# Probability/Conditional Distributions

## Motivation

Suppose there is an earthquake. Let ${\displaystyle X}$  be the number of casualties and ${\displaystyle Y}$  be the Richter scale of the earthquake.

(a) Without given anything, what is the distribution of ${\displaystyle X}$ ?

(b) Given that ${\displaystyle Y=1}$  , what is the distribution of ${\displaystyle X}$ ?

(c) Given that ${\displaystyle Y=9}$  , what is the distribution of ${\displaystyle X}$ ?

Remark.

• ${\displaystyle Y=1}$  means the earthquake is micro, and ${\displaystyle Y=9}$  means the earthquake is great.

Indeed, we condition on ${\displaystyle \{Y=1\}}$  and ${\displaystyle \{Y=9\}}$  in (ii) and (iii), and the distribution we are finding in (ii) and (iii) should actually be denoted by ${\displaystyle X|Y=1}$  and ${\displaystyle X|Y=9}$  respectively for clarity. They are the conditional distribution of ${\displaystyle X|Y}$ , meaning ${\displaystyle X}$  given ${\displaystyle Y}$  (before observing the value of ${\displaystyle Y}$ ), or ${\displaystyle X|Y=y}$ , meaning ${\displaystyle X}$  given ${\displaystyle Y=y}$  (after observing the value of ${\displaystyle Y}$ ).

## Conditional distributions

Recall the definition of conditional probability:

${\displaystyle \mathbb {P} (A|B)={\frac {\mathbb {P} (A\cap B)}{\mathbb {P} (B)}},}$

in which ${\displaystyle A,B}$  are events, with ${\displaystyle \mathbb {P} (B)>0}$ .

We have similar definitions for conditional distributions.

Definition. (Conditional probability function) Let ${\displaystyle X,Y}$  be random variables. The conditional probability (mass or density) function of ${\displaystyle X}$  given ${\displaystyle Y=y}$ , in which ${\displaystyle y}$  is a real number, is

${\displaystyle \underbrace {f_{X|Y}({\color {darkgreen}x}|y)} _{{\text{function of }}{\color {darkgreen}x}}={\frac {\overbrace {f({\color {darkgreen}x},y)} ^{\text{joint probability function}}}{\underbrace {f_{Y}(y)} _{\text{marginal pdf}}}}\propto \underbrace {f({\color {darkgreen}x},y)} _{{\text{function of }}{\color {darkgreen}x}}}$

Remark.

• For discrete random variable ${\displaystyle X}$ , the pmf

${\displaystyle \underbrace {f_{X|Y}({\color {darkgreen}x}|y)} _{{\text{function of }}{\color {darkgreen}x}}={\frac {\mathbb {P} (X={\color {darkgreen}x}\cap Y=y)}{\mathbb {P} (Y=y)}}{\overset {\text{ def }}{=}}\mathbb {P} (X={\color {darkgreen}x}|Y=y).}$

• The marginal pdf can be interpreted as normalizing constant, which makes the integral ${\displaystyle \int _{-\infty }^{\infty }f_{X|Y}({\color {darkgreen}x},y)\,d{\color {darkgreen}x}=1}$ , since ${\displaystyle \int _{-\infty }^{\infty }f({\color {darkgreen}x},y)\,d{\color {darkgreen}x}=\underbrace {f_{Y}(y)} _{\text{marginal pdf}}}$  (integrating over the region in which ${\displaystyle Y}$  is fixed to be ${\displaystyle y}$  (the region in which the condition is satisfied), so we only integrate over the corresponding interval of ${\displaystyle x}$  (${\displaystyle x}$  is still a variable)).
• This is similar to the denominator in the definition of conditional probability, which makes the conditional probability of the whole sample space equals one, to satisfy the probability axiom.

Graphical illustration of the definition:

Top view:

|
|
*---------------*
|               |
|               |
fixed y *===============* <--- corresponding interval
|               |
|               |
*---------------*
|
*---------------- x

Side view:

*
/ \
*\  *  /
/|#\   \
|  / |##\ / *---------*
| *  |###\            /\
| |\ |##/#\----------/--\
| | \|#/###*--------*   /
| |  \/############/#\ /
| |y *\===========/===*
| | /  *---------*   /
| |/              \ /
| *----------------*
|/
*------------------------- x

Front view:

|
|
|
*\
|#\
|##\
|###\
|####\   <------ Area: f_Y(y)
|#####*--------*
|###############\
*================*-------------- x

*---*
|###| : corresponding cross section from joint pdf
*---*


Definition. (Conditional cumulative distribution function) Let ${\displaystyle X,Y}$  be random variables. The conditional cumulative distribution function (cdf) of ${\displaystyle X}$  given ${\displaystyle Y=y}$ , in which ${\displaystyle y}$  is a real number, is

${\displaystyle F_{X|Y}({\color {darkgreen}x}|y){\overset {\text{ def }}{=}}\mathbb {P} (X\leq {\color {darkgreen}x}|Y=y)={\begin{cases}\displaystyle \sum _{{\color {red}u}:{\color {red}u}\leq {\color {darkgreen}x}}^{}f_{X|Y}({\color {red}u}|y),&X{\text{ is discrete}};\\\displaystyle \int _{-\infty }^{\color {darkgreen}x}f_{X|Y}({\color {red}u}|y)\,d{\color {red}u},&X{\text{ is continuous}}.\end{cases}}}$

Graphical illustration of the definition (continuous random variables):

Top view:

|
|
*---------------*
|               |
|               |
fixed y *=========@=====* <--- corresponding interval
|         x     |
|               |
*---------------*
|
*----------------

Side view:

*
/ \
*\  *  /
/|#\   \
|  / |##\ / *---------*
| *  |###\            /\
| |\ |##/#\----------/--\
| | \|#/###*--------*   /
| |  \/#########   / \ /
| |y *\========@==/===*
| | /  *-------x-*   /
| |/              \ /
| *----------------*
|/
*------------------------- x

Front view:

|
|
|
*\
|#\
|##\
|###\
|####\   <------------- Area: f_Y(y)
|#####*--------*
|###########    \
*==========@=====*--------------
x
*---*
|###| : the desired region from the cross section from joint pdf, whose area is the probability from the cdf
*---*


If ${\displaystyle Y=\mathbf {1} \{A\}}$  for some event ${\displaystyle A}$ , we have some special notations for simplicity:

• the conditional probability function of ${\displaystyle X}$  given ${\displaystyle Y=y}$  becomes

${\displaystyle f_{X|Y}({\color {darkgreen}x}|y)={\begin{cases}f({\color {darkgreen}x}|A),&y=1;\\f({\color {darkgreen}x}|A^{c}),&y=0.\end{cases}}}$

• the conditional cdf of ${\displaystyle X}$  given ${\displaystyle Y=y}$  becomes

${\displaystyle F_{X|Y}({\color {darkgreen}x}|y)=\mathbb {P} (X\leq {\color {darkgreen}x}|Y=y)={\begin{cases}F({\color {darkgreen}x}|A),&y=1;\\F({\color {darkgreen}x}|A^{c}),&y=0.\end{cases}}}$

Proposition. (Determining independence of two random variables) Random varibles ${\displaystyle X,Y}$  are independent if and only if ${\displaystyle f_{X|Y}(x|y)=f_{X}(x){\text{ or }}f_{Y|X}(y|x)=f_{Y}(y)}$  for each ${\displaystyle x,y}$ .

Proof. Recall the definition of independence between two random variables:

${\displaystyle X,Y}$  are independent if

${\displaystyle f(x,y)=f_{X}(x)f_{Y}(y)}$

for each ${\displaystyle x,y}$ .

Since

${\displaystyle f_{X|Y}({\color {darkgreen}x}|y)={\frac {\overbrace {f({\color {darkgreen}x},y)} ^{f_{X}({\color {darkgreen}x})f_{Y}(y)}}{f_{Y}(y)}}=f_{X}(x){\text{ and }}f_{Y|X}({\color {darkgreen}y}|x)={\frac {\overbrace {f({\color {darkgreen}y},x)} ^{f_{Y}({\color {darkgreen}y})f_{X}(x)}}{f_{X}(x)}}=f_{Y}(y)}$

for each ${\displaystyle x,y}$ , we have the desired result.

${\displaystyle \Box }$

Remark.

• This is expected, since the conditioning on independent event should not affect the occurrence of another independent event.

We can extend the definition of conditional probability function and cdf to groups of random variables, for joint cdf's and joint probability functions, as follows:

Definition. (Conditional joint probability function) Let ${\displaystyle \mathbf {X} =(X_{1},\dotsc ,X_{r})^{T}}$  and ${\displaystyle \mathbf {Y} =(Y_{1},\dotsc ,Y_{s})^{T}}$  be two random vectors. The conditional joint probability function of ${\displaystyle \mathbf {X} =(x_{1},\dotsc ,x_{r})}$  given ${\displaystyle \mathbf {Y} =(y_{1},\dotsc ,y_{s})}$  is

${\displaystyle f_{\mathbf {X} |\mathbf {Y} }({\color {darkgreen}x_{1},\dotsc ,x_{r}}|y_{1},\dotsc ,y_{s}){\overset {\text{ def }}{=}}\mathbb {P} (X_{1}={\color {darkgreen}x_{1}}\cap \dotsb \cap X_{r}={\color {darkgreen}x_{r}}|Y_{1}=y_{1}\cap \dotsb \cap Y_{s}=y_{s})={\frac {f({\color {darkgreen}x_{1},\dotsc ,x_{r}},y_{1},\dotsc ,y_{s})}{f_{\mathbf {Y} }(y_{1},\dotsc ,y_{s})}}}$

Then, we also have a similar proposition for determining independence of two random vectors.

Proposition. (Determining independence of two random vectors) Random vectors ${\displaystyle \mathbf {X} =(X_{1},\dotsc ,X_{r})^{T},\mathbf {Y} =(Y_{1},\dotsc ,Y_{s})^{T}}$  are independent if and only if ${\displaystyle f_{\mathbf {X} |\mathbf {Y} }(x_{1},\dotsc ,x_{r}|y_{1},\dotsc ,y_{s})=f_{\mathbf {X} }(x_{1},\dotsc ,x_{r}){\text{ or }}f_{\mathbf {Y} |\mathbf {X} }(y_{1},\dotsc ,y_{s}|x_{1},\dotsc ,x_{r})=f_{\mathbf {Y} }(y_{1},\dotsc ,y_{s})}$  for each ${\displaystyle x_{1},\dotsc ,x_{r},y_{1},\dotsc ,y_{s}}$ .

Proof. The definition of independence between two random vectors is

• ${\displaystyle \mathbf {X} =(X_{1},\dotsc ,X_{r})^{T},\mathbf {Y} =(Y_{1},\dotsc ,Y_{s})^{T}}$  are independent if

${\displaystyle f(x_{1},\dotsc ,x_{r},y_{1},\dotsc ,y_{s})=f_{\mathbf {X} }(x_{1},\dotsc ,x_{r})f_{\mathbf {Y} }(y_{1},\dotsc ,y_{s})}$

for each ${\displaystyle x_{1},\dotsc ,x_{r},y_{1},\dotsc ,y_{s}}$ .

Since

${\displaystyle f_{\mathbf {X} |\mathbf {Y} }({\color {darkgreen}x_{1},\dotsc ,x_{r}}|y_{1},\dotsc ,y_{s})={\frac {\overbrace {f({\color {darkgreen}x_{1},\dotsc ,x_{r}},y_{1},\dotsc ,y_{s})} ^{f_{\mathbf {X} }({\color {darkgreen}x_{1},\dotsc ,x_{r}})f_{\mathbf {Y} }(y_{1},\dotsc ,y_{s})}}{f_{\mathbf {Y} }(y_{1},\dotsc ,y_{s})}}=f_{\mathbf {X} }({\color {darkgreen}x_{1},\dotsc ,x_{r}}){\text{ and }}f_{\mathbf {Y} |\mathbf {X} }({\color {darkgreen}y_{1},\dotsc ,y_{s}}|x_{1},\dotsc ,x_{r})={\frac {\overbrace {f({\color {darkgreen}y_{1},\dotsc ,y_{s}},x_{1},\dotsc ,x_{r})} ^{f_{\mathbf {Y} }({\color {darkgreen}y_{1},\dotsc ,y_{s}})f_{\mathbf {X} }(x_{1},\dotsc ,x_{r})}}{f_{\mathbf {X} }(x_{1},\dotsc ,x_{r})}}=f_{\mathbf {Y} }(y_{1},\dotsc ,y_{s})}$

for each ${\displaystyle x_{1},\dotsc ,x_{r},y_{1},\dotsc ,y_{s}}$ , we have the desired result.

${\displaystyle \Box }$

### Conditional distributions of bivariate normal distribution

Recall from the Probability/Important Distributions chapter that the joint pdf of ${\displaystyle {\mathcal {N}}_{2}({\boldsymbol {\mu }},{\boldsymbol {\Sigma }})}$  is

${\displaystyle f(x,y)={\frac {1}{2\pi \sigma _{X}\sigma _{Y}{\sqrt {1-\rho ^{2}}}}}\exp \left(-{\frac {1}{2(1-\rho ^{2})}}\left(\left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)^{2}-2\rho \left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)+\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)^{2}\right)\right),\quad (x,y)\in \mathbb {R} ^{2}}$

, and ${\displaystyle X\sim {\mathcal {N}}(\mu _{X},\sigma _{X}^{2})}$  and ${\displaystyle Y\sim {\mathcal {N}}(\mu _{Y},\sigma _{Y}^{2})}$  in this case. in which ${\displaystyle \rho =\rho (X,Y)}$  and ${\displaystyle \sigma _{X},\sigma _{Y}}$  are positive.

Proposition. (Conditional distributions of bivariate normal distribution) Let ${\displaystyle (X,Y)^{T}\sim {\mathcal {N}}_{2}({\boldsymbol {\mu }},{\boldsymbol {\Sigma }})}$ . Then,

${\displaystyle X|(Y=y)\sim {\mathcal {N}}\left(\mu _{X}+\rho \cdot {\frac {\sigma _{X}}{\sigma _{Y}}}(y-\mu _{Y}),\sigma _{X}^{2}(1-\rho ^{2})\right),{\text{ and }}Y|(X=x)\sim {\mathcal {N}}\left(\mu _{Y}+\rho \cdot {\frac {\sigma _{Y}}{\sigma _{X}}}(x-\mu _{X}),\sigma _{Y}^{2}(1-\rho ^{2})\right)}$

(abuse of notations).

Proof.

• First, the conditional pdf

{\displaystyle {\begin{aligned}f_{X|Y}(x|y)&{\overset {\text{ def }}{=}}{\frac {f(x,y)}{f_{Y}(y)}}\\&=\left.{\frac {1}{{\color {darkgreen}2\pi }\sigma _{X}{\cancel {\sigma _{Y}}}{\sqrt {1-\rho ^{2}}}}}\exp \left(-{\frac {1}{2(1-\rho ^{2})}}\left(\left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)^{2}-2\rho \left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)+\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)^{2}\right)\right)\right/{\frac {1}{\sqrt {{\color {darkgreen}2\pi }{\cancel {\sigma _{Y}^{2}}}}}}\exp {\big (}{\color {blue}-(y-\mu _{Y})^{2}/2\sigma _{Y}^{2}}{\big )}\\&={\frac {1}{\sqrt {2\pi \sigma _{X}^{2}(1-\rho ^{2})}}}\exp \left(-{\frac {1}{2(1-\rho ^{2})}}\left(\left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)^{2}-2\rho \left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)+\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)^{2}\right){\color {blue}+(y-\mu _{Y})^{2}/2\sigma _{Y}^{2}}\right)\\&={\frac {1}{\sqrt {2\pi \sigma _{X}^{2}(1-\rho ^{2})}}}\exp \left(-{\frac {1}{2(1-\rho ^{2})}}\left(\left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)^{2}-2\rho \left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)+{\cancel {\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)^{2}}}{\color {purple}-({\cancel {1}}-\rho ^{2})}\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)^{2}\right)\right)\\&={\frac {1}{\sqrt {2\pi \sigma _{X}^{2}(1-\rho ^{2})}}}\exp \left(-{\frac {1}{2{\color {blue}\sigma _{X}^{2}}(1-\rho ^{2})}}\left(\left(x-\mu _{X}\right)^{2}-2\rho \cdot {\frac {\color {blue}\sigma _{X}}{\sigma _{Y}}}(x-\mu _{X})(y-\mu _{Y})+\left({\color {purple}\rho }\cdot {\frac {\color {blue}\sigma _{X}}{\sigma _{Y}}}(y-\mu _{Y})\right)^{2}\right)\right)\\&={\frac {1}{\sqrt {2\pi \sigma _{X}^{2}(1-\rho ^{2})}}}\exp \left(-{\frac {1}{2\sigma _{X}^{2}(1-\rho ^{2})}}\left((x-\mu _{X})-\left(\rho \cdot {\frac {\sigma _{X}}{\sigma _{Y}}}(y-\mu _{Y})\right)\right)^{2}\right)\\&={\frac {1}{\sqrt {2\pi \sigma _{X}^{2}(1-\rho ^{2})}}}\exp \left(-{\frac {1}{2\sigma _{X}^{2}(1-\rho ^{2})}}\left(x-\mu _{X}-\rho \cdot {\frac {\sigma _{X}}{\sigma _{Y}}}(y-\mu _{Y})\right)^{2}\right)\end{aligned}}}

• Then, we can see that ${\displaystyle X|(Y=y)\sim {\mathcal {N}}\left(\mu _{X}+\rho \cdot {\frac {\sigma _{X}}{\sigma _{Y}}}(y-\mu _{Y}),\sigma _{X}^{2}(1-\rho ^{2})\right)}$ ,
• and by symmetry (interchanging ${\displaystyle X}$  and ${\displaystyle Y}$ , and also interchanging ${\displaystyle x}$  and ${\displaystyle y}$ ), ${\displaystyle Y|(X=x)\sim {\mathcal {N}}\left(\mu _{Y}+\rho \cdot {\frac {\sigma _{Y}}{\sigma _{X}}}(x-\mu _{X}),\sigma _{Y}^{2}(1-\rho ^{2})\right)}$ .

${\displaystyle \Box }$

## Conditional version of concepts

We can obtain conditional version of concepts previously established for 'unconditional' distributions analogously for conditional distributions by substituting 'unconditional' cdf, pdf or pmf, i.e. ${\displaystyle F(\cdot )}$  or ${\displaystyle f(\cdot )}$ , by their conditional counterparts, i.e. ${\displaystyle F(\cdot {\color {darkgreen}|\cdot })}$  or ${\displaystyle f(\cdot {\color {darkgreen}|\cdot })}$ .

### Conditional independence

Definition. Random variables ${\displaystyle X_{1},X_{2},\dotsc ,X_{n}}$  are conditionally independent given ${\displaystyle Y=y}$  if and only if

${\displaystyle F_{X_{1},\dotsc ,X_{n}{\color {darkgreen}|Y}}(x_{1},\dotsc ,x_{n}{\color {darkgreen}|y})=F_{X_{1}{\color {darkgreen}|Y}}(x_{1}{\color {darkgreen}|y})\dotsb F_{X_{n}{\color {darkgreen}|Y}}(x_{n}{\color {darkgreen}|y})}$

or
${\displaystyle f_{X_{1},\dotsc ,X_{n}{\color {darkgreen}|Y}}(x_{1},\dotsc ,x_{n}{\color {darkgreen}|y})=f_{X_{1}{\color {darkgreen}|Y}}(x_{1}{\color {darkgreen}|y})\dotsb f_{X_{n}{\color {darkgreen}|Y}}(x_{n}{\color {darkgreen}|y})}$

. for each real number ${\displaystyle x_{1},\dotsc ,x_{n},{\color {darkgreen}y}}$  and for each positive integer ${\displaystyle n}$ , in which ${\displaystyle F_{X_{1},\dotsc ,X_{n}{\color {darkgreen}|Y}}}$  and ${\displaystyle f_{X_{1},\dotsc ,X_{n}{\color {darkgreen}|Y}}}$  denote the joint cdf and probabillity function of ${\displaystyle (X_{1},\dotsc ,X_{n})}$  conditional on ${\displaystyle Y=y}$  respectively.

Remark.

• For random variables, conditional independence and independence are not related, i.e. one of them does not imply the another.

Example. (Conditional independence does not imply independence) TODO

Example. (Independence does not imply conditional independence) TODO

### Conditional expectation

Definition. (Conditional expectation) Let ${\displaystyle f_{X|Y}(x|y)}$  be the conditional probability function of ${\displaystyle X}$  given ${\displaystyle Y=y}$ . Then,

${\displaystyle \mathbb {E} [X{\color {darkgreen}|Y=y}]={\begin{cases}\displaystyle \sum _{x\in \operatorname {supp} (X)}^{}xf_{X{\color {darkgreen}|Y}}(x{\color {darkgreen}|y}),&X{\text{ is discrete}};\\\displaystyle \int _{-\infty }^{\infty }xf_{X{\color {darkgreen}|Y}}(x{\color {darkgreen}|y})\,dx,&X{\text{ is continuous}}.\end{cases}}}$

Remark.

• ${\displaystyle \mathbb {E} [X{\color {darkgreen}|Y=y}]}$  is a function of ${\displaystyle y}$
• the random variable ${\displaystyle \mathbb {E} [*{\color {darkgreen}|Y=Y}]}$ , which is a function of ${\displaystyle Y}$  after computing the expectation, is written as ${\displaystyle \mathbb {E} [*{\color {darkgreen}|Y}]}$  for brevity, in which ${\displaystyle *}$ 's are the same term.
• ${\displaystyle \mathbb {E} [*{\color {darkgreen}|Y=y}]}$  is a realization of ${\displaystyle \mathbb {E} [*|Y]}$  when ${\displaystyle Y}$  is observed to be ${\displaystyle y}$  in which ${\displaystyle *}$ 's are the same term.

Similarly, we have conditional version of law of the unconscious statistician.

Proposition. (Law of the unconscious statistician (conditional version)) Let ${\displaystyle f_{X|Y}(x|y)}$  be the conditional probability function of ${\displaystyle X}$  given ${\displaystyle Y=y}$ . Then, for each function ${\displaystyle g(x)}$ ,

${\displaystyle \mathbb {E} [g(X){\color {darkgreen}|Y=y}]={\begin{cases}\displaystyle \sum _{x\in \operatorname {supp} (X)}^{}g(x)f_{X{\color {darkgreen}|Y}}(x{\color {darkgreen}|y}),&X{\text{ is discrete}};\\\displaystyle \int _{-\infty }^{\infty }g(x)f_{X{\color {darkgreen}|Y}}(x{\color {darkgreen}|y})\,dx,&X{\text{ is continuous}}.\end{cases}}}$

Proposition. (Conditional expectation under independence) If random variables ${\displaystyle X,Y}$  are independent,

${\displaystyle \mathbb {E} [g(X)|Y]=\mathbb {E} [g(X)]}$

for each function ${\displaystyle g}$ .

Proof.

${\displaystyle \mathbb {E} [g(X)|Y]={\begin{cases}\displaystyle \sum _{x}^{}g(x)f_{X|Y}(x|Y)=\sum _{x}^{}g(x)f_{X}(x)=\mathbb {E} [g(X)],&X{\text{ is discrete}};\\\displaystyle \int _{-\infty }^{\infty }g(x)f_{X|Y}(x|Y)\,dx=\int _{-\infty }^{\infty }g(x)f_{X}(x)\,dx=\mathbb {E} [g(X)],&X{\text{ is continuous}}.\end{cases}}}$

${\displaystyle \Box }$

Remark.

• This equality may not hold if ${\displaystyle X,Y}$  are not independent.

Example. Suppose random vector ${\displaystyle \mathbf {X} =(Y,Z)^{T}}$  in which ${\displaystyle Y,Z}$  are independent random variables, and ${\displaystyle g(\mathbf {x} )=y+z}$ . Then,

${\displaystyle \mathbb {E} [g(\mathbf {X} )|Y]=\mathbb {E} [\underbrace {Y} _{{\text{constant given }}Y}+Z|Y]=Y+\mathbb {E} [Z],}$

(${\displaystyle Y}$  is treated as constant, because of the conditioning: it is constant after realization of ${\displaystyle \mathbb {E} [Y+Z|Y]}$ ) but
${\displaystyle \mathbb {E} [g(\mathbf {X} )]=\mathbb {E} [Y+Z]=\mathbb {E} [Y]+\mathbb {E} [Z]\neq \mathbb {E} [g(\mathbf {X} )|Y].}$

The properties of ${\displaystyle \mathbb {E} [\cdot ]}$  still hold for conditional expectations ${\displaystyle \mathbb {E} [\cdot {\color {darkgreen}|Y}]}$ , with every 'unconditional' expectation replaced by conditional expectation and some suitable modifications, as follows:

Proposition. (Properties of conditional expectation) For each random variable ${\displaystyle Y}$ ,

• (linearity) ${\displaystyle \mathbb {E} [\underbrace {\alpha {\color {darkgreen}(Y)}} _{{\text{constant given }}Y}X_{1}+\underbrace {\beta {\color {darkgreen}(Y)}} _{{\text{constant given }}Y}X_{2}+\underbrace {\gamma {\color {darkgreen}(Y)}} _{{\text{constant given }}Y}{\color {darkgreen}|Y}]=\alpha {\color {darkgreen}(Y)}\mathbb {E} [X_{1}{\color {darkgreen}|Y}]+\beta {\color {darkgreen}(Y)}\mathbb {E} [X_{2}{\color {darkgreen}|Y}]+\gamma {\color {darkgreen}(Y)}}$
for each functions ${\displaystyle \alpha (Y),\beta (Y),\gamma (Y)}$  of ${\displaystyle Y}$  and for each random variable ${\displaystyle X_{1},X_{2}}$
• (nonnegativity) if ${\displaystyle X{\color {darkgreen}|Y}\geq 0}$ , ${\displaystyle \mathbb {E} [X{\color {darkgreen}|Y}]\geq 0}$
• (monotonicity) if ${\displaystyle X_{1}\geq X_{2}}$ , ${\displaystyle \mathbb {E} [X_{1}{\color {darkgreen}|Y}]\geq \mathbb {E} [X_{2}{\color {darkgreen}|Y}]}$  for each random variable ${\displaystyle X_{1},X_{2}}$
• (triangle inequality)

${\displaystyle |\mathbb {E} [X{\color {darkgreen}|Y}]|\leq \mathbb {E} [|X|{\color {darkgreen}|Y}]}$

• (multiplicativity under independence) if ${\displaystyle X_{1},X_{2}}$  are conditionally independent given ${\displaystyle Y}$ ,

${\displaystyle \mathbb {E} [X_{1}X_{2}{\color {darkgreen}|Y}]=\mathbb {E} [X_{1}{\color {darkgreen}|Y}]\mathbb {E} [X_{2}{\color {darkgreen}|Y}]}$

Proof. The proof is similar to the one for 'unconditional' expectations.

${\displaystyle \Box }$

Remark.

• ${\displaystyle \alpha (Y),\beta (Y),\gamma (Y)}$  are treated as constants given ${\displaystyle Y}$ , since after observing the value of ${\displaystyle Y}$  , they cannot be changed.
• Each result also holds with ${\displaystyle Y}$  replaced by random vectors ${\displaystyle (Y_{1},\dotsc ,Y_{s})^{T}}$ .

The following theorem about conditional expectation is quite important.

Theorem. (Law of total expectation) For each function ${\displaystyle g(x)}$  and for each random variable ${\displaystyle X,Y}$ ,

${\displaystyle \mathbb {E} {\big [}\underbrace {\mathbb {E} [g(X)|Y]} _{{\text{function of }}y}{\big ]}=\mathbb {E} [g(X)].}$

Proof.

${\displaystyle \mathbb {E} [\mathbb {E} [g(X)|Y]]={\begin{cases}\displaystyle \sum _{y}^{}\mathbb {E} [g(X)|Y=y]f_{Y}(y)=\sum _{x}^{}{\bigg (}\sum _{y}^{}g(x)\overbrace {f_{X|Y}(x|y)} ^{f(x,y){\cancel {/f_{Y}(y)}}}{\cancel {f_{Y}(y)}}{\bigg )}=\sum _{x}^{}g(x){\bigg (}\overbrace {\sum _{y}^{}f(x,y)} ^{f_{X}(x)}{\bigg )}=\mathbb {E} [g(X)],&X{\text{ is discrete}};\\\displaystyle \int _{-\infty }^{\infty }\mathbb {E} [g(X)|Y=y]f_{Y}(y)\,dy=\int _{-\infty }^{\infty }{\bigg (}\int _{-\infty }^{\infty }g(x)\underbrace {f_{X|Y}(x|y)} _{f(x,y){\cancel {/f_{Y}(y)}}}\,dx{\bigg )}{\cancel {f_{Y}(y)}}\,dy=\int _{-\infty }^{\infty }g(x){\bigg (}\underbrace {\int _{-\infty }^{\infty }f(x,y)\,dy} _{f_{X}(x)}{\bigg )}\,dx=\mathbb {E} [g(X)],&X{\text{ is continuous}}.\end{cases}}}$

${\displaystyle \Box }$

Remark.

• we can replace ${\displaystyle g(X)}$  by ${\displaystyle g(X,Y,Z,\dotsc )}$  and get

${\displaystyle \mathbb {E} [g(X,Y,Z,\dotsc )]=\mathbb {E} [\mathbb {E} [g(X,{\color {darkgreen}Y},Z,\dotsc ){\color {darkgreen}|Y}]]=\mathbb {E} [\mathbb {E} [g(X,{\color {darkgreen}Y,Z,\dotsc |Y,Z,\dotsc }]]=\dotsb }$

Corollary. (Generalized law of total probability) For each event ${\displaystyle A}$ ,

${\displaystyle \mathbb {E} _{Y}[\mathbb {P} (A|{\color {darkgreen}Y})]=\mathbb {P} (A).}$

Proof.

• First,

${\displaystyle \mathbb {E} [\mathbf {1} \{A\}|Y]=1(\mathbb {P} (\mathbf {1} \{A\}=1|Y)+0(\mathbb {P} (\mathbf {1} \{A\}=0|Y)=\mathbb {P} (A|Y).}$

• Then, using law of total expectation,

${\displaystyle \mathbb {E} _{Y}[\mathbb {P} (A|{\color {darkgreen}Y})]{\overset {\text{ above }}{=}}\mathbb {E} _{Y}[\mathbb {E} [\mathbf {1} \{A\}|{\color {darkgreen}Y}]]=\mathbb {E} [\mathbf {1} \{A\}]=\mathbb {P} (A).}$

${\displaystyle \Box }$

Remark.

• The expectation is taken with respect to ${\displaystyle Y}$ , so we use the ${\displaystyle \mathbb {E} _{Y}[\cdot ]}$  notation. We will use similar notations to denote the random variables to which the expectation is taken with respect if needed.
• We can replace ${\displaystyle Y}$  by ${\displaystyle (Y_{1},\dotsc ,Y_{s})}$ , which is a random vector.
• If ${\displaystyle Y}$  is discrete, then the expanded form of the result is ${\displaystyle \sum _{i}^{}\mathbb {P} (A|{\color {darkgreen}Y=i})\mathbb {P} ({\color {darkgreen}Y=i})=\mathbb {P} (A)}$  (discrete case for law of total probability).
• If ${\displaystyle Y}$  is continuous, then the expanded form of the result is ${\displaystyle \int _{\operatorname {supp} (Y)}\mathbb {P} (A|{\color {darkgreen}Y=y})\mathbb {P} ({\color {darkgreen}Y=y})\,dy=\mathbb {P} (A)}$  (continuous case for law of total probability).

Corollary. (Expectation version of law of total probability) Suppose the sample space ${\displaystyle \Omega =A_{1}\cup A_{2}\cup \dotsb }$  in which ${\displaystyle A_{i}}$ 's are mutually exclusive. Then,

${\displaystyle \mathbb {E} [X]=\mathbb {E} [X|A_{1}]\mathbb {P} (A_{1})+\mathbb {E} [X|A_{2}]\mathbb {P} (A_{2})+\dotsb .}$

Proof. Define ${\displaystyle Y=i}$  if ${\displaystyle A_{i}}$  occurs, in which ${\displaystyle i}$  is a positive integer. Then,

${\displaystyle \mathbb {E} [X]=\mathbb {E} _{Y}[\mathbb {E} _{X}[X|Y]]=\sum _{i=1}^{\infty }\mathbb {E} _{X}[X|Y=i]\mathbb {P} (Y=i)=\sum _{i=1}^{\infty }\mathbb {E} [X|A_{i}]\mathbb {P} (A_{i})}$

${\displaystyle \Box }$

Remark.

• the number of events can be finite, as long as they are mutually exclusive and their union is the whole sample space
• if ${\displaystyle X=\mathbf {1} \{B\}}$ , it reduces to law of total probability

Example. Let ${\displaystyle X}$  be the human height in m. A person is randomly selected from a population consisting of same number of men and women. Given that the mean height of a man is 1.8 m, and that of a woman is 1.7m, the mean height of the entire population is

${\displaystyle \mathbb {E} [X]=\mathbb {E} [X|\{{\text{man selected}}\}]\mathbb {P} ({\text{man selected}})+\mathbb {E} [X|\{{\text{woman selected}}\}]\mathbb {P} ({\text{woman selected}})=1.8(1/2)+1.7(1/2)=1.75}$

Corollary. (formula of expectation conditional on event) For each random variable ${\displaystyle X}$  and event ${\displaystyle A}$  with ${\displaystyle \mathbb {P} (A)>0}$ ,

${\displaystyle \mathbb {E} [X|A]={\frac {\mathbb {E} [X\mathbf {1} \{A\}]}{\mathbb {P} (A)}}.}$

Proof. By the formula of expectation computed by weighted average of conditional expectations,

${\displaystyle \mathbb {E} [X\mathbf {1} \{A\}]=\mathbb {E} [X\underbrace {\mathbf {1} \{A\}} _{1}|A]\mathbb {P} (A)+\mathbb {E} [X\underbrace {\mathbf {1} \{A\}} _{0}|A^{c}]\mathbb {P} (A^{c})=\mathbb {E} [X|A]\mathbb {P} (A),}$

and the result follows if ${\displaystyle \mathbb {P} (A)>0}$ .

${\displaystyle \Box }$

Remark.

• if ${\displaystyle X=\mathbf {1} \{B\}}$ , it reduces to the definition of the conditional probability ${\displaystyle \mathbb {P} (B|A)}$  by the fundamental bridge between probability and expectation

After defining conditional expectation, we can also have conditional variance, covariance and correlation coefficient, since variance, covariance, and correlation coefficient are built upon expectation.

#### Conditional expectations of bivariate normal distribution

Proposition. (Conditional expectations of bivariate normal distribution) Let ${\displaystyle (X,Y)^{T}\sim {\mathcal {N}}_{2}({\boldsymbol {\mu }},{\boldsymbol {\Sigma }})}$ . Then,

${\displaystyle \mathbb {E} [X|Y=y]=\mathbb {E} [X]+\rho (X,Y)\cdot {\frac {\sqrt {\operatorname {Var} (X)}}{\sqrt {\operatorname {Var} (Y)}}}(y-\mathbb {E} [Y]),{\text{ and }}\mathbb {E} [Y|X=x]=\mathbb {E} [Y]+\rho (X,Y)\cdot {\frac {\sqrt {\operatorname {Var} (Y)}}{\sqrt {\operatorname {Var} (X)}}}(x-\mathbb {E} [X]).}$

Proof.

• The result follows from the proposition about conditional distributions of bivariate normal distribution readily.

${\displaystyle \Box }$

### Conditional variance

Definition. (Conditional variance) The conditional variance of random variable ${\displaystyle X}$  given ${\displaystyle Y=y}$  is

${\displaystyle \operatorname {Var} (X{\color {darkgreen}|Y=y})=\mathbb {E} [(X-\mathbb {E} [X{\color {darkgreen}|Y=y}])^{2}{\color {darkgreen}|Y=y}].}$

Similarly, we have properties of conditional variance which are similar to that of variance.

Proposition. (Properties of conditional variance) For each random variable ${\displaystyle X,Y}$ ,

• (alternative formula of conditional variance) ${\displaystyle \operatorname {Var} (X{\color {darkgreen}|Y})=\mathbb {E} [X^{2}{\color {darkgreen}|Y}]-(\mathbb {E} [X{\color {darkgreen}|Y}])^{2}}$
• (invariance under change in location parameter) ${\displaystyle \operatorname {Var} (X+a{\color {darkgreen}(Y)}{\color {darkgreen}|Y})=\operatorname {Var} (X{\color {darkgreen}|Y})}$
• (homogeneity of degree two) ${\displaystyle \operatorname {Var} (b{\color {darkgreen}(Y)}X{\color {darkgreen}|Y})=\left(b{\color {darkgreen}(Y)}\right)^{2}\operatorname {Var} (X{\color {darkgreen}|Y})}$
• (nonnegativity) ${\displaystyle \operatorname {Var} (X{\color {darkgreen}|Y})\geq 0}$
• (zero variance implies non-randomness) ${\displaystyle \operatorname {Var} (X{\color {darkgreen}|Y})=0\Leftrightarrow \mathbb {P} (X=c{\color {darkgreen}(Y)|Y})=1}$  for some function ${\displaystyle c(Y)}$  of ${\displaystyle Y}$
• (additivity under independence) if ${\displaystyle X_{1},\dotsc ,X_{n}}$  are conditionally independent given ${\displaystyle Y}$ ,${\displaystyle \operatorname {Var} (X_{1}+\dotsb +X_{n}{\color {darkgreen}|Y})=\operatorname {Var} (X_{1}{\color {darkgreen}|Y})+\dotsb +\operatorname {Var} (X_{n}{\color {darkgreen}|Y})}$

Proof. The proof is similar to the one for properties of variance.

${\displaystyle \Box }$

Beside law of total expectation, we also have law of total variance, as follows:

Proposition. (Law of total variance) For each rnadom variable ${\displaystyle X,Y}$ ,

${\displaystyle \operatorname {Var} (X)=\mathbb {E} [\operatorname {Var} (X|Y)]+\operatorname {Var} (\mathbb {E} [X|Y]).}$

Proof.

{\displaystyle {\begin{aligned}\mathbb {E} [\operatorname {Var} (X|Y)]+\operatorname {Var} (\mathbb {E} [X|Y])&=\mathbb {E} \left[\mathbb {E} [X^{2}|Y]-(\mathbb {E} [X|Y])^{2}\right]+\mathbb {E} \left[(\mathbb {E} [X|Y])^{2}\right]-(\mathbb {E} [\mathbb {E} [X|Y]])^{2}\\&=\mathbb {E} [\mathbb {E} [X^{2}|Y]]{\cancel {+\mathbb {E} \left[(\mathbb {E} [X|Y])^{2}\right]}}+\mathbb {E} \left[(\mathbb {E} [X|Y])^{2}\right]{\cancel {-(\mathbb {E} [\mathbb {E} [X|Y]])^{2}}}\\&=\mathbb {E} [X^{2}]-(\mathbb {E} [X])^{2}\qquad {\text{by law of total expectation}}\\&=\operatorname {Var} (X)\end{aligned}}}

${\displaystyle \Box }$

Remark.

• We can replace ${\displaystyle Y}$  by ${\displaystyle (Y_{1},\dotsc ,Y_{s})^{T}}$ , a random vector.

#### Conditional variances of bivariate normal distribution

Proposition. (Conditional variances of bivariate normal distribution) Let ${\displaystyle (X,Y)^{T}\sim {\mathcal {N}}_{2}({\boldsymbol {\mu }},{\boldsymbol {\Sigma }})}$ . Then,

${\displaystyle \operatorname {Var} (X|Y=y)={\big (}1-(\rho (X,Y))^{2}{\big )}\operatorname {Var} (X),{\text{ and }}\operatorname {Var} (Y|X=x)={\big (}1-(\rho (X,Y)^{2}{\big )}\operatorname {Var} (Y)}$

Proof.

• The result follows from he proposition about conditional distributions of bivariate normal distribution readily.

${\displaystyle \Box }$

Remark.

• It can be observed that the exact values of ${\displaystyle x}$  and ${\displaystyle y}$  in the conditions do not matter. The result is the same for different values of them.

### Conditional covariance

Definition. (Conditional covariance) The conditional covariance of ${\displaystyle X}$  and ${\displaystyle Y}$  given ${\displaystyle Z=z}$  is

${\displaystyle \operatorname {Cov} (X,Y{\color {darkgreen}|Z=z})=\mathbb {E} [(X-\mathbb {E} [X{\color {darkgreen}|Z=z}])(Y-\mathbb {E} [Y{\color {darkgreen}|Z=z}]){\color {darkgreen}|Z=z}]}$

Proposition. (Properties of conditional covariance)

(i) (symmetry) for each random variable ${\displaystyle X,Y}$ ,

${\displaystyle \operatorname {Cov} (X,Y{\color {darkgreen}|Z})=\operatorname {Cov} (Y,X{\color {darkgreen}|Z})}$

(ii) for each random variable ${\displaystyle X}$ ,
${\displaystyle \operatorname {Cov} (X,X{\color {darkgreen}|Z})=\operatorname {Var} (X{\color {darkgreen}|Z})}$

(iii) (alternative formula of covariance)
${\displaystyle \operatorname {Cov} (X,Y{\color {darkgreen}|Z})=\mathbb {E} [XY{\color {darkgreen}|Z}]-\mathbb {E} [X{\color {darkgreen}|Z}]\mathbb {E} [Y{\color {darkgreen}|Z}]}$

(iv) for each constant ${\displaystyle a_{1},\dotsc ,a_{n},b_{1},\dotsc ,b_{m},c,d}$ , and for each random variables ${\displaystyle X_{1},\dotsc ,X_{n},Y_{1},\dotsc ,Y_{m}}$ ,
${\displaystyle \operatorname {Cov} \left(\sum _{i=1}^{n}(a_{i}X_{i}+c),\sum _{j=1}^{m}(b_{j}Y_{j}+d){\color {darkgreen}|Z}\right)=\sum _{i=1}^{n}\sum _{j=1}^{m}a_{i}b_{j}\operatorname {Cov} (X_{1},Y_{j}{\color {darkgreen}|Z})}$

(v) for each random variable ${\displaystyle X_{1},\dotsc ,X_{n}}$ ,
${\displaystyle \operatorname {Var} (X_{1}+\dotsb +X_{n}{\color {darkgreen}|Z})=\sum _{i=1}^{n}\operatorname {Var} (X_{i}{\color {darkgreen}|Z})+2\sum _{1\leq i

### Conditional correlation coefficient

Definition. (Conditional correlation coefficient) The conditional correlation coefficient of random variables ${\displaystyle X}$  and ${\displaystyle Y}$  given ${\displaystyle Z=z}$  is

${\displaystyle \rho (X,Y{\color {darkgreen}|Z=z})={\frac {\operatorname {Cov} (X,Y{\color {darkgreen}|Z=z})}{\sqrt {\operatorname {Var} (X{\color {darkgreen}|Z=z})\operatorname {Var} (Y{\color {darkgreen}|Z=z})}}}.}$

Remark.

• Similar to 'unconditional' correlation coefficient, conditional correlation coefficient also lies between ${\displaystyle -1}$  and ${\displaystyle 1}$  inclusively. The proof is similar, by replacing every unconditional terms with conditional terms.

### Conditional quantile

Definition. (Conditional quantile) The conditional ${\displaystyle \alpha }$ th quantile of ${\displaystyle X}$  given ${\displaystyle Y=y}$  is

${\displaystyle \inf\{x\in \mathbb {R} :F_{X{\color {darkgreen}|Y}}(x{\color {darkgreen}|y})>\alpha \}.}$

Remark.

• Then, we can have conditional median, interquartile range, etc., which are defined using conditional quantile in the same way as the unconditional ones