Probability/Conditional Distributions



Motivation

edit

Suppose there is an earthquake. Let   be the number of casualties and   be the Richter scale of the earthquake.

 

(a) Without given anything, what is the distribution of  ?

(b) Given that   , what is the distribution of  ?

(c) Given that   , what is the distribution of  ?

Remark.

  •   means the earthquake is micro, and   means the earthquake is great.

Are your answers to (a),(b),(c) different?

In (b) and (c), we have the conditional distribution of   given  , and the conditional distribution of   given   respectively.

In general, we have conditional distribution of   given   (before observing the value of  ), or   given   (after observing the value of  ).

Conditional distributions

edit

Recall the definition of conditional probability:   in which   are events, with  . Applying this definition to discrete random variables  , we have   where   is the joint pmf of   and  , and   is the marginal pmf of  . It is natural to call such conditional probability as conditional pmf, right? We will denote such conditional probability as  . Then, this is basically the definition of conditional pmf: conditional pmf of   given   is the conditional probability  . Naturally, we will expect that conditional pdf is defined similarly. This is indeed the case:

Definition. (Conditional probability function) Let   be random variables that are both discrete or both continuous. The conditional probability (mass or density) function of   given  , in which   is a real number, is  

Remark.

  • The marginal pdf can be interpreted as normalizing constant, which makes the integral  , since   (integrating over the region in which   is fixed to be   (the region in which the condition is satisfied), so we only integrate over the corresponding interval of   (  is still a variable)).
  • This is similar to the denominator in the definition of conditional probability, which makes the conditional probability of the whole sample space equals one, to satisfy the probability axiom.

To understand the definition more intuitively for the continuous case, consider the following diagram.

Top view:
     
        |
        |
        *---------------* 
        |               |
        |               |
fixed y *===============* <--- corresponding interval
        |               |
        |               |
        *---------------*
        |
        *---------------- x

Side view:

          *  
         / \ 
        *\  *  /                                           
       /|#\   \
   |  / |##\ / *---------*
   | *  |###\            /\
   | |\ |##/#\----------/--\     
   | | \|#/###*--------*   /                             
   | |  \/############/#\ /                              
   | |y *\===========/===*                               
   | | /  *---------*   /                                
   | |/              \ /                                 
   | *----------------*                                  
   |/                                                    
   *------------------------- x                          


Front view:
             
    |
    |
    |               
    *\     
    |#\    
    |##\   
    |###\             
    |####\   <------ Area: f_Y(y)
    |#####*--------*  
    |###############\ 
    *================*-------------- x

*---*
|###| : corresponding cross section from joint pdf
*---*   

We can see that when we are conditioning  , we take a "slice" out from the region under joint pdf, and the area of the "whole slice" is the area between the univariate joint pdf   with fixed   and variable  , and the  -axis. Since the area is given by  , while according to the probability axioms, the area should equal 1. Hence, we scale down the area of "slice" by a factor of  , by dividing the univariate joint pdf   by  . After that, the curve at the top of scaled "slice" is the graph of the conditional pdf  .

Now, we have discussed the case where both random variables are discrete or continuous. How about the case where one of them is discrete and another one is continuous? In this case, there is no "joint probability function" of these two random variables, since one is discrete and another is continuous! But, we can still define the conditional probability function in some other ways. To motivate the following definition, let   be the conditional probability  . Then, differentiating   with respect to   should yield the conditional pdf  . So, we have   Thus, it is natural to have the following definition.

Definition. (Conditional probability density function when   is continuous and   is discrete) Let   be a continuous random variable and   be a discrete random variable. The conditional probability density function of   given  , where   is real number, is  

Now, how about the case where   is discrete and   is continuous? In this case, let us use the above definition for the motivation of definition. However, we should interchange   and   so that the assumptions are still satisfied. Then, we get   In this case,   is discrete, so it is natural to define the conditional pmf of   given   as   in the expression. Now, after rearranging the terms, we get   Thus, we have the following definition.

Definition. (Conditional probability mass function when   is discrete and   is continuous) Let   be a discrete random variable and   be a continuous random variable. The conditional probability density function of   given  , where   is real number, is  

Based on the definitions of conditional probability functions, it is natural to define the conditional cdf as follows.

Definition. (Conditional cumulative distribution function) Let   be discrete or continuous random variables. The conditional cumulative distribution function (cdf) of   given  , in which   is a real number, is  

Remark.

  • We should be aware that when   is continuous, the event   has probability zero. So, according to the definition of conditional probability, the conditional cdf in this case should be undefined. However, in this context, we still define the conditional probability as an expression that makes sense and is defined.

Graphical illustration of the definition (continuous random variables):

Top view:
     
        |
        |
        *---------------* 
        |               |
        |               |
fixed y *=========@=====* <--- corresponding interval
        |         x     |
        |               |
        *---------------*
        |
        *---------------- 

Side view:

          *  
         / \ 
        *\  *  /                                           
       /|#\   \
   |  / |##\ / *---------*
   | *  |###\            /\
   | |\ |##/#\----------/--\     
   | | \|#/###*--------*   /                             
   | |  \/#########   / \ /                              
   | |y *\========@==/===*                               
   | | /  *-------x-*   /                                
   | |/              \ /                                 
   | *----------------*                                  
   |/                                                    
   *------------------------- x                          


Front view:

    |
    |
    |
    *\      
    |#\    
    |##\              
    |###\             
    |####\   <------------- Area: f_Y(y)         
    |#####*--------*  
    |###########    \ 
    *==========@=====*--------------  
               x
*---*
|###| : the desired region from the cross section from joint pdf, whose area is the probability from the cdf
*---*   

If   for some event  , we have some special notations for simplicity:

  • the conditional probability function of   given   becomes

 

  • the conditional cdf of   given   becomes

 

Proposition. (Determining independence of two random variables) Random varibles   are independent if and only if   for each  .

Proof. Recall the definition of independence between two random variables:

  are independent if

 

for each  .

Since   for each  , we have the desired result.

 

Remark.

  • This is expected, since the conditioning on independent event should not affect the occurrence of another independent event.


We can extend the definition of conditional probability function and cdf to groups of random variables, for joint cdf's and joint probability functions, as follows:

Definition. (Conditional joint probability function) Let   and   be two random vectors. The conditional joint probability function of   given   is  

Then, we also have a similar proposition for determining independence of two random vectors.

Proposition. (Determining independence of two random vectors) Random vectors   are independent if and only if   for each  .

Proof. The definition of independence between two random vectors is

  •   are independent if

 

for each  .

Since   for each  , we have the desired result.

 

Conditional distributions of bivariate normal distribution

edit

Recall from the Probability/Important Distributions chapter that the joint pdf of   is  , and   and   in this case. in which   and   are positive.

Proposition. (Conditional distributions of bivariate normal distribution) Let  . Then,   (abuse of notations: when we say the distribution of " ", we mean the conditional distribution of   given  ).

Proof.

  • First, the conditional pdf

 

  • Then, we can see that  ,
  • and by symmetry (interchanging   and  , and also interchanging   and  ),  .

 

Conditional version of concepts

edit

We can obtain conditional version of concepts previously established for 'unconditional' distributions analogously for conditional distributions by substituting 'unconditional' cdf, pdf or pmf, i.e.   or  , by their conditional counterparts, i.e.   or  .

Conditional independence

edit

Definition. Random variables   are conditionally independent given   if and only if   or  . for each real number   and for each positive integer  , in which   and   denote the joint cdf and probability function of   conditional on   respectively.

Remark.

  • For random variables, conditional independence and independence are not related, i.e. one of them does not imply the another.

Example. (Conditional independence does not imply independence) TODO

Example. (Independence does not imply conditional independence) TODO

Conditional expectation

edit

Definition. (Conditional expectation) Let   be the conditional probability function of   given  . Then,  

Remark.

  •   is a function of  
  • the random variable  , which is a function of   after computing the expectation, is written as   for brevity, in which  's are the same term.
  •   is a realization of   when   is observed to be   in which  's are the same term.

Similarly, we have conditional version of law of the unconscious statistician.

Proposition. (Law of the unconscious statistician (conditional version)) Let   be the conditional probability function of   given  . Then, for each function  ,  

Proposition. (Conditional expectation under independence) If random variables   are independent,   for each function  .

Proof.  

 

Remark.

  • This equality may not hold if   are not independent.

Example. Suppose random vector   in which   are independent random variables, and  . Then,   (  is treated as constant, because of the conditioning: it is constant after realization of  ) but  

The properties of   still hold for conditional expectations  , with every 'unconditional' expectation replaced by conditional expectation and some suitable modifications, as follows:

Proposition. (Properties of conditional expectation) For each random variable  ,

  • (linearity)  
for each functions   of   and for each random variable  
  • (nonnegativity) if  ,  
  • (monotonicity) if  ,   for each random variable  
  • (triangle inequality)

 

  • (multiplicativity under independence) if   are conditionally independent given  ,

 

Proof. The proof is similar to the one for 'unconditional' expectations.

 

Remark.

  •   are treated as constants given  , since after observing the value of   , they cannot be changed.
  • Each result also holds with   replaced by random vectors  .

The following theorem about conditional expectation is quite important.

Theorem. (Law of total expectation) For each function   and for each random variable  ,  

Proof.  

 

Remark.

  • We can replace   by   and get

 

Corollary. (Generalized law of total probability) For each event  ,  

Proof.

  • First,

 

  • Then, using law of total expectation,

 

 

Remark.

  • The expectation is taken with respect to  , so we use the   notation. We will use similar notations to denote the random variables to which the expectation is taken with respect if needed.
  • We can replace   by  , which is a random vector.
  • If   is discrete, then the expanded form of the result is   (discrete case for law of total probability).
  • If   is continuous, then the expanded form of the result is   (continuous case for law of total probability).

Corollary. (Expectation version of law of total probability) Suppose the sample space   in which  's are mutually exclusive. Then,  

Proof. Define   if   occurs, in which   is a positive integer. Then,  

 

Remark.

  • the number of events can be finite, as long as they are mutually exclusive and their union is the whole sample space
  • if  , it reduces to law of total probability

Example. Let   be the human height in m. A person is randomly selected from a population consisting of same number of men and women. Given that the mean height of a man is 1.8 m, and that of a woman is 1.7m, the mean height of the entire population is  

Corollary. (formula of expectation conditional on event) For each random variable   and event   with  ,  

Proof. By the formula of expectation computed by weighted average of conditional expectations,   and the result follows if  .

 

Remark.

  • if  , it reduces to the definition of the conditional probability   by the fundamental bridge between probability and expectation

After defining conditional expectation, we can also have conditional variance, covariance and correlation coefficient, since variance, covariance, and correlation coefficient are built upon expectation.

Conditional expectations of bivariate normal distribution

edit

Proposition. (Conditional expectations of bivariate normal distribution) Let  . Then,  

Proof.

  • The result follows from the proposition about conditional distributions of bivariate normal distribution readily.

 


Conditional variance

edit

Definition. (Conditional variance) The conditional variance of random variable   given   is  

Similarly, we have properties of conditional variance which are similar to that of variance.

Proposition. (Properties of conditional variance) For each random variable  ,

  • (alternative formula of conditional variance)  
  • (invariance under change in location parameter)  
  • (homogeneity of degree two)  
  • (nonnegativity)  
  • (zero variance implies non-randomness)   for some function   of  
  • (additivity under independence) if   are conditionally independent given  , 

Proof. The proof is similar to the one for properties of variance.

 

Beside law of total expectation, we also have law of total variance, as follows:

Proposition. (Law of total variance) For each rnadom variable  ,  

Proof.  

 

Remark.

  • We can replace   by  , a random vector.

Conditional variances of bivariate normal distribution

edit

Proposition. (Conditional variances of bivariate normal distribution) Let  . Then,  

Proof.

  • The result follows from he proposition about conditional distributions of bivariate normal distribution readily.

 

Remark.

  • It can be observed that the exact values of   and   in the conditions do not matter. The result is the same for different values of them.


Conditional covariance

edit

Definition. (Conditional covariance) The conditional covariance of   and   given   is  

Proposition. (Properties of conditional covariance)

(i) (symmetry) for each random variable  ,   (ii) for each random variable  ,   (iii) (alternative formula of covariance)   (iv) for each constant  , and for each random variables  ,   (v) for each random variable  ,  


Conditional correlation coefficient

edit

Definition. (Conditional correlation coefficient) The conditional correlation coefficient of random variables   and   given   is  

Remark.

  • Similar to 'unconditional' correlation coefficient, conditional correlation coefficient also lies between   and   inclusively. The proof is similar, by replacing every unconditional terms with conditional terms.


Conditional quantile

edit

Definition. (Conditional quantile) The conditional  th quantile of   given   is  

Remark.

  • Then, we can have conditional median, interquartile range, etc., which are defined using conditional quantile in the same way as the unconditional ones