Probability/Transformation of Random Variables

Probability
Transformation of Random Variables

Transformation of random variables

Underlying principle

Let $X_{1},\dotsc ,X_{n}$ be $n$ random variables, $Y_{1},\dotsc ,Y_{n}$ be another $n$ random variables, and $\mathbf {X} =(X_{1},\dotsc ,X_{n})^{T},\mathbf {Y} =(Y_{1},\dotsc ,Y_{n})^{T}$ be random (column) vectors.

Suppose the vector-valued function^[1] $\mathbf {g} :\operatorname {supp} (\mathbf {X} )\to \operatorname {supp} (\mathbf {Y} )$ is bijective (it is also called one-to-one correspondence in this case). Then, its inverse $\mathbf {g} ^{-1}:\operatorname {supp} (\mathbf {Y} )\to \operatorname {supp} (\mathbf {X} )$ exists.

After that, we can transform $\mathbf {X}$ to $\mathbf {Y}$ by applying the transformation $\mathbf {g}$ , i.e. by $\mathbf {Y} =\mathbf {g} (\mathbf {X} )$ , and transform $\mathbf {Y}$ to $\mathbf {X}$ by applying the inverse transformation $\mathbf {g} ^{-1}$ , i.e. by $\mathbf {X} =\mathbf {g} ^{-1}(\mathbf {Y} )$ .

We are often interested in deriving the joint probability function $f_{\mathbf {Y} }(\mathbf {y} )$ of $\mathbf {Y}$ , given the joint probability function $f_{\mathbf {X} }(\mathbf {x} )$ of $\mathbf {X}$ . We will examine the discrete and continuous cases one by one in the following.

Transformation of discrete random variables

Proposition. (transformation of discrete random variables) For each discrete random vector $\mathbf {X}$ with joint pmf $f_{\mathbf {X} }(\mathbf {x} )$ , the corresponding joint pmf of the transformed random vector $\mathbf {Y} =\mathbf {g} (\mathbf {X} )$ where $\mathbf {g}$ is bijective is $f_{\mathbf {Y} }(\mathbf {y} )=f_{\mathbf {X} }\left(\mathbf {g} ^{-1}(\mathbf {y} )\right),\quad \mathbf {y} \in \operatorname {supp} (\mathbf {Y} ).$

Proof. Considering the original pmf $f_{\mathbf {Y} }(\mathbf {y} )$ , we have $f_{\mathbf {Y} }(\mathbf {y} ){\overset {\text{ def }}{=}}\mathbb {P} (\mathbf {Y} =\mathbf {y} )=\mathbb {P} \left(\mathbf {g} ^{-1}(\mathbf {Y} )=\mathbf {g} ^{-1}(\mathbf {y} )\right)=\mathbb {P} \left(\mathbf {X} =\mathbf {g} ^{-1}(\mathbf {y} )\right){\overset {\text{ def }}{=}}f_{\mathbf {X} }\left(\mathbf {g} ^{-1}(\mathbf {y} )\right),\quad \mathbf {y} \in \operatorname {supp} (\mathbf {Y} ).$ In particular, the inverse $\mathbf {g} ^{-1}$ exists since $\mathbf {g}$ is bijective.

$\Box$

Transformation of continuous random variables

For continuous random variables, the situation is more complicated.

Let us investigate the case for univariate pdf, which is simpler.

Theorem. (Transformation of continuous random variable (univariate case)) Let $X$ be a continuous random variable with pdf $f_{X}(x)$ . Assume that the function $g$ is differentiable and strictly monotone. Then, the pdf of the transformed random variable $Y=g(X)$ is $f_{Y}(y)=f_{X}(g^{-1}(y))\left\vert {\frac {dx}{dy}}\right\vert ,\quad y\in \operatorname {supp} (Y).$

Proof. Under the assumption that $g$ is differentiable and strictly monotone, the cdf $F_{Y}(y)=\mathbb {P} (g(X)\leq y)={\begin{cases}\mathbb {P} (X\leq g^{-1}(y))=F_{X}(g^{-1}(y)),&g^{-1}{\text{ is increasing}};\\\mathbb {P} (X\geq g^{-1}(y))=1-F_{X}(g^{-1}(y)),&g^{-1}{\text{ is decreasing}}.\end{cases}}$ ( $g^{-1}$ exists since $g$ is strictly monotonic.) Differentiating both side of the above equation (assuming the cdf's involved are differentiable) gives $f_{Y}(y)={\begin{cases}f_{X}(g^{-1}(y)){\frac {dg^{-1}(y)}{dy}},&g^{-1}{\text{ is increasing}};\\-f_{X}(g^{-1}(y)){\frac {dg^{-1}(y)}{dy}},&g^{-1}{\text{ is decreasing}}.\\\end{cases}}$ Since $x=g^{-1}(y)$ , we can write ${\frac {dg^{-1}(y)}{dy}}$ as ${\frac {dx}{dy}}$ . Also, we can summarize the above case defined function into a single expression by applying absolute value function to both side: $f_{Y}(y)=f_{X}(g^{-1}(y))\left\vert {\frac {dx}{dy}}\right\vert ,$ where the absolute value sign is only applied to ${\frac {dx}{dy}}$ since the pdf's must be nonnegative, and thus we do not need to apply the sign to them.

$\Box$

Remark.

To explain this theorem in a more intuitive manner, we rewrite the equation in the theorem as

$|f_{Y}(y)dy|=|f_{X}(g^{-1}(y))dx|$

where both side of the equation can be regarded as differential areas, which are nonnegative due to the absolute value signs.

This equation should intuitively hold since they both represent the areas under the pdf's, which represent probabilities. For $|f_{X}(g^{-1}(y))dx|=|f_{X}(x)dx|$ , it is the area of the region $R_{X}$ under the pdf of $X$ over an "infinitesimal" interval $dx$ , which represent the probability for $X$ to lie in this infinitesimal interval $dx$ . After transformation, we get another pdf of $Y$ , and the original region $R_{X}$ is transformed to a region $R_{Y}$ under pdf of $Y$ over an infinitesimal interval $dy=g'(x)dx$ with area $|f_{Y}(y)dy|$ . Since $g$ is bijective function (its strict monotonicity implies this), $dy$ "correspond" to $dx$ in some sense, and we know that the values in $dy$ are "originated" from the values in $dx$ , and so the randomness. It follows that the probability for $X$ lying in $dx$ and $Y$ lying in $dy$ should be the same, and hence the two differential areas are the same.

Let us define Jacobian matrix, and introduce several notations in the definition.

Definition. (Jacobian matrix) Suppose the function $\mathbf {g}$ is differentiable (then it follows that $\mathbf {g} ^{-1}$ is differentiable). The Jacobian matrix ${\frac {\partial \mathbf {y} }{\partial \mathbf {x} }}={\begin{pmatrix}{\frac {\partial g_{1}(\mathbf {x} )}{\partial x_{1}}}&\dotsb &{\frac {\partial g_{1}(\mathbf {x} )}{\partial x_{n}}}\\\vdots &\ddots &\vdots \\{\frac {\partial g_{n}(\mathbf {x} )}{\partial x_{1}}}&\dotsb &{\frac {\partial g_{n}(\mathbf {x} )}{\partial x_{n}}}\end{pmatrix}},\quad \mathbf {y} =\mathbf {g} (\mathbf {x} )$ in which $g_{j}$ is the component function of $\mathbf {g}$ for each $j\in \{1,\dotsc ,n\}$ , i.e. $\mathbf {g} (\mathbf {x} )=(g_{1}(\mathbf {x} ),\dotsc ,g_{n}(\mathbf {x} ))$ .

Remark.

We have ${\frac {\partial \mathbf {y} }{\partial \mathbf {x} }}{\frac {\partial \mathbf {x} }{\partial \mathbf {y} }}=I_{n\times n}\Leftrightarrow {\frac {\partial \mathbf {y} }{\partial \mathbf {x} }}=\left({\frac {\partial \mathbf {x} }{\partial \mathbf {y} }}\right)^{-1}$ .

Example. Suppose $\mathbf {x} =(x_{1},x_{2})$ , $\mathbf {y} =(y_{1},y_{2})$ , and $\mathbf {y} =\mathbf {g} (\mathbf {x} )=({\color {red}2x_{1}},{\color {blue}3x_{2}})$ . Then, $g_{1}(\mathbf {x} )={\color {red}2x_{1}}$ , $g_{2}(\mathbf {x} )={\color {blue}3x_{2}}$ , and ${\frac {\partial \mathbf {y} }{\partial \mathbf {x} }}={\begin{pmatrix}{\frac {\partial ({\color {red}2x_{1}})}{\partial x_{1}}}&{\frac {\partial ({\color {red}2x_{1}})}{\partial x_{2}}}\\{\frac {\partial ({\color {blue}3x_{2}})}{\partial x_{1}}}&{\frac {\partial ({\color {blue}3x_{2}})}{\partial x_{2}}}\end{pmatrix}}={\begin{pmatrix}2&0\\0&3\end{pmatrix}}.$

Also, $\mathbf {x} =\mathbf {g} ^{-1}(\mathbf {y} )=({\color {darkgreen}y_{1}/2},{\color {purple}y_{2}/3})$ . Then, $g_{1}^{-1}(\mathbf {y} )={\color {darkgreen}y_{1}/2}$ , $g_{2}^{-1}(\mathbf {y} )={\color {purple}y_{2}/3}$ , and ${\frac {\partial \mathbf {x} }{\partial \mathbf {y} }}={\begin{pmatrix}{\frac {\partial ({\color {darkgreen}y_{1}/2})}{\partial y_{1}}}&{\frac {\partial ({\color {darkgreen}y_{1}/2})}{\partial y_{2}}}\\{\frac {\partial ({\color {purple}y_{2}/3})}{\partial y_{1}}}&{\frac {\partial ({\color {purple}y_{2}/3})}{\partial y_{2}}}\end{pmatrix}}={\begin{pmatrix}1/2&0\\0&1/3\end{pmatrix}}={\begin{pmatrix}2&0\\0&3\end{pmatrix}}^{-1}={\frac {1}{6}}{\begin{pmatrix}3&0\\0&2\end{pmatrix}}.$

Theorem. (Transformation of continuous random variables) Let $\mathbf {X}$ be a continuous random vector with joint pdf $f_{\mathbf {X} }(\mathbf {x} )$ , and assume $\mathbf {g}$ is differentiable and bijective. The corresponding joint pdf of transformed random vector $\mathbf {Y} =\mathbf {g} (\mathbf {X} )$ is $f_{\mathbf {Y} }(\mathbf {y} )=f_{\mathbf {X} }{\big (}\mathbf {g} ^{-1}(\mathbf {y} ){\big )}\left|\det {\frac {\partial \mathbf {x} }{\partial \mathbf {y} }}\right|,\quad \mathbf {y} \in \operatorname {supp} (\mathbf {Y} ).$

Proof. Partial proof: Assume $\mathbf {g}$ is differentiable and bijective.

First, $\mathbb {P} (Y\in S)=\int _{}^{}\dotsi \int _{S}^{}f_{\mathbf {Y} }(\mathbf {y} )\,dy_{1}\cdots \,dy_{n}\qquad (1).$

On the other hand, we have $\mathbb {P} (Y\in S)=\mathbb {P} (X=\mathbf {g} ^{-1}(Y)\in \mathbf {g} ^{-1}(S))=\int \dotsi \int _{\mathbf {g} ^{-1}(S)}^{}f_{\mathbf {X} }(\mathbf {x} )\,dx_{1}\cdots \,dx_{n}$ where $\mathbf {g} ^{-1}(S)=\{x\in X:\mathbf {g} (x)\in S\}$ , which is the preimage of the set $S$ under $\mathbf {g}$ .

Applying the change of variable formula to this integral (whose proof is advanced and uses our assumptions), we get $\int \dotsi \int _{\mathbf {g} ^{-1}(S)}f_{\mathbf {X} }(\mathbf {x} )\,dx_{1}\cdots \,dx_{n}=\int \dotsi \int _{S}f_{\mathbf {X} }{\big (}\mathbf {g} ^{-1}(\mathbf {y} ){\big )}\left|\det {\frac {\partial \mathbf {x} }{\partial \mathbf {y} }}\right|\,dy_{1}\cdots \,dy_{n}\qquad (2)$ Comparing the integrals in $(1)$ and $(2)$ , we can observe the desired result.

$\Box$

Moment generating function

Definition. (Moment generating function) The moment generating function (mgf) for the distribution of a random variable $X$ is $M_{X}({\color {darkgreen}t})=\mathbb {E} \left[e^{{\color {darkgreen}t}X}\right]$ .

Remark.

For comparison: cdf is $F_{X}({\color {darkgreen}t})=\mathbb {E} [\mathbf {1} \{X\leq {\color {darkgreen}t}\}]$ .
Mgf, similar to pmf, pdf and cdf, gives a complete description of distribution, so it can also similarly uniquely identify a distribution, provided that the mgf exists (expectation may be infinite),

i.e., we can recover probability function from mgf.
The proof to this result is complicated, and thus omitted.

Proposition. (Moment generating property of mgf) Assuming mgf $M_{X}({\color {darkgreen}t})$ exists for $t\in (-\varepsilon ,\varepsilon )$ in which $\varepsilon$ is a positive number, we have $\mathbb {E} [X^{n}]=\left.{\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}M_{X}({\color {darkgreen}t})\right|_{{\color {darkgreen}t}=0}$ for each nonnegative integer $n$ .

Proof.

Since

$M_{X}({\color {darkgreen}t})=\mathbb {E} \left[e^{{\color {darkgreen}t}X}\right]=\mathbb {E} \left[1+{\color {darkgreen}t}X+{\frac {{\color {darkgreen}t}^{2}X^{2}}{2!}}+\dotsb \right]{\overset {\text{linearity}}{=}}1+{\color {darkgreen}t}\mathbb {E} [X]+{\frac {{\color {darkgreen}t}^{2}}{2!}}\mathbb {E} [X^{2}]+\dotsb ,$ ${\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}M_{X}({\color {darkgreen}t}){\bigg |}_{{\color {darkgreen}t}=0}={\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}\left(1+{\color {darkgreen}t}\mathbb {E} [X]+{\frac {{\color {darkgreen}t}^{2}}{2!}}\mathbb {E} [X^{2}]+\dotsb \right){\bigg |}_{{\color {darkgreen}t}=0}=\mathbb {E} [X]{\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}{\color {darkgreen}t}+{\frac {\mathbb {E} [X^{2}]}{2!}}{\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}{\color {darkgreen}t^{2}}+\dotsb ,$

The result follows from simplifying the above expression by ${\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}{\color {darkgreen}t}^{m}=\mathbf {1} \{m=n\}n!+\mathbf {1} \{m\neq n\}(0).$

$\Box$

Proposition. (Relationship between independence and mgf) If $X$ and $Y$ are independent, $M_{XY}({\color {darkgreen}t})={\color {blue}\mathbb {E} _{X}[}M_{Y}({\color {darkgreen}t}{\color {blue}X}){\color {blue}]}={\color {red}\mathbb {E} _{Y}[}M_{X}({\color {darkgreen}t}{\color {red}Y}){\color {red}]}.$

Proof. $M_{XY}({\color {darkgreen}t})=\mathbb {E} [e^{{\color {darkgreen}t}{\color {blue}X}{\color {red}Y}}]{\overset {\text{lote}}{=}}{\color {blue}\mathbb {E} _{X}{\bigg [}}{\color {red}\mathbb {E} _{Y}[e}^{{\color {darkgreen}t}{\color {blue}X}{\color {red}Y}}|{\color {blue}X}{\color {red}]}{\color {blue}{\bigg ]}}={\color {blue}\mathbb {E} _{X}[}M_{Y}({\color {darkgreen}t}{\color {blue}X}){\color {blue}]}.$ Similarly, $M_{XY}({\color {darkgreen}t})=\mathbb {E} [e^{{\color {darkgreen}t}{\color {blue}X}{\color {red}Y}}]{\overset {\text{lote}}{=}}{\color {red}\mathbb {E} _{X}{\bigg [}}{\color {blue}\mathbb {E} _{X}[e}^{{\color {darkgreen}t}{\color {blue}X}{\color {red}Y}}|{\color {red}Y}{\color {blue}]}{\color {red}{\bigg ]}}={\color {red}\mathbb {E} _{Y}[}M_{X}({\color {darkgreen}t}{\color {red}Y}){\color {red}]}.$

lote: law of total expectation

$\Box$

Remark.

This equality does not hold if $X$ and $Y$ are not independent.

Joint moment generating function

In the following, we will use $\mathbf {X}$ to denote $(X_{1},\dotsc ,X_{n})^{T}$ .

Definition. (Joint moment generating function) The joint moment generating function (mgf) of random vector $\mathbf {X}$ is $M_{\mathbf {X} }({\color {darkgreen}\mathbf {t} })=\mathbb {E} [e^{{\color {darkgreen}\mathbf {t} }\cdot \mathbf {X} }]=\mathbb {E} [e^{{\color {darkgreen}t_{1}}X_{1}+\dotsb +{\color {darkgreen}t_{n}}X_{n}}]$ for each (column) vector $\mathbf {t} =(t_{1},\dotsc ,t_{n})^{T}$ , if the expectation exists.

Remark.

When $n=1$ , the dot product of two vectors is product of two numbers.
$\mathbf {t} \cdot \mathbf {X} {\overset {\text{ def }}{=}}\mathbf {t} ^{T}\mathbf {X}$ .

Proposition. (Relationship between independence and mgf) Random variables $X_{1},\dotsc ,X_{n}$ are independent if and only if $M_{\mathbf {X} }(\mathbf {t} )=M_{X_{1}}(t_{1})\dotsb M_{X_{n}}(t_{n}).$

Proof. 'only if' part: Assume $X_{1},\dotsc ,X_{n}$ are independent. Then, $M_{\mathbf {X} }(\mathbf {t} )=\mathbb {E} [e^{\mathbf {t} \cdot \mathbf {X} }]=\mathbb {E} [e^{t_{1}X_{1}}\dotsb e^{t_{n}X_{n}}]{\overset {\text{ independence }}{=}}\mathbb {E} [e^{t_{1}X_{1}}]\dotsb \mathbb {E} [e^{t_{n}X_{n}}]=M_{X_{1}}(t_{1})\dotsb M_{X_{n}}(t_{n}).$ Proof for 'if' part is quite complicated, and thus is omitted.

$\Box$

Analogously, we have marginal mgf.

Definition. (Marginal mgf) The marginal mgf of $X_{i}$ which is a member of random variables $X_{1},\dotsc ,X_{n}$ is $M_{X_{i}}(t)=M_{\mathbf {X} }(0,\dotsc ,0,\underbrace {t} _{i{\text{ th position}}},0,\dotsc ,0)$

Proposition. (Moment generating function of linear transformation of random variables) For each constant vector ${\color {red}\mathbf {a} }=({\color {red}a_{1}},\dotsc ,{\color {red}a_{n}})$ and a real constant ${\color {blue}b}$ , the mgf of ${\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}$ is $M_{{\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}}(t)=e^{{\color {blue}b}t}M_{\mathbf {X} }(t{\color {red}\mathbf {a} })=e^{{\color {blue}b}t}M_{\mathbf {X} }(t{\color {red}a_{1}},\dotsc ,t{\color {red}a_{n}}).$

Proof. $M_{{\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}}(t)=\mathbb {E} [e^{t{\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}t}]=e^{{\color {blue}b}t}\mathbb {E} [e^{t{\color {red}\mathbf {a} }\cdot \mathbf {X} }]=e^{{\color {blue}b}t}M_{\mathbf {X} }(t{\color {red}\mathbf {a} })=e^{{\color {blue}b}t}M_{\mathbf {X} }(t{\color {red}a_{1}},\dotsc ,t{\color {red}a_{n}}).$

$\Box$

Remark.

If $X_{1},\dotsc ,X_{n}$ are independent,

$M_{{\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}}(t)=e^{{\color {blue}b}t}M_{X_{1}}(t{\color {red}a_{1}})\dotsb M_{X_{n}}(t{\color {red}a_{n}}).$

This provides an alternative, and possibly more convenient method to derive the distribution of ${\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}$ , compared with deriving it from probability functions of $X_{1},\dotsc ,X_{n}$ .
Special case: if ${\color {red}\mathbf {a} }=(1,\dotsc ,1)^{T}$ and ${\color {blue}b}=0$ , then ${\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}=X_{1}+\dotsb +X_{n}$ , which is sum of r.v.'s.

So, $M_{X_{1}+\dotsb +X_{n}}(t)=M_{\mathbf {X} }(t,\dotsc ,t)$ .
In particular, if $X_{1},\dotsc ,X_{n}$ are independent , then $M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)$ .
We can use this result to prove the formulas for sum of independent r.v.'s., instead of using the proposition about convolution of r.v.'s.

Special case: if $n=1$ , then the expression for linear transformation becomes ${\color {red}a}X+{\color {blue}b}$ .

So, $M_{{\color {red}a}X+{\color {blue}b}}(t)=e^{{\color {blue}b}t}M_{X}({\color {red}a}t)$ .

Moment generating function of some important distributions

Proposition. (Moment generating function of binomial distribution) The moment generating function of $X\sim \operatorname {Binom} (n,p)$ is $M_{X}(t)=(pe^{t}+1-p)^{n}$ .

Proof. $M_{X}(t)=\sum _{k=0}^{n}{\color {blue}e^{tk}}\underbrace {{\binom {n}{k}}{\color {blue}p^{k}}(1-p)^{n-k}} _{{\text{for}}\operatorname {Binom} (n,p)}=\sum _{k=0}^{n}{\binom {n}{k}}{\color {blue}(pe^{t})^{k}}(1-p)^{n-k}=(pe^{t}+1-p)^{n}\quad {\text{by binomial theorem}}.$

$\Box$

Proposition. (Moment generating function of Poisson distribution) The moment generating function of $X\sim \operatorname {Pois} (\lambda )$ is $M_{X}(t)=e^{\lambda (e^{t-1})}$ .

Proof. $M_{X}(t){\overset {\text{ def }}{=}}\mathbb {E} [e^{tX}]{\overset {\text{ LOTUS }}{=}}\sum _{{\color {darkdarkgreen}k}=0}^{\infty }e^{{\color {darkorange}t}{\color {darkdarkgreen}k}}\cdot \underbrace {\frac {e^{\color {red}-\lambda }{\color {purple}\lambda }^{\color {darkdarkgreen}k}}{{\color {darkdarkgreen}k}!}} _{{\text{for}}\operatorname {Pois} (\lambda )}=e^{{\color {blue}\lambda }({\color {blue}e^{t}}{\color {red}-1})}\overbrace {\sum _{{\color {darkdarkgreen}k}=0}^{\infty }\underbrace {\frac {e^{\color {blue}-\lambda e^{t}}({\color {purple}\lambda }e^{\color {darkorange}t})^{\color {darkdarkgreen}k}}{{\color {darkdarkgreen}k}!}} _{{\text{for}}\operatorname {Pois} (\lambda e^{t})}} ^{=1}=e^{\lambda (e^{t}-1)}.$

$\Box$

Proposition. (Moment generating function of exponential distribution) The moment generating function of $X\sim \operatorname {Exp} (\lambda )$ is $M_{X}(t)={\frac {\lambda }{\lambda -t}},\quad t<\lambda$ .

Proof.

$M_{X}(t)=\mathbb {E} [e^{tX}]=\lambda \int _{0}^{\infty }e^{tx}e^{-\lambda x}\,dx=\lambda \int _{0}^{\infty }e^{-(\lambda -t)x}\,dx={\frac {\lambda }{\color {blue}\lambda -t}}\overbrace {\int _{0}^{\infty }\underbrace {{\color {blue}(\lambda -t)}e^{-(\lambda -t)x}} _{{\text{for}}\operatorname {Exp} (\lambda -t)}\,dx} ^{=1},\quad \underbrace {\lambda -t>0} _{\text{ensuring valid rate parameter}}\Leftrightarrow t<\lambda .$
The result follows.

$\Box$

Proposition. (Moment generating function of gamma distribution) The moment generating function of $X\sim \operatorname {Gamma} (\alpha ,\lambda )$ is $M_{X}(t)=\left({\frac {\lambda }{\lambda -t}}\right)^{\alpha },\quad t<\lambda$ .

Proof.

We use similar proof technique from the proof for mgf of exponential distribution.

$M_{X}(t)=\mathbb {E} [e^{tX}]={\frac {\lambda ^{\alpha }}{\Gamma (\alpha )}}\int _{0}^{\infty }e^{tx}x^{\alpha -1}e^{-\lambda x}\,dx={\frac {\lambda ^{\alpha }}{\Gamma (\alpha )}}\int _{0}^{\infty }e^{-(\lambda -t)x}x^{\alpha -1}\,dx={\frac {\lambda ^{\alpha }}{\color {blue}(\lambda -t)^{\alpha }}}\overbrace {\int _{0}^{\infty }\underbrace {{\frac {\color {blue}(\lambda -t)^{\alpha }}{\Gamma (\alpha )}}e^{-(\lambda -t)x}x^{\alpha -1}} _{{\text{for}}\operatorname {Gamma} (\alpha ,\lambda -t)}\,dx} ^{=1},\quad \underbrace {\lambda -t>0} _{\text{ensuring valid rate parameter}}\Leftrightarrow t<\lambda .$

The result follows.

$\Box$

Proposition. (Moment generating function of normal distribution) The moment generating function of $X\sim {\mathcal {N}}({\color {blue}\mu },{\color {red}\sigma ^{2}})$ is $M_{X}(t)=e^{{\color {blue}\mu }t+{\color {red}\sigma ^{2}}t^{2}/2}$ .

Proof.

Let $Z={\frac {X-{\color {blue}\mu }}{\color {red}\sigma }}\sim {\mathcal {N}}(0,1)$ . Then, $X={\color {red}\sigma }Z+{\color {blue}\mu }$ .
First, consider the mgf of $Z$ :

$M_{Z}(t){\overset {\text{ def }}{=}}\mathbb {E} [e^{tZ}]={\frac {1}{\sqrt {2\pi }}}\int _{-\infty }^{\infty }\underbrace {e^{tx}e^{-x^{2}/2}} _{=e^{-(x^{2}-2tx)/2}}\,dx={\frac {1}{\sqrt {2\pi }}}\int _{-\infty }^{\infty }\exp {\big (}\overbrace {-(x^{2}-2tx+{\color {darkdarkgreen}t^{2}})} ^{=-(x-t)^{2}}/2+{\color {darkdarkgreen}t^{2}/2}{\big )}\,dx=e^{t^{2}/2}\overbrace {\int _{-\infty }^{\infty }\underbrace {{\frac {1}{\sqrt {2\pi }}}\cdot e^{-(x-t)^{2}/2}} _{{\text{for }}{\mathcal {N}}(t,1)}\,dx} ^{=1}=e^{t^{2}/2}.$

It follows that the mgf of $X$ is

$M_{X}(t)=e^{{\color {blue}\mu }t}M_{X}({\color {red}\sigma }t)=e^{{\color {blue}\mu }t}e^{{\color {red}\sigma }^{2}t^{2}/2}.$

The result follows.

$\Box$

Distribution of linear transformation of random variables

We will prove some propositions about distributions of linear transformation of random variables using mgf. Some of them are mentioned in previous chapters. As we will see, proving these propositions using mgf is quite simple.

Proposition. (Distribution of linear transformation of normal r.v.'s) Let $X\sim {\mathcal {N}}(\mu ,\sigma ^{2})$ . Then, ${\color {red}a}X+{\color {blue}b}\sim {\mathcal {N}}({\color {red}a}\mu +{\color {blue}b},{\color {red}a^{2}}\sigma ^{2})$ .

Proof.

The mgf of ${\color {red}a}X+{\color {blue}b}$ is

$M_{{\color {red}a}X+{\color {blue}b}}(t)=e^{{\color {blue}b}t}M_{X}({\color {red}a}t)=e^{{\color {blue}b}t}\left(\exp({\color {red}a}\mu t+({\color {red}a}\sigma )^{2}t^{2}/2)\right)=\exp \left(({\color {red}a}\mu +{\color {blue}b})t+{\color {red}a^{2}}\sigma ^{2}t^{2}/2\right),$

which is the mgf of

{\mathcal {N}}({\color {red}a}\mu +{\color {blue}b},{\color {red}a^{2}}\sigma ^{2})

, and the result follows since mgf identify a distribution uniquely.

$\Box$

Sum of independent random variables

Proposition. (Sum of independent binomial r.v.'s) Let $X_{1}\sim \operatorname {Binom} (n_{1},p),\dotsc ,X_{m}\sim \operatorname {Binom} (n_{m},p)$ , in which $X_{1},\dotsc ,X_{m}$ are independent. Then, $X_{1}+\dotsb +X_{n}\sim \operatorname {Binom} (n_{1}+\dotsb +n_{m},p)$ .

Proof.

The mgf of $X_{1}+\dotsb +X_{n}$ is

$M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)=(pe^{t}+1-p)^{n_{1}}\dotsb (pe^{t}+1-p)^{n_{m}}=(pe^{t}+1-p)^{n_{1}+\dotsb +n_{m}},$

which is the mgf of

\operatorname {Binom} (n_{1}+\dotsb +n_{m},p)

, as desired.

$\Box$

Proposition. (Sum of independent Poisson r.v.'s) Let $X_{1}\sim \operatorname {Pois} (\lambda _{1}),\dotsc ,X_{n}\sim \operatorname {Pois} (\lambda _{n})$ , in which $X_{1},\dotsc ,X_{n}$ are independent. Then, $X_{1}+\dotsb +X_{n}\sim \operatorname {Pois} (\lambda _{1}+\dotsb +\lambda _{n})$ .

Proof.

The mgf of $X_{1}+\dotsb +X_{n}$ is

$M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)=e^{\lambda _{1}(e^{t}-1)}\dotsb e^{\lambda _{n}(e^{t}-1)}=e^{(\lambda _{1}+\dotsb +\lambda _{n})(e^{t}-1)},$

which is the mgf of

\operatorname {Pois} (\lambda _{1}+\dotsb +\lambda _{n})

, as desired.

$\Box$

Proposition. (Sum of independent exponential r.v.'s) Let $X_{1},\dotsc ,X_{n}$ be i.i.d. r.v.'s following $\operatorname {Exp} (\lambda )$ . Then, $X_{1}+\dotsb +X_{n}\sim \operatorname {Gamma} (n,\lambda )$ .

Proof.

The mgf of $X_{1}+\dotsb +X_{n}$ is

$M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)=\left({\frac {\lambda }{\lambda -t}}\right)^{n},$

which is the mgf of

\operatorname {Gamma} (n,\lambda )

, as desired.

$\Box$

Proposition. (Sum of independent gamma r.v.'s) Let $X_{1}\sim \operatorname {Gamma} (\alpha _{1},\lambda ),\dotsc ,X_{n}\sim \operatorname {Gamma} (\alpha _{n},\lambda )$ , in which $X_{1},\dotsc ,X_{n}$ are independent. Then, $X_{1}+\dotsb +X_{n}\sim \operatorname {Gamma} (\alpha _{1}+\dotsb +\alpha _{n},\lambda )$ .

Proof.

The mgf of $X_{1}+\dotsb +X_{n}$ is

$M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)=\left({\frac {\lambda }{\lambda -t}}\right)^{\alpha _{1}}\dotsb \left({\frac {\lambda }{\lambda -t}}\right)^{\alpha _{n}}=\left({\frac {\lambda }{\lambda -t}}\right)^{\alpha _{1}+\dotsb +\alpha _{n}},$

which is the mgf of

\operatorname {Gamma} (\alpha _{1}+\dotsb +\alpha _{n},\lambda )

, as desired.

$\Box$

Proposition. (Sum of independent normal r.v.'s) Let $X_{1}\sim {\mathcal {N}}(\mu _{1},\sigma _{1}^{2}),\dotsc ,X_{n}\sim {\mathcal {N}}(\mu _{n},\sigma _{n}^{2})$ , in which $X_{1},\dotsc ,X_{n}$ are independent. Then $X_{1}+\dotsb +X_{n}\sim {\mathcal {N}}(\mu _{1}+\dotsb +\mu _{n},\sigma _{1}^{2}+\dotsb +\sigma _{n}^{2})$ .

Proof.

The mgf of $X_{1}+\dotsb +X_{n}$ (in which they are independent) is

$M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)=\exp(\mu _{1}t+\sigma _{1}^{2}t^{2}/2)\dotsb \exp(\mu _{n}t+\sigma _{n}^{2}t^{2}/2)=\exp \left((\mu _{1}+\dotsb +\mu _{n})t+(\sigma _{1}^{2}+\dotsb +\sigma _{n}^{2})t^{2}/2\right),$

which is the mgf of

{\mathcal {N}}(\mu _{1}+\dotsb +\mu _{n},\sigma _{1}^{2}+\dotsb +\sigma _{n}^{2})

, as desired.

$\Box$

Central limit theorem

We will provide a proof to central limit theorem (CLT) using mgf here.

Theorem. (Central limit theorem) Let $X_{1},X_{2},\dotsc$ be a sequence of i.i.d. random variables with finite mean $\mu$ and positive variance $\sigma ^{2}$ , and ${\overline {X}}_{n}$ be the sample mean of the first $n$ random variables, i.e. ${\overline {X}}_{n}={\frac {X_{1}+\dotsb +X_{n}}{n}}$ . Then, the standardized sample mean ${\frac {{\sqrt {n}}({\overline {X}}_{n}-\mu )}{\sigma }}$ converges in distribution to a standard normal random variable as $n\to \infty$ .

Proof.

Define $T_{n}={\frac {{\sqrt {n}}({\overline {X}}_{n}-\mu )}{\sigma }}$ . Then, we have

$T_{n}={\frac {{\sqrt {n}}{\big (}(X_{1}+\dotsb +X_{n})/n-\mu {\big )}}{\sigma }}={\frac {X_{1}+\dotsb +X_{n}}{\color {red}\sigma {\sqrt {n}}}}{\color {blue}-{\frac {{\sqrt {n}}\mu }{\sigma }}},$

which is in the form of ${\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b},\quad {\color {red}\mathbf {a} =\left({\frac {1}{\sigma {\sqrt {n}}}},\dotsc ,{\frac {1}{\sigma {\sqrt {n}}}}\right)^{T}}{\text{ and }}{\color {blue}b=-{\frac {{\sqrt {n}}\mu }{\sigma }}}$ .
Therefore,

${\begin{aligned}&&M_{T_{n}}(t)&=e^{\color {blue}-{\sqrt {n}}\mu t/\sigma }\left(M_{X_{1}}\left({\frac {t}{\color {red}\sigma {\sqrt {n}}}}\right)\dotsb M_{X_{n}}\left({\frac {t}{\color {red}\sigma {\sqrt {n}}}}\right)\right)\\&&M_{T_{n}}(t)&=e^{-{\sqrt {n}}\mu t/\sigma }\left(M_{X_{1}}\left({\frac {t}{\sigma {\sqrt {n}}}}\right)\right)^{n}\quad {\text{since }}X_{1},\dotsc ,X_{n}{\text{ are identically distributed, which is equivalent to they have the same mgf}}\\&\Rightarrow &\ln M_{T_{n}}(t)&=-{\sqrt {n}}\mu t/\sigma +n\ln(\mathbb {E} [e^{t/(\sigma {\sqrt {n}})X_{1}}])\\&&&=-{\sqrt {n}}\mu t/\sigma +n\ln \mathbb {E} \left[1+t/(\sigma {\sqrt {n}})X_{1}+(1/2!)t^{2}/(\sigma ^{2}n)+\dotsb \right]\quad {\text{since }}e^{x}=1+x+{\frac {x^{2}}{2!}}+\dotsb \\&&&=-{\sqrt {n}}\mu t/\sigma +n\ln {\big (}1+t/(\sigma {\sqrt {n}}){\color {blue}\mathbb {E} [X]}+(1/2!)t^{2}/(\sigma ^{2}n)({\color {blue}\underbrace {\mathbb {E} [X^{2}]} _{\operatorname {Var} (X)+(\mathbb {E} [X])^{2}}})+{\text{terms of order smaller than }}n^{-1}{\big )}\\&&&=-{\sqrt {n}}\mu t/\sigma +n\ln \left(1+t/(\sigma {\sqrt {n}}){\color {blue}\mu }+(1/2)t^{2}/(\sigma ^{2}n)({\color {blue}\sigma ^{2}+\mu ^{2}})+{\text{terms of order smaller than }}n^{-1}\right)\\&&&=-{\sqrt {n}}\mu t/\sigma +n[t/(\sigma {\sqrt {n}}){\color {blue}\mu }+(1/2)t^{2}/(\sigma ^{2}n)({\color {blue}\sigma ^{2}+\mu ^{2}})-(1/2)(t/(\sigma {\sqrt {n}}){\color {blue}\mu })^{2}+{\text{terms of order smaller than }}n^{-1}]\quad {\text{since }}\ln(1+x)=x-x^{2}/2+\dotsb \\&&&={\cancel {-{\sqrt {n}}\mu t/\sigma }}{\cancel {+{\sqrt {n}}{\color {blue}\mu }t/\sigma }}+{\color {purple}{\cancel {n}}}(1/2)t^{2}/(\sigma ^{2}{\color {purple}{\cancel {n}}})({\color {blue}\sigma ^{2}}+{\color {blue}\mu ^{2}})-{\color {red}{\cancel {n}}}(1/2)(t^{2}/(\sigma ^{2}{\color {red}{\cancel {n}}}){\color {blue}\mu }^{2})+{\text{terms of order smaller than }}n^{0}\\&&&=(1/2)t^{2}(\sigma ^{2}/\sigma ^{2}){\cancel {+(1/2){\color {blue}\mu ^{2}}t^{2}-(1/2)t^{2}({\color {blue}\mu }^{2})}}+{\text{terms of order smaller than }}n^{0}\\&&&=(1/2)t^{2}+\underbrace {{\text{terms of order smaller than }}n^{0}} _{\to 0{\text{ as }}n\to \infty }\\&\Rightarrow &\lim _{n\to \infty }M_{T_{n}}(t)&=\underbrace {e^{t^{2}/2}} _{{\text{mgf of }}{\mathcal {N}}(0,1)},\end{aligned}}$ and the result follows from the mgf property of identifying distribution uniquely.

$\Box$

Remark.

Since ${\frac {{\sqrt {n}}({\overline {X}}-\mu )}{\sigma }}\sim {\mathcal {N}}(0,1)\Leftrightarrow {\color {blue}{\frac {\sigma }{\sqrt {n}}}}\cdot {\frac {{\sqrt {n}}({\overline {X}}-\mu )}{\sigma }}{\color {red}+\mu }\sim {\mathcal {N}}({\color {red}\mu },{\color {blue}\sigma ^{2}/n})\Leftrightarrow {\overline {X}}\sim {\mathcal {N}}(\mu ,\sigma ^{2}/n)$ ,
the sample mean converges in distribution to ${\mathcal {N}}(\mu ,\sigma ^{2}/n)$ as $n\to \infty$ .

The same result holds for the sample mean of normal r.v.'s with the same mean $\mu$ and the same variance $\sigma ^{2}$ ,
since if $X_{1},\dotsc ,X_{n}\sim {\mathcal {N}}(\mu ,\sigma ^{2})$ , then ${\frac {X_{1}+\dotsb +X_{n}}{\color {blue}n}}\sim {\mathcal {N}}\left({\frac {\overbrace {\mu +\dotsb +\mu } ^{n{\text{ times}}}}{\color {blue}n}},{\frac {\overbrace {\sigma ^{2}+\dotsb +\sigma ^{2}} ^{n{\text{ times}}}}{\color {blue}n^{2}}}\right)\equiv {\mathcal {N}}(\mu ,\sigma ^{2}/n)$ .

It follows from the proposition about the distribution of linear transformation of normal r.v.'s that the sample sum, i.e. $X_{1}+\dotsb +X_{n}={\color {blue}n}{\overline {X}}$ converges in distribution to ${\mathcal {N}}({\color {blue}n}\mu ,{\color {blue}n^{2}}\sigma ^{2}/n)\equiv {\mathcal {N}}(n\mu ,n\sigma ^{2})$ .

The same result holds for the sample sum of normal r.v.'s with the same mean $\mu$ and the same variance $\sigma ^{2}$ ,
since if $X_{1},\dotsc ,X_{n}\sim {\mathcal {N}}(\mu ,\sigma ^{2})$ , then $X_{1}+\dotsb +X_{n}\sim {\mathcal {N}}\left(\overbrace {\mu +\dotsb +\mu } ^{n{\text{ times}}},\overbrace {\sigma ^{2}+\dotsb +\sigma ^{2}} ^{n{\text{ times}}}\right)\equiv {\mathcal {N}}(n\mu ,n\sigma ^{2})$ .

If a r.v. converges in distribution to a distribution, then we can use the distribution to approximate the probabilities involving the r.v..

A special case of using CLT as approximation is using normal distribution to approximate discrete distribution. To improve accuracy, we should ideally have continuity correction, as explained in the following.

Proposition. (Continuity correction) A continuity correction is rewriting the probability expression $\mathbb {P} (X=i)$ ( $i$ is integer) as $\mathbb {P} (i-1/2<X<i+1/2)$ when approximating a discrete distribution by normal distribution using CLT.

Remark.

The reason for doing this is to make $i$ to be at the 'middle' of the interval, so that it is better approximated.

Illustration of continuity correcction:

| 
|              /
|             /
|            /
|           /|
|          /#|
|         *##|
|        /|##|
|       /#|##|   
|      /##|##|   
|     /|##|##|   
|    / |##|##|   
|   /  |##|##|
|  /   |##|##|
| /    |##|##|
*------*--*--*---------------------
    i-1/2 i i+1/2

| 
|              /
|             /
|            /
|           / 
|          /  
|         *   
|        /|   
|       /#|      
|      /##|      
|     /###|      
|    /####|      
|   /#####|   
|  /|#####|   
| / |#####|   
*---*-----*------------------------
   i-1    i      

| 
|              /|
|             /#|
|            /##|
|           /###|
|          /####|
|         *#####|
|        /|#####|
|       / |#####|
|      /  |#####|
|     /   |#####|
|    /    |#####|
|   /     |#####| 
|  /      |#####|
| /       |#####|
*---------*-----*------------------
          i     i+1

Conditional Distributions

Probability
Transformation of Random Variables

Index

↑ or equivalently, transformation between supports of $\mathbf {X}$ and $\mathbf {Y}$

[1] r equivalently, transformation between supports of $\mathbf {X}$ and $\mathbf {Y}$

[1]