Probability/Transformation of Random Variables

Transformation of random variables

Underlying principle

Let ${\displaystyle X_{1},\dotsc ,X_{n}}$  be ${\displaystyle n}$  random variables, ${\displaystyle Y_{1},\dotsc ,Y_{n}}$  be another ${\displaystyle n}$  random variables, and ${\displaystyle \mathbf {X} =(X_{1},\dotsc ,X_{n})^{T},\mathbf {Y} =(Y_{1},\dotsc ,Y_{n})^{T}}$  be random (column) vectors.

Suppose the vector-valued function[1] ${\displaystyle \mathbf {g} :\operatorname {supp} (\mathbf {X} )\to \operatorname {supp} (\mathbf {Y} )}$  is bijective (it is also called one-to-one correspondence in this case). Then, its inverse ${\displaystyle \mathbf {g} ^{-1}:\operatorname {supp} (\mathbf {Y} )\to \operatorname {supp} (\mathbf {X} )}$  exists.

After that, we can transform ${\displaystyle \mathbf {X} }$  to ${\displaystyle \mathbf {Y} }$  by applying the transformation ${\displaystyle \mathbf {g} }$ , i.e. by ${\displaystyle \mathbf {Y} =\mathbf {g} (\mathbf {X} )}$ , and transform ${\displaystyle \mathbf {Y} }$  to ${\displaystyle \mathbf {X} }$  by applying the inverse transformation ${\displaystyle \mathbf {g} ^{-1}}$ , i.e. by ${\displaystyle \mathbf {X} =\mathbf {g} ^{-1}(\mathbf {Y} )}$ .

We are often interested in deriving the joint probability function ${\displaystyle f_{\mathbf {Y} }(\mathbf {y} )}$  of ${\displaystyle \mathbf {Y} }$ , given the joint probability function ${\displaystyle f_{\mathbf {X} }(\mathbf {x} )}$  of ${\displaystyle \mathbf {X} }$ . We will examine the discrete and continuous cases one by one in the following.

Transformation of discrete random variables

Proposition. (transformation of discrete random variables) For each discrete random vector ${\displaystyle \mathbf {X} }$  with joint pmf ${\displaystyle f_{\mathbf {X} }(\mathbf {x} )}$ , the corresponding joint pmf of the transformed random vector ${\displaystyle \mathbf {Y} =\mathbf {g} (\mathbf {X} )}$  where ${\displaystyle \mathbf {g} }$  is bijective is

${\displaystyle f_{\mathbf {Y} }(\mathbf {y} )=f_{\mathbf {X} }\left(\mathbf {g} ^{-1}(\mathbf {y} )\right),\quad \mathbf {y} \in \operatorname {supp} (\mathbf {Y} ).}$

Proof. Considering the original pmf ${\displaystyle f_{\mathbf {Y} }(\mathbf {y} )}$ , we have

${\displaystyle f_{\mathbf {Y} }(\mathbf {y} ){\overset {\text{ def }}{=}}\mathbb {P} (\mathbf {Y} =\mathbf {y} )=\mathbb {P} \left(\mathbf {g} ^{-1}(\mathbf {Y} )=\mathbf {g} ^{-1}(\mathbf {y} )\right)=\mathbb {P} \left(\mathbf {X} =\mathbf {g} ^{-1}(\mathbf {y} )\right){\overset {\text{ def }}{=}}f_{\mathbf {X} }\left(\mathbf {g} ^{-1}(\mathbf {y} )\right),\quad \mathbf {y} \in \operatorname {supp} (\mathbf {Y} ).}$

In particular, the inverse ${\displaystyle \mathbf {g} ^{-1}}$  exists since ${\displaystyle \mathbf {g} }$  is bijective.

${\displaystyle \Box }$

Transformation of continuous random variables

For continuous random variables, the situation is more complicated.

Let us investigate the case for univariate pdf, which is simpler.

Theorem. (Transformation of continuous random variable (univariate case)) Let ${\displaystyle X}$  be a continuous random variable with pdf ${\displaystyle f_{X}(x)}$ . Assume that the function ${\displaystyle g}$  is differentiable and strictly monotone. Then, the pdf of the transformed random variable ${\displaystyle Y=g(X)}$  is

${\displaystyle f_{Y}(y)=f_{X}(g^{-1}(y))\left\vert {\frac {dx}{dy}}\right\vert ,\quad y\in \operatorname {supp} (Y).}$

Proof. Under the assumption that ${\displaystyle g}$  is differentiable and strictly monotone, the cdf ${\displaystyle F_{Y}(y)=\mathbb {P} (g(X)\leq y)={\begin{cases}\mathbb {P} (X\leq g^{-1}(y))=F_{X}(g^{-1}(y)),&g^{-1}{\text{ is increasing}};\\\mathbb {P} (X\geq g^{-1}(y))=1-F_{X}(g^{-1}(y)),&g^{-1}{\text{ is decreasing}}.\end{cases}}}$  (${\displaystyle g^{-1}}$  exists since ${\displaystyle g}$  is strictly monotonic.) Differentiating both side of the above equation (assuming the cdf's involved are differentiable) gives

${\displaystyle f_{Y}(y)={\begin{cases}f_{X}(g^{-1}(y)){\frac {dg^{-1}(y)}{dy}},&g^{-1}{\text{ is increasing}};\\-f_{X}(g^{-1}(y)){\frac {dg^{-1}(y)}{dy}},&g^{-1}{\text{ is decreasing}}.\\\end{cases}}}$

Since ${\displaystyle x=g^{-1}(y)}$ , we can write ${\displaystyle {\frac {dg^{-1}(y)}{dy}}}$  as ${\displaystyle {\frac {dx}{dy}}}$ . Also, we can summarize the above case defined function into a single expression by applying absolute value function to both side:
${\displaystyle f_{Y}(y)=f_{X}(g^{-1}(y))\left\vert {\frac {dx}{dy}}\right\vert ,}$

where the absolute value sign is only applied to ${\displaystyle {\frac {dx}{dy}}}$  since the pdf's must be nonnegative, and thus we do not need to apply the sign to them.

${\displaystyle \Box }$

Remark.

• To explain this theorem in a more intuitive manner, we rewrite the equation in the theorem as

${\displaystyle |f_{Y}(y)dy|=|f_{X}(g^{-1}(y))dx|}$

where both side of the equation can be regarded as differential areas, which are nonnegative due to the absolute value signs.
• This equation should intuitively hold since they both represent the areas under the pdf's, which represent probabilities. For ${\displaystyle |f_{X}(g^{-1}(y))dx|=|f_{X}(x)dx|}$ , it is the area of the region ${\displaystyle R_{X}}$  under the pdf of ${\displaystyle X}$  over an "infinitesimal" interval ${\displaystyle dx}$ , which represent the probability for ${\displaystyle X}$  to lie in this infinitesimal interval ${\displaystyle dx}$ . After transformation, we get another pdf of ${\displaystyle Y}$ , and the original region ${\displaystyle R_{X}}$  is transformed to a region ${\displaystyle R_{Y}}$  under pdf of ${\displaystyle Y}$  over an infinitesimal interval ${\displaystyle dy=g'(x)dx}$  with area ${\displaystyle |f_{Y}(y)dy|}$ . Since ${\displaystyle g}$  is bijective function (its strict monotonicity implies this), ${\displaystyle dy}$  "correspond" to ${\displaystyle dx}$  in some sense, and we know that the values in ${\displaystyle dy}$  are "originated" from the values in ${\displaystyle dx}$ , and so the randomness. It follows that the probability for ${\displaystyle X}$  lying in ${\displaystyle dx}$  and ${\displaystyle Y}$  lying in ${\displaystyle dy}$  should be the same, and hence the two differential areas are the same.

Let us define Jacobian matrix, and introduce several notations in the definition.

Definition. (Jacobian matrix) Suppose the function ${\displaystyle \mathbf {g} }$  is differentiable (then it follows that ${\displaystyle \mathbf {g} ^{-1}}$  is differentiable). The Jacobian matrix

${\displaystyle {\frac {\partial \mathbf {y} }{\partial \mathbf {x} }}={\begin{pmatrix}{\frac {\partial g_{1}(\mathbf {x} )}{\partial x_{1}}}&\dotsb &{\frac {\partial g_{1}(\mathbf {x} )}{\partial x_{n}}}\\\vdots &\ddots &\vdots \\{\frac {\partial g_{n}(\mathbf {x} )}{\partial x_{1}}}&\dotsb &{\frac {\partial g_{n}(\mathbf {x} )}{\partial x_{n}}}\end{pmatrix}},\quad \mathbf {y} =\mathbf {g} (\mathbf {x} )}$

in which ${\displaystyle g_{j}}$  is the component function of ${\displaystyle \mathbf {g} }$  for each ${\displaystyle j\in \{1,\dotsc ,n\}}$ , i.e. ${\displaystyle \mathbf {g} (\mathbf {x} )=(g_{1}(\mathbf {x} ),\dotsc ,g_{n}(\mathbf {x} ))}$ .

Remark.

• We have ${\displaystyle {\frac {\partial \mathbf {y} }{\partial \mathbf {x} }}{\frac {\partial \mathbf {x} }{\partial \mathbf {y} }}=I_{n\times n}\Leftrightarrow {\frac {\partial \mathbf {y} }{\partial \mathbf {x} }}=\left({\frac {\partial \mathbf {x} }{\partial \mathbf {y} }}\right)^{-1}}$ .

Example. Suppose ${\displaystyle \mathbf {x} =(x_{1},x_{2})}$ , ${\displaystyle \mathbf {y} =(y_{1},y_{2})}$ , and ${\displaystyle \mathbf {y} =\mathbf {g} (\mathbf {x} )=({\color {red}2x_{1}},{\color {blue}3x_{2}})}$ . Then, ${\displaystyle g_{1}(\mathbf {x} )={\color {red}2x_{1}}}$ ,${\displaystyle g_{2}(\mathbf {x} )={\color {blue}3x_{2}}}$ , and

${\displaystyle {\frac {\partial \mathbf {y} }{\partial \mathbf {x} }}={\begin{pmatrix}{\frac {\partial ({\color {red}2x_{1}})}{\partial x_{1}}}&{\frac {\partial ({\color {red}2x_{1}})}{\partial x_{2}}}\\{\frac {\partial ({\color {blue}3x_{2}})}{\partial x_{1}}}&{\frac {\partial ({\color {blue}3x_{2}})}{\partial x_{2}}}\end{pmatrix}}={\begin{pmatrix}2&0\\0&3\end{pmatrix}}.}$

Also, ${\displaystyle \mathbf {x} =\mathbf {g} ^{-1}(\mathbf {y} )=({\color {darkgreen}y_{1}/2},{\color {purple}y_{2}/3})}$ . Then, ${\displaystyle g_{1}^{-1}(\mathbf {y} )={\color {darkgreen}y_{1}/2}}$ , ${\displaystyle g_{2}^{-1}(\mathbf {y} )={\color {purple}y_{2}/3}}$ , and

${\displaystyle {\frac {\partial \mathbf {x} }{\partial \mathbf {y} }}={\begin{pmatrix}{\frac {\partial ({\color {darkgreen}y_{1}/2})}{\partial y_{1}}}&{\frac {\partial ({\color {darkgreen}y_{1}/2})}{\partial y_{2}}}\\{\frac {\partial ({\color {purple}y_{2}/3})}{\partial y_{1}}}&{\frac {\partial ({\color {purple}y_{2}/3})}{\partial y_{2}}}\end{pmatrix}}={\begin{pmatrix}1/2&0\\0&1/3\end{pmatrix}}={\begin{pmatrix}2&0\\0&3\end{pmatrix}}^{-1}={\frac {1}{6}}{\begin{pmatrix}3&0\\0&2\end{pmatrix}}.}$

Theorem. (Transformation of continuous random variables) Let ${\displaystyle \mathbf {X} }$  be a continuous random vector with joint pdf ${\displaystyle f_{\mathbf {X} }(\mathbf {x} )}$ , and assume ${\displaystyle \mathbf {g} }$  is differentiable and bijective. The corresponding joint pdf of transformed random vector ${\displaystyle \mathbf {Y} =\mathbf {g} (\mathbf {X} )}$  is

${\displaystyle f_{\mathbf {Y} }(\mathbf {y} )=f_{\mathbf {X} }{\big (}\mathbf {g} ^{-1}(\mathbf {y} ){\big )}\left|\det {\frac {\partial \mathbf {x} }{\partial \mathbf {y} }}\right|,\quad \mathbf {y} \in \operatorname {supp} (\mathbf {Y} ).}$

Proof. Partial proof: Assume ${\displaystyle \mathbf {g} }$  is differentiable and bijective.

First,

${\displaystyle \mathbb {P} (Y\in S)=\int _{}^{}\dotsi \int _{S}^{}f_{\mathbf {Y} }(\mathbf {y} )\,dy_{1}\cdots \,dy_{n}\qquad (1).}$

On the other hand, we have

${\displaystyle \mathbb {P} (Y\in S)=\mathbb {P} (X=\mathbf {g} ^{-1}(Y)\in \mathbf {g} ^{-1}(S))=\int \dotsi \int _{\mathbf {g} ^{-1}(S)}^{}f_{\mathbf {X} }(\mathbf {x} )\,dx_{1}\cdots \,dx_{n}}$

where ${\displaystyle \mathbf {g} ^{-1}(S)=\{x\in X:\mathbf {g} (x)\in S\}}$ , which is the preimage of the set ${\displaystyle S}$  under ${\displaystyle \mathbf {g} }$ .

Applying the change of variable formula to this integral (whose proof is advanced and uses our assumptions), we get

${\displaystyle \int \dotsi \int _{\mathbf {g} ^{-1}(S)}f_{\mathbf {X} }(\mathbf {x} )\,dx_{1}\cdots \,dx_{n}=\int \dotsi \int _{S}f_{\mathbf {X} }{\big (}\mathbf {g} ^{-1}(\mathbf {y} ){\big )}\left|\det {\frac {\partial \mathbf {x} }{\partial \mathbf {y} }}\right|\,dy_{1}\cdots \,dy_{n}\qquad (2)}$

Comparing the integrals in ${\displaystyle (1)}$  and ${\displaystyle (2)}$ , we can observe the desired result.

${\displaystyle \Box }$

Moment generating function

Definition. (Moment generating function) The moment generating function (mgf) for the distribution of a random variable ${\displaystyle X}$  is ${\displaystyle M_{X}({\color {darkgreen}t})=\mathbb {E} \left[e^{{\color {darkgreen}t}X}\right]}$ .

Remark.

• For comparison: cdf is ${\displaystyle F_{X}({\color {darkgreen}t})=\mathbb {E} [\mathbf {1} \{X\leq {\color {darkgreen}t}\}]}$ .
• Mgf, similar to pmf, pdf and cdf, gives a complete description of distribution, so it can also similarly uniquely identify a distribution, provided that the mgf exists (expectation may be infinite),
• i.e., we can recover probability function from mgf.
• The proof to this result is complicated, and thus omitted.

Proposition. (Moment generating property of mgf) Assuming mgf ${\displaystyle M_{X}({\color {darkgreen}t})}$  exists for ${\displaystyle t\in (-\varepsilon ,\varepsilon )}$  in which ${\displaystyle \varepsilon }$  is a positive number, we have

${\displaystyle \mathbb {E} [X^{n}]=\left.{\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}M_{X}({\color {darkgreen}t})\right|_{{\color {darkgreen}t}=0}}$

for each nonnegative integer ${\displaystyle n}$ .

Proof.

• Since

${\displaystyle M_{X}({\color {darkgreen}t})=\mathbb {E} \left[e^{{\color {darkgreen}t}X}\right]=\mathbb {E} \left[1+{\color {darkgreen}t}X+{\frac {{\color {darkgreen}t}^{2}X^{2}}{2!}}+\dotsb \right]{\overset {\text{linearity}}{=}}1+{\color {darkgreen}t}\mathbb {E} [X]+{\frac {{\color {darkgreen}t}^{2}}{2!}}\mathbb {E} [X^{2}]+\dotsb ,}$

${\displaystyle {\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}M_{X}({\color {darkgreen}t}){\bigg |}_{{\color {darkgreen}t}=0}={\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}\left(1+{\color {darkgreen}t}\mathbb {E} [X]+{\frac {{\color {darkgreen}t}^{2}}{2!}}\mathbb {E} [X^{2}]+\dotsb \right){\bigg |}_{{\color {darkgreen}t}=0}=\mathbb {E} [X]{\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}{\color {darkgreen}t}+{\frac {\mathbb {E} [X^{2}]}{2!}}{\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}{\color {darkgreen}t^{2}}+\dotsb ,}$

• The result follows from simplifying the above expression by ${\displaystyle {\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}{\color {darkgreen}t}^{m}=\mathbf {1} \{m=n\}n!+\mathbf {1} \{m\neq n\}(0).}$

${\displaystyle \Box }$

Proposition. (Relationship between independence and mgf) If ${\displaystyle X}$  and ${\displaystyle Y}$  are independent,

${\displaystyle M_{XY}({\color {darkgreen}t})={\color {blue}\mathbb {E} _{X}[}M_{Y}({\color {darkgreen}t}{\color {blue}X}){\color {blue}]}={\color {red}\mathbb {E} _{Y}[}M_{X}({\color {darkgreen}t}{\color {red}Y}){\color {red}]}.}$

Proof.

${\displaystyle M_{XY}({\color {darkgreen}t})=\mathbb {E} [e^{{\color {darkgreen}t}{\color {blue}X}{\color {red}Y}}]{\overset {\text{lote}}{=}}{\color {blue}\mathbb {E} _{X}{\bigg [}}{\color {red}\mathbb {E} _{Y}[e}^{{\color {darkgreen}t}{\color {blue}X}{\color {red}Y}}|{\color {blue}X}{\color {red}]}{\color {blue}{\bigg ]}}={\color {blue}\mathbb {E} _{X}[}M_{Y}({\color {darkgreen}t}{\color {blue}X}){\color {blue}]}.}$

Similarly,
${\displaystyle M_{XY}({\color {darkgreen}t})=\mathbb {E} [e^{{\color {darkgreen}t}{\color {blue}X}{\color {red}Y}}]{\overset {\text{lote}}{=}}{\color {red}\mathbb {E} _{X}{\bigg [}}{\color {blue}\mathbb {E} _{X}[e}^{{\color {darkgreen}t}{\color {blue}X}{\color {red}Y}}|{\color {red}Y}{\color {blue}]}{\color {red}{\bigg ]}}={\color {red}\mathbb {E} _{Y}[}M_{X}({\color {darkgreen}t}{\color {red}Y}){\color {red}]}.}$

• lote: law of total expectation

${\displaystyle \Box }$

Remark.

• This equality does not hold if ${\displaystyle X}$  and ${\displaystyle Y}$  are not independent.

Joint moment generating function

In the following, we will use ${\displaystyle \mathbf {X} }$  to denote ${\displaystyle (X_{1},\dotsc ,X_{n})^{T}}$ .

Definition. (Joint moment generating function) The joint moment generating function (mgf) of random vector ${\displaystyle \mathbf {X} }$  is

${\displaystyle M_{\mathbf {X} }({\color {darkgreen}\mathbf {t} })=\mathbb {E} [e^{{\color {darkgreen}\mathbf {t} }\cdot \mathbf {X} }]=\mathbb {E} [e^{{\color {darkgreen}t_{1}}X_{1}+\dotsb +{\color {darkgreen}t_{n}}X_{n}}]}$

for each (column) vector ${\displaystyle \mathbf {t} =(t_{1},\dotsc ,t_{n})^{T}}$ , if the expectation exists.

Remark.

• When ${\displaystyle n=1}$ , the dot product of two vectors is product of two numbers.
• ${\displaystyle \mathbf {t} \cdot \mathbf {X} {\overset {\text{ def }}{=}}\mathbf {t} ^{T}\mathbf {X} }$ .

Proposition. (Relationship between independence and mgf) Random variables ${\displaystyle X_{1},\dotsc ,X_{n}}$  are independent if and only if

${\displaystyle M_{\mathbf {X} }(\mathbf {t} )=M_{X_{1}}(t_{1})\dotsb M_{X_{n}}(t_{n}).}$

Proof. 'only if' part: Assume ${\displaystyle X_{1},\dotsc ,X_{n}}$  are independent. Then,

${\displaystyle M_{\mathbf {X} }(\mathbf {t} )=\mathbb {E} [e^{\mathbf {t} \cdot \mathbf {X} }]=\mathbb {E} [e^{t_{1}X_{1}}\dotsb e^{t_{n}X_{n}}]{\overset {\text{ independence }}{=}}\mathbb {E} [e^{t_{1}X_{1}}]\dotsb \mathbb {E} [e^{t_{n}X_{n}}]=M_{X_{1}}(t_{1})\dotsb M_{X_{n}}(t_{n}).}$

Proof for 'if' part is quite complicated, and thus is omitted.

${\displaystyle \Box }$

Analogously, we have marginal mgf.

Definition. (Marginal mgf) The marginal mgf of ${\displaystyle X_{i}}$  which is a member of random variables ${\displaystyle X_{1},\dotsc ,X_{n}}$  is

${\displaystyle M_{X_{i}}(t)=M_{\mathbf {X} }(0,\dotsc ,0,\underbrace {t} _{i{\text{ th position}}},0,\dotsc ,0)}$

Proposition. (Moment generating function of linear transformation of random variables) For each constant vector ${\displaystyle {\color {red}\mathbf {a} }=({\color {red}a_{1}},\dotsc ,{\color {red}a_{n}})}$  and a real constant ${\displaystyle {\color {blue}b}}$ , the mgf of ${\displaystyle {\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}}$  is

${\displaystyle M_{{\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}}(t)=e^{{\color {blue}b}t}M_{\mathbf {X} }(t{\color {red}\mathbf {a} })=e^{{\color {blue}b}t}M_{\mathbf {X} }(t{\color {red}a_{1}},\dotsc ,t{\color {red}a_{n}}).}$

Proof.

${\displaystyle M_{{\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}}(t)=\mathbb {E} [e^{t{\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}t}]=e^{{\color {blue}b}t}\mathbb {E} [e^{t{\color {red}\mathbf {a} }\cdot \mathbf {X} }]=e^{{\color {blue}b}t}M_{\mathbf {X} }(t{\color {red}\mathbf {a} })=e^{{\color {blue}b}t}M_{\mathbf {X} }(t{\color {red}a_{1}},\dotsc ,t{\color {red}a_{n}}).}$

${\displaystyle \Box }$

Remark.

• If ${\displaystyle X_{1},\dotsc ,X_{n}}$  are independent,

${\displaystyle M_{{\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}}(t)=e^{{\color {blue}b}t}M_{X_{1}}(t{\color {red}a_{1}})\dotsb M_{X_{n}}(t{\color {red}a_{n}}).}$

• This provides an alternative, and possibly more convenient method to derive the distribution of ${\displaystyle {\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}}$ , compared with deriving it from probability functions of ${\displaystyle X_{1},\dotsc ,X_{n}}$ .
• Special case: if ${\displaystyle {\color {red}\mathbf {a} }=(1,\dotsc ,1)^{T}}$  and ${\displaystyle {\color {blue}b}=0}$ , then ${\displaystyle {\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}=X_{1}+\dotsb +X_{n}}$ , which is sum of r.v.'s.
• So, ${\displaystyle M_{X_{1}+\dotsb +X_{n}}(t)=M_{\mathbf {X} }(t,\dotsc ,t)}$ .
• In particular, if ${\displaystyle X_{1},\dotsc ,X_{n}}$  are independent , then ${\displaystyle M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)}$ .
• We can use this result to prove the formulas for sum of independent r.v.'s., instead of using the proposition about convolution of r.v.'s.
• Special case: if ${\displaystyle n=1}$ , then the expression for linear transformation becomes ${\displaystyle {\color {red}a}X+{\color {blue}b}}$ .
• So, ${\displaystyle M_{{\color {red}a}X+{\color {blue}b}}(t)=e^{{\color {blue}b}t}M_{X}({\color {red}a}t)}$ .

Moment generating function of some important distributions

Proposition. (Moment generating function of binomial distribution) The moment generating function of ${\displaystyle X\sim \operatorname {Binom} (n,p)}$  is ${\displaystyle M_{X}(t)=(pe^{t}+1-p)^{n}}$ .

Proof.

${\displaystyle M_{X}(t)=\sum _{k=0}^{n}{\color {blue}e^{tk}}\underbrace {{\binom {n}{k}}{\color {blue}p^{k}}(1-p)^{n-k}} _{{\text{for}}\operatorname {Binom} (n,p)}=\sum _{k=0}^{n}{\binom {n}{k}}{\color {blue}(pe^{t})^{k}}(1-p)^{n-k}=(pe^{t}+1-p)^{n}\quad {\text{by binomial theorem}}.}$

${\displaystyle \Box }$

Proposition. (Moment generating function of Poisson distribution) The moment generating function of ${\displaystyle X\sim \operatorname {Pois} (\lambda )}$  is ${\displaystyle M_{X}(t)=e^{\lambda (e^{t-1})}}$ .

Proof.

${\displaystyle M_{X}(t){\overset {\text{ def }}{=}}\mathbb {E} [e^{tX}]{\overset {\text{ LOTUS }}{=}}\sum _{{\color {darkdarkgreen}k}=0}^{\infty }e^{{\color {darkorange}t}{\color {darkdarkgreen}k}}\cdot \underbrace {\frac {e^{\color {red}-\lambda }{\color {purple}\lambda }^{\color {darkdarkgreen}k}}{{\color {darkdarkgreen}k}!}} _{{\text{for}}\operatorname {Pois} (\lambda )}=e^{{\color {blue}\lambda }({\color {blue}e^{t}}{\color {red}-1})}\overbrace {\sum _{{\color {darkdarkgreen}k}=0}^{\infty }\underbrace {\frac {e^{\color {blue}-\lambda e^{t}}({\color {purple}\lambda }e^{\color {darkorange}t})^{\color {darkdarkgreen}k}}{{\color {darkdarkgreen}k}!}} _{{\text{for}}\operatorname {Pois} (\lambda e^{t})}} ^{=1}=e^{\lambda (e^{t}-1)}.}$

${\displaystyle \Box }$

Proposition. (Moment generating function of exponential distribution) The moment generating function of ${\displaystyle X\sim \operatorname {Exp} (\lambda )}$  is ${\displaystyle M_{X}(t)={\frac {\lambda }{\lambda -t}},\quad t<\lambda }$ .

Proof.

• ${\displaystyle M_{X}(t)=\mathbb {E} [e^{tX}]=\lambda \int _{0}^{\infty }e^{tx}e^{-\lambda x}\,dx=\lambda \int _{0}^{\infty }e^{-(\lambda -t)x}\,dx={\frac {\lambda }{\color {blue}\lambda -t}}\overbrace {\int _{0}^{\infty }\underbrace {{\color {blue}(\lambda -t)}e^{-(\lambda -t)x}} _{{\text{for}}\operatorname {Exp} (\lambda -t)}\,dx} ^{=1},\quad \underbrace {\lambda -t>0} _{\text{ensuring valid rate parameter}}\Leftrightarrow t<\lambda .}$

• The result follows.

${\displaystyle \Box }$

Proposition. (Moment generating function of gamma distribution) The moment generating function of ${\displaystyle X\sim \operatorname {Gamma} (\alpha ,\lambda )}$  is ${\displaystyle M_{X}(t)=\left({\frac {\lambda }{\lambda -t}}\right)^{\alpha },\quad t<\lambda }$ .

Proof.

• We use similar proof technique from the proof for mgf of exponential distribution.

${\displaystyle M_{X}(t)=\mathbb {E} [e^{tX}]={\frac {\lambda ^{\alpha }}{\Gamma (\alpha )}}\int _{0}^{\infty }e^{tx}x^{\alpha -1}e^{-\lambda x}\,dx={\frac {\lambda ^{\alpha }}{\Gamma (\alpha )}}\int _{0}^{\infty }e^{-(\lambda -t)x}x^{\alpha -1}\,dx={\frac {\lambda ^{\alpha }}{\color {blue}(\lambda -t)^{\alpha }}}\overbrace {\int _{0}^{\infty }\underbrace {{\frac {\color {blue}(\lambda -t)^{\alpha }}{\Gamma (\alpha )}}e^{-(\lambda -t)x}x^{\alpha -1}} _{{\text{for}}\operatorname {Gamma} (\alpha ,\lambda -t)}\,dx} ^{=1},\quad \underbrace {\lambda -t>0} _{\text{ensuring valid rate parameter}}\Leftrightarrow t<\lambda .}$

• The result follows.

${\displaystyle \Box }$

Proposition. (Moment generating function of normal distribution) The moment generating function of ${\displaystyle X\sim {\mathcal {N}}({\color {blue}\mu },{\color {red}\sigma ^{2}})}$  is ${\displaystyle M_{X}(t)=e^{{\color {blue}\mu }t+{\color {red}\sigma ^{2}}t^{2}/2}}$ .

Proof.

• Let ${\displaystyle Z={\frac {X-{\color {blue}\mu }}{\color {red}\sigma }}\sim {\mathcal {N}}(0,1)}$ . Then, ${\displaystyle X={\color {red}\sigma }Z+{\color {blue}\mu }}$ .
• First, consider the mgf of ${\displaystyle Z}$ :

${\displaystyle M_{Z}(t){\overset {\text{ def }}{=}}\mathbb {E} [e^{tZ}]={\frac {1}{\sqrt {2\pi }}}\int _{-\infty }^{\infty }\underbrace {e^{tx}e^{-x^{2}/2}} _{=e^{-(x^{2}-2tx)/2}}\,dx={\frac {1}{\sqrt {2\pi }}}\int _{-\infty }^{\infty }\exp {\big (}\overbrace {-(x^{2}-2tx+{\color {darkdarkgreen}t^{2}})} ^{=-(x-t)^{2}}/2+{\color {darkdarkgreen}t^{2}/2}{\big )}\,dx=e^{t^{2}/2}\overbrace {\int _{-\infty }^{\infty }\underbrace {{\frac {1}{\sqrt {2\pi }}}\cdot e^{-(x-t)^{2}/2}} _{{\text{for }}{\mathcal {N}}(t,1)}\,dx} ^{=1}=e^{t^{2}/2}.}$

• It follows that the mgf of ${\displaystyle X}$  is

${\displaystyle M_{X}(t)=e^{{\color {blue}\mu }t}M_{X}({\color {red}\sigma }t)=e^{{\color {blue}\mu }t}e^{{\color {red}\sigma }^{2}t^{2}/2}.}$

• The result follows.

${\displaystyle \Box }$

Distribution of linear transformation of random variables

We will prove some propositions about distributions of linear transformation of random variables using mgf. Some of them are mentioned in previous chapters. As we will see, proving these propositions using mgf is quite simple.

Proposition. (Distribution of linear transformation of normal r.v.'s) Let ${\displaystyle X\sim {\mathcal {N}}(\mu ,\sigma ^{2})}$ . Then, ${\displaystyle {\color {red}a}X+{\color {blue}b}\sim {\mathcal {N}}({\color {red}a}\mu +{\color {blue}b},{\color {red}a^{2}}\sigma ^{2})}$ .

Proof.

• The mgf of ${\displaystyle {\color {red}a}X+{\color {blue}b}}$  is

${\displaystyle M_{{\color {red}a}X+{\color {blue}b}}(t)=e^{{\color {blue}b}t}M_{X}({\color {red}a}t)=e^{{\color {blue}b}t}\left(\exp({\color {red}a}\mu t+({\color {red}a}\sigma )^{2}t^{2}/2)\right)=\exp \left(({\color {red}a}\mu +{\color {blue}b})t+{\color {red}a^{2}}\sigma ^{2}t^{2}/2\right),}$

which is the mgf of ${\displaystyle {\mathcal {N}}({\color {red}a}\mu +{\color {blue}b},{\color {red}a^{2}}\sigma ^{2})}$ , and the result follows since mgf identify a distribution uniquely.

${\displaystyle \Box }$

Sum of independent random variables

Proposition. (Sum of independent binomial r.v.'s) Let ${\displaystyle X_{1}\sim \operatorname {Binom} (n_{1},p),\dotsc ,X_{m}\sim \operatorname {Binom} (n_{m},p)}$ , in which ${\displaystyle X_{1},\dotsc ,X_{m}}$  are independent. Then, ${\displaystyle X_{1}+\dotsb +X_{n}\sim \operatorname {Binom} (n_{1}+\dotsb +n_{m},p)}$ .

Proof.

• The mgf of ${\displaystyle X_{1}+\dotsb +X_{n}}$  is

${\displaystyle M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)=(pe^{t}+1-p)^{n_{1}}\dotsb (pe^{t}+1-p)^{n_{m}}=(pe^{t}+1-p)^{n_{1}+\dotsb +n_{m}},}$

which is the mgf of ${\displaystyle \operatorname {Binom} (n_{1}+\dotsb +n_{m},p)}$ , as desired.

${\displaystyle \Box }$

Proposition. (Sum of independent Poisson r.v.'s) Let ${\displaystyle X_{1}\sim \operatorname {Pois} (\lambda _{1}),\dotsc ,X_{n}\sim \operatorname {Pois} (\lambda _{n})}$ , in which ${\displaystyle X_{1},\dotsc ,X_{n}}$  are independent. Then, ${\displaystyle X_{1}+\dotsb +X_{n}\sim \operatorname {Pois} (\lambda _{1}+\dotsb +\lambda _{n})}$ .

Proof.

• The mgf of ${\displaystyle X_{1}+\dotsb +X_{n}}$  is

${\displaystyle M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)=e^{\lambda _{1}(e^{t}-1)}\dotsb e^{\lambda _{n}(e^{t}-1)}=e^{(\lambda _{1}+\dotsb +\lambda _{n})(e^{t}-1)},}$

which is the mgf of ${\displaystyle \operatorname {Pois} (\lambda _{1}+\dotsb +\lambda _{n})}$ , as desired.

${\displaystyle \Box }$

Proposition. (Sum of independent exponential r.v.'s) Let ${\displaystyle X_{1},\dotsc ,X_{n}}$  be i.i.d. r.v.'s following ${\displaystyle \operatorname {Exp} (\lambda )}$ . Then, ${\displaystyle X_{1}+\dotsb +X_{n}\sim \operatorname {Gamma} (n,\lambda )}$ .

Proof.

• The mgf of ${\displaystyle X_{1}+\dotsb +X_{n}}$  is

${\displaystyle M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)=\left({\frac {\lambda }{\lambda -t}}\right)^{n},}$

which is the mgf of ${\displaystyle \operatorname {Gamma} (n,\lambda )}$ , as desired.

${\displaystyle \Box }$

Proposition. (Sum of independent gamma r.v.'s) Let ${\displaystyle X_{1}\sim \operatorname {Gamma} (\alpha _{1},\lambda ),\dotsc ,X_{n}\sim \operatorname {Gamma} (\alpha _{n},\lambda )}$ , in which ${\displaystyle X_{1},\dotsc ,X_{n}}$  are independent. Then, ${\displaystyle X_{1}+\dotsb +X_{n}\sim \operatorname {Gamma} (\alpha _{1}+\dotsb +\alpha _{n},\lambda )}$ .

Proof.

• The mgf of ${\displaystyle X_{1}+\dotsb +X_{n}}$  is

${\displaystyle M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)=\left({\frac {\lambda }{\lambda -t}}\right)^{\alpha _{1}}\dotsb \left({\frac {\lambda }{\lambda -t}}\right)^{\alpha _{n}}=\left({\frac {\lambda }{\lambda -t}}\right)^{\alpha _{1}+\dotsb +\alpha _{n}},}$

which is the mgf of ${\displaystyle \operatorname {Gamma} (\alpha _{1}+\dotsb +\alpha _{n},\lambda )}$ , as desired.

${\displaystyle \Box }$

Proposition. (Sum of independent normal r.v.'s) Let ${\displaystyle X_{1}\sim {\mathcal {N}}(\mu _{1},\sigma _{1}^{2}),\dotsc ,X_{n}\sim {\mathcal {N}}(\mu _{n},\sigma _{n}^{2})}$ , in which ${\displaystyle X_{1},\dotsc ,X_{n}}$  are independent. Then ${\displaystyle X_{1}+\dotsb +X_{n}\sim {\mathcal {N}}(\mu _{1}+\dotsb +\mu _{n},\sigma _{1}^{2}+\dotsb +\sigma _{n}^{2})}$ .

Proof.

• The mgf of ${\displaystyle X_{1}+\dotsb +X_{n}}$  (in which they are independent) is

${\displaystyle M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)=\exp(\mu _{1}t+\sigma _{1}^{2}t^{2}/2)\dotsb \exp(\mu _{n}t+\sigma _{n}^{2}t^{2}/2)=\exp \left((\mu _{1}+\dotsb +\mu _{n})t+(\sigma _{1}^{2}+\dotsb +\sigma _{n}^{2})t^{2}/2\right),}$

which is the mgf of ${\displaystyle {\mathcal {N}}(\mu _{1}+\dotsb +\mu _{n},\sigma _{1}^{2}+\dotsb +\sigma _{n}^{2})}$ , as desired.

${\displaystyle \Box }$

Central limit theorem

We will provide a proof to central limit theorem (CLT) using mgf here.

Theorem. (Central limit theorem) Let ${\displaystyle X_{1},X_{2},\dotsc }$  be a sequence of i.i.d. random variables with finite mean ${\displaystyle \mu }$  and positive variance ${\displaystyle \sigma ^{2}}$ , and ${\displaystyle {\overline {X}}_{n}}$  be the sample mean of the first ${\displaystyle n}$  random variables, i.e. ${\displaystyle {\overline {X}}_{n}={\frac {X_{1}+\dotsb +X_{n}}{n}}}$ . Then, the standardized sample mean ${\displaystyle {\frac {{\sqrt {n}}({\overline {X}}_{n}-\mu )}{\sigma }}}$  converges in distribution to a standard normal random variable as ${\displaystyle n\to \infty }$ .

Proof.

• Define ${\displaystyle T_{n}={\frac {{\sqrt {n}}({\overline {X}}_{n}-\mu )}{\sigma }}}$ . Then, we have

${\displaystyle T_{n}={\frac {{\sqrt {n}}{\big (}(X_{1}+\dotsb +X_{n})/n-\mu {\big )}}{\sigma }}={\frac {X_{1}+\dotsb +X_{n}}{\color {red}\sigma {\sqrt {n}}}}{\color {blue}-{\frac {{\sqrt {n}}\mu }{\sigma }}},}$

• which is in the form of ${\displaystyle {\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b},\quad {\color {red}\mathbf {a} =\left({\frac {1}{\sigma {\sqrt {n}}}},\dotsc ,{\frac {1}{\sigma {\sqrt {n}}}}\right)^{T}}{\text{ and }}{\color {blue}b=-{\frac {{\sqrt {n}}\mu }{\sigma }}}}$ .
• Therefore,

{\displaystyle {\begin{aligned}&&M_{T_{n}}(t)&=e^{\color {blue}-{\sqrt {n}}\mu t/\sigma }\left(M_{X_{1}}\left({\frac {t}{\color {red}\sigma {\sqrt {n}}}}\right)\dotsb M_{X_{n}}\left({\frac {t}{\color {red}\sigma {\sqrt {n}}}}\right)\right)\\&&M_{T_{n}}(t)&=e^{-{\sqrt {n}}\mu t/\sigma }\left(M_{X_{1}}\left({\frac {t}{\sigma {\sqrt {n}}}}\right)\right)^{n}\quad {\text{since }}X_{1},\dotsc ,X_{n}{\text{ are identically distributed, which is equivalent to they have the same mgf}}\\&\Rightarrow &\ln M_{T_{n}}(t)&=-{\sqrt {n}}\mu t/\sigma +n\ln(\mathbb {E} [e^{t/(\sigma {\sqrt {n}})X_{1}}])\\&&&=-{\sqrt {n}}\mu t/\sigma +n\ln \mathbb {E} \left[1+t/(\sigma {\sqrt {n}})X_{1}+(1/2!)t^{2}/(\sigma ^{2}n)+\dotsb \right]\quad {\text{since }}e^{x}=1+x+{\frac {x^{2}}{2!}}+\dotsb \\&&&=-{\sqrt {n}}\mu t/\sigma +n\ln {\big (}1+t/(\sigma {\sqrt {n}}){\color {blue}\mathbb {E} [X]}+(1/2!)t^{2}/(\sigma ^{2}n)({\color {blue}\underbrace {\mathbb {E} [X^{2}]} _{\operatorname {Var} (X)+(\mathbb {E} [X])^{2}}})+{\text{terms of order smaller than }}n^{-1}{\big )}\\&&&=-{\sqrt {n}}\mu t/\sigma +n\ln \left(1+t/(\sigma {\sqrt {n}}){\color {blue}\mu }+(1/2)t^{2}/(\sigma ^{2}n)({\color {blue}\sigma ^{2}+\mu ^{2}})+{\text{terms of order smaller than }}n^{-1}\right)\\&&&=-{\sqrt {n}}\mu t/\sigma +n[t/(\sigma {\sqrt {n}}){\color {blue}\mu }+(1/2)t^{2}/(\sigma ^{2}n)({\color {blue}\sigma ^{2}+\mu ^{2}})-(1/2)(t/(\sigma {\sqrt {n}}){\color {blue}\mu })^{2}+{\text{terms of order smaller than }}n^{-1}]\quad {\text{since }}\ln(1+x)=x-x^{2}/2+\dotsb \\&&&={\cancel {-{\sqrt {n}}\mu t/\sigma }}{\cancel {+{\sqrt {n}}{\color {blue}\mu }t/\sigma }}+{\color {purple}{\cancel {n}}}(1/2)t^{2}/(\sigma ^{2}{\color {purple}{\cancel {n}}})({\color {blue}\sigma ^{2}}+{\color {blue}\mu ^{2}})-{\color {red}{\cancel {n}}}(1/2)(t^{2}/(\sigma ^{2}{\color {red}{\cancel {n}}}){\color {blue}\mu }^{2})+{\text{terms of order smaller than }}n^{0}\\&&&=(1/2)t^{2}(\sigma ^{2}/\sigma ^{2}){\cancel {+(1/2){\color {blue}\mu ^{2}}t^{2}-(1/2)t^{2}({\color {blue}\mu }^{2})}}+{\text{terms of order smaller than }}n^{0}\\&&&=(1/2)t^{2}+\underbrace {{\text{terms of order smaller than }}n^{0}} _{\to 0{\text{ as }}n\to \infty }\\&\Rightarrow &\lim _{n\to \infty }M_{T_{n}}(t)&=\underbrace {e^{t^{2}/2}} _{{\text{mgf of }}{\mathcal {N}}(0,1)},\end{aligned}}}

and the result follows from the mgf property of identifying distribution uniquely.

${\displaystyle \Box }$

Remark.

• Since ${\displaystyle {\frac {{\sqrt {n}}({\overline {X}}-\mu )}{\sigma }}\sim {\mathcal {N}}(0,1)\Leftrightarrow {\color {blue}{\frac {\sigma }{\sqrt {n}}}}\cdot {\frac {{\sqrt {n}}({\overline {X}}-\mu )}{\sigma }}{\color {red}+\mu }\sim {\mathcal {N}}({\color {red}\mu },{\color {blue}\sigma ^{2}/n})\Leftrightarrow {\overline {X}}\sim {\mathcal {N}}(\mu ,\sigma ^{2}/n)}$ ,
• the sample mean converges in distribution to ${\displaystyle {\mathcal {N}}(\mu ,\sigma ^{2}/n)}$  as ${\displaystyle n\to \infty }$ .
• The same result holds for the sample mean of normal r.v.'s with the same mean ${\displaystyle \mu }$  and the same variance ${\displaystyle \sigma ^{2}}$ ,
• since if ${\displaystyle X_{1},\dotsc ,X_{n}\sim {\mathcal {N}}(\mu ,\sigma ^{2})}$ , then ${\displaystyle {\frac {X_{1}+\dotsb +X_{n}}{\color {blue}n}}\sim {\mathcal {N}}\left({\frac {\overbrace {\mu +\dotsb +\mu } ^{n{\text{ times}}}}{\color {blue}n}},{\frac {\overbrace {\sigma ^{2}+\dotsb +\sigma ^{2}} ^{n{\text{ times}}}}{\color {blue}n^{2}}}\right)\equiv {\mathcal {N}}(\mu ,\sigma ^{2}/n)}$ .
• It follows from the proposition about the distribution of linear transformation of normal r.v.'s that the sample sum, i.e. ${\displaystyle X_{1}+\dotsb +X_{n}={\color {blue}n}{\overline {X}}}$  converges in distribution to ${\displaystyle {\mathcal {N}}({\color {blue}n}\mu ,{\color {blue}n^{2}}\sigma ^{2}/n)\equiv {\mathcal {N}}(n\mu ,n\sigma ^{2})}$ .
• The same result holds for the sample sum of normal r.v.'s with the same mean ${\displaystyle \mu }$  and the same variance ${\displaystyle \sigma ^{2}}$ ,
• since if ${\displaystyle X_{1},\dotsc ,X_{n}\sim {\mathcal {N}}(\mu ,\sigma ^{2})}$ , then ${\displaystyle X_{1}+\dotsb +X_{n}\sim {\mathcal {N}}\left(\overbrace {\mu +\dotsb +\mu } ^{n{\text{ times}}},\overbrace {\sigma ^{2}+\dotsb +\sigma ^{2}} ^{n{\text{ times}}}\right)\equiv {\mathcal {N}}(n\mu ,n\sigma ^{2})}$ .
• If a r.v. converges in distribution to a distribution, then we can use the distribution to approximate the probabilities involving the r.v..

A special case of using CLT as approximation is using normal distribution to approximate discrete distribution. To improve accuracy, we should ideally have continuity correction, as explained in the following.

Proposition. (Continuity correction) A continuity correction is rewriting the probability expression ${\displaystyle \mathbb {P} (X=i)}$  (${\displaystyle i}$  is integer) as ${\displaystyle \mathbb {P} (i-1/2  when approximating a discrete distribution by normal distribution using CLT.

Remark.

• The reason for doing this is to make ${\displaystyle i}$  to be at the 'middle' of the interval, so that it is better approximated.

Illustration of continuity correcction:

|
|              /
|             /
|            /
|           /|
|          /#|
|         *##|
|        /|##|
|       /#|##|
|      /##|##|
|     /|##|##|
|    / |##|##|
|   /  |##|##|
|  /   |##|##|
| /    |##|##|
*------*--*--*---------------------
i-1/2 i i+1/2

|
|              /
|             /
|            /
|           /
|          /
|         *
|        /|
|       /#|
|      /##|
|     /###|
|    /####|
|   /#####|
|  /|#####|
| / |#####|
*---*-----*------------------------
i-1    i

|
|              /|
|             /#|
|            /##|
|           /###|
|          /####|
|         *#####|
|        /|#####|
|       / |#####|
|      /  |#####|
|     /   |#####|
|    /    |#####|
|   /     |#####|
|  /      |#####|
| /       |#####|
*---------*-----*------------------
i     i+1

1. or equivalently, transformation between supports of ${\displaystyle \mathbf {X} }$  and ${\displaystyle \mathbf {Y} }$