Underlying principle
Edit
Let
X
1
,
…
,
X
n
{\displaystyle X_{1},\dotsc ,X_{n}}
be
n
{\displaystyle n}
random variables,
Y
1
,
…
,
Y
n
{\displaystyle Y_{1},\dotsc ,Y_{n}}
be another
n
{\displaystyle n}
random variables, and
X
=
(
X
1
,
…
,
X
n
)
T
,
Y
=
(
Y
1
,
…
,
Y
n
)
T
{\displaystyle \mathbf {X} =(X_{1},\dotsc ,X_{n})^{T},\mathbf {Y} =(Y_{1},\dotsc ,Y_{n})^{T}}
be random (column) vectors.
Suppose the vector-valued function[1]
g
:
supp
(
X
)
→
supp
(
Y
)
{\displaystyle \mathbf {g} :\operatorname {supp} (\mathbf {X} )\to \operatorname {supp} (\mathbf {Y} )}
is bijective (it is also called one-to-one correspondence in this case).
Then, its inverse
g
−
1
:
supp
(
Y
)
→
supp
(
X
)
{\displaystyle \mathbf {g} ^{-1}:\operatorname {supp} (\mathbf {Y} )\to \operatorname {supp} (\mathbf {X} )}
exists.
After that, we can transform
X
{\displaystyle \mathbf {X} }
to
Y
{\displaystyle \mathbf {Y} }
by applying the transformation
g
{\displaystyle \mathbf {g} }
,
i.e. by
Y
=
g
(
X
)
{\displaystyle \mathbf {Y} =\mathbf {g} (\mathbf {X} )}
,
and transform
Y
{\displaystyle \mathbf {Y} }
to
X
{\displaystyle \mathbf {X} }
by applying the inverse transformation
g
−
1
{\displaystyle \mathbf {g} ^{-1}}
,
i.e. by
X
=
g
−
1
(
Y
)
{\displaystyle \mathbf {X} =\mathbf {g} ^{-1}(\mathbf {Y} )}
.
We are often interested in deriving the joint probability function
f
Y
(
y
)
{\displaystyle f_{\mathbf {Y} }(\mathbf {y} )}
of
Y
{\displaystyle \mathbf {Y} }
,
given the joint probability function
f
X
(
x
)
{\displaystyle f_{\mathbf {X} }(\mathbf {x} )}
of
X
{\displaystyle \mathbf {X} }
.
We will examine the discrete and continuous cases one by one in the following.
Transformation of discrete random variables
Edit
Proof.
Considering the original pmf
f
Y
(
y
)
{\displaystyle f_{\mathbf {Y} }(\mathbf {y} )}
, we have
f
Y
(
y
)
=
def
P
(
Y
=
y
)
=
P
(
g
−
1
(
Y
)
=
g
−
1
(
y
)
)
=
P
(
X
=
g
−
1
(
y
)
)
=
def
f
X
(
g
−
1
(
y
)
)
,
y
∈
supp
(
Y
)
.
{\displaystyle f_{\mathbf {Y} }(\mathbf {y} ){\overset {\text{ def }}{=}}\mathbb {P} (\mathbf {Y} =\mathbf {y} )=\mathbb {P} \left(\mathbf {g} ^{-1}(\mathbf {Y} )=\mathbf {g} ^{-1}(\mathbf {y} )\right)=\mathbb {P} \left(\mathbf {X} =\mathbf {g} ^{-1}(\mathbf {y} )\right){\overset {\text{ def }}{=}}f_{\mathbf {X} }\left(\mathbf {g} ^{-1}(\mathbf {y} )\right),\quad \mathbf {y} \in \operatorname {supp} (\mathbf {Y} ).}
In particular, the inverse
g
−
1
{\displaystyle \mathbf {g} ^{-1}}
exists since
g
{\displaystyle \mathbf {g} }
is bijective.
◻
{\displaystyle \Box }
Transformation of continuous random variables
Edit
For continuous random variables, the situation is more complicated.
Let us investigate the case for univariate pdf, which is simpler.
Proof.
Under the assumption that
g
{\displaystyle g}
is differentiable and strictly monotone,
the cdf
F
Y
(
y
)
=
P
(
g
(
X
)
≤
y
)
=
{
P
(
X
≤
g
−
1
(
y
)
)
=
F
X
(
g
−
1
(
y
)
)
,
g
−
1
is increasing
;
P
(
X
≥
g
−
1
(
y
)
)
=
1
−
F
X
(
g
−
1
(
y
)
)
,
g
−
1
is decreasing
.
{\displaystyle F_{Y}(y)=\mathbb {P} (g(X)\leq y)={\begin{cases}\mathbb {P} (X\leq g^{-1}(y))=F_{X}(g^{-1}(y)),&g^{-1}{\text{ is increasing}};\\\mathbb {P} (X\geq g^{-1}(y))=1-F_{X}(g^{-1}(y)),&g^{-1}{\text{ is decreasing}}.\end{cases}}}
(
g
−
1
{\displaystyle g^{-1}}
exists since
g
{\displaystyle g}
is strictly monotonic.)
Differentiating both side of the above equation (assuming the cdf's involved are differentiable) gives
f
Y
(
y
)
=
{
f
X
(
g
−
1
(
y
)
)
d
g
−
1
(
y
)
d
y
,
g
−
1
is increasing
;
−
f
X
(
g
−
1
(
y
)
)
d
g
−
1
(
y
)
d
y
,
g
−
1
is decreasing
.
{\displaystyle f_{Y}(y)={\begin{cases}f_{X}(g^{-1}(y)){\frac {dg^{-1}(y)}{dy}},&g^{-1}{\text{ is increasing}};\\-f_{X}(g^{-1}(y)){\frac {dg^{-1}(y)}{dy}},&g^{-1}{\text{ is decreasing}}.\\\end{cases}}}
Since
x
=
g
−
1
(
y
)
{\displaystyle x=g^{-1}(y)}
, we can write
d
g
−
1
(
y
)
d
y
{\displaystyle {\frac {dg^{-1}(y)}{dy}}}
as
d
x
d
y
{\displaystyle {\frac {dx}{dy}}}
.
Also, we can summarize the above case defined function into a single expression by applying absolute value function to both side:
f
Y
(
y
)
=
f
X
(
g
−
1
(
y
)
)
|
d
x
d
y
|
,
{\displaystyle f_{Y}(y)=f_{X}(g^{-1}(y))\left\vert {\frac {dx}{dy}}\right\vert ,}
where the absolute value sign is only applied to
d
x
d
y
{\displaystyle {\frac {dx}{dy}}}
since the pdf's must be nonnegative, and thus we do not need to apply the sign to them.
◻
{\displaystyle \Box }
Remark.
To explain this theorem in a more intuitive manner, we rewrite the equation in the theorem as
|
f
Y
(
y
)
d
y
|
=
|
f
X
(
g
−
1
(
y
)
)
d
x
|
{\displaystyle |f_{Y}(y)dy|=|f_{X}(g^{-1}(y))dx|}
where both side of the equation can be regarded as differential areas , which are nonnegative due to the absolute value signs. This equation should intuitively hold since they both represent the areas under the pdf's, which represent probabilities. For
|
f
X
(
g
−
1
(
y
)
)
d
x
|
=
|
f
X
(
x
)
d
x
|
{\displaystyle |f_{X}(g^{-1}(y))dx|=|f_{X}(x)dx|}
, it is the area of the region
R
X
{\displaystyle R_{X}}
under the pdf of
X
{\displaystyle X}
over an "infinitesimal" interval
d
x
{\displaystyle dx}
, which represent the probability for
X
{\displaystyle X}
to lie in this infinitesimal interval
d
x
{\displaystyle dx}
. After transformation, we get another pdf of
Y
{\displaystyle Y}
, and the original region
R
X
{\displaystyle R_{X}}
is transformed to a region
R
Y
{\displaystyle R_{Y}}
under pdf of
Y
{\displaystyle Y}
over an infinitesimal interval
d
y
=
g
′
(
x
)
d
x
{\displaystyle dy=g'(x)dx}
with area
|
f
Y
(
y
)
d
y
|
{\displaystyle |f_{Y}(y)dy|}
. Since
g
{\displaystyle g}
is bijective function (its strict monotonicity implies this),
d
y
{\displaystyle dy}
"correspond" to
d
x
{\displaystyle dx}
in some sense, and we know that the values in
d
y
{\displaystyle dy}
are "originated" from the values in
d
x
{\displaystyle dx}
, and so the randomness. It follows that the probability for
X
{\displaystyle X}
lying in
d
x
{\displaystyle dx}
and
Y
{\displaystyle Y}
lying in
d
y
{\displaystyle dy}
should be the same, and hence the two differential areas are the same.
Let us define Jacobian matrix , and introduce several notations in the definition.
Definition.
(Jacobian matrix)
Suppose the function
g
{\displaystyle \mathbf {g} }
is differentiable (then it follows that
g
−
1
{\displaystyle \mathbf {g} ^{-1}}
is differentiable).
The Jacobian matrix
∂
y
∂
x
=
(
∂
g
1
(
x
)
∂
x
1
⋯
∂
g
1
(
x
)
∂
x
n
⋮
⋱
⋮
∂
g
n
(
x
)
∂
x
1
⋯
∂
g
n
(
x
)
∂
x
n
)
,
y
=
g
(
x
)
{\displaystyle {\frac {\partial \mathbf {y} }{\partial \mathbf {x} }}={\begin{pmatrix}{\frac {\partial g_{1}(\mathbf {x} )}{\partial x_{1}}}&\dotsb &{\frac {\partial g_{1}(\mathbf {x} )}{\partial x_{n}}}\\\vdots &\ddots &\vdots \\{\frac {\partial g_{n}(\mathbf {x} )}{\partial x_{1}}}&\dotsb &{\frac {\partial g_{n}(\mathbf {x} )}{\partial x_{n}}}\end{pmatrix}},\quad \mathbf {y} =\mathbf {g} (\mathbf {x} )}
in which
g
j
{\displaystyle g_{j}}
is the component function of
g
{\displaystyle \mathbf {g} }
for each
j
∈
{
1
,
…
,
n
}
{\displaystyle j\in \{1,\dotsc ,n\}}
, i.e.
g
(
x
)
=
(
g
1
(
x
)
,
…
,
g
n
(
x
)
)
{\displaystyle \mathbf {g} (\mathbf {x} )=(g_{1}(\mathbf {x} ),\dotsc ,g_{n}(\mathbf {x} ))}
.
Remark.
We have
∂
y
∂
x
∂
x
∂
y
=
I
n
×
n
⇔
∂
y
∂
x
=
(
∂
x
∂
y
)
−
1
{\displaystyle {\frac {\partial \mathbf {y} }{\partial \mathbf {x} }}{\frac {\partial \mathbf {x} }{\partial \mathbf {y} }}=I_{n\times n}\Leftrightarrow {\frac {\partial \mathbf {y} }{\partial \mathbf {x} }}=\left({\frac {\partial \mathbf {x} }{\partial \mathbf {y} }}\right)^{-1}}
.
Example.
Suppose
x
=
(
x
1
,
x
2
)
{\displaystyle \mathbf {x} =(x_{1},x_{2})}
,
y
=
(
y
1
,
y
2
)
{\displaystyle \mathbf {y} =(y_{1},y_{2})}
, and
y
=
g
(
x
)
=
(
2
x
1
,
3
x
2
)
{\displaystyle \mathbf {y} =\mathbf {g} (\mathbf {x} )=({\color {red}2x_{1}},{\color {blue}3x_{2}})}
.
Then,
g
1
(
x
)
=
2
x
1
{\displaystyle g_{1}(\mathbf {x} )={\color {red}2x_{1}}}
,
g
2
(
x
)
=
3
x
2
{\displaystyle g_{2}(\mathbf {x} )={\color {blue}3x_{2}}}
, and
∂
y
∂
x
=
(
∂
(
2
x
1
)
∂
x
1
∂
(
2
x
1
)
∂
x
2
∂
(
3
x
2
)
∂
x
1
∂
(
3
x
2
)
∂
x
2
)
=
(
2
0
0
3
)
.
{\displaystyle {\frac {\partial \mathbf {y} }{\partial \mathbf {x} }}={\begin{pmatrix}{\frac {\partial ({\color {red}2x_{1}})}{\partial x_{1}}}&{\frac {\partial ({\color {red}2x_{1}})}{\partial x_{2}}}\\{\frac {\partial ({\color {blue}3x_{2}})}{\partial x_{1}}}&{\frac {\partial ({\color {blue}3x_{2}})}{\partial x_{2}}}\end{pmatrix}}={\begin{pmatrix}2&0\\0&3\end{pmatrix}}.}
Also,
x
=
g
−
1
(
y
)
=
(
y
1
/
2
,
y
2
/
3
)
{\displaystyle \mathbf {x} =\mathbf {g} ^{-1}(\mathbf {y} )=({\color {darkgreen}y_{1}/2},{\color {purple}y_{2}/3})}
.
Then,
g
1
−
1
(
y
)
=
y
1
/
2
{\displaystyle g_{1}^{-1}(\mathbf {y} )={\color {darkgreen}y_{1}/2}}
,
g
2
−
1
(
y
)
=
y
2
/
3
{\displaystyle g_{2}^{-1}(\mathbf {y} )={\color {purple}y_{2}/3}}
, and
∂
x
∂
y
=
(
∂
(
y
1
/
2
)
∂
y
1
∂
(
y
1
/
2
)
∂
y
2
∂
(
y
2
/
3
)
∂
y
1
∂
(
y
2
/
3
)
∂
y
2
)
=
(
1
/
2
0
0
1
/
3
)
=
(
2
0
0
3
)
−
1
=
1
6
(
3
0
0
2
)
.
{\displaystyle {\frac {\partial \mathbf {x} }{\partial \mathbf {y} }}={\begin{pmatrix}{\frac {\partial ({\color {darkgreen}y_{1}/2})}{\partial y_{1}}}&{\frac {\partial ({\color {darkgreen}y_{1}/2})}{\partial y_{2}}}\\{\frac {\partial ({\color {purple}y_{2}/3})}{\partial y_{1}}}&{\frac {\partial ({\color {purple}y_{2}/3})}{\partial y_{2}}}\end{pmatrix}}={\begin{pmatrix}1/2&0\\0&1/3\end{pmatrix}}={\begin{pmatrix}2&0\\0&3\end{pmatrix}}^{-1}={\frac {1}{6}}{\begin{pmatrix}3&0\\0&2\end{pmatrix}}.}
Proof.
Partial proof :
Assume
g
{\displaystyle \mathbf {g} }
is differentiable and bijective.
First,
P
(
Y
∈
S
)
=
∫
⋯
∫
S
f
Y
(
y
)
d
y
1
⋯
d
y
n
(
1
)
.
{\displaystyle \mathbb {P} (Y\in S)=\int _{}^{}\dotsi \int _{S}^{}f_{\mathbf {Y} }(\mathbf {y} )\,dy_{1}\cdots \,dy_{n}\qquad (1).}
On the other hand,
we have
P
(
Y
∈
S
)
=
P
(
X
=
g
−
1
(
Y
)
∈
g
−
1
(
S
)
)
=
∫
⋯
∫
g
−
1
(
S
)
f
X
(
x
)
d
x
1
⋯
d
x
n
{\displaystyle \mathbb {P} (Y\in S)=\mathbb {P} (X=\mathbf {g} ^{-1}(Y)\in \mathbf {g} ^{-1}(S))=\int \dotsi \int _{\mathbf {g} ^{-1}(S)}^{}f_{\mathbf {X} }(\mathbf {x} )\,dx_{1}\cdots \,dx_{n}}
where
g
−
1
(
S
)
=
{
x
∈
X
:
g
(
x
)
∈
S
}
{\displaystyle \mathbf {g} ^{-1}(S)=\{x\in X:\mathbf {g} (x)\in S\}}
, which is the preimage of the set
S
{\displaystyle S}
under
g
{\displaystyle \mathbf {g} }
.
Applying the change of variable formula to this integral (whose proof is advanced and uses our assumptions), we get
∫
⋯
∫
g
−
1
(
S
)
f
X
(
x
)
d
x
1
⋯
d
x
n
=
∫
⋯
∫
S
f
X
(
g
−
1
(
y
)
)
|
det
∂
x
∂
y
|
d
y
1
⋯
d
y
n
(
2
)
{\displaystyle \int \dotsi \int _{\mathbf {g} ^{-1}(S)}f_{\mathbf {X} }(\mathbf {x} )\,dx_{1}\cdots \,dx_{n}=\int \dotsi \int _{S}f_{\mathbf {X} }{\big (}\mathbf {g} ^{-1}(\mathbf {y} ){\big )}\left|\det {\frac {\partial \mathbf {x} }{\partial \mathbf {y} }}\right|\,dy_{1}\cdots \,dy_{n}\qquad (2)}
Comparing the integrals in
(
1
)
{\displaystyle (1)}
and
(
2
)
{\displaystyle (2)}
, we can observe the desired result.
◻
{\displaystyle \Box }
Definition.
(Moment generating function)
The moment generating function (mgf) for the distribution of a
random variable
X
{\displaystyle X}
is
M
X
(
t
)
=
E
[
e
t
X
]
{\displaystyle M_{X}({\color {darkgreen}t})=\mathbb {E} \left[e^{{\color {darkgreen}t}X}\right]}
.
Remark.
For comparison: cdf is
F
X
(
t
)
=
E
[
1
{
X
≤
t
}
]
{\displaystyle F_{X}({\color {darkgreen}t})=\mathbb {E} [\mathbf {1} \{X\leq {\color {darkgreen}t}\}]}
.
Mgf, similar to pmf, pdf and cdf, gives a complete description of distribution, so it can also similarly uniquely identify a distribution, provided that the mgf exists (expectation may be infinite), i.e., we can recover probability function from mgf.
The proof to this result is complicated, and thus omitted.
Proof.
M
X
(
t
)
=
E
[
e
t
X
]
=
E
[
1
+
t
X
+
t
2
X
2
2
!
+
⋯
]
=
linearity
1
+
t
E
[
X
]
+
t
2
2
!
E
[
X
2
]
+
⋯
,
{\displaystyle M_{X}({\color {darkgreen}t})=\mathbb {E} \left[e^{{\color {darkgreen}t}X}\right]=\mathbb {E} \left[1+{\color {darkgreen}t}X+{\frac {{\color {darkgreen}t}^{2}X^{2}}{2!}}+\dotsb \right]{\overset {\text{linearity}}{=}}1+{\color {darkgreen}t}\mathbb {E} [X]+{\frac {{\color {darkgreen}t}^{2}}{2!}}\mathbb {E} [X^{2}]+\dotsb ,}
d
n
d
t
n
M
X
(
t
)
|
t
=
0
=
d
n
d
t
n
(
1
+
t
E
[
X
]
+
t
2
2
!
E
[
X
2
]
+
⋯
)
|
t
=
0
=
E
[
X
]
d
n
d
t
n
t
+
E
[
X
2
]
2
!
d
n
d
t
n
t
2
+
⋯
,
{\displaystyle {\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}M_{X}({\color {darkgreen}t}){\bigg |}_{{\color {darkgreen}t}=0}={\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}\left(1+{\color {darkgreen}t}\mathbb {E} [X]+{\frac {{\color {darkgreen}t}^{2}}{2!}}\mathbb {E} [X^{2}]+\dotsb \right){\bigg |}_{{\color {darkgreen}t}=0}=\mathbb {E} [X]{\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}{\color {darkgreen}t}+{\frac {\mathbb {E} [X^{2}]}{2!}}{\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}{\color {darkgreen}t^{2}}+\dotsb ,}
The result follows from simplifying the above expression by
d
n
d
t
n
t
m
=
1
{
m
=
n
}
n
!
+
1
{
m
≠
n
}
(
0
)
.
{\displaystyle {\frac {d^{n}}{d{\color {darkgreen}t}^{n}}}{\color {darkgreen}t}^{m}=\mathbf {1} \{m=n\}n!+\mathbf {1} \{m\neq n\}(0).}
◻
{\displaystyle \Box }
Proof.
M
X
Y
(
t
)
=
E
[
e
t
X
Y
]
=
lote
E
X
[
E
Y
[
e
t
X
Y
|
X
]
]
=
E
X
[
M
Y
(
t
X
)
]
.
{\displaystyle M_{XY}({\color {darkgreen}t})=\mathbb {E} [e^{{\color {darkgreen}t}{\color {blue}X}{\color {red}Y}}]{\overset {\text{lote}}{=}}{\color {blue}\mathbb {E} _{X}{\bigg [}}{\color {red}\mathbb {E} _{Y}[e}^{{\color {darkgreen}t}{\color {blue}X}{\color {red}Y}}|{\color {blue}X}{\color {red}]}{\color {blue}{\bigg ]}}={\color {blue}\mathbb {E} _{X}[}M_{Y}({\color {darkgreen}t}{\color {blue}X}){\color {blue}]}.}
Similarly,
M
X
Y
(
t
)
=
E
[
e
t
X
Y
]
=
lote
E
X
[
E
X
[
e
t
X
Y
|
Y
]
]
=
E
Y
[
M
X
(
t
Y
)
]
.
{\displaystyle M_{XY}({\color {darkgreen}t})=\mathbb {E} [e^{{\color {darkgreen}t}{\color {blue}X}{\color {red}Y}}]{\overset {\text{lote}}{=}}{\color {red}\mathbb {E} _{X}{\bigg [}}{\color {blue}\mathbb {E} _{X}[e}^{{\color {darkgreen}t}{\color {blue}X}{\color {red}Y}}|{\color {red}Y}{\color {blue}]}{\color {red}{\bigg ]}}={\color {red}\mathbb {E} _{Y}[}M_{X}({\color {darkgreen}t}{\color {red}Y}){\color {red}]}.}
lote: law of total expectation
◻
{\displaystyle \Box }
Remark.
This equality does not hold if
X
{\displaystyle X}
and
Y
{\displaystyle Y}
are not independent.
Joint moment generating function
Edit
In the following, we will use
X
{\displaystyle \mathbf {X} }
to denote
(
X
1
,
…
,
X
n
)
T
{\displaystyle (X_{1},\dotsc ,X_{n})^{T}}
.
Definition.
(Joint moment generating function)
The joint moment generating function (mgf) of random vector
X
{\displaystyle \mathbf {X} }
is
M
X
(
t
)
=
E
[
e
t
⋅
X
]
=
E
[
e
t
1
X
1
+
⋯
+
t
n
X
n
]
{\displaystyle M_{\mathbf {X} }({\color {darkgreen}\mathbf {t} })=\mathbb {E} [e^{{\color {darkgreen}\mathbf {t} }\cdot \mathbf {X} }]=\mathbb {E} [e^{{\color {darkgreen}t_{1}}X_{1}+\dotsb +{\color {darkgreen}t_{n}}X_{n}}]}
for each (column) vector
t
=
(
t
1
,
…
,
t
n
)
T
{\displaystyle \mathbf {t} =(t_{1},\dotsc ,t_{n})^{T}}
,
if the expectation exists.
Remark.
When
n
=
1
{\displaystyle n=1}
, the dot product of two vectors is product of two numbers.
t
⋅
X
=
def
t
T
X
{\displaystyle \mathbf {t} \cdot \mathbf {X} {\overset {\text{ def }}{=}}\mathbf {t} ^{T}\mathbf {X} }
.
Proposition.
(Relationship between independence and mgf)
Random variables
X
1
,
…
,
X
n
{\displaystyle X_{1},\dotsc ,X_{n}}
are independent if and only if
M
X
(
t
)
=
M
X
1
(
t
1
)
⋯
M
X
n
(
t
n
)
.
{\displaystyle M_{\mathbf {X} }(\mathbf {t} )=M_{X_{1}}(t_{1})\dotsb M_{X_{n}}(t_{n}).}
Proof.
'only if' part:
Assume
X
1
,
…
,
X
n
{\displaystyle X_{1},\dotsc ,X_{n}}
are independent. Then,
M
X
(
t
)
=
E
[
e
t
⋅
X
]
=
E
[
e
t
1
X
1
⋯
e
t
n
X
n
]
=
independence
E
[
e
t
1
X
1
]
⋯
E
[
e
t
n
X
n
]
=
M
X
1
(
t
1
)
⋯
M
X
n
(
t
n
)
.
{\displaystyle M_{\mathbf {X} }(\mathbf {t} )=\mathbb {E} [e^{\mathbf {t} \cdot \mathbf {X} }]=\mathbb {E} [e^{t_{1}X_{1}}\dotsb e^{t_{n}X_{n}}]{\overset {\text{ independence }}{=}}\mathbb {E} [e^{t_{1}X_{1}}]\dotsb \mathbb {E} [e^{t_{n}X_{n}}]=M_{X_{1}}(t_{1})\dotsb M_{X_{n}}(t_{n}).}
Proof for 'if' part is quite complicated, and thus is omitted.
◻
{\displaystyle \Box }
Analogously, we have marginal mgf.
Definition.
(Marginal mgf)
The marginal mgf of
X
i
{\displaystyle X_{i}}
which is a member of random variables
X
1
,
…
,
X
n
{\displaystyle X_{1},\dotsc ,X_{n}}
is
M
X
i
(
t
)
=
M
X
(
0
,
…
,
0
,
t
⏟
i
th position
,
0
,
…
,
0
)
{\displaystyle M_{X_{i}}(t)=M_{\mathbf {X} }(0,\dotsc ,0,\underbrace {t} _{i{\text{ th position}}},0,\dotsc ,0)}
Proof.
M
a
⋅
X
+
b
(
t
)
=
E
[
e
t
a
⋅
X
+
b
t
]
=
e
b
t
E
[
e
t
a
⋅
X
]
=
e
b
t
M
X
(
t
a
)
=
e
b
t
M
X
(
t
a
1
,
…
,
t
a
n
)
.
{\displaystyle M_{{\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}}(t)=\mathbb {E} [e^{t{\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}t}]=e^{{\color {blue}b}t}\mathbb {E} [e^{t{\color {red}\mathbf {a} }\cdot \mathbf {X} }]=e^{{\color {blue}b}t}M_{\mathbf {X} }(t{\color {red}\mathbf {a} })=e^{{\color {blue}b}t}M_{\mathbf {X} }(t{\color {red}a_{1}},\dotsc ,t{\color {red}a_{n}}).}
◻
{\displaystyle \Box }
Remark.
If
X
1
,
…
,
X
n
{\displaystyle X_{1},\dotsc ,X_{n}}
are independent,
M
a
⋅
X
+
b
(
t
)
=
e
b
t
M
X
1
(
t
a
1
)
⋯
M
X
n
(
t
a
n
)
.
{\displaystyle M_{{\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}}(t)=e^{{\color {blue}b}t}M_{X_{1}}(t{\color {red}a_{1}})\dotsb M_{X_{n}}(t{\color {red}a_{n}}).}
This provides an alternative, and possibly more convenient method to derive the distribution of
a
⋅
X
+
b
{\displaystyle {\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}}
, compared with deriving it from probability functions of
X
1
,
…
,
X
n
{\displaystyle X_{1},\dotsc ,X_{n}}
.
Special case: if
a
=
(
1
,
…
,
1
)
T
{\displaystyle {\color {red}\mathbf {a} }=(1,\dotsc ,1)^{T}}
and
b
=
0
{\displaystyle {\color {blue}b}=0}
, then
a
⋅
X
+
b
=
X
1
+
⋯
+
X
n
{\displaystyle {\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b}=X_{1}+\dotsb +X_{n}}
, which is sum of r.v.'s. So,
M
X
1
+
⋯
+
X
n
(
t
)
=
M
X
(
t
,
…
,
t
)
{\displaystyle M_{X_{1}+\dotsb +X_{n}}(t)=M_{\mathbf {X} }(t,\dotsc ,t)}
.
In particular, if
X
1
,
…
,
X
n
{\displaystyle X_{1},\dotsc ,X_{n}}
are independent , then
M
X
1
+
⋯
+
X
n
(
t
)
=
M
X
1
(
t
)
⋯
M
X
n
(
t
)
{\displaystyle M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)}
.
We can use this result to prove the formulas for sum of independent r.v.'s., instead of using the proposition about convolution of r.v.'s. Special case: if
n
=
1
{\displaystyle n=1}
, then the expression for linear transformation becomes
a
X
+
b
{\displaystyle {\color {red}a}X+{\color {blue}b}}
. So,
M
a
X
+
b
(
t
)
=
e
b
t
M
X
(
a
t
)
{\displaystyle M_{{\color {red}a}X+{\color {blue}b}}(t)=e^{{\color {blue}b}t}M_{X}({\color {red}a}t)}
.
Moment generating function of some important distributions
Edit
Proposition.
(Moment generating function of binomial distribution)
The moment generating function of
X
∼
Binom
(
n
,
p
)
{\displaystyle X\sim \operatorname {Binom} (n,p)}
is
M
X
(
t
)
=
(
p
e
t
+
1
−
p
)
n
{\displaystyle M_{X}(t)=(pe^{t}+1-p)^{n}}
.
Proof.
M
X
(
t
)
=
∑
k
=
0
n
e
t
k
(
n
k
)
p
k
(
1
−
p
)
n
−
k
⏟
for
Binom
(
n
,
p
)
=
∑
k
=
0
n
(
n
k
)
(
p
e
t
)
k
(
1
−
p
)
n
−
k
=
(
p
e
t
+
1
−
p
)
n
by binomial theorem
.
{\displaystyle M_{X}(t)=\sum _{k=0}^{n}{\color {blue}e^{tk}}\underbrace {{\binom {n}{k}}{\color {blue}p^{k}}(1-p)^{n-k}} _{{\text{for}}\operatorname {Binom} (n,p)}=\sum _{k=0}^{n}{\binom {n}{k}}{\color {blue}(pe^{t})^{k}}(1-p)^{n-k}=(pe^{t}+1-p)^{n}\quad {\text{by binomial theorem}}.}
◻
{\displaystyle \Box }
Proposition.
(Moment generating function of Poisson distribution)
The moment generating function of
X
∼
Pois
(
λ
)
{\displaystyle X\sim \operatorname {Pois} (\lambda )}
is
M
X
(
t
)
=
e
λ
(
e
t
−
1
)
{\displaystyle M_{X}(t)=e^{\lambda (e^{t-1})}}
.
Proof.
M
X
(
t
)
=
def
E
[
e
t
X
]
=
LOTUS
∑
k
=
0
∞
e
t
k
⋅
e
−
λ
λ
k
k
!
⏟
for
Pois
(
λ
)
=
e
λ
(
e
t
−
1
)
∑
k
=
0
∞
e
−
λ
e
t
(
λ
e
t
)
k
k
!
⏟
for
Pois
(
λ
e
t
)
⏞
=
1
=
e
λ
(
e
t
−
1
)
.
{\displaystyle M_{X}(t){\overset {\text{ def }}{=}}\mathbb {E} [e^{tX}]{\overset {\text{ LOTUS }}{=}}\sum _{{\color {darkdarkgreen}k}=0}^{\infty }e^{{\color {darkorange}t}{\color {darkdarkgreen}k}}\cdot \underbrace {\frac {e^{\color {red}-\lambda }{\color {purple}\lambda }^{\color {darkdarkgreen}k}}{{\color {darkdarkgreen}k}!}} _{{\text{for}}\operatorname {Pois} (\lambda )}=e^{{\color {blue}\lambda }({\color {blue}e^{t}}{\color {red}-1})}\overbrace {\sum _{{\color {darkdarkgreen}k}=0}^{\infty }\underbrace {\frac {e^{\color {blue}-\lambda e^{t}}({\color {purple}\lambda }e^{\color {darkorange}t})^{\color {darkdarkgreen}k}}{{\color {darkdarkgreen}k}!}} _{{\text{for}}\operatorname {Pois} (\lambda e^{t})}} ^{=1}=e^{\lambda (e^{t}-1)}.}
◻
{\displaystyle \Box }
Proposition.
(Moment generating function of exponential distribution)
The moment generating function of
X
∼
Exp
(
λ
)
{\displaystyle X\sim \operatorname {Exp} (\lambda )}
is
M
X
(
t
)
=
λ
λ
−
t
,
t
<
λ
{\displaystyle M_{X}(t)={\frac {\lambda }{\lambda -t}},\quad t<\lambda }
.
Proof.
M
X
(
t
)
=
E
[
e
t
X
]
=
λ
∫
0
∞
e
t
x
e
−
λ
x
d
x
=
λ
∫
0
∞
e
−
(
λ
−
t
)
x
d
x
=
λ
λ
−
t
∫
0
∞
(
λ
−
t
)
e
−
(
λ
−
t
)
x
⏟
for
Exp
(
λ
−
t
)
d
x
⏞
=
1
,
λ
−
t
>
0
⏟
ensuring valid rate parameter
⇔
t
<
λ
.
{\displaystyle M_{X}(t)=\mathbb {E} [e^{tX}]=\lambda \int _{0}^{\infty }e^{tx}e^{-\lambda x}\,dx=\lambda \int _{0}^{\infty }e^{-(\lambda -t)x}\,dx={\frac {\lambda }{\color {blue}\lambda -t}}\overbrace {\int _{0}^{\infty }\underbrace {{\color {blue}(\lambda -t)}e^{-(\lambda -t)x}} _{{\text{for}}\operatorname {Exp} (\lambda -t)}\,dx} ^{=1},\quad \underbrace {\lambda -t>0} _{\text{ensuring valid rate parameter}}\Leftrightarrow t<\lambda .}
The result follows.
◻
{\displaystyle \Box }
Proposition.
(Moment generating function of gamma distribution)
The moment generating function of
X
∼
Gamma
(
α
,
λ
)
{\displaystyle X\sim \operatorname {Gamma} (\alpha ,\lambda )}
is
M
X
(
t
)
=
(
λ
λ
−
t
)
α
,
t
<
λ
{\displaystyle M_{X}(t)=\left({\frac {\lambda }{\lambda -t}}\right)^{\alpha },\quad t<\lambda }
.
Proof.
We use similar proof technique from the proof for mgf of exponential distribution.
M
X
(
t
)
=
E
[
e
t
X
]
=
λ
α
Γ
(
α
)
∫
0
∞
e
t
x
x
α
−
1
e
−
λ
x
d
x
=
λ
α
Γ
(
α
)
∫
0
∞
e
−
(
λ
−
t
)
x
x
α
−
1
d
x
=
λ
α
(
λ
−
t
)
α
∫
0
∞
(
λ
−
t
)
α
Γ
(
α
)
e
−
(
λ
−
t
)
x
x
α
−
1
⏟
for
Gamma
(
α
,
λ
−
t
)
d
x
⏞
=
1
,
λ
−
t
>
0
⏟
ensuring valid rate parameter
⇔
t
<
λ
.
{\displaystyle M_{X}(t)=\mathbb {E} [e^{tX}]={\frac {\lambda ^{\alpha }}{\Gamma (\alpha )}}\int _{0}^{\infty }e^{tx}x^{\alpha -1}e^{-\lambda x}\,dx={\frac {\lambda ^{\alpha }}{\Gamma (\alpha )}}\int _{0}^{\infty }e^{-(\lambda -t)x}x^{\alpha -1}\,dx={\frac {\lambda ^{\alpha }}{\color {blue}(\lambda -t)^{\alpha }}}\overbrace {\int _{0}^{\infty }\underbrace {{\frac {\color {blue}(\lambda -t)^{\alpha }}{\Gamma (\alpha )}}e^{-(\lambda -t)x}x^{\alpha -1}} _{{\text{for}}\operatorname {Gamma} (\alpha ,\lambda -t)}\,dx} ^{=1},\quad \underbrace {\lambda -t>0} _{\text{ensuring valid rate parameter}}\Leftrightarrow t<\lambda .}
◻
{\displaystyle \Box }
Proposition.
(Moment generating function of normal distribution)
The moment generating function of
X
∼
N
(
μ
,
σ
2
)
{\displaystyle X\sim {\mathcal {N}}({\color {blue}\mu },{\color {red}\sigma ^{2}})}
is
M
X
(
t
)
=
e
μ
t
+
σ
2
t
2
/
2
{\displaystyle M_{X}(t)=e^{{\color {blue}\mu }t+{\color {red}\sigma ^{2}}t^{2}/2}}
.
Distribution of linear transformation of random variables
Edit
We will prove some propositions about distributions of linear transformation of random variables using mgf . Some of them are mentioned in previous chapters.
As we will see, proving these propositions using mgf is quite simple.
Proposition.
(Distribution of linear transformation of normal r.v.'s)
Let
X
∼
N
(
μ
,
σ
2
)
{\displaystyle X\sim {\mathcal {N}}(\mu ,\sigma ^{2})}
.
Then,
a
X
+
b
∼
N
(
a
μ
+
b
,
a
2
σ
2
)
{\displaystyle {\color {red}a}X+{\color {blue}b}\sim {\mathcal {N}}({\color {red}a}\mu +{\color {blue}b},{\color {red}a^{2}}\sigma ^{2})}
.
Proof.
The mgf of
a
X
+
b
{\displaystyle {\color {red}a}X+{\color {blue}b}}
is
M
a
X
+
b
(
t
)
=
e
b
t
M
X
(
a
t
)
=
e
b
t
(
exp
(
a
μ
t
+
(
a
σ
)
2
t
2
/
2
)
)
=
exp
(
(
a
μ
+
b
)
t
+
a
2
σ
2
t
2
/
2
)
,
{\displaystyle M_{{\color {red}a}X+{\color {blue}b}}(t)=e^{{\color {blue}b}t}M_{X}({\color {red}a}t)=e^{{\color {blue}b}t}\left(\exp({\color {red}a}\mu t+({\color {red}a}\sigma )^{2}t^{2}/2)\right)=\exp \left(({\color {red}a}\mu +{\color {blue}b})t+{\color {red}a^{2}}\sigma ^{2}t^{2}/2\right),}
which is the mgf of
N
(
a
μ
+
b
,
a
2
σ
2
)
{\displaystyle {\mathcal {N}}({\color {red}a}\mu +{\color {blue}b},{\color {red}a^{2}}\sigma ^{2})}
, and the result follows since mgf identify a distribution uniquely.
◻
{\displaystyle \Box }
Sum of independent random variables
Edit
Proposition.
(Sum of independent binomial r.v.'s)
Let
X
1
∼
Binom
(
n
1
,
p
)
,
…
,
X
m
∼
Binom
(
n
m
,
p
)
{\displaystyle X_{1}\sim \operatorname {Binom} (n_{1},p),\dotsc ,X_{m}\sim \operatorname {Binom} (n_{m},p)}
, in which
X
1
,
…
,
X
m
{\displaystyle X_{1},\dotsc ,X_{m}}
are independent. Then,
X
1
+
⋯
+
X
n
∼
Binom
(
n
1
+
⋯
+
n
m
,
p
)
{\displaystyle X_{1}+\dotsb +X_{n}\sim \operatorname {Binom} (n_{1}+\dotsb +n_{m},p)}
.
Proof.
The mgf of
X
1
+
⋯
+
X
n
{\displaystyle X_{1}+\dotsb +X_{n}}
is
M
X
1
+
⋯
+
X
n
(
t
)
=
M
X
1
(
t
)
⋯
M
X
n
(
t
)
=
(
p
e
t
+
1
−
p
)
n
1
⋯
(
p
e
t
+
1
−
p
)
n
m
=
(
p
e
t
+
1
−
p
)
n
1
+
⋯
+
n
m
,
{\displaystyle M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)=(pe^{t}+1-p)^{n_{1}}\dotsb (pe^{t}+1-p)^{n_{m}}=(pe^{t}+1-p)^{n_{1}+\dotsb +n_{m}},}
which is the mgf of
Binom
(
n
1
+
⋯
+
n
m
,
p
)
{\displaystyle \operatorname {Binom} (n_{1}+\dotsb +n_{m},p)}
, as desired.
◻
{\displaystyle \Box }
Proposition.
(Sum of independent Poisson r.v.'s)
Let
X
1
∼
Pois
(
λ
1
)
,
…
,
X
n
∼
Pois
(
λ
n
)
{\displaystyle X_{1}\sim \operatorname {Pois} (\lambda _{1}),\dotsc ,X_{n}\sim \operatorname {Pois} (\lambda _{n})}
, in which
X
1
,
…
,
X
n
{\displaystyle X_{1},\dotsc ,X_{n}}
are independent. Then,
X
1
+
⋯
+
X
n
∼
Pois
(
λ
1
+
⋯
+
λ
n
)
{\displaystyle X_{1}+\dotsb +X_{n}\sim \operatorname {Pois} (\lambda _{1}+\dotsb +\lambda _{n})}
.
Proof.
The mgf of
X
1
+
⋯
+
X
n
{\displaystyle X_{1}+\dotsb +X_{n}}
is
M
X
1
+
⋯
+
X
n
(
t
)
=
M
X
1
(
t
)
⋯
M
X
n
(
t
)
=
e
λ
1
(
e
t
−
1
)
⋯
e
λ
n
(
e
t
−
1
)
=
e
(
λ
1
+
⋯
+
λ
n
)
(
e
t
−
1
)
,
{\displaystyle M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)=e^{\lambda _{1}(e^{t}-1)}\dotsb e^{\lambda _{n}(e^{t}-1)}=e^{(\lambda _{1}+\dotsb +\lambda _{n})(e^{t}-1)},}
which is the mgf of
Pois
(
λ
1
+
⋯
+
λ
n
)
{\displaystyle \operatorname {Pois} (\lambda _{1}+\dotsb +\lambda _{n})}
, as desired.
◻
{\displaystyle \Box }
Proof.
The mgf of
X
1
+
⋯
+
X
n
{\displaystyle X_{1}+\dotsb +X_{n}}
is
M
X
1
+
⋯
+
X
n
(
t
)
=
M
X
1
(
t
)
⋯
M
X
n
(
t
)
=
(
λ
λ
−
t
)
n
,
{\displaystyle M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)=\left({\frac {\lambda }{\lambda -t}}\right)^{n},}
which is the mgf of
Gamma
(
n
,
λ
)
{\displaystyle \operatorname {Gamma} (n,\lambda )}
, as desired.
◻
{\displaystyle \Box }
Proposition.
(Sum of independent gamma r.v.'s)
Let
X
1
∼
Gamma
(
α
1
,
λ
)
,
…
,
X
n
∼
Gamma
(
α
n
,
λ
)
{\displaystyle X_{1}\sim \operatorname {Gamma} (\alpha _{1},\lambda ),\dotsc ,X_{n}\sim \operatorname {Gamma} (\alpha _{n},\lambda )}
, in which
X
1
,
…
,
X
n
{\displaystyle X_{1},\dotsc ,X_{n}}
are independent. Then,
X
1
+
⋯
+
X
n
∼
Gamma
(
α
1
+
⋯
+
α
n
,
λ
)
{\displaystyle X_{1}+\dotsb +X_{n}\sim \operatorname {Gamma} (\alpha _{1}+\dotsb +\alpha _{n},\lambda )}
.
Proof.
The mgf of
X
1
+
⋯
+
X
n
{\displaystyle X_{1}+\dotsb +X_{n}}
is
M
X
1
+
⋯
+
X
n
(
t
)
=
M
X
1
(
t
)
⋯
M
X
n
(
t
)
=
(
λ
λ
−
t
)
α
1
⋯
(
λ
λ
−
t
)
α
n
=
(
λ
λ
−
t
)
α
1
+
⋯
+
α
n
,
{\displaystyle M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)=\left({\frac {\lambda }{\lambda -t}}\right)^{\alpha _{1}}\dotsb \left({\frac {\lambda }{\lambda -t}}\right)^{\alpha _{n}}=\left({\frac {\lambda }{\lambda -t}}\right)^{\alpha _{1}+\dotsb +\alpha _{n}},}
which is the mgf of
Gamma
(
α
1
+
⋯
+
α
n
,
λ
)
{\displaystyle \operatorname {Gamma} (\alpha _{1}+\dotsb +\alpha _{n},\lambda )}
, as desired.
◻
{\displaystyle \Box }
Proposition.
(Sum of independent normal r.v.'s)
Let
X
1
∼
N
(
μ
1
,
σ
1
2
)
,
…
,
X
n
∼
N
(
μ
n
,
σ
n
2
)
{\displaystyle X_{1}\sim {\mathcal {N}}(\mu _{1},\sigma _{1}^{2}),\dotsc ,X_{n}\sim {\mathcal {N}}(\mu _{n},\sigma _{n}^{2})}
, in which
X
1
,
…
,
X
n
{\displaystyle X_{1},\dotsc ,X_{n}}
are independent. Then
X
1
+
⋯
+
X
n
∼
N
(
μ
1
+
⋯
+
μ
n
,
σ
1
2
+
⋯
+
σ
n
2
)
{\displaystyle X_{1}+\dotsb +X_{n}\sim {\mathcal {N}}(\mu _{1}+\dotsb +\mu _{n},\sigma _{1}^{2}+\dotsb +\sigma _{n}^{2})}
.
Proof.
The mgf of
X
1
+
⋯
+
X
n
{\displaystyle X_{1}+\dotsb +X_{n}}
(in which they are independent) is
M
X
1
+
⋯
+
X
n
(
t
)
=
M
X
1
(
t
)
⋯
M
X
n
(
t
)
=
exp
(
μ
1
t
+
σ
1
2
t
2
/
2
)
⋯
exp
(
μ
n
t
+
σ
n
2
t
2
/
2
)
=
exp
(
(
μ
1
+
⋯
+
μ
n
)
t
+
(
σ
1
2
+
⋯
+
σ
n
2
)
t
2
/
2
)
,
{\displaystyle M_{X_{1}+\dotsb +X_{n}}(t)=M_{X_{1}}(t)\dotsb M_{X_{n}}(t)=\exp(\mu _{1}t+\sigma _{1}^{2}t^{2}/2)\dotsb \exp(\mu _{n}t+\sigma _{n}^{2}t^{2}/2)=\exp \left((\mu _{1}+\dotsb +\mu _{n})t+(\sigma _{1}^{2}+\dotsb +\sigma _{n}^{2})t^{2}/2\right),}
which is the mgf of
N
(
μ
1
+
⋯
+
μ
n
,
σ
1
2
+
⋯
+
σ
n
2
)
{\displaystyle {\mathcal {N}}(\mu _{1}+\dotsb +\mu _{n},\sigma _{1}^{2}+\dotsb +\sigma _{n}^{2})}
, as desired.
◻
{\displaystyle \Box }
We will provide a proof to central limit theorem (CLT) using mgf here.
Proof.
Define
T
n
=
n
(
X
¯
n
−
μ
)
σ
{\displaystyle T_{n}={\frac {{\sqrt {n}}({\overline {X}}_{n}-\mu )}{\sigma }}}
. Then, we have
T
n
=
n
(
(
X
1
+
⋯
+
X
n
)
/
n
−
μ
)
σ
=
X
1
+
⋯
+
X
n
σ
n
−
n
μ
σ
,
{\displaystyle T_{n}={\frac {{\sqrt {n}}{\big (}(X_{1}+\dotsb +X_{n})/n-\mu {\big )}}{\sigma }}={\frac {X_{1}+\dotsb +X_{n}}{\color {red}\sigma {\sqrt {n}}}}{\color {blue}-{\frac {{\sqrt {n}}\mu }{\sigma }}},}
which is in the form of
a
⋅
X
+
b
,
a
=
(
1
σ
n
,
…
,
1
σ
n
)
T
and
b
=
−
n
μ
σ
{\displaystyle {\color {red}\mathbf {a} }\cdot \mathbf {X} +{\color {blue}b},\quad {\color {red}\mathbf {a} =\left({\frac {1}{\sigma {\sqrt {n}}}},\dotsc ,{\frac {1}{\sigma {\sqrt {n}}}}\right)^{T}}{\text{ and }}{\color {blue}b=-{\frac {{\sqrt {n}}\mu }{\sigma }}}}
.
Therefore,
M
T
n
(
t
)
=
e
−
n
μ
t
/
σ
(
M
X
1
(
t
σ
n
)
⋯
M
X
n
(
t
σ
n
)
)
M
T
n
(
t
)
=
e
−
n
μ
t
/
σ
(
M
X
1
(
t
σ
n
)
)
n
since
X
1
,
…
,
X
n
are identically distributed, which is equivalent to they have the same mgf
⇒
ln
M
T
n
(
t
)
=
−
n
μ
t
/
σ
+
n
ln
(
E
[
e
t
/
(
σ
n
)
X
1
]
)
=
−
n
μ
t
/
σ
+
n
ln
E
[
1
+
t
/
(
σ
n
)
X
1
+
(
1
/
2
!
)
t
2
/
(
σ
2
n
)
+
⋯
]
since
e
x
=
1
+
x
+
x
2
2
!
+
⋯
=
−
n
μ
t
/
σ
+
n
ln
(
1
+
t
/
(
σ
n
)
E
[
X
]
+
(
1
/
2
!
)
t
2
/
(
σ
2
n
)
(
E
[
X
2
]
⏟
Var
(
X
)
+
(
E
[
X
]
)
2
)
+
terms of order smaller than
n
−
1
)
=
−
n
μ
t
/
σ
+
n
ln
(
1
+
t
/
(
σ
n
)
μ
+
(
1
/
2
)
t
2
/
(
σ
2
n
)
(
σ
2
+
μ
2
)
+
terms of order smaller than
n
−
1
)
=
−
n
μ
t
/
σ
+
n
[
t
/
(
σ
n
)
μ
+
(
1
/
2
)
t
2
/
(
σ
2
n
)
(
σ
2
+
μ
2
)
−
(
1
/
2
)
(
t
/
(
σ
n
)
μ
)
2
+
terms of order smaller than
n
−
1
]
since
ln
(
1
+
x
)
=
x
−
x
2
/
2
+
⋯
=
−
n
μ
t
/
σ
+
n
μ
t
/
σ
+
n
(
1
/
2
)
t
2
/
(
σ
2
n
)
(
σ
2
+
μ
2
)
−
n
(
1
/
2
)
(
t
2
/
(
σ
2
n
)
μ
2
)
+
terms of order smaller than
n
0
=
(
1
/
2
)
t
2
(
σ
2
/
σ
2
)
+
(
1
/
2
)
μ
2
t
2
−
(
1
/
2
)
t
2
(
μ
2
)
+
terms of order smaller than
n
0
=
(
1
/
2
)
t
2
+
terms of order smaller than
n
0
⏟
→
0
as
n
→
∞
⇒
lim
n
→
∞
M
T
n
(
t
)
=
e
t
2
/
2
⏟
mgf of
N
(
0
,
1
)
,
{\displaystyle {\begin{aligned}&&M_{T_{n}}(t)&=e^{\color {blue}-{\sqrt {n}}\mu t/\sigma }\left(M_{X_{1}}\left({\frac {t}{\color {red}\sigma {\sqrt {n}}}}\right)\dotsb M_{X_{n}}\left({\frac {t}{\color {red}\sigma {\sqrt {n}}}}\right)\right)\\&&M_{T_{n}}(t)&=e^{-{\sqrt {n}}\mu t/\sigma }\left(M_{X_{1}}\left({\frac {t}{\sigma {\sqrt {n}}}}\right)\right)^{n}\quad {\text{since }}X_{1},\dotsc ,X_{n}{\text{ are identically distributed, which is equivalent to they have the same mgf}}\\&\Rightarrow &\ln M_{T_{n}}(t)&=-{\sqrt {n}}\mu t/\sigma +n\ln(\mathbb {E} [e^{t/(\sigma {\sqrt {n}})X_{1}}])\\&&&=-{\sqrt {n}}\mu t/\sigma +n\ln \mathbb {E} \left[1+t/(\sigma {\sqrt {n}})X_{1}+(1/2!)t^{2}/(\sigma ^{2}n)+\dotsb \right]\quad {\text{since }}e^{x}=1+x+{\frac {x^{2}}{2!}}+\dotsb \\&&&=-{\sqrt {n}}\mu t/\sigma +n\ln {\big (}1+t/(\sigma {\sqrt {n}}){\color {blue}\mathbb {E} [X]}+(1/2!)t^{2}/(\sigma ^{2}n)({\color {blue}\underbrace {\mathbb {E} [X^{2}]} _{\operatorname {Var} (X)+(\mathbb {E} [X])^{2}}})+{\text{terms of order smaller than }}n^{-1}{\big )}\\&&&=-{\sqrt {n}}\mu t/\sigma +n\ln \left(1+t/(\sigma {\sqrt {n}}){\color {blue}\mu }+(1/2)t^{2}/(\sigma ^{2}n)({\color {blue}\sigma ^{2}+\mu ^{2}})+{\text{terms of order smaller than }}n^{-1}\right)\\&&&=-{\sqrt {n}}\mu t/\sigma +n[t/(\sigma {\sqrt {n}}){\color {blue}\mu }+(1/2)t^{2}/(\sigma ^{2}n)({\color {blue}\sigma ^{2}+\mu ^{2}})-(1/2)(t/(\sigma {\sqrt {n}}){\color {blue}\mu })^{2}+{\text{terms of order smaller than }}n^{-1}]\quad {\text{since }}\ln(1+x)=x-x^{2}/2+\dotsb \\&&&={\cancel {-{\sqrt {n}}\mu t/\sigma }}{\cancel {+{\sqrt {n}}{\color {blue}\mu }t/\sigma }}+{\color {purple}{\cancel {n}}}(1/2)t^{2}/(\sigma ^{2}{\color {purple}{\cancel {n}}})({\color {blue}\sigma ^{2}}+{\color {blue}\mu ^{2}})-{\color {red}{\cancel {n}}}(1/2)(t^{2}/(\sigma ^{2}{\color {red}{\cancel {n}}}){\color {blue}\mu }^{2})+{\text{terms of order smaller than }}n^{0}\\&&&=(1/2)t^{2}(\sigma ^{2}/\sigma ^{2}){\cancel {+(1/2){\color {blue}\mu ^{2}}t^{2}-(1/2)t^{2}({\color {blue}\mu }^{2})}}+{\text{terms of order smaller than }}n^{0}\\&&&=(1/2)t^{2}+\underbrace {{\text{terms of order smaller than }}n^{0}} _{\to 0{\text{ as }}n\to \infty }\\&\Rightarrow &\lim _{n\to \infty }M_{T_{n}}(t)&=\underbrace {e^{t^{2}/2}} _{{\text{mgf of }}{\mathcal {N}}(0,1)},\end{aligned}}}
and the result follows from the mgf property of identifying distribution uniquely.
◻
{\displaystyle \Box }
Remark.
Since
n
(
X
¯
−
μ
)
σ
∼
N
(
0
,
1
)
⇔
σ
n
⋅
n
(
X
¯
−
μ
)
σ
+
μ
∼
N
(
μ
,
σ
2
/
n
)
⇔
X
¯
∼
N
(
μ
,
σ
2
/
n
)
{\displaystyle {\frac {{\sqrt {n}}({\overline {X}}-\mu )}{\sigma }}\sim {\mathcal {N}}(0,1)\Leftrightarrow {\color {blue}{\frac {\sigma }{\sqrt {n}}}}\cdot {\frac {{\sqrt {n}}({\overline {X}}-\mu )}{\sigma }}{\color {red}+\mu }\sim {\mathcal {N}}({\color {red}\mu },{\color {blue}\sigma ^{2}/n})\Leftrightarrow {\overline {X}}\sim {\mathcal {N}}(\mu ,\sigma ^{2}/n)}
,
the sample mean converges in distribution to
N
(
μ
,
σ
2
/
n
)
{\displaystyle {\mathcal {N}}(\mu ,\sigma ^{2}/n)}
as
n
→
∞
{\displaystyle n\to \infty }
. The same result holds for the sample mean of normal r.v.'s with the same mean
μ
{\displaystyle \mu }
and the same variance
σ
2
{\displaystyle \sigma ^{2}}
,
since if
X
1
,
…
,
X
n
∼
N
(
μ
,
σ
2
)
{\displaystyle X_{1},\dotsc ,X_{n}\sim {\mathcal {N}}(\mu ,\sigma ^{2})}
, then
X
1
+
⋯
+
X
n
n
∼
N
(
μ
+
⋯
+
μ
⏞
n
times
n
,
σ
2
+
⋯
+
σ
2
⏞
n
times
n
2
)
≡
N
(
μ
,
σ
2
/
n
)
{\displaystyle {\frac {X_{1}+\dotsb +X_{n}}{\color {blue}n}}\sim {\mathcal {N}}\left({\frac {\overbrace {\mu +\dotsb +\mu } ^{n{\text{ times}}}}{\color {blue}n}},{\frac {\overbrace {\sigma ^{2}+\dotsb +\sigma ^{2}} ^{n{\text{ times}}}}{\color {blue}n^{2}}}\right)\equiv {\mathcal {N}}(\mu ,\sigma ^{2}/n)}
. It follows from the proposition about the distribution of linear transformation of normal r.v.'s that the sample sum, i.e.
X
1
+
⋯
+
X
n
=
n
X
¯
{\displaystyle X_{1}+\dotsb +X_{n}={\color {blue}n}{\overline {X}}}
converges in distribution to
N
(
n
μ
,
n
2
σ
2
/
n
)
≡
N
(
n
μ
,
n
σ
2
)
{\displaystyle {\mathcal {N}}({\color {blue}n}\mu ,{\color {blue}n^{2}}\sigma ^{2}/n)\equiv {\mathcal {N}}(n\mu ,n\sigma ^{2})}
. The same result holds for the sample sum of normal r.v.'s with the same mean
μ
{\displaystyle \mu }
and the same variance
σ
2
{\displaystyle \sigma ^{2}}
,
since if
X
1
,
…
,
X
n
∼
N
(
μ
,
σ
2
)
{\displaystyle X_{1},\dotsc ,X_{n}\sim {\mathcal {N}}(\mu ,\sigma ^{2})}
, then
X
1
+
⋯
+
X
n
∼
N
(
μ
+
⋯
+
μ
⏞
n
times
,
σ
2
+
⋯
+
σ
2
⏞
n
times
)
≡
N
(
n
μ
,
n
σ
2
)
{\displaystyle X_{1}+\dotsb +X_{n}\sim {\mathcal {N}}\left(\overbrace {\mu +\dotsb +\mu } ^{n{\text{ times}}},\overbrace {\sigma ^{2}+\dotsb +\sigma ^{2}} ^{n{\text{ times}}}\right)\equiv {\mathcal {N}}(n\mu ,n\sigma ^{2})}
. If a r.v. converges in distribution to a distribution, then we can use the distribution to approximate the probabilities involving the r.v..
A special case of using CLT as approximation is using normal distribution to approximate discrete distribution.
To improve accuracy, we should ideally have continuity correction , as explained in the following.
Remark.
The reason for doing this is to make
i
{\displaystyle i}
to be at the 'middle' of the interval, so that it is better approximated.
Illustration of continuity correcction:
|
| /
| /
| /
| /|
| /#|
| *##|
| /|##|
| /#|##|
| /##|##|
| /|##|##|
| / |##|##|
| / |##|##|
| / |##|##|
| / |##|##|
*------*--*--*---------------------
i-1/2 i i+1/2
|
| /
| /
| /
| /
| /
| *
| /|
| /#|
| /##|
| /###|
| /####|
| /#####|
| /|#####|
| / |#####|
*---*-----*------------------------
i-1 i
|
| /|
| /#|
| /##|
| /###|
| /####|
| *#####|
| /|#####|
| / |#####|
| / |#####|
| / |#####|
| / |#####|
| / |#####|
| / |#####|
| / |#####|
*---------*-----*------------------
i i+1
↑ or equivalently, transformation between supports of
X
{\displaystyle \mathbf {X} }
and
Y
{\displaystyle \mathbf {Y} }