Preliminary conept: Bernoulli trial
Edit
Definition.
(Bernoulli trial)
A Bernoulli trial is an experiment with only two possible outcomes, namely success and failure.
Remark.
'Success' and 'failure' are acting as labels only, i.e. we can define any one of two outcomes in the experiment as 'success'.
Definition.
(Independence of Bernoulli trials)
Let
S
i
{\displaystyle S_{i}}
be the event
{
i
th Bernoulli trial is a success
}
,
i
=
1
,
2
,
…
{\displaystyle \{i{\text{th Bernoulli trial is a success}}\},\quad i=1,2,\dotsc }
[1] .
If
S
1
,
S
2
,
…
{\displaystyle S_{1},S_{2},\dotsc }
are independent , then the corresponding Bernoulli trials is independent .
Example.
If we interpret the outcomes of tossing a coin as 'head comes up' and 'tail comes up', then tossing a coin is a Bernoulli trial.
Exercise.
Remark.
We typically interpret the outcomes of tossing a coin as 'head comes up' and 'tail comes up'.
Binomial distribution
Edit
Motivation
Edit
Consider
n
{\displaystyle {\color {blue}n}}
independent Bernoulli trials with the same success probability
p
{\displaystyle {\color {darkgreen}p}}
.
We would like to calculate to probability
P
(
{
r
successes in
n
trials
}
)
{\displaystyle \mathbb {P} (\{{\color {darkgreen}r}{\text{ successes in }}{\color {blue}n}{\text{ trials}}\})}
.
Let
S
i
{\displaystyle S_{i}}
be the event
{
i
th Bernoulli trial is a success
}
,
i
=
1
,
2
,
…
{\displaystyle \{i{\text{th Bernoulli trial is a success}}\},\quad i=1,2,\dotsc }
, as in the previous section.
Let's consider a particular sequence of outcomes such that there are
r
{\displaystyle {\color {darkgreen}r}}
successes in
n
{\displaystyle {\color {blue}n}}
trials:
S
⋯
S
⏟
r
successes
F
⋯
F
⏞
n
−
r
failures
{\displaystyle {\color {darkgreen}\underbrace {S\cdots S} _{r{\text{ successes}}}}{\color {red}\overbrace {F\cdots F} ^{{\color {blue}n}-{\color {darkgreen}r}{\text{ failures}}}}}
Its probability is
P
(
S
1
∩
⋯
S
r
∩
S
r
+
1
c
∩
⋯
∩
S
n
c
)
=
indpt.
P
(
S
1
)
⋯
P
(
S
r
)
P
(
S
r
+
1
c
)
⋯
P
(
S
n
c
)
=
p
r
(
1
−
p
)
n
−
r
{\displaystyle \mathbb {P} ({\color {darkgreen}S_{1}\cap \dotsb S_{r}}\cap {\color {red}S_{r+1}^{c}\cap \dotsb \cap S_{\color {blue}n}^{c}}){\overset {\text{ indpt. }}{=}}{\color {darkgreen}\mathbb {P} (S_{1})\dotsb \mathbb {P} (S_{r})}{\color {red}\mathbb {P} (S_{r+1}^{c})\cdots \mathbb {P} (S_{\color {blue}n}^{c})}={\color {darkgreen}p^{r}}{\color {red}(1-{\color {darkgreen}p})^{{\color {blue}n}-{\color {darkgreen}r}}}}
[2]
Since the probability of other sequences with some of
r
{\displaystyle {\color {darkgreen}r}}
successes occurring in other trials is the same ,
and there are
(
n
r
)
{\displaystyle {\binom {\color {blue}n}{\color {darkgreen}r}}}
distinct possible sequences[3] ,
P
(
{
r
successes in
n
trials
}
)
=
(
n
r
)
p
r
(
1
−
p
)
n
−
r
.
{\displaystyle \mathbb {P} (\{{\color {darkgreen}r}{\text{ successes in }}{\color {blue}n}{\text{ trials}}\})={\binom {\color {blue}n}{\color {darkgreen}r}}{\color {darkgreen}p}^{\color {darkgreen}r}{\color {red}(1-{\color {darkgreen}p})^{{\color {blue}n}-{\color {darkgreen}r}}}.}
This is the pmf of a random variable following the binomial distribution .
Definition
Edit
Definition.
(Binomial distribution)
Pmf's of
Binom
(
20
,
0.5
)
,
Binom
(
20
,
0.7
)
{\displaystyle {\color {blue}\operatorname {Binom} (20,0.5)},{\color {green}\operatorname {Binom} (20,0.7)}}
and
Binom
(
40
,
0.5
)
{\displaystyle {\color {red}\operatorname {Binom} (40,0.5)}}
. A random variable
X
{\displaystyle X}
follows the binomial distribution with
n
{\displaystyle {\color {blue}n}}
independent Bernoulli trials and success probability
p
{\displaystyle {\color {darkgreen}p}}
, denoted by
X
∼
Binom
(
n
,
p
)
{\displaystyle X\sim \operatorname {Binom} ({\color {blue}n},{\color {darkgreen}p})}
, if its pmf is
f
(
x
;
n
,
p
)
=
(
n
x
)
p
x
(
1
−
p
)
n
−
x
,
x
∈
supp
(
X
)
=
{
0
,
1
,
2
,
…
,
n
}
.
{\displaystyle f({\color {darkgreen}x};{\color {blue}n},{\color {darkgreen}p})={\binom {\color {blue}n}{\color {darkgreen}x}}{\color {darkgreen}p^{x}}{\color {red}(1-{\color {darkgreen}p})^{{\color {blue}n}-{\color {darkgreen}x}}},\quad {\color {darkgreen}x}\in \operatorname {supp} (X)=\{0,1,2,\dotsc ,{\color {blue}n}\}.}
Cdf's of
Binom
(
20
,
0.5
)
,
Binom
(
20
,
0.7
)
{\displaystyle {\color {blue}\operatorname {Binom} (20,0.5)},{\color {green}\operatorname {Binom} (20,0.7)}}
and
Binom
(
40
,
0.5
)
{\displaystyle {\color {red}\operatorname {Binom} (40,0.5)}}
.
Remark.
The "
;
n
,
p
{\displaystyle ;n,p}
" in the pmf emphasizes that the values of parameters of the distribution (which are quantities that describes the distribution) are
n
{\displaystyle n}
and
p
{\displaystyle p}
. We can similar notations to pdf. There are some alternative notations for emphasizing the parameter values. For example, when the parameter value is
θ
{\displaystyle \theta }
, then the pdf/pmf can be denoted by
f
(
⋅
|
θ
)
,
f
θ
(
⋅
)
,
…
{\displaystyle f(\cdot |\theta ),f_{\theta }(\cdot ),\dotsc }
Of course, it is not necessary to adding these to the pdf/pmf, but it makes the parameter values involved explicit and clear. The pmf involves a binomial coefficient, and hence the name 'binomial distribution'.
General remark for each distribution :We may also just write down the notation for the distribution to denote the distribution itself, e.g.
Binom
(
n
,
p
)
{\displaystyle \operatorname {Binom} {({\color {blue}n},{\color {darkgreen}p})}}
stands for the binomial distribution.
We sometimes say pmf, pdf, or support of a distribution, to mean pmf, pdf or support (respectively) of a random variable following that distribution, for simplicity (it also applies for other properties of distribution (discussed in a later chapter), e.g. mean, variance, etc.).
Bernoulli distribution
Edit
Bernoulli distribution is simply a special case of binomial distribution, as follows:
Definition.
(Bernoulli distribution)
Pmf's of
Ber
(
0.8
)
,
Ber
(
0.2
)
{\displaystyle {\color {red}\operatorname {Ber} (0.8)},{\color {blue}\operatorname {Ber} (0.2)}}
and
Ber
(
0.5
)
{\displaystyle {\color {darkgreen}\operatorname {Ber} (0.5)}}
. A random variable
X
{\displaystyle X}
follows the Bernoulli distribution with success probability
p
{\displaystyle {\color {darkgreen}p}}
, denoted by
X
∼
Ber
(
p
)
{\displaystyle X\sim \operatorname {Ber} ({\color {darkgreen}p})}
, if its pmf is
f
(
x
;
p
)
=
p
x
(
1
−
p
)
1
−
x
,
x
∈
supp
(
X
)
=
{
0
,
1
}
.
{\displaystyle f({\color {darkgreen}x};{\color {darkgreen}p})={\color {darkgreen}p^{x}}{\color {red}(1-{\color {darkgreen}p})^{1-{\color {darkgreen}x}}},\quad {\color {darkgreen}x}\in \operatorname {supp} (X)=\{0,1\}.}
Cdf's of
Ber
(
1
)
,
Ber
(
0.8
)
,
Ber
(
0.5
)
{\displaystyle {{\color {blue}\operatorname {Ber} (1)},\color {red}\operatorname {Ber} (0.8)},{\color {darkorange}\operatorname {Ber} (0.5)}}
and
Ber
(
0.3
)
{\displaystyle {\color {darkgreen}\operatorname {Ber} (0.3)}}
.
Remark.
Ber
(
p
)
=
Binom
(
1
,
p
)
{\displaystyle \operatorname {Ber} ({\color {darkgreen}p})=\operatorname {Binom} (1,{\color {darkgreen}p})}
.
One Bernoulli trial is involved, and hence the name 'Bernoulli distribution'.
Poisson distribution
Edit
Motivation
Edit
The Poisson distribution can be viewed as the 'limit case' for the binomial distribution.
Consider
n
{\displaystyle {\color {blue}n}}
independent Bernoulli trials with success probability
p
=
λ
/
n
{\displaystyle {\color {darkgreen}p}=\lambda /{\color {blue}n}}
. By the binomial distribution,
P
(
r
successes in
n
trials
)
=
(
n
r
)
(
λ
/
n
)
r
(
1
−
λ
/
n
)
n
−
r
.
{\displaystyle \mathbb {P} ({\color {darkgreen}r}{\text{ successes in }}{\color {blue}n}{\text{ trials}})={\binom {\color {blue}n}{\color {darkgreen}r}}{\color {darkgreen}(\lambda /{\color {blue}n})^{r}}{\color {red}(1-\lambda /{\color {blue}n})^{{\color {blue}n}-{\color {darkgreen}r}}}.}
After that, consider an unit time interval, with (positive) occurrence rate
λ
{\displaystyle \lambda }
of a rare event (i.e. the mean of number of occurrence of the rare event is
λ
{\displaystyle \lambda }
). We can divide the unit time interval to
n
{\displaystyle {\color {blue}n}}
time subintervals of time length
1
/
n
{\displaystyle 1/{\color {blue}n}}
each.
If
n
{\displaystyle {\color {blue}n}}
is large and
p
{\displaystyle {\color {darkgreen}p}}
is relatively small , such that the probability for
occurrence of two or more rare events at a single time interval is negligible, then the probability for occurrence of exactly one rare event
for each time subinterval is
p
=
λ
/
n
{\displaystyle {\color {darkgreen}p}=\lambda /{\color {blue}n}}
by definition of mean.
Then, we can view the unit time interval as a sequence of
n
{\displaystyle {\color {blue}n}}
Bernoulli trials [4] with success probability
p
=
λ
/
n
{\displaystyle {\color {darkgreen}p}=\lambda /{\color {blue}n}}
.
After that, we can use
Binom
(
n
,
λ
/
n
)
{\displaystyle \operatorname {Binom} {({\color {blue}n},\lambda /{\color {blue}n})}}
to model the number of occurrences of rare event . To be more precise,
P
(
r
successes in
n
trials
⏟
r
rare events in the unit time
)
=
(
n
r
)
(
λ
/
n
)
r
(
1
−
λ
/
n
)
n
−
r
=
n
(
n
−
1
)
⋯
(
n
−
r
+
1
)
r
!
(
λ
r
/
n
r
)
(
1
−
λ
/
n
)
n
−
r
=
(
λ
r
/
r
!
)
(
1
−
1
/
n
⏟
→
0
as
n
→
∞
)
⋯
(
1
−
(
r
−
1
)
/
n
⏟
→
0
as
n
→
∞
)
⏞
→
1
as
n
→
∞
(
1
−
λ
/
n
)
n
−
r
⏞
→
n
as
n
→
∞
⏟
→
e
−
λ
as
n
→
∞
→
e
−
λ
λ
r
/
r
!
as
n
→
∞
.
{\displaystyle {\begin{aligned}\mathbb {P} (\underbrace {{\color {darkgreen}r}{\text{ successes in }}{\color {blue}n}{\text{ trials}}} _{{\color {darkgreen}r}{\text{ rare events in the unit time}}})&={\binom {\color {blue}n}{\color {darkgreen}r}}{\color {darkgreen}(\lambda /{\color {blue}n})^{r}}{\color {red}(1-\lambda /{\color {blue}n})^{{\color {blue}n}-{\color {darkgreen}r}}}\\&={\frac {{\color {blue}n}({\color {blue}n}-1)\dotsb ({\color {blue}n}-{\color {darkgreen}r}+1)}{{\color {darkgreen}r}!}}(\lambda ^{\color {darkgreen}r}/{\color {blue}n}^{\color {darkgreen}r})(1-\lambda /{\color {blue}n})^{{\color {blue}n}-{\color {darkgreen}r}}\\&=(\lambda ^{\color {darkgreen}r}/{\color {darkgreen}r}!)\overbrace {(1-\underbrace {1/{\color {blue}n}} _{\to 0{\text{ as }}n\to \infty })\dotsb {\big (}1-\underbrace {({\color {darkgreen}r-1})/{\color {blue}n}} _{\to 0{\text{ as }}n\to \infty }{\big )}} ^{\to 1{\text{ as }}n\to \infty }\underbrace {(1-\lambda /{\color {blue}n})^{\overbrace {{\color {blue}n}-{\color {darkgreen}r}} ^{\to n{\text{ as }}n\to \infty }}} _{\to e^{-\lambda }{\text{ as }}n\to \infty }\\&\to e^{-\lambda }\lambda ^{\color {darkgreen}r}/{\color {darkgreen}r}!{\text{ as }}n\to \infty .\end{aligned}}}
This is the pmf of a random variable following the Poisson distribution ,
and this result is known as the Poisson limit theorem (or law of rare events). We will introduce it formally after introducing the definition of Poisson distribution .
Definition
Edit
Definition.
(Poisson distribution)
Pmf's of
Pois
(
1
)
,
Pois
(
4
)
{\displaystyle {\color {darkorange}\operatorname {Pois} (1)},{\color {purple}\operatorname {Pois} (4)}}
and
Pois
(
10
)
{\displaystyle {\color {royalblue}\operatorname {Pois} (10)}}
. A random variable
X
{\displaystyle X}
follows the Poisson distribution with positive rate parameter
λ
{\displaystyle \lambda }
, denoted by
X
∼
Pois
(
λ
)
{\displaystyle X\sim \operatorname {Pois} (\lambda )}
, if its pmf is
f
(
x
;
λ
)
=
e
−
λ
λ
x
/
x
!
,
x
∈
supp
(
X
)
=
{
0
,
1
,
2
,
…
}
.
{\displaystyle f({\color {darkgreen}x};\lambda )=e^{-\lambda }\lambda ^{\color {darkgreen}x}/{\color {darkgreen}x}!,\quad {\color {darkgreen}x}\in \operatorname {supp} (X)=\{0,1,2,\dotsc \}.}
Cdf's of
Pois
(
1
)
,
Pois
(
4
)
{\displaystyle {\color {darkorange}\operatorname {Pois} (1)},{\color {purple}\operatorname {Pois} (4)}}
and
Pois
(
10
)
{\displaystyle {\color {royalblue}\operatorname {Pois} (10)}}
.
Remark.
As a result, the Poisson distribution can be used as an approximation to the binomial distributions for large
n
{\displaystyle {\color {blue}n}}
and relatively small
p
=
λ
/
n
{\displaystyle {\color {darkgreen}p}=\lambda /{\color {blue}n}}
.
Geometric distribution
Edit
Motivation
Edit
Consider a sequence of independent Bernoulli trials with success probability
p
{\displaystyle {\color {darkgreen}p}}
.
We would like to calculate the probability
P
(
{
x
failures before first success
}
)
{\displaystyle \mathbb {P} (\{{\color {red}x}{\text{ failures before first success}}\})}
.
By considering this sequence of outcomes:
F
⋯
F
⏟
x
failures
S
,
{\displaystyle {\color {red}\underbrace {F\cdots F} _{{\color {red}x}{\text{ failures}}}}{\color {darkgreen}S},}
we can calculate that
P
(
{
x
failures before first success
}
)
=
(
1
−
p
)
x
p
,
x
∈
supp
(
X
)
=
{
0
,
1
,
2
,
…
}
{\displaystyle \mathbb {P} (\{{\color {red}x}{\text{ failures before first success}}\})={\color {red}(1-{\color {darkgreen}p})^{x}}{\color {darkgreen}p},\quad {\color {red}x}\in \operatorname {supp} (X)=\{0,1,2,\dotsc \}}
[5]
This is the pmf of a random variable following the geometric distribution .
Definition
Edit
Definition.
(Geometric distribution)
Pmf's of
Geo
(
0.2
)
,
Geo
(
0.5
)
{\displaystyle {\color {green}\operatorname {Geo} (0.2)},{\color {blue}\operatorname {Geo} (0.5)}}
and
Geo
(
0.8
)
{\displaystyle {\color {red}\operatorname {Geo} (0.8)}}
. A random variable
X
{\displaystyle X}
follows the geometric distribution with success probability
p
{\displaystyle {\color {darkgreen}p}}
, denoted by
X
∼
Geo
(
p
)
{\displaystyle X\sim \operatorname {Geo} ({\color {darkgreen}p})}
, if its pmf is
f
(
x
;
p
)
=
(
1
−
p
)
x
p
,
x
∈
supp
(
X
)
=
{
0
,
1
,
2
,
…
}
.
{\displaystyle f({\color {red}x};{\color {darkgreen}p})={\color {red}(1-{\color {darkgreen}p})^{x}}{\color {darkgreen}p},\quad {\color {red}x}\in \operatorname {supp} (X)=\{0,1,2,\dotsc \}.}
Cdf's of
Geo
(
0.2
)
,
Geo
(
0.5
)
{\displaystyle {\color {green}\operatorname {Geo} (0.2)},{\color {blue}\operatorname {Geo} (0.5)}}
and
Geo
(
0.8
)
{\displaystyle {\color {red}\operatorname {Geo} (0.8)}}
.
Remark.
The sequence of the probabilities starting from
f
(
0
;
p
)
{\displaystyle f(0;{\color {darkgreen}p})}
, with input value
x
{\displaystyle {\color {red}x}}
increased one by one (i.e.
p
,
(
1
−
p
)
p
,
(
1
−
p
)
2
p
,
…
{\displaystyle {\color {darkgreen}p},{\color {red}(1-{\color {darkgreen}p})}{\color {darkgreen}p},{\color {red}(1-{\color {darkgreen}p})^{2}}{\color {darkgreen}p},\dotsc }
) is a geometric sequence , and hence the name 'geometric distribution'.
For an alternative definition, the pmf is instead
(
1
−
p
)
x
−
1
p
{\displaystyle (1-p)^{x-1}p}
, which is the proability
P
(
{
x
trials before first success
}
)
{\displaystyle \mathbb {P} (\{x{\text{ trials before first success}}\})}
, with support
supp
(
X
)
=
{
1
,
2
,
…
}
{\displaystyle \operatorname {supp} (X)=\{1,2,\dotsc \}}
.
Proposition.
(Memorylessness of geometric distribution)
If
X
∼
Geo
(
p
)
{\displaystyle X\sim \operatorname {Geo} (p)}
, then
P
(
X
>
m
+
n
|
X
≥
m
)
=
P
(
X
>
n
)
{\displaystyle \mathbb {P} (X>m+n|X\geq m)=\mathbb {P} (X>n)}
for each
nonnegative integer
m
{\displaystyle m}
and
n
{\displaystyle n}
.
Proof.
P
(
X
>
m
+
n
|
X
≥
m
)
=
def
P
(
X
>
m
+
n
∩
X
≥
m
)
⏞
=
X
>
m
+
n
P
(
X
≥
m
)
=
def
p
(
(
1
−
p
)
m
+
n
+
1
+
(
1
−
p
)
m
+
n
+
2
+
⋯
)
p
(
(
1
−
p
)
m
+
(
1
−
p
)
m
+
1
+
⋯
)
=
(
1
−
p
)
m
+
n
+
1
/
(
1
−
(
1
−
p
)
)
(
1
−
p
)
m
/
(
1
−
(
1
−
p
)
)
by geometric series formula
=
(
1
−
p
)
n
+
1
⋅
p
p
=
p
⋅
(
1
−
p
)
n
+
1
1
−
(
1
−
p
)
=
p
(
(
1
−
p
)
n
+
1
+
(
1
−
p
)
n
+
2
+
⋯
)
by geometric series formula
=
def
P
(
X
>
n
)
since
X
>
n
⇔
X
=
n
+
1
,
n
+
2
,
…
.
{\displaystyle {\begin{aligned}\mathbb {P} (X>m+n|X\geq m)&{\overset {\text{ def }}{=}}{\frac {\mathbb {P} (\overbrace {X>m+n\cap X\geq m)} ^{=X>m+n}}{\mathbb {P} (X\geq m)}}\\&{\overset {\text{ def }}{=}}{\frac {{\cancel {p}}\left((1-p)^{m+n+1}+(1-p)^{m+n+2}+\dotsb \right)}{{\cancel {p}}\left((1-p)^{m}+(1-p)^{m+1}+\dotsb \right)}}\\&={\frac {(1-p)^{{\cancel {m}}+n+1}{\cancel {/{\big (}1-(1-p){\big )}}}}{{\cancel {(1-p)^{m}}}{\cancel {/{\big (}1-(1-p){\big )}}}}}&{\text{by geometric series formula}}\\&=(1-p)^{n+1}\cdot {\frac {\color {darkgreen}p}{\color {blue}p}}\\&={\color {darkgreen}p}\cdot {\frac {(1-p)^{n+1}}{\color {blue}1-(1-p)}}\\&={\color {darkgreen}p}\left((1-p)^{n+1}+(1-p)^{n+2}+\dotsb \right)&{\text{by geometric series formula}}\\&{\overset {\text{ def }}{=}}\mathbb {P} (X>n)&{\text{since }}X>n\Leftrightarrow X=n+1,n+2,\dotsc .\\\end{aligned}}}
In particular,
X
>
m
+
n
∩
X
≥
m
=
X
>
m
+
n
{\displaystyle X>m+n\cap X\geq m=X>m+n}
since
X
>
m
+
n
⏟
X
=
m
+
n
+
1
,
m
+
n
+
2
,
…
⊊
X
≥
m
⏟
X
=
m
,
m
+
1
,
…
{\displaystyle \underbrace {X>m+n} _{X=m+n+1,m+n+2,\dotsc }\subsetneq \underbrace {X\geq m} _{X=m,m+1,\dotsc }}
.
◻
{\displaystyle \Box }
Remark.
X
>
m
+
n
{\displaystyle X>m+n}
can be interpreted as 'there are more than
m
+
n
{\displaystyle m+n}
failures before the first success';
X
≥
m
{\displaystyle X\geq m}
can be interpreted as '
m
{\displaystyle m}
failures have occured, so there are more than or equal to
m
{\displaystyle m}
failures before the first success'.
It implies that the condition
X
≥
m
{\displaystyle X\geq m}
does not affect the distribution of the remaining number of failures before the first success (it still follows geometric distribution with the same success probability).
So, we can assume the trials start afresh after an arbitrary trial for which failure occurs. E.g., if failure occurs in first trial, then the distribution of the remaining number of failures before the first success is not affected.
Also, if success occurs in first trial, then the condition becomes
X
=
0
{\displaystyle X=0}
, instead of
X
≥
m
{\displaystyle X\geq m}
, so the above formula cannot be applied in this situation. Indeed,
P
(
X
>
m
+
n
|
X
=
0
)
=
0
{\displaystyle \mathbb {P} (X>m+n|X=0)=0}
, since
X
{\displaystyle X}
cannot exceed zero given that
X
=
0
{\displaystyle X=0}
.
Negative binomial distribution
Edit
Motivation
Edit
Consider a sequence of independent Bernoulli trials with success probability
p
{\displaystyle {\color {darkgreen}p}}
.
We would like to calculate the probability
P
(
{
x
failures before
k
th success
}
)
{\displaystyle \mathbb {P} (\{{\color {red}x}{\text{ failures before }}{\color {darkgreen}k}{\text{th success}}\})}
.
By considering this sequence of outcomes:
F
⋯
F
⏟
x
1
failures
S
F
⋯
F
⏟
x
2
failures
S
⋯
F
⋯
F
⏟
x
k
failures
⏞
x
+
k
−
1
trials
S
⏞
k
th success
,
x
1
+
x
2
+
⋯
+
x
k
=
x
,
{\displaystyle \overbrace {{\color {red}\underbrace {F\cdots F} _{x_{1}{\text{ failures}}}}{\color {darkgreen}S}{\color {red}\underbrace {F\cdots F} _{x_{2}{\text{ failures}}}}{\color {darkgreen}S}\cdots {\color {red}\underbrace {F\cdots F} _{x_{k}{\text{ failures}}}}} ^{{\color {red}x}+{\color {darkgreen}k}-1{\text{ trials}}}{\color {darkgreen}\overbrace {S} ^{k{\text{th success}}}},\quad {\color {red}x_{1}}+{\color {red}x_{2}}+\dotsb +{\color {red}x_{k}}={\color {red}x},}
we can calculate that
P
(
{
x
failures before
k
th success
}
)
=
(
1
−
p
)
x
p
k
,
x
∈
supp
(
X
)
=
{
0
,
1
,
2
,
…
}
.
{\displaystyle \mathbb {P} (\{{\color {red}x}{\text{ failures before }}{\color {darkgreen}k}{\text{th success}}\})={\color {red}(1-{\color {darkgreen}p})^{x}}{\color {darkgreen}p^{k}},\quad {\color {red}x}\in \operatorname {supp} (X)=\{0,1,2,\dotsc \}.}
Since the probability of other sequences with
some of
x
{\displaystyle {\color {red}x}}
failures occuring in other trials
(and some of
k
−
1
{\displaystyle {\color {darkgreen}k}-1}
successes (excluding the
k
{\displaystyle {\color {darkgreen}k}}
th success,
which must occur in the last trial) occuring in other trials), is the same , and there are
(
x
+
k
−
1
x
)
{\displaystyle {\binom {{\color {red}x}+{\color {darkgreen}k}-1}{\color {red}x}}}
(or
(
x
+
k
−
1
k
−
1
)
{\displaystyle {\binom {{\color {red}x}+{\color {darkgreen}k}-1}{{\color {green}k}-1}}}
,
which is the same numerically) distinct possible sequences
[6] ,
P
(
{
x
failures before
k
th success
}
)
=
(
x
+
k
−
1
x
)
(
1
−
p
)
x
p
k
,
x
∈
supp
(
X
)
=
{
0
,
1
,
2
,
…
}
.
{\displaystyle \mathbb {P} (\{{\color {red}x}{\text{ failures before }}{\color {darkgreen}k}{\text{th success}}\})={\binom {{\color {red}x}+{\color {darkgreen}k}-1}{\color {red}x}}{\color {red}(1-{\color {darkgreen}p})^{x}}{\color {darkgreen}p^{k}},\quad {\color {red}x}\in \operatorname {supp} (X)=\{0,1,2,\dotsc \}.}
This is the pmf of a random variable following the negative binomial distribution .
Definition
Edit
Definition.
(Negative binomial distribution)
Pmf's of
NB
(
10
,
0.9
)
,
NB
(
10
,
0.8
)
,
NB
(
10
,
0.5
)
{\displaystyle {\color {darkblue}\operatorname {NB} (10,0.9)},{\color {red}\operatorname {NB} (10,0.8)},{\color {darkorange}\operatorname {NB} (10,0.5)}}
and
NB
(
10
,
0.3
)
{\displaystyle {\color {darkgreen}\operatorname {NB} (10,0.3)}}
. A random variable
X
{\displaystyle X}
follows the negative binomial distribution with success probability
p
{\displaystyle {\color {darkgreen}p}}
, denoted by
X
∼
NB
(
k
,
p
)
{\displaystyle X\sim \operatorname {NB} ({\color {darkgreen}k,p})}
, if its pmf is
f
(
x
;
k
,
p
)
=
(
x
+
k
−
1
x
)
(
1
−
p
)
x
p
k
,
x
∈
supp
(
X
)
=
{
0
,
1
,
2
,
…
}
.
{\displaystyle f({\color {red}x};{\color {darkgreen}k,p})={\binom {{\color {red}x}+{\color {darkgreen}k}-1}{\color {red}x}}{\color {red}(1-{\color {darkgreen}p})^{x}}{\color {darkgreen}p^{k}},\quad {\color {red}x}\in \operatorname {supp} (X)=\{0,1,2,\dotsc \}.}
Cdf's of
NB
(
10
,
0.9
)
,
NB
(
10
,
0.8
)
,
NB
(
10
,
0.5
)
{\displaystyle {\color {royalblue}\operatorname {NB} (10,0.9)},{\color {red}\operatorname {NB} (10,0.8)},{\color {darkorange}\operatorname {NB} (10,0.5)}}
and
NB
(
10
,
0.3
)
{\displaystyle {\color {darkgreen}\operatorname {NB} (10,0.3)}}
.
Remark.
Negative binomial coefficient is involved and hence the name 'negative binomial distribution'.
Hypergeometric distribution
Edit
Motivation
Edit
Consider a sample of size
n
{\displaystyle n}
are drawn without replacement
from a population size
N
{\displaystyle N}
, containing
K
{\displaystyle K}
objects of type 1 and
N
−
K
{\displaystyle N-K}
of another type.
Then, the probability
P
(
{
k
type 1 objects are found when
n
objects are drawn from
N
objects
}
)
=
(
K
k
)
⏟
type 1
(
N
−
K
n
−
k
)
⏞
another type
/
(
N
n
)
⏟
all outcomes
,
k
∈
{
max
{
n
−
N
+
K
,
0
}
,
…
,
min
{
K
,
n
}
}
{\displaystyle \mathbb {P} (\{k{\text{ type 1 objects are found when }}n{\text{ objects are drawn from }}N{\text{ objects}}\})=\underbrace {\binom {K}{k}} _{\text{type 1}}\overbrace {\binom {N-K}{n-k}} ^{\text{another type}}{\bigg /}\underbrace {\binom {N}{n}} _{\text{all outcomes}},\quad k\in {\big \{}\max\{n-N+K,0\},\dotsc ,\min {\{K,n\}}{\big \}}}
[7] .
(
K
k
)
{\displaystyle {\binom {K}{k}}}
: unordered selection of
k
{\displaystyle k}
objects of type 1 from
K
{\displaystyle K}
(distinguishable) objects of type 1 without replacement;
(
N
−
K
n
−
k
)
{\displaystyle {\binom {N-K}{n-k}}}
: unordered selection of
n
−
k
{\displaystyle n-k}
objects of another type from
N
−
K
{\displaystyle N-K}
(distinguishable) objects of another type without replacement;
(
N
n
)
{\displaystyle {\binom {N}{n}}}
: unordered selection of
n
{\displaystyle n}
objects from
N
{\displaystyle N}
(distinguishable) objects without replacement.This is the pmf of a random variable following the hypergeometric distribution .
Definition
Edit
Definition. (Hypergeometric distribution)
Pmf's of
HypGeo
(
500
,
50
,
100
)
,
HypGeo
(
500
,
60
,
200
)
{\displaystyle {\color {blue}\operatorname {HypGeo} (500,50,100)},{\color {darkgreen}\operatorname {HypGeo} (500,60,200)}}
and
HypGeo
(
500
,
70
,
300
)
{\displaystyle {\color {red}\operatorname {HypGeo} (500,70,300)}}
. A random variable
X
{\displaystyle X}
follows the hypergeometric distribution with
n
{\displaystyle n}
objects drawn from a collection of
K
{\displaystyle K}
objects of type 1 and
N
−
K
{\displaystyle N-K}
of another type, denoted by
X
∼
HypGeo
(
N
,
K
,
n
)
{\displaystyle X\sim \operatorname {HypGeo} (N,K,n)}
, if its pmf is
f
(
k
;
N
,
K
,
n
)
=
(
K
k
)
(
N
−
K
n
−
k
)
/
(
N
n
)
,
k
∈
supp
(
X
)
=
{
max
{
n
−
N
+
K
,
0
}
,
…
,
min
{
K
,
n
}
}
.
{\displaystyle f(k;N,K,n)={\binom {K}{k}}{\binom {N-K}{n-k}}{\bigg /}{\binom {N}{n}},\quad k\in \operatorname {supp} (X)={\big \{}\max\{n-N+K,0\},\dotsc ,\min {\{K,n\}}{\big \}}.}
Cdf's of
HypGeo
(
500
,
50
,
100
)
,
HypGeo
(
500
,
60
,
200
)
{\displaystyle {\color {blue}\operatorname {HypGeo} (500,50,100)},{\color {darkgreen}\operatorname {HypGeo} (500,60,200)}}
and
HypGeo
(
500
,
70
,
300
)
{\displaystyle {\color {red}\operatorname {HypGeo} (500,70,300)}}
.
Remark.
The pmf is sort of similar to hypergeometric series [8] , and hence the name 'hypergeometric distribution'.
Finite discrete distribution
Edit
This type of distribution is a generalization of all discrete distribution with finite support, e.g. Bernoulli distribution and hypergeometric distribution.
Another special case of this type of distribution is discrete uniform distribution , which is similar to the continuous uniform distribution (will be discussed later).
Definition.
(Finite discrete distribution)
A random variable
X
{\displaystyle X}
follows the finite discrete distribution with vector
x
=
(
x
1
,
…
,
x
n
)
T
{\displaystyle \mathbf {x} =(x_{1},\dotsc ,x_{n})^{T}}
and probability vector
p
=
(
p
1
,
…
,
p
n
)
T
,
p
1
,
…
,
and
p
n
≥
0
,
p
1
+
⋯
+
p
n
=
1
{\displaystyle \mathbf {p} =(p_{1},\dotsc ,p_{n})^{T},\quad p_{1},\dotsc ,{\text{ and }}p_{n}\geq 0,p_{1}+\dotsb +p_{n}=1}
,
denoted by
X
∼
FD
(
x
,
p
)
{\displaystyle X\sim \operatorname {FD} (\mathbf {x} ,\mathbf {p} )}
if its pmf is
f
(
x
i
;
p
)
=
p
i
,
i
=
1
,
…
,
or
n
.
{\displaystyle f(x_{i};\mathbf {p} )=p_{i},\quad i=1,\dotsc ,{\text{ or }}n.}
Remark.
For mean and variance, we can calculate them by definition directly. There are no special formulas for finite discrete distribution.
Definition.
(Discrete uniform distribution)
The discrete uniform distribution , denoted by
D
U
{
x
1
,
…
,
x
n
}
{\displaystyle \operatorname {D} {\mathcal {U}}\{x_{1},\dotsc ,x_{n}\}}
, is
FD
(
x
,
p
)
,
p
=
(
1
n
,
…
,
1
n
⏟
n
times
)
T
{\displaystyle \operatorname {FD} (\mathbf {x} ,\mathbf {p} ),\quad \mathbf {p} ={\bigg (}\underbrace {{\frac {1}{n}},\dotsc ,{\frac {1}{n}}} _{n{\text{ times}}}{\bigg )}^{T}}
.
Remark.
Its pmf is
f
(
x
i
)
=
1
n
,
i
=
1
,
…
,
or
n
.
{\displaystyle f(x_{i})={\frac {1}{n}},\quad i=1,\dotsc ,{\text{ or }}n.}
Example.
Suppose a r.v.
X
∼
FD
(
(
1
,
2
,
3
)
T
,
(
0.2
,
0.3
,
0.5
)
T
)
{\displaystyle X\sim \operatorname {FD} {\big (}(1,2,3)^{T},(0.2,0.3,0.5)^{T}{\big )}}
.
Then,
P
(
X
=
1
)
=
0.2
,
P
(
X
=
2
)
=
0.3
,
and
P
(
X
=
3
)
=
0.5.
{\displaystyle \mathbb {P} (X=1)=0.2,\mathbb {P} (X=2)=0.3,{\text{ and }}\mathbb {P} (X=3)=0.5.}
Illustration of the pmf:
|
| *
| |
| * |
| * | |
| | | |
*----*----*----*-------
1 2 3
Example.
Suppose a r.v.
X
∼
D
U
{
1
,
2
,
3
}
{\displaystyle X\sim \operatorname {D} {\mathcal {U}}\{1,2,3\}}
. Then,
P
(
X
=
1
)
=
P
(
X
=
2
)
=
P
(
X
=
3
)
=
1
3
.
{\displaystyle \mathbb {P} (X=1)=\mathbb {P} (X=2)=\mathbb {P} (X=3)={\frac {1}{3}}.}
Illustration of the pmf:
|
|
|
| * * *
| | | |
| | | |
*----*----*----*-------
1 2 3
Exercises
Edit
Exercise.
Uniform distribution (continuous)
Edit
The continuous uniform distribution is a model for 'no preference',
i.e. all intervals of the same length on its support are equally likely [9] (it can be seen from the pdf corresponding to continuous uniform distribution).
There is also discrete uniform distribution, but it is less important than continuous uniform distribution.
So, from now on, simply 'uniform distribution' refers to the continuous one, instead of the discrete one.
Definition.
(Uniform distribution)
Pdf's of
U
[
a
,
b
]
{\displaystyle {\color {dodgerblue}{\mathcal {U}}[a,b]}}
. A random variable
X
{\displaystyle X}
follows the uniform distribution , denoted by
X
∼
U
[
a
,
b
]
{\displaystyle X\sim {\mathcal {U}}[a,b]}
, if its pdf is
f
(
x
)
=
1
/
(
b
−
a
)
,
x
∈
supp
(
X
)
=
[
a
,
b
]
,
and
a
≤
b
.
{\displaystyle f(x)=1/(b-a),\quad x\in \operatorname {supp} (X)=[a,b],{\text{ and }}a\leq b.}
Remark.
The support of
U
[
a
,
b
]
{\displaystyle {\mathcal {U}}[a,b]}
can also be alternatively
[
a
,
b
)
,
(
a
,
b
]
{\displaystyle [a,b),(a,b]}
or
(
a
,
b
)
{\displaystyle (a,b)}
, without affecting the probabilities of events involved, since the probability calculated, using pdf at a single point, is zero anyways.
The distribution
U
[
0
,
1
]
{\displaystyle {\mathcal {U}}[0,1]}
is the standard uniform distribution .
Proposition.
Cdf's of
U
[
a
,
b
]
{\displaystyle {\color {dodgerblue}{\mathcal {U}}[a,b]}}
. (Cdf of uniform distribution)
The cdf of
U
[
a
,
b
]
{\displaystyle {\mathcal {U}}[a,b]}
is
F
(
x
)
=
{
0
,
x
<
a
;
(
x
−
a
)
/
(
b
−
a
)
,
a
≤
x
≤
b
;
1
,
x
>
b
.
{\displaystyle F(x)={\begin{cases}0,&x<a;\\(x-a)/(b-a),&a\leq x\leq b;\\1,&x>b.\end{cases}}}
Proof.
F
(
x
)
=
∫
−
∞
x
1
{
a
≤
x
≤
b
}
b
−
a
d
y
=
1
b
−
a
∫
a
x
1
{
a
≤
x
≤
b
}
d
y
=
{
0
/
(
b
−
a
)
,
x
<
a
;
[
y
]
a
x
/
(
b
−
a
)
,
a
≤
x
≤
b
;
[
y
]
a
b
/
(
b
−
a
)
,
x
>
b
.
{\displaystyle F(x)=\int _{-\infty }^{x}{\frac {\mathbf {1} \{a\leq x\leq b\}}{b-a}}\,dy={\frac {1}{b-a}}\int _{a}^{x}\mathbf {1} \{a\leq x\leq b\}\,dy={\begin{cases}0/(b-a),&x<a;\\[][y]_{a}^{x}/(b-a),&a\leq x\leq b;\\[][y]_{a}^{b}/(b-a),&x>b.\end{cases}}}
Then, the result follows.
◻
{\displaystyle \Box }
Exponential distribution
Edit
The exponential distribution with rate parameter
λ
{\displaystyle \lambda }
is often used to describe the interarrival time of rare events with rate
λ
{\displaystyle \lambda }
.
Comparing this with the Poisson distribution, the exponential distribution describes the interarrival time of rare events,
while Poisson distribution describes the number of occurrences of rare events within a fixed time interval.
By definition of rate , when the rate
↑
{\displaystyle \uparrow }
, then interarrival time
↓
{\displaystyle \downarrow }
(i.e. frequency of the rare event
↑
{\displaystyle \uparrow }
).
So, we would like the pdf to be more skewed to left when
λ
↑
{\displaystyle \lambda \uparrow }
(i.e. the pdf has higher value for small
x
{\displaystyle x}
when
λ
↑
{\displaystyle \lambda \uparrow }
), so that areas under the pdf for intervals involving small value of
x
{\displaystyle x}
↑
{\displaystyle \uparrow }
when
λ
↑
{\displaystyle \lambda \uparrow }
.
Also, since with a fixed rate
λ
{\displaystyle \lambda }
, the interarrival time should be less likely of higher value. So, intuitively, we would also like the pdf to be a strictly decreasing function, so that the probability involved (area under the pdf for some interval)
↓
{\displaystyle \downarrow }
when
x
↑
{\displaystyle x\uparrow }
.
As we can see, the pdf of exponential distribution satisfies both of these properties.
Proof.
Suppose
X
∼
Exp
(
λ
)
{\displaystyle X\sim \operatorname {Exp} (\lambda )}
. The cdf of
X
{\displaystyle X}
is
F
(
x
)
=
∫
−
∞
x
λ
e
−
λ
y
1
{
y
≥
0
}
d
y
=
{
∫
0
x
λ
e
−
λ
y
d
y
,
x
≥
0
;
0
,
x
<
0
(
When
x
<
0
,
x
∉
supp
(
X
)
,
so
F
(
x
)
=
P
(
X
≤
x
)
=
0
)
=
1
{
x
≥
0
}
λ
∫
0
x
e
−
λ
y
d
y
=
1
{
x
≥
0
}
λ
−
λ
[
e
−
λ
y
]
0
x
=
−
1
{
x
≥
0
}
(
e
−
λ
x
−
1
)
=
(
1
−
e
−
λ
x
)
1
{
x
≥
0
}
.
{\displaystyle {\begin{aligned}F(x)&=\int _{-\infty }^{x}\lambda e^{-\lambda y}\mathbf {1} \{y\geq 0\}\,dy\\&={\begin{cases}\int _{0}^{x}\lambda e^{-\lambda y}\,dy,&x\geq 0;\\0,&x<0\\\end{cases}}&\left({\text{When }}x<0,x\notin \operatorname {supp} (X),{\text{ so }}F(x)=\mathbb {P} (X\leq x)=0\right)\\&=\mathbf {1} \{x\geq 0\}\lambda \int _{0}^{x}e^{-\lambda y}\,dy\\&=\mathbf {1} \{x\geq 0\}{\frac {\lambda }{-\lambda }}[e^{-\lambda }y]_{0}^{x}\\&=-\mathbf {1} \{x\geq 0\}(e^{-\lambda x}-1)\\&=(1-e^{-\lambda x})\mathbf {1} \{x\geq 0\}.\\\end{aligned}}}
◻
{\displaystyle \Box }
Proposition.
(Memorylessness of exponential distribution)
If
X
∼
Exp
(
λ
)
{\displaystyle X\sim \operatorname {Exp} (\lambda )}
, then
P
(
X
>
s
+
t
|
X
>
s
)
=
P
(
X
>
t
)
{\displaystyle \mathbb {P} (X>s+t|X>s)=\mathbb {P} (X>t)}
for each
nonnegative number
s
{\displaystyle s}
and
t
{\displaystyle t}
.
Proof.
P
(
X
>
s
+
t
|
X
>
s
)
=
def
P
(
X
>
s
+
t
∩
X
>
s
)
P
(
X
>
s
)
=
P
(
X
>
s
+
t
)
P
(
X
>
s
)
=
1
−
(
1
−
e
−
λ
(
s
+
t
)
)
1
−
(
1
−
e
−
λ
s
)
=
e
−
λ
(
s
+
t
)
e
−
λ
s
=
e
−
λ
t
=
P
(
X
>
t
)
.
{\displaystyle \mathbb {P} (X>s+t|X>s){\overset {\text{ def }}{=}}{\frac {\mathbb {P} (X>s+t\cap X>s)}{\mathbb {P} (X>s)}}={\frac {\mathbb {P} (X>s+t)}{\mathbb {P} (X>s)}}={\frac {1-(1-e^{-\lambda (s+t)})}{1-(1-e^{-\lambda s})}}={\frac {e^{-\lambda (s+t)}}{e^{-\lambda s}}}=e^{-\lambda t}=\mathbb {P} (X>t).}
◻
{\displaystyle \Box }
Gamma distribution
Edit
Gamma distribution is a generalized exponential distribution, in the sense that we can also change the shape of the pdf of exponential distribution.
Definition.
(Gamma distribution)
Pdf's of
Gamma
(
1
,
1
)
,
Gamma
(
2
,
1
)
,
Gamma
(
3
,
1
)
{\displaystyle {\color {red}\operatorname {Gamma} (1,1)},{\color {green}\operatorname {Gamma} (2,1)},{\color {blue}\operatorname {Gamma} (3,1)}}
and
Gamma
(
3
,
0.5
)
{\displaystyle {\color {magenta}\operatorname {Gamma} (3,0.5)}}
. A random variable
X
{\displaystyle X}
follows the gamma distribution with positive shape parameter
α
{\displaystyle \alpha }
and positive rate parameter
λ
{\displaystyle \lambda }
, denoted by
X
∼
Gamma
(
α
,
λ
)
{\displaystyle X\sim \operatorname {Gamma} (\alpha ,\lambda )}
, if its pdf is
f
(
x
)
=
λ
α
x
α
−
1
e
−
λ
x
Γ
(
α
)
,
x
∈
supp
(
X
)
=
[
0
,
∞
)
.
{\displaystyle f(x)={\frac {\lambda ^{\alpha }x^{\alpha -1}e^{-\lambda x}}{\Gamma (\alpha )}},\quad x\in \operatorname {supp} (X)=[0,\infty ).}
Cdf's of
Gamma
(
1
,
1
)
,
Gamma
(
2
,
1
)
,
Gamma
(
3
,
1
)
{\displaystyle {\color {red}\operatorname {Gamma} (1,1)},{\color {green}\operatorname {Gamma} (2,1)},{\color {blue}\operatorname {Gamma} (3,1)}}
and
Gamma
(
3
,
0.5
)
{\displaystyle {\color {magenta}\operatorname {Gamma} (3,0.5)}}
.
Beta distribution
Edit
Beta distribution is a generalized
U
[
0
,
1
]
{\displaystyle {\mathcal {U}}[0,1]}
, in the sense that we can also change the shape of the pdf, using two shape parameters .
Definition.
(Beta distribution)
Pdf's of
Beta
(
0.5
,
0.5
)
,
Beta
(
5
,
1
)
,
Beta
(
1
,
3
)
{\displaystyle {\color {red}\operatorname {Beta} (0.5,0.5)},{\color {royalblue}\operatorname {Beta} (5,1)},{\color {green}\operatorname {Beta} (1,3)}}
,
Beta
(
2
,
2
)
{\displaystyle {\color {purple}\operatorname {Beta} (2,2)}}
and
Beta
(
2
,
5
)
{\displaystyle {\color {darkorange}\operatorname {Beta} (2,5)}}
. A random variable
X
{\displaystyle X}
follows the beta distribution with positive shape parameters
α
{\displaystyle \alpha }
and
β
{\displaystyle \beta }
, denoted by
X
∼
Beta
(
α
,
β
)
{\displaystyle X\sim \operatorname {Beta} (\alpha ,\beta )}
, if its pdf is
f
(
x
)
=
Γ
(
α
+
β
)
Γ
(
α
)
Γ
(
β
)
x
α
−
1
(
1
−
x
)
β
−
1
,
x
∈
supp
(
X
)
=
[
0
,
1
]
.
{\displaystyle f(x)={\frac {\Gamma (\alpha +\beta )}{\Gamma (\alpha )\Gamma (\beta )}}x^{\alpha -1}(1-x)^{\beta -1},\quad x\in \operatorname {supp} (X)=[0,1].}
Cdf's of
Beta
(
0.5
,
0.5
)
,
Beta
(
5
,
1
)
,
Beta
(
1
,
3
)
{\displaystyle {\color {red}\operatorname {Beta} (0.5,0.5)},{\color {royalblue}\operatorname {Beta} (5,1)},{\color {green}\operatorname {Beta} (1,3)}}
,
Beta
(
2
,
2
)
{\displaystyle {\color {purple}\operatorname {Beta} (2,2)}}
and
Beta
(
2
,
5
)
{\displaystyle {\color {darkorange}\operatorname {Beta} (2,5)}}
.
Remark.
Beta
(
1
,
1
)
≡
U
[
0
,
1
]
{\displaystyle \operatorname {Beta} (1,1)\equiv {\mathcal {U}}[0,1]}
, since the pdf of
Beta
(
1
,
1
)
{\displaystyle \operatorname {Beta} (1,1)}
is
f
(
x
)
=
Γ
(
2
)
⏞
=
1
!
=
1
Γ
(
1
)
⏟
=
0
!
=
1
Γ
(
1
)
x
1
−
1
(
1
−
x
)
1
−
1
1
{
0
≤
x
≤
1
}
=
1
{
0
≤
x
≤
1
}
,
{\displaystyle f(x)={\frac {\overbrace {\Gamma (2)} ^{=1!=1}}{\underbrace {\Gamma (1)} _{=0!=1}\Gamma (1)}}x^{1-1}(1-x)^{1-1}\mathbf {1} \{0\leq x\leq 1\}=\mathbf {1} \{0\leq x\leq 1\},}
which is the pdf of
U
[
0
,
1
]
{\displaystyle {\mathcal {U}}[0,1]}
.
Cauchy distribution
Edit
The Cauchy distribution is a heavy-tailed distribution [10] .
As a result, it is a 'pathological' distribution, in the sense that it has some counter-intuitive properties, e.g. undefined mean and variance, despite its mean and variance seems to be defined when we look at its graph directly.
Remark.
This definition is referring to a special case of Cauchy distribution. To be more precise, there is also the scale parameter in the complete definition of Cauchy distribution, and it is set to be one in the pdf here. This definition is used here for simplicity. The pdf is symmetric about
θ
{\displaystyle \theta }
, since
f
(
θ
+
x
)
=
f
(
θ
−
x
)
{\displaystyle f(\theta +x)=f(\theta -x)}
.
Normal distribution (very important)
Edit
The normal or Gaussian distribution is a thing of beauty, appearing in many places in nature. This is probably because sample means or sample sums often follow normal distributions approximately
by central limit theorem .
As a result, the normal distribution is important in statistics.
Definition.
(Normal distribution)
Pdf's of
N
(
0
,
0.2
)
,
N
(
0
,
1
)
,
N
(
0
,
5
)
{\displaystyle {\color {blue}{\mathcal {N}}(0,0.2)},{\color {red}{\mathcal {N}}(0,1)},{\color {darkorange}{\mathcal {N}}(0,5)}}
and
N
(
−
2
,
0.5
)
{\displaystyle {\color {darkgreen}{\mathcal {N}}(-2,0.5)}}
. A random variable
X
{\displaystyle X}
follows the normal distribution with mean
μ
{\displaystyle \mu }
and variance
σ
2
{\displaystyle \sigma ^{2}}
,
denoted by
X
∼
N
(
μ
,
σ
2
)
{\displaystyle X\sim {\mathcal {N}}(\mu ,\sigma ^{2})}
, if its pdf is
f
(
x
)
=
1
2
π
σ
2
exp
(
−
(
x
−
μ
)
2
2
σ
2
)
,
x
∈
supp
(
X
)
=
R
.
{\displaystyle f(x)={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\exp \left(-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}\right),\quad x\in \operatorname {supp} (X)=\mathbb {R} .}
Cdf's of
N
(
0
,
0.2
)
,
N
(
0
,
1
)
,
N
(
0
,
5
)
{\displaystyle {\color {blue}{\mathcal {N}}(0,0.2)},{\color {red}{\mathcal {N}}(0,1)},{\color {darkorange}{\mathcal {N}}(0,5)}}
and
N
(
−
2
,
0.5
)
{\displaystyle {\color {darkgreen}{\mathcal {N}}(-2,0.5)}}
.
Remark.
The distribution
N
(
0
,
1
)
{\displaystyle {\mathcal {N}}(0,1)}
is the standard normal distribution. For
N
(
0
,
1
)
{\displaystyle {\mathcal {N}}(0,1)}
, its pdf is often denoted by
φ
(
⋅
)
{\displaystyle \varphi (\cdot )}
, and its cdf is often denoted by
Φ
(
⋅
)
{\displaystyle \Phi (\cdot )}
.
pdf of
N
(
0
,
1
)
{\displaystyle {\mathcal {N}}(0,1)}
is
φ
(
x
)
=
1
2
π
e
−
x
2
/
2
{\displaystyle \varphi (x)={\frac {1}{\sqrt {2\pi }}}e^{-x^{2}/2}}
.
It follows that the pdf of
N
(
μ
,
σ
2
)
{\displaystyle {\mathcal {N}}(\mu ,\sigma ^{2})}
is
(
1
/
σ
)
φ
(
x
−
μ
/
σ
)
{\displaystyle (1/\sigma )\varphi (x-\mu /\sigma )}
. It will be proved that
μ
{\displaystyle \mu }
is actually the mean , and
σ
{\displaystyle \sigma }
is actually the variance .
The pdf is symmetric about
μ
{\displaystyle \mu }
, since
f
(
μ
+
x
)
=
f
(
μ
−
x
)
{\displaystyle f(\mu +x)=f(\mu -x)}
.
Remark.
A special case is when
a
=
1
/
σ
{\displaystyle a=1/\sigma }
and
b
=
−
μ
/
σ
{\displaystyle b=-\mu /\sigma }
,
Y
=
a
X
+
b
=
(
X
−
μ
)
/
σ
∼
N
(
0
,
1
)
{\displaystyle Y=aX+b=(X-\mu )/\sigma \sim {\mathcal {N}}(0,1)}
since
a
μ
+
b
=
(
1
/
σ
)
μ
−
μ
/
σ
=
0
{\displaystyle a\mu +b=(1/\sigma )\mu -\mu /\sigma =0}
;
a
2
σ
2
=
σ
2
/
σ
2
=
1
{\displaystyle a^{2}\sigma ^{2}=\sigma ^{2}/\sigma ^{2}=1}
.
This shows that we can transform each normally distributed r.v. to the r.v. following standard normal distribution.
This can ease the calculation for the probability relating the normally distributed r.v., since we have the standard normal table , in which values of
Φ
(
x
)
{\displaystyle \Phi (x)}
at different
x
{\displaystyle x}
are given.
For some types of standard normal table, only the values of
Φ
(
x
)
{\displaystyle \Phi (x)}
at different nonnegative
x
{\displaystyle x}
are given.
Then, we can calculate its values at different negative
x
{\displaystyle x}
using
Φ
(
−
x
)
=
1
−
Φ
(
x
)
.
{\displaystyle \Phi (-x)=1-\Phi (x).}
This formula holds since
ϕ
(
−
y
)
=
ϕ
(
y
)
⇔
∫
−
∞
x
ϕ
(
−
y
)
d
y
=
∫
−
∞
x
ϕ
(
y
)
d
y
⇔
−
∫
∞
−
x
ϕ
(
u
)
d
u
=
Φ
(
x
)
let
u
=
−
y
⇒
d
y
=
−
d
y
.
⇔
[
Φ
(
u
)
]
−
x
∞
=
Φ
(
x
)
⇔
Φ
(
∞
)
⏟
=
P
(
Ω
)
=
1
−
Φ
(
−
x
)
=
Φ
(
x
)
.
{\displaystyle {\begin{aligned}&&\phi (-y)&=\phi (y)\\&\Leftrightarrow &\int _{-\infty }^{x}\phi (-y)\,dy&=\int _{-\infty }^{x}\phi (y)\,dy\\&\Leftrightarrow &-\int _{\infty }^{-x}\phi (u)\,du&=\Phi (x)&{\text{let }}u=-y\Rightarrow dy=-dy.\\&\Leftrightarrow &[\Phi (u)]_{-x}^{\infty }&=\Phi (x)\\&\Leftrightarrow &\underbrace {\Phi (\infty )} _{=\mathbb {P} (\Omega )=1}-\Phi (-x)&=\Phi (x).\end{aligned}}}
Important distributions for statistics especially
Edit
The following distributions are important in statistics especially, and they are all related to normal distribution.
We will introduce them briefly.
Chi-squared distribution
Edit
The chi-squared distribution is a special case of Gamma distribution, and also related to standard normal distribution.
Definition.
(Chi-squared distribution)
Pdf's of
χ
1
2
,
χ
2
2
,
χ
3
2
,
χ
4
2
,
χ
6
2
{\displaystyle {\color {darkorange}\chi _{1}^{2}},{\color {green}\chi _{2}^{2}},{\color {royalblue}\chi _{3}^{2}},{\color {blue}\chi _{4}^{2}},{\color {purple}\chi _{6}^{2}}}
and
χ
9
2
{\displaystyle {\color {red}\chi _{9}^{2}}}
. The chi-squared distribution with positive
ν
{\displaystyle {\color {blue}\nu }}
degrees of freedom, denoted by
χ
ν
2
{\displaystyle \chi _{\color {blue}\nu }^{2}}
,
is the distribution of
Z
1
2
+
⋯
+
Z
ν
2
{\displaystyle Z_{1}^{2}+\dotsb +Z_{\color {blue}\nu }^{2}}
, in which
Z
1
,
…
,
Z
ν
{\displaystyle Z_{1},\dotsc ,Z_{\color {blue}\nu }}
are i.i.d., and they all follow
N
(
0
,
1
)
{\displaystyle {\mathcal {N}}(0,1)}
.
Cdf's of
χ
1
2
,
χ
2
2
,
χ
3
2
,
χ
4
2
,
χ
6
2
{\displaystyle {\color {darkorange}\chi _{1}^{2}},{\color {green}\chi _{2}^{2}},{\color {royalblue}\chi _{3}^{2}},{\color {blue}\chi _{4}^{2}},{\color {purple}\chi _{6}^{2}}}
and
χ
9
2
{\displaystyle {\color {red}\chi _{9}^{2}}}
.
Student's t -distribution
Edit
The Student's
t
{\displaystyle t}
-distribution is related to chi-squared distribution and normal distribution.
Definition.
(Student's
t
{\displaystyle t}
-distribution)
Pdf's of
t
1
,
t
2
,
t
5
{\displaystyle {\color {darkorange}t_{1}},{\color {purple}t_{2}},{\color {royalblue}t_{5}}}
and
t
∞
{\displaystyle t_{\infty }}
. The Student's
t
{\displaystyle t}
-distribution with
ν
{\displaystyle {\color {blue}\nu }}
degrees of freedom, denoted by
t
ν
{\displaystyle t_{\color {blue}\nu }}
, is the distribution of
Z
Y
/
ν
{\displaystyle {\frac {Z}{\sqrt {Y/{\color {blue}\nu }}}}}
in which
Y
∼
χ
ν
2
{\displaystyle Y\sim \chi _{\color {blue}\nu }^{2}}
and
Z
∼
N
(
0
,
1
)
{\displaystyle Z\sim {\mathcal {N}}(0,1)}
.
Cdf's of
t
1
,
t
2
,
t
5
{\displaystyle {\color {darkorange}t_{1}},{\color {purple}t_{2}},{\color {royalblue}t_{5}}}
and
t
∞
{\displaystyle t_{\infty }}
.
F -distribution
Edit
The
F
{\displaystyle F}
-distribution is sort of a generalized Student's
t
{\displaystyle t}
-distribution, in the sense that it has one more changeable parameter for another degrees of freedom.
Definition.
(
F
{\displaystyle F}
-distribution)
The
F
{\displaystyle F}
-distribution with
ν
1
{\displaystyle {\color {red}\nu _{1}}}
and
ν
2
{\displaystyle {\color {blue}\nu _{2}}}
degrees of freedom, denoted by
F
ν
1
,
ν
2
{\displaystyle F_{{\color {red}\nu _{1}},{\color {blue}\nu _{2}}}}
,
is the distribution of
X
1
/
ν
1
X
2
/
ν
2
{\displaystyle {\frac {X_{1}/{\color {red}\nu _{1}}}{X_{2}/{\color {blue}\nu _{2}}}}}
in which
X
1
∼
χ
ν
1
2
{\displaystyle X_{1}\sim \chi _{\color {red}\nu _{1}}^{2}}
and
X
2
∼
χ
ν
2
2
{\displaystyle X_{2}\sim \chi _{\color {blue}\nu _{2}}^{2}}
.
Pdf's of
F
1
,
1
,
F
2
,
1
,
F
5
,
2
,
F
10
,
1
{\displaystyle {\color {red}F_{1,1}},F_{2,1},{\color {blue}F_{5,2}},{\color {green}F_{10,1}}}
and
F
100
,
100
{\displaystyle {\color {dimgray}F_{100,100}}}
. Cdf's of
F
1
,
1
,
F
2
,
1
,
F
5
,
2
,
F
10
,
1
{\displaystyle {\color {red}F_{1,1}},F_{2,1},{\color {blue}F_{5,2}},{\color {green}F_{10,1}}}
and
F
100
,
100
{\displaystyle {\color {dimgray}F_{100,100}}}
.
If you are interested in knowing how chi-squared distribution , Student's
t
{\displaystyle t}
-distribution , and
F
{\displaystyle F}
-distribution are useful in statistics,
then you may briefly look at, for instance, Statistics/Interval Estimation (applications in confidence interval construction) and Statistics/Hypothesis Testing (applications in hypothesis testing).
Multinomial distribution
Edit
Motivation
Edit
Multinomial distribution is generalized binomial distribution,
in the sense that each trial has more than two outcomes.
Suppose
n
{\displaystyle n}
objects are to be allocated to
k
{\displaystyle k}
cells independently,
for which each object is allocated to one and only one cell, with probability
p
i
{\displaystyle p_{i}}
to be allocated to the
i
{\displaystyle i}
th cell (
i
=
1
,
2
,
…
,
k
{\displaystyle i=1,2,\dotsc ,k}
) [12] .
Let
X
i
{\displaystyle X_{i}}
be the number of objects allocated to cell
i
{\displaystyle i}
.
We would like to calculate the probability
P
(
X
=
def
(
X
1
,
…
,
X
k
)
T
=
x
=
def
(
x
1
,
…
,
x
k
)
T
)
{\displaystyle \mathbb {P} {\big (}\mathbf {X} {\overset {\text{ def }}{=}}(X_{1},\dotsc ,X_{k})^{T}=\mathbf {x} {\overset {\text{ def }}{=}}(x_{1},\dotsc ,x_{k})^{T}{\big )}}
, i.e.
the probability that
i
{\displaystyle i}
th cell has
x
i
{\displaystyle x_{i}}
objects.
We can regard each allocation as an independent trial with
k
{\displaystyle k}
outcomes (since it can be allocated to one and only one of
k
{\displaystyle k}
cells).
We can recognize that the allocation of
n
{\displaystyle n}
objects is partition of
n
{\displaystyle n}
objects into
k
{\displaystyle k}
groups. There are hence
(
n
x
1
,
…
,
x
k
)
{\displaystyle {\binom {n}{x_{1},\dotsc ,x_{k}}}}
ways of allocation.
So,
P
(
X
=
x
)
=
(
n
x
1
,
…
,
x
k
)
p
1
x
1
⋯
p
k
x
k
.
{\displaystyle \mathbb {P} (\mathbf {X} =\mathbf {x} )={\binom {n}{x_{1},\dotsc ,x_{k}}}p_{1}^{x_{1}}\dotsb p_{k}^{x_{k}}.}
In particular, the probability of allocating
x
i
{\displaystyle x_{i}}
objects to
i
{\displaystyle i}
th cell is
p
i
x
i
{\displaystyle p_{i}^{x_{i}}}
by independence, and so that of a particular case of allocation of
n
{\displaystyle n}
objects to
k
{\displaystyle k}
cells is
p
1
x
1
⋯
p
k
x
k
{\displaystyle p_{1}^{x_{1}}\dotsb p_{k}^{x_{k}}}
by independence.
Definition
Edit
Definition.
(Multinomial distribution)
A random vector
X
=
(
X
1
,
…
,
X
k
)
T
{\displaystyle \mathbf {X} =(X_{1},\dotsc ,X_{k})^{T}}
follows the multinomial distribution with
n
{\displaystyle n}
trials and probability vector
p
=
(
p
1
,
…
,
p
k
)
T
{\displaystyle \mathbf {p} =(p_{1},\dotsc ,p_{k})^{T}}
,
denoted by
X
∼
Multinom
(
n
,
p
)
{\displaystyle \mathbf {X} \sim \operatorname {Multinom} (n,\mathbf {p} )}
, if its joint pmf is
f
X
(
x
1
,
…
,
x
k
;
n
,
p
)
=
(
n
x
1
,
…
,
x
k
)
p
1
x
1
⋯
p
k
x
k
,
x
1
,
…
,
x
k
≥
0
,
and
x
1
+
⋯
+
x
k
=
n
.
{\displaystyle f_{\mathbf {X} }(x_{1},\dotsc ,x_{k};n,\mathbf {p} )={\binom {n}{x_{1},\dotsc ,x_{k}}}p_{1}^{x_{1}}\dotsb p_{k}^{x_{k}},\quad x_{1},\dotsc ,x_{k}\geq 0,{\text{ and }}x_{1}+\dotsb +x_{k}=n.}
Remark.
Multinom
(
n
,
p
)
≡
Binom
(
n
,
p
)
{\displaystyle \operatorname {Multinom} (n,\mathbf {p} )\equiv \operatorname {Binom} (n,p)}
if
p
=
(
p
,
1
−
p
)
T
{\displaystyle \mathbf {p} =(p,1-p)^{T}}
.In this case, if
(
X
1
,
X
2
)
T
∼
Multinom
(
n
,
p
)
{\displaystyle (X_{1},X_{2})^{T}\sim \operatorname {Multinom} (n,\mathbf {p} )}
,
X
1
{\displaystyle X_{1}}
is the number of successes for the binomial distribution (and
X
2
(
=
n
−
X
1
)
{\displaystyle X_{2}(=n-X_{1})}
is the number of failures). Also,
X
i
∼
Binom
(
n
,
p
i
)
{\displaystyle X_{i}\sim \operatorname {Binom} (n,p_{i})}
. It can be seen by regarding allocating the object into
i
{\displaystyle i}
th cell as 'success' for each allocation of single object [13] . Then, the success probability is
p
i
{\displaystyle p_{i}}
.
Multivariate normal distribution
Edit
Multivariate normal distribution is, as suggested by its name, a multivariate (and also generalized) version of the normal distribution (univariate).
Definition.
(Multivariate normal distribution)
A random vector
X
=
(
X
1
,
…
,
X
k
)
T
{\displaystyle \mathbf {X} =(X_{1},\dotsc ,X_{k})^{T}}
follows the
k
{\displaystyle k}
-dimensional normal distribution
with mean vector
μ
{\displaystyle {\boldsymbol {\mu }}}
and covariance matrix
Σ
{\displaystyle {\boldsymbol {\Sigma }}}
, denoted by
X
∼
N
k
(
μ
,
Σ
)
{\displaystyle \mathbf {X} \sim {\mathcal {N}}_{k}({\boldsymbol {\mu }},{\boldsymbol {\Sigma }})}
[14] if its joint pdf is
f
X
(
x
1
,
…
,
x
k
;
μ
,
Σ
)
=
exp
(
−
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
/
2
)
(
2
π
)
k
det
Σ
,
x
=
(
x
1
,
…
,
x
k
)
T
∈
R
k
{\displaystyle f_{\mathbf {X} }(x_{1},\dotsc ,x_{k};{\boldsymbol {\mu }},{\boldsymbol {\Sigma }})={\frac {\exp \left(-(\mathbf {x} -{\boldsymbol {\mu }})^{T}{\boldsymbol {\Sigma }}^{-1}(\mathbf {x} -{\boldsymbol {\mu }})/2\right)}{\sqrt {(2\pi )^{k}\det {\boldsymbol {\Sigma }}}}},\quad \mathbf {x} =(x_{1},\dotsc ,x_{k})^{T}\in \mathbb {R} ^{k}}
in which
μ
=
(
μ
1
,
…
,
μ
k
)
T
=
(
E
[
X
1
]
,
…
,
E
[
X
k
]
)
T
{\displaystyle {\boldsymbol {\mu }}=(\mu _{1},\dotsc ,\mu _{k})^{T}=(\mathbb {E} [X_{1}],\dotsc ,\mathbb {E} [X_{k}])^{T}}
is the
mean vector ,
and
Σ
=
(
Cov
(
X
1
,
X
1
)
⋯
Cov
(
X
1
,
X
k
)
⋮
⋱
⋮
Cov
(
X
k
,
X
1
)
⋯
Cov
(
X
k
,
X
k
)
)
=
(
σ
1
2
⋯
Cov
(
X
1
,
X
k
)
⋮
⋱
⋮
Cov
(
X
k
,
X
1
)
⋯
σ
k
2
)
{\displaystyle {\boldsymbol {\Sigma }}={\begin{pmatrix}\operatorname {Cov} (X_{1},X_{1})&\cdots &\operatorname {Cov} (X_{1},X_{k})\\\vdots &\ddots &\vdots \\\operatorname {Cov} (X_{k},X_{1})&\cdots &\operatorname {Cov} (X_{k},X_{k})\end{pmatrix}}={\begin{pmatrix}\sigma _{1}^{2}&\cdots &\operatorname {Cov} (X_{1},X_{k})\\\vdots &\ddots &\vdots \\\operatorname {Cov} (X_{k},X_{1})&\cdots &\sigma _{k}^{2}\end{pmatrix}}}
is the
covariance matrix (with size
k
×
k
{\displaystyle k\times k}
).
Remark.
The distribution for case
k
=
2
{\displaystyle k=2}
is more usually used, and that is called the bivariate normal distribution.
An alternative and equivalent definition is that
X
=
(
X
1
,
…
,
X
k
)
T
∼
N
k
(
μ
,
Σ
)
{\displaystyle \mathbf {X} =(X_{1},\dotsc ,X_{k})^{T}\sim {\mathcal {N}}_{k}({\boldsymbol {\mu }},{\boldsymbol {\Sigma }})}
if
X
1
=
a
11
Z
1
+
⋯
+
a
1
n
Z
n
+
μ
1
;
⋮
X
k
=
a
k
1
Z
1
+
⋯
+
a
k
n
Z
n
+
μ
k
,
{\displaystyle {\begin{aligned}X_{1}&=a_{11}Z_{1}+\dotsb +a_{1n}Z_{n}+\mu _{1};\\\vdots \\X_{k}&=a_{k1}Z_{1}+\dotsb +a_{kn}Z_{n}+\mu _{k},\\\end{aligned}}}
for some constants
a
11
,
…
,
a
1
n
,
…
,
a
k
1
,
…
,
a
k
n
,
μ
1
,
…
,
μ
k
{\displaystyle a_{11},\dotsc ,a_{1n},\dotsc ,a_{k1},\dotsc ,a_{kn},\mu _{1},\dotsc ,\mu _{k}}
, and
Z
1
,
…
,
Z
n
{\displaystyle Z_{1},\dotsc ,Z_{n}}
are
n
{\displaystyle n}
i.i.d. standard normal random variables. Using the above result, the marginal distribution followed by
X
i
{\displaystyle X_{i}}
is
N
(
μ
i
,
σ
i
2
)
,
i
=
1
,
2
,
…
,
or
k
{\displaystyle {\mathcal {N}}(\mu _{i},\sigma _{i}^{2}),\quad i=1,2,\dotsc ,{\text{ or }}k}
, as one will expect. By proposition about the sum of independent normal random variables and distribution of linear transformation of normal random variables (see Probability/Transformation of Random Variables chapter), the mean is
0
+
⋯
+
0
+
μ
i
=
μ
i
{\displaystyle 0+\dotsb +0+\mu _{i}=\mu _{i}}
, and the variance is
a
i
1
2
+
⋯
+
a
i
n
2
{\displaystyle a_{i1}^{2}+\dotsb +a_{in}^{2}}
(this equals
σ
i
2
{\displaystyle \sigma _{i}^{2}}
by definition).
Proposition.
(Joint pdf of the bivariate normal distribution)
The joint pdf of
N
2
(
μ
,
Σ
)
{\displaystyle {\mathcal {N}}_{2}({\boldsymbol {\mu }},{\boldsymbol {\Sigma }})}
is
f
(
x
,
y
)
=
1
2
π
σ
X
σ
Y
1
−
ρ
2
exp
(
−
1
2
(
1
−
ρ
2
)
(
(
x
−
μ
X
σ
X
)
2
−
2
ρ
(
x
−
μ
X
σ
X
)
(
y
−
μ
Y
σ
Y
)
+
(
y
−
μ
Y
σ
Y
)
2
)
)
,
(
x
,
y
)
T
∈
R
2
{\displaystyle f(x,y)={\frac {1}{2\pi \sigma _{X}\sigma _{Y}{\sqrt {1-\rho ^{2}}}}}\exp \left(-{\frac {1}{2(1-\rho ^{2})}}\left(\left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)^{2}-2\rho \left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)+\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)^{2}\right)\right),\quad (x,y)^{T}\in \mathbb {R} ^{2}}
in which
ρ
=
ρ
(
X
,
Y
)
{\displaystyle \rho =\rho (X,Y)}
and
σ
X
,
σ
Y
{\displaystyle \sigma _{X},\sigma _{Y}}
are positive.
Graph of an example of bivariate normal distribution
Proof.
For the bivariate normal distribution,
the mean vector is
μ
=
(
μ
X
,
μ
Y
)
{\displaystyle {\boldsymbol {\mu }}=(\mu _{X},\mu _{Y})}
;
the covariance matrix is
Σ
=
(
Cov
(
X
,
X
)
Cov
(
X
,
Y
)
Cov
(
Y
,
X
)
Cov
(
Y
,
Y
)
)
=
(
Var
(
X
)
Cov
(
X
,
Y
)
Cov
(
X
,
Y
)
Var
(
Y
)
)
=
(
σ
X
2
ρ
σ
X
σ
Y
ρ
σ
X
σ
Y
σ
Y
2
)
.
{\displaystyle {\boldsymbol {\Sigma }}={\begin{pmatrix}\operatorname {Cov} (X,X)&\operatorname {Cov} (X,Y)\\\operatorname {Cov} (Y,X)&\operatorname {Cov} (Y,Y)\end{pmatrix}}={\begin{pmatrix}\operatorname {Var} (X)&\operatorname {Cov} (X,Y)\\\operatorname {Cov} (X,Y)&\operatorname {Var} (Y)\\\end{pmatrix}}={\begin{pmatrix}\sigma _{X}^{2}&\rho \sigma _{X}\sigma _{Y}\\\rho \sigma _{X}\sigma _{Y}&\sigma _{Y}^{2}\\\end{pmatrix}}.}
Hence,
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
=
1
det
Σ
(
(
x
−
μ
X
,
y
−
μ
Y
)
T
)
T
(
σ
Y
2
−
ρ
σ
X
σ
Y
−
ρ
σ
X
σ
Y
σ
X
2
)
(
x
−
μ
X
,
y
−
μ
Y
)
T
)
=
1
det
Σ
(
x
−
μ
X
y
−
μ
Y
)
(
σ
Y
2
−
ρ
σ
X
σ
Y
−
ρ
σ
X
σ
Y
σ
X
2
)
(
x
−
μ
X
y
−
μ
Y
)
=
1
det
Σ
(
(
x
−
μ
X
)
σ
Y
2
−
(
y
−
μ
Y
)
ρ
σ
X
σ
Y
−
(
x
−
μ
X
)
ρ
σ
X
σ
Y
+
(
y
−
μ
Y
)
σ
X
2
)
(
x
−
μ
X
y
−
μ
Y
)
=
1
det
Σ
⏟
σ
X
2
σ
Y
2
−
(
ρ
σ
X
σ
Y
)
2
(
(
x
−
μ
X
)
2
σ
Y
2
−
(
x
−
μ
X
)
(
y
−
μ
Y
)
ρ
σ
X
σ
Y
−
(
x
−
μ
X
)
(
y
−
μ
Y
)
ρ
σ
X
σ
Y
⏟
=
−
2
ρ
(
x
−
μ
X
)
(
y
−
μ
Y
)
σ
X
σ
Y
+
(
y
−
μ
Y
)
2
σ
X
2
)
=
(
x
−
μ
X
)
2
σ
Y
2
−
2
ρ
(
x
−
μ
X
)
(
y
−
μ
Y
)
σ
X
σ
Y
+
(
y
−
μ
Y
)
2
σ
X
2
σ
X
2
σ
Y
2
(
1
−
ρ
)
2
=
1
1
−
ρ
2
(
(
x
−
μ
X
σ
X
)
2
−
2
ρ
(
(
x
−
μ
X
)
(
y
−
μ
Y
)
σ
X
σ
Y
)
+
(
y
−
μ
Y
σ
Y
)
2
)
.
{\displaystyle {\begin{aligned}(\mathbf {x} -{\boldsymbol {\mu }})^{T}{\boldsymbol {\Sigma }}^{-1}(\mathbf {x} -{\boldsymbol {\mu }})&={\frac {1}{\det {\boldsymbol {\Sigma }}}}\left((x-\mu _{X},y-\mu _{Y})^{T}\right)^{T}{\begin{pmatrix}\sigma _{Y}^{2}&-\rho \sigma _{X}\sigma _{Y}\\-\rho \sigma _{X}\sigma _{Y}&\sigma _{X}^{2}\\\end{pmatrix}}(x-\mu _{X},y-\mu _{Y})^{T})\\&={\frac {1}{\det {\boldsymbol {\Sigma }}}}{\begin{pmatrix}{\color {blue}x-\mu _{X}}&{\color {red}y-\mu _{Y}}\end{pmatrix}}{\begin{pmatrix}{\color {darkgreen}\sigma _{Y}^{2}}&{\color {darkorange}-\rho \sigma _{X}\sigma _{Y}}\\{\color {purple}-\rho \sigma _{X}\sigma _{Y}}&{\color {maroon}\sigma _{X}^{2}}\\\end{pmatrix}}{\begin{pmatrix}x-\mu _{X}\\y-\mu _{Y}\end{pmatrix}}\\&={\frac {1}{\det {\boldsymbol {\Sigma }}}}{\begin{pmatrix}{\color {blue}(x-\mu _{X})}{\color {darkgreen}\sigma _{Y}^{2}}{\color {purple}-}{\color {red}(y-\mu _{Y})}{\color {purple}\rho \sigma _{X}\sigma _{Y}}&{\color {darkorange}-}{\color {blue}(x-\mu _{X})}{\color {darkorange}\rho \sigma _{X}\sigma _{Y}}+{\color {red}(y-\mu _{Y})}{\color {maroon}\sigma _{X}^{2}}\end{pmatrix}}{\begin{pmatrix}{\color {deeppink}x-\mu _{X}}\\{\color {deeppink}y-\mu _{Y}}\end{pmatrix}}\\&={\frac {1}{\underbrace {\det {\boldsymbol {\Sigma }}} _{\sigma _{X}^{2}\sigma _{Y}^{2}-(\rho \sigma _{X}\sigma _{Y})^{2}}}}{\big (}(x-\mu _{X})^{\color {deeppink}2}\sigma _{Y}^{2}\underbrace {-{\color {deeppink}(x-\mu _{X})}(y-\mu _{Y})\rho \sigma _{X}\sigma _{Y}-(x-\mu _{X}){\color {deeppink}(y-\mu _{Y})}\rho \sigma _{X}\sigma _{Y}} _{=-2\rho (x-\mu _{X})(y-\mu _{Y})\sigma _{X}\sigma _{Y}}+(y-\mu _{Y})^{\color {deeppink}2}\sigma _{X}^{2}{\big )}\\&={\frac {(x-\mu _{X})^{2}\sigma _{Y}^{2}-2\rho (x-\mu _{X})(y-\mu _{Y})\sigma _{X}\sigma _{Y}+(y-\mu _{Y})^{2}\sigma _{X}^{2}}{\sigma _{X}^{2}\sigma _{Y}^{2}(1-\rho )^{2}}}\\&={\frac {1}{1-\rho ^{2}}}\left(\left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)^{2}-2\rho \left({\frac {(x-\mu _{X})(y-\mu _{Y})}{\sigma _{X}\sigma _{Y}}}\right)+\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)^{2}\right).\end{aligned}}}
It follows that the joint pdf is
f
(
x
,
y
)
=
1
(
2
π
)
2
det
Σ
exp
(
−
1
2
⋅
1
1
−
ρ
2
(
(
x
−
μ
X
σ
X
)
2
−
2
ρ
(
(
x
−
μ
X
)
(
y
−
μ
Y
)
σ
X
σ
Y
)
+
(
y
−
μ
Y
σ
Y
)
2
)
)
=
1
2
π
σ
X
2
σ
Y
2
(
1
−
ρ
2
)
exp
(
−
1
2
(
1
−
ρ
2
)
(
(
x
−
μ
X
σ
X
)
2
−
2
ρ
(
(
x
−
μ
X
)
(
y
−
μ
Y
)
σ
X
σ
Y
)
+
(
y
−
μ
Y
σ
Y
)
2
)
)
=
1
2
π
σ
X
σ
Y
1
−
ρ
2
exp
(
−
1
2
(
1
−
ρ
2
)
(
(
x
−
μ
X
σ
X
)
2
−
2
ρ
(
x
−
μ
X
σ
X
)
(
y
−
μ
Y
σ
Y
)
+
(
y
−
μ
Y
σ
Y
)
2
)
)
.
{\displaystyle {\begin{aligned}f(x,y)&={\frac {1}{\sqrt {(2\pi )^{2}\det {\boldsymbol {\Sigma }}}}}\exp \left(-{\frac {1}{2}}\cdot {\frac {1}{1-\rho ^{2}}}\left(\left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)^{2}-2\rho \left({\frac {(x-\mu _{X})(y-\mu _{Y})}{\sigma _{X}\sigma _{Y}}}\right)+\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)^{2}\right)\right)\\&={\frac {1}{2\pi {\sqrt {\sigma _{X}^{2}\sigma _{Y}^{2}(1-\rho ^{2})}}}}\exp \left({\frac {-1}{2(1-\rho ^{2})}}\left(\left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)^{2}-2\rho \left({\frac {(x-\mu _{X})(y-\mu _{Y})}{\sigma _{X}\sigma _{Y}}}\right)+\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)^{2}\right)\right)\\&={\frac {1}{2\pi \sigma _{X}\sigma _{Y}{\sqrt {1-\rho ^{2}}}}}\exp \left({\frac {-1}{2(1-\rho ^{2})}}\left(\left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)^{2}-2\rho \left({\frac {x-\mu _{X}}{\sigma _{X}}}\right)\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)+\left({\frac {y-\mu _{Y}}{\sigma _{Y}}}\right)^{2}\right)\right).\\\end{aligned}}}
◻
{\displaystyle \Box }