Statistics/Distributions

How are the results of the latest SAT test? What is the average height of females under 21 in Zambia? How does beer consumption among college students at engineering college compare to college students in liberal arts colleges?

To answer these questions, we would collect data and put them in a form that is easy to summarize, visualize, and discuss. Loosely speaking, the collection and aggregation of data result in a distribution. Distributions are most often in the form of a histogram or a table. That way, we can "see" the data immediately and begin our scientific inquiry.

For example, if we want to know more about students' latest performance on the SAT, we would collect SAT scores from ETS, compile them in a way that is pertinent to us, and then form a distribution of these scores. The result may be a data table or it may be a plot. Regardless, once we "see" the data, we can begin asking more interesting research questions about our data.

The distributions we create often parallel distributions that are mathematically generated. For example, if we obtain the heights of all high school students and plot this data, the graph may resemble a normal distribution, which is generated mathematically. Then, instead of painstakingly collecting heights of all high school students, we could simply use a normal distribution to approximate the heights without sacrificing too much accuracy.

In the study of statistics, we focus on mathematical distributions for the sake of simplicity and relevance to the real-world. Understanding these distributions will enable us to visualize the data easier and build models quicker. However, they cannot and do not replace the work of manual data collection and generating the actual data distribution.

What percentage lie within a certain range? Distributions show what percentage of the data lies within a certain range. So, given a distribution, and a set of values, we can determine the probability that the data will lie within a certain range.

The same data may lead to different conclusions if it is interposed on different distributions. So, it is vital in all statistical analysis for data to be put onto the correct distribution.

Distributions

Comparison of Some Distributions

**Some Distributions**
Name	Notation	Formula	Symbols	Use	Continuous/ discrete	Notes
Bernoulli	f(x)=	$p^{x}(1-p)^{1-x}$	p x	2 outcomes	Discrete	1 trial
Binomial	b(x;n, p)=	${n \choose k}{p^{k}(1-p)^{n-k}}$	n trials k successes p probability	number of times success specific probabilities not random	Discrete
Poisson	P(x)=	${\frac {e^{-\lambda t}(\lambda t)^{x}}{x!}}$	$\mu =\lambda t$ $\sigma ^{2}=\lambda t$	outcome/time outcome/region	Discrete
Hypergeometric	h(x;N,n,k) =	${{{k \choose x}{{N-k} \choose {n-x}}} \over {N \choose n}}$	n samples from N items k of N items are successes, N-k are failures	X times success occurs irregardless of location is random	Discrete	Without Replacement
Multivariate Hypergeometric	$h(x_{1},x_{2}...,x_{k};a_{1},a_{2},...a_{k},N,n){=}$	${a_{1} \choose x_{1}}{a_{2} \choose x_{2}}\dots {a_{k} \choose x_{k}} \over {N \choose n}$	n sample size N items k cells $A_{1}\dots A_{k}$ each with $a_{1}\dots a_{k}$ elements		Discrete	Without Replacement
Normal	$\int _{a}^{b}n(x;\mu ,\sigma ){=}$	$Z{=}{\frac {x-\mu }{\sigma }}$	x $\mu$ average $\sigma$ std dev	: ${\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-(x-\mu )^{2}/2\sigma ^{2}}$	Continuous	Z is a random variable with $\mu {=}0and\sigma ^{2}{=}1$
Chi-Square	$\chi ^{2}{=}$	${\frac {(n-1)s^{2}}{\sigma ^{2}}}$	$S^{2}$ variance of a rand sampp of size n taken from norm pop w/ var $\sigma ^{2}$	variances of random sample related to the pop	Continuous
Student-t	T=	${\frac {{\bar {X}}-\mu }{S/{\sqrt {n}}}}$	${\bar {X}}$ mean of rand samp size n	If don't know $\sigma$	Continuous	v=n-1
F	F=	${\frac {\sigma _{2}^{2}S_{1}^{2}}{\sigma _{1}^{2}S_{2}^{2}}}{=}{\frac {S_{1}^{2}}{S_{2}^{2}}}$	Continuous