Data Science: An Introduction/Single Variable Analysis
Chapter SummaryEdit
As discussed in chapter three, a variable is a set of values we have measured from a group of objects. For example, we can measure the first name of each person in a class. Their actual collected name is the value for that person for the variable (which, in this case, we would call "FirstName") When we put all the values of "FirstName" together in a group, we call that group of values a Distribution. In data science speak we would say that "a variable has a distribution of values." In practice, however, many data scientists interchange the words distribution and variable as if they were synonyms.
Descriptive Statistics are calculations we perform on distributions to simply describe the variables. The two most common descriptive statistics we normally calculate are called Measures of Central Tendency, and Measures of Dispersion. Every variable, and hence every distribution, has a data type—nominal, ordinal, interval, or ratio. We have distinct descriptive statistics for each data type. The table below lists the names of the simple descriptive statistics for each data type.
Measure | Data Types | |||
---|---|---|---|---|
Nominal | Ordinal | Interval | Ratio | |
Central Tendency | Mode | Median | Arithmetic Mean | Geometric Mean |
Dispersion | Variation Ratio | Inter-quartile Range | Standard Deviation | Coefficient of Variation |
Generally speaking, except for physics and chemistry, most data science projects either do not use ratio data, or the ratio data is converted to interval data (into what is sometimes called "log-normal" data). Thus, the Geometric Mean and the Coefficient of Variation are rarely used by data scientists. We also must be careful not to mis-apply the descriptive statistics of one data type to that of another. This will often result in a mis-interpretation of the data. The exception is that we can cautiously apply descriptive statistics of a "lower" data type to a "higher" data type. That is, we can appropriately calculate the median for interval data, but not the arithmetic mean for ordinal data.
DiscussionEdit
DistributionsEdit
The Normal Distribution
Other Common Distributions
Nominal VariablesEdit
Central Tendency
Dispersion
Ordinal VariablesEdit
Central Tendency
Dispersion
From Ordinal to "ordered nominal"
Interval VariablesEdit
Central Tendency
Dispersion
From Interval to Ordinal
Ratio VariablesEdit
Central Tendency
Dispersion
From Ratio to Interval
Assignment/ExerciseEdit
More ReadingEdit
ReferencesEdit
