Statistical Analysis: an Introduction using R/Chapter 2
Data is the life blood of statistical analysis. A recurring theme in this book is that most analysis consists of constructing sensible statistical models to explain the data that has been observed. This requires a clear understanding of the data and where it came from. It is therefore important to know the different types of data that are likely to be encountered. Thus in this chapter we focus on different types of data, including simple ways in which they can be examined, and how data can organised into coherent datasets.
R topics in Chapter 2
The R examples used in this chapter are intended to introduce you to the nuts and bolts of R, so may seem dry or even overly technical compared to the rest of the book. However, the topics introduced here are essential to understand how to use R and it is particularly important to understand them. They assume that you are comfortable with the concepts of assignment (i.e. storing objects) and functions, as detailed previously.
VariablesEdit
The simplest sort of data is just a collection of measurements, each measurement being a single "data point". In statistics, a collection of single measurements of the same sort is commonly known as a variable, and these are often given a name^{[1]}. Variables usually have a reasonable amount of background context associated with them: what the measurements represent, why and how they were collected, whether there are any known omissions or exceptional points, and so forth. Knowing or finding out this associated information is an essential part of any analysis, along with examination of the variables (e.g. by plotting or other means).
Single variables in REdit
Vectors
TRUE
or FALSE
^{[2]}). In this topic we will use some example vectors provided by the "datasets" package, containing data on States of the USA (see ?state
).
R is an inherently vectorbased program; in fact the numbers we have been using in previous calculations are just treated as vectors with a single element. This means that most basic functions in R will behave sensibly when given a vector as a argument, as shown below.
Input:
1 state.area #a NUMERIC vector giving the area of US states, in square miles
2 state.name #a CHARACTER vector (note the quote marks) of state names
3 sq.km < state.area*2.59 #Arithmetic works on numeric vectors, e.g. convert sq miles to sq km
4 sq.km #... the new vector has the calculation applied to each element in turn
5 sqrt(sq.km) #Many mathematical functions also apply to each element in turn
6 range(state.area) #But some functions return different length vectors (here, just the max & min).
7 length(state.area) #and some, like this useful one, just return a single value.
Result:
> state.area #a NUMERIC vector giving the area of US states, in square miles [1] 51609 589757 113909 53104 158693 104247 5009 2057 58560 58876 6450 83557 56400 [14] 36291 56290 82264 40395 48523 33215 10577 8257 58216 84068 47716 69686 147138 [27] 77227 110540 9304 7836 121666 49576 52586 70665 41222 69919 96981 45333 1214 [40] 31055 77047 42244 267339 84916 9609 40815 68192 24181 56154 97914 > state.name #a CHARACTER vector (note the quote marks) of state names [1] "Alabama" "Alaska" "Arizona" "Arkansas" [5] "California" "Colorado" "Connecticut" "Delaware" [9] "Florida" "Georgia" "Hawaii" "Idaho" [13] "Illinois" "Indiana" "Iowa" "Kansas" [17] "Kentucky" "Louisiana" "Maine" "Maryland" [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi" [25] "Missouri" "Montana" "Nebraska" "Nevada" [29] "New Hampshire" "New Jersey" "New Mexico" "New York" [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma" [37] "Oregon" "Pennsylvania" "The smallest state" "South Carolina" [41] "South Dakota" "Tennessee" "Texas" "Utah" [45] "Vermont" "Virginia" "Washington" "West Virginia" [49] "Wisconsin" "Wyoming" > sq.km < state.area*2.59 #Standard arithmatic works on numeric vectors, e.g. convert sq miles to sq km > sq.km #... giving another vector with the calculation performed on each element in turn [1] 133667.31 1527470.63 295024.31 137539.36 411014.87 269999.73 12973.31 5327.63 [9] 151670.40 152488.84 16705.50 216412.63 146076.00 93993.69 145791.10 213063.76 [17] 104623.05 125674.57 86026.85 27394.43 21385.63 150779.44 217736.12 123584.44 [25] 180486.74 381087.42 200017.93 286298.60 24097.36 20295.24 315114.94 128401.84 [33] 136197.74 183022.35 106764.98 181090.21 251180.79 117412.47 3144.26 80432.45 [41] 199551.73 109411.96 692408.01 219932.44 24887.31 105710.85 176617.28 62628.79 [49] 145438.86 253597.26 > sqrt(sq.km) #Many mathematical functions also apply to each element in turn [1] 365.60540 1235.90883 543.16140 370.86299 641.10441 519.61498 113.90044 72.99062 [9] 389.44884 390.49819 129.24976 465.20171 382.19890 306.58390 381.82601 461.58830 [17] 323.45487 354.50609 293.30334 165.51263 146.23826 388.30328 466.62203 351.54579 [25] 424.83731 617.32278 447.23364 535.06878 155.23324 142.46136 561.35100 358.33202 [33] 369.04978 427.81111 326.74911 425.54695 501.17940 342.65503 56.07370 283.60615 [41] 446.71213 330.77479 832.11058 468.96955 157.75712 325.13205 420.25859 250.25745 [49] 381.36447 503.58441 > range(state.area) #But some functions return different length vectors (here, just the max & min). [1] 1214 589757 > length(state.area) #and some, like this useful one, just return a single value. [1] 50
c()
, so named because it concatenates objects together. However, if you wish to create vectors consisting of regular sequences of numbers (e.g. 2,4,6,8,10,12, or 1,1,2,2,1,1,2,2) there are several alternative functions you can use, including seq()
, rep()
, and the :
operator.
Input:
1 c("one", "two", "three", "pi") #Make a character vector
2 c(1,2,3,pi) #Make a numeric vector
3 seq(1,3) #Create a sequence of numbers
4 1:3 #A shortcut for the same thing (but less flexible)
5 i < 1:3 #You can store a vector
6 i
7 i < c(i,pi) #To add more elements, you must assign again, e.g. using c()
8 i
9 i < c(i, "text") #A vector cannot contain different data types, so ...
10 i #... R converts all elements to the same type
11 i+1 #The numbers are now strings of text: arithmetic is impossible
12 rep(1, 10) #The "rep" function repeats its first argument
13 rep(3:1,10) #The first argument can also be a vector
14 huge.vector < 0:(10^7) #R can easily cope with very big vectors
15 #huge.vector #VERY BAD IDEA TO UNCOMMENT THIS, unless you want to print out 10 million numbers
16 rm(huge.vector) #"rm" removes objects. Deleting huge unused objects is sensible
Result:
> c("one", "two", "three", "pi") #Make a character vector [1] "one" "two" "three" "pi" > c(1,2,3,pi) #Make a numeric vector [1] 1.000000 2.000000 3.000000 3.141593 > seq(1,3) #Create a sequence of numbers [1] 1 2 3 > 1:3 #A shortcut for the same thing (but less flexible) [1] 1 2 3 > i < 1:3 #You can store a vector > i [1] 1 2 3 > i < c(i,pi) #To add more elements, you must assign again, e.g. using c() > i [1] 1.000000 2.000000 3.000000 3.141593 > i < c(i, "text") #A vector cannot contain different data types, so ... > i #... R converts all elements to the same type [1] "1" "2" "3" "3.14159265358979" "text" > i+1 #The numbers are now strings of text: arithmetic is impossible Error in i + 1 : nonnumeric argument to binary operator > rep(1, 10) #The "rep" function repeats its first argument [1] 1 1 1 1 1 1 1 1 1 1 > rep(3:1,10) #The first argument can also be a vector [1] 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 > huge.vector < 0:(10^7) #R can easily cope with very big vectors > #huge.vector #VERY BAD IDEA TO UNCOMMENT THIS, unless you want to print out 10 million numbers > rm(huge.vector) #"rm" removes objects. Deleting huge unused objects is sensible
Accessing elements of vectors
[]
(i.e. square brackets). If these square brackets contain
 A positive number or numbers, then this has the effect of picking those particular elements of the vector
 A negative number or numbers, then this has the effect of picking the whole vector except those elements
 A logical vector, then each element of the logical vector indicates whether to pick (if TRUE) or not (if FALSE) the equivalent element of the original vector^{[3]}.
The use of logical vectors may seem a little complicated. However, they can be extremely useful, because they are the key behind using comparison operators. These can be used, for example, to identify which US states are small, with an area less than (<) 10 000 square miles (as demonstrated below).
Input:
1 min(state.area) #This gives the area of the smallest US state...
2 which.min(state.area) #... this shows which element it is (the 39th as it happens)
3 state.name[39] #You can obtain individual elements by using square brackets
4 state.name[39] < "THE SMALLEST STATE" #You can replace elements using [] too
5 state.name #The 39th name ("Rhode Island") should now have been changed
6 state.name[1:10] #This returns a new vector consisting of only the first 10 states
7 state.name[(1:10)] #Using negative numbers gives everything but the first 10 states
8 state.name[c(1,2,2,1)] #You can also obtain the same element multiple times
9 ###Logical vectors are a little more complicated to get your head round
10 state.area < 10000 #A LOGICAL vector, identifying which states are small
11 state.name[state.area < 10000] #So this can be used to select the names of the small states
Result:
> min(state.area) #This gives the area of the smallest US state... [1] 1214 > which.min(state.area) #... this shows which element it is (the 39th as it happens) [1] 39 > state.name[39] #You can obtain individual elements by using square brackets [1] "Rhode Island" > state.name[39] < "The smallest state" #You can replace elements using [] too > state.name #The 39th name ("Rhode Island") should now have been changed [1] "Alabama" "Alaska" "Arizona" "Arkansas" [5] "California" "Colorado" "Connecticut" "Delaware" [9] "Florida" "Georgia" "Hawaii" "Idaho" [13] "Illinois" "Indiana" "Iowa" "Kansas" [17] "Kentucky" "Louisiana" "Maine" "Maryland" [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi" [25] "Missouri" "Montana" "Nebraska" "Nevada" [29] "New Hampshire" "New Jersey" "New Mexico" "New York" [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma" [37] "Oregon" "Pennsylvania" "THE SMALLEST STATE" "South Carolina" [41] "South Dakota" "Tennessee" "Texas" "Utah" [45] "Vermont" "Virginia" "Washington" "West Virginia" [49] "Wisconsin" "Wyoming" > state.name[1:10] #This returns a new vector consisting of only the first 10 states [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California" "Colorado" [7] "Connecticut" "Delaware" "Florida" "Georgia" > state.name[(1:10)] #Using negative numbers gives everything but the first 10 states [1] "Hawaii" "Idaho" "Illinois" "Indiana" [5] "Iowa" "Kansas" "Kentucky" "Louisiana" [9] "Maine" "Maryland" "Massachusetts" "Michigan" [13] "Minnesota" "Mississippi" "Missouri" "Montana" [17] "Nebraska" "Nevada" "New Hampshire" "New Jersey" [21] "New Mexico" "New York" "North Carolina" "North Dakota" [25] "Ohio" "Oklahoma" "Oregon" "Pennsylvania" [29] "THE SMALLEST STATE" "South Carolina" "South Dakota" "Tennessee" [33] "Texas" "Utah" "Vermont" "Virginia" [37] "Washington" "West Virginia" "Wisconsin" "Wyoming" > state.name[c(1,2,2,1)] #You can also obtain the same element multiple times [1] "Alabama" "Alaska" "Alaska" "Alabama" > ###Logical vectors are a little more complicated to get your head round > state.area < 10000 #A LOGICAL vector, identifying which states are small [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE [16] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE [31] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE [46] FALSE FALSE FALSE FALSE FALSE > state.name[state.area < 10000] #So this can be used to select the names of the small states [1] "Connecticut" "Delaware" "Hawaii" "Massachusetts" [5] "New Hampshire" "New Jersey" "THE SMALLEST STATE" "Vermont"
[]
operator can be used to access just a single element of a vector, it is particularly useful for accessing a number of elements at once. Another operator, the double squarebracket ([[
) exists for specifically accessing a single element. While not particularly useful for vectors, it comes into its own for #Lists and #Data frames.Logical operations
<
) to produce a logical vector, which could then be used to select elements less than a certain value. This type of logical operation is very useful thing to be able to do. As well as <
, there are a handful of other comparison operators. Here is the full set (See ?Comparison
for more details)
<
(less than) and<=
(less than or equal to)>
(greater than) and>=
(greater than or equal to)==
(equal to^{[4]}) and!=
(not equal to)
Even more flexibility can be gained by combining logical vectors using and, or, and not. For example, we might want to identify which US states have an area less than 10 000 or greater than 100 000 square miles, or to identify which have an area greater than 100 000 square miles and which have a short name. The code below shows how can be used to do this, using the following R symbols:
&
("and")
("or")!
("not")
When using logical vectors, the following functions are particularly useful, as illustrated below
which()
identifies which elements of a logical vector areTRUE
sum()
can be used to give the number of elements of a logical vector which areTRUE
. This is becausesum()
forces its input to be converted to numbers, and if TRUE and FALSE are converted to numbers, they take the values 1 and 0 respectively.ifelse()
returns different values depending on whether each element of a logical vector is TRUE or FALSE. Specifically, a command such asifelse(aLogicalVector, vectorT, vectorF)
takesaLogicalVector
and returns, for each element that isTRUE
, the corresponding element fromvectorT
, and for each element that isFALSE
, the corresponding element fromvectorF
. An extra elaboration is that ifvectorT
orvectorF
are shorter thanaLogicalVector
they are extended by duplication to the correct length.
Input:
1 ### In these examples, we'll reuse the American states data, especially the state names
2 ### To remind yourself of them, you might want to look at the vector "state.names"
3
4 nchar(state.name) # nchar() returns the number of characters in strings of text ...
5 nchar(state.name) <= 6 #so this indicates which states have names of 6 letters or fewer
6 ShortName < nchar(state.name) <= 6 #store this logical vector for future use
7 sum(ShortName) #With a logical vector, sum() tells us how many are TRUE (11 here)
8 which(ShortName) #These are the positions of the 11 elements which have short names
9 state.name[ShortName] #Use the index operator [] on the original vector to get the names
10 state.abb[ShortName] #Or even on other vectors (e.g. the 2 letter state abbreviations)
11
12 isSmall < state.area < 10000 #Store a logical vector indicating states <10000 sq. miles
13 isHuge < state.area > 100000 #And another for states >100000 square miles in area
14 sum(isSmall) #there are 8 "small" states
15 sum(isHuge) #coincidentally, there are also 8 "huge" states
16
17 state.name[isSmall  isHuge] #  means OR. So these are states which are small OR huge
18 state.name[isHuge & ShortName] # & means AND. So these are huge AND with a short name
19 state.name[isHuge & !ShortName]# ! means NOT. So these are huge and with a longer name
20
21 ### Examples of ifelse() ###
22
23 ifelse(ShortName, state.name, state.abb) #mix short names with abbreviations for long ones
24 # (think of this as "*if* ShortName is TRUE then use state.name *else* use state.abb)
25
26 ### Many functions in R increase input vectors to the correct size by duplication ###
27 ifelse(ShortName, state.name, "tooBIG") #A silly example: the 3rd argument is duplicated
28 size < ifelse(isSmall, "small", "large") #A more useful example, for both 2nd & 3rd args
29 size #might be useful as an indicator variable?
30 ifelse(size=="large", ifelse(isHuge, "huge", "medium"), "small") #A more complex example
Result:
> ### In these examples, we'll reuse the American states data, especially the state names > ### To remind yourself of them, you might want to look at the vector "state.names" > > nchar(state.name) # nchar() returns the number of characters in strings of text ... [1] 7 6 7 8 10 8 11 8 7 7 6 5 8 7 4 6 8 9 5 8 13 8 9 11 8 7 8 6 13 [30] 10 10 8 14 12 4 8 6 12 12 14 12 9 5 4 7 8 10 13 9 7 > nchar(state.name) <= 6 #so this indicates which states have names of 6 letters or fewer [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE [15] TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE [29] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE [43] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE > ShortName < nchar(state.name) <= 6 #store this logical vector for future use > sum(ShortName) #With a logical vector, sum() tells us how many are TRUE (11 here) [1] 11 > which(ShortName) #These are the positions of the 11 elements which have short names [1] 2 11 12 15 16 19 28 35 37 43 44 > state.name[ShortName] #Use the index operator [] on the original vector to get the names [1] "Alaska" "Hawaii" "Idaho" "Iowa" "Kansas" "Maine" "Nevada" "Ohio" "Oregon" [10] "Texas" "Utah" > state.abb[ShortName] #Or even on other vectors (e.g. the 2 letter state abbreviations) [1] "AK" "HI" "ID" "IA" "KS" "ME" "NV" "OH" "OR" "TX" "UT" > > isSmall < state.area < 10000 #Store a logical vector indicating states <10000 sq. miles > isHuge < state.area > 100000 #And another for states >100000 square miles in area > sum(isSmall) #there are 8 "small" states [1] 8 > sum(isHuge) #coincidentally, there are also 8 "huge" states [1] 8 > > state.name[isSmall  isHuge] #  means OR. So these are states which are small OR huge [1] "Alaska" "Arizona" "California" "Colorado" "Connecticut" [6] "Delaware" "Hawaii" "Massachusetts" "Montana" "Nevada" [11] "New Hampshire" "New Jersey" "New Mexico" "Rhode Island" "Texas" [16] "Vermont" > state.name[isHuge & ShortName] # & means AND. So these are huge AND with a short name [1] "Alaska" "Nevada" "Texas" > state.name[isHuge & !ShortName]# ! means NOT. So these are huge and with a longer name [1] "Arizona" "California" "Colorado" "Montana" "New Mexico" > > ### Examples of ifelse() ### > > ifelse(ShortName, state.name, state.abb) #mix short names with abbreviations for long ones [1] "AL" "Alaska" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" [10] "GA" "Hawaii" "Idaho" "IL" "IN" "Iowa" "Kansas" "KY" "LA" [19] "Maine" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" [28] "Nevada" "NH" "NJ" "NM" "NY" "NC" "ND" "Ohio" "OK" [37] "Oregon" "PA" "RI" "SC" "SD" "TN" "Texas" "Utah" "VT" [46] "VA" "WA" "WV" "WI" "WY" > # (think of this as "*if* ShortName is TRUE then use state.name *else* use state.abb) > > ### Many functions in R increase input vectors to the correct size by duplication ### > ifelse(ShortName, state.name, "tooBIG") #A silly example: the 3rd argument is duplicated [1] "tooBIG" "Alaska" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" [10] "tooBIG" "Hawaii" "Idaho" "tooBIG" "tooBIG" "Iowa" "Kansas" "tooBIG" "tooBIG" [19] "Maine" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" [28] "Nevada" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "Ohio" "tooBIG" [37] "Oregon" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "Texas" "Utah" "tooBIG" [46] "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" > size < ifelse(isSmall, "small", "large") #A more useful example, for both 2nd & 3rd args > size #might be useful as an indicator variable? [1] "large" "large" "large" "large" "large" "large" "small" "small" "large" "large" [11] "small" "large" "large" "large" "large" "large" "large" "large" "large" "large" [21] "small" "large" "large" "large" "large" "large" "large" "large" "small" "small" [31] "large" "large" "large" "large" "large" "large" "large" "large" "small" "large" [41] "large" "large" "large" "large" "small" "large" "large" "large" "large" "large" > ifelse(size=="large", ifelse(isHuge, "huge", "medium"), "small") #A more complex example [1] "medium" "huge" "huge" "medium" "huge" "huge" "small" "small" "medium" [10] "medium" "small" "medium" "medium" "medium" "medium" "medium" "medium" "medium" [19] "medium" "medium" "small" "medium" "medium" "medium" "medium" "huge" "medium" [28] "huge" "small" "small" "huge" "medium" "medium" "medium" "medium" "medium" [37] "medium" "medium" "small" "medium" "medium" "medium" "huge" "medium" "small" [46] "medium" "medium" "medium" "medium" "medium"
If you have done any computer programming, you may be more used to dealing with logic in the context of "if" statements. While R also has an if()
statement, it is less useful when dealing with vectors. For example, the following R expression
if(aVariable == 0) then print("zero") else print("not zero")expects
aVariable
to be a single number: it outputs "zero" if this number is 0, or "not zero" if it is a number other than zero^{[5]}. If aVariable
is a vector of 2 values or more, only the first element counts: everything else is ignored^{[6]}. There are also logical operators which ignore everything but the first element of a vector: these are &&
for AND and 
for OR^{[7]}.Missing data
We could just omit missing data, but in many cases, we have information for some variables, but not for others. For example, we might not want to completely omit a US state from an analysis, just because it it missing one particular datum of interest. For this reason, R provides a special value, NA, meaning "not available". Any vector, numeric, character, or logical, can have elements which are NA. These can be identified by the function "is.na".
Input:
1 some.missing < c(1,NA)
2 is.na(some.missing)
Result:
some.missing < c(1,NA) is.na(some.missing) [1] FALSE TRUE
Measurement valuesEdit
One important feature of any variable is the values it is allowed to have. For example, a variable such as sex can only take a limited number of values ('Male' and 'Female' in this instance), whereas a variable such as humanHeight can take any numerical value between about 0 and 3 metres. This is the sort of obvious background information that cannot necessarily be inferred from the data, but which can be vital for analysis. Only a limited amount of this information is usually fed directly into statistical analysis software. As always, it's very important to take account of such background information. This can be usually done using that commodity  unavailable to a computer  known as common sense. For example, a computer could be used to perform an analysis of human height without realising that one person has been recorded as (say) 175, rather than 1.75, metres tall. A computer can blindly perform analysis on this variable without noticing the error, even though it is glaringly obvious to a human^{[8]}. That's one of the primary reasons that it is important to plot data before analysis.
Categorical versus quantitative variablesEdit
Nevertheless, a few bits of information about a variable can (indeed, often must) be given to analysis software. Nearly all statistical software packages require you, at a minimum, to distinguish between categorical variables (e.g. sex) in which each data point takes one of a fixed number of predefined "levels", and quantitative variables (e.g. humanHeight) in which each data point is a number on a welldefined scale. Further examples are given in Table 2.1. This distinction is important even for such simple analyses as taking an average: a procedure which is meaningful for a quantitative variable, but rarely for a categorical variable (what is the "average" sex of a 'male' and a 'female'?).
Categorical (also known as "qualitative" variables or "factors") 


 
Quantitative (also known as "numeric" variables or "covariates") 

 

It is not always immediately obvious from the plain data whether a variable is categorical or quantitative: often this judgement must be made by careful consideration of the context of the data. For example, a variable containing numbers 1, 2, and 3 might seem to be a numerical quantity, but it could just as easily be a categorical variable describing (say) a medical treatment using either drug 1, drug 2, or drug 3. More rarely, a seemingly categorical variable such as colour (levels 'blue', 'green', 'yellow', 'red') might be better represented as a numerical quantity such as the wavelength of light emitted in an experiment. Again, it's your job to make this sort of judgement, on the basis of what you are trying to do.
Borderline categorical/quantitative variablesEdit
Despite the importance of the categorical/quantitative distinction (and its prominence in many textbooks), reality is not always so clearcut. It can sometimes be reasonable to treat categorical variables as quantitative, or vice versa. Perhaps the commonest case is when the levels of a categorical variable seem to have a natural order, such as the class variable in Table 2.1, or the Likert scale often used in questionnaires.
In rare and specific circumstances, and depending on the nature of the question being asked, there may be rough numerical values that can be allocated to each level. For example, maybe a survey question is accompanied by a visual scale on which the Likert categories are marked, from 'absolutely agree' to 'absolutely disagree'. In this case it may be justifiable to convert the categorical variable straight into to a quantitative one.
More commonly, the order of levels is known, but exact values cannot be generally agreed upon. Such categorical variables can be described as ordinal or ranked, as opposed ones such as gender or professedReligion which are purely nominal. Hence we can recognise two major types of categorical variable: ordered ("ordinal") and unordered ("nominal"), as illustrated by the two examples in Table 2.1.
Classification of quantitative variablesEdit
Although the categorical/quantitative division is the most important one, we can further subdivide each type (as we have already seen when discussing categorical variables) . The most commonly taught classification is due to Stevens (1946). As well as dividing categorical variables into ordinal and nominal types, he classified quantitative variables into two types, interval or ratio, depending on the nature of the scale that was used. To this classification can be added circular variables. Hence classifying quantitative variables on the basis of the measurement scale leads to three subdivision (as illustrated by the subdivisions in Table 2.1):
 Ratio data is the most commonly encountered. Examples include distances, lengths of time, numbers of items, etc. These variables are measured on a scale with a natural zero point; because we can work with exclusively positive integers.
 Interval data is measured on a scale where there is no natural zero point. The most common examples are temperature (in degrees Centigrade or Fahrenheit) and calendar date. Since the zero point on the scale is essentially arbitrary, The name comes from the fact that while ratios are not meaningful, intervals are. E.g. means that ****
 Circular data is measured on a scale which "wraps around", such as Direction, TimeOfDay, Logitude etc. ***
The Stevens classification is not the only way to categorise quantitative variables. Another sensible division recognises the difference between continuous and discrete measurements. Specifically, quantitative variables can represent either
 Continuous data, in which it makes sense to talk about intermediate values (e.g. 1.2 hours, 12.5%, etc.). This includes cases where data have been rounded ***.
 "Discrete data", where intermediate values are nonsensical (e.g. doesn't make much sense to talk about 1.2 deaths, or 2.6 cancer cases in a group of 10 people). Often these are counts of things: this is sometime known as meristic data.
In practice, discrete data are often treated as continuous, especially when the units into which they are divided are relatively small. For example, the population size of different countries is theoretically discrete (you can't have half a person), but the values are so huge that it may be reasonable to treat such data as continuous. However, for small values, such as the number of people in a household, the data are rather "granular", and the discrete nature of values becomes very apparent. One common result of this is the presence of multiple repeated values (e.g. there will be a lot of 2 person households in most data sets).
A third way of classifying quantitative variables depends on whether the scale has upper or lower bounds, or even both.
 bounded at one end (e.g. landArea cannot be below 0),
 at both ends (e.g. percentages cannot be less then 0 or greater than 100). Also see censored data ***
 unbounded (weightLoss).
Most important is circular  often requires very different analytical tools. Often best to make linear in some way (e.g. difference from a fixed direction).
Interval data cannot use ratios (division). Rather rare
Bounds: very common to have lower bound. Unusual to have only an upper bound. Both often indicates a percentage.  often treated by transformation (e.g. log)
Count data: if multiple identical values, can affect plotting etc. If true independent counts, indicates error function.
The distinctions between the different types of variables are summarised in Figure ***. Note that it is also common to
Independence of data pointsEdit
Does the actual value cause correllations in "surrounding" values (e.g. time series), or do both reflect some common association (e.g. blocks/heterogeneity).
Time series Spatial data Blocks
Incorporating informationEdit
Time series, other sources of nonindependence
Factors
The factor()
function creates a factor and defines the available levels. By default the levels are taken from the ones in the vector***. Actually, you don't often need to use factor()
, because when reading data in from a file, R assumes by default that text should be converted to factors (see Statistical Analysis: an Introduction using R/R/Data frames). You may need to use as.factor()
. Internally, R stores the levels as numbers from 1 upwards, but it is not always obvious which number corresponds to which level, and it should not normally be necessary to know.
Ordinal variables, that is factors in which the levels have a natural order, are known to R as ordered factors. They can be created in the normal way a factor is created, but in addition specifying ordered=TRUE
.
Input:
state.region #An example of a factor: note that the levels are printed out state.name #this is *NOT* a factor state.name[1] < "Any text" #you can replace text in a character vector state.region[1] < "Any text" #but you can't in a factor state.region[1] < "South" #this is OK state.abb #this is not a factor, just a character vector character.vector < c("Female", "Female", "Male", "Male", "Male", "Female", "Female", "Male", "Male", "Male", "Male", "Male", "Female", "Female" , "Male", "Female", "Female", "Male", "Male", "Male", "Male", "Female", "Female", "Female", "Female", "Male", "Male", "Male", "Female" , "Male", "Female", "Male", "Male", "Male", "Male", "Male", "Female", "Male", "Male", "Male", "Male", "Female", "Female", "Female") #a bit tedious to do all that typing #might be easier to use codes, e.g. 1 for female and 2 for male Coded < factor(c(1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1)) Gender < factor(Coded, labels=c("Female", "Male")) #we can then convert this to named levels
Result:
Matrices and Arrays
Essentially, a matrix (plural: matrices) is the two dimensional equivalent of a vector. In other words, it is a rectangular grid of numbers, arranged in rows and columns. In R, a matrix object can be created by the matrix()
function, which takes, as a first argument, a vector of numbers with which the matrix is filled, and as the second and third arguments, the number of rows and the number of columns respectively.
R can also use array objects, which are like matrices, but can have more than 2 dimensions. These are particularly useful for tables: a type of array containing counts of data classified according to various criteria. Examples of these "contingency tables" are the HairEyeColor
and Titanic
tables shown below.
As with vectors, the indexing operator []
can be used to access individual elements or sets of elements in a matrix or array. This is done by separating the numbers inside the brackets by commas. For example, for matrices, you need to specify the row index then a comma, then the column index. If the row index is blank, it is assumed that you want all the rows, and similarly for the columns.
Input:
1 m < matrix(1:12, 3, 4) #Create a 3x4 matrix filled with numbers 1 to 12
2 m #Display it!
3 m*2 #Arithmetic, just like with vectors
4 m[2,3] #Pick out a single element (2nd row, 3rd column)
5 m[1:2, 2:4] #Or a range (rows 1 and 2, columns 2, 3, and 4.)
6 m[,1] #If the row index is missing, assume all rows
7 m[1,] #Same for columns
8 m[,2] < 99 #You can assign values to one or more elements
9 m #See!
10 ###Some real data, stored as "arrays"
11 HairEyeColor #A 3D array
12 HairEyeColor[,,1] #Select only the males to make it a 2D matrix
13 Titanic #A 4D array
14 Titanic[1:3,"Male","Adult",] #A matrix of only the adult male passengers
Result:
> m < matrix(1:12, 3, 4) #Create a 3x4 matrix filled with numbers 1 to 12 > m #Display it! [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > m*2 #Arithmetic, just like with vectors [,1] [,2] [,3] [,4] [1,] 2 8 14 20 [2,] 4 10 16 22 [3,] 6 12 18 24 > m[2,3] #Pick out a single element (2nd row, 3rd column) [1] 8 > m[1:2, 2:4] #Or a range (rows 1 and 2, columns 2, 3, and 4.) [,1] [,2] [,3] [1,] 4 7 10 [2,] 5 8 11 > m[,1] #If the row index is missing, assume all rows [1] 1 2 3 > m[1,] #Same for columns [1] 1 4 7 10 > m[,2] < 99 #You can assign values to one or more elements > m #See! [,1] [,2] [,3] [,4] [1,] 1 99 7 10 [2,] 2 99 8 11 [3,] 3 99 9 12 > ###Some real data, stored as "arrays" > HairEyeColor #A 3D array , , Sex = Male Eye Hair Brown Blue Hazel Green Black 32 11 10 3 Brown 53 50 25 15 Red 10 10 7 7 Blond 3 30 5 8 , , Sex = Female Eye Hair Brown Blue Hazel Green Black 36 9 5 2 Brown 66 34 29 14 Red 16 7 7 7 Blond 4 64 5 8 > HairEyeColor[,,1] #Select only the males to make it a 2D matrix Eye Hair Brown Blue Hazel Green Black 32 11 10 3 Brown 53 50 25 15 Red 10 10 7 7 Blond 3 30 5 8 > Titanic #A 4D array , , Age = Child, Survived = No Sex Class Male Female 1st 0 0 2nd 0 0 3rd 35 17 Crew 0 0 , , Age = Adult, Survived = No Sex Class Male Female 1st 118 4 2nd 154 13 3rd 387 89 Crew 670 3 , , Age = Child, Survived = Yes Sex Class Male Female 1st 5 1 2nd 11 13 3rd 13 14 Crew 0 0 , , Age = Adult, Survived = Yes Sex Class Male Female 1st 57 140 2nd 14 80 3rd 75 76 Crew 192 20 > Titanic[1:3,"Male","Adult",] #A matrix of only the adult male passengers Survived Class No Yes 1st 118 57 2nd 154 14 3rd 387 75
Lists
If you want a collection of elements which are of different types, or not of one of the allowed vector types, you need to use a list.
Input:
1 l1 < list(a=1, b=1:3)
2 l2 < c(sqrt, log) #
Result:
Visualising a single variableEdit
Before carrying out a formal analysis, you should always perform an Initial Data Analysis, part of which is to inspect the variables that are to be analysed. If there are only a few data points, the numbers can be scanned by eye, but normally it is easier to inspect data by plotting.
Scatter plots, such as those in Chapter 1, are perhaps the most familiar sort of plot, and are useful for showing patterns of association between two variables. These are discussed later in this chapter, but in this section we first examine the various ways of visualising a single variable.
Plots of a single variable, or univariate plots are particularly used to explore the distribution of a variable; that is its shape and position. Apart from initial inspection of data, one very common use of these plots is to look at the residuals (see Figure 1.2): the unexplained part of the data that remains after fitting a statistical model. Assumptions about the distribution of these residuals are often checked by plotting them.
The plots which follow illustrate a few of the more common types of univariate plot. The classic text is Tufte (cite: the visual display of quantitative information).
Categorical variablesEdit
For categorical variables, the choice of plots is quite simple. The most basic plots simply involve counting up the data points at each level.
Figure 2.1(a) displays these counts as a bar chart; another possibility is to use points as in Figure 2.1(b). In the case of gender, the order of the levels is not important: either 'male' or 'female' could come first. In the case of class, the natural order of the levels is used in the plot. In the extreme case, where intermediate levels might be meaningful, or where you wish to emphasise the pattern between levels, it may be reasonable to connect points by lines. For illustrative purposes, this has been done in Figure 2.1(b), although the reader may question whether it is appropriate in this instance.
plot(1:length(Gender), Gender, yaxs="n"); axis(2, 1:2, levels(Gender), las=1)
In some cases we may be interested in the actual sequence of data points. This is particularly so for time series data, but may also be relevant elsewhere. For instance, in the case of gender, the data was recorded in the order that each child was born. If we think that the preceeding birth influences the following birth (unlikely in this case, but just within the realm of possibility if pheremones are involved), then we might want to do SymbolbySymbol plot . If we are looking for associations with time, however, then a bivariate plot may be more appropriate ***, or there are particular features of the data that we are interested in (e.g. repeat rate), then other possibilities exist (doi:10.1016/j.stamet.2007.05.001).See chapter on time series
Quantitative variablesEdit
landArea  
Alabama  133666 

Alaska  1527463 
Arizona  295022 
Arkansas  137538 
California  411012 
Colorado  269998 
Connecticut  12973 
Delaware  5327 
Florida  151669 
Georgia  152488 
Hawaii  16705 
Idaho  216411 
Illinois  146075 
Indiana  93993 
Iowa  145790 
Kansas  213062 
Kentucky  104622 
Louisiana  125673 
Maine  86026 
Maryland  27394 
Massachusetts  21385 
Michigan  150778 
Minnesota  217735 
Mississippi  123583 
Missouri  180485 
Montana  381085 
Nebraska  200017 
Nevada  286297 
New Hampshire  24097 
New Jersey  20295 
New Mexico  315113 
New York  128401 
North Carolina  136197 
North Dakota  183021 
Ohio  106764 
Oklahoma  181089 
Oregon  251179 
Pennsylvania  117411 
Rhode Island  3144 
South Carolina  80432 
South Dakota  199550 
Tennessee  109411 
Texas  692404 
Utah  219931 
Vermont  24887 
Virginia  105710 
Washington  176616 
West Virginia  62628 
Wisconsin  145438 
Wyoming  253596 
deathsPerYear  
1875  3 

1876  5 
1877  7 
1878  9 
1879  10 
1880  18 
1881  6 
1882  14 
1883  11 
1884  9 
1885  5 
1886  11 
1887  15 
1888  6 
1889  11 
1890  17 
1891  12 
1892  15 
1893  8 
1894  4 
A quantitative variable can be plotted in many more ways than a categorical variable. Some of the most common singlevariable plots are discussed below, using the land area of the 50 US states as our example of a continuous variable, and a famous data set of the number of deaths by horse kick as our example of a discrete variable. These data are tabulated in Tables 2.2 and 2.3
Some sorts of data consist of many data points with identical values. This is particularly true for count data where there are low numbers of counts (e.g. number of offspring).
There are 3 things we might want to look for in these sorts of plots
 points that seem extreme in some way (these are known as outliers). Outliers often reveal mistakes in data collection, and even if they don't, they can have a disproportionate effect on further analysis. If it turns out that they aren't due to an obvious mistake, one option is to remove them from the analysis, but this causes problems of its own/
 shape & position of the distribution (e.g. normal, bimodal, etc.)
 similarity to known distributions (QQ)
We'll keep the focus mostly on the variable "landArea" ***
The simplest way to represent quantitative data is to plot the points on a line, as in Figure 2.3(a). This is often called a 'dot plot, although this is also sometimes used to describe a number of other types of plot (e.g. Figure 2.7)^{[12]}. To avoid confusion, it may be best to call it a onedimensional scatterplot. As well as simplicity, there are two advantages to a 1D scatterplot
 All the information present in the data is retained.
 Outliers are easily identified. Indeed, it is often useful to be able to identify which data points are outliers. Some software packages allow you to identify points interactively (e.g. by clicking points on the plot). Otherwise, points can be labelled, as has been done for some outlying points in Figure 2.3a^{[13]}.
One dimensional scatterplots do not work so well for large datasets. Figure 2.3(a) consists of only 50 points. Even so, it is difficult to get an overall impression of the data, to (as the saying goes) "see the wood for the trees". This is partly because some points lie almost on top of each other, but also because of the sheer number of closely placed points. It is often the case that features of the data are best explored by summarising it in some way.
Figure 2.3(b) shows a step on the way to a better plot. To alleviate the problem of points obscuring each other, they have been displaced, or jittered sideways by a small, random amount. More importantly, the data have been summarised by dividing into quartiles (and coloured accordingly, for ease of explanation). The quarter of states with the largest area have been coloured red. The smallest quarter of states have been coloured green.
More generally, we can talk about the quantiles of our data^{[14]}. The red line represents the 75% quantile: 75% of the points lie below it. The green line represents the 25% quantile: 25% of the points lie below it. The distance between these lines is known as the Interquartile Range (IQR), and is a measure of the spread of the data. The thick black line has a special name: the median. It marks the middle of the data, the 50% quantile: 50% of the points lie above, and 50% lie below. A major advantage of quantiles over other summary measures is that they are relatively insensitive to outliers, or changes in scale ****.
Figure 2.3(c) is a coloured version of a widely used statistical summary plot: the boxplot. Here it has been coloured to show the correspondence to Figure 2.3(b). The box marks the quartiles of the data, with the median marked within the box. If the median is not positioned centrally within the box, this is often an indicator that the data are skewed in some way. The lines on either side of the box are known as "whiskers", and summarise the data which lies outside of the upper and lower quartiles. In this case, the whiskers have simply been extended to the maximum and minimum observed values.
Figure 2.3(d) is a more sophisticated boxplot of the same data. Here, notches have been drawn on the box: these are useful for comparing the medians in different boxplots. The whiskers have been shortened so that they do not include points considered as outliers. There are various ways of defining these outliers automatically. This figure is based on a convention that considers outlying points as those more than one and a half times the IQR from either side of the box. However it is often more informative to identify and inspect interesting points (including outliers) by visual inspection. For example, in Figure 2.1a it is clear that Alaska and (to a lesser extent) Texas are unusually large states, but that California (identified as an outlier by this automatic procedure) is not so setapart from the rest.
One problem with plotting on a single line is that, if points are repeated, there ***. This is particularly problematic for discrete data. NB, there is no particular reason (or established convention) for these plots to be vertical. Figure 2.2 shows . The stacked plot (Figure 2.4d) is similar to a histogram (Figure 2.5).
This gives another way of picturing the median & other quantiles: as dividing the area into sections ***
We can space out the points along the other axis. For example, if the order of points in the dataset is meaningful, we can just plot each point in turn. This is true for whatsit's horsekick data. The data by year are plotted in Figure 2.6.
One thing we can always do is to sort the data points by their value, plotting the smallest first, etc. This is seen in Figure 2.3b. If all the data points were equally spaced (and excluded each other****), we would see a straight line. The plot for the logged variables shows that this transformation has evened out the spacing somewhat. This is called a quantile plot, for the following reason
when the axes are swapped, this is called the empirical cumulative distribution function. The unswapped data is useful for understanding qq plots. Also for understanding quantiles. median, etc.
TransformationsEdit
We could put a scale break in, but a better option is usually to transform the variable.
Sometimes, plotting on a different scale (e.g. a logarithmic scale) can be more informative. We can visualise this either as a plot with a nonlinear (e.g. logarithmic) axis, or as a conventional plot of a transformed variable (e.g. a plot of log(my.variable) on a standard, linear axis). Figure 2.1(b) illustrates this point: the left axis marks the state areas, the right axis marks the logarithm of the state areas. This sort of rescaling can highlight quite different features of a variable. In this case, it seems clear that there are a batch of nine states which seem distinctly smaller than most, and while Alaska still seems extraordinarily large, Texas does not seem so unusual in that respect. This is also reflected by the automatic labelling of outliers in the logtransformed variable.
It is particularly common for smaller numbers to have greater resolution. As discussed in later chapters ***, logarithmic scales are particularly useful for multiplicative data ***.
There are other common transformations, for example, the squareroot transformation (often used for count data). This may be more appropriate for state areas, if the limiting factor for state size is (e.g.) the distance across the state, or factors associated with it (e.g. the length of time to cross from one side to another). Figure 2.1c shows a sqrt rescaling of the data. You can see that in some sense this is less extreme than the log transformation...
Univariate plots
Input:
1 stripchart(state.areas, xlab="Area (sq. miles)") #see method="stack" & method="jitter" for others
2 boxplot(sqrt(state.area))
3 hist(sqrt(state.area))
4 hist(sqrt(state.area), 25)
5 plot(density(sqrt(state.area))
6 plot(UKDriverDeaths)
7
8 qqnorm()
9 ecdf(
Result:
DatasetsEdit
Multiple variables in a table. Notation. most packages do this.
Bivariate plottingEdit
Quantitative versus quantitativeEdit
Scatter plots problems with overplotting? Sunflowerplots etc.
Quantitative versus categoricalEdit
Vioplots (&boxplots)
Categorical versus categoricalEdit
Mosaicplots
NotesEdit
 ↑ The convention (which is followed here) is to write variable names in italics
 ↑ These are special words in R, and cannot be used as names for objects. The objects
T
andF
are temporary shortcuts forTRUE
andFALSE
, but if you use them, watch out: since T and F are just normal object names you can change their meaning by overwriting them.  ↑ if the logical vector is shorter then the original vector, then it is sequentially repeated until it is of the right length
 ↑ Note that, when using continuous (fractional) numbers, rounding error may mean that results of calculations are not exactly equal to each other, even if they seem as if they should be. For this reason, you should be careful when using == with continuous numbers. R provides the function all.equal to help in this case
 ↑ But unlike
ifelse
, it can't cope withNA
values  ↑ For this reason, using
==
inif
statements may not be a good idea, see the Note in?"=="
for details.  ↑ These are particularly used in more advanced computer programming in R, see ?"&&" for details
 ↑ Similar examples are given in Chatfield ****
 ↑ There are actually 3 types of allowed numbers: "normal" numbers, complex numbers, and simple integers. This book deals almost exclusively with the first of these.
 ↑ This is not quite true, but unless you are a computer specialist, you are unlikely to use the final type: a vectors of elements storing "raw" computer bits, see
?raw
 ↑ This dataset was collected by the Russian economist von Bortkiewicz, in 1898, to illustrate the pattern seen when events occur independently of each other (this is known as a Poisson distribution). The table here gives the total number of deaths summed over all 14 corps. For the full dataset, broken down into corp, see Statistical Analysis: an Introduction using R/Datasets
 ↑ Authors such as Wild & Seber call it a dot plot, but R uses the term "dotplot" to refer to a Cleveland (1985) dotplot as shown in Figure 2.1(b). Other authors (***) specifically use it to refer to a sorted ("quantile") plot as used in Figure *** (cite)
 ↑ Labels often obscure the plot, so For plots intended only for viewing on a computer, it is possible to print labels in a size so small that they can only be seen when zooming in.
 ↑ There a many different methods for calculating the precise value of a quantile, when it lies between two points. See Hyndman and Fan (1996)
ReferencesEdit
Carifio, J. & Perla, R. (2007). Ten Common Misunderstandings, Misconceptions, Persistent Myths and Urban Legends about Likert Scales and Likert Response Formats and their Antidotes. Journal of Social Sciences, 2, 106116. http://www.scipub.org/fulltext/jss/jss33106116.pdf