Statistical Analysis: an Introduction using R/R/R fundamentals

R fundamentals

edit

If you carry out the exercises in all these topics, you should be relatively competent in using R (also see programming)


Text marked like this is used to discuss an R-specific point. The basics of R can be learned by reading these sections in the order they appear in the book. There will also be commands that can be entered directly into R; you should be able to copy-and-paste them directly into your R session[1]. Try the following to see how to use R as a simple calculator
  Input:
100+2/3
  Result:
> 100+2/3

[1] 100.6667

In the absence of any instructions of what to do with the output of a command, R usually prints the result to the screen. For the time being, ignore the [1] before the answer: we will see that this is useful when R outputs many numbers at once. Note that R respects the standard mathematical rules of carrying out multiplication and division before addition and subtraction: it divides 2 by 3 before adding 100.
R commands can sometimes be rather difficult to follow, so occasionally it can be useful to annotate them with comments. This can be done by typing a hash (#) character: any further text on the same line is ignored by R. This will be used extensively in the R examples in this wikibook, e.g.
  Input:
#this is a comment: R will ignore it
(100+2)/3    #You can use round brackets to group operations so that they are carried out first
5*10^2       #The symbol * means multiply, and ^ means "to the power", so this gives 5 times (10 squared), i.e. 500
1/0          #R knows about infinity (and minus infinity)
0/0          #undefined results take the value NaN ("not a number")
(0i-9)^(1/2) #for the mathematically inclined, you can force R to use complex numbers
  Result:
> #this is a comment: R will ignore it

> (100+2)/3 #You can use round brackets to group operations so that they are carried out first [1] 34 > 5*10^2 #The symbol * means multiply, and ^ means "to the power", so this is 5 times (10 squared) [1] 500 > 1/0 #R knows about infinity (and minus infinity) [1] Inf > 0/0 #undefined results take the value NaN ("not a number") [1] NaN > (0i-9)^(1/2) #for the mathematically inclined, you can force R to use complex numbers [1] 0+3i

  • If you don't know anything about complex numbers, don't worry: they are not important here.
  • Note that you can't use curly brackets {} or square brackets [] to group operations together


R is what is known as an "object-oriented" program. Everything (including the numbers you have just typed) is a type of object. Later we will see why this concept is so useful. For the time being, you need only note that you can give a name to an object, which has the effect of storing it for later use. Names can be assigned by using the arrow-like signs <- and -> as demonstrated in the exercise below. Which sign you use depends on whether you prefer putting the name first or last (it may be helpful to think of -> as "put into" and <- as "set to"). Unlike many statistical packages, R does not usually display the results of analyses you perform. Instead, analyses usually end up by producing an object which can be stored. Results can then be obtained from the object at leisure. For this reason, when doing statistics in R, you will often find yourself naming and storing objects. The name you choose should consist of letters, numbers, and the "." character[2], and should not start with a number.
  Input:
0.001 -> small.num                #Store the number 0.0001 under the name "small.num" (i.e. put 0.0001 into small.num)
big.num <- 10 * 100               #You can put the name first if you reverse the arrow (set big.num to 10000).
big.num+small.num+1               #Now you can treat big.num and small.num as numbers, and use them in calculations
my.result <- big.num+small.num+2  #And you can store the result of any calculation
my.result                         #To look at the stored object, just type its name
pi                                #There are some named objects that R provides for you
  Result:
> 0.001 -> small.num #Store the number 0.0001 under the name "small.num" (i.e. put 0.0001 into small.num)

> big.num <- 10 * 100 #You can put the name first if you reverse the arrow (set big.num to 10000). > big.num+small.num+1 #Now you can treat big.num and small.num as numbers, and use them in calculations [1] 1001.001 > my.result <- big.num+small.num+2 #And you can store the result of any calculation > my.result #To look at the stored object, just type its name [1] 1002.001 > pi #There are some named objects that R provides for you [1] 3.141593

Note that when the end result of a command is to store (assign) an object, as on input lines 1, 2, and 4, R doesn't print anything to the screen.


Apart from numbers, perhaps the most useful named objects in R are functions. Nearly everything useful that you will do in R is carried out using a function, and many are available in R by default. You can use (or "call") a function by typing its name followed by a pair of round brackets. For instance, the start up text mentions the following function, which you might find useful if you want to reference R in published work:
  Input:
citation()
  Result:
> citation() To cite R in publications use: R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. A BibTeX entry for LaTeX users is @Manual{, url = {http://www.R-project.org}, title = {R: A Language and Environment for Statistical Computing}, author = {{R Development Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2008}, note = {{ISBN} 3-900051-07-0}, } We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also ‘citation("pkgname")’ for citing R packages.
Many R functions can produce results which differ depending on arguments that you provide to them. Arguments are placed inside the round brackets, separated by commas. Many functions have one or more optional arguments: that is, you can choose whether or not to provide them. An example of this is the citation() function. It can take an optional argument giving the name of an R add-on package. If you do not provide an optional argument, there is usually an assumed default value (in the case of citation(), this default value is "base", i.e. provide the citation reference for the base package: the package which provides most of the foundations of the R language).

Most arguments to a function are named. For example, the first argument of the citation function is named package. To provide extra clarity, when using a function you can provide arguments in the longer form name=value. Thus

citation("base")

does the same as

citation(package="base")
If a function can take more than one argument, using the long form also allows you to change the order of arguments, as shown in the example code below.
  Input:
citation("base")      #Does the same as citation(), because the default for the first argument is "base"
                      #Note: quotation marks are needed in this particular case (see discussion below)
citation("datasets")  #Find the citation for another package (in this case, the result is very similar)
sqrt(25)              #A different function: "sqrt" takes a single argument, returning its square root.
sqrt(25-9)            #An argument can contain arithmetic and so forth
sqrt(25-9)+100        #The result of a function can be used as part of a further analysis
max(-10, 0.2, 4.5)    #This function returns the maximum value of all its arguments
sqrt(2 * max(-10, 0.2, 4.5))             #You can use results of functions as arguments to other functions
x <- sqrt(2 * max(-10, 0.2, 4.5)) + 100  #... and you can store the results of any of these calculations
x
log(100)              #This function returns the logarithm of its first argument
log(2.718282)         #By default this is the natural logarithm (base "e")
log(100, base=10)     #But you can change the base of the logarithm using the "base" argument
log(100, 10)          #This does the same, because "base" is the second argument of the log function
log(base=10, 100)     #To have the base as the first argument, you have to use the form name=value
  Result:
> citation("base") #Does the same as citation(), because the default for the first argument is "base" To cite R in publications use: R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. A BibTeX entry for LaTeX users is @Manual{, title = {R: A Language and Environment for Statistical Computing}, author = {{R Development Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2008}, note = {{ISBN} 3-900051-07-0}, url = {http://www.R-project.org}, } We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also ‘citation("pkgname")’ for citing R packages. > #Note: quotation marks are needed in this particular case (see discussion below) > citation("datasets") #Find the citation for another package (in this case, the result is very similar) The 'datasets' package is part of R. To cite R in publications use: R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. A BibTeX entry for LaTeX users is @Manual{, title = {R: A Language and Environment for Statistical Computing}, author = {{R Development Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2008}, note = {{ISBN} 3-900051-07-0}, url = {http://www.R-project.org}, } We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also ‘citation("pkgname")’ for citing R packages. > sqrt(25) #A different function: "sqrt" takes a single argument, returning its square root. [1] 5 > sqrt(25-9) #An argument can contain arithmetic and so forth [1] 4 > sqrt(25-9)+100 #The result of a function can be used as part of a further analysis [1] 104 > max(-10, 0.2, 4.5) #This function returns the maximum value of all its arguments [1] 4.5 > sqrt(2 * max(-10, 0.2, 4.5)) #You can use results of functions as arguments to other functions [1] 3 > x <- sqrt(2 * max(-10, 0.2, 4.5)) + 100 #... and you can store the results of any of these calculations > x [1] 103 > log(100) #This function returns the logarithm of its first argument [1] 4.60517 > log(2.718282) #By default this is the natural logarithm (base "e") [1] 1 > log(100, base=10) #But you can change the base of the logarithm using the "base" argument [1] 2 > log(100, 10) #This does the same, because "base" is the second argument of the log function [1] 2 > log(base=10, 100) #To have the base as the first argument, you have to use the form name=value [1] 2
Note that when typing normal text (as in the name of a package), it needs to be surrounded by quotation marks[3], to avoid confusion with the names of objects. In other words, in R
citation

refers to a function, whereas

"citation"

is a "string" of text. This is useful, for example when providing titles for plots, etc.

You will probably find that one of the trickiest aspects of getting to know R is knowing which function to use in a particular situation. Fortunately, R not only provides documentation for all its functions, but also ways of searching through the documentation, as well as other ways of getting help.
There are a number of ways to get help in R, and there is also a wide variety of online information. Most installations of R come with a reasonably detailed help file called "An Introduction to R", but this can be rather technical for first-time users of a statistics package. Almost all functions and other objects that are automatically provided in R have a help page which gives intricate details about how to use them. These help pages usually also contain examples, which can be particularly helpful for new users. However, if you don't know the name of what you are looking for, then finding help may not be so easy, although it is possible to search for keywords and concepts that are associated with objects. Some versions of R give easy access to help files without having to type in commands (for example, versions which provide menu bars usually have a "help" menu, and the Macintosh interface also has a help box in the top right hand corner). However, this functionality can always be accessed by typing in the appropriate commands. You might like to type some or all of the following into an R session (no output is listed here because the result will depend on your R system).
help.start()            #A web-based set of help pages (try the link to "An Introduction to R")
help(sqrt)              #Show details of the "sqrt" and similar functions
?sqrt                   #A shortcut to do the same thing
example(sqrt)           #run the examples on the bottom of the help page for "sqrt"
help.search("maximum")  #gives a list of functions involving the word "maximum", but oddly, "max" is not in there!
### The next line is commented out to reduce internet load. To try it, remove the first # sign.
#RSiteSearch("maximum")  #search the R web site for anything to do with "maximum". Probably overkill here!
The last but one command illustrates a problem you may come across with using the R help functions. The searching facility for help files is sometimes a bit hit-and-miss. If you can't find exactly what you are looking for, it is often useful to look at the "See also" section of any help files that sound vaguely similar or relevant. In this case, you might probably eventually find the max() function by looking at the "See also" section of the help file for which.max(). Not ideal!.

Statistical Analysis: an Introduction using R/R/Quitting

One of the most fundamental objects in R is the vector, used to store multiple measurements of the same type (e.g. data variables). There are several different sorts of data that can be stored in a vector. Most common is the numeric vector, in which each element of the vector is simply a number. Other commonly used types of vector are character vectors (where each element is a piece of text) and logical vectors (where each element is either TRUE or FALSE[4]). In this topic we will use some example vectors provided by the "datasets" package, containing data on States of the USA (see ?state). R is an inherently vector-based program; in fact the numbers we have been using in previous calculations are just treated as vectors with a single element. This means that most basic functions in R will behave sensibly when given a vector as a argument, as shown below.
  Input:
state.area                #a NUMERIC vector giving the area of US states, in square miles
state.name                #a CHARACTER vector (note the quote marks) of state names 
sq.km <- state.area*2.59  #Arithmetic works on numeric vectors, e.g. convert sq miles to sq km
sq.km                     #... the new vector has the calculation applied to each element in turn
sqrt(sq.km)               #Many mathematical functions also apply to each element in turn 
range(state.area)         #But some functions return different length vectors (here, just the max & min).
length(state.area)        #and some, like this useful one, just return a single value.
  Result:
> state.area #a NUMERIC vector giving the area of US states, in square miles
[1]  51609 589757 113909  53104 158693 104247   5009   2057  58560  58876   6450  83557  56400

[14] 36291 56290 82264 40395 48523 33215 10577 8257 58216 84068 47716 69686 147138 [27] 77227 110540 9304 7836 121666 49576 52586 70665 41222 69919 96981 45333 1214 [40] 31055 77047 42244 267339 84916 9609 40815 68192 24181 56154 97914 > state.name #a CHARACTER vector (note the quote marks) of state names

[1] "Alabama"            "Alaska"             "Arizona"            "Arkansas"          
[5] "California"         "Colorado"           "Connecticut"        "Delaware"          
[9] "Florida"            "Georgia"            "Hawaii"             "Idaho"             

[13] "Illinois" "Indiana" "Iowa" "Kansas" [17] "Kentucky" "Louisiana" "Maine" "Maryland" [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi" [25] "Missouri" "Montana" "Nebraska" "Nevada" [29] "New Hampshire" "New Jersey" "New Mexico" "New York" [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma" [37] "Oregon" "Pennsylvania" "The smallest state" "South Carolina" [41] "South Dakota" "Tennessee" "Texas" "Utah" [45] "Vermont" "Virginia" "Washington" "West Virginia" [49] "Wisconsin" "Wyoming" > sq.km <- state.area*2.59 #Standard arithmatic works on numeric vectors, e.g. convert sq miles to sq km > sq.km #... giving another vector with the calculation performed on each element in turn

[1]  133667.31 1527470.63  295024.31  137539.36  411014.87  269999.73   12973.31    5327.63
[9]  151670.40  152488.84   16705.50  216412.63  146076.00   93993.69  145791.10  213063.76

[17] 104623.05 125674.57 86026.85 27394.43 21385.63 150779.44 217736.12 123584.44 [25] 180486.74 381087.42 200017.93 286298.60 24097.36 20295.24 315114.94 128401.84 [33] 136197.74 183022.35 106764.98 181090.21 251180.79 117412.47 3144.26 80432.45 [41] 199551.73 109411.96 692408.01 219932.44 24887.31 105710.85 176617.28 62628.79 [49] 145438.86 253597.26 > sqrt(sq.km) #Many mathematical functions also apply to each element in turn

[1]  365.60540 1235.90883  543.16140  370.86299  641.10441  519.61498  113.90044   72.99062
[9]  389.44884  390.49819  129.24976  465.20171  382.19890  306.58390  381.82601  461.58830

[17] 323.45487 354.50609 293.30334 165.51263 146.23826 388.30328 466.62203 351.54579 [25] 424.83731 617.32278 447.23364 535.06878 155.23324 142.46136 561.35100 358.33202 [33] 369.04978 427.81111 326.74911 425.54695 501.17940 342.65503 56.07370 283.60615 [41] 446.71213 330.77479 832.11058 468.96955 157.75712 325.13205 420.25859 250.25745 [49] 381.36447 503.58441 > range(state.area) #But some functions return different length vectors (here, just the max & min). [1] 1214 589757 > length(state.area) #and some, like this useful one, just return a single value. [1] 50

Note that the first part of your output may look slightly different to that above. Depending on the width of your screen, the number of elements printed on each line of output may differ. This is the reason for the numbers in square brackets, which are produced when vectors are printed to the screen. These bracketed numbers give the position of the first element on that line, which is a useful visual aid. For instance, looking at the printout of state.name, and counting across from the second line, we can tell that the eighth state is Delaware.
You may occasionally need to create your own vectors from scratch (although most vectors are obtained from processing data in already-existing files). The most commonly used function for constructing vectors is c(), so named because it concatenates objects together. However, if you wish to create vectors consisting of regular sequences of numbers (e.g. 2,4,6,8,10,12, or 1,1,2,2,1,1,2,2) there are several alternative functions you can use, including seq(), rep(), and the : operator.
  Input:
c("one", "two", "three", "pi")  #Make a character vector
c(1,2,3,pi)                     #Make a numeric vector
seq(1,3)                        #Create a sequence of numbers
1:3                             #A shortcut for the same thing (but less flexible)
i <- 1:3                        #You can store a vector
i
i <- c(i,pi)                    #To add more elements, you must assign again, e.g. using c() 
i                             
i <- c(i, "text")               #A vector cannot contain different data types, so ... 
i                               #... R converts all elements to the same type
i+1                             #The numbers are now strings of text: arithmetic is impossible 
rep(1, 10)                      #The "rep" function repeats its first argument
rep(3:1,10)                     #The first argument can also be a vector
huge.vector <- 0:(10^7)         #R can easily cope with very big vectors
#huge.vector #VERY BAD IDEA TO UNCOMMENT THIS, unless you want to print out 10 million numbers
rm(huge.vector)                 #"rm" removes objects. Deleting huge unused objects is sensible
  Result:
> c("one", "two", "three", "pi") #Make a character vector

[1] "one" "two" "three" "pi" > c(1,2,3,pi) #Make a numeric vector [1] 1.000000 2.000000 3.000000 3.141593 > seq(1,3) #Create a sequence of numbers [1] 1 2 3 > 1:3 #A shortcut for the same thing (but less flexible) [1] 1 2 3 > i <- 1:3 #You can store a vector > i [1] 1 2 3 > i <- c(i,pi) #To add more elements, you must assign again, e.g. using c() > i [1] 1.000000 2.000000 3.000000 3.141593 > i <- c(i, "text") #A vector cannot contain different data types, so ... > i #... R converts all elements to the same type [1] "1" "2" "3" "3.14159265358979" "text" > i+1 #The numbers are now strings of text: arithmetic is impossible Error in i + 1 : non-numeric argument to binary operator > rep(1, 10) #The "rep" function repeats its first argument

[1] 1 1 1 1 1 1 1 1 1 1

> rep(3:1,10) #The first argument can also be a vector

[1] 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1

> huge.vector <- 0:(10^7) #R can easily cope with very big vectors > #huge.vector #VERY BAD IDEA TO UNCOMMENT THIS, unless you want to print out 10 million numbers > rm(huge.vector) #"rm" removes objects. Deleting huge unused objects is sensible

Categorical variables in R are stored as a special vector object known as a factor. This is not the same as a character vector filled with a set of names (don't get the two mixed up). In particular, R has to be told that each element can only be one of a number of known levels (e.g. Male or Female). If you try to place a data point with a different, unknown level into the factor, R will complain. When you print a factor to the screen, R will also list the possible levels that factor can take (this may include ones that aren't present)

The factor() function creates a factor and defines the available levels. By default the levels are taken from the ones in the vector***. Actually, you don't often need to use factor(), because when reading data in from a file, R assumes by default that text should be converted to factors (see Statistical Analysis: an Introduction using R/R/R/Data frames). You may need to use as.factor(). Internally, R stores the levels as numbers from 1 upwards, but it is not always obvious which number corresponds to which level, and it should not normally be necessary to know.

Ordinal variables, that is factors in which the levels have a natural order, are known to R as ordered factors. They can be created in the normal way a factor is created, but in addition specifying ordered=TRUE.
  Input:
state.region #An example of a factor: note that the levels are printed out

state.name #this is *NOT* a factor state.name[1] <- "Any text" #you can replace text in a character vector state.region[1] <- "Any text" #but you can't in a factor state.region[1] <- "South" #this is OK state.abb #this is not a factor, just a character vector

character.vector <- c("Female", "Female", "Male", "Male", "Male", "Female", "Female", "Male", "Male", "Male", "Male", "Male", "Female", "Female" , "Male", "Female", "Female", "Male", "Male", "Male", "Male", "Female", "Female", "Female", "Female", "Male", "Male", "Male", "Female" , "Male", "Female", "Male", "Male", "Male", "Male", "Male", "Female", "Male", "Male", "Male", "Male", "Female", "Female", "Female") #a bit tedious to do all that typing
  1. might be easier to use codes, e.g. 1 for female and 2 for male

Coded <- factor(c(1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1)) Gender <- factor(Coded, labels=c("Female", "Male")) #we can then convert this to named levels

  Result:
When collecting data, it is often the case that certain data points are unknown. This happens for a variety of reasons. For example, when analysing experimental data, we might be recording a number of variables for each experiment (e.g. temperature, time of day, etc.), yet have forgotten (or been unable) to record temperature in one instance. Or when collecting social data on US states, it might be that certain states do not record certain statistics of interest. Another example is the ship passenger data from the sinking of the Titanic, where careful research has identified the ticket class of all 2207 people on board, but not been able to ascertain the age of 10 or so of the victims (see http://www.encyclopedia-titanica.org). We could just omit missing data, but in many cases, we have information for some variables, but not for others. For example, we might not want to completely omit a US state from an analysis, just because it it missing one particular datum of interest. For this reason, R provides a special value, NA, meaning "not available". Any vector, numeric, character, or logical, can have elements which are NA. These can be identified by the function "is.na".
  Input:
some.missing <- c(1,NA)
is.na(some.missing)
  Result:
some.missing <- c(1,NA)

is.na(some.missing) [1] FALSE TRUE

Note that some analyses are hard to do if there are missing data. You can use "complete.cases" or "na.omit" to construct datasets with the missing values omitted.

Statistical Analysis: an Introduction using R/R/Matrices and arrays

R is very particular about what can be contained in a vector. All the elements need to be of the same type, an moreover must be either types of number[5], logical values, or strings of text[6]. If you want a collection of elements which are of different types, or not of one of the allowed vector types, you need to use a list.
  Input:
l1 <- list(a=1, b=1:3)
l2 <- c(sqrt, log) #
  Result:

Statistical Analysis: an Introduction using R/R/Simple plotting

Producing rough plots in R is extremely easy, although it can be time consuming tweaking them to get a certain look. The defaults are usually sensible.
  Input:
stripchart(state.areas, xlab="Area (sq. miles)") #see method="stack" & method="jitter" for others
boxplot(sqrt(state.area))
hist(sqrt(state.area))
hist(sqrt(state.area), 25)
plot(density(sqrt(state.area))
plot(UKDriverDeaths)

qqnorm()
ecdf(
  Result:

Statistical Analysis: an Introduction using R/R/Data frames Statistical Analysis: an Introduction using R/R/Getting data into R Statistical Analysis: an Introduction using R/R/Bivariate plots

  1. Depending on how you are viewing this book, may see a ">" character in front of each command. This is not part of the command to type: it is produced by R itself to prompt you to type something. This character should be automatically omitted if you are copying and pasting from the online version of this book, but if you are reading the paper or pdf version, you should omit the ">" prompt when typing into R.
  2. If you are familiar with computer programming languages, you may be used to using the underscore ("_") character in names. In R, "." is usually used in its place.
  3. you can use either single (') or double (") quotes to delimit text strings, as long as the start and end quotes match
  4. These are special words in R, and cannot be used as names for objects. The objects T and F are temporary shortcuts for TRUE and FALSE, but if you use them, watch out: since T and F are just normal object names you can change their meaning by overwriting them.
  5. There are actually 3 types of allowed numbers: "normal" numbers, complex numbers, and simple integers. This book deals almost exclusively with the first of these.
  6. This is not quite true, but unless you are a computer specialist, you are unlikely to use the final type: a vectors of elements storing "raw" computer bits, see ?raw