# Statistical Analysis: an Introduction using R/R/R fundamentals

< Statistical Analysis: an Introduction using R | RThis page or section is an undeveloped draft or outline.You can help to develop the work, or you can ask for assistance in the project room. |

## Contents

## R fundamentalsEdit

If you carry out the exercises in all these topics, you should be relatively competent in using R (also see programming)

^{[1]}. Try the following to see how to use R as a simple calculator

###### Input:

```
1 100+2/3
```

###### Result:

> 100+2/3 [1] 100.6667

###### Input:

```
1 #this is a comment: R will ignore it
2 (100+2)/3 #You can use round brackets to group operations so that they are carried out first
3 5*10^2 #The symbol * means multiply, and ^ means "to the power", so this gives 5 times (10 squared), i.e. 500
4 1/0 #R knows about infinity (and minus infinity)
5 0/0 #undefined results take the value NaN ("not a number")
6 (0i-9)^(1/2) #for the mathematically inclined, you can force R to use complex numbers
```

###### Result:

> #this is a comment: R will ignore it > (100+2)/3 #You can use round brackets to group operations so that they are carried out first [1] 34 > 5*10^2 #The symbol * means multiply, and ^ means "to the power", so this is 5 times (10 squared) [1] 500 > 1/0 #R knows about infinity (and minus infinity) [1] Inf > 0/0 #undefined results take the value NaN ("not a number") [1] NaN > (0i-9)^(1/2) #for the mathematically inclined, you can force R to use complex numbers [1] 0+3i

- If you don't know anything about complex numbers, don't worry: they are not important here.
- Note that you can't use curly brackets {} or square brackets [] to group operations together

`<-`

and `->`

as demonstrated in the exercise below. Which sign you use depends on whether you prefer putting the name first or last (it may be helpful to think of `->`

as "put into" and `<-`

as "set to").
Unlike many statistical packages, R does not usually display the results of analyses you perform. Instead, analyses usually end up by producing an object which can be stored. Results can then be obtained from the object at leisure. For this reason, when doing statistics in R, you will often find yourself naming and storing objects. The name you choose should consist of letters, numbers, and the "." character^{[2]}, and should not start with a number.

###### Input:

```
1 0.001 -> small.num #Store the number 0.0001 under the name "small.num" (i.e. put 0.0001 into small.num)
2 big.num <- 10 * 100 #You can put the name first if you reverse the arrow (set big.num to 10000).
3 big.num+small.num+1 #Now you can treat big.num and small.num as numbers, and use them in calculations
4 my.result <- big.num+small.num+2 #And you can store the result of any calculation
5 my.result #To look at the stored object, just type its name
6 pi #There are some named objects that R provides for you
```

###### Result:

> 0.001 -> small.num #Store the number 0.0001 under the name "small.num" (i.e. put 0.0001 into small.num) > big.num <- 10 * 100 #You can put the name first if you reverse the arrow (set big.num to 10000). > big.num+small.num+1 #Now you can treat big.num and small.num as numbers, and use them in calculations [1] 1001.001 > my.result <- big.num+small.num+2 #And you can store the result of any calculation > my.result #To look at the stored object, just type its name [1] 1002.001 > pi #There are some named objects that R provides for you [1] 3.141593

**functions**. Nearly everything useful that you will do in R is carried out using a function, and many are available in R by default. You can use (or "call") a function by typing its name followed by a pair of round brackets. For instance, the start up text mentions the following function, which you might find useful if you want to reference R in published work:

###### Input:

```
1 citation()
```

###### Result:

> citation() To cite R in publications use: R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. A BibTeX entry for LaTeX users is @Manual{, url = {http://www.R-project.org}, title = {R: A Language and Environment for Statistical Computing}, author = {{R Development Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2008}, note = {{ISBN} 3-900051-07-0}, } We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also ‘citation("pkgname")’ for citing R packages.

**arguments**that you provide to them. Arguments are placed inside the round brackets, separated by commas. Many functions have one or more

*optional*arguments: that is, you can choose whether or not to provide them. An example of this is the

`citation()`

function. It can take an optional argument giving the name of an R add-on package. If you do not provide an optional argument, there is usually an assumed default value (in the case of `citation()`

, this default value is `"base"`

, i.e. provide the citation reference for the base package: the package which provides most of the foundations of the R language).
Most arguments to a function are *named*. For example, the first argument of the citation function is named *package*. To provide extra clarity, when using a function you can provide arguments in the longer form *name=value*. Thus

citation("base")

does the same as

citation(package="base")

If a function can take more than one argument, using the long form also allows you to change the order of arguments, as shown in the example code below.

###### Input:

```
1 citation("base") #Does the same as citation(), because the default for the first argument is "base"
2 #Note: quotation marks are needed in this particular case (see discussion below)
3 citation("datasets") #Find the citation for another package (in this case, the result is very similar)
4 sqrt(25) #A different function: "sqrt" takes a single argument, returning its square root.
5 sqrt(25-9) #An argument can contain arithmetic and so forth
6 sqrt(25-9)+100 #The result of a function can be used as part of a further analysis
7 max(-10, 0.2, 4.5) #This function returns the maximum value of all its arguments
8 sqrt(2 * max(-10, 0.2, 4.5)) #You can use results of functions as arguments to other functions
9 x <- sqrt(2 * max(-10, 0.2, 4.5)) + 100 #... and you can store the results of any of these calculations
10 x
11 log(100) #This function returns the logarithm of its first argument
12 log(2.718282) #By default this is the natural logarithm (base "e")
13 log(100, base=10) #But you can change the base of the logarithm using the "base" argument
14 log(100, 10) #This does the same, because "base" is the second argument of the log function
15 log(base=10, 100) #To have the base as the first argument, you have to use the form name=value
```

###### Result:

> citation("base") #Does the same as citation(), because the default for the first argument is "base" To cite R in publications use: R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. A BibTeX entry for LaTeX users is @Manual{, title = {R: A Language and Environment for Statistical Computing}, author = {{R Development Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2008}, note = {{ISBN} 3-900051-07-0}, url = {http://www.R-project.org}, } We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also ‘citation("pkgname")’ for citing R packages. > #Note: quotation marks are needed in this particular case (see discussion below) > citation("datasets") #Find the citation for another package (in this case, the result is very similar) The 'datasets' package is part of R. To cite R in publications use: R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. A BibTeX entry for LaTeX users is @Manual{, title = {R: A Language and Environment for Statistical Computing}, author = {{R Development Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2008}, note = {{ISBN} 3-900051-07-0}, url = {http://www.R-project.org}, } We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also ‘citation("pkgname")’ for citing R packages. > sqrt(25) #A different function: "sqrt" takes a single argument, returning its square root. [1] 5 > sqrt(25-9) #An argument can contain arithmetic and so forth [1] 4 > sqrt(25-9)+100 #The result of a function can be used as part of a further analysis [1] 104 > max(-10, 0.2, 4.5) #This function returns the maximum value of all its arguments [1] 4.5 > sqrt(2 * max(-10, 0.2, 4.5)) #You can use results of functions as arguments to other functions [1] 3 > x <- sqrt(2 * max(-10, 0.2, 4.5)) + 100 #... and you can store the results of any of these calculations > x [1] 103 > log(100) #This function returns the logarithm of its first argument [1] 4.60517 > log(2.718282) #By default this is the natural logarithm (base "e") [1] 1 > log(100, base=10) #But you can change the base of the logarithm using the "base" argument [1] 2 > log(100, 10) #This does the same, because "base" is the second argument of the log function [1] 2 > log(base=10, 100) #To have the base as the first argument, you have to use the form name=value [1] 2

Note that when typing normal text (as in the name of a package), it needs to be surrounded by quotation marks^{[3]}, to avoid confusion with the names of objects. In other words, in R

citation

refers to a function, whereas

"citation"

is a "string" of text. This is useful, for example when providing titles for plots, etc.

You will probably find that one of the trickiest aspects of getting to know R is knowing which function to use in a particular situation. Fortunately, R not only provides documentation for all its functions, but also ways of searching through the documentation, as well as other ways of getting help.Some versions of R give easy access to help files without having to type in commands (for example, versions which provide menu bars usually have a "help" menu, and the Macintosh interface also has a help box in the top right hand corner). However, this functionality can always be accessed by typing in the appropriate commands. You might like to type some or all of the following into an R session (no output is listed here because the result will depend on your R system).

```
1 help.start() #A web-based set of help pages (try the link to "An Introduction to R")
2 help(sqrt) #Show details of the "sqrt" and similar functions
3 ?sqrt #A shortcut to do the same thing
4 example(sqrt) #run the examples on the bottom of the help page for "sqrt"
5 help.search("maximum") #gives a list of functions involving the word "maximum", but oddly, "max" is not in there!
6 ### The next line is commented out to reduce internet load. To try it, remove the first # sign.
7 #RSiteSearch("maximum") #search the R web site for anything to do with "maximum". Probably overkill here!
```

`max()`

function by looking at the "See also" section of the help file for `which.max()`

. Not ideal!.Statistical Analysis: an Introduction using R/R/Quitting

*vector*, used to store multiple measurements of the same type (e.g. data variables). There are several different sorts of data that can be stored in a vector. Most common is the

**numeric vector**, in which each element of the vector is simply a number. Other commonly used types of vector are

**character vectors**(where each element is a piece of text) and

**logical vectors**(where each element is either

`TRUE`

or `FALSE`

^{[4]}). In this topic we will use some example vectors provided by the "datasets" package, containing data on States of the USA (see

`?state`

).
R is an inherently vector-based program; in fact the numbers we have been using in previous calculations are just treated as vectors with a single element. This means that most basic functions in R will behave sensibly when given a vector as a argument, as shown below.

###### Input:

```
1 state.area #a NUMERIC vector giving the area of US states, in square miles
2 state.name #a CHARACTER vector (note the quote marks) of state names
3 sq.km <- state.area*2.59 #Arithmetic works on numeric vectors, e.g. convert sq miles to sq km
4 sq.km #... the new vector has the calculation applied to each element in turn
5 sqrt(sq.km) #Many mathematical functions also apply to each element in turn
6 range(state.area) #But some functions return different length vectors (here, just the max & min).
7 length(state.area) #and some, like this useful one, just return a single value.
```

###### Result:

> state.area #a NUMERIC vector giving the area of US states, in square miles [1] 51609 589757 113909 53104 158693 104247 5009 2057 58560 58876 6450 83557 56400 [14] 36291 56290 82264 40395 48523 33215 10577 8257 58216 84068 47716 69686 147138 [27] 77227 110540 9304 7836 121666 49576 52586 70665 41222 69919 96981 45333 1214 [40] 31055 77047 42244 267339 84916 9609 40815 68192 24181 56154 97914 > state.name #a CHARACTER vector (note the quote marks) of state names [1] "Alabama" "Alaska" "Arizona" "Arkansas" [5] "California" "Colorado" "Connecticut" "Delaware" [9] "Florida" "Georgia" "Hawaii" "Idaho" [13] "Illinois" "Indiana" "Iowa" "Kansas" [17] "Kentucky" "Louisiana" "Maine" "Maryland" [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi" [25] "Missouri" "Montana" "Nebraska" "Nevada" [29] "New Hampshire" "New Jersey" "New Mexico" "New York" [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma" [37] "Oregon" "Pennsylvania" "The smallest state" "South Carolina" [41] "South Dakota" "Tennessee" "Texas" "Utah" [45] "Vermont" "Virginia" "Washington" "West Virginia" [49] "Wisconsin" "Wyoming" > sq.km <- state.area*2.59 #Standard arithmatic works on numeric vectors, e.g. convert sq miles to sq km > sq.km #... giving another vector with the calculation performed on each element in turn [1] 133667.31 1527470.63 295024.31 137539.36 411014.87 269999.73 12973.31 5327.63 [9] 151670.40 152488.84 16705.50 216412.63 146076.00 93993.69 145791.10 213063.76 [17] 104623.05 125674.57 86026.85 27394.43 21385.63 150779.44 217736.12 123584.44 [25] 180486.74 381087.42 200017.93 286298.60 24097.36 20295.24 315114.94 128401.84 [33] 136197.74 183022.35 106764.98 181090.21 251180.79 117412.47 3144.26 80432.45 [41] 199551.73 109411.96 692408.01 219932.44 24887.31 105710.85 176617.28 62628.79 [49] 145438.86 253597.26 > sqrt(sq.km) #Many mathematical functions also apply to each element in turn [1] 365.60540 1235.90883 543.16140 370.86299 641.10441 519.61498 113.90044 72.99062 [9] 389.44884 390.49819 129.24976 465.20171 382.19890 306.58390 381.82601 461.58830 [17] 323.45487 354.50609 293.30334 165.51263 146.23826 388.30328 466.62203 351.54579 [25] 424.83731 617.32278 447.23364 535.06878 155.23324 142.46136 561.35100 358.33202 [33] 369.04978 427.81111 326.74911 425.54695 501.17940 342.65503 56.07370 283.60615 [41] 446.71213 330.77479 832.11058 468.96955 157.75712 325.13205 420.25859 250.25745 [49] 381.36447 503.58441 > range(state.area) #But some functions return different length vectors (here, just the max & min). [1] 1214 589757 > length(state.area) #and some, like this useful one, just return a single value. [1] 50

`c()`

, so named because it __c__oncatenates objects together. However, if you wish to create vectors consisting of regular sequences of numbers (e.g. 2,4,6,8,10,12, or 1,1,2,2,1,1,2,2) there are several alternative functions you can use, including

`seq()`

, `rep()`

, and the `:`

operator.
###### Input:

```
1 c("one", "two", "three", "pi") #Make a character vector
2 c(1,2,3,pi) #Make a numeric vector
3 seq(1,3) #Create a sequence of numbers
4 1:3 #A shortcut for the same thing (but less flexible)
5 i <- 1:3 #You can store a vector
6 i
7 i <- c(i,pi) #To add more elements, you must assign again, e.g. using c()
8 i
9 i <- c(i, "text") #A vector cannot contain different data types, so ...
10 i #... R converts all elements to the same type
11 i+1 #The numbers are now strings of text: arithmetic is impossible
12 rep(1, 10) #The "rep" function repeats its first argument
13 rep(3:1,10) #The first argument can also be a vector
14 huge.vector <- 0:(10^7) #R can easily cope with very big vectors
15 #huge.vector #VERY BAD IDEA TO UNCOMMENT THIS, unless you want to print out 10 million numbers
16 rm(huge.vector) #"rm" removes objects. Deleting huge unused objects is sensible
```

###### Result:

> c("one", "two", "three", "pi") #Make a character vector [1] "one" "two" "three" "pi" > c(1,2,3,pi) #Make a numeric vector [1] 1.000000 2.000000 3.000000 3.141593 > seq(1,3) #Create a sequence of numbers [1] 1 2 3 > 1:3 #A shortcut for the same thing (but less flexible) [1] 1 2 3 > i <- 1:3 #You can store a vector > i [1] 1 2 3 > i <- c(i,pi) #To add more elements, you must assign again, e.g. using c() > i [1] 1.000000 2.000000 3.000000 3.141593 > i <- c(i, "text") #A vector cannot contain different data types, so ... > i #... R converts all elements to the same type [1] "1" "2" "3" "3.14159265358979" "text" > i+1 #The numbers are now strings of text: arithmetic is impossible Error in i + 1 : non-numeric argument to binary operator > rep(1, 10) #The "rep" function repeats its first argument [1] 1 1 1 1 1 1 1 1 1 1 > rep(3:1,10) #The first argument can also be a vector [1] 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 > huge.vector <- 0:(10^7) #R can easily cope with very big vectors > #huge.vector #VERY BAD IDEA TO UNCOMMENT THIS, unless you want to print out 10 million numbers > rm(huge.vector) #"rm" removes objects. Deleting huge unused objects is sensible

**factor**. This is

*not*the same as a character vector filled with a set of names (don't get the two mixed up). In particular, R has to be told that each element can only be one of a number of known

*levels*(e.g.

*Male*or

*Female*). If you try to place a data point with a different, unknown level into the factor, R will complain. When you print a factor to the screen, R will also list the possible levels that factor can take (this may include ones that aren't present)

The `factor()`

function creates a factor and defines the available levels. By default the levels are taken from the ones in the vector***. Actually, you don't often need to use `factor()`

, because when reading data in from a file, R assumes by default that text should be converted to factors (see Statistical Analysis: an Introduction using R/R/R/Data frames). You may need to use `as.factor()`

. Internally, R stores the levels as numbers from 1 upwards, but it is not always obvious which number corresponds to which level, and it should not normally be necessary to know.

Ordinal variables, that is factors in which the levels have a natural order, are known to R as **ordered factors**. They can be created in the normal way a factor is created, but in addition specifying `ordered=TRUE`

.

###### Input:

state.region #An example of a factor: note that the levels are printed out state.name #this is *NOT* a factor state.name[1] <- "Any text" #you can replace text in a character vector state.region[1] <- "Any text" #but you can't in a factor state.region[1] <- "South" #this is OK state.abb #this is not a factor, just a character vector character.vector <- c("Female", "Female", "Male", "Male", "Male", "Female", "Female", "Male", "Male", "Male", "Male", "Male", "Female", "Female" , "Male", "Female", "Female", "Male", "Male", "Male", "Male", "Female", "Female", "Female", "Female", "Male", "Male", "Male", "Female" , "Male", "Female", "Male", "Male", "Male", "Male", "Male", "Female", "Male", "Male", "Male", "Male", "Female", "Female", "Female") #a bit tedious to do all that typing #might be easier to use codes, e.g. 1 for female and 2 for male Coded <- factor(c(1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1)) Gender <- factor(Coded, labels=c("Female", "Male")) #we can then convert this to named levels

###### Result:

*temperature*,

*time of day*, etc.), yet have forgotten (or been unable) to record

*temperature*in one instance. Or when collecting social data on US states, it might be that certain states do not record certain statistics of interest. Another example is the ship passenger data from the sinking of the Titanic, where careful research has identified the ticket class of all 2207 people on board, but not been able to ascertain the age of 10 or so of the victims (see http://www.encyclopedia-titanica.org).

We could just omit missing data, but in many cases, we have information for *some* variables, but not for others. For example, we might not want to completely omit a US state from an analysis, just because it it missing one particular datum of interest. For this reason, R provides a special value, *NA*, meaning "not available". Any vector, numeric, character, or logical, can have elements which are *NA*. These can be identified by the function "is.na".

###### Input:

```
1 some.missing <- c(1,NA)
2 is.na(some.missing)
```

###### Result:

some.missing <- c(1,NA) is.na(some.missing) [1] FALSE TRUE

Statistical Analysis: an Introduction using R/R/Matrices and arrays

^{[5]}, logical values, or strings of text

^{[6]}.

If you want a collection of elements which are of different types, or not of one of the allowed vector types, you need to use a **list**.

###### Input:

```
1 l1 <- list(a=1, b=1:3)
2 l2 <- c(sqrt, log) #
```

###### Result:

Statistical Analysis: an Introduction using R/R/Simple plotting

###### Input:

```
1 stripchart(state.areas, xlab="Area (sq. miles)") #see method="stack" & method="jitter" for others
2 boxplot(sqrt(state.area))
3 hist(sqrt(state.area))
4 hist(sqrt(state.area), 25)
5 plot(density(sqrt(state.area))
6 plot(UKDriverDeaths)
7
8 qqnorm()
9 ecdf(
```

###### Result:

Statistical Analysis: an Introduction using R/R/Data frames

Statistical Analysis: an Introduction using R/R/Getting data into R

Statistical Analysis: an Introduction using R/R/Bivariate plots

- ↑ Depending on how you are viewing this book, may see a ">" character in front of each command. This is not part of the command to type: it is produced by R itself to prompt you to type something. This character should be automatically omitted if you are copying and pasting from the online version of this book, but if you are reading the paper or pdf version, you should omit the ">" prompt when typing into R.
- ↑ If you are familiar with computer programming languages, you may be used to using the underscore ("_") character in names. In R, "." is usually used in its place.
- ↑ you can use either single (') or double (") quotes to delimit text strings, as long as the start and end quotes match
- ↑ These are special words in R, and cannot be used as names for objects. The objects
`T`

and`F`

are temporary shortcuts for`TRUE`

and`FALSE`

, but if you use them, watch out: since T and F are just normal object names you can change their meaning by overwriting them. - ↑ There are actually 3 types of allowed numbers: "normal" numbers, complex numbers, and simple integers. This book deals almost exclusively with the first of these.
- ↑ This is not quite true, but unless you are a computer specialist, you are unlikely to use the final type: a vectors of elements storing "raw" computer bits, see
`?raw`