Biostatistics with R/Printable version


Biostatistics with R

The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at
https://en.wikibooks.org/wiki/Biostatistics_with_R

Permission is granted to copy, distribute, and/or modify this document under the terms of the Creative Commons Attribution-ShareAlike 3.0 License.


Biostatistics with R authors

LicenseEdit

The text of this book is released under the terms of the Creative Commons Attribution-ShareAlike 3.0 and GNU Free Documentation License. The particular version of that license that is being used can be found at:

Wikibooks:Creative Commons Attribution-ShareAlike 3.0 Unported License
Wikibooks:GNU Free Documentation License

Images used in this document are available under various licenses. Clicking on the image will take you to a description page where the licensing information is displayed.

AuthorsEdit

ListEdit



A Brief Introduction To R/The First Step in R

What is R?Edit

How to install REdit

RStudioEdit

Use R packageEdit

Data Entry to REdit

Some Special ValuesEdit

ReferenceEdit



Import

Why R for biostatistics?Edit

R is superior to common statistical packages such as SPSS, SAS and MINITAB because it is

  • powerful
  • available for many platforms (Mac OS X, Windows, Linux etc.)
  • programmable
  • non-commercial
  • extensively documented

Obtaining R/InstallationEdit

You may refer to R FAQ

Data ImportEdit

The format of data set available in Wiley's website are CSV, Excel, MINITAB, SAS and SPSS. Although you can import the data saved in Excel, SAS and SPSS into R using the foreign package, you should download the data in CSV format. It is because CSV is the easiest one to process in R.

For example, you would like to import the "Large Data set" data file. The downloaded data file (LDS_C02_NCBIRTH800.csv) , assuming stored in the directory "/desktop",can be imported into R as a data.frame called "largedataset" using following syntax:

> largedataset <- read.csv("/Desktop/LDS_C02_NCBIRTH800.csv", header=TRUE,na.strings="NA")

if you prefer to choose the data file using the standard "point-and-click" GUI way, you may use the function file.choose(), i.e.

largedataset <- read.csv(file.choose(), header=TRUE,na.strings="NA")

Now, you should imported the data from the CSV to a data frame called "largedataset". You may try to look inside the data frame by calling its name

> largedataset

You can access the variable (in computer lingo, column) "sex" inside the largedataset dataframe by

largedataset$sex

For example, you want to count the frequency of sex

> table(largedataset$sex)

You can attach the data frame so that you can call the variable directly

> attach(largedataset)
> table(sex)
> detach() #cancel attaching

Basic data managementEdit

R is designed to be a analysis system instead of a integrated environment such as SPSS. Unlike SPSS, R doesn't have a spreadsheet-like environment for data input. Usually data are entered using different software (e.g. database, spreadsheet software such as OO.o Calc) and then imported to R as described above. For quick one-off calculations, you can do the data entry in R. For example, if you want to calculate the mean age of ten patients (30,31,32,34,35,36,37,30,40,45) you can enter the data into R using the c() function.

> pt_age <- c(30,31,32,34,35,36,37,30,40,45)

You may call the newly created object pt_age by its name...

> pt_age

...and then calculate the mean age of the ten patients.

> mean (pt_age)



Introduction to Biostatistics

REVIEW EXERCISES

1. Explain what is meant by descriptive statistics.

2. Explain what is meant by inferential statistics.

3. Define: (a) Statistics (b)Biostatistics (c) Variable (d)Quantitative variable (e) Qualitative variable (f)Random variable (g) Population (h)Finite population (i) Infinite population (j)Sample (k) Discrete variable (l)Continuous variable (m) Simple random sample (n)Sampling with replacement (o) Sampling without replacement

4. Define the word measurement.

5. List, describe, and compare the four measurement scales.

6. For each of the following variables, indicate whether it is quantitative or qualitative and specify the measurement scale that is employed when taking measurements on each: (a) Class standing of the members of this class relative to each other (b) Admitting diagnosis of patients admitted to a mental health clinic (c) Weights of babies born in a hospital during a year (d) Gender of babies born in a hospital during a year (e) Range of motion of elbow joint of students enrolled in a university health sciences curriculum (f) Under-arm temperature of day-old infants born in a hospital

7. For each of the following situations, answer questions a through e: (a) What is the sample in the study? (b) What is the population? (c) What is the variable of interest? (d) How many measurements were used in calculating the reported results? (e) What measurement scale was used? Situation A. A study of 300 households in a small southern town revealed that 20 percent had at least one school-age child present. Situation B. A study of 250 patients admitted to a hospital during the past year revealed that, on the average, the patients lived 15 miles from the hospital.

8. Consider the two situations given in Exercise 7. For Situation A describe how you would use a stratified random sample to collect the data. For Situation B describe how you would use systematic sampling of patient records to collect the data.



Descriptive Statistics

Summary For Formular with REdit

Formula

Number

Name Formula Formula with R
2.3.1 Class interval width using Sturges’s Rule Example
2.4.1 Mean of a population Example
2.4.2 Skewness Example
2.4.2 Mean of a sample Example
2.5.1 Range Example
2.5.2 Sample variance Example
2.5.3 Population variance Example
2.5.4 Standard deviation Example
2.5.5 Coefficient of variation Example
2.5.6 Quartile location in ordered array Example
2.5.7 Interquartile range Example
2.5.8 Kurtosis Example
Symbol Key
  • = coefficient of variation
  • = Interquartile range
  • = number of class intervals
  • = population mean
  • = population size
  • = sample size
  • =degrees of freedom
  • = first quartile
  • = second quartile = median
  • = third quartile
  • =range
  • =standard deviation
  • = sample variance
  • = population variance
  • = data observation
  • = largest data point
  • =smallest data point
  • = sample mean
  • =class width
Example



The Ordered Array

The Frequency DistributionEdit

Example 2.2.1 detailed the procedure to sort an array. This array is a series of ages in subjects received two kinds of smoking cessation program. Suppose you already import the data set using the following command:

> SmokeCProg <- read.csv("/EXA_C01_S04_01.csv", header=T, na.strings=NA)

It is better to use a descriptive name (SmokeCProg for Smoking Cessation Program) rather than commonly used place holder name such as x,y. We can obtain a sorted array of ages using the following command:

> sort(SmokeCProg$AGE)

The frequency distribution of Ages as shown in table 2.3.1 can be obtained using:

> table(cut(SmokeCProg$AGE, b=c(0,39,49,59,69,79,89)))
(0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 
    11      46      70      45      16       1 

cut command break up AGE variables based on the break points (0,39,49,59,69,79,89) provided. In table 2.3.2, the frequency table of age was provided. As suggested by Venables et al. in the book "An Introduction to R", statistical analysis is normally done as a series of steps, with intermediate results being stored in objects. Compared to other statistical packages, R will only give minimal output. We will demonstrate this important characteristic in this example. In previous example, we calculated the frequency distribution of Ages using table() and cut() command. We can store the results in form of a object called "AgeFreqTable" using:

> AgeFreqTable <- table(cut(SmokeCProg$AGE, b=c(0,39,49,59,69,79,89)))

You will get no output. Until you call the object "AgeFreqTable"

> AgeFreqTable
(0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 
    11      46      70      45      16       1

In order to obtain the cumulative frequency, we can process the object "AgeFreqTable" using cumsum() command

> cumsum(AgeFreqTable)
(0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 
    11      57     127     172     188     189

Before we jump to the calculation of relative frequency, we can obtain the total number of observations in a variable using length() function

> length(SmokeCProg$AGE)
[1] 189

We can calculate the relative frequency by dividing each items in the object "AgeFreqTable" by the total number of observations using

> AgeFreqTable/length(SmokeCProg$AGE)
    (0,39]     (39,49]     (49,59]     (59,69]     (69,79]     (79,89] 
0.058201058 0.243386243 0.370370370 0.238095238 0.084656085 0.005291005

Similarly, the cummulative relative frequency can be calculated using

> cumsum(AgeFreqTable)/length(SmokeCProg$AGE)
    (0,39]    (39,49]    (49,59]    (59,69]    (69,79]    (79,89] 
0.05820106 0.30158730 0.67195767 0.91005291 0.99470899 1.00000000

If you would like to round the results of relative frequency to 4 digits, you can use the round() function

> round (AgeFreqTable/length(SmokeCProg$AGE),digits=4)
 (0,39] (39,49] (49,59] (59,69] (69,79] (79,89] 
0.0582  0.3016  0.6720  0.9101  0.9947  1.0000 

Alternatively, you can store the results of relative frequency in a new object and then process that object with round() function

> AgeRelFreqTable <- AgeFreqTable/length(SmokeCProg$AGE)
> round (AgeRelFreqTable, digits=4)

Exercise: Try to round the results of cummulative relative frequency to 4 digits using R command To plot a histogram, you can use the hist() function, e.g.

> hist(SmokeCProg$AGE)

You can customize the histogram by adding some arguments (i.e. options), you may type ?hist to learn more about the argument of hist() function. For example, if you want to plot a histogram with only five bars (similar to Figure 2.3.2)

> hist(SmokeCProg$AGE, breaks=5)

You can add more arguments to hist() functions, e.g.

> hist(SmokeCProg$AGE, breaks=5, ylim=c(0,70), main="Histogram of Ages of 189 subjects", col="red", xlab="Age")

Remember, always consult the document (e.g. ?hist or help.search("histogram") ) when you have question. In 95% of the time, you can find the answer in help document. For example, you don't know how to plot a stem-and-leaf graph to display your data. You don't even know the name of the function. You can use help.search() to search for the keyword "stem", i.e.

> help.search("stem")

A function called stem() should be in the results. We then try to use this function to visual our data

> stem(SmokeCProg$AGE)
The decimal point is 1 digit(s) to the right of the |
 3 | 04
 3 | 577888899
 4 | 00223333334444444
 4 | 55566666677777788888889999999
 5 | 0000000011112222223333333333333333344444444444
 5 | 555666666777777788999999
 6 | 000011111111111222222233444444
 6 | 556666667888999
 7 | 0111111123
 7 | 567888
 8 | 2

Not similar to MINITAB, the steam unit is adjusted by the scale argument. The plot above using a default scale of 1 which is equivalent to steam unit =5. To change the steam unit to 10, the value of scale argument should be change to 0.5

> stem(SmokeCProg$AGE, scale=0.5)
 The decimal point is 1 digit(s) to the right of the |
 3 | 04577888899
 4 | 0022333333444444455566666677777788888889999999
 5 | 00000000111122222233333333333333333444444444445556666667777777889999
 6 | 000011111111111222222233444444556666667888999
 7 | 0111111123567888
 8 | 2

Central TendencyEdit



Some Basic Probability Concepts

Formular with REdit

Formular Number Name Formular Formular with R
3.2.1 Classical probability Example
3.2.2 Relative frequency probability Example
3.3.1–3.3.3 Properties of probability

Example
3.4.1 Multiplication rule Example
3.4.2 Conditional probability Example
3.4.3 Addition rule Example
3.4.4 Independent events Example
3.4.5 Complementary events Example
3.4.6 Marginal probability Example
Sensitivity of a screening test Example
Specificity of a screening test Example
3.5.1 Predictive value positive of a screening test Example
3.5.2 Predictive value negative of a screening test Example
Symbol Key
  • = disease
  • = Event
  • = the number of times an event E_i occurs
  • = sample size or the total number of times a process occurs
  • =Population size or the total number of mutually exclusive and equally likely events
  • = a complementary event; the probability of an event A, not occurring
  • =probability of some event E_i occurring
  • =an “intersection” or “and” statement; the probability of an event A and an event B occurring
  • =an “union” or “or” statement; the probability of an event A or an event B or both occurring
  • =a conditional statement; the probability of an event A occurring given that an event B has already occurred
  • =test results
Example



Probability Distributions

Summary of Formulars with REdit

Formular Number Name Formular Formular with R
4.2.1 Mean of a frequency distribution Example
4.2.2 Variance of a frequency distribution

or

Example
4.3.1 Combination of objects Example
4.3.2 Binomial distribution function Example
4.3.3–4.3.5 Tabled binomial probability equalities

Example
4.4.1 Poisson distribution function Example
4.6.1 Normal distribution function Example
4.6.2 z-transformation Example
4.6.3 Standard normal distribution function Example
Symbol Key



Some Important Sampling Distributions

Summary of Formulars with REdit

Formular Number Name Formular Formular with R
5.3.1 z-transformation for sample mean Example
5.4.1 z-transformation for difference between two means Example
5.5.1 z-transformation for sample proportion Example
5.5.2 Continuity correction when x < np Example
5.5.3 Continuity correction when x > np Example
5.6.1 z-transformation for difference between two proportions Example
Symbol Key



Estimation

Summary of Formulars with REdit

Formular Number Name Formular Formular with R
6.2.1 Expression of an interval estimate estimator ± (reliability coefficient)× standard error of the estimator Example
6.2.2 Interval estimate for when is known Example
6.3.1 t-transformation Example
6.3.2 Interval estimate for when is unknown Example
6.4.1 Interval estimate for the difference between two population means when and are known Example
6.4.2 Pooled variance estimate Example
6.4.3 Standard error of estimate Example
6.4.4 Interval estimate for the difference between two population means when s 1 is unknown Example
6.4.5 Cochran’s correction for reliability coefficient when variances are not equal Example
6.4.6 Interval estimate using Cochran’s correction for t Example
6.5.1 Interval estimate for a population proportion Example
6.6.1 Interval estimate for the difference between two population proportions Example
6.7.1–6.7.3 Sample size determination when sampling with replacement Example
6.7.4–6.7.5 Sample size determination when sampling without replacement Example
6.8.1 Sample size determination for proportions when sampling with replacement Example
6.8.2 Sample size determination for proportions when sampling without replacement Example
6.9.1 Interval estimate for s 2 Example
6.9.2 Interval estimate for s Example
6.10.1 Interval estimate for the ratio of two variances Example
6.10.2 Relationship among F ratios Example
Symbol Key



Hypothesis Testing

Summary of Formulars with REdit

Formular Number Name Formular Formular with R
7.1.1, 7.1.2, 7.2.1 z-transformation (using either or ) Example
7.2.2 t-transformation Example
7.2.3 Test statistic when sampling from a population that is not normally distributed Example
7.3.1 Test statistic when sampling from normally distributed populations:population variances known Example
7.3.2 Test statistic when sampling from normally distributed populations:population variances unknown and equal Example Example
7.3.3, 7.3.4 Test statistic when sampling from normally distributed populations: population variances unknown and unequal Example Example
7.3.5 Sampling from populations that are not normally distributed Example Example
7.4.1 Test statistic for paired differences when the population variance is unknown Example Example
7.4.2 Test statistic for paired differences when the population variance is known Example Example
7.5.1 Test statistic for a single population proportion Example Example
7.6.1, 7.6.2 Test statistic for the difference between two population proportions Example Example
7.7.1 Test statistic for a single population variance Example Example
7.8.1 Variance ratio Example Example
7.9.1, 7.9.2 Upper and lower critical values for � x Example Example
7.10.1, 7.10.2 Critical value for determining sample size to control type II errors Example Example
7.10.3 Sample size to control type II errors Example Example
5.5.3 Continuity correction when x > np Example Example
5.6.1 z-transformation for difference between two proportions Example Example
Symbol Key



Analysis of Variance

Summary of Formulars with REdit

Formular Number Name Formular Formular with R
8.2.1 One-way ANOVA model Example Example
8.2.2 Total sum-of-squares Example Example
8.2.3 Within-group sum-of-squares Example Example
8.2.4 Among-group sum-of-squares Example Example
8.2.5 Within-group variance Example Example
8.2.6 Among-group variance I Example Example
8.2.9 Tukey’s HSD (equal sample sizes) Example Example
8.2.10 Tukey’s HSD (unequal sample sizes) Example Example
8.3.1 Two-way ANOVA model Example Example
8.3.2 Sum-of-squares representation Example Example
8.3.3 Sum-of-squares total Example Example
8.3.4 Sum-of-squares block Example Example
8.3.5 Sum-of-squares treatments Example Example
8.3.6 Sum-of-squares error Example Example
8.4.1 Fixed-effects, additive single-factor, repeated-measures ANOVA model Example Example
8.4.2 Fixed-effects, additive two-factor, repeated-measures ANOVA model Example Example
8.5.1 Two-factor completely randomized fixed-effects factorial model Example Example
8.5.2 Probabilistic representation of a Example Example
8.5.3 Sum-of-squares total I Example Example
8.5.4 Sum-of-squares total II Example Example
8.5.5 Sum-of-squares treatment partition Example Example
Symbol Key



Simple Linear Regression and Correlation

Summary of Formulars with REdit

Formular Number Name Formular Formular with R
9.2.1 Assumption of linearity Example Example
9.2.2 Simple linear regression model Example Example
9.2.3 Error (residual) term Example Example
9.3.1 Algebraic representation of a straight line Example Example
9.3.2 Least square estimate of the slope of a regression line Example Example
9.3.3 Least square estimate of the intercept of a regression line Example Example
9.4.1 Deviation equation Example Example
9.4.2 Sum-of-squares equation Example Example
9.4.3 Estimated population coefficient of determination Example Example
9.4.4–9.4.7 Means and variances of point estimators a and b Example Example
9.4.8 z statistic for testing hypotheses about b Example Example
9.4.9 t statistic for testing hypotheses about b Example Example
9.5.1 Prediction interval for Y for a given X Example Example
9.5.2 Confidence interval for the mean of Y for a given X Example Example
9.7.1–9.7.2 Correlation coefficient Example Example
9.7.3 t statistic for correlation coefficient Example Example
9.7.4 z statistic for correlation coefficient Example Example
9.7.5 Estimated standard deviation for z statistic Example Example
9.7.6 Z statistic for correlation coefficient Example Example
9.7.7 Z statistic for correlation coefficient when n < 25 Example Example
9.7.8 Standard deviation for z à Example Example
9.7.9 Z Ã statistic for correlation coefficient Example Example
9.7.10 Confidence interval for r Example Example
Symbol Key



Multiple Regression and Correlation

Summary of Formulars with REdit

Formular Number Name Formular Formular with R
10.2.1 Representation of the multiple linear regression equation Example Example
10.2.2 Representation of the multiple linear regression equation with two independent variables Example Example
10.2.3 Random deviation of a point from a plane when there are two independent variables Example Example
10.3.1 Sum-of-squared residuals Example Example
10.4.1 Sum-of-squares equation Example Example
10.4.2 Coefficient of multiple determination Example Example
10.4.3 t statistic for testing hypotheses about b i Example Example
10.5.1 Estimation equation for multiple linear regression Example Example
10.5.2 Confidence interval for the mean of Y for a given X Example Example
10.5.3 Prediction interval for Y for a given X Example Example
10.6.1 Multiple correlation model Example Example
10.6.2 Multiple correlation coefficient Example Example
10.6.3 F statistic for testing the multiple correlation coefficient Example Example
10.6.4–10.6.6 Partial correlation between two variables (1 and 2) after controlling for a third (3) Example Example
10.6.7 t statistic for testing hypotheses about partial correlation coefficients Example Example
Symbol Key



Regression Analysis: Some Additional Techniques

Summary of Formulars with REdit

Formular Number Name Formular Formular with R
11.4.1–11.4.3 Representations of the simple linear regression model Example Example
11.4.4 Simple logistic regression model Example Example
11.4.5 Alternative representation of the simple logistic regression model Example Example
11.4.6 Alternative representation of the multiple logistic regression model Example Example
11.4.7 Alternative representation of the multiple logistic regression model Example Example
Symbol Key



The Chi-Square Distribution and the Analysis of Frequencies

Summary of Formulars with REdit

Formular Number Name Formular Formular with R
12.2.1 Standard normal random variable Example Example
12.2.2 Chi-square distribution with n degrees of freedom Example Example
12.2.3 Chi-square probability density function Example Example
12.2.4 Chi-square test statistic Example Example
12.4.1 Chi-square calculation formula for a 2 Â 2 contingency table Example Example
12.4.2 Yates’s corrected chi-square calculation for a 2 Â 2 contingency table Example Example
12.6.1–12.6.2 Large-sample approximation to the chi-square Example Example
12.7.1 Relative risk estimate Example Example
12.7.2 Confidence interval for the relative risk estimate Example Example
12.7.3 Odds ratio estimate Example Example
12.7.4 Confidence interval for the odds ratio estimate Example Example
12.7.5 Expected frequency in the Mantel–Haenszel statistic Example Example
12.7.6 Stratum expected frequency in the Mantel–Haenszel statistic Example Example
12.7.7 Mantel–Haenszel test statistic Example Example
12.7.8 Mantel–Haenszel estimator of the common odds ratio Example Example
Example Example Example Example
Example Example Example Example
Symbol Key



Nonparametric and Distribution-Free Statistics

Summary of Formulars with REdit

Formular Number Name Formular Formular with R
13.3.1 Sign test statistic Example Example
13.3.2 Large-sample approximation of the sign test Example Example
13.6.1 Mann–Whitney test statistic Example Example
13.6.2 Large-sample approximation of the Mann–Whitney test Example Example
13.6.3 Equivalence of the Mann–Whitney and Wilcoxon two-sample statistics Example Example
13.7.1–13.7.2 Kolmogorov–Smirnov test statistic Example Example
13.8.1 Kruskal–Wallis test statistic Example Example
13.8.2 Kruskal–Wallis test statistic adjustment for ties Example Example
13.9.2 Friedman test statistic Example Example
13.10.1 Spearman rank correlation test statistic Example Example
13.10.2 Large-sample approximation of the Spearman rank correlation Example Example
13.10.3–13.10.4 Correction for tied observations in the Spearman rank correlation Example Example
13.11.1 Theil's estimator of b Example Example



Survival Analysis

Summary of Formulars with REdit

Formular Number Name Formular Formular with R
14.2.1 Example Example Example
14.2.2 Example Example Example
14.2.3 Example Example Example
14.2.4 Example Example Example
14.2.5 Example Example Example
Example Example Example Example
Example Example Example Example
Example Example Example Example
Example Example Example Example
Example Example Example Example



Vital Statistics

Summary of Formulars with REdit

Formular Number Name Formular Formular with R
Example Example Example Example
Example Example Example Example
Example Example Example Example
Example Example Example Example
Example Example Example Example
Example Example Example Example
Example Example Example Example
Example Example Example Example
Example Example Example Example
Example Example Example Example



Further reading

For BiostatisticsEdit

For R programmingEdit