R Programming/Descriptive Statistics

In this section, we present descriptive statistics, ie a set of tools to describe and explore data. This mainly includes univariate and bivariate statistical tools.

Generic FunctionsEdit

We introduce some functions to describe a dataset.

  • names() gives the names of each variable
  • str() gives the structure of the dataset
  • summary() gives the mean, median, min, max, 1st and 3rd quartile of each variable in the data.
> summary(mydat)
  • describe() (Hmisc package) gives more details than summary()
> library("Hmisc")
> describe(mydat)
  • contents() (Hmisc package)
  • dims() in the Zelig package.
  • descr() in the descr package gives min, max, mean and quartiles for continuous variables, frequency tables for factors and length for character vectors.
  • whatis() (YaleToolkit) gives a good description of a dataset.
  • describe() in the psych package also provides summary statistics:
> x = runif(100)
> y = rnorm(100)
> z = rt(100,1)
> sample.data = x*y*z
> require(psych)
Loading required package: psych
> describe(cbind(sample.data,x,z,y))
            var   n  mean   sd median trimmed  mad    min   max range  skew kurtosis   se
sample.data   1 100  0.37 3.21   0.00    0.07 0.31  -9.02 24.84 33.86  4.79    36.91 0.32
x             2 100  0.54 0.28   0.56    0.55 0.35   0.02  1.00  0.98 -0.12    -1.13 0.03
z             3 100  0.12 6.28   0.02   -0.01 1.14 -30.40 37.93 68.33  1.49    22.33 0.63
y             4 100 -0.01 1.07   0.09   -0.02 1.12  -2.81  2.35  5.16  0.00    -0.30 0.11

Univariate analysisEdit

Continuous variableEdit

MomentsEdit

  • mean() computes the mean
  • the variance : var().
  • the standard deviation sd().
  • the skewness skewness() (fUtilities, moment or e1071)
  • the kurtosis : kurtosis() (fUtilities, moment or e1071)
  • all the moments : moment() (moment) and all.moments() (moment).
> library(moments)
>  x <- rnorm(1000)
> moment(x,order = 2) # the variance
[1] 0.999782
> all.moments(x, order.max = 4) # mean, variance, skewness and kurtosis
[1] 1.000000000 0.006935727 0.999781992 0.062650605 2.972802009
> library("e1071")
> moment(x,order = 3) # the skewness
[1] 0.0626506


Order statisticsEdit

  • the range, the minimum and the maximum : range() returns the range of a vector (minimum and maximum of a vector), min() the minimum and max() the maximum.
  • IQR() computes the interquartile range. median() computes the median and mad() the median absolute deviation.
  • quantile(), hdquantile() in the Hmisc package and kuantile() in the quantreg packages computes the sample quantiles of a continuous vector. kuantile() may be more efficient when the sample size is big.
> library(Hmisc)
> library(quantreg)
> x <- rnorm(1000)
> seq <- seq(0, 1, 0.25)
> quantile(x, probs = seq, na.rm = FALSE, names = TRUE)
         0%         25%         50%         75%        100% 
-3.07328999 -0.66800917  0.02010969  0.72620061  2.92897970 
> hdquantile(x, probs = seq, se = FALSE, na.rm = FALSE, names = TRUE, weights=FALSE)
       0.00        0.25        0.50        0.75        1.00 
-3.07328999 -0.66901899  0.02157989  0.72378407  2.92897970 
> kuantile(x, probs = seq(0, 1, .25), na.rm = FALSE, names = TRUE)
         0%         25%         50%         75%        100% 
-3.07328999 -0.66800917  0.02010969  0.72620061  2.92897970 
attr(,"class")
[1] "kuantile"


Inequality IndexEdit

  • The gini coefficient : Gini() (ineq) and gini() (reldist).
  • ineq() (ineq) gives all inequalities index.
> library(ineq)
> x <- rlnorm(1000)
> Gini(x)
[1] 0.5330694
> RS(x) #  Ricci-Schutz coefficient
[1] 0.3935813
> Atkinson(x, parameter = 0.5)
[1] 0.2336169
> Theil(x, parameter = 0)
[1] 0.537657
> Kolm(x, parameter = 1)
[1] 0.7216194
> var.coeff(x, square = FALSE)
[1] 1.446085
> entropy(x, parameter = 0.5)
[1] 0.4982675
> library("reldist")
> gini(x)
[1] 0.5330694


  • Concentration index
> library(ineq)
> Herfindahl(x)
[1] 0.003091162
>  Rosenbluth(x)
[1] 0.002141646


  • Poverty index
> library(ineq)
> Sen(x,median(x)/2)
[1] 0.1342289
> ?pov # learn more about poverty index


Plotting the distributionEdit

We can plot the distribution using a box plot (boxplot()), an histogram (hist()), a kernel estimator (plot() with density()) or the empirical cumulative distribution function (plot() with ecdf()). See the Nonparametric section to learn more about histograms and kernel density estimators. qqnorm() produces a normal QQ plot and qqline() adds a line to the QQ plot which passes through the first and the third quartile.

  • A box-plot is a graphical representation of the minimum, the first quartile, the median, the third quartile and the maximum.
  • stripchart() and stem() are also availables.
> x <- rnorm(10^3)
> hist(x)
> plot(density(x))
> boxplot(x)
> plot(ecdf(x)) # plots the empirical distribution function
 
> qqnorm(x)
> qqline(x, col="red") # it does not do the plot but adds a line to existing one


Goodness of fit testsEdit

Kolmogorov Smirnov Test :

The KS test is one sample goodness of fit test. The test statistic is simply the maximum of the absolute value of the difference between the empirical cumulative distribution function and the theoritical cumulative distribution function. KSd() (sfsmisc) gives the critical values for the KS statistic. As an example, we draw a sample from a Beta(2,2) distribution and we test if it fits a Beta(2,2) a Beta(1,1) and a uniform distribution.

> y <- rbeta(1000,2,2) # Draw y in a Beta(2,2) distribution
> ks.test(y,"pbeta",2,2) # Test if it fits a beta(2,2) distribution
> ks.test(y,"pbeta",1,1) # Test if it fits a beta(1,1) distribution
> ks.test(y,"punif") # Test if its fit a uniform distribution (in fact the beta(1,1) is a uniform distribution)


Some tests are specific to the normal distribution. The Lillie Test is an extension of the KS test when the parameters are unknown. This is implemented with the lillie.test() in the nortest package. shapiro.test() implements the Shapiro Wilk Normality Test

> N <- 100
> x <- rnorm(N)
> library("nortest")
> lillie.test(x)
 
         Lilliefors (Kolmogorov-Smirnov) normality test
 
data:  x 
D = 0.0955, p-value = 0.9982*
> shapiro.test(x)
 
	Shapiro-Wilk normality test
 
data:  x 
W = 0.9916, p-value = 0.7902
> library("nortest")
> ad.test(x)
 
	Anderson-Darling normality test
 
data:  x 
A = 0.2541, p-value = 0.7247

See also the package ADGofTest for another version of this test[1].

> sf.test(x)
 
	Shapiro-Francia normality test
 
data:  x 
W = 0.9866, p-value = 0.9953
> library("nortest")
> pearson.test(x)
 
	Pearson chi-square normality test
 
data:  x 
P = 0.8, p-value = 0.8495
  • Cramer-von Mises normality test
> cvm.test(x)
 
	Cramer-von Mises normality test
 
data:  x 
W = 0.0182, p-value = 0.9756
> jarque.bera.test(x)
 
	Jarque Bera Test
 
data:  x 
X-squared = 0.6245, df = 2, p-value = 0.7318

Discrete variableEdit

We generate a discrete variable using sample() and we tabulate it using table(). We can plot using a pie chart (pie()), a bar chart (barplot() or barchart() (lattice)) or a dot chart (dotchart() or dotplot() (lattice)).

  • freq() (descr) prints the frequency, the percentages and produces a barplot. It supports weights.
> x <- sample(c("A","B","C"),100,replace=T)
> tab <- table(x)
> tab
> prop.table(tab)
> pie(tab)
> barplot(tab)
> dotchart(tab)
> library("descr")
> freq(x) 
x 
      Frequency Percent
A            32      32
B            34      34
C            34      34
Total       100     100


Multivariate analysisEdit

Continuous variablesEdit

  • Covariance : cov()
  • Pearson's linear correlation : cor().
  • Pearson's correlation test cor.test() performs the test.
  • Spearman's rank correlation :
    • cor() with method = "spearman".
    • spearman() (Hmisc)
  • Spearman's rank correlation test :
    • spearman2() (Hmisc)
    • spearman.test() (Hmisc)
    • spearman.test() (pspearman package) performs the Spearman’s rank correlation test with precomputed exact null distribution for n <= 22.
  • Kendall's correlation : cor() with method = "kendall". See also the Kendall package.
> N <- 100
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1 + 1
> y <- 1 + x1 + x2 + rnorm(N)
> plot(y ~ x1 ) # Scatter plot 
> mydat <- data.frame(y,x1,x2)
> cor(mydat)
> cor(mydat, method = "spearman")
> cor(mydat, method = "kendall")
> cor.test(mydat$x1,mydat$x2, method = "pearson")
> cor.test(mydat$x1,mydat$x2, method = "spearman")
> cor.test(mydat$x1,mydat$x2, method = "kendall")


Discrete variablesEdit

  • table(), xtabs() and prop.table() for contingency tables.
  • assocplot() and mosaicplot() for graphical display of contingency table.
  • CrossTable() (descr) is similar to SAS Proc Freq. It returns a contingency table with Chi square and Fisher independance tests.
  • my.table.NA() and my.table.margin() (cwhmisc)
  • chisq.detail() (TeachingDemos)

Discrete and Continuous variablesEdit

  • bystats() Statistics by Categories in the Hmisc package
  • summaryBy() (doBy)
  • Multiple box plots : plot() or boxplot()
> N <- 100
> x <- sample(1:4,N, replace = T) 
> y <- x + rnorm(N)
> plot(y ~ x) # scatter plot
> plot(y ~ as.factor(x)) # multiple box plot
> boxplot(y ~ x) # multiple box plot
> bystats(y , as.factor(x), fun = mean) 
> bystats(y , as.factor(x), fun = quantile)


  • Equality of two sample mean t.test() and wilcox.test(), Equality of variance var.test(), equality of two distributions ks.test().
N <- 100
x <- sample(0:1,N, replace = T) 
y <- x + rnorm(N)
t.test(y ~ x )
wilcox.test(y ~ x)


ReferencesEdit

  1. Carlos J. Gil Bellosta (2009). ADGofTest: Anderson-Darling GoF test. R package version 0.1. http://CRAN.R-project.org/package=ADGofTest
Previous: Graphics Index Next: Linear Models
Last modified on 29 January 2013, at 18:33