R Programming/Descriptive Statistics

In this section, we present descriptive statistics, ie a set of tools to describe and explore data. This mainly includes univariate and bivariate statistical tools.

Generic Functions

edit

We introduce some functions to describe a dataset.

  • names() gives the names of each variable
  • str() gives the structure of the dataset
  • summary() gives the mean, median, min, max, 1st and 3rd quartile of each variable in the data.
> summary(mydat)
  • describe() (Hmisc package) gives more details than summary()
> library("Hmisc")
> describe(mydat)
  • contents() (Hmisc package)
  • dims() in the Zelig package.
  • descr() in the descr package gives min, max, mean and quartiles for continuous variables, frequency tables for factors and length for character vectors.
  • whatis() (YaleToolkit) gives a good description of a dataset.
  • detail() in the SciencesPo package gives a broad range of statistics for continuous variables, frequency tables for factors and length for character vectors.
  • describe() in the psych package also provides summary statistics:
> x = runif(100)
> y = rnorm(100)
> z = rt(100,1)
> sample.data = x*y*z
> require(psych)
Loading required package: psych
> describe(cbind(sample.data,x,z,y))
            var   n  mean   sd median trimmed  mad    min   max range  skew kurtosis   se
sample.data   1 100  0.37 3.21   0.00    0.07 0.31  -9.02 24.84 33.86  4.79    36.91 0.32
x             2 100  0.54 0.28   0.56    0.55 0.35   0.02  1.00  0.98 -0.12    -1.13 0.03
z             3 100  0.12 6.28   0.02   -0.01 1.14 -30.40 37.93 68.33  1.49    22.33 0.63
y             4 100 -0.01 1.07   0.09   -0.02 1.12  -2.81  2.35  5.16  0.00    -0.30 0.11

Univariate analysis

edit

Continuous variable

edit

Moments

edit
  • mean() computes the mean
  • the variance : var().
  • the standard deviation sd().
  • the skewness skewness() (fUtilities, moment or e1071)
  • the kurtosis : kurtosis() (fUtilities, moment or e1071)
  • all the moments : moment() (moment) and all.moments() (moment).
> library(moments)
>  x <- rnorm(1000)
> moment(x,order = 2) # the variance
[1] 0.999782
> all.moments(x, order.max = 4) # mean, variance, skewness and kurtosis
[1] 1.000000000 0.006935727 0.999781992 0.062650605 2.972802009
> library("e1071")
> moment(x,order = 3) # the skewness
[1] 0.0626506


Order statistics

edit
  • the range, the minimum and the maximum : range() returns the range of a vector (minimum and maximum of a vector), min() the minimum and max() the maximum.
  • IQR() computes the interquartile range. median() computes the median and mad() the median absolute deviation.
  • quantile(), hdquantile() in the Hmisc package and kuantile() in the quantreg packages computes the sample quantiles of a continuous vector. kuantile() may be more efficient when the sample size is big.
> library(Hmisc)
> library(quantreg)
> x <- rnorm(1000)
> seq <- seq(0, 1, 0.25)
> quantile(x, probs = seq, na.rm = FALSE, names = TRUE)
         0%         25%         50%         75%        100% 
-3.07328999 -0.66800917  0.02010969  0.72620061  2.92897970 
> hdquantile(x, probs = seq, se = FALSE, na.rm = FALSE, names = TRUE, weights=FALSE)
       0.00        0.25        0.50        0.75        1.00 
-3.07328999 -0.66901899  0.02157989  0.72378407  2.92897970 
> kuantile(x, probs = seq(0, 1, .25), na.rm = FALSE, names = TRUE)
         0%         25%         50%         75%        100% 
-3.07328999 -0.66800917  0.02010969  0.72620061  2.92897970 
attr(,"class")
[1] "kuantile"


Inequality Index

edit
  • The gini coefficient : Gini() (ineq) and gini() (reldist).
  • ineq() (ineq) gives all inequalities index.
> library(ineq)
> x <- rlnorm(1000)
> Gini(x)
[1] 0.5330694
> RS(x) #  Ricci-Schutz coefficient
[1] 0.3935813
> Atkinson(x, parameter = 0.5)
[1] 0.2336169
> Theil(x, parameter = 0)
[1] 0.537657
> Kolm(x, parameter = 1)
[1] 0.7216194
> var.coeff(x, square = FALSE)
[1] 1.446085
> entropy(x, parameter = 0.5)
[1] 0.4982675
> library("reldist")
> gini(x)
[1] 0.5330694


  • Concentration index
> library(ineq)
> Herfindahl(x)
[1] 0.003091162
>  Rosenbluth(x)
[1] 0.002141646


  • Poverty index
> library(ineq)
> Sen(x,median(x)/2)
[1] 0.1342289
> ?pov # learn more about poverty index


Plotting the distribution

edit

We can plot the distribution using a box plot (boxplot()), an histogram (hist()), a kernel estimator (plot() with density()) or the empirical cumulative distribution function (plot() with ecdf()). See the Nonparametric section to learn more about histograms and kernel density estimators. qqnorm() produces a normal QQ plot and qqline() adds a line to the QQ plot which passes through the first and the third quartile.

  • A box-plot is a graphical representation of the minimum, the first quartile, the median, the third quartile and the maximum.
  • stripchart() and stem() are also availables.
> x <- rnorm(10^3)
> hist(x)
> plot(density(x))
> boxplot(x)
> plot(ecdf(x)) # plots the empirical distribution function
 
> qqnorm(x)
> qqline(x, col="red") # it does not do the plot but adds a line to existing one


Goodness of fit tests

edit

Kolmogorov Smirnov Test :

The KS test is one sample goodness of fit test. The test statistic is simply the maximum of the absolute value of the difference between the empirical cumulative distribution function and the theoritical cumulative distribution function. KSd() (sfsmisc) gives the critical values for the KS statistic. As an example, we draw a sample from a Beta(2,2) distribution and we test if it fits a Beta(2,2) a Beta(1,1) and a uniform distribution.

> y <- rbeta(1000,2,2) # Draw y in a Beta(2,2) distribution
> ks.test(y,"pbeta",2,2) # Test if it fits a beta(2,2) distribution
> ks.test(y,"pbeta",1,1) # Test if it fits a beta(1,1) distribution
> ks.test(y,"punif") # Test if its fit a uniform distribution (in fact the beta(1,1) is a uniform distribution)


Some tests are specific to the normal distribution. The Lillie Test is an extension of the KS test when the parameters are unknown. This is implemented with the lillie.test() in the nortest package. shapiro.test() implements the Shapiro Wilk Normality Test

> N <- 100
> x <- rnorm(N)
> library("nortest")
> lillie.test(x)

         Lilliefors (Kolmogorov-Smirnov) normality test

data:  x 
D = 0.0955, p-value = 0.9982*
> shapiro.test(x)

	Shapiro-Wilk normality test

data:  x 
W = 0.9916, p-value = 0.7902
> library("nortest")
> ad.test(x)

	Anderson-Darling normality test

data:  x 
A = 0.2541, p-value = 0.7247

See also the package ADGofTest for another version of this test[1].

> sf.test(x)

	Shapiro-Francia normality test

data:  x 
W = 0.9866, p-value = 0.9953
> library("nortest")
> pearson.test(x)

	Pearson chi-square normality test

data:  x 
P = 0.8, p-value = 0.8495
  • Cramer-von Mises normality test
> cvm.test(x)

	Cramer-von Mises normality test

data:  x 
W = 0.0182, p-value = 0.9756
> jarque.bera.test(x)

	Jarque Bera Test

data:  x 
X-squared = 0.6245, df = 2, p-value = 0.7318

Discrete variable

edit

We generate a discrete variable using sample() and we tabulate it using table(). We can plot using a pie chart (pie()), a bar chart (barplot() or barchart() (lattice)) or a dot chart (dotchart() or dotplot() (lattice)).

  • freq() (descr) prints the frequency, the percentages and produces a barplot. It supports weights.
> x <- sample(c("A","B","C"),100,replace=T)
> tab <- table(x)
> tab
> prop.table(tab)
> pie(tab)
> barplot(tab)
> dotchart(tab)
> library("descr")
> freq(x) 
x 
      Frequency Percent
A            32      32
B            34      34
C            34      34
Total       100     100


Multivariate analysis

edit

Continuous variables

edit
  • Covariance : cov()
  • Pearson's linear correlation : cor().
  • Pearson's correlation test cor.test() performs the test.
  • Spearman's rank correlation :
    • cor() with method = "spearman".
    • spearman() (Hmisc)
  • Spearman's rank correlation test :
    • spearman2() (Hmisc)
    • spearman.test() (Hmisc)
    • spearman.test() (pspearman package) performs the Spearman’s rank correlation test with precomputed exact null distribution for n <= 22.
  • Kendall's correlation : cor() with method = "kendall". See also the Kendall package.
> N <- 100
> x1 <- rnorm(N)
> x2 <- rnorm(N) + x1 + 1
> y <- 1 + x1 + x2 + rnorm(N)
> plot(y ~ x1 ) # Scatter plot 
> mydat <- data.frame(y,x1,x2)
> cor(mydat)
> cor(mydat, method = "spearman")
> cor(mydat, method = "kendall")
> cor.test(mydat$x1,mydat$x2, method = "pearson")
> cor.test(mydat$x1,mydat$x2, method = "spearman")
> cor.test(mydat$x1,mydat$x2, method = "kendall")


Discrete variables

edit
  • table(), xtabs() and prop.table() for contingency tables. ftable() (stats package) for a flat (nested) table.
  • assocplot() and mosaicplot() for graphical display of contingency table.
  • CrossTable() (descr) is similar to SAS Proc Freq. It returns a contingency table with Chi square and Fisher independence tests.
  • my.table.NA() and my.table.margin() (cwhmisc)
  • chisq.detail() (TeachingDemos)

Discrete and Continuous variables

edit
  • bystats() Statistics by Categories in the Hmisc package
  • summaryBy() (doBy)
  • Multiple box plots : plot() or boxplot()
> N <- 100
> x <- sample(1:4,N, replace = T) 
> y <- x + rnorm(N)
> plot(y ~ x) # scatter plot
> plot(y ~ as.factor(x)) # multiple box plot
> boxplot(y ~ x) # multiple box plot
> bystats(y , as.factor(x), fun = mean) 
> bystats(y , as.factor(x), fun = quantile)


  • Equality of two sample mean t.test() and wilcox.test(), Equality of variance var.test(), equality of two distributions ks.test().
N <- 100
x <- sample(0:1,N, replace = T) 
y <- x + rnorm(N)
t.test(y ~ x )
wilcox.test(y ~ x)


References

edit
  1. Carlos J. Gil Bellosta (2009). ADGofTest: Anderson-Darling GoF test. R package version 0.1. http://CRAN.R-project.org/package=ADGofTest
Previous: Graphics Index Next: Linear Models