Statistics/Numerical Methods/Quantile Regression

Quantile Regression as introduced by Koenker and Bassett (1978) seeks to complement classical linear regression analysis. Central hereby is the extension of "ordinary quantiles from a location model to a more general class of linear models in which the conditional quantiles have a linear form" (Buchinsky (1998), p. 89). In Ordinary Least Squares (OLS) the primary goal is to determine the conditional mean of random variable , given some explanatory variable , reaching the expected value . Quantile Regression goes beyond this and enables one to pose such a question at any quantile of the conditional distribution function. The following seeks to introduce the reader to the ideas behind Quantile Regression. First, the issue of quantiles is addressed, followed by a brief outline of least squares estimators focusing on Ordinary Least Squares. Finally, Quantile Regression is presented, along with an example utilizing the Boston Housing data set.

Preparing the Grounds for Quantile Regression

edit

What are Quantiles

edit

Gilchrist (2001, p.1) describes a quantile as "simply the value that corresponds to a specified proportion of an (ordered) sample of a population". For instance a very commonly used quantile is the median  , which is equal to a proportion of 0.5 of the ordered data. This corresponds to a quantile with a probability of 0.5 of occurrence. Quantiles hereby mark the boundaries of equally sized, consecutive subsets. (Gilchrist, 2001)

More formally stated, let   be a continuous random variable with a distribution function   such that

 

which states that for the distribution function   one can determine for a given value   the probability   of occurrence. Now if one is dealing with quantiles, one wants to do the opposite, that is one wants to determine for a given probability   of the sample data set the corresponding value  . A  -quantile refers in a sample data to the probability   for a value  .

 

Another form of expressing the  -quantile mathematically is following:

 

  is such that it constitutes the inverse of the function   for a probability  .

Note that there are two possible scenarios. On the one hand, if the distribution function   is monotonically increasing, quantiles are well defined for every  . However, if a distribution function   is not strictly monotonically increasing , there are some  s for which a unique quantile can not be defined. In this case one uses the smallest value that   can take on for a given probability  .

Both cases, with and without a strictly monotonically increasing function, can be described as follows:

 

That is   is equal to the inverse of the function   which in turn is equal to the infimum of   such that the distribution function   is greater or equal to a given probability  , i.e. the  quantile. (Handl (2000))

However, a problem that frequently occurs is that an empirical distribution function is a step function. Handl (2000) describes a solution to this problem. As a first step, one reformulates equation 4 in such a way that one replaces the continuous random variable   with  , the observations, in the distribution function  , resulting in the empirical distribution function  . This gives the following equation:

 

The empirical distribution function can be separated into equally sized, consecutive subsets via the number of observations  . Which then leads one to the following step:

 

with   and   as the sorted observations. Hereby, of course, the range of values that   can take on is limited simply by the observations   and their nature. However, what if one wants to implement a different subset, i.e. different quantiles but those that can be derived from the number of observations  ?

Therefore a further step necessary to solving the problem of a step function is to smooth the empirical distribution function through replacing it a with continuous linear function  . In order to do this there are several algorithms available which are well described in Handl (2000) and more in detail with an evaluation of the different algorithms and their efficiency in computer packages in Hyndman and Fan (1996). Only then one can apply any division into quantiles of the data set as suitable for the purpose of the analysis. (Handl (2000))

Ordinary Least Squares

edit

In regression analysis the researcher is interested in analyzing the behavior of a dependent variable   given the information contained in a set of explanatory variables  . Ordinary Least Squares is a standard approach to specify a linear regression model and estimate its unknown parameters by minimizing the sum of squared errors. This leads to an approximation of the mean function of the conditional distribution of the dependent variable. OLS achieves the property of BLUE, it is the best, linear, and unbiased estimator, if following four assumptions hold:

1. The explanatory variable   is non-stochastic

2. The expectations of the error term   are zero, i.e.  

3. Homoscedasticity - the variance of the error terms   is constant, i.e.  

4. No autocorrelation, i.e.   ,  

However, frequently one or more of these assumptions are violated, resulting in that OLS is not anymore the best, linear, unbiased estimator. Hereby Quantile Regression can tackle following issues: (i), frequently the error terms are not necessarily constant across a distribution thereby violating the axiom of homoscedasticity. (ii) by focusing on the mean as a measure of location, information about the tails of a distribution are lost. (iii) OLS is sensitive to extreme outliers that can distort the results significantly. (Montenegro (2001))

Quantile Regression

edit

The Method

edit

Quantile Regression essentially transforms a conditional distribution function into a conditional quantile function by slicing it into segments. These segments describe the cumulative distribution of a conditional dependent variable   given the explanatory variable   with the use of quantiles as defined in equation 4.

For a dependent variable   given the explanatory variable   and fixed  ,  , the conditional quantile function is defined as the   quantile   of the conditional distribution function  . For the estimation of the location of the conditional distribution function, the conditional median   can be used as an alternative to the conditional mean. (Lee (2005))

One can nicely illustrate Quantile Regression when comparing it with OLS. In OLS, modeling a conditional distribution function of a random sample ( ) with a parametric function   where   represents the independent variables,   the corresponding estimates and   the conditional mean, one gets following minimization problem:

 

One thereby obtains the conditional expectation function  . Now, in a similar fashion one can proceed in Quantile Regression. Central feature thereby becomes  , which serves as a check function.

 

This check-function ensures that

1. all   are positive

2. the scale is according to the probability  

Such a function with two supports is a must if dealing with L1 distances, which can become negative.

In Quantile Regression one minimizes now following function:

 

Here, as opposed to OLS, the minimization is done for each subsection defined by  , where the estimate of the  -quantile function is achieved with the parametric function  . (Koenker and Hallock (2001))

Features that characterize Quantile Regression and differentiate it from other regression methods are following:

1. The entire conditional distribution of the dependent variable   can be characterized through different values of  

2. Heteroscedasticity can be detected

3. If the data is heteroscedastic, median regression estimators can be more efficient than mean regression estimators

4. The minimization problem as illustrated in equation 9 can be solved efficiently by linear programming methods, making estimation easy

5. Quantile functions are also equivariant to monotone transformations. That is  , for any function

6. Quantiles are robust in regards to outliers ( Lee (2005) )

A graphical illustration of Quantile Regression

edit

Before proceeding to a numerical example, the following subsection seeks to graphically illustrate the concept of Quantile Regression. First, as a starting point for this illustration, consider figure 1. For a given explanatory value of   the density for a conditional dependent variable   is indicated by the size of the balloon. The bigger the balloon, the higher is the density, with the mode, i.e. where the density is the highest, for a given   being the biggest balloon. Quantile Regression essentially connects the equally sized balloons, i.e. probabilities, across the different values of  , thereby allowing one to focus on the interrelationship between the explanatory variable   and the dependent variable   for the different quantiles, as can be seen in figure 2. These subsets, marked by the quantile lines, reflect the probability density of the dependent variable   given  .

 
Figure 1: Probabilities of occurrence for individual explanatory variables

The example used in figure 2 is originally from Koenker and Hallock (2000), and illustrates a classical empirical application, Ernst Engel's (1857) investigation into the relationship of household food expenditure, being the dependent variable, and household income as the explanatory variable. In Quantile Regression the conditional function of   is segmented by the  -quantile. In the analysis, the  -quantiles  , indicated by the thin blue lines that separate the different color sections, are superimposed on the data points. The conditional median ( ) is indicated by a thick dark blue line, the conditional mean by a light yellow line. The color sections thereby represent the subsections of the data as generated by the quantiles.

 
Figure 2: Engels Curve, with the median highlighted in dark blue and the mean in yellow

Figure 2 can be understood as a contour plot representing a 3-D graph, with food expenditure and income on the respective y and x axis. The third dimension arises from the probability density of the respective values. The density of a value is thereby indicated by the darkness of the shade of blue, the darker the color, the higher is the probability of occurrence. For instance, on the outer bounds, where the blue is very light, the probability density for the given data set is relatively low, as they are marked by the quantiles 0,05 to 0,1 and 0,9 to 0,95. It is important to notice that figure 2 represents for each subsections the individual probability of occurrence, however, quantiles utilize the cumulative probability of a conditional function. For example,   of 0,05 means that 5  of observations are expected to fall below this line, a   of 0,25 for instance means that 25  of the observations are expected to fall below this and the 0,1 line.

The graph in figure 2, suggests that the error variance is not constant across the distribution. The dispersion of food expenditure increases as household income goes up. Also the data is skewed to the left, indicated by the spacing of the quantile lines that decreases above the median and also by the relative position of the median which lies above the mean. This suggests that the axiom of homoscedasticity is violated, which OLS relies on. The statistician is therefore well advised to engage in an alternative method of analysis such as Quantile Regression, which is actually able to deal with heteroscedasticity.

A Quantile Regression Analysis

edit

In order to give a numerical example of the analytical power of Quantile Regression and to compare it within the boundaries of a statistical application with OLS the following section will be analyzing some selected variables of the Boston Housing dataset which is available at the md-base website. The data was first analyzed by Belsley, Kuh, and Welsch (1980). The original data comprised 506 observations for 14 variables stemming from the census of the Boston metropolitan area.

This analysis utilizes as the dependent variable the median value of owner occupied homes (a metric variable, abbreviated with H) and investigates the effects of 4 independent variables as shown in table 1. These variables were selected as they best illustrate the difference between OLS and Quantile Regression. For the sake of simplicity of the analysis, it was neglected for now to deal with potential difficulties related to finding the correct specification of a parametric model. A simple linear regression model therefore was assumed. For the estimation of asymptotic standard errors see for example Buchinsky (1998), which illustrates the design-matrix bootstrap estimator or alternatively Powell (1986) for kernel based estimation of asymptotic standard errors.

Table1: The explanatory variables
Name Short What it is type
NonrTail T Proportion of non-retail business acres metric
NoorOoms O Average number of rooms per dwelling metric
Age A Proportion of owner-built dwellings prior to 1940 metric
PupilTeacher P Pupil-teacher ratio metric

In the following firstly an OLS model was estimated. Three digits after the comma were indicated in the tables as some of the estimates turned out to be very small.

 

Computing this via XploRe one obtains the results as shown in the table below.

Table2: OLS estimates
         
36,459 0,021 38,010 0,001 -0,953


Analyzing this data set via Quantile Regression, utilizing the   quantiles   the model is characterized as follows:

 

Just for illustrative purposes and to further foster the understanding of the reader for Quantile Regression, the equation for the   quantile is briefly illustrated, all others follow analogous:

 

 

Table3: Quantile Regression estimates
           
0,1 23,442 0,087 29,606 -0,022 -0,443
0,3 15,7130 -0,001 45,281 -0,037 -0,617
0,5 14,8500 0,022 53,252 -0,031 -0,737
0,7 20,7910 -0,021 50,999 -0,003 -0,925
0,9 34,0310 -0,067 51,353 0,004 -1,257

Now if one compares the results for the estimates of OLS from table 2 and Quantile Regression, table 3, one finds that the latter method can make much more subtle inferences of the effect of the explanatory variables on the dependent variable. Of particular interest are thereby quantile estimates that are relatively different as compared to other quantiles for the same estimate.

Probably the most interesting result and most illustrative in regards to an understanding of the functioning of Quantile Regression and pointing to the differences with OLS are the results for the independent variable of the proportion of non-retail business acres  . OLS indicates that this variable has a positive influence on the dependent variable, the value of homes, with an estimate of  , i.e. the value of houses increases as the proportion of non-retail business acres   increases in regards to the Boston Housing data.

Looking at the output that Quantile Regression provides us with, one finds a more differentiated picture. For the 0,1 quantile, we find an estimate of   which would suggest that for this low quantile the effect seems to be even stronger than is suggested by OLS. Here house prices go up when the proportion of non-retail businesses   goes up, too. However, considering the other quantiles, this effect is not quite as strong anymore, for the 0,7th and 0,9th quantile this effect seems to be even reversed indicated by the parameter   and  . These values indicate that in these quantiles the house price is negatively influenced by an increase of non-retail business acres  . The influence of non-retail business acres   seems to be obviously very ambiguous on the dependent variable of housing price, depending on which quantile one is looking at. The general recommendation from OLS that if the proportion of non-retail business acres   increases, the house prices would increase can obviously not be generalized. A policy recommendation on the OLS estimate could therefore be grossly misleading.

One would intuitively find the statement that the average number of rooms of a property   positively influences the value of a house, to be true. This is also suggested by OLS with an estimate of  . Now Quantile Regression also confirms this statement, however, it also allows for much subtler conclusions. There seems to be a significant difference between the 0,1 quantile as opposed to the rest of the quantiles, in particular the 0,9th quantile. For the lowest quantile the estimate is  , whereas for the 0,9th quantile it is  . Looking at the other quantiles one can find similar values for the Boston housing data set as for the 0,9th, with estimates of  ,  , and   respectively. So for the lowest quantile the influence of additional number of rooms   on the house price seems to be considerably smaller then for all the other quantiles.

Another illustrative example is provided analyzing the proportion of owner-occupied units built prior to 1940   and its effect on the value of homes. Whereas OLS would indicate this variable has hardly any influence with an estimate of  , looking at Quantile Regression one gets a different impression. For the 0,1th quantile, the age has got a negative influence on the value of the home with  . Comparing this with the highest quantile where the estimate is  , one finds that the value of the house is suddenly now positively influenced by its age. Thus, the negative influence is confirmed by all other quantiles besides the highest, the 0,9th quantile.

Last but not least, looking at the pupil-teacher ratio   and its influence on the value of houses, one finds that the tendency that OLS indicates with a value of   to be also reflected in the Quantile Regression analysis. However, in Quantile Regression one can see that the influence on the housing price of the pupils-teacher ratio   gradually increases over the different quantiles, from the 0,1th quantile with an estimate of   to the 0,9th quantile with a value of  .

This analysis makes clear, that Quantile Regression allows one to make much more differentiated statements when using Quantile Regression as opposed to OLS. Sometimes OLS estimates can even be misleading what the true relationship between an explanatory and a dependent variable is as the effects can be very different for different subsection of the sample.

Conclusion

edit

For a distribution function   one can determine for a given value of   the probability   of occurrence. Now quantiles do exactly the opposite. That is, one wants to determine for a given probability   of the sample data set the corresponding value  . In OLS, one has the primary goal of determining the conditional mean of random variable  , given some explanatory variable   ,  . Quantile Regression goes beyond this and enables us to pose such a question at any quantile of the conditional distribution function. It focuses on the interrelationship between a dependent variable and its explanatory variables for a given quantile. Quantile Regression overcomes thereby various problems that OLS is confronted with. Frequently, error terms are not constant across a distribution, thereby violating the axiom of homoscedasticity. Also, by focusing on the mean as a measure of location, information about the tails of a distribution are lost. And last but not least, OLS is sensitive to extreme outliers, which can distort the results significantly. As has been indicated in the small example of the Boston Housing data, sometimes a policy based upon an OLS analysis might not yield the desired result as a certain subsection of the population does not react as strongly to this policy or even worse, responds in a negative way, which was not indicated by OLS.


References

edit

Abrevaya, J. (2001): “The effects of demographics and maternal behavior on the distribution of birth outcomes,” in Economic Application of Quantile Regression, ed. by B. Fitzenberger, R. Koenker, and J. A. Machade, pp. 247–257. Physica-Verlag Heidelberg, New York.

Belsley, D. A., E. Kuh, and R. E. Welsch (1980): Applied Multivariate Statistical Analysis. Regression Diagnostics, Wiley.

Buchinsky, M. (1998): “Recent Advances in Quantile Regression Models: A Practical Guidline for Empirical Research,” Journal of Human Resources, 33(1), 88–126.

Cade, B.S. and B.R. Noon (2003): A gentle introduction to quantile regression for ecologists. Frontiers in Ecology and the Environment 1(8): 412-420. http://www.fort.usgs.gov/products/publications/21137/21137.pdf

Cizek, P. (2003): “Quantile Regression,” in XploRe Application Guide, ed. by W. Härdle, Z. Hlavka, and S. Klinke, chap. 1, pp. 19–48. Springer, Berlin.

Curry, J., and J. Gruber (1996): “Saving Babies: The Efficacy and Costs of Recent Changes in the Medicaid Eligibility of Pregnant Women,” Journal of Political Economy, 104, 457–470.

Handl, A. (2000): “Quantile,” available at http://www.wiwi.uni-bielefeld.de/~frohn/Lehre/Datenanalyse/Skript/daquantile.pdf

Härdle, W. (2003): Applied Multivariate Statistical Analysis. Springer Verlag, Heidelberg. Hyndman, R. J., and Y. Fan (1996): “Sample Quantiles in Statistical Packages,” The American Statistician, 50(4), 361 – 365.

Jeffreys, H., and B. S. Jeffreys (1988): Upper and Lower Bounds. Cambridge University Press.

Koenker, R., and G. W. Bassett (1978): “Regression Quantiles,” Econometrica, 46, 33–50.

Koenker, R., and G. W. Bassett (1982): “Robust tests for heteroscedasticity based on Regression Quantiles,” Econometrica, 61, 43–61.

Koenker, R., and K. F. Hallock (2000): “Quantile Regression an Introduction,” available at http://www.econ.uiuc.edu/~roger/research/intro/intro.html

Koenker, R., and K. F. Hallock (2001): “Quantile Regression,” Journal of Economic Perspectives, 15(4), 143–156.

Lee, S. (2005): “Lecture Notes for MECT1 Quantile Regression,” available at http://www.homepages.ucl.ac.uk/~uctplso/Teaching/MECT/lecture8.pdf

Lewit, E. M., L. S. Baker, H. Corman, and P. Shiono (1995): “The Direct Costs of Low Birth Weight,” The Future of Children, 5, 35–51.

mdbase (2005): “Statistical Methodology and Interactive Datanalysis,” available at http://www.quantlet.org/mdbase/

Montenegro, C. E. (2001): “Wage Distribution in Chile: Does Gender Matter? A Quantile Regression Approach,” Working Paper Series 20, The World Bank, Development Research Group.

Powell, J. (1986): “Censored Regression Quantiles,” Journal of Econometrics, 32, 143– 155.

Scharf, F. S., F. Juanes, and M. Sutherland (1998): “Inferring Ecologiocal Relationships from the Edges of Scatter Diagrams: Comparison of Regression Techniques,” Ecology, 79(2), 448–460.

XploRe (2006): “XploRe,” available at http://www.xplore-stat.de/index_js.html