You are on page 1of 8

CRAN Task View: Statistics for the Social Sciences

Maintainer: John Fox


Social scientists use a wide range of statistical methods. To make the burden carried by this task view lighter, I have suppressed detail in some areas that are well covered by related task views (e.g., the Spatial task view for spatial statistics), and have pointed to those task views instead. Most statistical data analysis in the social sciences is covered by the facilities in the base and recommended packages, which are part of the standard R distribution. In the package descriptions below, I identify base and recommended packages on first mention; packages that are not specifically identified as "R-base" or "recommended" are contributed packages. One area of central interest to social scientists that I do not cover here is statistical graphics, even though this is one of the great strengths of R: Basic R graphics, trellis graphics (in the recommended lattice package), dynamic 3D graphs (via the rgl package), and the many packages that include facilities for various statistical graphs are just too extensive to detail here. Fortunately, a Graphics task view is currently in preparation. Linear and Generalized Linear Models: Univariate and multivariate linear models are fit by the lm function, generalized linear models by the glm function, both in the R-base stats package. Beyond summary and plot methods for lm and glm objects, there is a wide array of functions that support these objects:

The generic anova function in the stats package constructs sequential analysis of variance and analysis of deviance tables, and can compute F and likelihood-ratio tests for nested models. (It is typical for other classes of statistical models in R to have anova methods as well.) The generic Anova function in the car package (associated with Fox, An R and S-PLUS Companion to Applied Regression, Sage, 2002) constructs so-called "Type-II" and "Type-III" tests for linear and generalized linear models. F and Wald tests for a variety of hypotheses are available from the coeftest and waldtest functions in the lmtest package, and the linear.hypothesis function in the car package. All of these functions permit the use of heteroscedasticity and heteroscedasticity/autocorrelation-consistent covariance matrices, as computed, e.g., by functions in the sandwich and car packages. Also see the glh.test function in the gmodels package. Nonlinear functions of parameters can be tested via the delta.method function in the alr3 package (associated with Weisberg, Applied Linear Regression, 3rd Ed., Wiley, 2005). The multcomp package includes functions for multiple comparisons. The vuong function in the pscl package tests non-nested hypotheses for generalized linear and some other

models. Also see the Design package for tests on linear and generalized linear models. A basic R installation has excellent facilities for linear and generalized linear model "diagnostics," including, for example, hat-values and deletion statistics such as studentized residuals and Cook's distances ( hatvalues, rstudent, and cooks.distance, all in the stats package). These are augmented by other packages: several functions in the car package, which emphasizes graphical methods, e.g., cr.plots for component-plus-residual plots and av.plots for added-variable plots, in addition to numerical diagnostics, such vif for (generalized) variance-inflation factors; the dr package for dimension reduction in regression, including SIR, SAVE, and pHd; and the lmtest package, which implements a wide variety of tests (e.g., for heteroscedasticity, nonlinearity, and autocorrelation). More diagnostic methods, e.g., for inverse-response plots, may be found in the alr3 package. The forward package implements diagnostics based on a "forward search" (Atkinson and Riani, Robust Diagnostic Regression Analysis, Springer, 2000). Other collinearity diagnostics are in the perturb package. Diagnostics may also be found in the Design package. Several packages contain functions that are useful for interpreting linear and generalized linear models that have been fit to data: The qvcalc packages computes "quasi variances" for factors in linear and generalized linear models (and more generally). The effects package constructs effect displays, including, e.g., "adjusted means," for linear and generalized linear models. The Zelig package (see under "Collections" ) creates displays for many kinds of statistical models.

Analysis of Categorical and Count Data: Binomial logit and probit models, as well as Poisson-regression and loglinear models for contingency tables (including models for "over-dispersed" binomial and Poisson data), can be fit with the glm function in the stats package. For over-dispersed data, see also the aod package and the glm.nb function in the recommended MASS package (in the VR package-bundle, associated with Venables and Ripley, Modern Applied Statistics in S, Fourth Ed. , Springer, 2002), which fits negative-binomial GLMs. The multinomial logit model is fit by the multinom function in the recommended nnet package (also part of the VR bundle), and ordered logit and probit models by the polr function in the MASS package. Also see the MNP package for the multinomial probit model, and multinomRob for the analysis of overdispersed multinomial data. There are other noteworthy facilities for analyzing categorical and count data:

The table function in the R-base base package and the xtabs and ftable functions in the stats package construct contingency tables. The chisq.test and fisher.test functions in the stats package may be used to test for independence in two-way contingency tables.

The loglm and loglin functions in the MASS package fit hierachical loglinear models to contingency tables, the former as a front end to glm, the latter by iterative proportional fitting. Also see the brlr package for biased-reduced logistic regression (useful, e.g., in cases of complete separation); the catspec package for multi-way percentage tables and models for square tables; the exactLoglinTest package for exact tests of loglinear models; the gllm package for the analysis of incomplete tables; the clogit function in the survival package for conditional logistic regression; and the vcd package for graphical displays of categorical data. The gnm package estimates generalized nonlinear models, and can be used, e.g., to fit certain specialized models to mobility tables.

Other Regression Models: It is possible to fit a very wide variety of regression models with the facilities provided by the base and recommended packages, and a much wider variety of models with contributed packages:

Nonlinear regression: The nls function in the stats package fits nonlinear models by least-squares. Generalized least-squares regression: The gls function in the recommended nlme package fits models by generalized least squares. (The lm function can also fit weighted least-squares regressions; also see the dyn and dynlm packages, which allow gls, lm, and other regression functions to handle time-series data structures.) Mixed-effects models: The recommended nlme package, associated with Pinheiro and Bates, Mixed-Effects Models in S and S-PLUS (Springer, 2000), fits linear and nonlinear mixed-effects models, commonly used in the social sciences for hierarchical and longitudinal data. Generalized linear mixed-effects models may be fit by the glmmPQL function in the MASS package, and by the lmer function in the Matrix package (related to the lme4 package, which largely supersedes nlme for linear mixed models). Also see the lmeSplines and lmm packages. Generalized estimating equations: The gee and geepack packages fit marginal models by generalized estimating equations. Nonparametric regression analysis: This is one of the conspicuous strengths of R. A standard R installation includes several functions for smoothing scatterplots, including loess.smooth and smooth.spline, both in the stats package. The loess function in the stats package fits simple and multiple-regression models by local polynomial regression. Generalized additive models are covered by several packages, including the recommended mgcv package, and the gam package, the latter associated with Hastie and Tibshirani, Generalized Additive Models (Chapman and Hall, 1990). Some other noteworthy contributed packages in this area are gss, which fits spline regressions, locfit, for local-polynomial regression (and also density estimation) (Loader, Local Regression and Likelihood, Springer, 1999), sm, for a variety of smoothing techniques, including for regression (Bowman and Azzalini, Applied Smoothing Techniques for Data Analysis,

Oxford, 1997), and acepack for ACE (alternating conditional expecations) and AVAS (additivity and variance stabilization) nonparametric transformation of the response and explanatory variables in regression. Robust regression: The rlm function fits linear models by M-estimation and lqs computes bounded-influence estimators; both are in the MASS package. (The cov.rob function in the same package computes a robust covariance-matrix estimator.) Also see the quantreg package, which computes linear, nonlinear, and nonparametric quantile regressions; the roblm package, for MM estimation; and the riv package, for robust instrumental-variables estimation. Structural-equation models: The sem package fits general (i.e., latent-variable) SEMs by FIML, and structural equations in observed-variable models by 2SLS. Categorical variables in SEMs can be accommodated via the polycor package. The systemfit package implements a wider variety of estimators for observedvariables models, including nonlinear simultaneous-equations models. See also the pls package, for partial least-squares estimation, and the gR task view for graphical models. Selection bias and censored regression: Censored regression models, such as the tobit model, can be fit by the survreg function in the recommended survival package. The rq function in the quantreg package can estimate censored quantile-regression models. The hurdle and zeroinfl functions in the pscl package fit hurdle and zero-inflated Poisson and negative-binomial models to count data; also see the zicounts package. The heckit function in the micEcon package implements two-step Heckman estimators to correct for sample-selection bias. Also see under Survival Analysis below.

Other Statistical Methods: Here is a brief survey of implementations in R of other statistical methods commonly used by social scientists:

Survival (Event-History) Analysis: There is an extensive implementation of methods of survival analysis in the recommended survival package, which is associated with Therneau and Grambsch, Modeling Survival Data (Springer, 2000). Also see the eha, survrec, frailtypack, and Design packages. "Dimensional" Analysis: Exploratory maximum-likelihood factor analysis is implemented in the factanal function in the stats package, which also provides for varimax and promax factor rotation. (Confirmatory factor-analysis models can be fit with the sem package.) Additional rotations are available through functions in the GPArotation package. The prcomp and princomp functions in the stats package perform principal-components analysis. The cmdscale function in the stats package performs metric multidimensional scaling, while the isoMDS and sammon functions in the MASS package perform non-metric multidimensional scaling. For methods of cluster analysis and mixtures see the Cluster task view. The BradleyTerry package fits the Bradley-Terry model for paired comparisons. The ltm package fits Rasch and other item-response models to binary items. The

concord and irr packages contain functions for assessing inter-rater reliability; also see the psy package. Other Multivariate Statistics: See the Multivariate task view, which includes information on graphs for visualizing multivariate data. Missing Data: A variety of packages implement methods for handling missing data by multiple imputation, including the norm, cat, mix, and pan packages associated with Shafer, Analysis of Incomplete Multivariate Data (Chapman and Hall, 1997), and the mice and mitools packages (the latter for drawing inferences from multiply imputed data sets). Also see the EMV and mvnmle packages. There are also some facilities for missing-data imputation in the general Hmisc package, which is described below, under "Collections" . Bootstrapping and Other Resampling Methods: The recommended package boot, associated with Davison and Hinkley, Bootstrap Methods and Their Application (Cambridge, 1997), has excellent facilities for bootstrapping and some related methods. Also notable is the bootstrap package, associated with Efron and Tibshirani, An Introduction to the Bootstrap (Chapman and Hall, 1993), which has functions for bootstrapping and jackknifing. Model Selection: The step function in the stats package and the more broadly applicable stepAIC function in the MASS package perform forward, backward, and forward-backward stepwise selection for a variety of statistical models. The regsubsets function in the leaps package performs all-subsets regression. The BMA package performs Bayesian model averaging. Beyond these, see the MachineLearning task view. Social Network Analysis: There are several packages useful for social network analysis, including sna for sociometric analysis of networks (e.g., blockmodeling), network for manipulating and displaying network objects, and latentnet for latent position and cluster models for networks. Bayesian Statistical Methods: Because of its easy programmability, R is a natural environment within which to implement and use Bayesian methods, and there are many packages that provide such methods, including interfaces to external Bayesian software, such as BUGS. For details, see the Bayesian task view. Spatial Statistics: In addition to the recommended spatial package (part of the VR bundle), see the Spatial task view for an extensive list of functions and packages for spatial data analysis. Time-Series Analysis: Beyond time-series regression (see generalized leastsquares regression, above), R has very extensive facilities for time-series analysis, both in the standard R distribution and in contributed packages; for details, see the Econometrics and Finance task views. Surveys: The sampling package includes functions for selecting survey samples; the survey package includes functions for the analysis of data from complex sample surveys, among them functions for fitting linear and generalized linear models. Meta Analysis: See the meta and rmeta packages. Propensity Scores and Matching: See the Matching, MatchIt, optmatch, and USPS packages.

Collections of Functions: There are some packages that are so heterogeneous that they are difficult to classify, yet contain functions (typically in multiple domains) that are potentially of interest to social scientists:

I have already made several references to the recommended MASS package, which is associated with Venables and Ripley's Modern Applied Statistics With S . Other recommended packages associated with this book are nnet, for fitting neural networks (but also, as mentioned, multinomial logistic-regression models); spatial for spatial statistics; and class, which contains functions for classification. All are part of the VR bundle. The Hmisc and Design packages (both mentioned above), associated with Harrell, Regression Modeling Strategies (Springer, 2001), provide functions for data manipulation, linear models, logistic-regression models, and survival analysis, many of them "front ends" to or modifications of other facilities in R. The Zelig package integrates a wide array of statistical models of interest to social scientists (see the Zelig web site for details).

Base R ships with a lot of functionality useful for computational econometrics, in particular in the stats package. This functionality is complemented by many packages on CRAN, a brief overview is given below. There is also a considerable overlap between the tools for econometrics in this view and finance in the Finance view. Furthermore, the finance SIG is a suitable mailing list for obtaining help and discussing questions about both computational finance and econometrics. The packages in this view can be roughly structured into the following topics. If you think that some package is missing from the list, please let me know.

Linear regression models : Linear models can be fitted (via OLS) with lm() (from stats) and standard tests for model comparisons are available in various methods such as summary() and anova() . Analogous functions that also support asymptotic tests ( z instead of t tests, and Chi-squared instead of F tests) and plug-in of other covariance matrices are coeftest() and waldtest() in lmtest. Tests of more general linear hypotheses are implemented in linear.hypothesis() in car. HC and HAC covariance matrices that can be plugged into these functions are available in sandwich. The packages car and lmtest also provide a large collection of further methods for diagnost checking in linear regression models. Microeconometrics : Many standard microeconometric models belong to the family of generalized linear models (GLM) and can be fitted by glm() from package stats. This includes in particular logit and probit models for modelling choice data and poisson models for count data. Negative binomial GLMs are available via glm.nb() in package MASS from the VR bundle. Zero-inflated count models are provided in zicounts. Further over-dispersed and inflated models, including hurdle models, are available in package pscl. Bivariate poisson regression models are implemented in bivpois. Basic censored regression models

(e.g., tobit models) can be fitted by survreg() in survival. Further more refined tools for microecnometrics are provided in micEcon. The package bayesm implements a Bayesian approach to microeconometrics and marketing. Inference for relative distributions is contained in package reldist. Further regression models : Various extensions of the linear regression model and other model fitting techniques are available in base R and several CRAN packages. Nonlinear least squares modelling is availble in nls() in package stats. Relevant packages include quantreg (quantile regression), sem (linear structural equation models, including two-stage least squares), systemfit (simultaneous equation estimation), betareg (beta regression), nlme (nonlinear mixed-effect models), VR (multinomial logit models in package nnet) and MNP (Bayesian multinomial probit models). The packages Design and Hmisc provide several tools for extended handling of (generalized) linear regression models. Basic time series infrastructure : The class "ts" in package stats is R's standard class for regularly spaced time series which can be coerced back and forth without loss of information to "zooreg" from package zoo. zoo provides infrastructure for both regularly and irregularly spaced time series (the latter via the class "zoo" ) where the time information can be of arbitrary class. Several other implementations of irregular time series building on the "POSIXt" time-date classes are available in its, tseries and fCalendar which are all aimed particularly at finance applications (see the Finance view). Time series modelling : Classical time series modelling tools are contained in the stats package and include arima() for ARIMA modelling and Box-Jenkins-type analysis. Furthermore stats provides StructTS() for fitting structural time series and decompose() and HoltWinters() for time series filtering and decomposition. For estimating VAR models, several methods are available: simple models can be fitted by ar() in stats, more elaborate models are provided by estVARXls() in dse and a Bayesian approach is available in MSBVAR. A convenient interface for fitting dynamic regression models via OLS is available in dynlm; a different approach that also works with other regression functions is implemented in dyn. More advanced dynamic system equations can be fitted using dse. Unit root and cointegration techniques are available in urca, uroot and tseries. Matrix manipulations : As a vector- and matrix-based language, base R ships with many powerful tools for doing matrix manipulations, which are complemented by the packages Matrix and SparseM. Inequality : For measuring inequality, concentration and poverty the package ineq provides some basic tools such as Lorenz curves, Pen's parade, the Gini coefficient and many more. Structural change : R is particularly strong when dealing with structural changes and changepoints in parametric models, see strucchange and segmented. Data sets : Many of the packages in this view contain collections of data sets from the econometric literature and the package Ecdat contains a complete collection of data sets from various standard econometric textbooks. micEcdat provides several data sets from the Journal of Applied Econometrics and the Journal of Business & Economic Statistics data archives. Package pwt provides the Penn world table.

Jeff Racine and Rob Hyndman have an article Using R to Teach Econometrics, Journal of Applied Econometrics, Vol. 17, No. 2, 2002, pp. 149-174. [Associated files] Mahmood Arai has written a useful document A brief guide to R for beginners in Econometrics. Econ 472, at UIUC, has a nice website which has many examples in R. An introduction to modern bayesian econometrics by Tony Lancaster, Blackwell, May 2004. The book uses R. I have heard good things about a book which focuses on MLE: Yudi Pawitan (2001), In all Likelihood: Statistical Modelling and Inference using Likelihood, Clarendon Press, Oxford. The associated website has a good set of R codes. Edward W. Frees has a book project Longitudinal and Panel Data: Analysis and Applications for the Social Sciences with an associated web page with R codes. See my set of files R by example. http://www.mayin.org/ajayshah/KB/R/index.html

You might also like