| AUTHOR: Gries, Stefan Th.
TITLE: Statistics for Linguistics with R
SUBTITLE: A Practical Introduction
SERIES TITLE: Trends in Linguistics. Studies and Monographs [TiLSM] 208
PUBLISHER: De Gruyter Mouton
Andrew Caines, Computation Cognition and Language Group, Research Centre for
English and Applied Linguistics, University of Cambridge
This book discusses methods of statistical analysis using R, the open source
software. It contains numerous linguistic datasets as case studies, features
'think breaks' and exercises at the end of every chapter, and gives
comprehensive instruction on how to code the required calculations and chart
plotting in R. There is also higher level discussion of experiment design, as
well as top-to-tail coaching in diligent empirical research: from hypothesis
formation to data collection, and thereafter through to appropriate statistical
analysis to reporting the results.
Prerequisites to getting the most out of this book include downloading R which,
being open source, is free (instructions can be found in chapter 2); and
downloading the data and exercise files from the companion website. These files
also contain all the code that is shown and referred to in the text of each
chapter, and answer keys for the exercises.
The first two chapters deal with the essential preliminaries. Chapter 1 outlines
procedures for empirical work, including hypothesis formation,
operationalization of variables, best practice for data collection and
annotation, and experiment design. Chapter 2 gives instruction on how to install
R, how to obtain the code and exercise files from the companion website, how to
load, manipulate and save data in the R console, and the difference between
vectors, factors and data frames. It is essential at this point that the reader
has acquired the skills presented so far, whether from previous experience or
from working through chapter 2. From this point forward, Gries proceeds with
case studies which the reader should simultaneously work through in R so as to
get the most out of the book. To do so, the skills presented in chapter 2 are
Chapter 3 introduces various descriptive statistical methods relating to
measures of central tendency and bivariate analysis. The measures of central
tendency include the arithmetic and geometric mean, the median and mode. The
measures of dispersion -- at least one of which, Gries emphasises, should always
accompany any measure of central tendency -- covered here are relative entropy,
range, quartiles and quantiles, average deviation, standard deviation, variation
coefficient and standard error. The next section describes centering and
standardization methods (i.e., z-scores), followed by confidence intervals. The
chapter closes with a section on bivariate statistics, necessary to characterize
datasets with more than one variable and the relation(s) among those variables.
The methods discussed include crosstabulation, correlation coefficients, linear
regression and a range of plotting techniques. The plot types covered in this
chapter include scatter, mosaic, box, spine, line and bar plots.
The fourth chapter turns to analytical statistics and identifies the appropriate
calculation(s) for various experiment design scenarios, according to number of
variables, variable type (dependent or independent), and data type (nominal,
ordinal, ratio-scaled). The distribution tests include measures of
goodness-of-fit, such as chi-square, as well as the finer details of
distributions -- dispersions and means. In addition, Gries shows how to compute
a table of p-values in R adjusted for degrees of freedom. There are further
demonstrations of correlation and regression analyses as well as versatile plot
types such as association plots, cross-tabulation plots and strip charts.
Chapter 5 takes an advanced step into multifactorial modelling. After all, ''we
live in a multifactorial world in which probably no phenomenon is really
monofactorial'' (p238). The techniques explored are multiple regression analysis,
both mono- and multifactorial analysis of variance (ANOVA), binary logistic
regression, and cluster analysis. To conclude the book, there is a brief but
very thought-provoking Epilog in chapter 6 (on which more below).
This book successfully performs at least three roles. Firstly, it gives a
wide-ranging overview of statistical techniques and when to use them. Secondly,
it is a well-written instruction manual on how to carry out these techniques in
R. Thirdly, in terms of a more general context, it codifies the standards which
should be observed in empirical linguistic work. The book's most significant
contribution is in bringing together advice on both experiment design and data
analysis in one volume. The comparable work by Baayen (2008), for instance, goes
further in its exploration of clustering, regression and mixed models but lacks
the section on best practice for experiment design with which Gries begins. Each
has its own role, then -- Baayen (2008) being more narrowly focused on
statistics and the present volume being a more complete guide to linguistic
This book by Gries, along with those such as Baayen (2008) and Gries (2009b),
provide some much needed rigour to linguistic study. It is desirable that such
high standards and procedures for empirical work are followed, if research is to
be properly discussed and built upon. If all university linguistics courses
could feature one of these works as a textbook it would be an important step in
the right direction. Not only is there the guidance to data collection and
analysis, there are also recommendations on how to summarize the results of
statistical analyses in prose -- an essential skill for journal papers and
conference proceedings which is all too easy to get wrong, especially when
blindly imitating other publications.
The interactive nature of this book is its best asset. There are code files
available from the companion website so that the reader can follow the narrative
and replicate the case studies, exercises at the end of every chapter and
frequent 'think breaks'. Much effort and care has evidently gone in to preparing
the accompanying code and exercise files. The code files contain an appropriate
amount of editorial comment as well as suggestions for the reader to try
alternative statistical or graphical techniques to those outlined in the main
text of the book.
The calculations range from the straightforward (p64), such as this:
> sqrt(9) # compute square root of 9
To the relatively complex (p296), like this:
> model.lrm<-lrm(CONSTRUCTION1 ~ V_CHANGPOSS + REC_ACT + PAT_ACT, x=T, y=T,
linear.predictors=T) # compute binary logistic regression of pre-loaded dataset
The reader may well feel challenged by the advancement in complexity but the
progression is steady enough that there should be no reason for it to be
overwhelming, thanks moreover to the supporting code files and exercise answer
keys. One enhancement, however, would have been a glossary to the functions
covered in the book. In the absence of a glossary, the index is comprehensive
enough to perform this function but a ready-reference function list would have
been better still.
The plot types introduced in this book (and the statistical methods themselves,
for that matter) are acknowledged as being only the tip of the iceberg. The
reader is referred to specialist works for more advanced techniques (from the
book Murrell 2005; Cook and Swayne 2007; Sarkar 2008; but also Wickham 2009).
However, with histograms, boxplots, stripcharts, pie charts, scatter plots,
association plots and many more, there are plenty of data display methods to
satisfy most needs.
The Epilog (chapter 6) observes that the sections on linear models (ANOVAs and
regressions) are short and points the reader to references on further techniques
which might be of use: Poisson regression, repeated measure ANOVAs and
multi-level models. The reader is also referred to R libraries containing more
powerful graphical tools and books dedicated to graphics in R. Finally, both to
''shake up a bit what you have learnt so far'' and ''stimulate some curiosity for
what else is out there'' (p320), Gries observes that the null hypothesis testing
paradigm which is central to every case study in the book is not quite so
uncontroversial as it seems and is generally held to be. This is an appropriate
overview of what the book stands for: on the one hand it offers more than enough
for the reader to get by with data collection, analysis and reporting; on the
other hand it can be the jumping off point for the reader's exploration into
more advanced statistical methods and theoretical considerations. The book is
thus not only an introduction to data analysis with R but also an introduction
to statistical theory and reconsideration of current techniques.
Gries has diligently compiled a work of great use and interest. It is relevant
above all to linguistic students and researchers, and can readily act as a
textbook for taught courses. It should be noted that the book is equally useful
as a reference guide, with the analysis scenarios sufficiently well labelled and
organized so that the reader can dip into it as and when necessary, or as a
complete set of exercises which the reader can work through section by section.
Baayen, R. H. (2008). Analyzing Linguistic Data: a Practical Introduction to
Statistics using R. Cambridge: Cambridge University Press.
Cook, D. and D. F. Swayne (2007). Interactive and Dynamic Graphics for Data
Analysis. New York: Springer.
Gries, St. Th. (2009b). Quantitative Corpus Linguistics with R: A practical
introduction. London: Routledge.
Murrell, P. (2005). R Graphics. Boca Raton, FL: Chapman and Hall / CRC.
Sarkar, D. (2008). Lattice: Multivariate Data Visualization with R. New York:
Wickham, H. (2009). ggplot: Elegant Graphics for Data Analysis. New York: Springer.
ABOUT THE REVIEWER