Publishing Partner: Cambridge University Press CUP Extra Wiley-Blackwell Publisher Login
amazon logo
More Info

New from Oxford University Press!


The Vulgar Tongue: Green's History of Slang

By Jonathon Green

A comprehensive history of slang in the English speaking world by its leading lexicographer.

New from Cambridge University Press!


The Universal Structure of Categories: Towards a Formal Typology

By Martina Wiltschko

This book presents a new theory of grammatical categories - the Universal Spine Hypothesis - and reinforces generative notions of Universal Grammar while accommodating insights from linguistic typology.

New from Brill!


Brill's MyBook Program

Do you have access to Dynamics of Morphological Productivity through your library? Then you can by the paperback for only €25 or $25! Find out more about Brill's MyBook program!

Email this page
E-mail this page

Review of  Analyzing Linguistic Data

Reviewer: Aditi Ghosh
Book Title: Analyzing Linguistic Data
Book Author: Harald R. Baayen
Publisher: Cambridge University Press
Linguistic Field(s): Text/Corpus Linguistics
Discipline of Linguistics
Book Announcement: 19.3453

Discuss this Review
Help on Posting
AUTHOR: Baayen, Harald
TITLE: Analyzing Linguistic Data
SUBTITLE: A Practical Introduction to Statistics using R
PUBLISHER: Cambridge University Press
YEAR: 2008

Aditi Ghosh, Department of Linguistics, University of Calcutta, Kolkata

This book is a guidebook for researchers and students who want to use
statistical computation to analyze linguistic data. It teaches how different
types of linguistic data can be dealt with quantitatively by using 'R' – a free
statistical tool developed originally at AT&T Bell Laboratories. The book is
divided in seven chapters, each giving instruction on how to use R to analyze
different types of linguistic data, starting from relatively simple problems,
gradually leading to more complex ones. The first chapter, entitled 'An
introduction to R', as anticipated, gives instructions on the basic handling of
R – on how to download the relevant packages, how to use the software in
different operating systems, how to import or export datasets and even how to
use R as a simple calculator. This chapter also teaches how to select a portion
of data out of a data set, how to order or sort a data frame, how to change
specific information and how to extract relevant portions out of a data frame.
Lastly it shows how to perform basic calculations on a data frame such as the
mean or the sum of a numeric vector. The chapter like all the other chapters in
the book is followed by a set of exercises and the solutions to these are
provided in Appendix A.

The second chapter, ''Graphical Data Exploration'', deals with producing graphic
presentation of data. It starts with a brief definition of random variable and
goes on to show how bar plots and histograms can be produced, how curves can be
added to a histogram. The chapter also shows how to plot ordered values and
density and generate boxplots. For two or more variables the more useful
practice is to create a mosaic plot or a scatter plot – and the third section of
this chapter deals with this. The final section introduces trellis graphics – a
graph in which data can be represented by many organized graphs at the same time.

The third chapter, ''Probability Distribution'', begins with an introduction of
distribution and goes on to demonstrate how to deal with discrete and continuous
distribution and introduces the relevant functions available in R. it also
introduces the Poisson distribution, different types on normal distribution.
Lastly it introduces three important continuous distributions, name t, F and X
squared distributions and the functions used in R for these distributions.

Chapter four is entitled ''Basic statistical methods''. The first two section of
this chapter introduces tests for single and two independent vectors. For single
vectors, one can test the distribution by plotting the density in different
kinds of graphical representations or one can use appropriate tests, such as
Shapiro-Wilk test for normality or Kolmogoro-Smirnov one sample test. To test
the mean of a single vector one can use t-test. To observe distribution of two
independent vectors, one can plot them with in two differently colored lines. To
test if the means are same of the vectors in question, one can use the boxplot
function to see their frequency distribution, or one may run a t-test to verify
if two means are significantly different. R also has specific functions to test
if the variances of the two vectors in question are the same. For paired
vectors, again t-test and Wilcox test can be used. To understand if two vectors
have significant relation one can plot their individual points in a scatterplot
and obtain a regression line. This chapter also shows how to evaluate
significance of correlation between two variables. It deals with problems of
linear regression and how to deal with them. The chapter also shows how to
examine joint density of two paired vectors, how to deal with one or two
numerical vectors and a factor and two vector with counts. The final section
explains how to estimate the significance of a statistical test.

Chapter five, ''Clustering and classification'', discusses methods of handling
more than two vectors. With the help of 'principal component analysis' and
'factor analysis' it explores the relationship between uses of 27 derivational
affixes with that of the type of texts in which they appear. In the next two
sub-sections we are introduced to 'correspondence analysis' and
'multidimensional scaling', used to create a low-dimensional map of the data and
to trace structure in a matrix of distances respectively. The last subsection of
this section (5.1.5) considers hierarchical cluster analysis – techniques to
cluster data and display them in a tree diagram. The next section (5.2) moves
from clustering to classification of data. The first subsection teaches how to
create a classification tree with CART (classification and regression tree)
analysis. CART and related analysis are applied on 'dative' data sets to find
out whether realization of recipient as NP or PP can be predicted from other
variables such as 'semantic class', 'length of theme' etc. The second subsection
shows how linear discriminant analysis can be done in R to predict an item's
class from a set of numerical predictors. The last subsection (5.2.3)
demonstrates the use of 'support vector machine' for classification.

Chapter six takes up regression modeling, a topic which was introduced earlier
in chapter four. This chapter discusses multiple linear regression and related
functions available in R such as 'ordinary least squares regression'. Two
subsections show how to deal with models with a nonlinear relation between
independent and dependent variables and with datasets where all the independent
variables are strongly correlated. The two following subsections deal with how
to check whether the model that one arrives at, is satisfactory or not. The
third section introduces 'generalized linear models'. Two subsections in this
section (6.3.1 and 6.3.2) show how binary responses can be handled in R with
logistic regression model and how ordered responses can be handled with ordinal
logistic regression. Section 6.4 demonstrates how to deal with discontinuity in
an otherwise linear relation. The next section (6.5) shows how to study lexical
richness. It introduces various functions available in R to find out unique
units in a dataset, to compare datasets etc. The last section in this chapter
discusses some general issues in using statistical models with reference to the
examples used in this chapter.

The last chapter is entitled ''Mixed Models''. The first four sections of this
chapter deal with various strategies on how to build mixed models. Section 7.1
introduces the packages and function in R to build mixed effects and illustrates
their usage in datasets. The next section compares mixed effect models with
traditional models such as quasi-F, latin square designs and with traditional
regression (with mixed-effect regression). The following section deals with
BLUPS (the best linear unbiased predictors), which is available in the
mixed-effect models, unlike classical models, as it provides 'shrinkage'
estimates for the by-subject and by-item adjustments. Section 7.4 discusses
mixed model parallels with of generalized linear model – the 'generalized linear
mixed model'. The last section (7.5) presents case studies where mixed models
are put into practice.

The chapters are followed by two appendices. Appendix A provides solutions to
the exercises that appear at the end of each chapter and appendix B gives an
overview of functions for R. There are four indexes – of datasets, of R, of
topics and of authors.

This book is like a course or a tutorial on how to use R to analyze linguistic
data. Linguistic research has successfully used quantitative tools for a long
time now and there are quite a few introductory books dealing with the subject
(eg. Douglas 1943, Herdan 1964, Butler 1985, Woods & Fletcher & Hughes 1986,
Tesitelová 1992, Rietveld & van Hout 1993, Kretzschmar, Kretzschmar & Schneider
1996, Paolillo 2002, Johnson 2008, Rasinger 2008). However, this field is
developing rapidly and researchers need up to date knowledge about new resources
available. Since R is becoming one of the most widely used tools for statistical
analysis in social sciences, books exploring its utility in Linguistics, are the
need of the hour. This book meets that requirement. Though there other works
(cf., Johnson 2008) dealing with this topic, Baayen's introduction is valuable
as, apart from being a practical introduction to statistics, it is also a
thorough introduction for R beginning from downloading relevant packages for
linguistics. It starts with introducing basic statistics and its use in R and
progresses step-by-step to more sophisticated methods. This, apart from making
it a very systematically organized book, makes it equally useful for linguists
with limited mathematical background and those with sufficient expertise in
statistical methods. With examples of a number of real data sets it demonstrates
how to study linguistic data quantitatively. The exercises at the end of each
chapter are very useful for practicing functions introduced in the adjacent
chapters. The separate indexes are also quite useful for researchers, if they
need to look for specific R functions or topics to meet their research
requirement. It is also enriching to be introduced to the actual datasets used
in the course of this book. However, I wish the datasets were more varied in
type. Almost all the sets used here are morphological/lexical data sets. It
would have been worthwhile to see more sociolinguistic or language teaching
oriented data. I faced a few problems in installing the packages, as apparently
the version that I earlier had (R 2.7.0) was not compatible with LanguageR – the
package that is used extensively in this book. This problem was solved as I
downloaded and installed version 2.7.1. All in all, in my opinion, this book
succeeds effectively in its aim to provide its readers with ''a driving license
for exploratory data analysis'' (p-xi).

Butler Christopher. (1985) _Statistics in linguistics_. New York: Blackwell

Douglas, Chretien, C. (1943) _Quantitative Method for Determining Linguistic
Relationships: Interpretation of Results and Tests of Significance_ Berkeley Ca:
University of California Berkeley.

Herdan, Gustav. (1964) _Quantitative Linguistics_. London : Butterworths.

Johnson Keith. (2008) _Quantitative methods in linguistics_. Malden, MA:
Blackwell Publishing.

Kretzschmar, William A. , William A. Kretzschmar, Jr., and Edgar W. Schneider.
(1996) _Introduction to Quantitative Analysis of Linguistic Survey Data: An
Atlas by the Numbers_. Thousand Oaks, CA: Sage Publications.

Paolillo, John C. (2002) _Analyzing linguistic variation: Statistical models and
methods_. Stanford, CA : Center for the Study of Language and Information.

Rasinger, Sebastian M. (2008) _Quantitative Research in Linguistics: An
Introduction_. London and New York: Continuum International Publishing Group.

Rietveld, Toni & Roeland van Hout. (1993) _Statistics in language research:
Analysis of variance_. Berlin and New York : Mouton de Gruyter.

Tesitelová Marie (1992) _Quantitative Linguistics_, Amsterdam and Philadelphia:
Benjamins Publisher.

Woods Anthony, Paul Fletcher, and Arthur Hughes. (1986) _Statistics in Language
Studies_. New York: Cambridge University Press

Dr Aditi Ghosh is a Lecturer at the Department of Linguistics at Calcutta
University. Her current research interests include impacts of multilingualism,
relationship between society and language, linguistic politics and semantics. At
present, she is engaged with two major research projects – on language use and
attitude and on concepts in Linguistics

Amazon Store: