This book "asserts that the origin and spread of languages must be examined primarily through the time-tested techniques of linguistic analysis, rather than those of evolutionary biology" and "defends traditional practices in historical linguistics while remaining open to new techniques, including computational methods" and "will appeal to readers interested in world history and world geography."
Review of Statistical Methods in Language and Linguistic Research
The aim of the book, as the author formulates it in the preface, is to ''illustrate with numerous examples how quantitative methods can most fruitfully contribute to linguistic analysis research'' (p. xi). It introduces basic and intermediate-level statistical techniques that can be used by linguists, especially in the domains of applied and corpus linguistics. The techniques range from basic parametric and non-parametric tests, such as the chi-squared test, to more advanced multivariate techniques, such as factor analysis and multiple linear regression. The book explains, step-by-step, the mathematical and conceptual apparatus behind various statistical methods. It also contains chapters on fundamental corpus-linguistic topics, namely, word frequency lists and collocations.
The book consists of six chapters, a list of references and an index. It also includes a vast appendix, which contains tables with critical values of the most important statistical distributions and a table with examples of appropriate statistical tests for different types of variables.
Chapter 1 introduces basic descriptive statistics, such as measures of central tendency (i.e. mean, median and mode) and dispersion (e.g. range, variance, standard deviation. It also discusses z-scores and t-scores, which can be used for data standardization. The author provides detailed explanations of how these measures can be computed. The chapter also offers a brief introduction to probability theory and gives examples of different types of distributions.
In Chapter 2, the reader learns about different types of variables, depending on their level of measurement, or scale (i.e. interval, rational, nominal and ordinal), and their role in a statistical model (i.e. dependent, independent, moderator, control and intervening). This is the shortest chapter, which comprises only seven pages.
Chapter 3 discusses univariate and bivariate parametric and non-parametric tests that can be used to compare two or more groups or investigate relationships between variables. The chapter begins with an overview of the most important statistics, where the author explains how to use the tests appropriately depending on specific research questions and characteristics of the available data. Parametric tests include the t-test for independent and paired samples, analysis of variance (ANOVA), Pearson's correlation coefficient and simple linear regression, while the non-parametric section deals with the Mann-Whitney U-test, the sign test, the chi-squared test, the median test and Spearman's rank correlation. The author explains the underlying assumptions and theoretical principles of each test, and provides extensive illustrations.
Chapter 4 describes four multivariate statistical methods: cluster analysis in its hierarchical and non-hierarchical (i.e. k-means) instantiations, discriminant functions, factor analysis, and multiple linear regression. As in the previous chapter, the assumptions that should be met are discussed for each method. The author walks the reader through all the main conceptual steps of each analysis. Most calculations in this chapter are done by the author with the help of SPSS.
Chapters 5 and 6 deal with some fundamental issues in corpus linguistics related to word frequency lists and collocation measures. Chapter 5 is probably the most heterogeneous one of the book. First, it discusses at length the usefulness of different ways of sorting frequency lists, and illustrates Dunning's (1993) method of finding the keywords in a text or corpus. The 'keyness' is determined with the help of the log-likelihood test. The method is illustrated by computing the keyness of words in one of Barack Obama's speeches. The reference corpus, which is used to measure the degree of unexpectedness of the words in Obama's speech is, somewhat surprisingly, the British National Corpus. In addition, the author mentions different types of corpus annotation, and suggests a method of comparing wordlists from different domains with the help of meta-frequency lists, which are conceptually similar to the popular Venn's diagrams. Next, the author moves on to discuss type and token distribution in a corpus, as well as Zipf's law. Finally, he describes how to measure dispersion of a word in a corpus by using Gries' (2008) DP (i.e. Deviation of Proportions) measure.
Finally, Chapter 6 provides the reader with information about concordance, KWIC (i.e. Key Words In Context) format and collocation. It discusses four association measures (i.e. mutual information (MI) and its modified version, MI3, z-score, and log-likelihood) and compares them in a case study of a small list of collocates. After that, the author introduces the notion of lexical constellations, which reflect hierarchical and asymmetric relationships between collocates.
''Statistical Methods in Language and Linguistic Research'' provides a useful and accessible introduction to the world of statistics for beginners. The main advantage of the book, in my opinion, is the fact that it offers a detailed explanation of classical statistical techniques. The text contains many examples, which will definitely help a novice to understand the logic behind the statistical tests. The book can thus be used as a supplement to more practically oriented textbooks, e.g., Baayen (2008) and Gries (2009, 2013). Another strong point is the systematic discussion and comparison of parametric and non-parametric methods offered in Chapter 3. Since linguistic data tend to deviate from normality, this approach is very welcome.
That being said, there are a few concerns. First, I have some doubts that the book fully achieves its goal formulated in the preface, namely, to demonstrate how statistics can contribute to linguistic studies. Unfortunately, the examples and topics covered in the book are too limited from a theoretical point of view. Most illustrations come from foreign language acquisition (e.g. the case studies that compare the effectiveness of different teaching methods, or determine the weight of factors that influence students' motivation) and 'old-school' corpus linguistics (e.g. keywords and concordance analysis, automatic text classification, etc.), with all due respect to those domains. This is a bit odd, since the application of quantitative methods in contemporary linguistic research has been extremely productive in many areas, especially within the usage-based paradigm and in variationist research, psycholinguistics and typology. In addition, the data in examples are often fictional or come from an unnamed source, especially in the first chapters of the book.
Second, in the age of the statistical software boom, it is somewhat surprising to find no practical guidelines regarding how to perform statistical tests with the help of existing packages (for instance, SPSS, which is extensively used by the author). After all, these calculations are no longer done with pencil and paper. It would be useful, therefore, if the book were to contain at least an appendix with relevant codes.
Another problematic issue is the imprecise use of statistical terminology. Consider the following, more small-scale errors: i. Figure 1.12 is called a histogram (21), but is really a standard x-y plot without bars; ii. The term 'probability ratios' (26) should be substituted by simple 'probabilities' or 'proportions'; iii. The 'independent value' (63) in regression modelling is normally called the ‘intercept’; iv. Gries's (2008) DP measure is not the 'Degree of Dispersion' (183), but rather the 'Deviation of Proportions'; v. Finally, a normalized version of DP (Lijffijt & Gries 2012) might have been more appropriate to include.
Some errors are, however, more serious on conceptual grounds: i. “To perform multiple regression, the variables should either be interval or continuous and they should be related linearly” (p. 122). This assumption is erroneous. In fact, there exist perfectly legitimate solutions that enable one to incorporate categorical predictors (e.g. dummy coding) and non-linear relationships (e.g. power transformation) in a linear regression model; ii. The strategy of fitting the regression model described on pp. 131-133 has a serious flaw. A model with 8 independent variables and only 32 observations runs a huge risk of overfitting. As a result, such a model cannot be extrapolated to new data, which makes it useless (see Harrell 2001).
Finally, the book contains a few minor misprints, which can be a source of confusion for beginners: i.The mean, median and mode should be in the reverse order in Figure 1.16 (24); ii. Instead of 1000/1650 = 0.91, this calculation should read 1000/1100 = 0.91 (29); iii. T2 = 1 + 5 + 6 + 7 + 10 = 29, not 27, as in the text (70); iv. ''screen plot'' (correct: ''scree plot'') (117); v. ''beta-axis'' (correct: ''y-axis'') (122); vi. ''? the slope'' (correct: ''beta the slope'') (122); vii. The formula of (pointwise) mutual information is not MI = P(w1, w2)/log2 P(w1)*P(w2), but rather MI = log2(P(w1, w2)/P(w1)*P(w2)) (Manning & Schütze 1999: 68) (205).
Baayen, R. Harald. 2008. Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.
Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19 (1). 61-74.
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403–437.
Gries, Stefan Th. 2009. Statistics for Linguistics with R. A practical introduction. Berlin: De Gruyter Mouton.
Gries, Stefan Th. 2013. Statistics for Linguistics with R. A practical introduction. 2nd rev. and ext. ed. Berlin: De Gruyter Mouton.
Harrell, Frank E. Jr. 2001. Regression Modeling Strategies. With Application to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer.
Lijffijt, Jefrey, & Stefan Th.Gries. 2012. Correction to “Dispersions and adjusted frequencies in corpora”. International Journal of Corpus Linguistics 17(1). 147–149.
Manning, Chris & Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.
ABOUT THE REVIEWER:
Natalia Levshina is a postdoctoral researcher at the Research Group 'Language Typology and Quantitative Linguistics' at Philipps University of Marburg, Germany. She obtained her PhD from the University of Leuven, Belgium, in 2011. Her thesis was based on multivariate statistical analyses of periphrastic causatives in Netherlandic and Belgian Dutch. Among her main interests are multifactorial models of language use and spatial representations of natural language semantics in Cognitive Linguistics and typology. She has been teaching courses in Corpus Linguistics and quantitative methods of linguistic analysis at the University of Jena and the University of Marburg in Germany.