Review of  A Taste for Corpora

Reviewer: Marlies Gabriele Prinzl
Book Title: A Taste for Corpora
Book Author: Fanny Meunier Sylvie De Cock Gaëtanelle Gilquin Magali Paquot
Publisher: John Benjamins
Linguistic Field(s): Applied Linguistics
Text/Corpus Linguistics
Book Announcement: 23.2764

Discuss this Review
Help on Posting
EDITORS: Meunier, Fanny; de Cock, Sylvie; Gilquin, Gaëtanelle; Paquot, Magali
TITLE: A Taste for Corpora
SUBTITLE: In honour of Sylviane Granger
SERIES TITLE: Series in Corpus Linguistics 45
PUBLISHER: John Benjamins Publishing Company
YEAR: 2011

Marlies Gabriele Prinzl, Centre for Intercultural Studies, University College
London, UK


‘A Taste for Corpora’ is a collection of eleven essays presented in honour of
Sylviane Granger’s sixtieth birthday. Although focusing predominantly on the
applications of corpora in the field of language learning, the book covers quite
a range of topics from within that area that are meant to whet the reader’s
appetite -- or taste -- for more. Bengt Altenberg’s preface is followed by an
introductory chapter, “Putting corpora to good use”, from the collection’s
editors, providing details on Granger’s work in corpus linguistics, from her
beginnings as a PhD student at University College London, to her role in
founding the Centre for English Corpus Linguistics (CECL), to her current
research interests. An overview of all the essays is also included.

Chapter 1 “Frequency, corpora and language learning” (Geoffrey Leech)

According to Leech, one particular benefit of corpora is that they provide
information about frequency that is otherwise not available. He distinguishes
between three frequency types: ‘raw frequency’; ‘normalized (or relative)
frequency’; and ‘ordinal frequency’, which he deems the most useful measure in
language learning. A historical overview of frequency is given, including its
rejection by Chomsky’s Generative School of linguistics in the second half of
the twentieth century, as well as the role of the computer age in reviving
frequency studies and thus challenging “a tradition long established in language
study, whereby grammars and dictionaries provide distinct kinds of information
about a language” (12). The equation ‘more frequent = more important to learn’
is discussed fairly extensively, with Leech concluding that language learning
(i.e. input, performance, evaluation) should be ‘frequency-informed’.

Chapter 2 “Learner corpora and contrastive language analysis” (Hilde Hasselgård
and Stig Johansson)

The second chapter commences with an overview of interlanguage studies before
computer corpora, going back to contrastive analyses of native and foreign
languages in the 1940s and 1950s, to more systematic analyses in the 1960s and
1970s, which focused on error analysis, until it became apparent that both error
and success in language learning needed to be considered. The authors proceed to
the introduction of computer corpora, which allowed for projects that were
larger and more varied in scope. One such project was Sylviane Granger’s
''International Corpus of Learner English'' (ICLE) in 1990. Innovative in its
systematic approach to corpus design and the compilation of comparable
sub-corpora for text produced by learners with different native languages, it
also developed a new framework for learner corpus research: The Contrastive
Interlanguage Analysis (CIA). While contrastive analysis involves the comparison
of two languages, CIA ''concerns varieties of the 'same' language'' (38,
italicisation in the original text is substituted with single quotation marks),
meaning both native language (L1) and learner language (L2) in the form of L1
vs. L2, as well as interlanguage varieties (L2 vs. L2), are compared. Hasselgård
and Johansson present significant findings of the CIA, and discuss case studies
before identifying some challenges (e.g. application of findings in pedagogy and
EFL practice; genre; need for data at different stages of the learning process).

Chapter 3 “The use of small corpora for tracing the development of academic
literacies” (JoAnne Neff van Aertselaer and Caroline Bunce)

Chapter 3 presents a study on academic literacy of Spanish university students
based on two corpora: the Spanish subcorpus of the International Corpus of
Learner English, containing texts of students with no specific training in
academic writing (AW); and a corpus of texts produced by Spanish students of
English as part of an AW course used to ascertain the students’ progress, and
with the pedagogical aim of syllabus revision in mind. ‘Can do’ descriptors
specifying structural and rhetorical features to be learned by the students in
the course -- specifically, discourse oriented words, with a focus on the use of
intertextual dialogue devices (such as various types of rhetors and grammar
patterns) -- are used. The authors provide details on the descriptors and the
study’s methods and procedures. Based on the data obtained, they note that
students with and without AW training perform differently, with only one of the
categories evaluated (use of deictic as subject) showing no improvement. They
conclude that academic literacy can be improved by providing students with ‘can
do’ descriptors and by studying students’ “use of text-internal and external
features” (80) and “centring sets of exercises around these features” (80).

Chapter 4 “Revisiting apprentice texts: Using lexical bundles to investigate
expert and apprentice performances in academic writing” (Christopher Tribble)

Tribble commences with the observation that corpora only became a resource for
language learning from the late 1980s onwards, a development that was partially
motivated by concerns over the made-up, rather than real, language data used in
classrooms until then. He presents a study on ‘lexical bundles’, which are
defined as the “most frequently occurring sequences of words” and are normally
“not idiomatic in meaning” nor “complete grammatical structures” (87). Tribble
looks at the use of both 3-word and 4-word lexical bundles by advanced students
in specific disciplinary areas. This corpus of apprentice written production is
compared to the language use of experts in the same field (i.e. the exemplar
corpus), as well as several analogue corpora, all of which lead him to conclude
that the comparison of such data provides valuable insights into what students
“use, fail to use, underuse and overuse” (102) -- insights that are crucial for
pedagogy and curriculum design.

Chapter 5 “Automatic error tagging of spelling mistakes in learner corpora”
(Paul Rayson and Alistair Baron)

In computer learner research, the marking (i.e. ‘tagging’) of learner errors in
corpora has been and still is done mostly manually or semi-automatically.
However, more recently, results from natural language processing (NLP) have been
applied to learner corpora. In their study, Rayson and Baron employ Variant
Detector (VARD) software for tagging language learners’ spelling mistakes, with
the aim of evaluating VARD’s potential for the automatic detection of such
errors and the insertions of corrections within learner corpora. The experiment
uses an expanded data set from Lefer & Thewissen (2007), which consists of
30,000 words from Spanish, German and French language learners drawn from the
ICLE corpus. With data having already been manually marked up for all types of
learner errors, the researchers were able to determine the accuracy of VARD and,
in the second stage of the experiment, used part of the manually corrected
corpus to train VARD. With results showing high accuracy and, after training,
increased correction, Rayson and Baron conclude that NLP methods can contribute
to the automatic error analysis of learner corpora.

Chapter 6 “Data mining with learner corpora: Choosing classifiers for L1
detection” (Scott Jarvis)

Jarvis presents a study on data mining, using a supervised classification
approach to evaluate which classifiers are “best able to learn to recognise the
relationship between n-gram patterns in ICLE texts and the L1 group membership
of learners who produced the texts” (147). The chapter distinguishes between
unsupervised and supervised classification, providing details on different
classifier types (i.e. centroid-based, boundary-based, Bayesian, artificial
neural networks, decision trees, rule-based, means-based, composite), feature
selection and parameter tuning. A lot of background is also covered in terms of
previous research, such as studies on L1 detection, projects tackling the
question of which classifier is best, and more. Jarvis then proceeds to his own
research, which used 20 classifiers and experimented with various parameter
settings and feature selection methods to determine optimal classification
accuracy. He observes that the best-performing classifiers (i.e. Linear
Discriminant Analysis, Sequential Minimal Optimization, Naïve Bayes Multinomial,
Nearest Shrunken Centroid) for the task demonstrate relatively little difference
between them and considers the question of whether an ensemble of classifiers
might produce higher classification accuracies, stating that results are
inconclusive with respect to this.

Chapter 7 “Learners and users: Who do we want corpus data from?” (Anna Mauranen)

With corpora compiling data from native speakers and language learners, Mauranen
considers the next step: data from second-language (L2) speakers who use the
language as a lingua franca. She explores how L2 users differ from L2 learners,
noting, among other things, that the former do not typically share a cultural
and linguistic background, and use the language due to convenience or necessity,
with the target audience being international rather than English-speaking
countries. Unlike learner language, there is also the potential that L2 users
may influence the target language with the increasing usage of English as a
lingua franca. Finally, L2 users focus on making sense and being understood, not
on language learning. The differences between the two L2 groups are reflected in
corpus compilation, with corpora such as the Helsinki-based ELFA (English as a
Lingua Franca in Academic Settings) containing no data from learners, and
variation in the mother tongue and proficiency of participants. Although the
principle differences between learner and ELF corpora provide good reasons to
keep them separate, Mauranen concludes that the results yielded from either
corpora are of mutual interest, as both L2 speakers and learners are using a
non-native language.

Chapter 8 “Learner knowledge of phrasal verbs: A corpus-informed study” (Norbert
Schmitt and Stephen Redwood)

Chapter 8 deals with phrasal verbs (PVs), which are key features in spoken and
written language that can pose difficulties for learners. While PV lists in
textbooks and dictionaries are mostly intuition and tradition based, the
researchers use a selection from the 100 most frequently occurring PVs in the
British National Corpus for their study, asking the question “[D]o learners tend
to know more about the most frequently occurring phrasal verbs than the less
frequent ones?” (181). Distinguishing between productive and receptive knowledge
of PVs, a group of 68 EFL/ESL students of different nationalities, at both
intermediate and upper intermediate levels, was tested. Although some variation
in PV knowledge was seen, Schmitt and Redwood conclude that there is an overall
relationship between frequency of occurrence and knowledge. Other factors --
language proficiency, gender, age, formal language instruction, extensive
reading, watching films and TV, listening to music and social networking – are
also discussed, some of which the authors find to play a role.

Chapter 9 “Corpora and the new Englishes: Using the ‘Corpus of Cyber-Jamaica’ to
explore research perspectives for the future” (Christian Mair)

Chapter 9 commences with a brief overview of corpus-based research on ‘New
English(es)’, including a discussion on the term’s definition, before
introducing an ongoing project at Freiburg University on the use of Jamaican
English (JE) and Jamaican Creole (JC) in the 15 million+ word Corpus of
Cyber-Jamaica (CCJ). Building on research from the Jamaican component of the
International Corpus of English, the study investigates innovations in two
areas: 1) the use of Non-Standard English online through Jamaican web posts,
where increased usage of JC forms is seen when compared with traditional
writing; and 2) the sociolinguistics of globalisation, as originally localised
vernaculars spread through the medium of the web, which becomes a contact site
for non-standard varieties of English. The issue of legitimacy and authenticity
of data from the web for sociolinguistic research is raised and Mair concludes
that multilingual diasporic web forums “await corpus-linguistic exploration” (234).

Chapter 10 “Towards a new generation of Corpus-derived lexical resources for
language learning (David Wible and Nai-Lung Tsao)

Wible and Tsao put forth the argument that corpora are “by their very nature as
collections of texts and tokens, severely limited in what they can offer
directly to language learners or teachers” (237). Their focus is on exploring
these limitations as they look at the gap between corpora and learners’ lexical
knowledge. Corpus-supported learning, they argue, depends on guided exposure to
tokens in use that reveal underlying language patterns. Concordancing and KWIC
(Key Word in Context), however, do not find patterns but rather strings, while
software that allows pattern searches requires knowledge of technical language
(e.g. regular expression), and will only search for patterns that learners
stipulate -- not the ones that they are unaware of. The authors further discuss
the limitations of n-grams, congrams and skipgrams, suggesting that a
paradigmatic dimension is missing with all of these. Hybrid n-grams are
introduced as an alternative that not only “identify patterns of word use” but
also “create a new entire space where relations hold among these patterns and
among the words in them” (243), resulting in a massive StringNet of “organic
lexical knowledgebase whose structure is not prescribed but emerges” (244).
Illustrative examples are provided to demonstrate how this approach is
beneficial for language learners.

Chapter 11 “Automating the creation of dictionaries: Where will it all end?”
(Michael Rundell and Adam Kilgariff)

The final chapter of ‘A Taste for Corpora’ explores automation in dictionaries.
It provides an overview of the developments, starting with the 1960s to 1970s,
when computers were first used for dictionary making, before proceeding to the
technological advances and increased accessibility of PCs in the 80s and 90s.
1981 is identified as “Year Zero” with the COBUILD project, for which “every
linguistic fact… [was] supported by the empirical evidence in the form of corpus
extracts” (259). However, although changes were clearly occurring in the way
dictionaries were made, automation only became more prominent in the late 90s.
Rundell and Kilgarriff then detail their work on the Macmillan English
Dictionary for Advanced Learners, looking at the tasks involved (e.g. corpus
creation, headword lists, collocations and word sketches, labelling, examples,
tickbox lexicography) and commenting on the state of automation in each. All
these tasks have by now been automated to a significant degree, however, further
advances are still on the horizon. The researchers conclude that “the
lexicographer’s task changes from selecting and copying data from the software,
to validating… the choices made by the computer” (278), but note that fully
automated lexicography is “still some way off” (279).


‘A Taste for Corpora’ offers a diverse and rich collections of essays, all of a
high quality, covering a wide spectrum of topics related to the applications of
corpora in language learning. The volume can be perused as a whole or as an
introduction to different topics of interest. Although it is generally also
suitable for newcomers to corpus-based language learning, some chapters are
quite specialised and may require further reading on the topic.

Quite a number of essays -- such as Chapters 6, 10 or 11 -- point to approaches
that are at early stages and may thus spark both fascination as well as
controversy in terms of possible future directions and developments. It is
doubtful, for example, that Wible and Tsao's previously quoted argument in
Chapter 10 that ''corpora are, by their very natures as collections of texts and
tokens, severely limited in what they can directly offer to language learners
and speakers'' (237) will immediately be welcomed by every corpus linguist, as it
seems to question the field itself and also appears phrased in a purposely
provocative manner. Corpora and corpus-based resources may have limitations, but
they are not completely useless even if they only ''find the patterns that the
user tells them to search for'' (241) or are ''one-dimensional'' (243). Corpus
linguistics challenged approaches to language and language learning that came
before it and significantly shifted the focus from ''an ideal speaker-listener,
in a completely homogenous speech-community, who knows its language perfectly
and is unaffected by such grammatically irrelevant conditions as memory
limitations, distractions, shifts of attention and interest, and errors''
(Chomsky 1965: 3) to models and methods that deal with language as it is
actually spoken by both L1 and L2 users in real life. Equally, corpus
linguistics as a whole and its specific applications for language teaching and
learning should be questioned for usefulness both in theory and practice, so
that the rightfully identified ''unfortunate gap [that] still stand[s] between
what learners need… and what corpora currently provide'' (237) may be filled, but
to outright declare corpora ''severely limited'' (237) goes somewhat over the mark.

Other chapters also deserve a second mention in this review. The contributions
from Mauranen (Chapter 7, ''Learners and users: Who do we want corpus data
from?'') and Mair (Chapter 9, “Corpora and the new Englishes: Using the ‘Corpus
of Cyber-Jamaica’ to explore research perspectives for the future”) are
exciting, as they strongly signal a move away from the traditional research
focus on English as spoken in a select few nations, or certain kinds of English
users. Although projects like the International Corpus of English (ICE),
containing multiple 1 million-token subcorpora from Singaporean to Sri Lankan
English, have existed for a while already, both Mauranen's and Mair's research
expands the field further. Mair's investigation into the still relatively
unmapped territory of language in cyberspace -- web posts in Jamaican English as
well as Jamaican Creole -- makes this turn even more interesting, as cyber
language in both its more static (e.g. informative websites) and dynamic forms
(e.g. Twitter, text messaging) becomes an increasingly significant part of our
everyday language usage.

‘A Taste for Corpora’ is a fitting as well as wonderful collection of essays to
honour the achievements of Sylviane Granger. Most chapters make at least some
reference to how Granger’s work is significant to a particular subfield within
corpus-based language learning. These references, on occasion, feel a little
forced, but then this is the nature of such a volume published in honour of a
researcher. Altogether, ‘A Taste for Corpora’ certainly manages to whet the
reader’s appetite -- or taste -- for what the future of corpus-based language
learning holds.


Chomsky, Noam. 1965. Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.

Marlies Gabriele Prinzl is a PhD candidate, supervised by Prof. Theo Hermans and Dr. Daniel Abondolo, at the Centre for Intercultural Studies, University College London, UK. Her research interests include literary translation, particularly with regard to creativity and experimental writing, retranslation and corpus linguistic approaches to literature and translation. She is further interested in East Asian cinema and cultural products, including aspects of fansubbing and fantranslation. Details can be found at: