Review of  Corpora Galore: Analyses and Techniques in Describing English

Reviewer: Joybrato Mukherjee
Book Title: Corpora Galore: Analyses and Techniques in Describing English
Book Author: John M Kirk
Publisher: Rodopi
Linguistic Field(s): Text/Corpus Linguistics
Subject Language(s): English
Issue Number: 11.1905

John M. Kirk, ed. (2000): Corpora Galore: Analyses and Techniques in
Describing English (Language and Computers: Studies in Practical
Linguistics No 30). Amsterdam: Rodopi.

Reviewed by Joybrato Mukherjee, University of Bonn

This book presents a selection of papers from the Nineteenth Conference
on English Language Research on Computerised Corpora, commonly referred
to as ICAME 19-98 (International Computer Archive of Modern and Medieval
English). This ICAME conference was held at the Slieve Donard Hotel in
Newcastle/Northern Ireland from 20-24 May 1998. To make it clear right at
the beginning of this review, this collection is, on the whole, as
worth-reading as the proceedings of previous ICAME conferences as it
introduces "new descriptions from new corpora using new techniques" (p.
i), as Kirk points out in his preface (correctly, I believe). The
following synopsis is intended to provide the reader with some kind of
bird's eye view of the contents of the book.


The papers are subsumed into three groups. The first section comprises
studies which are devoted to the lexical and collocational description of
English, whereas the papers in the second group present corpus-based
analyses of syntactic and semantic phenomena. The third section is about
new methods and innovative techniques in the rapidly developing field of
corpus linguistics.

Opening the first section, Susan Blackwell offers an extremely inspiring
article on the relevance of corpus data to forensic linguistics. She
compares the use of "honest", "look" and "well" as discourse markers in
the 10 million word spoken component of the Bank of English Corpus (BofE)
with their use in disputed and undisputed utterances of two suspects
(convicted of armed robbery and drug abuse respectively). By means of
this technique, she shows in both cases that the disputed and unsigned
police interviews presumably do not form contemporaneously written and
reliable transcripts, but summaries which have been infelicitously
produced at a later stage.

The exploration of low frequency collocations in the British National
Corpus (BNC) lies at the heart of the second paper. Sebastian Hoffmann
and Hans Martin Lehmann compare the knowledge of 16 native speakers and
16 non-native speakers concerning the most frequent collocates of 55 node
words (e.g. "goalless draw") which occur between 50 and 100 times in the
BNC. Whereas the native speakers performed at an average rate of 70%, the
non-native speakers guessed correctly at an average rate of 34%. Two main
conclusions are drawn: (1) native speakers are able to memorise
collocational patterns which are comparatively rare in language use; (2)
taking into account the relatively small exposure to the English
language, non-native speakers' performance in this experiment turns out
to be surprisingly good.

Using the spoken and written Wellington corpora, Graeme Kennedy and
Shunji Yamazaki investigate the influence of Maori on the lexicon of New
Zealand English. They find out that, in terms of frequency, this
influence is not as strong as previously assumed.

Anthony McEnery, John Paul Baker and Andrew Hardie provide a progress
report on the current compilation of the Lancaster Corpus of Abuse (LCA)
containing examples of swearing from spoken language. So far, corpus data
reveal, for example, that terms which have been traditionally regarded as
sexist language (e.g. "bitch") are significantly often applied to males
as well.

Newspaper CDs as corpora are used by David C. Minugh who explores the
frequency of idioms in this genre. Although there are many caveats (e.g.
the lack of representativeness), his findings, in general, support the
use of these easily accessible corpora in the teaching of English as a
Foreign Language (EFL).

The first section is concluded by a highly illuminating article by
Vincent B.Y. Ooi. He analyses the use and frequency of culturally
distinctive collocations (e.g. "fish-head curry", "urine detector") in
Singaporean-Malaysian English (represented by several newspaper corpora)
and in the newspaper sections of the BofE. This paper impressively
exemplifies the impact of corpus linguistics on the description of
collocational differences between varieties of English.

The second section opens with an article by J�rgen Gerner, putting into
perspective the choice of singular or plural pronouns in coreference with
the indefinite personal pronouns "someone", "anyone", "everyone" and "no
one" (as well as the corresponding items ending in "-body"). Whichever
pronominal form is chosen, there is either a violation of gender concord
(e.g. "himself") or a violation of number concord (e.g. "themselves")
between the anaphorical pronoun and its antecedent. Drawing on the spoken
component of the BNC, Gerner's analysis reveals that in this medium, the
so-called singular "they" ("them", "theirs" etc.) is used in around
96-98% of all cases. Only with regard to "someone/somebody" this relative
frequency drops to 84-87%, as this indefinite pronoun may be used with
specific singular reference so that a violation of number concord by
means of singular "they" is not necessitated. As far as the written usage
is concerned, future research will certainly profit from the exploration
of the remaining 90 million words in the written domain of the BNC.

In the 50 million word Cobuild Direct Corpus (CDC), a sub-corpus of the
BofE, G�ran Kjellmer detects 47 occurrences of the verb "try" followed by
a bare infinitive. As "try" in these authentic instances meets at least
some of the criteria for auxiliary verbs, the hypothesis that this verb
is moving towards auxiliaryhood seems plausible. This study shows that
ongoing diachronic processes may be feasible at an early stage only due
to large corpora.

Hans Lindquist studies the choice between inflectional and periphrastic
comparison (e.g. "costlier" vs. "more costly") of disyllabic adjectives
in two newspaper corpora. His findings suggest that the selection is
usually not a matter of free variation, but to a large extent guided by
morphological, syntactic and prosodic factors. Periphrastic
constructions, for example, tend to be placed at the end of a clause as
they are somewhat heavier than inflectional forms: this could be regarded
as a realization of the principle of end-weight. Whether these mechanisms
are genre-specific or not, remains to be seen.

In Inge de M�nnink's paper, seven types of noun phrases (e.g. with a
fronted premodifier) are described which elude the usual noun phrase
structure and, consequently, a clear-cut immediate constituent analysis.
One of the interesting points in this article is the underlying
methodology which combines corpus data (from an unspecified 175,000 word
corpus and the BNC) with elicitation data so that (1) intuition-based
hypotheses can be verified or falsified in the light of empirical corpus
data and (2) interpretations of corpus data can be tested by means of
intuition or elicitation, potentially leading to new hypotheses (as in
this study). Thus, an innovative "data cycle for descriptive linguistics"
(p. 144) is established.

Degree modifiers of adjectives in spoken English are investigated by
Carita Paradis. To describe diachronic changes in the use of
constructions such as "it's well weird", she draws on the 500,000 word
London-Lund Corpus (LLC) comprising texts from the sixties and seventies,
the Corpus of London Teenage Language (COLT) of the same size and
compiled in the nineties, and the spoken component of the BNC. She
observes, for example, that there are remarkably fewer degree modifiers
in COLT than in LLC, although two degree modifiers are attested in COLT
only, namely "well" and "enough".

The article by Aimo Sepp�nen and Joe Trotta is devoted to the use of the
pattern "wh- + that" in sentences such as "I yielded to whatever
arguments that were given". The wide-spread assumption that this pattern
became extinct after the Early Modern English period is refuted since 90
examples are discovered in the BNC and the CDC. These occurrences are
neither restricted to spoken/written language nor to specific varieties
of English. Therefore the authors make a plea to include this - no doubt
marginal - structure in the grammar of present-day English.

Anna Brita Stenstr�m investigates intensifiers in teenage talk as
attested in COLT. Two striking results refer to (1) the increasing use of
"well" as adjective intensifier and (2) of "enough" as intensifier in
premodifying position. These phenomena exemplify the innovative potential
of teenage language. Surprisingly enough, both lexical items had already
been used as intensifiers in the 8th and 9th century so that the recent
developments in London teenage language may be considered as a process of

The specialized 800,000 word Corpus of Early English Medical Writing
1375-1750 (still under construction) is the database of Irma
Taavitsainen's analysis of the linguistic processes involved in the
development of this very genre. In general, there is a clear change from
a rather detached to a more involved writing which is, for example, based
on a general trend from textual to interpersonal kinds of metatextual
comments. This study emphasizes the relevance of corpus data to
diachronic linguistics.

Medical writing, though from a synchronic perspective, is also the topic
of Minna Vihla's paper which focuses on modal expressions (of epistemic
possibility), e.g. "may" and "might", in a 400,000 word corpus of
contemporary American medical texts. On the whole, the extensive use of
modal expressions ensures that writers do not identify themselves with
the research results presented and remain, thus, sincere towards the
reader. More specifically, one can differentiate beween several
sub-genres in which modal expressions are used to different extents: in
manuals and clinical textbooks, for example, they are much more frequent
than in expository and argumentative texts.

The third section opens with Magnar Brekke's at times hilarious, but no
doubt thought-provoking considerations of the future role of the world
wide web as a cybercorpus. The occurrences of two test items, i.e.
"chaos" and "quantum", and their collocates in the cybercorpus are
studied. The results are then compared with corpus data from the BNC,
leading to the general assumption that the exploration of the constantly
growing and changing web may provide very useful linguistic insights.
However, two fundamental problems are clearly identified as well: (1) the
lack of representativeness and of any other standards of corpus
compilation in the web; (2) the, linguistically speaking, primitive
toolkits provided by today's web browsers.

Sylviane Granger and Lartin Wynne make an attempt to optimise measures of
lexical richness in essays written by EFL learners. Drawing on data from
five sections of the International Corpus of Learner English (ICLE) and
the concept of adjusted lemma/token ratios, they draw the conclusion that
it is not the lack of words but the lack of native-like use of the words
used by language learners which should be the prime concern in EFL

The BNCweb is a client for accessing the BNC via the world wide web. Some
of its main features are sketched out by Hans Martin Lehmann, Peter
Schneider and Sebastian Hoffmann.

Oliver Mason claims that the collocates of a word can be determined
empirically. He introduces the concept of lexical gravity of a word which
is described in terms of entropy, i.e. the degree of lexico-grammatical
stability in the context of a word. Thus, the so-called window, i.e. the
number of words to the left and to the right of the node word, in which
collocates are to be described, is not a fixed frame, but a variable span
the size of which is dependent on the specific node word.

Nelleke Oostdijk's case study of the linguistic annotation of the English
verb phrase underlines how important it is for corpus users to thoroughly
know the descriptive model underlying the linguistic analysis provided by
corpus compilers. While, for example, the verb in "a bottle containing
milk" constitutes a genuine simple verb phrase (i.e. main verb only), the
formally similar verb phrase in "the man walking in the park" could be
considered as a reduction of a complex verb phrase, namely "is walking".
The findings presented in this paper call for a very careful
interpretation of corpus data since different systems of linguistic
annotation yield, consequently, different results.

The unresolved problem of grammatically annotating spontaneous speech is
discussed by Anna Rahman and Geoffrey Sampson. Speech repairs and
grammatically ill-formed utterances are two examples of phenomena which
still pose great problems to hitherto existing software tools for word
tagging and syntactic parsing. If natural language processing is to make
substantial progress, grammar annotation standards will have to be
extended to these particularities of spoken language.

Considering the increasing availability of syntactically parsed corpora,
Pasi Tapanainen and Timo J�rvinen develop a new type of concordance which
is not based on node words, but on syntactic functions in node position.
This approach allows for syntactic concordances in which the key-word is
missing, as for example in zero relative clauses.

Focusing on the potential of parsing procedures as well, Atro Voutilainen
gives a progress report on recent trends in parser design at the
University of Helsinki. In particular, the performance of a new
functional dependency parser, visualizing dependency relations between
words, seems to be promising: the overall precision of the parser ranges
from 90% to 96% with regard to subjects, objects and predicatives.

Finally, Sean Wallis, Bas Aarts and Gerald Nelson provide a general
introduction to the ICE Corpus Utility Program (ICECUP). ICECUP is a
software tool which has been designed for the exploration of the
syntactically parsed 1,000,000 word British component of the
International Corpus of English (ICE). ICECUP draws on the use of
so-called fuzzy tree fragments which are intended to visualize the
function, the category, the features and the edges of text unit elements.

Critical Evaluation

The selection of papers underlines at how many corpus linguistic front
lines progress is being made. As the title implies, the focus is on the
diversity of corpora which are available today. This very aspect is, in
fact, successfully represented by the selected papers. Some thirty
different corpora are used to different extents. In this, three main
domains can be identified which do, of course, overlap at times: (1)
extensive lexico-grammatical and semantic studies of particular
phenomena; (2) comparative analyses of several corpora (including the use
of databases as control corpora); (3) putting new techniques and methods
to the test.

Those studies which cover the first domain are, on the whole,
well-written and plausible contributions to what the prime concern of
linguists should be according to the forefather of British contextualism:
"The business of linguistics is to describe languages" (Firth 1957: 32).
To pick out but one example, Susan Blackwell's paper exemplifies the
relevance of corpus-based descriptions of authentic language use to the
field of applied (e.g. forensic) linguistics. Hans Lindquist's study is
another good example of the advantages of corpus-based analyses (over,
say, generative approaches) because the mechanisms which underlie the
choice between inflectional and periphrastic comparison can only be
identified by considering real language in context as attested in large
corpora: "the comprehensive study of language must be based on textual
evidence" (Sinclair 1991: 6).

Stubbs (1996: 33) states as a central principle of British traditions in
text analysis that "text types must be studied comparatively across text
corpora". Accordingly and successfully, Anna-Brita Stenstr�m, for
example, compares the use of "well" and "enough" in COLT (representing
the teenage language) with general tendencies in the BNC and its
subcorpora. The versatility of available corpora allows for such
empirical text-typological analyses and leaves no excuses for stylistic
descriptions based on intuition and/or invented examples only. In a
similar way, Vincent B.Y. Ooi's study shows that today's corpora are a
goldmine for English dialectology in that different collocational
strengths in different varieties of English become feasible in
quantitative terms.

Corpus linguistics is a process, and the myriad of new methods and
techniques presented in this book reveals the rapid development in this
field. For example, Magnar Brekke's paper makes it clear that there may
be the cybercorpus on the horizon - a database of unprecedented size and

Having highlighted the positive so far, it is, however, necessary to make
some critical remarks about the selection of papers in general and about
some of the contributions in particular.

On the one hand, the book suffers from a lack of theoretical commitment.
Of course, I do not know whether this is due to the selection procedure
or to the entirety of papers submitted for consideration in the first
place. Kirk explicitly states that "ICAME papers have not only been
descriptive, they have been concerned with theoretical issues" (p. v),
but in the corresponding third section, progress reports and
introductions to new software hold the field. This is not to say that
questions of corpus linguistic theory in a wider setting are not
addressed at all. But sometimes, there is no genuine attempt to answer
them. For example, Sebastian Hoffmann and Hans Martin Lehmann present
most inspiring findings as to the acquisition of low frequency
collocations by native and non-native speakers. Their final conclusion is
"that an even larger corpus would be needed to provide reliable data for
future investigations" (p. 31). Notwithstanding the correctness of this
conclusion, I feel that their results may also challenge the
traditionally established, generative approach to language competence:
obviously, exposure to authentic language use plays a much more important
role in the shaping of (collocational) competence than previously
assumed. A second example of leaving loose threads is David C. Minugh's
paper. He is perfectly correct in observing that "students, particularly
EFL students, are both encouraged to learn idioms [...] and
simultaneously discouraged from using them" (p. 57). However, he does not
provide the reader with a clear-cut conclusion as to this problem on the
basis of the numerous - and no doubt valuable - quantitative corpus

On the other hand, some papers are affected by more specific and minor
infelicities. Again, two examples should suffice to illustrate this
point. The methodology of Inge de M�nnink's study of the mobility of
constituents in the English noun phrase has already been pointed out as
being very effective and innovative. However, she does not go very much
into detail about the 175,000 word corpus which she draws on. In my view,
this vagueness is at odds with her general (and true) statement that
"corpus data are verifiable, which is an important requirement for a
scientific approach to linguistics" (p. 133). Some studies seem to get
carried away by the irresistable power of figures, tables and diagrams.
In Oliver Mason's paper, for example, quite a considerable number of
diagrams are intended to illustrate the lexical gravity of several words,
but I personally would have preferred a more explicit explanation of the
underlying concepts of entropy and gravity (although, as a molecular
biologist, I am acquainted with the major aspects of entropy in
biochemistry and gravity in the physical sciences). I think that quite
generally and perhaps inevitably, there is a latent danger in corpus
linguistics of focusing on figures and frequencies at the expense of
theoretical and functional considerations, explanations and conclusions.

On the whole, Corpora Galore is a celebration of the fact that only a few
years after Jan Svartvik's (1992: 7) statement that "[c]orpus linguistics
comes of age", it has by now come of age and is rapidly growing and
consistently flourishing. The book provides many interesting results by
using many different methods and many different corpora. Everyone who is
interested in the linguistic description of authentic English, will no
doubt profit from reading this selection. As conference proceedings tend
to be in general (and this is not a criticism at all), it is more like a
jigsaw puzzle and not a straight-forward introduction to the state of the
art in corpus linguistics. Hopefully, many linguists will try to put the
puzzle together by reading the book.

(Some small typological errata: On p. 12, some lines of the
running text have been duplicated, on p. 162 one finds *"structur",
and some tables are inconsistently formatted (e.g. p. 171). In one
paper, the introductory sentence of section 1 and the first sentence
of section 2, comprising 32 words, are identical (p. 133 and p. 134).


Firth, John Rupert (1957): "A synopsis of linguistic theory 1930-1955",
Studies in Linguistic Analysis, Special Volume of the Philological
Society, 1-32.

Sinclair, John (1991): Corpus, Concordance, Collocation. Oxford: Oxford
University Press.

Stubbs, Michael (1996): Text and Corpus Analysis: Computer-assisted
Studies of Language and Culture. Oxford: Blackwell.

Svartvik, Jan (1992): "Corpus linguistics comes of age", Directions in
Corpus Linguistics: Proceedings of Nobel Symposium 82, edited by Jan
Svartvik. Berlin: Mouton de Gruyter. 7-13.

Joybrato Mukherjee is an Assistant Professor of Modern English
Linguistics at the English Department of the University of Bonn. His
research interests include corpus linguistics, stylistics,
textlinguistics, intonation, syntax and EFL teaching. In his forthcoming
PhD thesis, interactions between prosody and syntax at tone unit
boundaries are described on the basis of quantitative and functional
corpus analyses.



