Date: Thu, 02 Oct 2003 14:04:22 +0000
From: Petek Kurtböke
Subject: Extending the Scope of Corpus-Based Research: New
Applications, New Challenges
Granger, Sylviane and Stephanie Petch-Tyson, ed. (2003) Extending the
Scope of Corpus-Based Research: New Applications, New Challenges,
Petek Kurtböke, Ph.D.
Much of 1980s and 1990s were taken up by considerations of three major
areas in the field of Corpus Linguistics: 1.Corpus design; 2.Corpus
Annotation (a. encoding; b. tagging; c. parsing), 3.Linguistic
exploration of the data (Oostdijk and de Haan 1994, Svartvik 1992,
Meijs 1987). As we have moved into 21st century, the focus of Corpus
Linguistics has moved too, and publications such as the present volume
are a sign that it really has.
Such a volume is also a confirmation that the tension of 1990s, between
"(a) those who want[ed] to know as much as possible about language
[...] and (b) those who want[ed] to know as much as possible about what
the computer c[ould] do" (Quirk 1992), has relaxed. There now seems to
be agreement that both approaches are equally valid and "potentially
complementary", hence a collective effort to establish the direction
and future of Corpus Linguistics research (see Grefenstette 1998 on
Both parties have used computers, the former to interpret and the
latter to generate natural language. Generally-speaking, the term
'natural language' has been perceived as speech or writing produced in
'natural settings', with the term 'natural' meaning 'ideal' in a
setting where only one language is used with its rules perfectly in
place. Such a view has enabled the expert to approach language
processes in procedural terms. In fact, computational applications in
linguistics have so far tested the grammars proposed by theoretical
linguists. There is endless literature on these experiments and their
results, in which the language, most commonly English, is treated in
terms of a limited set of rules.
In Linguistics, then, in both theoretical and computational terms,
there has been a tendency to view the 'natural setting' as monolingual,
although it is hardly the case in everyday life. Long before the age
of 'multiculturalism' and computers, most communities used at least
some elements from a second language as part of their daily
communication, or from more languages. For example, in the Balkans,
communities have used a mixture of two or three of the following
languages contemporarily in speech for centuries, (in spite of
nationalistic language planning movements to discourage this tendency):
Greek, Turkish, Albanian, Croatian, Serbian, Slovenian and others.
Similar examples may be listed from all over the world.
Regardless of the commonness of bilingual or multilingual settings,
studies reporting computational treatment of mixed linguistic data are
rare. In other words, no such data sets have been fully analyzed using
computational techniques. Until recently, it was also uncommon to
create corpora in bilingual or multilingual settings (Kurtböke 2000).
Researchers in three areas, LANGUAGE CONTACT, CORPUS LINGUISTICS and
NATURAL LANGUAGE PROCESSING, are now starting to think about the
problem of how to treat mixed linguistic data computationally, even
though some still fail to go beyond the traditional "borrowing"- "code-
switching" distinction. In corpus construction, on the other hand,
some still discuss whether texts of mixed nature should be allowed into
a corpus at all. And in computational research, entire funds are still
dedicated to the resolution of monolingual grammars by developing more
elegant yet robust systems.
EXTENDING THE SCOPE OF CORPUS-BASED RESEARCH is a happy indication that
the scene might be changing, with articles reporting on the analysis of
contact data, be it in the local press (Hajar & Harjita on Malay-
English pp. 159-175) or in language learning contexts (Aronsson on
Swedish-English pp. 197-210; Neff et al. on
Spanish/Dutch/Italian/French/German in contact with English pp. 211-
230; Schmied on German-English pp. 231-247).
In the past, corpus texts were usually categorised according to their
primary discourse function (Sinclair 1987:12; Rissanen et al. 1987).
Biber's extensive work on the typology of English texts showed that a
thorough definition of the target population based on the co-occurrence
of grammatical features was possible (e.g. 1989, 1990). Text
categorisation and document clustering have been of interest
particularly to those in the area of Artificial Intelligence (e.g.
Machine Translation), although research outcomes depend largely on how
the corpus available has been accessed: raw, annotated or analysed
(McNaught 1993). Corpus annotation adds interpretative (especially
linguistic) information to an existing corpus of spoken and/or written
language, by some kind of coding attached to, or interspersed with, the
electronic representation of the language material itself (Leech 1987,
1993). "At the time that Biber conducted his research, no corpora were
available that had been annotated with detailed syntactic information"
(p. 16). Since then fully-parsed corpora have become available (e.g.
ICE-GB) and a structurally annotated corpus to replicate Biber's Multi-
Feature/Multi-Dimension method has "simplified and improved the search
for the linguistic features considerably" (p. 23). Also "a factor
analysis carried out on the frequency counts of a set of word class
tags resulted in largely the same classification" (De Mönnink, Brom and
Oostdijk pp. 15-25).
Parsing has been one of the concerns of computational corpus research
since early 1970s (e.g. TOSCA in Nijmegen). Raw (spoken) data may need
"normalization" before syntactic parsing can proceed, although how far
the normalization procedures should go is still debated (Oostdjik pp.
59-85). Parsing a corpus in order to build a syntactic representation
for it is of course barely an end in itself. The syntactic structure
usually serves as input to some further processing towards the
refinement of grammar descriptions (Wallis pp. 27-38).
In the previous decade, numerous corpus exploitation tools became
available on the market (see detailed surveys by Schulze et al. 1994,
Christ 1996). However, as the advances in computer technology
facilitated the exchange of on-line textual resources and electronic
transfer on the internet, the largest corpus has become the Web itself.
This development has moved the focus of tool design from the
exploration of a number of controlled and monitored corpora to one wild
and uncontrollable corpus sans frontiers. Three articles in the present
volume relate to this aspect of Corpus Linguistics: Renouf on a new
tool development "WebCorp" (pp. 39-58); Peters and Smith on how e-
documents are slowly but firmly changing the conventional print
documents (71-85); Schmied on the Internet Grammar project at Chemnitz
With the shift of emphasis in the late 1980s and early 1990s from
language system to language use, it became obvious that the data
extracted from corpora were more complex than was described by the
rule-based systems. For example, the traditional parsing technology
ignored certain aspects of the lexicon such as collocations and word
associations since they were too difficult to capture using rule-based
systems (Atkins et al. 1994). Sentence as the central unit of
linguistic analysis was questioned (Sinclair 1996) and alternative
units of analysis continue to be discussed today (Mukherjee on tone-
unit pp. 21-134).
As 20th century came to an end, a prediction as to the future of
Linguistics in general was that it would advance in two directions:
computational corpus research and the mental lexicon (Halliday 1998).
Sampson's article on WORDINESS - or LEXICAL DENSITY in Hallidayian
terms (Halliday and Martin 1993) - in children's writing (pp. 177-193);
and Kjellmer's article (pp. 149-158) on potential words which
constitute unexpected "lexical gaps" in the Bank of English, are indeed
evidence that Pshycholinguistics and Corpus Research are coming closer.
Finally, Pérez-Parades (pp. 248-261) shows that the tension between
corpus-based versus task-based approaches to language teaching is no
longer there. Learner corpora in a classroom setting provide naturally
occurring examples for the instant use of the teacher and the student,
whereas in the past, language course books as well as traditional
grammars and dictionaries, used invented examples, which seemed
intuitively right to the native-speaker. With an electronic corpus
available to the teacher and the learner, tasks in the classroom are
now designed around the application of corpus examples to discourse
organization. Hence corpus-based and task-based approaches no longer
stand in opposition but they have become complementary.
a) More corpus research on typologically different language pairs
L2 acquisition has been subject of corpus research before. For example,
Biber et al. (1994) used corpus analysis to examine the development of
discourse competence and register awareness of the adult learners of
English. Similarly, Lux and Grabe (1991) used corpus-based analysis to
compare the compositions of university students, written in Ecuadorian
Spanish and English. Also in Canada, the acquisition of French by the
Portugese as well as other migrant groups as a second language has been
investigated using a corpus-based approach (Bazergui et al. 1990). The
studies reported in the present volume pleasantly add to the Language
Learning-Teaching research library. It would be worthwhile though to
extend the boundaries of such investigation to more typologically
different language pairs.
b) Corpus-based vs corpus-driven
The editors do not make reference to this significant distinction in
corpus research but the title of the volume EXTENDING THE SCOPE OF
CORPUS-BASED RESEARCH must have been selected with this distinction in
mind. In the data-driven approach the linguist investigates the corpus
with an open mind to discover how language really works as opposed to
the corpus-based approach where the linguist first establishes the
model and then investigates the corpus to find natural examples to fit
into that model (Clear et al. 1996). While the majority of the
contributions in the volume may be considered corpus-based, some may be
considered corpus-driven (e.g. Gotti on the use of SHALL and WILL pp.
91-109; Ketteman, König and Marko on the morpheme ECO pp. 135-148).
c) Written vs spoken corpus material
The volume places emphasis on devising better methods of
differentiation between speech and writing, although this seems to be a
contradiction in terms. One cannot ignore that the use of the internet
for daily communication, and the globalisation factor creating new
diasporas, are two strong forces that are rapidly narrowing the gap
between the spoken and written input.
Lastly, most of the contributors are Corpus Linguists "firmly
established" in their area of research and it is much to our
community's benefit that they felt "the need to ask [themselves] where
the future [of Corpus Linguistics] lies" (p. 9). With the points above
considered, perhaps the title of the book could have been THE SCOPE OF
CORPUS RESEARCH - A VIEW OF THE PRESENT IN TERMS OF THE PAST rather
than "Extending the scope of corpus-based research - new applications,
new challenges". A final word from the Editors Granger and Petch-Tyson
as to how they see the work in progress reported in this volume will
develop in the future would have been a nicer closure for an elegant
volume showing how far Corpus Linguistics has come.
Atkins, B. T. S., B. Levin and A. Zampolli (1994) Computational
Approaches to the Lexicon: An Overview. In B.T.S. Atkins and A.
Zampolli (eds.) Computational Approaches to the Lexicon. Oxford
University Press, Oxford. pp. 17-45.
Bazergui, N. et al.(eds) (1990) Acquisition du français chez des
adultes à Montréal. Office de la langue française, Québec.
Biber, D. (1989) A Typology of English texts. Linguistics 27:3-43.
Biber, D. (1990) Methodological Issues Regarding corpus-based Analyses
of Linguistic variation. Literary and Linguistic Computing 5:4:257-269.
Biber, D., S. Conrad and R. Reppen (1994) Corpus-based Approaches to
Issues in Applied Linguistics. Applied Linguistics 15:2:169-189.
Christ, O. (1996) Corpus Exploration Tools. Tutorial script. EURALEX
96, University of Göteborg, Sweden.
Clear, J. et al. (1996) COBUILD, The State of the Art. International
Journal of Corpus Linguistics 1:2:303-314.
Grefenstette, G. (1998) The Future of Linguistics and Lexicographers:
Will there be lexicographers in the year 3000? Plenary address. EURALEX
98, Proceedings, Univ. of Liège. pp. 25-41.
Halliday, M. A. K. (1998) Representing the child as a semiotic being
(one who means). Plenary Address. Intl. Conference on Representing The
Child. Monash University, Melbourne. 2-3 October.
Halliday, M. A. K. and J. R. Martin (1993) Writing Science. The Falmer
Kurtböke, P. (2000) 1001 texts: Ali Baba's Charcoal Chicken Delivery
YapIlIr. Paper presented at 21st ICAME Conference, Macquairie
Leech, G (1987) General Introduction. In R. Garside et al. (eds), The
Computational Analysis of English - a corpus-based approach. Longman,
London. pp. 1-15.
Leech, G (1993) Corpus Annotation Schemes. Literary and Linguistic
Lux, P. and W. Grabe (1991) Multivariate approaches to contrastive
rhetoric. Lenguas Modernas 18:133-60.
McNaught, J. (1993) User needs for textual corpora in Natural Language
Processing. Literary and Linguistic Computing 8:227-234.
Meijs, W. (1987) Preface. In W. Meijs (ed.) Corpus Linguistics and
Beyond - Proceedings of the Seventh International Conference on English
Language Research on Computerised Corpora. Rodopi, Amsterdam. pp. ii-
Oostdijk, N. and P. de Haan (eds.) (1994) Corpus-Based Research into
Language: In Honour of Jan Aarts. Rodopi, Amsterdam.
Quirk, R. (1992) On corpus principles and design. In Svartvik, pp. 457-
Rissanen, M., O. Ihalainen and M. Kytö (1987) The Helsinki Corpus of
English Texts. In Meijs, pp. 21-32.
Sinclair, J (1996) The Search for Units of Meaning. Textus 9:75-105.
Sinclair, J (ed.) (1987) Looking Up: An Account of the Cobuild Project
in Lexical Computing. Collins, London.
Schulze, B. M. et al. (1994) DECIDE Designing and Evaluating Extraction
Tools for Collocations in Dictionaries and Corpora. MLAP Project 93-
Svartvik, J. (1992) Corpus linguistics comes of age. In J Svartvik (ed)
Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82.
Stockholm, 4-8 August 1991. Mouton de Gruyter, Berlin. pp. 7-13.