Review of  From the COLT’s mouth ... and others’.

Reviewer: Yuancheng Tu
Book Title: From the COLT’s mouth ... and others’.
Book Author: Leiv Egil Breivik Angela Hasselgreen
Publisher: Rodopi
Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics
Subject Language(s): English
Norwegian Bokmål
Language Family(ies): Germanic
New English
Issue Number: 13.3277

Date: Tue, 10 Dec 2002 20:20:26 -0600 (CST)
From: Yuancheng Tu
Subject: Sociolinguistics: Review of Breivik and Hasselgren (eds) (2002)

Breivik, Leiv Egil and Angela Hasselgren (2002) From the Colt's Mouth
... And Others'. Rodopi, x+260pp, hardback ISBN 90-420-1479-2, $61.00,
Language and Computers: Studies in Practical Linguistics 40.

Yuancheng Tu, Department of Linguistics, University of Illinois at

'From the COLT's mouth...and others' is a collection of fifteen papers,
each one exploring a different problem in language corpora studies. If the
reader happens to know that the COLT (The Bergen Corpus of London Teenage
Language) is an English corpus focusing on the speech of teenagers and
Anna-Brita Stenström is the person who compiles it, the title of the book
makes more sense and its coherence with the subtitle: Language Corpora
Studies In honour of Anna-Brita Stenström can then be better perceived.
The title of this book also reflects that the research conducted in this
collection is more or less related to spoken corpora such as COLT.

'Does corpus linguistics exist? Some old and new issues' by Jan Aarts, is
the first paper. This paper deals with a number of methodological
questions in corpus linguistics. 'Old issues' here include types, nature
and usage of corpus data. There are two 'new issues'. One is a heightened
interest in the spoken language with the availability of new electronic
resources such as COLT. The other is the distinction between corpus-based
approach and corpus-driven approach (Tognini-Bonelli 2001). Jan argues
that the prominent difference between corpus-based and corpus-driven is
the attitude towards the annotation of corpus data. In corpus-based
approach, annotation is indispensable while anathematic in the other. The
paper concludes that corpus linguistics does exist and the difference
between theoretical linguistics and corpus linguistics is the object of
their study. The former is concerned about competence while the latter is
about language-in-use, which is first pointed out by Leech (1992).

In 'Zero translations and cross-linguistic equivalence: Evidence from the
English-Swedish Parallel Corpus', Karin Aijmer and Bengt Altenberg report
that cross-linguistic non-equivalence is not the only reason of omission
in translation. They use the English-Swedish Parallel Corpus to
demonstrate that the occurrence of zero translation is governed by other
factors, such as the clarity of the context, language-specific conventions
and even cultural differences. Evidence lies in adverbial connectors in
both English and Swedish, Swedish modal particles, English discourse
particles and translations of endearment words from Swedish into English. Based
on Jan Aarts's criteria, the research reported in this paper is typically
corpus-based, using statistics from corpus as evidence to demonstrate a
theoretical view.

Gisle Andersen's 'Corpora and the double copula' however is a typical
corpus-driven paper. Data from Internet and British National Corpus
exhibit a new sentence structure involving double copula such as _The best
part is, is that you get to shoot your opponent_. Instead of explaining
this double copula as an arbitrary hesitation feature, Andersen shows that
it is actually a new grammatical feature: the tendency to repeat the
copula before a nominal that-clause in the context of a focus
construction. He argues that this double copula construction is a
conflation of two focusing structures, the wh-cleft and clausal subject
postponements of the type 'The point/issue/question is that'. Since it is
not clear if the data from the Internet represent spoken or written, the
double copula structure may be just a phenomenon in spoken language.
However, the author provides evidence to support that this structure is
spreading in several dimensions, from spoken to written, from American
English to more general English, and from informal to more formal context.

'The non-nominal character of spoken English' by Pieter de Haan seeks
evidence from British National Corpus sampler CD-ROM (one million words of
spoken English and one million words of written English) to confirm the
claim that the written variety of English has a strong nominal character
whereas the spoken variety has a strong verbal, or clausal character.
Therefore, it is typical corpus-based research. The paper also provides
evidence to show the cline from informal spoken language to informative
writing, which has the strongest nominal character.

The main concern of the next paper is exactly what its title says 'Teenage
slang in Norway'. Eli-Marie Drange summarizes some of the results from a
research project survey on Nordic Teenage Language. The survey shows a new
trend that, apart from English, more and more words come from other
languages such as Arabic and Spanish. And many of these words are in the
process of being adjusted to Norwegian spelling and morphology.

'The semantics and pragmatics of the Norwegian concessive marker likevel:
Evidence from the English-Norwegian Parallel Corpus' by Thorstein Fretheim
and Stig Johansson reminds us of the second paper in this book "Zero
translations and cross-linguistic equivalence: Evidence from the
English-Swedish Parallel Corpus" by Karin Aijmer and Bengt Altenberg. Both
of them use Parallel Corpus, examine language varieties and deal with
translation strategies. Fretheim and Johansson claim that no single form
in English parallels the concessive marker _likevel_ in Norwegian. This
lack of formal counterpart in English triggers the occurrence of
translation omission in going from Norwegian to English. In addition,
evidence from English-Norwegian Parallel Corpus supports the idea that
differences between Norwegian and English are most striking with _likevel_
in medial and final position where more inferential processing is
required. But these two languages are more alike in regards to local
concessive linking, signaled by initial _likevel_ and English concessive
links of _even so_ type.

'Sound a bit foreign', By Angela Hasselgren, compares the use of small
words, such as _well_, _all right_ and _sort of_ taken from more or less
fluent Norweigian learners of English and native English speakers. The
quality of small-word-usage is evaluated functionally via the ability to
send the signals most essential to communication. It demonstrates that as
the speakers' fluency increases, they are likely to use more small words
and send more basic signals. However, the real difference exists between
the ranges of small words used by more fluent learners and the native
speakers. The limited range of the fluent learners deprives them from the
pragmatic overtones that native speakers give to their signals and
therefore makes them sound a little foreign.

'Congratulations, like: -Gratulerer, liksom! Proagmatic
particles in English and Norwegian' by Ingrid Kristine Hasund presents the
similarity of the pragmatic particles _like_ in English and _liksom_ in
Norwegian. Hasund suggests that these two particles are used in similar
ways to mark the speaker's epistemic stance towards the content or form of
an utterance. The Bergen Corpus of London Teenage Language (COLT) is the
corpus for the English part of the study and a corpus of spoken Oslo
teenage language is used for the Norwegian part of the study.

'Applicatons of the Stenström model of discourse structure' by John M.
Kirk simply applies Stenströmian model to a variety of transcribed spoken
datasets and focuses on question and response exchanges by numbering them
in each excerpt. Excerpts Kirk uses in this paper are from London-Lund
Corpus, Map Task Corpus, and Dynasty, an American television soap opera.
All of them support the idea that different types of conversational data
or written dramatic dialogues can be identified and categorized by the
Stenströmian model.

In 'The Britain: An unexpected case of article usage in present-day
English', Goran Kjellmer investigate the variation with regard to article
usage among names of counties such as _the UK_, which influences the use
of the article with Britain. According to Quirk (1985), names of countries
have no article, even with a premodifying adjective. However, one
advertisement for the British Council on the Internet uses the article
_the_ before Britain. Via searching BNC corpora, Kjellmer found that 'the
Britain' actually occurs repeatedly. The reason for this is summarized as
an analogy to the usage such as _the UK_.

'What vocabulary tells us about genre differences: A study of lexis in
five newspaper genres' by Magnus Ljung is a corpus-based study on lexical
differences. Five newspaper genres were selected: hard news, sports news,
business news, arts articles and obituaries. The data were taken from the
same five weekdays in the CDROM-based 1997 issues of The Times and The New
York Times. The results of this research show that differences in word use
do signal genre differences within certain textual parameters. Both
newspapers have the tendency to be most formal with general news and
least formal with sports.

'What is a grammatical rule?' by Dieter Mindt presents a new perspective
of the definition of grammatical rules. Instead of description with
exceptions, grammatical rules here resemble a mathematical function, i.e.
the exponential function of decay. Evidence comes from the probability
distribution derived from corpus statistics. Each grammatical rule is
represented by a set of probability distribution of classes, and the class
that is lower than 5% is traditionally called exceptions. This
distributive representation of grammatical rules can predict the
diachronic change of language, which cannot be achieved via the
traditional definition of a grammatical rule.

David Minugh investigates the distribution of the formal adposition
_notwithstanding_ in English in 'Her COLTISH energy notwithstanding: An
examination of the adposition nothwithstanding'. This word is interesting
since it can occur prepositionally or postpositionally. Via statistics
from 1845 million words from present day English and newspaper CDs, he
shows that written American English is most willing to use the
postpositional form and the governed NP is also longer than that of
prepositional form.

'As and other relativizers after same in present-day standard English' by
Gunnel Tottie and Hans Martin Lehmann presents the use of _as_ as a
relative marker in constructions where the antecedent contains the word
_same_. Data from BNC-S and The Times show that same-constructions occur
much more frequently with relativizers having adverbial function and
predominantly bearing the manner type. Pragmatic explanation is provided
to account for this phenomenon, and etymology is used to demonstrate why
as is used as a relativizer after _same_.

Anne Wichmann in her 'looking for attitudes in corpora' looks into the
ways people say things from ICE GB, the British contribution to the
International Corpus of English. She chooses nine word tokens and two
sentence structures as seeds to explore the corpusk, and her statistics
reveal that people do not seem to talk about tone of voice very much
though they intuitively recognize it and response to it. Anne also
presents her categorization of various kinds of meanings that seem to
be encoded in the attitudes of people saying things.

This book is in honor of Anna-Brita. All fifteen papers are directly or indirectly
stamped by something she has done or written on spoken corpus and
discourse analyses. The research conducted in every paper is more or less
related to spoken data except the first one that is about methodology.
However, even in that paper, Jan precisely points out that a new trend in
corpus linguistics is the investigation of spoken data. This collection
provides concrete evidence to show the contribution of corpus
linguistics. Researchers observe new structures from large corpus, which
are beyond linguists' intuition and introspection, such as the double
copula structure reported by Gisle Andersen. The probabil distribution
of a grammatical rule can signal diachronic change of
language that will not be achieved by traditional description. In summary,
this is a valuable collection with respect to corpora related studies, especially
spoken corpora.

Leech, G. 1992. Corpora and theories of linguistic performance. In J.
Svartvik (ed.) _Directions in corpus linguistics. Proceedings of Nobel
Symposium 82, Stockholm_, 4-8 August 1991. Berlin: Mouton de Gruyter.

Tognini-Bonelli, E. 2001. _Corpus linguistics at work_. Amsterdam: John

Yuancheng Tu is currently a Ph.D student at the department of linguistics at the University of Illinois at Urbana-Champaign. His research area is computational lexical semantics and corpus linguistics. He is now working on his Ph.D thesis, which is building a semantic network called PhraseNet from large corpora. Functions are written for PhraseNet to interact with WordNet to expand it to generate semantic features for other Natural Language Processing applications, such as Question-Answering and Prepositional Phrase Attachments.

