LINGUIST List 17.104|
Fri Jan 13 2006
Review: Corpus Linguistics: Sinclair (2004)
Editor for this issue: Lindsay Butler
What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Sheila Dooley at dooleylinguistlist.org.
Trust The Text: Language, Corpus and Discourse
Message 1: Trust The Text: Language, Corpus and Discourse
From: Oliver Streiter <ostreiterweb.de>
Subject: Trust The Text: Language, Corpus and Discourse
AUTHOR: Sinclair, John McH.
EDITOR: Carter, Ronald
TITLE: Trust The Text
SUBTITLE: Language, Corpus and Discourse
PUBLISHER: Routledge (Taylor and Francis)
Announced at http://linguistlist.org/issues/15/15-2786.html
Oliver Streiter, National University of Kaohsiung, Taiwan
The book under review, ''Trust the Text: Language, corpus and
discourse'' by John Sinclair is a collection of 12 papers on written
discourse structure, lexis structure, phraseology, lexicography and
linguistic theory. All papers have been published previously between
1982 and 2003, but many of these papers are not easily accessible.
Some have been published in Festschriften, others are transcripts of
lectures. The book thus tries to make these papers accessible to a
The author, John Sinclair is one of the most influential and original
figures in contemporary linguistics. His focus on the analysis of
spoken language and his practical and theoretical work in corpus
linguistics, long before this had become mainstream, has influenced
many linguists and has changed the face of modern linguistics.
Most of the ideas presented in this collection have been discussed or
assimilated by the research community and taken as a basis for
further research. A summary of this follow-up is out of the scope of
this review. What this review thus only can do is to identify and explain
key ideas presented in each paper and finally try to evaluate the book
in terms of whether it succeeds in disseminating these ideas.
The book, edited by Ronald Carter, is organized in three parts,
called 'Foundations', 'The organization of text' and 'Lexis and
PART I Foundations
Paper 1: Trust the text
This paper argues that the availability of electronic corpora should
lead to a re-evaluation of linguistic research traditions. It warns of
upward projections of proven linguistic techniques to areas with larger
linguistic units. For the analysis of discourse, thus, new techniques
and a new framework of description are needed.
One notion introduced is the ''prospection'' in spoken discourse. A
prospection classifies what is going to follow in discourse. Thus,
different from backward oriented models which focus on antecedents
in the preceding discourse, it is argued that either the entire discourse
is encapsulated via a reference in the current sentence (examples can
be found in Paper 5, pg. 86, eg. words like 'and', 'however', 'also' etc.)
or that the current sentence has been projected by the preceding
discourse (like when you say ''... has dramatic consequence.'', what
follows will be understood as the consequences).
The paper then continues and makes a number of claims which
challenge established assumptions:
+ The idea of a stable lemma is questioned as different word forms of
a lemma have different patterns of meaning.
+ A word that can be used in more than one word class tends to have
specific meanings associated with each word class. This correlation
between word class and meaning breaks down when the words form
part of idiomatic phrases or technical terms.
+ Words may have specific privileges or restrictions how they are used
(as subject, in prepositional phrases etc.)
+ Words have subliminal meanings, such as the verb 'happen' which
refers to something nasty.
+ Grammar is a grammar of meaning and should state which meaning
corresponds to which grammatical pattern.
+ Words are not selected independently but share meaning
components which cannot be ascribed to a single word or a single
+ As a result of the common selection of related words, these words
have to give up parts of their meaning. This is referred to
as 'delexicalization'. This delexicalization is easily visible with adjective-
noun combinations in which adjectives lose much of their meaning,
e.g. when they stress part of the meaning of the noun (e.g. 'physical
Paper 2: The search for units of meaning
This paper proposes a linguistic unit called the 'lexical item', a unit in
the lexical structure to be selected independently and which then
selects lexical or grammatical patterns for its expression.
That words are not independent units can be seen from compounds,
phrasal verbs, proverbs etc. Words are more or less dependent on
each other and this dependence lies somewhere between an 'open
choice' and an 'idiom'. Open choice represents the 'terminological
tendency', i.e. the tendency for each word to have a fixed, context-
independent meaning. Idiomaticity represents the 'phraseological
tendency' where words are selected together and make meanings
from their combinations. While traditionally the terminological principle
is seen as central to language, this paper focuses on the
Phraseological combinations, even if considered to be fixed, allow for
small variations to fit the phraseological combination into its context. In
addition, the different components of a phraseological combination
have distinct functions. This is taken as an argument for their co-
The phraseological combination 'the naked eye' is analyzed. It is
shown that it consist of a semantic prosody ('difficult'), a semantic
preference ('see'), a colligation (preposition) and an invariable core,
i.e. the collocation 'the naked eye', example: 'just visible to the naked
For the phraseological combination 'true feeling' the lexical item
consists of a semantic prosody ('reluctant'), a semantic preference
('communicate'), a colligation (possessive) and a collocation ('true
feelings'), as in 'try to communicate our true feelings'. The semantic
prosody and the semantic preference can be fused as
in 'conceal, 'hide' or 'mask'.
A similar analysis is provided for the verb 'brook', which because of its
infrequent usage, might be more independent of the context. But even
for this verb, a complex lexical unit can be identified if sufficient corpus
data are available.
PART II The organization of text
Paper 3: Planes of discourse
This paper integrates written language and discourse in one
framework as both are essentially interactive. Two notions are
introduced. The 'autonomous plane' of discourse gives access to the
record of experience of speakers by integrating previous experiences
in the form of words and phrases in a text structure. The 'interactive
plane' of discourse is in charge of negotiating between participants,
selecting the effect of utterances and what features of the outside
world utterances should incorporate. The organization of written text
is also managed on the interactive plane, e.g. predictions,
anticipations, self-reference, discourse labeling and participant
Some operation allows switching the attention between the two
planes. 'Reports' transfer attention to the autonomous plane within an
utterance, so that the author does not have to adhere to the fact.
A 'reference' to the preceding discourse encapsulates the old
interaction and makes it available on the autonomous plane. 'Quotes'
however remain on the interactive plain.
In fiction, then, similar to a report, the author no longer averes each
utterance. However she does not attribute the utterances to an author
in the real world neither The evaluation at the end (laughter, moral)
marks then the return to averral. The notions introduced in this paper
are then illustrated in the analysis of a fragment of fiction.
Paper 4: On the integration of linguistic description
This paper elaborates and illustrates the notions developed in the
previous paper. It is shown how the identification of the interactive and
autonomous plane of discourse can be used for a descriptive system
(annotation scheme) for the analysis of written texts and spoken
Paper 5: Written discourse structure
This paper elaborates ideas presented in Paper 1 in the analysis of
data. Of central importance is the idea of encapsulation. Each new
sentence takes over from the previous sentence the status of 'state of
the text'. By default, each new sentence encapsulates the previous
one by a reference. This removes the discourse function from the
previous sentence and leaves mainly a meaning trace in memory, and
only partially a trace of form. The encapsulation creates coherence
and cohesion is defined as the referencing act. Point-to-point
references, eg. a pronoun referring to its antecedent are then
interpreted mainly with reference to the shared knowledge and not the
'Logical acts' encapsulate the whole of the previous sentence (eg.
through the words 'but', 'therefore') or the previous half of the same
sentence (eg. through the words 'and', 'rather'). 'Deictic acts' also
include the whole of the previous sentence (eg. 'that', 'this').
A 'prospection' about the next sentence requires the next sentence to
fulfill the created expectancies if coherence is to be maintained. A text
is analyzed to illustrate and discuss this notion. Different sub-types of
prospections, such as prospection through an attribution, internal
prospection or advanced labelling are introduced.
Paper 6: The internalization of dialogue
This paper tries to link spoken and written discourse in a single
description and does so in a very original way. The author claims that
properties of sentence grammar can be understood by relating
grammatical structures (subordinate clause, relative clause, noun
phrase etc.) to features of spoken interaction, and that in the
phylogenetic development of languages these features of spoken
interaction are internalized (understood as ''creating a (language)-
internal representation of'').
Through the internalization of the 'speaker change', a single speaker
can change the posture and present conflicting ideas. The speaker,
when marking this change, is no longer bound by the requirement to
be coherent in his posture.
Declarative, interrogative or imperative mood can be equally
understood as internalization of performative aspects of discourse. By
internalizing them the speaker can now achieve the same speech act
with a combination of different moods. This extends the range and the
finesse of mood choices and thus creates an open set of possible
The internalization of speech acts as subordinate clauses free them
from their interactive function. Thus, hypotheses can be formulated by
the speaker. Through the internalization, the move (i.e. the discourse
unit) becomes a proposition, the averral becomes a truth value and
the situational context becomes a possible world. When internalized
as restrictive relative clauses, then this clause may specify which
referents are included under a denotation by reference to a possible
world. Prepositional phrases and attributive adjectives are derived
from these by leaving the truth value unexpressed (e.g. dropping the
Paper 7: A tool for text explication
The author describes the history of text analysis/explication in its
various forms (stylistics, discourse analysis) as a periodical movement
between the poles of objectivity (e.g. using descriptive schemes) and
subjectivity (to achieve a qualitatively rich analysis). In an impressive
analysis of a small text fragment, the author shows how corpus data
can be used in a qualitatively rich analysis of discourse strategies,
having as supported massive objective data.
PART III Lexis and grammar
Paper 8: The lexical item
This paper starts from a historic account of the distinction
between 'word' and 'lexical item'. The author revives the notion
of 'lexical item' to describe the vocabulary in more meaningful terms,
e.g. to account for the fact that a vocabulary is a limited set of
meaningful items which in text can assume an unlimited number of
meanings. An alternative model according to which words are
exchange in their paradigm is rejected as it creates artificial meanings
and meaning ambiguities which are not felt by a native speaker.
Instead, a mechanism called 'reversal' is introduced according to
which meaning is created from the context and takes precedence over
the meaning assigned in the vocabulary. When using 'lexical items' in
generation, there is less choice than with words and almost no
The components of lexical items are those we have seen in Paper 2,
the core, the semantic prosody (both obligatory), collocation,
colligation and semantic preference. Through their syntactic flexibility
(colligation) and semantic flexibility, lexical items allow for a limited
paradigmatic choice and thus an integration with other lexical items in
their context. New meanings are created when contextual constraints
and lexical specifications do not match. The nature of a lexical item is
illustrated in an analysis of the usage of the verb 'budge'.
Paper 9: The empty lexicon
This paper argues against the conception of language as a simple
code for a message. According to the author, a message is only part
of communication and the message cannot be easily separated or
distilled from the form as many elements are concerned with
negotiating the interaction and contributing to the message at the
Discussing terminology first, the paper contrasts the 'terminological
tendency' where words have fixed meanings and the natural flexibility
and variability of language. The function terminology has in the lexis,
is the same function that sublanguages have in grammar.
Sublanguages also try to protect a chosen set of patterns and limit
contextual factors on meaning. The terminological approach and the
sublanguage approach are prevalent in a technical view on language,
e.g. in Natural Language Processing. The technical approach is better
suited to describe written language, especially scientific texts.
A proposal for a lexicon structure is elaborated. It includes two
sublexica. One is similar to a termbank, the other is the flexible
lexicon, initially empty. The lexicon learns about vocabulary from text
and it is constantly updated. The only fixed element in this lexicon is
its structure. It has three subcomponents, (1) the form of a lexical item,
(2) an environment and a (3) meaning, and associations between
elements of these subcomponents.
Paper 10: Lexical grammar
This paper discusses the notions of lexis and grammar. It explains why
these notions have been seen historically as two separate entities. A
model based on this opposition, however, cannot account for
meaning. Neither the study of the lexis with the help of referential or
logical semantics, nor the study of grammar can assign meaning to
syntagmatic patterns (c.f. 'the naked eye'). Traditional frameworks
cannot handle cross-border categories, semantic prosody or the
vagueness of word classes. Without presenting an alternative model,
however, the paper finishes with an exemplary analysis, similar to
what we have seen with 'the naked eye'.
Paper 11: Phraseognomy
This short paper provides an analysis of the phrases 'Society of X'
and 'Society for X'. This paper does not pretend to provide deeper
insight beyond the specific example.
Paper 12: Current issues in corpus linguistic
This paper argues, essentially, against a number of ideas that are
neither referenced, or fully described. The first argumentation defeats
the idea of fixed adequate lexicon for the purpose of Natural
Language Processing, and related to it, the idea of sublanguage. The
second fusillade goes against small corpora and the third against the
(over-)annotation of corpora.
While the overall impression of the book is very positive in terms of its
intellectual challenges, its linguistic inspirations, the historical
perspectives it offers and its capacity to bring together different lines
of research, I won't spare some critical remarks.
First, different contributions vary in quality, scope and relevance.
Paper 11 is nice to read but lacks any import beyond what has been
stated repeatedly in the book. Paper 12, I experienced as simply
annoying. This paper epitomizes a writing style where positions are
criticized with a minimal summary or a reference to a specific person,
publication, a school. I have been forgiving throughout the book,
seeing this style as the price for the wider view the author offers to the
reader, but his paper doesn't offer this wider view and the discourse
slips down into an unfair and unscientific shadow-boxing.
''But when someone says their corpus does not need to get any
bigger ...'' (pg. 188)
Second, statements as the one above can only be understood in the
light of the assumption that corpus linguistics is a scientific paradigm
defined by the 'exemplary instance of scientific research' (Kuhn
1996/1962) realized by the author and his colleagues. Sometimes, this
assumption shows up in half-sentences:
''In corpus linguistics, by contrast, we have to work on the assumption
that ...'' (pg. 170)
''[T]he vast majority of work with corpora still takes place under the
assumptions of pre-corpus linguistics'' (pg. 176)
The author thus silently tries to monopolize the term 'corpus
linguistics' and to assign it the meaning of what Tognini-Bonelli
identifies correctly as 'corpus-driven approach' within the area of
corpus linguistics. The author thus denies the label 'corpus linguistics'
to those researchers which understand corpus linguistics differently,
e.g. as a (complementary) research method (Biber et al. 1998).
Third, the general tendency in these articles to cite research only
when it can be integrated en passant or to fire a broadside
on 'computational linguistics' or 'structural linguistics' is
counterproductive to the advancement of science. As Kuhn
(1996/1962) has taught us, new paradigms not only come up with a
new theory but also with new data. And this is what the author does
extraordinarily well. But as long as the data of the other paradigms
cannot be accounted for, or can be shown to be artificial data or
represent an artificial problem, we have two theories (old and new)
which describe different data derived from the same world. Much
would have been gained in this book, if, instead of repeatedly
providing new data for theory verification, an analysis of other
theories' data would have been given (e.g. in Paper 5, the so called
donkey-sentences of Kamp & Reyle 1993, or in Part III,
Mel'cuk's 'heavy smoker' (1974) or Pustejovsky's 'fast car' and 'fast
Finally, attempts to make the language of the book accessible have
either not been made or they have not been successful. Sentence
structure is unnecessarily complex, e.g.:
''This chapter concerns the relation between the two types of patterns
that are mainly recognized as the means whereby language creates
meaning.'' (pg. 164)
and sometimes barely understandable:
''A user community that kept clearly separate the language that was
used in a particular subject-matter area, and whose usage in that area
differed markedly from its other usage and the usage of comparable
communities, while remaining largely within the rules of the general
language - such conditions would identify a sublanguage.'' (pg. 152)
''Professional linguists should not be surprised to experience a rather
disturbing effect from the massive surge in the availability of evidence
and the growing sophistication of the tools for examining it and testing
hypotheses against it that corpus linguistics has brought.'' (pg. 173)
To sum up, the content of book will serve as rich source of inspiration
to those who are involved in corpus linguistic research, lexicography
and discourse analysis. The book however is not suited as general
introduction and certainly not as a text book for university courses.
The price of the book, the writing style and the fragmented
presentation of ideas are responsible for the fact that, the ideas will
still remain difficult to access.
Douglas Biber, Susan Conrad and Randi Reppen, (1998) Corpus
Linguistics- Investigating Language Structure and Use, Cambridge
Hans Kamp & Uwe Reyle, (1993) From Discourse to Logic.
Introduction to Model theoretic Semantics of Natural Language,
Formal Logic and Discourse Representation Theory, Dordrecht,
Kluwer Academic Publishers.
Thomas S. Kuhn (1996/1962) The Structure of Scientific Revolutions.
University of Chicago Press, 3rd edition.
Igor A. Mel'cuk (1974) Opyt teorii lingusticeskix modelej Smysl <=>
Text. Semantika, sintaksis . Izdatel'stvo ''Nauka'', Moskva.
James Pustejovsky (1995) The Generative Lexicon, MIT Press,
Elena Tognini-Bonelli (2001) Corpus Linguistics at Work. Benjamins.
ABOUT THE REVIEWER
Oliver Streiter teaches computational linguistics and corpus linguistics
at the National University of Kaohsiung, Taiwan. His current research
focuses on applications in Computer Assisted Language Learning
("Gymnzilla") and a project which aims at the compilation and
annotation of linguistic resources to support low density languages.
Respond to list|Read more issues|LINGUIST home page|Top of issue
Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.