LINGUIST List 16.1724|
Tue May 31 2005
Review: Corpus Ling: Barnbrook et al. (2004)
Editor for this issue: Naomi Ogasawara
What follows is a review or discussion note contributed to our Book Discussion Forum. We expect discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in. If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for review." Then contact Sheila Dooley at collberglinguistlist.org.
Message 1: Meaningful Texts
From: Mikhail Mikhailov <Mihail.Mihailovuta.fi>
Subject: Meaningful Texts
EDITORS: Barnbrook, Geoff; Danielsson, Pernilla; Mahlberg, Michaela
TITLE: Meaningful Texts
SUBTITLE: The Extraction of Semantic Information from Monolingual
and Multilingual Corpora
SERIES: Corpus and Discourse
PUBLISHER: Continuum International Publishing Group Ltd
Announced at http://linguistlist.org/issues/15/15-3578.html
Mikhail Mikhailov, School of Modern Languages and Translation
Studies, University of Tampere, Finland
[This review contains ISO-8859-2 (Latin 2) and Cyrillic characters, and
is best viewed using Unicode encoding. -- Eds.]
This volume is an edited collection of papers on corpus and corpus-
based linguistics. Many of the papers were originally presented at the
5th and the 6th TELRI seminars held in Ljubljana, Slovenia (2000) and
Bansko, Bulgaria (2001). The papers present research in different
language material, the discussed topics vary considerably, and
different approaches are used. All in all, there are 21 papers in the
volume plus Introduction. The book is divided into two parts: part I is
devoted to monolingual corpora and part II is dealing with multilingual
PART I. MONOLINGUAL CORPORA
1. Extracting concepts from dynamic legislative text collections (Gaël
Dias, Sara Madeira, and José Gabriel Pereira Lopes), pp. 5-16.
In this paper the problems of automated extraction of multiword terms
from legal texts are discussed. The software developed by the authors
of the paper is used for processing a dynamic collection of raw texts in
Portuguese. The SENTA (Software for the Extraction of N-ary Textual
Associations) module extracts multiword combinations which are likely
to be terms. Both contiguous and non-contiguous terms can be
extracted. The basic principles used are similar to most research in the
field: the observed frequency of re-occurence of the elements of the
string is compared with those statistically expected. A web-based
interface of the module has been developed.
2. A diachronic genre corpus: problems and findings from the
DIALAYMED-Corpus (DIAchronic Multilingual Corpus of LAYman-
oriented MEDical texts) (Eva Martha Eckkrammer), pp. 17-30.
The paper is concerned with the issues of compiling diachronic
corpora. The necessity to include texts from different chronological
periods presents many difficulties for the compiler: 1) genuine orality
not available for the early periods, 2) problems in texts' classification
and sampling, 3) lack of texts of certain genres from the certain
periods. The corpus presented in the paper is the DIALAYMED, a
multilingual diachronic corpus of medical information texts (self-
counseling texts). The corpus comprises seven languages (Spanish,
French, Italian, Portuguese, German, English) and is divided into
seven periods, from Late Middle Ages to 21st century. The
DIALAYMED can be used both for study of changes inside one of the
languages of the corpus and for cross-cultural research.
3. Word meaning in dictionaries, corpora and the speaker's mind
(Christiane Fellbaum with Lauren Delfs, Susanne Wolff and Martha
Palmer), pp. 31-38.
The authors of the paper point out the importance of merging
dictionaries and text corpora. Semantic tagging based on dictionary
definitions is one of possible solutions of the problem. It is clear that
manual semantic annotation of a large text corpus is an enormously
difficult and expensive task, and automated semantic tagging is much
needed. Nevertheless, it is important to study first the results produced
by human tagging. It has been found, that there is a rather high rate of
disagreement between different human annotators. Most probably, the
reason lies in traditional dictionary models of meaning representation.
The authors claim it is not likely that human or machine annotators can
successfully perform one-to-one mapping of word senses in
polysemous words. They suggest it would be more reasonable to aim
at selecting clusters of senses or broader senses for the contexts
where the meaning is unclear.
4. Extracting meaning from text (Gregory Grefenstette), pp. 38-47.
There are two approaches to automated extraction of meaning from
text: the first one targeting at imitating of human understanding, the
second one based on statistical methods without referring to
knowledge representation. In this paper the second approach is
considered. Automated analysis of word lists and text structure can
provide user with answers to various important questions: What kind of
text is this? What other texts are like this? What is the text about? How
good is the text? Annotated corpora can supply linguists with data on
morphology and syntax.
5. Translators at work: a case study of electronic tools used by
translators in industry (Riitta Jääskeläinen and Anna Mauranen), pp.
In this paper the use of software tools by Finnish translators is
reviewed. The study was a part of international project SPIRIT
(Supporting Peripheral Industries with Realistic Applications of Internet-
based Technology). It was found that most translators use basic tools
like electronic dictionaries and the Internet. Terminology management
software, translation memories, and corpus tools were virtually
unknown to them. The experiment with a group of in-house translators
showed that people of this profession are rather conservative and
prefer familiar software. Jääskeläinen and Mauranen suggest that
there should be more cooperation between the developers of software
for translators and the end-users.
6. Extracting meteorological contexts from the newspaper corpus of
Slovenian (Primo? Jakopin), pp. 54-61.
Jakopin presents in his paper methods of identification of weather
forecasts in a corpus of newspaper texts and extracting significant
data from such contexts. It was relatively easy to automatically extract
texts of weather forecasts from the corpus because of their fixed
length and standard headings. The quantitative study of the
meteorological texts shows that quite few lexemes appear mostly in
weather forecasts and not in the other texts of the corpus. However,
word bigrams prove to be much more interesting. There were
extracted eight two-word terms, which occur in meteorological texts
with a probability rate over 99 percent. Finally, there are clichéd
sentences in weather forecasts that have rather high frequency and
occur mostly in meteorological texts.
7. The Hungarian possibility suffix "-hat/-het" as a dictionary entry
(Ferenc Kiefer), pp. 62-69.
In this paper the problems of lexicographical description of modal
words are discussed, the Hungarian possibility suffix as an example.
Kiefer demonstrates that the entry from a traditional dictionary is
inadequate both theoretically and descriptively. The small text corpus
gives much more information about the use of possibility suffix and the
kinds of possibility it expresses. Still, the author argues that a good
entry may be based exclusively on corpus material; a good theory is
needed for interpreting usage examples. In case of Hungarian suffix,
the clear distinction between different kinds of modality (epistemic,
deontic, circumstantial, boulomaic, and dispositional) and
distinguishing semantic and pragmatic function would help to develop
a more consistent lexicographical entry.
8. Dictionaries, corpora and word-formation (Simon Krek, Vojko
Gorjanc and Marko Stabej), pp. 70-82.
Like the previous, this paper is concerned with the issues of
lexicographical description. The authors study the principles of
presenting English adverbs derived from adjectives and ending with -
ly. The data from the dictionaries was compared to that of BNC and
Google search engine. Many -ly adverbs are registered as run-on
entries, although derivatives not always take all the meanings of the
primitives. Sometimes high-frequency derivatives are promoted to
headwords. Compilers of bilingual dictionaries often have to seek
different solutions because of the necessity to provide translation
equivalents, which often demonstrates differences in meaning
between the two words. The authors suggest that corpora and the
Internet can give lexicographers additional data, which would help
evaluating importance of lexical items and taking decisions on their
status in the dictionary.
9. Hidden culture: using the British National Corpus with language
learners to investigate collocational behaviour, wordplay and culture-
specific references (Dominic Stewart), pp. 83-95.
Stewart shows the ways of using corpora in language learning as a
source of culture-specific information. It is fairly difficult to obtain some
kinds of information using traditional dictionaries and encyclopedias. It
is particularly difficult to find a clue to a wordplay, when some elements
of an idiom are replaced by other words (e.g. special queue <= special
brew). Stewart suggests looking up re-occurring collocates from the
corpora. The method seems to work well even when retained elements
are high frequency words (like in the example above). The use of
corpora makes it possible even to look up idioms by structural patterns.
10. Language as an economic factor: the importance of terminology
(Wolfgang Teubert), pp. 96-106.
Teubert focuses on importance of terminology in the modern world.
Standardization of terminology is very important for development of
technologies; many projects are carried out in international teams.
Although English is used more and more as lingua franca, developing
of national scientific discourse and national terminologies remains a
part of technological progress. Therefore, there remains a great need
for updating multilingual terminological banks and collecting
multilingual text corpora. Special attention should be paid to 'soft
terminology', i.e. new terms, which have already become part of
discourse but are not yet standardized. Developing of knowledge
extraction technologies would help to 'filter out' and create lists of such
11. Lemmatization and collocational analysis of Lithuanian nouns
(Andrius Utka), pp. 107-114.
In this paper, issues of lemmatization are discussed. Lemmatization is
on the one hand a very useful procedure for bringing together all word
forms of a lexeme; on the other hand it is sometimes criticized because
important information on individual constituents of the lemma becomes
unavailable. The Lithuanian language is heavily inflected, which makes
the use of a lemmatized text corpus much more convenient.
Nevertheless, the researcher should not forget about the different
forms of the word and their usage. A case study of the word "teisyb?"
('truth') demonstrates that different forms have different frequencies
and different collocations. Thus, the analysis of the lemma only gives a
generalized profile, while studying of each separate form would give
more precise information on the usage of the word.
12. Challenging the native-speaker norm: a corpus-driven analysis of
scientific usage (Geoffrey Williams), pp. 115-130.
Williams centers on the problem of non-native-speaker English. More
and more researchers, whose native language is not English, submit
their papers in English, while the proportion of native speakers of
English is declining. The situation in technical writing is, probably, most
difficult. A case study on use of relative pronouns "which/that" in a
corpus of plant biology research articles has shown that in many cases
an avoidance strategy is chosen, e.g. the writers tend to use simple
constructions avoiding relative clauses. Williams emphasizes the
importance of compiling specialized corpora as well as learner
corpora, which would help to improve the level of technical writing.
PART II. MULTILINGUAL CORPORA
13. Chinese-English translation database: extracting units of
translation from parallel texts (Chang Baobao, Pernilla Danielsson and
Wolfgang Teubert), pp. 131-142.
This paper examines methods of extracting translation equivalents
from parallel texts. The research is carried out on Chinese-English
parallel corpus of about 17 million running words per language.
Translation correspondences detected by software should be
unambiguous. That is why the authors suggest that the best solution is
to seek correspondences between multiword units. So, the texts of the
corpus are chunked into multiword units. Both chunking and search for
equivalents is done using statistical techniques. Four different
statistical scores were tested (MI, Dice, log likelihood, chi-2) and it was
found out that LL and chi-2 achieved better accuracy than the other
two coefficients. Precision and recall of the software was improved by
checking syntax patterns.
14. Abstract noun collocations: their nature in a parallel English-Czech
corpus (Frantisek Cermák), pp. 143-153.
Cermák shows in his article, that there exist differences in functioning
between abstract and concrete nouns. A contrastive analysis of
abstract nouns in Orwell's "1984" and its Czech translation has been
performed. It was found that there is no direct correspondence
between items of source and target texts. Verbs can be translated as
nouns, nouns as adjectives, etc. The study of verbal collocational
patterns of ACTION, EMOTION and LANGUAGE abstract nouns
demonstrates the following tendencies: 1) inchoative verbs were the
most typical collocations for all three groups of abstract nouns, 2)
terminative verbal collocations were the least typical, and the
LANGUAGE nouns seem to avoid terminative phase, 3) the study
revealed certain asymmetry between English and Czech noun
15. Parallel corpora and translation studies: old questions, new
perspectives? Reporting "that" in Gepcolt: a case study (Dorothy
Kenny), pp. 154-165.
Comparable corpora are used extensively nowadays in translation
studies, the main issue of current research projects is language of
translations in contrast to authentic language (see e.g. Baker 1993,
Laviosa 1998, cf. Mauranen and Jantunen 2005). Kenny shows in her
paper that it is difficult to explain findings from comparable corpora
using only texts of translations without comparing them to original
texts. That is why parallel corpora should be used together with
comparable ones. The cross-language comparison of the use of
optional German connective "dass" and optional English
connective "that" in German-English parallel corpus of literary texts
(Gepcolt) demonstrates that it is difficult to claim direct influence of
source text on the language of translation.
16. Structural derivation and meaning extraction: a comparative study
of French/Serbo-Croatian parallel texts (Cvetana Krstev and Dusko
Vitas), pp. 166-178.
This article shows the importance of structural derivation in Serbo-
Croatian language and the necessity of taking it into account in
linguistic software applications. The use of traditional lemmatization
(only different forms of the same lexeme) narrows results of the
search. The authors of the article suggest expanding inflective classes
of nouns so that various kinds of structural derivation (diminuatives,
augmentatives, feminine forms, possessive adjectives, etc) are also
included into augmented entries. This improves search results in
parallel corpora and makes search for translation equivalents more
17. Noun collocations from a multilingual perspective (Ruta
Marcinkeviciene), pp. 179-187.
The topic of the paper is close to that of Cermák's paper in this
volume. Marcinkeviciene studies a parallel concordance of the English
noun "memory" in Orwell's "1984" and six translations of the novel. A
special interest is paid to the verbal collocations of "memory" and its
equivalents in other languages. The research demonstrates that
translators in most cases preserve collocational patterns of the target
language rather than try to keep collocations of the source language
18. Studies of English-Latvian legal texts for Machine Translation
(Inguna Skadina), pp. 188-195.
The paper deals with studying ambiguous words in parallel corpora.
The aim of the research is to find the methods of improving the quality
of machine translation. A study of parallel contexts for several Latvian
words provided new translation equivalents not registered in the
dictionaries, some of the equivalents appeared to be quite frequently
used. The author suggests that parallel corpora of specialized texts
are very valuable source of data for terminology databases and
machine translation systems. The corpus-based approach is also one
of the ways to improve the quality of printed dictionaries as well.
19. The applicability of lemmatization in translation equivalents
detection (Marko Tadic, Sanja Fulgosi and Kresimir Sojat), pp. 196-
In this paper, the process of automated extraction translation
equivalents from Croatian-English parallel corpus is outlined. The
current version of the software is based exclusively on statistical
methods (pointwise mutual information), however the use of linguistic
filters is also planned on the later stages. The algorithm extracts one-
to-one equivalent pairs, generation of other kinds of equivalent pairs
(1-2, 2-1, 2-2, ...) is also possible. Still, the problem of very large
number of combinations (combinatorial explosion) is to be solved. The
algorithm was tested on both non-lemmatized and lemmatized
material. The hypothesis that search for translation equivalents on
lemmatized texts is more effective for inflected languages like Croatian
20. Cognates: free rides, false friends or stylistic devices? A corpus-
based comparative study (Spela Vintar and Silvia Hansen-Schirra), pp.
Vintar and Hansen-Schirra study cognate words (like EN "sport" vs.
GE "Sport") in English-German and English-Slovene parallel corpora.
The research demonstrated that percentage of cognates in Slovene
translations from English is quite close to that in translations from
English into German. However, the comparison with texts originally
written in German and Slovene shows that percentage of cognates in
Slovene translated texts is slightly lower than in original Slovene texts,
while in German translations there twice more cognates than in
original German texts. The phenomenon is most likely caused by purist
tendencies in Slovene, a language of only two million speakers, and
openness of the German language to linguistic influences. The
comparison of frequencies of cognates and 'native' synonyms in
Slovene and German reference corpora confirm the hypothesis.
21. Trilingual corpus and its use for the teaching of reading
comprehension in French (Xu Xunfeng and Régis Kawecki), pp. 222-
The paper examines the possibilities of use of parallel corpora in
language teaching. An online English-French-Chinese parallel corpus
was used in reading comprehension teaching. An experiment with
three groups of students in Hong Kong showed that reading
comprehension skills of the test group improved significantly after six
weeks of reading trilingual texts on the Web. It is planned to further
develop this learning tool: comprehension test and online
concordancer will be added.
The issues discussed in this volume have received a great deal of
attention in research of the past decade. A strong side of the book is
that it includes works of different scholars working with different
languages. Actually, the publications of this book deal with twelve
languages. The papers of the volume are fairly short and most of them
present results of case studies. However, the articles are interesting to
read and methods introduced are applicable to different linguistic
phenomena. I read with special interest the papers by Christiane
Fellbaum et al. (3), Dominic Stewart (9), Chang Baobao et al. (13),
Dorothy Kenny (15), Cvetana Krstev and Dusko Vitas (16), Ruta
Marcinkeviciene (17), Spela Vintar and Silvia Hansen-Schirra (20).
1. I understand that it is extremely difficult to put together these very
diverse contributions under the same title. However, the title of the
volume is rather misleading. One would expect this to be a collection
of papers on automated text processing, semantic tagging,
disambiguation, translation memories, etc. Unfortunately, only half of
the articles can be considered as studying the issues of "The
Extraction of Semantic Information from Monolingual and Multilingual
Corpora". Many articles are rather loosely related to the subject (e.g.
chapters 2, 5, 7, 8). The title "Meaningful texts" without the subtitle
would have been ambiguous enough to cover all the publications of
2. The division of the volume into two parts does not seem to work
well. The editors themselves admit that there are the same questions
discussed in both 'monolingual' and 'multilingual' parts, e.g.
lemmatization, noun collocations (p. 1). Furthermore, the DIALAYMED
corpus, presented in the paper by Eckkrammer is a multilingual
corpus, and I don't quite understand why the article is placed in the
first part of the book. The paper by Wolfgang Teubert is not devoted
exclusively to monolingual corpora either.
3. The order in which the papers are arranged leaves impression of a
random order. Both Cermák (14) and Marcinkeviciene (17) are
studying noun collocations. Chang Baobao et al (13), Skadina (18),
and Tadic et al (19) discuss translation equivalents in parallel corpora.
Why the editors did not place the papers dealing with the close
problems one after another? After studying table of contents once
again I realized that the order is alphabetical. Anyway, the best
solution would be to arrange ALL the papers of the book in
alphabetical order without dividing it into two parts.
4. Some papers of the volume reference each other (Marcinkeviciene
=> Cermák) but it does not seem that cross-referencing is carried out
COMMENTARY ON SPECIFIC PAPERS
Gaël Dias, Sara Madeira, and José Gabriel Pereira Lopes:
It is not quite clear how effective the method is and what percentage of
noise the software produces.
The facts about different frequencies and different collocations of
different forms of the same word are very interesting and important.
However, I do not understand why one should give up lemmatization
just because of that. To my mind, the researcher should combine the
study of lexeme and its different forms. Besides, lemmatization and
tagging would help to filter out homonymous forms. Thus, study of raw
text would be a step back, it is better to improve the mark-up of the
Chang Baobao, Pernilla Danielsson and Wolfgang Teubert:
"Translation Equivalent Pair (TEP): a Translation Equivalent Pair is
composed of both a source-language Translation Unit and a target-
language Translation Unit, which are mutual translations" (p. 133). I
am not quite sure that bidirectional equivalence is common even in
terminology. If the TEPs are extracted from Chinese-English corpus,
they will be Chinese-English TEPs, not English-Chinese as well. For
obtaining English-Chinese TEPs one would need English-Chinese
parallel corpus. I am pretty sure the lists of TEP's obtained from
English-Chinese and Chinese-English corpora would be different. In
this respect, the idea of existence of '_mutual_ translations' is rather
misleading and simplistic.
Cvetana Krstev and Dusko Vitas:
I completely agree with the authors that augmenting of derivatives into
the dictionary entry is very important for automated text analysis of
Slavonic languages. However, only noun derivatives are discussed, it
would be interesting at least to mention verbal and adjectival
derivation as well (e.g. verbal aspect pairs in Russian present a very
serious problem for word alignment). Besides, although the problem of
word alignment for "baron" is solved very elegantly in the paper, it
would be interesting to discuss the possibilities of word-suffix
alignment as well. Sometimes diminutives and other derivational
suffixes may have explicit correspondences in translation, sometimes
they have to be ignored, e.g. the Russian diminutive
noun "berezka" 'little birch' can be translated into English
as "birch", "pretty birch" or "little birch".
Marko Tadic, Sanja Fulgosi and Kresimir Sojat:
The method of extraction of translation equivalents introduced in the
paper seems to generate on the first stage many 'impossible' pairs
like 'article-verb' or 'conjunction-noun', which are of course filtered out
on later stages but still slow down the process considerably.
Generating 'reasonable' translation pairs from the very beginning by
employing linguistic filters would help to avoid combinatorial explosion.
Use of stopwords would be an easy and robust solution.
Spela Vintar and Silvia Hansen-Schirra:
The principles of automated search of cognates formulated in the
paper look rather simplified, differences in orthographic traditions
should be taken into account (see e.g. Tiedeman 2003: 50-51).
"According to Baker (1996), translations should be longer than
originally produced texts in the target language or in the source
language. The evidence for this tendency may, for example, be found
in the text length (number of words of the individual texts)" (p. 212). I
think the idea was originally formulated by Nida and Taber (Nida &
Taber 1974: 163). Still, it is not quite clear, how one can compare
lengths of source and target texts that are written in different
languages. E.g. there are fewer words in translations from English into
Russian (because there are no articles in Russian); translations from
English and Russian into Finnish (no articles and few prepositions plus
composite words in Finnish) also tend to be 'shorter'. Character counts
also can be misleading, because words lengths differ from language to
language. Thus, although the heuristic seems to be quite reasonable,
it is not possible to prove the explicitation tendency simply by
comparing word or character counts of source and target texts
(Mikhailov 2003: 165-174).
Xu Xunfeng and Régis Kawecki:
It would be interesting to know what kind of teaching methods were
used. Or was it just reading parallel texts?
Finally, I noticed some misprints in the volume, e.g. Russian examples
in the paper by Marcinkeviciene (pp. 180-185). [Examples omitted; see
To sum up, the book can be recommended for those who are
interested in corpus-based linguistics and corpus-based translation
studies, especially if their research is concerned with Slavonic or Baltic
Baker, Mona (1993) Corpora in Translation Studies: An Overview and
Some Suggestions for Future Research. Target 7(2): 223-43.
Laviosa, Sara (1998) The English Comparable Corpus: A Resource
and a Methodology, in Bowker, Lynne, Cronin, Michael et al (eds.)
Unity in Diversity? Current Trends in Translation Studies. Manchester:
Mikhailov, Mikhail (2003) Parallel'nye korpusa xudozhestvennyx
tekstov: principy sostavlenija i vozmozhnosti primenenija v
lingvisticheskix i perevodovedcheskix issledovanijax (Parallel corpora
of literary texts: principles of compilation and use in linguistics and
translation studies, in Russian) Acta Universitatis Tamperensis, 956.
Acta Electronica Universitatis Tamperensis, 280. University of
Mauranen, Anna & Jarmo Jantunen, eds. (2005) Käännössuomeksi.
Tutkimuksia suomennosten kielestä. Tampere University Press.
Nida E. A. & Taber C .R. (1974) The Theory and Practice of
Translation. Leiden: E.J. Brill.
Tiedemann, Jörg (2003) Recycling translations: Extraction of lexical
data from parallel corpora and their application in natural language
processing. Uppsala: Acta Universitatis Upsaliensis.
ABOUT THE REVIEWER
Mikhail Mikhailov is a senior lecturer at the School of Modern
Languages and Translation Studies, University of Tampere, Finland.
His main research interests lie in parallel corpora and corpus-based
translation studies. He is currently working on methods of studying
Russian-Finnish parallel texts.
Respond to list|Read more issues|LINGUIST home page|Top of issue
Please report any bad links or misclassified data
LINGUIST Homepage | Read
LINGUIST | Contact us
While the LINGUIST List makes every effort to ensure the linguistic relevance of sites listed
on its pages, it cannot vouch for their contents.