Date: Fri, 27 May 2005 17:04:50 +0300 From: Mikhail Mikhailov Subject: Meaningful Texts: The Extraction of Semantic Information from Monolingual and Multilingual Corpora
EDITORS: Barnbrook, Geoff; Danielsson, Pernilla; Mahlberg, Michaela TITLE: Meaningful Texts SUBTITLE: The Extraction of Semantic Information from Monolingual and Multilingual Corpora SERIES: Corpus and Discourse PUBLISHER: Continuum International Publishing Group Ltd YEAR: 2004
Mikhail Mikhailov, School of Modern Languages and Translation Studies, University of Tampere, Finland
[This review contains ISO-8859-2 (Latin 2) and Cyrillic characters, and is best viewed using Unicode encoding. -- Eds.]
This volume is an edited collection of papers on corpus and corpus- based linguistics. Many of the papers were originally presented at the 5th and the 6th TELRI seminars held in Ljubljana, Slovenia (2000) and Bansko, Bulgaria (2001). The papers present research in different language material, the discussed topics vary considerably, and different approaches are used. All in all, there are 21 papers in the volume plus Introduction. The book is divided into two parts: part I is devoted to monolingual corpora and part II is dealing with multilingual corpora.
PART I. MONOLINGUAL CORPORA
1. Extracting concepts from dynamic legislative text collections (Gaël Dias, Sara Madeira, and José Gabriel Pereira Lopes), pp. 5-16. In this paper the problems of automated extraction of multiword terms from legal texts are discussed. The software developed by the authors of the paper is used for processing a dynamic collection of raw texts in Portuguese. The SENTA (Software for the Extraction of N-ary Textual Associations) module extracts multiword combinations which are likely to be terms. Both contiguous and non-contiguous terms can be extracted. The basic principles used are similar to most research in the field: the observed frequency of re-occurence of the elements of the string is compared with those statistically expected. A web-based interface of the module has been developed.
2. A diachronic genre corpus: problems and findings from the DIALAYMED-Corpus (DIAchronic Multilingual Corpus of LAYman- oriented MEDical texts) (Eva Martha Eckkrammer), pp. 17-30. The paper is concerned with the issues of compiling diachronic corpora. The necessity to include texts from different chronological periods presents many difficulties for the compiler: 1) genuine orality not available for the early periods, 2) problems in texts' classification and sampling, 3) lack of texts of certain genres from the certain periods. The corpus presented in the paper is the DIALAYMED, a multilingual diachronic corpus of medical information texts (self- counseling texts). The corpus comprises seven languages (Spanish, French, Italian, Portuguese, German, English) and is divided into seven periods, from Late Middle Ages to 21st century. The DIALAYMED can be used both for study of changes inside one of the languages of the corpus and for cross-cultural research.
3. Word meaning in dictionaries, corpora and the speaker's mind (Christiane Fellbaum with Lauren Delfs, Susanne Wolff and Martha Palmer), pp. 31-38. The authors of the paper point out the importance of merging dictionaries and text corpora. Semantic tagging based on dictionary definitions is one of possible solutions of the problem. It is clear that manual semantic annotation of a large text corpus is an enormously difficult and expensive task, and automated semantic tagging is much needed. Nevertheless, it is important to study first the results produced by human tagging. It has been found, that there is a rather high rate of disagreement between different human annotators. Most probably, the reason lies in traditional dictionary models of meaning representation. The authors claim it is not likely that human or machine annotators can successfully perform one-to-one mapping of word senses in polysemous words. They suggest it would be more reasonable to aim at selecting clusters of senses or broader senses for the contexts where the meaning is unclear.
4. Extracting meaning from text (Gregory Grefenstette), pp. 38-47. There are two approaches to automated extraction of meaning from text: the first one targeting at imitating of human understanding, the second one based on statistical methods without referring to knowledge representation. In this paper the second approach is considered. Automated analysis of word lists and text structure can provide user with answers to various important questions: What kind of text is this? What other texts are like this? What is the text about? How good is the text? Annotated corpora can supply linguists with data on morphology and syntax.
5. Translators at work: a case study of electronic tools used by translators in industry (Riitta Jääskeläinen and Anna Mauranen), pp. 48-53. In this paper the use of software tools by Finnish translators is reviewed. The study was a part of international project SPIRIT (Supporting Peripheral Industries with Realistic Applications of Internet- based Technology). It was found that most translators use basic tools like electronic dictionaries and the Internet. Terminology management software, translation memories, and corpus tools were virtually unknown to them. The experiment with a group of in-house translators showed that people of this profession are rather conservative and prefer familiar software. Jääskeläinen and Mauranen suggest that there should be more cooperation between the developers of software for translators and the end-users.
6. Extracting meteorological contexts from the newspaper corpus of Slovenian (Primož Jakopin), pp. 54-61. Jakopin presents in his paper methods of identification of weather forecasts in a corpus of newspaper texts and extracting significant data from such contexts. It was relatively easy to automatically extract texts of weather forecasts from the corpus because of their fixed length and standard headings. The quantitative study of the meteorological texts shows that quite few lexemes appear mostly in weather forecasts and not in the other texts of the corpus. However, word bigrams prove to be much more interesting. There were extracted eight two-word terms, which occur in meteorological texts with a probability rate over 99 percent. Finally, there are clichéd sentences in weather forecasts that have rather high frequency and occur mostly in meteorological texts.
7. The Hungarian possibility suffix "–hat/–het" as a dictionary entry (Ferenc Kiefer), pp. 62-69. In this paper the problems of lexicographical description of modal words are discussed, the Hungarian possibility suffix as an example. Kiefer demonstrates that the entry from a traditional dictionary is inadequate both theoretically and descriptively. The small text corpus gives much more information about the use of possibility suffix and the kinds of possibility it expresses. Still, the author argues that a good entry may be based exclusively on corpus material; a good theory is needed for interpreting usage examples. In case of Hungarian suffix, the clear distinction between different kinds of modality (epistemic, deontic, circumstantial, boulomaic, and dispositional) and distinguishing semantic and pragmatic function would help to develop a more consistent lexicographical entry.
8. Dictionaries, corpora and word-formation (Simon Krek, Vojko Gorjanc and Marko Stabej), pp. 70-82. Like the previous, this paper is concerned with the issues of lexicographical description. The authors study the principles of presenting English adverbs derived from adjectives and ending with - ly. The data from the dictionaries was compared to that of BNC and Google search engine. Many –ly adverbs are registered as run-on entries, although derivatives not always take all the meanings of the primitives. Sometimes high-frequency derivatives are promoted to headwords. Compilers of bilingual dictionaries often have to seek different solutions because of the necessity to provide translation equivalents, which often demonstrates differences in meaning between the two words. The authors suggest that corpora and the Internet can give lexicographers additional data, which would help evaluating importance of lexical items and taking decisions on their status in the dictionary.
9. Hidden culture: using the British National Corpus with language learners to investigate collocational behaviour, wordplay and culture- specific references (Dominic Stewart), pp. 83-95. Stewart shows the ways of using corpora in language learning as a source of culture-specific information. It is fairly difficult to obtain some kinds of information using traditional dictionaries and encyclopedias. It is particularly difficult to find a clue to a wordplay, when some elements of an idiom are replaced by other words (e.g. special queue <= special brew). Stewart suggests looking up re-occurring collocates from the corpora. The method seems to work well even when retained elements are high frequency words (like in the example above). The use of corpora makes it possible even to look up idioms by structural patterns.
10. Language as an economic factor: the importance of terminology (Wolfgang Teubert), pp. 96-106. Teubert focuses on importance of terminology in the modern world. Standardization of terminology is very important for development of technologies; many projects are carried out in international teams. Although English is used more and more as lingua franca, developing of national scientific discourse and national terminologies remains a part of technological progress. Therefore, there remains a great need for updating multilingual terminological banks and collecting multilingual text corpora. Special attention should be paid to 'soft terminology', i.e. new terms, which have already become part of discourse but are not yet standardized. Developing of knowledge extraction technologies would help to 'filter out' and create lists of such terms.
11. Lemmatization and collocational analysis of Lithuanian nouns (Andrius Utka), pp. 107-114. In this paper, issues of lemmatization are discussed. Lemmatization is on the one hand a very useful procedure for bringing together all word forms of a lexeme; on the other hand it is sometimes criticized because important information on individual constituents of the lemma becomes unavailable. The Lithuanian language is heavily inflected, which makes the use of a lemmatized text corpus much more convenient. Nevertheless, the researcher should not forget about the different forms of the word and their usage. A case study of the word "teisybė" ('truth') demonstrates that different forms have different frequencies and different collocations. Thus, the analysis of the lemma only gives a generalized profile, while studying of each separate form would give more precise information on the usage of the word.
12. Challenging the native-speaker norm: a corpus-driven analysis of scientific usage (Geoffrey Williams), pp. 115-130. Williams centers on the problem of non-native-speaker English. More and more researchers, whose native language is not English, submit their papers in English, while the proportion of native speakers of English is declining. The situation in technical writing is, probably, most difficult. A case study on use of relative pronouns "which/that" in a corpus of plant biology research articles has shown that in many cases an avoidance strategy is chosen, e.g. the writers tend to use simple constructions avoiding relative clauses. Williams emphasizes the importance of compiling specialized corpora as well as learner corpora, which would help to improve the level of technical writing.
PART II. MULTILINGUAL CORPORA 13. Chinese-English translation database: extracting units of translation from parallel texts (Chang Baobao, Pernilla Danielsson and Wolfgang Teubert), pp. 131-142. This paper examines methods of extracting translation equivalents from parallel texts. The research is carried out on Chinese-English parallel corpus of about 17 million running words per language. Translation correspondences detected by software should be unambiguous. That is why the authors suggest that the best solution is to seek correspondences between multiword units. So, the texts of the corpus are chunked into multiword units. Both chunking and search for equivalents is done using statistical techniques. Four different statistical scores were tested (MI, Dice, log likelihood, chi-2) and it was found out that LL and chi-2 achieved better accuracy than the other two coefficients. Precision and recall of the software was improved by checking syntax patterns.
14. Abstract noun collocations: their nature in a parallel English-Czech corpus (František Čermák), pp. 143-153. Čermák shows in his article, that there exist differences in functioning between abstract and concrete nouns. A contrastive analysis of abstract nouns in Orwell's "1984" and its Czech translation has been performed. It was found that there is no direct correspondence between items of source and target texts. Verbs can be translated as nouns, nouns as adjectives, etc. The study of verbal collocational patterns of ACTION, EMOTION and LANGUAGE abstract nouns demonstrates the following tendencies: 1) inchoative verbs were the most typical collocations for all three groups of abstract nouns, 2) terminative verbal collocations were the least typical, and the LANGUAGE nouns seem to avoid terminative phase, 3) the study revealed certain asymmetry between English and Czech noun collocations.
15. Parallel corpora and translation studies: old questions, new perspectives? Reporting "that" in Gepcolt: a case study (Dorothy Kenny), pp. 154-165. Comparable corpora are used extensively nowadays in translation studies, the main issue of current research projects is language of translations in contrast to authentic language (see e.g. Baker 1993, Laviosa 1998, cf. Mauranen and Jantunen 2005). Kenny shows in her paper that it is difficult to explain findings from comparable corpora using only texts of translations without comparing them to original texts. That is why parallel corpora should be used together with comparable ones. The cross-language comparison of the use of optional German connective "dass" and optional English connective "that" in German-English parallel corpus of literary texts (Gepcolt) demonstrates that it is difficult to claim direct influence of source text on the language of translation.
16. Structural derivation and meaning extraction: a comparative study of French/Serbo-Croatian parallel texts (Cvetana Krstev and Duško Vitas), pp. 166-178. This article shows the importance of structural derivation in Serbo- Croatian language and the necessity of taking it into account in linguistic software applications. The use of traditional lemmatization (only different forms of the same lexeme) narrows results of the search. The authors of the article suggest expanding inflective classes of nouns so that various kinds of structural derivation (diminuatives, augmentatives, feminine forms, possessive adjectives, etc) are also included into augmented entries. This improves search results in parallel corpora and makes search for translation equivalents more accurate.
17. Noun collocations from a multilingual perspective (Rūta Marcinkevičienė), pp. 179-187. The topic of the paper is close to that of Čermák's paper in this volume. Marcinkevičienė studies a parallel concordance of the English noun "memory" in Orwell's "1984" and six translations of the novel. A special interest is paid to the verbal collocations of "memory" and its equivalents in other languages. The research demonstrates that translators in most cases preserve collocational patterns of the target language rather than try to keep collocations of the source language in translation.
18. Studies of English-Latvian legal texts for Machine Translation (Inguna Skadiņa), pp. 188-195. The paper deals with studying ambiguous words in parallel corpora. The aim of the research is to find the methods of improving the quality of machine translation. A study of parallel contexts for several Latvian words provided new translation equivalents not registered in the dictionaries, some of the equivalents appeared to be quite frequently used. The author suggests that parallel corpora of specialized texts are very valuable source of data for terminology databases and machine translation systems. The corpus-based approach is also one of the ways to improve the quality of printed dictionaries as well.
19. The applicability of lemmatization in translation equivalents detection (Marko Tadić, Sanja Fulgosi and Krešimir Šojat), pp. 196- 207. In this paper, the process of automated extraction translation equivalents from Croatian-English parallel corpus is outlined. The current version of the software is based exclusively on statistical methods (pointwise mutual information), however the use of linguistic filters is also planned on the later stages. The algorithm extracts one- to-one equivalent pairs, generation of other kinds of equivalent pairs (1-2, 2-1, 2-2, ...) is also possible. Still, the problem of very large number of combinations (combinatorial explosion) is to be solved. The algorithm was tested on both non-lemmatized and lemmatized material. The hypothesis that search for translation equivalents on lemmatized texts is more effective for inflected languages like Croatian was confirmed.
20. Cognates: free rides, false friends or stylistic devices? A corpus- based comparative study (Špela Vintar and Silvia Hansen-Schirra), pp. 208-221. Vintar and Hansen-Schirra study cognate words (like EN "sport" vs. GE "Sport") in English-German and English-Slovene parallel corpora. The research demonstrated that percentage of cognates in Slovene translations from English is quite close to that in translations from English into German. However, the comparison with texts originally written in German and Slovene shows that percentage of cognates in Slovene translated texts is slightly lower than in original Slovene texts, while in German translations there twice more cognates than in original German texts. The phenomenon is most likely caused by purist tendencies in Slovene, a language of only two million speakers, and openness of the German language to linguistic influences. The comparison of frequencies of cognates and 'native' synonyms in Slovene and German reference corpora confirm the hypothesis.
21. Trilingual corpus and its use for the teaching of reading comprehension in French (Xu Xunfeng and Régis Kawecki), pp. 222- 228. The paper examines the possibilities of use of parallel corpora in language teaching. An online English-French-Chinese parallel corpus was used in reading comprehension teaching. An experiment with three groups of students in Hong Kong showed that reading comprehension skills of the test group improved significantly after six weeks of reading trilingual texts on the Web. It is planned to further develop this learning tool: comprehension test and online concordancer will be added.
The issues discussed in this volume have received a great deal of attention in research of the past decade. A strong side of the book is that it includes works of different scholars working with different languages. Actually, the publications of this book deal with twelve languages. The papers of the volume are fairly short and most of them present results of case studies. However, the articles are interesting to read and methods introduced are applicable to different linguistic phenomena. I read with special interest the papers by Christiane Fellbaum et al. (3), Dominic Stewart (9), Chang Baobao et al. (13), Dorothy Kenny (15), Cvetana Krstev and Duško Vitas (16), Rūta Marcinkevičienė (17), Špela Vintar and Silvia Hansen-Schirra (20).
1. I understand that it is extremely difficult to put together these very diverse contributions under the same title. However, the title of the volume is rather misleading. One would expect this to be a collection of papers on automated text processing, semantic tagging, disambiguation, translation memories, etc. Unfortunately, only half of the articles can be considered as studying the issues of "The Extraction of Semantic Information from Monolingual and Multilingual Corpora". Many articles are rather loosely related to the subject (e.g. chapters 2, 5, 7, 8). The title "Meaningful texts" without the subtitle would have been ambiguous enough to cover all the publications of the volume.
2. The division of the volume into two parts does not seem to work well. The editors themselves admit that there are the same questions discussed in both 'monolingual' and 'multilingual' parts, e.g. lemmatization, noun collocations (p. 1). Furthermore, the DIALAYMED corpus, presented in the paper by Eckkrammer is a multilingual corpus, and I don’t quite understand why the article is placed in the first part of the book. The paper by Wolfgang Teubert is not devoted exclusively to monolingual corpora either.
3. The order in which the papers are arranged leaves impression of a random order. Both Čermák (14) and Marcinkevičienė (17) are studying noun collocations. Chang Baobao et al (13), Skadiņa (18), and Tadić et al (19) discuss translation equivalents in parallel corpora. Why the editors did not place the papers dealing with the close problems one after another? After studying table of contents once again I realized that the order is alphabetical. Anyway, the best solution would be to arrange ALL the papers of the book in alphabetical order without dividing it into two parts.
4. Some papers of the volume reference each other (Marcinkevičienė => Čermák) but it does not seem that cross-referencing is carried out persistently.
COMMENTARY ON SPECIFIC PAPERS
Gaël Dias, Sara Madeira, and José Gabriel Pereira Lopes: It is not quite clear how effective the method is and what percentage of noise the software produces.
Andrius Utka: The facts about different frequencies and different collocations of different forms of the same word are very interesting and important. However, I do not understand why one should give up lemmatization just because of that. To my mind, the researcher should combine the study of lexeme and its different forms. Besides, lemmatization and tagging would help to filter out homonymous forms. Thus, study of raw text would be a step back, it is better to improve the mark-up of the corpora.
Chang Baobao, Pernilla Danielsson and Wolfgang Teubert: "Translation Equivalent Pair (TEP): a Translation Equivalent Pair is composed of both a source-language Translation Unit and a target- language Translation Unit, which are mutual translations" (p. 133). I am not quite sure that bidirectional equivalence is common even in terminology. If the TEPs are extracted from Chinese-English corpus, they will be Chinese-English TEPs, not English-Chinese as well. For obtaining English-Chinese TEPs one would need English-Chinese parallel corpus. I am pretty sure the lists of TEP’s obtained from English-Chinese and Chinese-English corpora would be different. In this respect, the idea of existence of '_mutual_ translations' is rather misleading and simplistic.
Cvetana Krstev and Duško Vitas: I completely agree with the authors that augmenting of derivatives into the dictionary entry is very important for automated text analysis of Slavonic languages. However, only noun derivatives are discussed, it would be interesting at least to mention verbal and adjectival derivation as well (e.g. verbal aspect pairs in Russian present a very serious problem for word alignment). Besides, although the problem of word alignment for "baron" is solved very elegantly in the paper, it would be interesting to discuss the possibilities of word-suffix alignment as well. Sometimes diminutives and other derivational suffixes may have explicit correspondences in translation, sometimes they have to be ignored, e.g. the Russian diminutive noun "berezka" 'little birch' can be translated into English as "birch", "pretty birch" or "little birch".
Marko Tadić, Sanja Fulgosi and Krešimir Šojat: The method of extraction of translation equivalents introduced in the paper seems to generate on the first stage many 'impossible' pairs like 'article-verb' or 'conjunction-noun', which are of course filtered out on later stages but still slow down the process considerably. Generating 'reasonable' translation pairs from the very beginning by employing linguistic filters would help to avoid combinatorial explosion. Use of stopwords would be an easy and robust solution.
Špela Vintar and Silvia Hansen-Schirra: The principles of automated search of cognates formulated in the paper look rather simplified, differences in orthographic traditions should be taken into account (see e.g. Tiedeman 2003: 50–51). "According to Baker (1996), translations should be longer than originally produced texts in the target language or in the source language. The evidence for this tendency may, for example, be found in the text length (number of words of the individual texts)" (p. 212). I think the idea was originally formulated by Nida and Taber (Nida & Taber 1974: 163). Still, it is not quite clear, how one can compare lengths of source and target texts that are written in different languages. E.g. there are fewer words in translations from English into Russian (because there are no articles in Russian); translations from English and Russian into Finnish (no articles and few prepositions plus composite words in Finnish) also tend to be 'shorter'. Character counts also can be misleading, because words lengths differ from language to language. Thus, although the heuristic seems to be quite reasonable, it is not possible to prove the explicitation tendency simply by comparing word or character counts of source and target texts (Mikhailov 2003: 165–174).
Xu Xunfeng and Régis Kawecki: It would be interesting to know what kind of teaching methods were used. Or was it just reading parallel texts?
Finally, I noticed some misprints in the volume, e.g. Russian examples in the paper by Marcinkevičienė: "переметивать" (should be "перемешивать", p 180), "...несколко секунд после провуждения..." (should be "...несколько секунд после пробуждения...", p 183), "...после ровуждения..." (should be "...после пробуждения...", p 183), "...какуюто древную..." (should be "...какую-то древнюю...", p 185).
To sum up, the book can be recommended for those who are interested in corpus-based linguistics and corpus-based translation studies, especially if their research is concerned with Slavonic or Baltic languages.
Baker, Mona (1993) Corpora in Translation Studies: An Overview and Some Suggestions for Future Research. Target 7(2): 223–43.
Laviosa, Sara (1998) The English Comparable Corpus: A Resource and a Methodology, in Bowker, Lynne, Cronin, Michael et al (eds.) Unity in Diversity? Current Trends in Translation Studies. Manchester: St. Jerome.
Mikhailov, Mikhail (2003) Parallel'nye korpusa xudožestvennyx tekstov: principy sostavlenija i vozmožnosti primenenija v lingvisticheskix i perevodovedcheskix issledovanijax (Parallel corpora of literary texts: principles of compilation and use in linguistics and translation studies, in Russian) Acta Universitatis Tamperensis, 956. Acta Electronica Universitatis Tamperensis, 280. University of Tampere 2003.
Mauranen, Anna & Jarmo Jantunen, eds. (2005) Käännössuomeksi. Tutkimuksia suomennosten kielestä. Tampere University Press.
Nida E. A. & Taber C .R. (1974) The Theory and Practice of Translation. Leiden: E.J. Brill.
Tiedemann, Jörg (2003) Recycling translations: Extraction of lexical data from parallel corpora and their application in natural language processing. Uppsala: Acta Universitatis Upsaliensis. http://publications.uu.se/theses/abstract.xsql?dbid=3791.
ABOUT THE REVIEWER:
ABOUT THE REVIEWER
Mikhail Mikhailov is a senior lecturer at the School of Modern Languages and Translation Studies, University of Tampere, Finland. His main research interests lie in parallel corpora and corpus-based translation studies. He is currently working on methods of studying Russian-Finnish parallel texts.