Date: Mon, 17 Feb 2003 15:51:14 -0500 (EST) From: Pablo Ariel Duboue Subject: Computational Linguistics in the Netherlands 2001
Theune, Mariët, Anton Nijholt, and Hendri Hondorp, ed. (2002) Computational Linguistics in the Netherlands 2001: Selected Papers from the Twelfth CLIN Meeting. Rodopi, viii+207pp, hardback, ISBN 90-420-0943-8, US$50, EUR50, Language and Computers series 45.
Book Announcement on Linguist: http://linguistlist.org/issues/13/13-2106.html
Pablo A. Duboue, Computer Science Department, Columbia University, USA
SYNOPSIS Continuing with their tradition of a second round of submission and reviewing after the CLIN meeting, this year "Computational Linguistics on the Netherlands" offers mostly Dutch-related content. The book contains papers on a wide variety of topics, distributed over 14 papers and the extended abstract of the invited talk. The second round of revisions ensures a high level of quality and the authors profit from the discussions at the meeting before sending their extended versions.
DETAILED ANALYSIS I decided to divide the published papers in four sections, to facilitate the discussion. The division into the sections is not clear-cut; it should be taken mostly for expository reasons. These sections are Theory (including psycholinguistically motivated works), Speech (including dialogs), Corpus (including creation and evaluation) and Tools (for Dutch and multilingual).
Theory I found four papers in this category, two dealing with particular results and two with psycholinguistic concerns. Their results are mostly language independent, with most papers providing English examples. In "Conservative vs Set-driven Learning Functions for the Classes k-valued" (Christophe COSTA FLORÊNCIO), the author answers an open question set aside by Kanazawa (1998), by means of a constructive proof. The work focuses in Classical Categorical Grammars (CCGs). In "Reference Resolution in Context" (Jan van EIJCK), pronoun reference resolution is analyzed in terms of incremental semantics. The concepts of the paper are exemplified with an implemented Haskell prototype, available from the author's homepage.
Finally, two contributions deal with psycholinguistically motivated formalisms: "Incremental Generation of Self-corrections Using Underspecification" (Markus GUHE and Frank SCHILDER), from a generation perspective; and "Performance Grammar: a Declarative Definition" (Gerard KEMPEN and Karin HARBUSCH), from an understanding perspective. Guhe and Schilder's work profits from a psycholinguistically plausible generation architecture to generate self corrections (e.g. "I have two seats... uh no... one seat available"). The authors argue that such self-corrections are required for systems with dynamic input data. Kempen and Harbusch, on the other hand, present a HPSG-motivated grammar formalism, Performance Grammar (PG). PG also captures important psycholinguistic features such as incrementality and late linearization. The authors provide both Dutch and English formalizations.
Speech Dutch morphology makes for a particularly challenging environment in Speech Recognition. The number of out-of-vocabulary (OOVs) words can be quite large, as a result of compounding and other word formation rules. Two papers explore solutions to this problem: "Memory-Based Phoneme-to-Grapheme Conversion" (Bart DECADT, Jacques DUCHATEAU, Walter DAELEMANS, and Patrick WAMBACQ) and "Automated Compounding as a Means for Maximizing Lexical Coverage" (Vincent VANDEGHINSTE). Decadt et al. investigate the automatic guessing of Dutch spelling out of phoneme transcriptions (phoneme-to-grapheme conversion). Their algorithm performs outstandingly well on clean input. However, the authors acknowledge further work is required to accommodate the highly noisy phonetic transcriptions coming from the speech recognition system. Vandeghinste explores a different but related problem: optimization in the use of the bounded memory of the speech recognition system. As only 36,000 words can be stored in that memory, the author combines several readily available lexicons for Dutch to extract roots and "quasi"-roots for Dutch words. He later re-combines the words into more complex ones, using a statistically trained module. His results seem to be ready for practical application and his statistical analysis is very thorough.
Dealing with errors in dialogs, "Multi-feature Error Detection in Spoken Dialogue Systems" (Piroska LENDVAI, Antal van den BOSCH, Emiel KRAHMER, and Marc SWERTS) analyses the impact of combination of prosodic and non-prosodic features in automatic error detection. Trying to reproduce available results reported over English spoken corpora, their results over a Dutch corpus provide mixed evidence regarding the importance of prosodic features.
In the extended abstract of the invited talk, "Ideas on Multi-layer Dialogue Management for Multi-party, Multi-conversation, Multi-modal Communication" (David R. TRAUM), the challenges behind the complex Mission Rehearsal Exercise are outlined. The MRE is a military training environment where synthetic agents interact with a human trainee, on a Bosnia village setting. The talk strengthens the multiple problems involved during MRE's development. The MRE challenges go well beyond the ones faced on regular dialogue systems.
Corpus Corpus creation and evaluation in Dutch is an issue of optimizing existing, limited, resources and maximizing the impact of the resources applications. On those grounds, "The Alpino Dependency Treebank" (Leonoor van der BEEK, Gosse BOUMA, Rob MALOUF, and Gertjan van NOORD) describes the on-going construction of a dependency treebank for Dutch, with the objective of theory-neutrality. Also on the Alpino tree-bank, "Corpus-based Acquisition of Collocational Prepositional Phrases" (Gosse BOUMA and Begon~a VILLADA), investigates the problem of collocational prepositional phrases (CPPs), and experiments with techniques for automated acquisition. While their initial analysis of the linguistics of the CPPs is very thorough (and goes beyond computational linguistics, being of interest for linguists in general), the authors express slight disappointment on their acquisition results. It seems a better definition for the CPPs is required.
Working on the PAROLE corpus, "Tagging the Dutch PAROLE Corpus" Jesse de DOES et al. confront themselves with few training data and a large tagset (with syntactically motivated, complex, tags). The authors try to cope with such a challenging situation by using a mixture of different part-of-speech taggers. They also adapted POS-taggers trained on larger corpora with a different tagset, by learning tag-transformation rules. While the authors express regret on their overall results, the constraints on their task render it a very challenging one, indeed.
"Creating a Dutch Information Retrieval Test Corpus" (Djoerd HIEMSTRA, David van LEEUWEN) explains the internals of the Dutch section employed in the CLEF (Cross-language Evaluation Forum). CLEF is an European, multilingual, counterpart for the Text Retrieval Conference (TREC), focusing on information retrieval (IR). The paper discusses the logistics involved on the construction of the Dutch corpus, together with some CLEF results. A very thorough analysis of the impact of judge subjectivity on the overall IR results is worth mentioning.
Tools This very general section captures three remaining papers. "A Named Entity Recognition System for Dutch" (Fien DE MEULDER, Walter DAELEMANS, Véronique HOSTE) presents an interesting approach for rapid development of language technologies tools: a small sample of expected output is hand-tagged and a rule induction machine learning system (RIPPER) is run over it. System developers then analyze the rules and integrate them in a rule-based system. The benefits of this approach are the ability of the human programmer to tell good rules from bad ones, together with the possibility of integrating rules from different runs of the machine learning system. The use of machine learning as an aid for human knowledge acquisition seems to speed up their development process quite a bit and it is a technique easily applicable to other problems or domains.
The question of whether stemming (reducing a word to a rough version of its root) is useful or not for text classification is revisited in "Accurate Stemming of Dutch for Text Classification" (Tanja GAUSTAD and Gosse BOUMA). The authors proceed to do an extrinsic evaluation of two stemmers, a complex, very accurate, dictionary-based stemmer and the Dutch version of the Porter stemmer (straight-forward but inaccurate). Their results provide mixed evidence of the utility of stemming and diverge from published English experiments.
Finally, "Applying Monte Carlo Techniques to Language Identification" (Arjen POUTSMA) provide an interesting new methodology to perform language identification. While the author argues that the problem of automatically guessing the language of a given document is considered a solved problem, he proposes a novel, more efficient approach. The technique, based on Monte Carlo sampling, requires a small sample of the text in question. It provides results slightly below the state of the art but with an 850% speed up.
OVERALL ANALYSIS A quick scan over the list of contributors yields that, out of 31 contributors, only three authors (the US invited speaker and two German authors) are located outside the Netherlands and Flanders areas. Such focus on Dutch processing makes the book of particular interest for researchers working on Dutch or similar languages presenting a complex morphology. Nevertheless, computational linguists focusing on languages spoken by small communities can profit from the experiences reported on the book. It is also worth noting that the new edition is hardcover, compared to last year's paperback. This can motivate purchasing the actual book, as its contents are also available online.
REFERENCE Kanazawa, M. (1998) Learnable Classes of Categorical Grammars, CSLI Publications, Stanford University.
ABOUT THE REVIEWER:
ABOUT THE REVIEWER Pablo Ariel Duboue is a senior PhD student working under the supervision of Dr. Kathleen McKeown at the Natural Language Processing group, Columbia University in the City of New York (USA). His research interest falls in the area of Natural Language Generation, mainly on the automatic construction of content planners from aligned corpora. More information about Pablo is available at http://www.cs.columbia.edu/~pablo