Date: Fri, 31 May 2002 16:02:30 -0400 (EDT) From: Pablo Ariel Duboue Subject: Computational Linguistics in the Netherlands 2000
Daelemans, Walter, Khalil Sima'an, Jorn Veenstra, and Jakub Zavrel, ed. (2001) Computational Linguistics in the Netherlands 2000. Rodopi, 204pp, paperback ISBN 90-420-1247-1, US$ 45.00, EUR 48,00.
Pablo A. Duboue, Computer Science Department, Columbia University
SYNOPSIS This book contains a selection of the papers presented at the eleventh annual conference on Computational Linguistics in the Netherlands (Tilburg, 2000). Although its title seems to suggest an audience exclusively to the Netherlands and Flanders area, this is far from being true. The book is targeted to a wide audience. As noted in the introduction of the book, 50% of the contributions are not from the Benelux area. However, while the book does not concentrates exclusively in Dutch computational linguistic (CL) issues, people interested on these issues will find valuable articles in it. In general, it seems to me this book is a nice conclusion to the process started in the Balancing Act (Klavans and Resnik, 1996): looking for some stability between knowledge-based and statistical approaches. In the papers presented in this book, you can see both statistical systems trying to incorporate more knowledge to their structure (e.g., Carson-Berndsen, Joue and Walsh) and symbolic systems trying to unveil areas for learning, in order to improve robustness (e.g, Poibbeau and Kosseim). The book topics cover a considerable spectrum, including parsing, generation, speech processing and information retrieval.
DETAILED ANALYSIS In his Invited Talk "Very Large Lexicons" (1-15), Gregory Grefenstette brings added value to the book. Invited talks are normally for the enjoyment of the attendees of a conference. By providing a transcription of his talk, Dr. Grefenstette significantly enriches the collection. It points out the new challenges of internet-aware CL research. His talk focus on the perspectives for building a full-lexicon language model out of the Internet. Two years after publication, his figures on hard disk space seem very conservative, making a full-lexicon language model an even more doable task.
The first of the regular papers, "Phonotactic Speech Ranking for Speech Recognition" (16-29), by Julie Carson-Berndsen, Gina Joue and Michael Walsh, deals with the question of how to add more knowledge to a speech recognition system. One of the proposed solutions involves the use of constraints on the permissible combinations of sounds from a language (phonotactic constraints). The authors draw from their previous work to present a technique which extrapolates constraints for enhanced robustness, together with additional techniques which acquire constraints automatically from corpora.
The next article is "Through a glass darkly: Part-of-speech distribution in original and translated text" (30-44), by Lars Borin and Klas Prutz. This article will be of real interest for linguists studying language acquisition and bilingualism. The authors are dealing with a very interesting resource (a magazine for foreigners in Sweden, available in eight different languages) and look for particular patterns of POS tags that, while not appearing in a balanced corpus of general English, do appear in the translated counterparts. They later proceed to formulate a hypothesis of how the source language affects the election of possible translations. As pointed out in their conclusions, this is just one of the possible experiments that can be undertaken to study the same issue.
In the first article on Dutch linguistics, "Alpino: Wide-coverage Computational Analysis of Dutch" (45-59), Gosse Bouma, Gertjan van Noord, and Robert Malouf presents a clearly written contribution, understandable for people with no prior knowledge of Dutch. It is a broad system description of Alpino, an analytical tool for Dutch, including its hand-built, head-driven lexicalized grammar (over 100 rules, with inheritance) and its Part-Of-Speech module. Aside from its obvious impact on Dutch CL this article can be of interest to other researchers working on wide-coverage systems in new languages.
The following article is a very unusual paper for a CL conference, "Revolution in Computational Linguistics: Towards a Genuinely Applied Science" (60-72), written by Pius ten Hacken. The article is really important for the average CL person, in particular for students, I believe. It points out how CL has moved in the last 30 years from being merely linguistics with different methods to a completely applied field. In the words of the paper, the shift has been from: Problem: Understanding human language processing; Knowledge: Contemporary linguistics theories; Solution: A running program in a computer; to: Problem: A practical problem occurring in real life; Knowledge: Whatever turns out to be helpful in a solution; Solution: A system or program in practical use. I consider this paper and Grefenstette's paper to be the ones that define the style of the book as a whole. A last note of caution: For the reader used to CL articles, the eleven pages of margin to margin running text with no figures can make for a hard reading.
In "Syntactic Annotation for the Spoken Dutch Corpus Project (CGN)" (73-87), the first of the two articles dealing with the Spoken Dutch Corpus Project, Heleen Hoekstra, Michael Moortgat, Ineke Schuurman, and Ton van der Wouden describe how to annotate Dutch continuous speech. The overall task is to annotate one thousand hours, circa 10M words using a theory neutral formalism. Aside from the problems involved in achieving theory neutrality, the peculiarities of Dutch (crossing dependencies, etc.) make this annotation a complex endeavor.
Andre Kempe's "Part-of-Speech Tagging with Two Sequential Transducers" (88-96) presents an interesting idea: use two sequential transducers (i.e., finite state technology) to correct the errors of a baseline (most frequent tag-per-word) part-of-speech tagger. These transducers are applied in reverse direction (the first one left-to-right and the last one right-to-left). It is interesting to note that while this technique does not improve existing part-of-speech taggers, it is very efficient, being useful for domains such as information retrieval that may need to trade speed for efficiency.
Regarding information retrieval, "Different approaches to Cross Language Information Retrieval" (97-110) by Wessel Kraaij and Renee Pohlmann presents an overview of cross-language information retrieval as seen, for instance, in the TREC-6 evaluation conference. It seems to me that the most outstanding contribution of their technique is to mix different approaches regarding whether to translate the documents or translate the query. They achieve this by incorporating translation information to their document rank model. They provide a thorough evaluation including all translations, most probable translation and word sense disambiguated translation. Their results seemed a little counter-intuitive in my opinion but their analysis is indeed thorough.
In "A New-Old Class of Linguistically Motivated Regulated Grammars" (111-125), S. Marcus, C. Martin-Vide, V. Mitrana and Gh. Paun present a heavily theoretical paper, following some ideas presented by I. Bellert in 1965 that seemed to have been left oversighted. The authors are interested in the central problem of the generative power of families of grammars: which grammar formalism can be used to deal with natural language beyond the context-free grammars but below the context-sensitive ones. Their proposed methodology "Path Controlled Grammars" works with two context free grammars on different alphabets. The first one is a regular context free grammar on the alphabet of the target language. The second one is defined over the possible set of intermediate rewritings on the previous grammar. A derivation of the whole system can only use strings validated by the second grammar. This familiy of grammars is midly context-sensitive. The authors also prove a pumping lemma. It would be interesting to see further development of parsers and grammars using this formalism.
In the second article dealing with the CGN project, "CGN to Grail: Extracting a Type-logical Lexicon from the CGN Annotation" (126-143), Michael Moortgat and Richard Moot describe how to use CGN annotation to adhere to some particular formalism. The formalism itself (proof nets for the Grail theorem prover/parser) is quite complicated. The two articles, (Hoekstra, Moortgat, Schuurman and vad der Wouden) and this one, are best read together. However, this one requires a good deal of background knowledge on their formalism to be understood, as well as knowledge of Dutch linguistics. It is interesting to see how the annotation affects the transformation process. In any case, this paper is the most Dutch-dependent in the collection.
Thierry Poibeau and Leila Kosseim present in "Proper Name Extraction from Non-Journalistic Texts" (146-157) a series of experiments dealing with the named entity recognition using unusual domains. I found this article an important contribution, since, in my personal experience, general tools work bad on domains different from the ones for which they are trained. The figures shown in the paper (90% performance in journalistic text dropping to 50% in non-traditional domains) are indicative of the effects a practitioner may find with tools for other tasks different than proper name extraction when the tools are trained on general text. The process followed by the authors at adapting the tools to new domains allows them re-achieve most of the lost performance. The article itself should be a mandatory reading for researchers working on specific domains and sub-languages.
Being the only generation article in the proceedings, "Generating Referring Expressions in a Multimodal Context: An empirically oriented approach" (158-176), by Ielka van der Sluis and Emiel Krahmer, targets the classic problem of generating referring expressions but now in a multimodal context. They extend Dale & Reiter's classic algorithm with information such as the distance between the object and focus of attention. Their algorithm, however, is NP complete. It can regain polynomial time behavior under certain conditions, explained in (van Deemter 2001). Their "empirical approach" relates to the fact they draw their algorithm from the experiments with human subjects done by Beum and Cremmers (1998).
Erik F. Tjong Kirn Sang's "Transforming a Chunker to a Parser" (177-188) presents a promising idea of building a parse tree by cascading chunker applications. The techniques and experiments described in the paper seem sound and well grounded, although its results do not compare well to the state of the art on parsing technology. I like to compare this paper to Kempe's approach to part of speech tagging. In this case, however, each of the chunkers should be trained separately, and its information must be loaded at runtime, therefore a claim on efficiency gain cannot be made.
The last paper in the book is "Automatic Detection of Problematic Turns in Human-Machine Interactions" (189-200), by Antal van den Bosch, Emiel Krahmer and Marc Swerts. The authors describe a Dutch travel reservation system, in particular, they address the issue of automatic construction of classifier for errors in dialogs (such as "I want to go to Amsterdam/So you want to go to Rotterdam?"). Their system is quite successful, although they use two machine learning techniques, with the rule induction one clearly outperforming the memory-based approach (an interesting result from the automatic learning perspective).
OVERALL ANALYSIS The book itself contains a good snapshot of natural language processing and computational linguistics in Europe at the beginning of the decade. In general, the Dutch contributions are more homogeneous than its non-Dutch counterparts. All in all, the book makes for an interesting reading, covering a variety of topics.
REFERENCES Klavans J L, Resnik P, (1996) The Balancing Act, Combining Symbolic and Statistical Approaches to Language. Cambridge, MA: MIT Press. (Linguist List review at http://linguistlist.org/issues/8/8-834.html)
ABOUT THE REVIEWER:
ABOUT THE REVIEWER Pablo Ariel Duboue is a senior PhD student working under the supervision of Dr. Kathleen McKeown at the Natural Language Processing group, Columbia University in the City of New York (USA). His research interest falls in the area of Natural Language Generation, mainly on the automatic construction of content planners from aligned corpora. More information about Pablo is available at: http://www.cs.columbia.edu/~pablo