Date: Wed, 27 Apr 2005 11:11:38 +0200
From: Svetoslav Marinov
Subject: Memory-based Parsing
AUTHOR: Kübler, Sandra
TITLE: Memory-Based Parsing
SERIES: Natural Language Processing 7
PUBLISHER: John Benjamins
Svetoslav Marinov, Gothenburg University/University College
The book consists of 8 chapters plus 2 appendices.
The first chapter (Introduction) sets up the stage for the two central
topics in this monograph -- parsing and machine-learning. The notion
of syntactic parsing is explained and the basic distinction between
data-driven and discrete approaches to parsing is given. Relying on
the former to create a broad coverage parser for German, the author
has chosen machine learning (ML) techniques and specifically the
paradigm of memory-based learning (MBL), which is an example of
supervised methods within ML. Brief introduction to MBL is given and
the place of ML in Natural Language Processing (NLP) is also
Chapter 2 gives the reader a more extensive introduction to MBL.
Section 2.1 covers the basic approach and the crucial idea of lazy
learning - i.e. "storing the presented training instances without
abstraction" (p. 11), contrary to eager learning, as well as the
similarity metrics used to calculate the distance of a new instance to
be classified to the already stored ones. The algorithms extending the
basic model in order to achieve faster retrieval, reduction of storage
requirements and tolerance to noise are given in section 2.2 together
with results of their performance. Feature selection is important in any
k-nearest neighbour approach within ML. It is therefore necessary to
carefully select the relevant ones and have proper metrics for
calculating their weights. This is clearly explained in section 2.3,
where the author discusses two of the most widely used - mutual
information and class projection, relying mainly, but not only, on the
work of (Wettschereck et al, 1997).
The next chapter deals with the application of MBL to syntactic
parsing. It is a careful review of the existing albeit not a very extensive
literature. The difficulty in utilising MBL for parsing stems from the fact
that "[n]atural language parsing [...] is not a task that can easily be
defined [as a classification problem]" (p. 34). While the majority of
approaches described in this part split the task into sub-tasks in order
to better mold parsing into the classification paradigm, Sandra
Kübler's method is different, as it will become clear later on in the
book. Chapter 3, therefore, describes how different authors solve the
partial tasks or adopt the idea of MBL by cascaded classifiers. Section
3.1 summarises two approaches to noun-phrase (NP) chunking. Next,
shallow parsing and grammatical relation finding are discussed.
Memory-based sequence learning, an approach in which the order of
the words is very important, is presented with its results for NP
chunking and subject-verb, and verb-object relation assignment.
Certain emphasis is also given to the work of Sabine Buchholz (e.g.
(Buchholz, 2002)), especially the optimal results and parameters
involved in discovering relations between verb chunks and other
chunks in a sentence. Last, in section 3.3, an approach to full parsing
using MBL is reviewed.
Data-oriented parsing (DOP) is the topic of chapter 4. While not an
embodiment of the k-nearest neighbour model, DOP shares many
similarities with MBL, which Kübler summarises on p. 58. The basic
DOP model, i.e. DOP1, for phrase structure trees is presented in
section 4.1. The author then goes on to describe the two important
stages in DOP - the actual parsing of new input and the
disambiguation procedures. The former being a generation of a
parsed forest, while the latter is the finding of the most probable
derivation (an NP-hard problem, unlike in standard probabilistic
context-free parsing). Two evaluations of DOP1 are discussed in
section 4.2 - one on the Wall Street Journal section of the Penn
Treebank (Marcus et al., 1993) and the other on the Air Travel
Information System corpus. In the next section the non-probabilistic
DOP is mentioned and the comparison to its stochastic counterpart is
shown. Certain enhancements of DOP to tackle unknown words (the
so called DOP3 model) are presented in section 4.4. Further, the
possibility to render DOP to PCFG in order to achieve faster search
for the most probable derivation and alleviate the exponential growth
of grammar, as well as a memory-based approach to DOP are cited in
the remaining two sections.
Chapter 5 is central to the book since it presents the memory-based
parser TueSBL and the treebank it is trained on TueBa-D/S. TueBa-
D/S is a corpus of transliterated spontaneous dialogues developed in
the context of VERBMOBIL (speech-to-speech machine translation
project). Therefore it covers the domain of business appointments,
travel scheduling and hotel reservations and contains approximately
67000 trees. In quite some detail the annotation scheme is presented
with appropriate examples and a comparison to the Penn Treebank is
given in section 5.1. In the following section Kübler gives reasons for
not employing the cascaded approach to parsing in the MBL
paradigm. Complete trees are thus viewed as classification classes, a
decision reminiscent to DOP. To quickly batter away doubts on
the 'trees as classes'-concept, the hybrid parsing architecture of
TueSBL is presented. Its indispensable preprocessing module CASS,
Abney's (1991) chunk parser and its adoption to German, and the
present task, are described in section 5.3. The learning component in
the MBL paradigm requires two things in order to succeed - lots of
data and carefully selected relevant features used for the
classification task. Since syntactically annotated corpora is after all of
modest size, the same goes for TueBa-D/S, TueSBL is given access
to all possible information in the treebank, yet not simultaneously
but "ordered according to their reliability" (p. 160). These design
decisions, the organisation of the instance base, as well as the search
in it are clearly explained in section 5.5. A sample parse and the
weighting scheme are then presented. The latter is also an example of
non-standard approach. The chapter ends with an explanation of the
backing-off strategy of TueSBL, which is triggered in those case
where no tree structures were discovered for input chunks or no
sentences were matched to the input sentence.
Chapter 6 is devoted to the evaluation of TueSBL. The standard
PARSEVAL metrics of bracketed/labelled precision and recall are
described and TueSBL's performance in these terms is given. Since
TueBa-D/S contains functional labels (e.g. HD for head, described in
Chapter 5), "functional"-labelled precision and recall are also
calculated. Additionally, certain modules of the parser are evaluated
separately and the results are given in section 6.4. Analysis of
TueSBL's errors is then presented. Since it is debatable whether
PARCEVAL's measures offer an accurate picture of a parser's output,
a dependency-based evaluation is proposed (cf. (Lin, 1995)). Section
6.6 also discussed the conversion of TueBa-D/S into dependency-
based representation. Section 6.7 not only gives the results of the
dependency-based evaluation, but it also offers a comparison of the
Bringing us back to the wider concept of MBL, Chapter 7 compares
Kübler's parser to other approaches already mentions in Chapters 2-
4. TueSBL resembles DOP, since both methods "rely on trees or tree
fragments that go beyond the local information present in rules." [p.
252]. Yet memory requirements and levels of generalisation are quite
different in the two approaches. TueSBL strives to return a complete
parse while other MBL methods based on cascaded classifiers focus
on chunk and grammatical relation finding (section 7.2). Section 7.3
gives a comparison to a conceptually different full memory-based
Chapter 8 concludes the book and brings up the fact that the
presented approach "is especially tailored towards processing
spontaneous speech." [p. 262]. Appendix A is the Stuttgart-Tuebingen
Tagset. Appendix B gives the syntactic categories and functional
labels of TueBa-D/S.
The book by Sandra Kübler is an important contribution to the area of
syntactic parsing in several respects. First, this is the monograph's
main point - a memory-based robust parser for German spontaneous
speech. A data-driven approach to NLP in its incarnation as an MBL is
used for the design of a parser (TueSBL) whose architecture
deserves to be looked at by anyone interested in parsing spoken input
or using analogy-based methods in Computational Linguistics, or
TueSBL is designed to exploit a relatively small treebank to its
maximum when trained on it. Complete trees are stored in the memory
and also represent the classification classes. This is something that
resembles DOP very much yet one do not get the weak points of pure
DOP - large storage space and intractability of the parsing model.
TueSBL is also different than the standard MBL approaches. It does
not use cascades of MBL classifiers and does not only return
grammatical relations between chunks but delivers a complete parse.
On the other hand it is not a purely k-nearest neighbour approach. As
such certain modifications are necessary. The notion of classification
classes becomes a very relaxed and loose idea. These have internal
structure and allow for modifications, such as omission of words and
chunks from a tree (see pages 127-132). This particular decision is
very well motivated in the book and also stems from the design and
organisation of the treebank (TueBa-D/S). With 3 levels of annotation -
phrasal, topological and functional (section 5.2), compared to the flat
structure of Penn Treebank; with the particularities of spoken input -
omissions, false starts, corrections, etc; with the free word-order of
German (while most MBL approaches use English and the Penn
Treebank), TueBa-D/S would offer an impossible task to the majority
of standard MBL methods using cascaded classifier. The weighting
metrics are also non-standard and this hinges again on the design
decisions of TueSBL.
Another strong point of the monograph is that the work described in it
is clearly placed in the context of other memory-based approaches to
parsing. Chapters 3 and 4 give in enough detail to what previous
authors have done in this field. Chapter 2 is also essential to better
understand TueBSL. The reader is given enough detail to understand
the essence of memory-based learning, yet not too much to be
distracted from the central topic - TueSBL. Recently, another way to
exploit the benefits of MBL for syntactic parsing has been presented,
namely the work of (Nivre and Sholz, 2004) and (Nivre et al., 2004).
There, TiMBL (Tilburg Memory-Based Learner, (Daelemans et al.,
2003)) is used to predict the next step of a deterministic dependency
based parser. (Nivre et al., 2004) present results for Swedish and in
(Nivre and Scholz, 2004) the results for English are given.
Finally, the question remains how scalable Kübler's approach is. While
the author claims that TueSBL can be trained on any treebank that
conforms to the TueBa-D/S format, it is well known that treebanks are
time- and resource-consuming. It would have been interesting to know
whether TueSBL could be trained on other treebanks (not strictly
following the TueBa-D/S format) or even on non-phrase structure
treebanks, i.e. dependency-based representations.
Abney, Steven. 1991. Parsing By Chunks. In: Robert Berwick, Steven
Abney and Carol Tenny (eds.), Principle-Based Parsing. Kluwer
Academic Publishers, Dordrecht. 1991.
Buchholz, Sabine. 2002. Memory-Based Grammatical Relation
Finding. PhD Thesis. University of Tilburg, The Netherlands.
Daelemans, Walter, Jakub Zavrel, Ko van der Sloot and Antal van den
Bosch. 2003. 'TiMBL: Tilburg Memory Based Learner, version 5.0,
Reference Guide.' ILK Research Group Technical Report Series no.
03-10, 56 pages.
Lin, Dekang. 1995. A Dependency-based Method for Evaluating
Broad-Coverage Parsers. In Proceedings of the 14th International
Joint Conference on Artificial Intelligence, IJCAI-95. Montreal, Canada.
Marcus, Mitchell, Beatrice Santorini and Mary Ann Marcinkiewicz.
1993. Building a Large Annotated Corpus of English: The Penn
Treebank. Computational Linguistics 19:2.313-330
Nivre, Joakim, Johan Hall and Jens Nilsson. 2004. Memory-Based
Dependency Parsing. In Ng, H. T. and Riloff, E. (eds.) Proceedings of
the Eighth Conference on Computational Natural Language Learning
(CoNLL). May 6-7, 2004. Boston, Massachusetts. pp. 49-56.
Nivre, Joakim and Mario Scholz. 2004. Deterministic Dependency
Parsing of English Text. In Proceedings of COLING 2004, Geneva,
Switzerland, August 23-27, 2004.
Wettschereck, Dietrich, David W. Aha and Takao Mohri. 1997. A
Review and Empirical Evaluation of Feature Weighting Methods for a
Class of Lazy Learning Algorithms. Artificial Intelligence Review 11(1-