Publishing Partner: Cambridge University Press CUP Extra Wiley-Blackwell Publisher Login
amazon logo
More Info


New from Oxford University Press!

ad

Language Planning as a Sociolinguistic Experiment

By: Ernst Jahr

Provides richly detailed insight into the uniqueness of the Norwegian language development. Marks the 200th anniversary of the birth of the Norwegian nation following centuries of Danish rule


New from Cambridge University Press!

ad

Acquiring Phonology: A Cross-Generational Case-Study

By Neil Smith

The study also highlights the constructs of current linguistic theory, arguing for distinctive features and the notion 'onset' and against some of the claims of Optimality Theory and Usage-based accounts.


New from Brill!

ad

Language Production and Interpretation: Linguistics meets Cognition

By Henk Zeevat

The importance of Henk Zeevat's new monograph cannot be overstated. [...] I recommend it to anyone who combines interests in language, logic, and computation [...]. David Beaver, University of Texas at Austin


Email this page
E-mail this page

Review of  Foundations of Statistical Natural Language Processing


Reviewer: Richard Evans
Book Title: Foundations of Statistical Natural Language Processing
Book Author: Christopher D. Manning Hinrich Sch├╝tze
Publisher: MIT Press
Linguistic Field(s): Computational Linguistics
Book Announcement: 10.1349

Discuss this Review
Help on Posting
Review:

Christopher D. Manning & Hinrich Schuetze (1999) Foundations of
Statistical Natural Language Processing, MIT Press, Massachusetts,
US, Pp. 680 Hard $60

Reviewed by Richard Evans, Research Assistant, Computational
Linguistics Research Group, University of Wolverhampton, UK

SYNOPSIS

The book provides an introduction to the field of Statistical
Natural Language Processing. Aimed at graduate students and
researchers, it should also be seen as a valuable teaching aid for
courses in computational linguistics. Deriving mathematical formulae
from basic principles with reference to specific language processing
tasks prevents the descriptions from becoming too dry. At all points
the material is thoroughly reinforced with the relevant linguistic
examples. The authors succeed in ensuring that the material is
relevant and interesting, one of the most important yet difficult
criteria to meet when teaching statistics.
The book has been written in LaTeX and has the format commonly
associated with such documents. One useful stylistic feature is that
as important terms are introduced to the text, they are printed in
the margin, which makes it easy to scan the text for topics of
interest. These terms are also listed in the index. Each chapter is
concluded by a fairly thorough 'Further Reading' section and a set
of exercises with tasks of varying difficulty. Several of the
chapters are also broken up by small sets of exercises. The book
concludes with a 44 page bibliography and 23 page index.

Part I Preliminaries

Chapter 1 sets out the empiricist standpoint adopted throughout the
volume, providing a critique of rationalist views on linguistics.
The points concerning weaknesses with the 'categorical judgment'
approach to linguistics exemplified by the work of Chomsky are
convincingly made with illustrative examples (section 1.2.1).

Chapter 2 provides a self-contained introduction to the mathematical
foundations of the ensuing material. It is divided into two parts,
one on probability theory and the other on information theory. The
section on probability theory includes the notions of conditional
probability, Bayes theorem, random variables (functions that map all
the possible outcomes of an event to a probability score), joint and
conditional distributions, Bayesian updating (where our statistical
expectation estimates are influenced by our prior beliefs about what
those expectations should be) and Bayesian decision theory. The
section on Information Theory covers Entropy (defined both as the
amount of information contained in a random variable and also as a
measure of the average amount of information required to describe an
outcome of that variable), Mutual Information (the amount of
information that one random variable contains about another) and
Noisy Channel models (in which the output from a communication
channel has a probability of differing from the input to that
channel), among others.

Chapter 3 provides a self-contained introduction to the linguistic
(largely syntactic) theory that will be used in subsequent chapters.
Here we have an introduction to parts of speech, phrase structure
and brief descriptions of morphology, semantics and pragmatics. The
introduction is quite detailed and the authors have not been afraid
to present some quite difficult examples (complex NPs and the like).
Having said this, there is not much coverage of analyses above the
level of the sentence but this is reflective of the field itself. In
combination with chapter 2, a basic statistical and linguistic
toolkit has been formed upon which the ensuing approaches will
depend. Later chapters do introduce further statistical methods, but
it is to chapters 2 and 3 that the reader will return for the
fundamentals.
Chapter 4 introduces the notion of corpus-based work and provides an
overview of the low level formatting issues that must be addressed
when using documents as an information source for further processing
(section 4.2). This chapter usefully provides details about
organisations that can be contacted in order to obtain these crucial
resources (table 4.1). There is also discussion of the SGML encoding
that is important for much current work (section 4.3).

Part II Words

Chapter 5 examines collocations and simple term extraction using
Mutual Information (2.2.3) methods. There is some brief discussion
of proper name recognition (sections 5.5 and 5.6), but a failure to
highlight the particular problems associated with that subject. For
instance the Named Entity Recognition task that has challenged
participants in the MUC conferences is not mentioned, nor any of the
approaches taken to address the problem (Mikheev, Grover & Moens
1998). This chapter also covers the notions of hypothesis testing
and significance (section 5.3).

Chapter 6 concerns statistical inference and the application of
probabilistic approaches to language modeling. This is a stochastic
method where our expectation of seeing some word or category in a
text is based only on the information we have about the preceding n
words (section 6.1). The chapter also covers a variety of
statistical estimation methods over those models (section 6.2) and
the process of smoothing (sections 6.2 and 6.3) which lets us apply
statistical methods in the face of sparse data.

Chapter 7 applies the prior methods to word sense disambiguation.
Several different algorithms are presented and reviewed. The authors
set out supervised and unsupervised methods for disambiguation, the
supervised ones being based on Bayes decision rule and Mutual
Information techniques. On supervised learning methods that require
manually annotated corpora, the authors note (p.232) "the
production of labeled training data is expensive". However they do
not mention any of the software tools that have been produced that
make the annotation task less time consuming and therefore less
expensive (such as MITRE's Alembic Workbench).

Chapter 8 presents methods for Lexical Acquisition. Here, the goal
is to classify lexical items on the basis of verb subcategoristion,
selectional restrictions, attachment ambiguity and semantic
similarity. Co-occurrence statistics and vector similarity methods
are used to obtain classes of semantically similar words. The
chapter also gives good coverage of evaluation measures (precision,
recall and f-measure).
Part III Grammar

Chapter 9 presents Markov models, a variation of the language models
presented in section 6.1. Familiarity with them is widely presumed
in current work and it is useful to have them derived here from
scratch for the benefit of the uninitiated reader. The Viterbi
algorithm is presented as a means of finding the best probability
traversal of Markov models.

Chapter 10 Part of Speech Tagging sets out 4 different strategies
and concludes with a discussion of performance and applications. The
algorithms include methods based on the Markov model techniques
introduced in chapter 9 and Brill's transformation based learning
method. There is some coverage of issues like base NP chunking, but
discussion of complex NP extraction (section 10.6.2) is omitted.

Chapter 11 Probabilistic Context Free Grammars describe an
application of Hidden Markov Models to determine the probabilities
of strings of words in a language. The authors present the Inside-
Outside algorithm as a method for finding the most likely analysis
for a sentence. There do appear to be a number of typographical
problems with this chapter. Space prevents me from making them
explicit but examination of pages 384, 385 and 391 should reveal
them to the interested reader.

Chapter 12 Probabilistic Parsing shows how annotated corpora
(treebanks) can be used as the basis for finding a syntactic
analysis for new sentences. The distinction between phrase-structure
and dependency grammars is presented (section 12.1.7) and various
statistical methods and search techniques are put forward. The
authors present a sample of the Penn treebank. Here we note that the
analysis consists of many 'flat' infrequent trees that do not
contain X-Bar nodes, only X and XP ones. Many current systems are
based on this treebank and the astute reader will be somewhat
concerned about the quality of the analyses returned by such
systems. There is a good, thorough description of evaluation
difficulties with respect to parsing in 12.1.8. An assumption made
in this chapter is that a parser should first try out the analysis
of a word string that is most commonly observed in a treebank.
However some best-first techniques based on human reading-time
experiments suggest that this is not always the best approach
(Crocker and Pickering 1996 unpublished work).
Part IV Applications and Techniques

Chapter 13 Statistical Alignment and Machine Translation presents
the idea of aligning sentences and paragraphs between documents of
different languages and using this information as the basis for
automatic translation (section 13.1). The method is based on the
noisy channel model (chapter 2). When reviewing the problems with
machine translation techniques, the authors write, "on the surface
these are problems of the model, but they are all related to the
lack of linguistic knowledge in the model." They then give examples
(p.489-492) which demonstrate a range of linguistic information that
is not exploited by the statistical models and could serve as the
basis for future work.

Chapter 14 Clustering presents a number of methods and algorithms
that classify items on the basis of some measure of similarity.
Hierarchical (section 14.1) and non-hierarchical (section 14.2)
approaches are covered. By using these techniques, words can be
classified automatically into categories that reflect something-like
semantic similarity. Some promising results are shown in table 14.5.

Chapter 15 Topics in Information Retrieval covers automatic term
extraction from documents. One of the approaches uses a vector space
model, following from material in chapter 8, and the measures of
Term Frequency and Term Frequency Inverse Document Frequency which
are derived here (section 15.2.2). The other method for term
identification is based on a term distribution model. The review is
followed by a method for discourse segmentation 'TextTiling' that is
based on information about the distribution of terms in a document
(section 15.5).

Chapter 16 Text Categorisation introduces a number of statistical
classification methods. The goal here is to automatically identify
the topics or themes of documents. Several methods are used. With
Decision Trees (section 16.1) a given document is described in terms
of feature-value trees where possible values are labeled with
probability scores. The combination of a document's value scores
gives the likelihood that it belongs to a given class. Maximum
entropy models (16.2) are described in which a number of pre-
classified documents are defined by means of constraint features.
The classification with the highest entropy score is defined as the
maximum entropy model. New documents are then classified according
to their similarity with this model. With the perceptron learning
method term vectors (chapter 15) and iteratively induced weights are
used to classify documents. The K-nearest neighbour (16.4)
classification method is also described. Here, documents are
classified according to their similarity to positively classified
documents.

Although much of the material here is also covered by (Charniak
1996) and (Krenn and Samuelsson 1997) and less so by (Allen 1995),
Manning and Schetze's work provides wider, more detailed coverage.
Strangely, none of the works discusses the application of corpus-
based optimisation techniques such as genetic algorithms (Mitchell
1997) to natural language processing.

I recommend this book both as an exemplary teaching aid and a
rigorous introduction to statistical NLP. It is to be commended for
its readability and the coherent presentation of a notoriously
difficult subject. This reviewer did note some flaws, but they
represent very minor points in the context of a 680 page book. Part
of the beauty of this work is that it can stand-alone without the
reader having to refer to anything else in order to understand or
clarify parts of it. All the crucial information is here, presented
from first principles. It is a very good reference book for anyone
working in the field of NLP.

Bibliography
Allen, J (1995) Natural Language Understanding, Benjamin / Cummins
Charniak, E. (1996) Statistical Language Learning, MIT Press
Crocker, M. & Pickering, M. (1996) A Rational Analysis of Parsing
and Interpretation, Unpublished
Day, D. et al. (1997) Mixed Initiative Development of Language
Processing Systems, The Mitre Corporation
Krenn, B. & Samuelsson, C. (1997) The Linguist's Guide to
Statistics, at http://coli.uni-sb.de/{~Krenn,~christer}
Mikheev, A., Grover, C. & Moens, M. (1998) Description of the LTG
System Used for MUC-7, Language Technology Group,
http://www.ltg.ed.ac.uk/papers/muc.ps
Mitchell, T. (1997) Machine Learning, McGraw Hill
Richard Evans is a research assistant with the Computational
Linguistics Research Group at the University of Wolverhampton in the
UK. His current research interest is anaphor resolution and the
application of corpus-based machine learning and optimisation
methods to that task.



 
ABOUT THE REVIEWER:

Amazon Store: