AUTHOR: Smith, Noah A. TITLE: Linguistic Structure Prediction PUBLISHER: Morgan & Claypool Publishers YEAR: 2011
Benjamin Börschinger, Department of Computing, Macquarie University
In this book, Noah Smith sets out to ''bridge the gap between NLP [Natural Language Processing] and ML [Machine Learning]'' (p. xv). It is targeted at ''graduate students, NLP researchers wanting to better understand ML, and ML researchers wanting to better understand NLP,'' and as such, assumes a ''[b]asic familiarity'' (p. xv) with at least one of the areas.
Smith proposes the general framework of 'linguistic structure prediction' as an approach to NLP problems that makes clear the connections to ML. Within this framework, Smith discusses a wide variety of techniques and approaches to both supervised and unsupervised learning in considerable technical detail. He covers both important 'traditional' ideas such as the semiring approach in NLP, by now 'standard' approaches such as Support Vector Machines (SVMs) and Maximum Entropy models, and the general idea of and differences between generative and discriminative models.
Chapter one introduces some typical NLP tasks, ranging from Part-of-Speech tagging to syntactic parsing. While providing a basic overview of NLP problems for people with a primary background in ML, Smith also shows how to formulate them as instances of a generic ML problem, showing the connections between the two fields. He proposes the abstract ''linguistic structure prediction framework'' that is made up of four steps:
First, ''[f]ormally define the inputs and outputs'' (p. 20). Smith provides concrete examples how to do this in cases such as Part-of-Speech tagging or Syntactic Parsing, thus also introducing in an intuitive way notation that is used consistently throughout the remainder of the book. Second, ''[i]dentify a scoring function over input-output pairs, and an algorithm that can find the maximum-scoring output given an input'' (p. 20), the topic of Chapter two. Third, identify potential data that actually allows one ''to learn to predict outputs from inputs, and apply a learning algorithm'' (p. 20), discussed further in Chapters three to five. Finally, ''evaluate the model on an objective criterion measured on unseen test data'' (p. 21), a point that is addressed in one of the four appendices.
Chapter two starts with an explicit definition of the scoring function mentioned in step two. A scoring function needs to assign a single real number (a score) to every possible input-output pair for the given task; for example, in Part-of-Speech (PoS) tagging it needs to assign a score to every possible pair of word and PoS sequences of the same length. Given a formal representation of both input and output along the lines of Chapter one, these representations are first mapped to a feature vector, an ordered sequence of real numbers. As in Chapter one, Smith provides concrete and intuitive examples of how to define a feature function for specific NLP tasks, giving people with primarily a ML background an idea of what relevant features can look like in NLP. Given a feature vector V, a score is then simply assigned to the corresponding input-output pair by taking the dot product between the vector V and a given weight vector W. Thus, problems in NLP can be stated as standard optimization problems: for a given input I, find the output O such that the product of the corresponding V and a given weight vector W is as large as possible. Obviously, this raises two further questions, namely how one can efficiently find the optimal output value and where the weight vector W comes from. The rest of Chapter two is dedicated to the first problem, finding the optimal output if one already knows the weight vector, for which Smith introduces the technical term 'decoding'.
Before addressing how the problem of decoding can efficiently be solved, however, Smith introduces his 'five views on decoding' that serve to provide more intuitive ways to think about the rather abstract formulation of the scoring function as just a product of vectors. In particular, Smith shows how apparently different approaches such as Probabilistic Graphical Models, Parsing or Weighted Logic Programs can all be seen as (not necessarily equivalent) instantiations of the abstract idea of decoding just outlined, and that certain tasks are more naturally using one or the other.
The rest of Chapter two provides an exhaustive discussion of Dynamic Programming, one of the key ideas behind many efficient algorithms that have a long history in NLP. Even for people familiar with specific instances of Dynamic Programming, e.g. the CKY or the Viterbi algorithm, Smith's presentation is interesting in that he uses the mathematical idea of a semiring to provide a very general look at Dynamic Programming. His use of important NLP algorithms as examples again makes it easy for people already familiar with them to understand the abstract approach, and provides people without a strong background in NLP with a clear understanding of some of the core algorithms of the field.
Finally, chapter two includes a quick overview of approximate algorithms that can be used to solve problems for which no efficient Dynamic Programming formulation is known, or in cases where solving the DP exactly is still impractical, a situation that often arises in practice.
Chapter three addresses the problem of finding a weight vector W, assuming many correct input-output pairs are available. This situation is known as supervised learning, and the first half of Chapter three provides a detailed discussion of how probabilistic models, both generative and conditional, can be used in this setting. Smith succeeds in making clear the difference between generative and conditional probabilistic models and their relative merits and shortcomings, as well as in relating them back to the abstract notion of decoding developed in Chapter two. He starts with basic and ''traditional'' approaches such as Hidden Markov Models, Probabilistic Context Free Grammars and Naïve Bayes, again providing an opportunity to get used in a 'familiar' setting to the intricate notation and the not exactly trivial mathematical details that necessarily come up in this context. It is also worth pointing out that Smith provides an introduction to the use of Dirichlet Priors as an elegant approach to smoothing, an idea that has become increasingly popular in the NLP community over the last years.
The second half of chapter three discusses fully discriminative models, in particular large margin methods, and Smith succeeds in explaining their general difference from probabilistic models and why one may choose to use one or the other. Starting with the very intuitive Perceptron algorithm, Smith goes on to sketch the more powerful framework of Support Vector Machines, pointing out the challenges these models pose in learning weight vectors from the training data.
Chapter four addresses the problem of finding a weight vector W in cases where one does not have any input-output pairs to learn from, commonly called unsupervised learning. It provides an excellent discussion of the Expectation Maximization algorithm, and how it can be applied to probabilistic generative models for unsupervised learning. Particularly noteworthy is the discussion of Bayesian Unsupervised Learning, a very recent trend in NLP. Smith both makes very clear the general difference between the Bayesian and the 'standard' (frequentist) approach and provides an explanation of the relevant formal background, including a quick discussion of Markov Chain Monte Carlo methods and Variational Inference.
In addition to unsupervised learning, Chapter four provides a short but insightful discussion of Hidden Variable Learning. This is the case where one actually has input-output pairs to learn from, as in Chapter three, but assumes that there is unknown hidden structure on the input side that is useful in predicting the output. Importantly, Hidden Variable Learning can also be applied to conditional and discriminative models whereas 'classical' unsupervised learning cannot, thus providing a connection to the discussions in Chapter three.
Chapter five provides additional mathematical details about some of the subproblems that arose in Chapters three and four. It is the mathematically most challenging part of the book but can be used more like a reference for specific mathematical questions that arise when actually implementing one of the techniques discussed in the previous chapters.
Finally, there are four appendices that provide additional background on some of the topics mentioned in the main text. The first appendix discusses numerical optimization methods, particularly important for Chapters three and four. The second appendix discusses experimental methodology in NLP, thus addressing the important question of how to do evaluation. The third and the fourth appendix elaborate on Chapters three and four, discussing Maximum Entropy models and the difference between locally and globally normalized probabilistic models.
''Linguistic Structure Prediction'' provides an exhaustive and detailed discussion of ML techniques that are currently used within NLP. It is clearly not an introductory book for people with absolutely no background in either field, but given that it defines its own target audience as graduate students and researchers, this is not a real criticism. The only thing that I would consider 'missing' is a discussion of directed probabilistic graphical models, as Smith focuses exclusively on undirected ones. Again, however, this is something that Smith points out himself, and his reason for focusing on the undirected case, ''general methods for decoding are [...] easier to understand in the undirected case'' (p. 29), is reasonable.
Particularly noteworthy is how up-to-date the book is. Even very recent developments are covered, e.g. Bayesian non-parametric approaches, and the extensive bibliography makes it easy for the reader to identify relevant and recent papers. Researchers in NLP will welcome it as a useful and up-to-date reference book for ML techniques, in particular given that most of the literature in ML uses examples that are often hard to connect back to the discrete and richly structured world of NLP problems.
To conclude, then, one can say that ''Linguistic Structure Prediction'' does a good job in bringing ML and NLP closer together.
ABOUT THE REVIEWER
ABOUT THE REVIEWER:
Benjamin Börschinger received his M.A. in Philosophy and Computational
Linguistics from the University of Heidelberg and is currently working
towards his PhD in Computational Linguistics at Macquarie University,
Australia. His research focuses on computational models of human language
acquisition and unsupervised learning of linguistic structure.
Additionally, he is interested in conceptual questions that arise at the
interface of Linguistics and other disciplines, in particular the Cognitive