Review of Translating the Untranslatable
|AUTHOR: Bond, Francis
TITLE: Translating the untranslatable
SUBTITLE: A solution to the problem of generating English determiners
SERIES: CSLI Series in Computational Linguistics
Shigeko Nariyama, Asia Institute, University of Melbourne
Here is the list of chapters with a brief description of each one.
Ch. 1. Introduction
Ch. 2. Background: literature review on reference, countability,
definiteness and thematic marking
Ch. 3. Determiners and Number in Machine Translation: literature
review on machine translation and related areas of natural language
Ch. 4. Semantic Representation: a tractable representation of
referentiality, boundedness (entities with or without salient boundary)
and definiteness proposed
Ch. 5. Automatic Interpretation: the algorithms proposed that
determine values for referentiality, boundedness and definiteness
Ch. 6. Evaluation and discussion: implementation of the algorithms
and comparison with other systems
Ch. 7. Construction of the Lexicon: compilation of the detailed
knowledge used in the lexicon
Ch. 8. Automatic Acquisition of Lexical Information: acquired from
existing dictionaries and corpora
Ch. 9. Conclusion
Chapters 2 ,4 and 7 are more linguistic in focus, while the remaining
chapters are more computational NLP. Chapters 5, 6, and 8 are the
heart of the book. It would have been better had Chapter 7 been
placed immediately following Chapter 4.
'Translating the untranslatable' is exactly what this book is all about --
a challenge to accomplish a near impossible mission concerning
languages -- generating required linguistic information in the target
language from input sentences in the source language that apparently
contain no such information!
'Determiners' cover a wide range of linguistic phenomena, including
in/definite articles (i.e. a/the) or null articles, possessive pronouns,
and number (and therefore they relate to generic referents,
countability and numeral classifiers). Although all of these are
syntactically obligatory in English and must be appropriately reflected
in every sentence, none of these except for numeral classifiers are
grammaticalised in Japanese. Hence, it is easy to see the magnitude
of difficulties in generating English sentences with correct determiners
and number from Japanese sentences that contain no overt
information concerning these. For example, inu 'dog' can be 'a
dog', 'the dog', or 'dogs'.
Being a native speaker of Japanese myself and speaking English as a
second language, I learned a greatly deal from reading this book. As
mentioned in the book, articles and numbers are the most frequent
types of errors for Japanese, ranging from 9%~18% depending on
one's competence. The problem of incorrect use of determiners is
more serious than it may appear, since the difference in the use of
incorrect determiners can result not only in wrong nuances of
sentences, but also in referring to different entities.
This book is truly comprehensive and has something for everyone.
Apart from the benefits for second language acquisition mentioned
above, it examines the issue of determiners and number from
theoretical linguistics, computational linguistics, and various
applications in NLP and generation, including machine translation
systems, on which this book is focused. It is easy to read, particularly
because of the range of appropriate examples. Japanese fonts in the
examples make reading so much easier for Japanese speakers, as
Japanese words have an abundance of homonyms.
Moving onto the details of the book, given the frequent absence of
determiners and number in Japanese, the solution to the issues of
determiners and number has to be sought elsewhere in the sentence.
Lexical knowledge is one good source, convincingly discussed in
Chapter 7. Determiners and number also have an intricate relation to
discourse elements, such as the notions of topic and familiarity.
All of these linguistic phenomena and discourse elements that play an
important role for determiners and number are complex issues on their
own, and none of them have been satisfactory accounted for, let
alone comprehensive treatment of determiners and number. For
example, the various definitions of 'definiteness' in English have been
proposed: e.g., uniqueness, discourse given, familiarity. However,
corpus analysis shows that 21% of the definite articles are used even
for unfamiliar and discourse new entities (Poesio 2004), and thus the
definition has not reached consensus among English speakers (see
the series of work by Poesio). Even among those languages that use
in/definite articles, definiteness is often language specific. Hence,
determiners and number have been known to be a perennial problem
particularly in NLP. Because computers do not have the faculty to rely
on intuition that humans can utilise, they require explicit procedures
for generating determiners and number.
As a solution to the issues, Bond proposed three algorithms
concerning 1) referentiality, 2) number and countability, and 3)
definiteness. These algorithms combine a deep semantic analysis with
the use of sensible defaults. They were tested in the wide-coverage
Japanese to English machine translation system ALT-J/E. The result
reported is highly promising: generating determiners (articles and
possessives, to be more precise) at an accuracy of over 85%. The
methodology and evaluation seem sound, as it was tested on 398
sentences with 3,000 NPs.
While this high accuracy may not always be maintained in other
domains of texts, it is still highly promising given all the complexities
associated with the issues. The author ought to be congratulated for
I found Chapter 8 particularly meaningful. Bond successfully shows
with high precision and F-score that countability of unknown words
including multi-(compound) words can be automatically learnt with a
precision rivalling manual annotation. It is acquired from semantic
classes and corpora. The main obstacle there lies in distinguishing
different senses of a word. For example, both countable and
uncountable usages of 'interest' are in corpora; countable for the
sense 'a sense of concern with and curiosity', while uncountable
for 'fixed charge for borrowing money'.
The main area I would like to criticize is on the way of capturing the
relationship of definiteness with the Japanese thematic marking wa
and the nominative case marker ga. Bond treats wa (also mo) as
definite in the algorithm. The relevant discussion is found in Watanabe
(1989:140-1), who reports that 99.5% of wa-marked arguments are
definite, whereas only 61.6% of ga-marked arguments are definite,
and, as a reference point, 100% of elided arguments are definite (ibid.
75-154). Looking at the issue of definiteness from another
perspective, 69.9% of definite subjects are marked by wa and 30.1%
by ga, while 1.7% of indefinite subjects are marked by wa and 98.3%
In principle, ga-marked arguments are indefinite, unless denoting an
exhaustive listing, which connotes an emphasis (see Kuno 1973). I
suspect that one of the other reasons why 30.1% of definite
arguments are marked by ga has to do with it appearing in the
subordinate clause; the subject in a subordinate clause must be
marked by ga, irrespective of definiteness. I do not have access to
Watanabe's corpus to check this point. Even though Watanabe did not
specify her definition of definiteness in the analysis and the
definiteness there may not necessarily correspond to the use of the,
these findings are still sufficient enough to vindicate that the
differences between wa and ga have indeed a strong correlation with
Furthermore, the classification of the use of wa and ga described in
Figure 4 quoted from Hinds (1987) may not be the best representation
for capturing the difference between the two, because three out of
seven categories show both wa and ga as the possibility. Watanabe
(1989: 162) offers a more precise representation of mental processing
of wa and ga in relation to ellipsis (zero anaphor).
Finally, I totally agree with Bond that discourse contexts will improve
generating more accurate determiners that have anaphoric relations.
And this is the overall future direction of work in linguistics and NLP,
including determiners. That is, to deal with discourse (sequence of
sentences), not just isolated sentences.
Hinds. John. 1987. Thematization, assumed familiarity, staging, and
syntactic binding in Japanese. In J. Hinds et al. (eds.), Perspective on
Topicalization: The case of Japanese wa. 83-106.
Kuno, Susumu. 1973. The structure of the Japanese language. Mass:
Poesio, Massimo. 2004. An empirical investigation of definiteness.
Proceedings of International Conference on Linguistic Evidence,
Vieira, Renata and Massimo Poesio. 2000. Processing definite
descriptions in corpora, In S. Botley and T. McEnery (eds.), Corpus-
based and computational approaches to anaphora, UCL Press.
Watanabe, Yashuko. 1989. The function of ''WA'' and ''GA'' in
Japanese discourse. Eugene: University of Oregon Ph.D Dissertation.
| ABOUT THE REVIEWER:
ABOUT THE REVIEWER
Shigeko Nariyama is a lecturer at the Asia Institute, the University of
Melbourne, Australia. Her main research area is zero anaphora, along
with lexical semantics, pragmatics and world knowledge that
contribute to resolving zero anaphora.