This book "asserts that the origin and spread of languages must be examined primarily through the time-tested techniques of linguistic analysis, rather than those of evolutionary biology" and "defends traditional practices in historical linguistics while remaining open to new techniques, including computational methods" and "will appeal to readers interested in world history and world geography."
AUTHOR: Bond, Francis TITLE: Translating the untranslatable SUBTITLE: A solution to the problem of generating English determiners SERIES: CSLI Series in Computational Linguistics PUBLISHER: CSLI YEAR: 2005
Shigeko Nariyama, Asia Institute, University of Melbourne
Here is the list of chapters with a brief description of each one. Ch. 1. Introduction Ch. 2. Background: literature review on reference, countability, definiteness and thematic marking Ch. 3. Determiners and Number in Machine Translation: literature review on machine translation and related areas of natural language processing (NLP) Ch. 4. Semantic Representation: a tractable representation of referentiality, boundedness (entities with or without salient boundary) and definiteness proposed Ch. 5. Automatic Interpretation: the algorithms proposed that determine values for referentiality, boundedness and definiteness Ch. 6. Evaluation and discussion: implementation of the algorithms and comparison with other systems Ch. 7. Construction of the Lexicon: compilation of the detailed knowledge used in the lexicon Ch. 8. Automatic Acquisition of Lexical Information: acquired from existing dictionaries and corpora Ch. 9. Conclusion
Chapters 2 ,4 and 7 are more linguistic in focus, while the remaining chapters are more computational NLP. Chapters 5, 6, and 8 are the heart of the book. It would have been better had Chapter 7 been placed immediately following Chapter 4.
'Translating the untranslatable' is exactly what this book is all about -- a challenge to accomplish a near impossible mission concerning languages -- generating required linguistic information in the target language from input sentences in the source language that apparently contain no such information!
'Determiners' cover a wide range of linguistic phenomena, including in/definite articles (i.e. a/the) or null articles, possessive pronouns, and number (and therefore they relate to generic referents, countability and numeral classifiers). Although all of these are syntactically obligatory in English and must be appropriately reflected in every sentence, none of these except for numeral classifiers are grammaticalised in Japanese. Hence, it is easy to see the magnitude of difficulties in generating English sentences with correct determiners and number from Japanese sentences that contain no overt information concerning these. For example, inu 'dog' can be 'a dog', 'the dog', or 'dogs'.
Being a native speaker of Japanese myself and speaking English as a second language, I learned a greatly deal from reading this book. As mentioned in the book, articles and numbers are the most frequent types of errors for Japanese, ranging from 9%~18% depending on one's competence. The problem of incorrect use of determiners is more serious than it may appear, since the difference in the use of incorrect determiners can result not only in wrong nuances of sentences, but also in referring to different entities.
This book is truly comprehensive and has something for everyone. Apart from the benefits for second language acquisition mentioned above, it examines the issue of determiners and number from theoretical linguistics, computational linguistics, and various applications in NLP and generation, including machine translation systems, on which this book is focused. It is easy to read, particularly because of the range of appropriate examples. Japanese fonts in the examples make reading so much easier for Japanese speakers, as Japanese words have an abundance of homonyms.
Moving onto the details of the book, given the frequent absence of determiners and number in Japanese, the solution to the issues of determiners and number has to be sought elsewhere in the sentence. Lexical knowledge is one good source, convincingly discussed in Chapter 7. Determiners and number also have an intricate relation to discourse elements, such as the notions of topic and familiarity.
All of these linguistic phenomena and discourse elements that play an important role for determiners and number are complex issues on their own, and none of them have been satisfactory accounted for, let alone comprehensive treatment of determiners and number. For example, the various definitions of 'definiteness' in English have been proposed: e.g., uniqueness, discourse given, familiarity. However, corpus analysis shows that 21% of the definite articles are used even for unfamiliar and discourse new entities (Poesio 2004), and thus the definition has not reached consensus among English speakers (see the series of work by Poesio). Even among those languages that use in/definite articles, definiteness is often language specific. Hence, determiners and number have been known to be a perennial problem particularly in NLP. Because computers do not have the faculty to rely on intuition that humans can utilise, they require explicit procedures for generating determiners and number.
As a solution to the issues, Bond proposed three algorithms concerning 1) referentiality, 2) number and countability, and 3) definiteness. These algorithms combine a deep semantic analysis with the use of sensible defaults. They were tested in the wide-coverage Japanese to English machine translation system ALT-J/E. The result reported is highly promising: generating determiners (articles and possessives, to be more precise) at an accuracy of over 85%. The methodology and evaluation seem sound, as it was tested on 398 sentences with 3,000 NPs.
While this high accuracy may not always be maintained in other domains of texts, it is still highly promising given all the complexities associated with the issues. The author ought to be congratulated for his achievement.
I found Chapter 8 particularly meaningful. Bond successfully shows with high precision and F-score that countability of unknown words including multi-(compound) words can be automatically learnt with a precision rivalling manual annotation. It is acquired from semantic classes and corpora. The main obstacle there lies in distinguishing different senses of a word. For example, both countable and uncountable usages of 'interest' are in corpora; countable for the sense 'a sense of concern with and curiosity', while uncountable for 'fixed charge for borrowing money'.
The main area I would like to criticize is on the way of capturing the relationship of definiteness with the Japanese thematic marking wa and the nominative case marker ga. Bond treats wa (also mo) as definite in the algorithm. The relevant discussion is found in Watanabe (1989:140-1), who reports that 99.5% of wa-marked arguments are definite, whereas only 61.6% of ga-marked arguments are definite, and, as a reference point, 100% of elided arguments are definite (ibid. 75-154). Looking at the issue of definiteness from another perspective, 69.9% of definite subjects are marked by wa and 30.1% by ga, while 1.7% of indefinite subjects are marked by wa and 98.3% by ga.
In principle, ga-marked arguments are indefinite, unless denoting an exhaustive listing, which connotes an emphasis (see Kuno 1973). I suspect that one of the other reasons why 30.1% of definite arguments are marked by ga has to do with it appearing in the subordinate clause; the subject in a subordinate clause must be marked by ga, irrespective of definiteness. I do not have access to Watanabe's corpus to check this point. Even though Watanabe did not specify her definition of definiteness in the analysis and the definiteness there may not necessarily correspond to the use of the, these findings are still sufficient enough to vindicate that the differences between wa and ga have indeed a strong correlation with definiteness.
Furthermore, the classification of the use of wa and ga described in Figure 4 quoted from Hinds (1987) may not be the best representation for capturing the difference between the two, because three out of seven categories show both wa and ga as the possibility. Watanabe (1989: 162) offers a more precise representation of mental processing of wa and ga in relation to ellipsis (zero anaphor).
Finally, I totally agree with Bond that discourse contexts will improve generating more accurate determiners that have anaphoric relations. And this is the overall future direction of work in linguistics and NLP, including determiners. That is, to deal with discourse (sequence of sentences), not just isolated sentences.
Hinds. John. 1987. Thematization, assumed familiarity, staging, and syntactic binding in Japanese. In J. Hinds et al. (eds.), Perspective on Topicalization: The case of Japanese wa. 83-106.
Kuno, Susumu. 1973. The structure of the Japanese language. Mass: MIT Press.
Poesio, Massimo. 2004. An empirical investigation of definiteness. Proceedings of International Conference on Linguistic Evidence, Tuebingen.
Vieira, Renata and Massimo Poesio. 2000. Processing definite descriptions in corpora, In S. Botley and T. McEnery (eds.), Corpus- based and computational approaches to anaphora, UCL Press.
Watanabe, Yashuko. 1989. The function of ''WA'' and ''GA'' in Japanese discourse. Eugene: University of Oregon Ph.D Dissertation.
ABOUT THE REVIEWER:
ABOUT THE REVIEWER
Shigeko Nariyama is a lecturer at the Asia Institute, the University of Melbourne, Australia. Her main research area is zero anaphora, along with lexical semantics, pragmatics and world knowledge that contribute to resolving zero anaphora.