Date: Thu, 4 Aug 2005 20:07:07 +0530 (IST) From: Veena Dixit Subject: Computational and Quantitative Studies
AUTHOR: Halliday, M. A. K. EDITOR: Webster, Jonathan J. TITLE: Computational and Quantitative Studies SERIES: Collected Works of M. A. K. Halliday PUBLISHER: Continuum International Publishing Group Ltd YEAR: 2004
Veena Dixit, Center for Indian Language Technology, Indian Institute of Technology, Bombay, India.
This is the sixth volume from the collected works of Professor M. A. K. Halliday that runs into ten volumes. Professor Halliday has had a lifelong engagement with language and these volumes represent the outcome. The book portraits the developmental phases of machine translation (MT) from the perspective of Firthian frame of lexical-functional grammar. Computer technologies have developed considerably since the date the first article of the volume appeared. Nevertheless, the early articles continue to be relevant and not only from a historical point of view.
The book contains eleven articles divided into three parts. Each part has a brief introduction by the Editor. There is an appendix containing a trial grammar for a text generation project. The selection of articles represents the sequential shift in the focus of the author's interest while stressing the continuity and development of themes articulated in the 1950s.
The central theme of the first part is that the linguistic analysis secured on sound and scientific theory is the prerequisite of any language oriented mechanical task. Such analysis offers language description in mutually, unilaterally approximating comparative terms. The author proposes that the description of languages, source language (SL) and target language (TL), should cover levels of grammar and lexis at one end and context at the other end. The description can be in the form of statistical statements displaying quantitative analysis of occurrences of items. The rules for the systematic relating of these two descriptions should be appended to the descriptions. The expected relationship between items is in terms of translation equivalence.
The second part contains six chapters, which continue and develop the central propositions of the first part of the book. The linguistic system is inherently probabilistic in nature. Grammatics, the theory of grammar, has to be paradigmatic. Quantitative analysis will throw light on probability of choosing. The basis for quantitative analysis of language is the principle that the frequency in a text instantiates probability in a system.
Corpus linguistics is as much about theory building as it is about data collecting. Corpus provides methodological means for collecting evidence of relative frequencies in the grammar, from which the probability profiles of grammatical systems can be established. This is the theme for the third part.
EVALUATION BY CHAPTER
Chapter one: 'The Linguistic Basis of a Mechanical Thesaurus' (1956): The fact that grammar and lexis exhibit high degree of internal determination is exploited. Machine translation is defined as a function between two given languages. Translation procedure involves translation equivalence, equivalence of determining features and operation of particular determining features in TL.
Autonomous analysis and construction of a mechanical thesaurus are needed for MT. Grammar should be viewed as a statistics based statement of lexical redundancy, which can be handled autonomously by Lattice program. Thesaurus is defined as the lexical analogue of a grammatical paradigm, in which words are arranged in a contextually determined series to achieve translation as well as contextual equivalence. One can abstract the collocation and the non-collocation features of context from the language text.
The proposition is substantiated with examples from Chinese and English.
Chapter two: 'Linguistics and Machine Translation' (1962): This article foreshadows the themes developed by the author over the next thirty years. There is no analogy between code and message on the one hand and form and content on the other. A full description of a language involves categories and methods, which are peculiar to that language. These categories need to be used for stating the patterns of language and for showing how it works. The author introduces necessary technical categories such as unit, form, rank, and level. Description is complete when independent grammatical description and lexical description is shown to be related.
The author expresses the necessity of quantitative analysis for the description of the languages. Computer has to translate on more likely or less likely basis than yes-no basis.
He concludes that the Interlingua for translation between pairs or groups of the languages concerned can be neither natural language nor machine language. It will have to be a mathematical construct serving as transit code between natural languages.
Chapter three: 'Towards Probabilistic Interpretations' (1991): Professor Halliday starts from a rather distant point by posing the question how change is to be incorporated into the structural linguistic concept of a system. Language may have infinite possibilities but it has a finite number of users. A probabilistic model of lexicogrammar enables us to explain register variations, which relates with diachronic variations. When probability achieves a certainty, it is a category change. Every single instance alters the probability of the system in some measure.
The difference between physical systems or biological systems and semiotic systems lies in the key concepts of instantiation and realization. In a semiotic system, instances have differential qualitative values (referred as Helmet Factor). As to realization, linguistic systems are characterized by stratification. The author wants to escape from constructivist trap.
Chapter four: 'Corpus Studies and Probabilistic Grammar' (1991): The chapter is about the theoretical status of corpus frequencies. The author refutes Chomsky's theory of competence and performance, as by definition it made impossible that analysis of an actual text could play any part in explaining grammar of the language. He points out that the corpus studies are a well-established source of information about the grammar of language. A statement about quantitative patterns of grammar is not an attack on the freedom of choice of an individual while using the language.
Probabilities do not predict single instances; rather they predict the general pattern. The significance of probabilities lies in interpretation than prediction of the single instance. It is evident that even children construe the lexicogrammar, on the evidence of text frequency, as a probabilistic system.
Consistent with his views on the role of linguistics, Professor Halliday holds that lexis and grammar are complementary perspectives and not contrastive, opposing or unrelated fields. Each explains different aspects of a single phenomenon.
Chapter five: 'Language as System and Language as Instance: The Corpus as a Theoretical Construct' (1992): System and instance are two end observers of a single phenomenon, the language. Every instance of a text perturbs the overall probabilities of the system. The more we observe instances, the better we perform as system observer. Professor Halliday emphasizes that the corpus need to have very large sample of real text.
We can check the relative frequencies and the frequencies broken down by the register to test the hypothesis regarding probability typology. We need to measure how the probability of selecting one term is affected by previous selections made within the same system. It is possible to measure the complexity of the language through general measures such as lexical density or specific measures such as length of nominal chains. The degree of association between simultaneous systems can be found. Measure of conditional probabilities can give insights into historical linguistics.
The chapter discusses the aspects of statistical measures of natural language.
Chapter six: 'A Quantitative System of Polarity and Primary Tense in the English Finite Clause' (1993): This chapter is co-authored with Z.L. James. The intention was to undertake basic quantitative research in the grammar of modern English. The authors decided to access the corpus directly using existing programs. They hoped to test the hypothesis that grammatical systems fall largely into two types. There are systems where the options are equally probable; there is no unmarked term in the quantitative sense. In the other type of systems the options were skew, one term being unmarked. The authors then detail the procedure adopted, the problems faced and the important decisions taken during the course of the study.
Chapter seven: 'Quantitative Studies and Probabilities in Grammar' (1993): According to Professor Halliday, corpus linguistics modifies our thinking about theoretical linguistics. He maintains that because of quantitative studies, some interesting patterns seemed to emerge. Any concern with grammatical probabilities makes sense only in the context of a paradigmatic model of grammar.
Systemic functional corpus studies investigate systemic variation in patterns of meaning on the plane of content rather than plane of expressions. The studies investigate the internal relationship between two systems within the grammar in terms of their interdependencies and their logical semantic relationship.
In the second half of the chapter, the author discusses the factors, which identify the grammatical systems for investigation and the decisions taken during the study. There are procedures adopted and statements of observations made during the studies, as also the analysis of inaccuracy and the steps taken to deal with errors and omissions. He holds that the analysis should be valid when applied to any natural text.
Chapter eight: 'The Spoken Language Corpus: A Foundation for Grammatical Theory' (2002): The author holds that only in spoken language, the full semantic potential of the system is brought into play, from which flow new insights to the theory of language in total.
The metaphor, 'reducing spoken language to writing' suggests that some features such as melody and rhythm are lost in transcribing the spoken variety. Transcription should be faithful to the essential natural features of the spoken variety, which are functional in carrying meaning.
With some reservations, the author accepts the distinction between 'corpus- based' and 'corpus-driven' descriptions, both essentially need to be theory based. He describes structure as theory of syntagm and system as theory of paradigm.
He concludes that grammatical probabilities, both global and local, are an essential aspect of 'what language really is and how it works'. The discussion is supported by a few interesting examples and the results of spoken corpus studies.
Chapter nine: 'On Language in Relation to Fuzzy Logic and Intelligent Computing' (1995): The author expresses need for systemic analysis of the language for MT rather than depending on commonsense knowledge about the language. After detailing the distinct features of language as semiotic system, he summarizes the complexity of language. The complexity arises as the systems are not fully independent, and relate to one another. Nor do they form any kind of strict taxonomy. There are various degrees and kinds of partial association among the systems. Thus, there is a great deal of indeterminacy, both in systems and in their relationship. The overall picture is notably fuzzy. It is essential to account for fuzziness of language, its disorder and complexity, not as accidental and aberrant, but as systemic and necessary to convey the meaning.
Finally, he outlines the basic principles adopted in attempting to theorize about language. He wants to formulate grammar paradigmatically, contextually, functionally and fuzzily. Examples are used to illustrate the principles of systemic modeling.
Chapter ten: 'Fuzzy Grammatics: A Systemic Functional Approach to Fuzziness in Natural Language' (1995): This chapter is about the role of grammar when natural language is to be used as a metalanguage for intelligent computing. The basic metafunctions of natural language are ideational, interpersonal and textual. Ideational metafunctions construe experience, which can be material, mental, verbal or relational. Interpersonal metafunctions enact social relationship and creates discourse. Metafunctions are comprehensive, extravagant, telescopic, non- autonomous, variable and indeterminate. Rhetorical toning, indistinctness, unexpectedness, logogenesis, complexity, irrelevance, jocularity and error are some of the problem areas of natural language as metalanguage.
The author expresses the need to model language reality in terms of tendencies rather than in terms of categories. This makes it possible for natural language to be its own metalanguage.
Chapter eleven: 'Computing Meanings: Some Reflections on Past Experience and Present Prospects' (1995): MT began in 1950s with the premise that the approach had to be mathematical and logical. It was only in the mid 1960 that the phenomenon of language came to be seen to be autonomous. In the 1980s, language came to occupy the central stage and computers became a tool for linguistic research. Now research is at a stage where we can think of computers functioning through the medium of natural language. It was recognized that a word has its meaning only in the total meaning potential of the language.
For intelligent computing to succeed, we will have to align language and knowledge on the one hand and instance and the system of which it is an instance on the other. Professor Halliday then summarizes those points of linguistic complexity that will have to be taken into account if computing with natural language is to succeed.
When computing will involve operating with natural languages, we will finally be computing meaning.
A general theme runs through the book. Language is described as made up of choices of alternative patterns. It is therefore inherently probabilistic. Different aspects of the same issues are discussed in appropriate contexts over different chapters. Many times the author draws on some probability- based results to support his hypothesis.
The theoretical statements regarding sentence equivalence are not supported by adequate discussion in chapter two.
In chapter three, the author supports his propositions by discussing fieldwork for child language acquisition as well as cognitive processes regarding language. This makes the propositions more meaningful.
It is stated that cause and effect in case of physical systems are directional. However, the author has not considered whether this holds for human perception also.
Firthian concept of 'system' in chapter four provides the necessary paradigmatic base for corpus based probabilistic studies of the language.
In chapter six, the conclusions are tabulated. These conclusions are not always short and sharp answers.
I personally disagree with the following statement in chapter nine, "Literate, educated adults no longer have access to commonsense knowledge about language; what they bring to language are the ideas they learnt in primary school, which have neither unconscious insights of everyday practical experience nor the theoretical power of designed systematic knowledge" (p. 197). It appears to me that a person can improve an acquired language by constant access to contemporary knowledge about language. The capability for second language learning may support this view.
There is no justification for excluding ungrammaticality from formal model of language in chapter ten. It is generally accepted that a linguistic description to be complete has to account for ungrammaticality.
It should make us pause and think that school education is sufficient for day-to-day language use but not adequate for MT. Is this inadequacy merely the difference between the use and the explanation for the use?
Can we relate difference between patterns of spoken and written version of the language to the gesture and facial expressions and body language, which are concomitant with spoken language?
Perhaps corpus linguistics can be usefully supplemented by a study of forms of non-verbal communication.
This unusual book displays Professor Halliday's different concerns and endeavor to give linguistics, particularly, probabilistic corpus studies, a central role in MT. While illuminating the developments, he provides insights and linkages with different contemporary subjects.
On reading the book, the reader cannot but feel that it is only on the development of a comprehensive theory of meaning that computational linguistics can finally come into its own.
Chomsky, Noam (2004): New Horizons in the Study of Language and Mind, Cambridge University Press
Dash, Niladri (2004): Corpus Linguistics and Language Technology, Mittal Publications, New Delhi
ABOUT THE REVIEWER:
ABOUT THE REVIEWER
The reviewer is M. A. (Linguistics) and pursuing her Ph. D. in 'Word Sense Disambiguation'. She is engaged in research on the less-studied and resource-poor language, Marathi, the state language of Maharashtra State of India. She is a significant contributor to the development of Morphology Rule-Based Spellchecker for Marathi. At present, she is working on a Rule-Based Part-of-Speech Tagger for Marathi. She is participating in the development of Wordnet for Marathi. She has undertaken to design a course for learning Marathi as a second language. Her lectures on morphology are available on the net. She has presented her work in national and international conferences.