|Editors: Soudi, Abdelhadi; van den Bosch, Antal P.; Neumann, Günter
Title: Arabic Computational Morphology
Subtitle: Knowledge-based and Empirical Methods
Series Title: Text, Speech and Language Technology
Adel Jebali, Département de Linguistique et de Didactique des Langues,
Université du Québec à Montréal (UQAM), Canada.
This book is a collection of papers that deal with the different methods
employed in the field of Arabic computational morphology and the use of these
approaches in large scale applications. The two main approaches of this
collection are: knowledge-based and empirical. The difference between these
methods resides in the manner computational linguists provide the linguistic
knowledge used in the analysis. In knowledge-based techniques, the linguist
encodes the linguistic knowledge manually, based on a predefined theory of the
morphological entities. In the empirical approaches, on another hand, linguistic
knowledge is extracted directly from natural language data by employing machine
The book preface is written by Richard Sproat, an eminent linguist working on
computational morphology. The book itself is divided into four parts, containing
fifteen chapters. Part 1 is a three chapter introduction to Arabic computational
morphology, and specifically to the two methods used in this field:
knowledge-based and empirical. Part 2 contains four chapters and focuses on
knowledge-based methods. Part 3 contains four chapters and deals with empirical
methods. Finally, Part 4's four chapters deal with the integration of Arabic
morphology in two main applications: information retrieval and machine translation.
Chapter 1 is written by the editors to offer a brief roadmap of the book. They
introduce the two approaches widely used in Arabic computational morphology, the
applications related to this field, and Basic Language Resource Kits (BLARK) for
Chapter 2 focuses on the transliteration scheme adopted in this book to
represent Arabic characters. The authors present, as well, guidelines to
pronounce Arabic using this scheme. The goal is to have a sort of standard to
transliterate Arabic scripts, respected by all the authors in this book. This
scheme is proposed as a complete system to be widely adopted by the natural
language processing research community working on Arabic, a standard that is
Chapter 3 provides a presentation of the main issues facing Arabic morphological
analysis. Even if the relation between modern dialects and Modern Standard
Arabic is a challenging one, Timothy Buckwalter thinks that the salient issues
are orthographic. These include the status of non-standard Arabic characters,
the persistent variation in the spelling of some letters, problems related to
the tokenization of Arabic input strings and the absence of annotation for
lexically-determined features, such as gender, number and humanness.
Chapter 4 begins Part 2. It introduces the first of the knowledge-based
approaches, called Syllable-Based Morphology (SBM). In this model, morphological
realizations are defined in terms of their syllable structure. Cahill shows that
this framework accounts for facts from Semitic languages, and particularly
Arabic, in the same way it accounts for facts from European ones.
The second knowledge-based approach to Arabic morphology is depicted in the
fifth chapter. In this approach, which is an inheritance-based one, Al-Najem
demonstrates the benefits of using this model to account for Arabic
root-and-pattern morphology to capture generalizations, dependencies and
syncretisms. He further implements his analysis in DATR, an inheritance network
formalism designed for the representation of natural language lexical information.
In the sixth chapter, Cavalli-Sforza and Soudi present and use another approach,
the Lexeme-Based Morphology of Aronoff (1994) and Beard (1995). In this theory,
the priority is given to stems and not to prefixes and suffixes. The authors
propose a concatenative method to generate Arabic inflected forms even when the
real language-process is not concatenative in nature. They implement this
approach in an extension of the MORPHÉ tool developed by Leavitt (1994).
The last method in this paradigm is related on the work in two projects:
DIINAR.1 and SYSTRAN Arabic-English translator. The approach adopted in those
projects is a stem-based Arabic lexicon with grammar and lexis specifications.
It is presented in Chapter 7 by Dicky and Farghaly. The authors argue that the
most appropriate organization for the storage of information for a language like
Arabic is to use stem-grounded lexical databases in conjunction with entries
associated with grammar and lexis specifications.
The third part of the book focuses on empirical methods and presents four
accounts of data-driven processing models of Arabic morphology. Chapter 8, whose
authors are Days et al, is a sort of introduction to these methods. The authors
present a machine learning approach to the problem of extracting consonantal
roots of Arabic words. This approach relies on statistical methods and
linguistic constraints as well. The accuracy of the predictions thus obtained is
by no means inferior to the quality of human predictions of the accurate roots.
The second account in this paradigm is presented in Chapter 9, by Diab et al.
These authors provide a Support Vector Machine (SVM) based approach to tokenize,
tag and annotate data of Modern Standard Arabic. They apply a method that proved
its efficiency when dealing with English data and they obtain high scores
working on the Arabic Treebank.
Chapters 10 and 11 present two memory-based models whose application data come
from the Arabic Treebank. The first of these models is semi-supervised while the
second in supervised. In the partially supervised machine learning techniques,
largely motivated by first language acquisition, Clark presents a pair of sets
of words to the learner, who must align them. The author's focus is on broken
plural (a nonconcatenative morphological process). In chapter 11, Van Den Bosh
et al use annotated corpora as an application of the memory-based learning to
morphological analysis and part-of-speech tagging of written Arabic.
Chapter 12 begins Part 4. Larkey et al focus on one possible application for
Arabic computational morphology: information retrieval (henceforth IR). They use
a method called light-stemming, i.e. stemming without resorting to morphological
analysis. They argue that this method is more efficient than several stemmers
which are morphological analysis-based.
Chapter 13 deals with IR as well. Darwich and Oard present a method to adapt
existing Arabic morphological analysis techniques with the aim of making them
suitable for the requirements of IR. They present as well a shallow statistical
Arabic morphological analyzer called Sebawai and a light-stemmer called Al-Stem.
Both were used by the authors in an IR application to produce Arabic index terms.
The second application to benefit from morphological analysis is Machine
Translation (henceforth MT). In Chapter 14, Habash is mainly concerned with the
representations used by different MT-relevant resources (morphological
analyzers, dictionaries and treebanks). He discusses the usability of these
representations in different MT approaches and argues that the
lexeme-and-feature level of representation is motivated.
The last chapter focuses on MT as well. Guessoum and Zantout investigate the
impact of Arabic Morphological Generation on the quality of MT systems. The one
chosen by them is a web-based English to Arabic MT system called Ajeeb. They
have translated thousands of sentences using this tool and analyzed these
translations. Their analysis reveals that the morphological information captures
various linguistic aspects and affects the quality of the translation.
I think this collection could indeed be a very good starting point for every
researcher who wants to engage in Arabic computational morphology, its
challenges, its theories and its applications. The tripartite division provides
a clear distinction between the main problems, the theoretical issues and the
areas of application. As the editors state with reason, this book is unique in
several respects. I know of no other book with a so wide a coverage of both
knowledge-based and empirical methods and of applications as well.
The book offers a general view of the trends of Arabic computational morphology,
but it omits one of the most important approaches. The so-called finite-state
morphology of Beesley (1989, 1990), for example, has greatly contributed to the
Arabic computational morphology, but no paper is devoted to this knowledge-based
approach. The editors mention it in Chapter 1 and present some of its concepts,
but I think that this brief presentation does not do justice to such an
important theory in the history of computational morphology.
Furthermore, redundancy is the main drawback of this book. Each author in each
chapter, with the exception of Chapter 12, is concerned with presenting an
introduction to Arabic morphology. While this could be useful for someone
reading only one paper or some of the papers in isolation, it may be somewhat
boring for someone who reads all chapters in the book. It would have been
preferable to devote a chapter to introduce Arabic and specifically Arabic
morphology. Chapter 3 was meant for that purpose, but Buckwalter focuses mainly
on orthographic issues while it is well established that the main issues in
Arabic morphology are linguistic (nonconcatenative nature, for example, as
stated by (McCarthy, 1981)).
Some dialects are mentioned in the papers, such as Egyptian and Levantine Arabic
in Chapter 3, which is the only chapter which takes into account the complexity
of the data from both Standard Arabic and modern dialects. In the remainder of
the book, however, the main focus is on Standard Arabic. While this is a natural
choice when dealing with written Arabic, dialects should have been taken into
account to propose more precise linguistic analyses. In addition, what some
authors call ‘Standard Arabic’ is not defined in the papers or in the
introduction. Cahill states: “The data we will cover in this chapter is from
Standard Arabic.” (Chapter 4, page 48). He states further: “We will not address
bi- and quadriliteral roots, even though the latter do occur in Classical
Arabic.” (Page 48-49). This means that ‘Standard Arabic’ includes somehow the
variety called ‘Classical Arabic’, but the data from this one is not to be taken
into account. Dichy and Farghaly (chapter 7) state clearly that the variety
studied is ''Modern Standard Arabic'' (page 116) which means that data from
‘Classical Arabic’ are not discussed. Finally, Larkey et al. (chapter 12)
declare “The morphological complexity of Arabic (see Chapter 3 of this volume)
makes it particularly difficult to develop natural language processing
applications for Arabic information retrieval.” (Page 222) They make reference
to Chapter 3 where Buckwalter takes into account both Standard Arabic and the
modern dialects. Nevertheless, their analysis takes only Standard Arabic into
Apart from these issues, there are some minor considerations I would like to
address. I think that a glossary at the end of the book would have been very
useful for someone looking for the definition of a specific notion. Besides
that, the index is too short and does not contain the authors' names mentioned
in the papers. The lists of bibliographic references are formatted according to
several standards from one chapter to another. The editors should have put more
emphasis on this aspect. Finally, while most authors gloss Arabic examples and
give a translation too, some of them translate without glossing (see chapters 6
and 15 for example).
Aronoff, M. (1994) _Morphology by Itself: Stems and Inflectional Classes_.
Cambridge, MA: MIT Press.
Beard, R. (1995) _Lexeme-Morpheme Base Morphology: A General Theory of
Inflection and Word Formation_. Albany: State University of New York Press.
Beesley, K. R. (1989) _Computer Analysis of Arabic Morphology: A Two-Level
Approach with Detours. In Third Annual Symposium on Arabic Linguistics_. Salt
Lake City: University of Utah. Published as Beesley, 1991.
Beesley, K. R. (1990) Finite-State Description of Arabic Morphology. In
_Proceedings of the Second Cambridge Conference on Bilingual Computing in Arabic
and English_. No pagination.
Leavitt, J.R. (1994) MORPHÉ: A Morphological Rule Compiler. Technical Report,
McCarthy, John. (1981) A Prosodic Theory of Nonconcatenative Morphology.
_Linguistic Inquiry_, vol. 12, pp. 373–418.
ABOUT THE REVIEWER
Adel Jebali is currently a lecturer and a PhD student in linguistics at the
Université du Québec à Montréal (UQAM). His researches focus on the
implementation of Arabic argument markers within the HPSG framework using the
LKB system. He is also interested in computational linguistics and more
specifically in Arabic computational morphology and syntax.