Date: Tue, 17 Jun 2003 14:21:40 +0200 From: Hans Paulussen Subject: Programming for Linguists: Perl for Language Researchers
Hammond, Michael, (2003) Programming for Linguists: Perl for Language Researchers, Blackwell Publishing. Announced at http://linguistlist.org/issues/14/14-900.html
Reviewed by Hans Paulussen, University of Leuven (campus Kortrijk)
"Programming for linguists" is an introductory book for linguists who want to learn to program in Perl.
The book consists of nine chapters, which I would group into three sections (although the author does not make this distinction): an introductory section (chapters 1-2), a description of Perl (chapters 3-7), and an extension to this basic introduction, covering HTML and CGI scripting (chapters 8-9).
The introductory section (chapters 1-2) gives a brief introduction to Perl and how to start using Perl. The first chapter explains in a few lines why Perl programming skills would help the linguist in his research tasks. This is followed by a brief note on how to download and install Perl, and a note on how to read the book. The second chapter explains how to edit and run your first script, the ubiquitous print command: "Hello world".
The second section (chapters 3-7) gives a carefully documented description of the basics of the Perl programming language. Basic control structures and variables are introduced in chapter 3. At the end of the chapter, the knowledge gained is illustrated in a linguistic experimental task involving the automatic generation of 'nonsense' syllables (p. 25).
Chapter 4 covers the subject of input and output, array operations and randomizing. The whole set of new elements (input/output, array, randomization) is then illustrated in a program (expprog.pl) which shows how to collect experimental data (p. 43).
Chapter 5 deals with the subject of subroutines and modules. It is a bit longer than the previous chapters and requires close attention of the beginner programmer. The topics discussed involve also the anonymous variable, variable scope, arguments to subroutines, multidimensional arrays, and the Exporter module to round off the modularity of Perl. All the new features are integreted in new versions of the sample program expprog.pl introduced in the previous chapter.
Regular expressions form the main topic of chapter 6. All the main features of pattern matching are explained, and the regular expressions are illustrated in a "pig latin" generator (p. 89): a program which swaps syllables (similar to the French verlan), taking into account some syllabic constraints. A sentence splitter (p. 90) is shown as another linguistic example.
Chapter 7 deals with all the Perl tools used for text and string manipulation. This includes, for example, string replacement based on regular expressions, conversion of strings to arrays (split and join) and sorting. The power of hashes is introduced, and the chapter ends with two linguistic illustrations: a concordancing program (p. 114) and a bigram selector (p. 118).
The third and final section covers two chapters which go beyond the topic of a basic Perl introduction, but which nevertheless show some interesting features of Perl. An introduction to HTML is given in chapter 8, which also shows how to retrieve and parse web pages, illustrated in a simplified websearch script (p. 136). Chapter 9 is an introduction to CGI scripting, which is used to create web pages dynamically. The chapter finishes with a further development of the syllable experiment introduced in previous chapters, which is now run over the web.
The book finishes with four appendices. The first two appendices can be considered some additional chapters on an extra, rather intricate feature of Perl: references. Appendix A is a brief yet well-documented introduction to objected-oriented Perl programming. Appendix B is a general introduction to the Perl Tk module, which is a library used for building a graphical user interface (GUI). Appendix C lists the most important "special" variables, and appendix D gives some hints about how to find further information on Perl.
Perl is a well-known scripting language, and many books have been written on this topic, but most of them are aimed at readers with some background in programming. Even Schwartz & Christiansen (1997), which is an excellent introduction to Perl, covers topics which are not always that transparent to beginning programmers, in particular those with a background in the humanities. Moreover, programming books in general use samples situated mainly in the field of mathematics.
Many books have been written on Perl and CGI scripting and web tracking and logging, all domains for computer specialists. Only recently, introductory books on Perl have been written for other domains, mainly for researchers in the field of bioinformatics (cf. Cross 2001, Dwyer 2002, Tisdall 2001), where the knowledge of mathematics and logic is a prerequisite.
Hammond's book is the first work, as far as I know, which focuses on the specific needs of language researchers. It is strange why so late an introductory book for language researchers has been written, since the extensive support of regular expressions makes Perl a perfect programming language for linguistic applications.
The main asset of this introductory book is the gradual introduction of all the basic features of Perl and the use of language samples. Seldom have I seen an introductory programming book which explains with minute details the different steps to understand what are variables and program loops. Every new feature is consolidated in a number of exercises at the end of each chapter. One can see that Hammond has many years of experience in teaching basic programming in general, but also, and especially, for learners with a background in the humanities. Very instructive is the use of small modifications in the proposed scripts, which are especially used to gradually introduce new features in a very digestive way. Each script is explained in detail.
Apart from the table on p. 31, showing a rather obscure nomenclature for console input and output, the whole chapter on input and output gives a clear overview of the different aspects of input and output facilities in Perl. On the other hand, the exact input and output for numerous examples is often not shown at all. In a number of cases, it is exactly the input and output files which would give the extra information needed to understand what the script is about.
You can download the sample programs from the author's website in three versions: Unix, Windows and Macintosh. However, there are now and then some minor practical problems. Some of the downloaded exercises do not match their paper version, expecially in chapter 5. A small number of exercises are missing, but since the programs are not very lengthy at all, this cannot be a big problem. The book says that answers to selected even-numbered exercises are also available on the website, but I have not found them.
Often special attention is given to environment specific features, including Unix, Windows and Macintosh, the last one being quite different from the others, thus posing some problems for beginners. However, a good understanding of the basic text file formats is missing. In fact, an introduction on end-of-lines is a sine-qua-non in text manipulation, especially for the linguistic researcher.
Because of the detailed explanations, one might be tempted to use the book as a self-study course book. However, there are a number of flaws which make that the beginner will often need a helping hand of a teacher. For example, no script starts with the warning flag (-w) and the "use strict" command, which are the minimum requirements for defensive programming. Especially in the case of beginners the lack of these two items can result in many mistakes. The chapters on HTML and CGI are brief, clear introductions to the subject, but I doubt whether a beginner can get any hands-on experience without the help of a web guru.
Hammond limits himself to introduce only the most common structures, which is fair enough in the context of a beginners' handbook of Perl. However, to call some structures merely redundant (note 8, p. 29) is an exageration. Some of these redundant constructions are very handy instruments which render the programming code more transparent, as soon as one gets used to it. If one wants to discard all redundant elements, one could just as well start by leaving out the ubiquitous shorthand operations. Another example is the function each() which is of limited use for a small hash (cf. p. 114), but which is very useful when reading a hash table of thousands of elements, which are likely to occur in real world applications.
In a world where localisation and internationalisation has become an important topic, it is strange that no attention has been paid to multilingual text processing. The only multilingual program sample is replace5.pl (p. 97) which converts a number of diacritical characters to their base form. Is this once again a typical English-oriented approach of analysing language (English being one of the few languages using few or no diacritics)? There are more and more language researcher who need to know how to handle language specific features efficiently.
Since regexes are considered one of the most useful features of Perl, why then is the chapter on regexes so short? Why not show how to expand regexes based on the concatenation of previously defined regexes? One could develop for example a regex procedure to detect French words in an English text.
Hammond is an outspoken master in explaining complicated programming features in a simple way. A case in point is the introduction to references (appendix A) which shows a transparency seldomly seen in any other introductory book on Perl. This didactic approach, however, conceals the intricacy of programming in the field of linguistics. The language examples are too general to be considered full-fledged scripts. They only scratch the surface of the complexities involved in language research and language technology. Linguistic analysis is simplified and thus mystified. It is simple to render some linguistic features into a regex (such as vowels and verbal inflection, cf. p. 85), but what about the intricacies involved in determining syllables. The tokenisation algorithm as illustrated in the sentence splitter script (chapter 6) does not even take account of abbreviations (any dot is considered the end of a sentence) nor the possibility of multiple dots, to name just the basic problems in sentence splitting. On the other hand, the concordance program concord3.pl (p. 116) shows a possible way how to deal with capitalized nouns.
To conclude, this book is a nice introduction to Perl programming, aimed at researchers with an interest in language. There are quite a number of program examples and exercises that deal with language, but the scripts presented are too rudimentary to be used for real world applications. They nevertheless give a general idea of what one can do with text and string manipulation. The main asset of this handbook is the gradual introduction of the most important features of Perl, using a transparency which appeals beginners with no background in mathematics or logic. As such, this book can be considered an excellent complement to the well-known introductory Perl book of Schwartz & Christiansen (1997).
Cross, David (2001), "Data Munging with Perl", Manning Publications Company.
Dwyer, Rex A. (2002), "Genomic Perl: From Bioinformatics Basics to Working Code", Cambridge University Press.
Schwartz, Randal L. & Tom Christiansen (1997), Learning Perl, O'Reilly.
Tisdall, James (2001), "Beginning Perl for Bioinformatics", O'Reilly.
ABOUT THE REVIEWER:
ABOUT THE REVIEWER Hans Paulussen is translator and computational linguist. He has worked in the field of language teaching and text corpora research. He wrote a PhD on the contrastive analysis of prepositions in English, French and Dutch, within the cognitive linguistic framework. This empirical work was based on a parallel aligned corpus specifically compiled for that purpose. He has also been involved in the computational linguistic support of the development of the first corpus-based Arabic-Dutch dictionary. He is presently involved in a CALL research project at the University of Leuven (campus Kortrijk) and he teaches an introductory postgraduate course in corpus linguistics at the University of Lille.