Date: Wed, 28 May 2003 12:59:21 -0400 From: Anne Mahoney Subject: Review: Hammond (2003) Programming for Linguists: Perl for Language Researchers
Hammond, Michael (2003) Programming for Linguists: Perl for Language Researchers. Blackwell Publishing.
Reviewer: Anne Mahoney, Tufts University
Programming for Linguists is an introduction to computer programming using the Perl language, aimed at people who work with language. Although Hammond seems to envision it as a self-study guide, it would probably work better as a course textbook. It is a generally sound introduction to the language and to the notion of programming a computer.
Perl is a particularly nice language for text processing because of its wealth of pattern-matching and string-handling constructs. It is easy in Perl to say, for example, "find all words that end in a vowel" or "replace every occurrence of the word 'cat' with the word 'feline.'" In addition, Perl is easy for beginners because it is interpreted rather than compiled: one simply writes a program and runs it, without explicitly having to turn it into machine code. As Hammond points out (p. 2), Perl is moreover available, and free, for every type of computer system in current use. I therefore agree that Perl is a good starting point for a linguist with a computational problem.
Hammond's intended audience is "a naive reader who may know nothing about programming" (p. ix). The reader who already knows another programming language and wants to pick up Perl will be better served by Wall et al. (2000) Hammond's naive reader, however, is expected to understand how to install software, how to use a text editor as distinct from a word processor, and how files and directories work. Although Hammond gives basic instructions on how to invoke an editor, how to invoke the Perl interpreter, and how to display the text of a Perl program, he leaves the reader helpless if anything goes wrong. While the details of using a text editor really are beyond the scope of the book, especially if the reader could be using any of several computing platforms, it is often the case that someone who has never thought about programming before has also never had occasion to use a text editor, change the path (set of directories from which executable programs can automatically be found), or install anything that requires configuration or compilation. Although Hammond sensibly suggests that some of these are "delicate tasks" and "you should seek assistance before attempting them on your own if you've never done this before" (p. 7), it would be useful to provide more concrete information about where such assistance might be available. A college- or university-affiliated linguist may be able to ask the school's "academic technology" group. If no such resource is available, the reader will want a good book on the relevant operating system, perhaps one introducing system administration or development.
The first two chapters introduce Perl and how to create and run a program. Chapters 3-7 cover the core features of the language. Not every bit of Perl syntax is included, only what beginners need to write basic programs. Each chapter includes examples, which are also available from the author's home page, http://www.u.arizona.edu/~hammond/ (p. x), and ends with a group of exercises, many of which are variations on the example programs. The exercises are all relatively easy, a few minutes' to half an hour's work; there are no term projects or research questions here. They provide practice on the language features introduced in the text, and may help the reader figure out what kinds of problems a computer program might help solve. The core chapters introduce, in order, control statements, scalar variables, and arrays; input and output, both at the user's screen and to files; organizing programs into subroutines; regular expressions; substitutions, sorting, and tokenization. Examples grow increasingly elaborate, including an English-to-Pig Latin translator.
Chapter 8, on HTML, talks about using Perl to generate or parse HTML files. Chapter 9 is about CGI, the "Common Gateway Interface" for web programming. Oddly, it does not mention the commonly used CGI module, available from CPAN (the Comprehensive Perl Archive Network, http://www.cpan.org, discussed in appendix D), which includes functions to do several of the things Hammond has the reader do laboriously by hand, notably retrieving the input to a CGI routine.
Four appendices round out the book. Appendix A mentions object-oriented programming as it is done in Perl. While it is appropriate to explain the odd syntax that object-style modules may use (all those double colons and extra pointers), this topic is otherwise rather more advanced than the rest of the book. Appendix B discusses the Perl implementation of the Tk toolkit for building graphical interfaces. Finally, appendix C lists the basic "special variables" built in to Perl, and appendix D gives a few pointers to further information.
Any introductory programming textbook is necessarily its readers' first initiation not only into the mechanics of programming, but also into style. Here Hammond's recommendations and examples are sometimes inappropriate, and often unidiomatic for Perl. For example, on p. 49 he suggests that programmers should avoid "command condensation," by which he means using the output of one routine as an argument to another, or more generally doing more than one operation in a single step. He notes that this technique produces shorter programs, but "it results in far less clarity and should be avoided." (p. 50) The alternative, however, is generally to introduce new variables to hold intermediate results. This is also confusing, as another programmer reading or working on the code some time later must determine what happens to each of those variables, and whether they are still relevant in some later part of the code. In programming languages as in natural languages, greater fluency makes it possible to read longer "sentences" without getting confused. A first-year Latin student might be thoroughly confounded by the sentence of 60-odd words that begins Cicero's speech for Archias the poet, but the experienced Latinist understands its sense, its structure, and its sound effects. Similarly, experienced programmers learn to use increasingly complicated statements. (While in natural language acquisition students can generally read more complex sentences than they can accurately write, in programming language acquisition the sequence is often the reverse, because students rarely get practice in reading existing code. This is unfortunate, however, as working programmers spend much more time reading, documenting, and modifying existing code than they do writing new code from scratch.)
Hammond points out that code should have comments (p. 48), but the examples rarely do. He also notes that variable names should give some information about the use of the variable (p. 49); although most of the examples follow this precept, there are occasional one-letter or otherwise neutral names. He characterizes the ubiquitous Perl "anonymous variables" as "one major threat to writing easy-to-read programs" (p. 50), yet anyone who will be working with Perl will run into them almost at once. Anonymous (or "implicit") variables in Perl are supplied from the context when a function requires an argument which it is not given. They include the current main input filehandle, the current record from a file being read, and the current element of an array within a loop.
Real programs must be prepared for errors, especially if they expect to receive any data from outside. Hammond notes (p. 36) that it's always necessary to check whether a program has successfully opened a file it intends to use, and gives the standard idiom for doing so. The example programs, however, merely complain that there has been an error, without saying what error or on which file; the information necessary to construct an informative error message is relegated to a footnote (p. 45). Once we get to regular expressions, in chapter 6, a series of example programs allow the user to enter a regular expression as input to the program. These expressions are then used without any check on their validity (examples p. 80, 81, 85, etc.).
Although some aspects of programming have changed in the last fifty years or so, the basic principles of good style are much the same as ever. The style manual Kernighan and Plaugher (1978) has really not been superseded; its key style rules (including "Avoid temporary variables"; "Use the good features of a language; avoid the bad ones"; "Make sure all variables are initialized before use"; and so on) are as relevant to object-oriented Perl as they were to Fortran and PL/1. Hammond is an experienced enough programmer to know this, as is clear from the programs he makes available on his home page. Students may as well learn good habits from the beginning, rather than being encouraged by the textbook to be sloppy.
The book has little to say about either design or debugging. Any non-trivial program should be sketched out first, before the programmer starts writing code, to be sure nothing major will be overlooked. Simply starting in to write without first thinking about the structure of the program can lead to using the wrong structure. How large a program is non-trivial depends on experience; for the intended readers of this book, the solutions to the exercises are not yet trivial. Moreover, few programs are correct when first written, and Hammond gives no suggestions about how to determine why a program does not do what you think it should. Perl does include a couple of tools: the "use strict" pragma to enforce variable declarations, the "-w" command line switch to enable warnings, and a debugger. New programmers need to be reminded that everyone makes mistakes, that programming mistakes are rarely disastrous unless the program modifies a file or something else beyond its own borders, and that there are systematic ways of finding and fixing the mistakes that will happen.
The book is in general accurate and well-edited, but I found a few errors or inaccuracies which might lead to a bit of confusion. For example, on page 9, the Perl escape "\n" is described as "an explicit return -- or newline"; they are not the same thing. In a footnote on p. 29, the definition of "prime number" is correct, but the example includes 1, which is incorrect. On p. 57, the scope of a variable defined as a loop index in a "for" or "foreach" statement is the loop itself, not the block or routine that encloses the loop. In the discussion of regular expressions, p. 82-83, the pattern that is intended to contain a backslashed vertical bar is twice printed with a space between the backslash and the bar: for "\|" we have "\ |" instead. The example on p. 87 misses the first match: the pattern /o.*s/ applied to "John loves Mary" will match "ohn loves" rather than merely "oves". In the discussion of sorts, p. 105-106, the text says you specify "an explicit sorting function" as an argument to the standard sort routine. In fact, what you specify is only a comparison function, which tells how to determine if one item comes before another, not an entire sort function. In the discussion of HTML, correct terminology seems to be deliberately avoided: "escape sequence" on p. 129 instead of the standard term "entity," "parameter" on p. 131 instead of the standard term "attribute."
The code sample on p. 136-138 does not handle URLs with directories, and will also fail on a relative link from a page whose URL includes a filename. Although I did not test all of the code, this is the only program in the book which had visible errors: a commendable success rate.
The preface suggests that the audience for the book includes not only linguists but "literary theorists" (p. ix, repeated on p. 1). I assume Hammond means "literary scholars" here; while literary theorists are unlikely to need computational tools, many of us who work on literature -- applying theories rather than creating them -- do have occasion to program. Literary scholars might want to write programs for stylometrics, collation and textual criticism, metrical analyses, concordancing, and so on. In addition, knowledge of programming greatly facilitates marking up a text for other uses, for example turning a plain typed or scanned text into TEI XML.
Overall, this is a sound book, with only a few questionable recommendations and very few errors. It would make a good foundation text for an introductory course on computational linguistics or humanities computing, perhaps coupled with something like Hockey (2001) to give the students some ideas about what this new skill will allow them to do.
Hockey, Susan M. (2001). Electronic Texts in the Humanities: Principles and Practice. Oxford.
Kernighan, Brian W., and P. J. Plaugher (1978). The Elements of Programming Style, second edition. New York: McGraw-Hill.
Wall, Larry, Tom Christiansen, and Randal L. Schwartz. (2000) Programming Perl, third edition. Sebastopal, CA: O'Reilly and Associates.
ABOUT THE REVIEWER:
ABOUT THE REVIEWER Anne Mahoney teaches in the department of classics at Tufts University and is the lead programmer at the Perseus Project there. Her research interests include Greek and Latin meter and poetics, ancient drama, and vocabulary.