Date: Mon, 3 Jan 2005 11:47:13 -0600 From: Philip McCarthy Subject: Lexical Diversity and Language Development: Quantification and Assessment
AUTHORS: Malvern, David D.; Chipere, Ngoni; Richards, Brian J.; Durán, Pilar TITLE: Lexical Diversity and Language Development SUBTITLE: Quantification and Assessment PUBLISHER: Palgrave Macmillan YEAR: 2004
Philip M. McCarthy, Department of English (Linguistics), and the Institute for Intelligent Systems (IIS), the University of Memphis.
"Lexical Diversity and Language Development: Quantification and Assessment" is, predominantly, a summary of David Malvern and Brian Richards' last seven years' work on the lexical richness measure known as 'D'. The measure D, it is argued, is the most reliable measure of lexical diversity and is particularly useful for measuring short transcripts such as those produced by young children. The book is of interest to researchers working in the areas of language acquisition, English as a second language (ESL), aphasiology, or any other field where the quantification of language deployment (lexical diversity) is a factor.
Lexical diversity, reported throughout this book (despite the title) as lexical richness, is one of the greatest linguistic enigmas -- if a rather unsung one. In brief, we have long known that people of different ages and abilities, and different texts for different purposes, appear to produce significantly different degrees of lexical diversity. No one, for instance, would argue that Shakespeare was less diverse in his vocabulary deployment than would be a typical five-year old child. And by the same token, we all seem to intuitively know that works by such authors as Joyce or Tolstoy are lexically richer than are works by, say, Hemmingway or Steinbeck. Despite such appearances, however, no one has yet been able to produce a measure that is capable of scoring such differences meaningfully and accurately: It is as if we were all aware of differences in temperature, had tacitly agreed what constituted heat, and yet had been unable to invent the thermometer. What Malvern et al. are offering us, therefore, is the best yet attempt at a lexical diversity thermometer.
Malvern et al.'s book is organized into four parts. The first, and main part of the book, serves to explain the concept of lexical richness, to outline why lexical richness is such a tricky and elusive measurement, to explain where and how lexical richness measures have been employed, to discuss the various types of lexical richness measures that have been proposed, to show where and why these measures fail to reliably account for lexical richness, and, most importantly, to introduce and discuss the measure known as D. Part II offers a collection of previously published papers that serve to support the authors' claims as to D's reliability. Part III offers a look at other considerations for lexical richness measures, and part IV is a brief overview and conclusion.
The book's review of previously proposed measures of lexical richness is probably the most thorough ever published. The authors begin by explaining the underlying problem of basic lexical richness measures, such as type- token ratio (TTR). In brief, types are the words used in a text, whereas tokens are the instances of words used in a text. Thus, the sentence "the big dog chased the small dog" has four types and six tokens; the types "the" and "dog" having two tokens each. The problem, as Malvern et al. explain, is that as a text increases in length the likelihood of new types being introduced decreases. Consequently, the longer a text is, the lower the TTR is likely to be.
Over the years, numerous alternatives to TTR have been proposed, and Malvern et al. explain each with great clarity. Mathematically manipulated lexical richness scores such as RootTTR (G) and Corrected-TTR (C), logarithmic variations of lexical richness such as R and H, and frequency based measures such as Z and K are all explained, dissected, and discredited. Malvern et al. show the problems with each measure through theoretical and empirical approaches. The studies of Jarvis (2002) and Tweedie and Baayen (1998) form a good deal of the empirical testing that have shown problems with other lexical richness measures, and where theory rather than empirical evidence discredits the measures, Malvern et al. go to great lengths themselves to explain the problems.
Part I builds towards the most complex method of obtaining a lexical richness score: the "curve fitting" approach of Sichel (1986). It is largely on the basis of this model that Malvern et al. have composed their measure of D. Like Sichel's model, D operates by trying to fit empirical data, derived from TTR scores, to a theoretical TTR curve. D differs from Sichel in a number of ways: of primary importance is that D operates by taking hundreds of samples of data and averaging them to fit an ideal TTR curve. Because of the complexity of D, the freely available vocd software (MacWhinney, 2000) is used to make the calculation. The 18 pages dedicated towards D's development are highly enlightening and clearly the book's most important section. Despite the fact that much of what is written in this section has been said in previously published journal articles (Malvern & Richards, 1997; McKee, Malvern & Richards, 2000), the thoroughness and clarity in which the development of D is relayed here is without doubt well worth the read.
If part I is the synthesis and expansion of D's genesis (Malvern & Richards, 1997; McKee et al., 2000; Duran, Malvern, Richards, & Chipere, 2004) then Part II is simply the collection and reprinting of more recent papers (Malvern & Richards, 2002; Richards & Malvern, 2004). The four chapters forming Part II provide empirical evidence supporting D and its operating methodology: Chapters' 4 and 5 focus on measures of D across different corpora, Chapter 6 offers compelling evidence on the inadequacies of assessment examination testing as opposed to the reliability of results produced by D, and Chapter 7 investigates how variations in lemmatizing the analysis of words can lead to markedly differing results. While these chapters would have been more convincing had there been more work from other researchers, Malvern et al.'s own breadth of experimentation and investigation is quite forceful. Hopefully, more research will soon be underway to support even further these initial findings.
Part III of the book compares lexical richness to other methods for assessing texts: type-type ratios (as opposed to type-token ratios), for example, are considered. Evidence compiled here suggests that investigations into the diversity of parts of speech are also a product of text length and that, once again, D may provide the best answers. In Part III, the authors also expand the investigation of D's reliability into written texts concluding that the measure effectively discriminates across ages and developmental levels. Part IV is a bare six-page overview and conclusion. The brevity is somewhat disturbing as one would imagine the potential for future research involving lexical richness and D would be vast. And it would certainly seem apparent that far more testing of D would be undertaken. That said, Malvern et al. do take this opportunity to once more drive home the importance of an accurate measure of lexical richness, and they once more go to great pains to show how numerous previous studies using flawed measures of lexical richness have lead to results that must now be seriously questioned (for example, see Le Normand & Cohen, 1999; Ouellet, Cohen, Le Normand, & Braun, 2000; and Dalaney- Black et al. 2000). Even studies as recent as Ertmer, Strong, and Sadagopan (2003) use TTR of differing text lengths and quote the questionable "norms" of Templin (1957). Malvern et al. show their clear concern by writing:
These things matter. Much of the research based on flawed measures has significant implications for theory, practice, and policy. It is important therefore that the methodological issues of measuring vocabulary richness are understood and that these confusions are cleared up.
The authors' conclusion also acknowledges a few of D's problems: problems involving topic change and rhetorical styles that confound the curve fitting approach of D. Such problems are not dwelt upon however, and it would be fair to assume that later analyses of D will be somewhat more critical.
The authors' claim that previous LD measures are unreliable and their evidence for such claims are well made. It would be hard to believe that following such work any previously published approach could now win favor as the lexical richness measure of choice. Unfortunately, whether D itself is truly capable of carrying the crown is also, as we shall see, less than assured.
As the book is essentially an advertisement for D, rather than a disinterested history of lexical richness, criticism and potential problems with D are less than boldly stated. The main problem for D lies in its limitations caused by the attempt to satisfy its primary aim. As stated above, this aim is to offer a reliable measure of lexical richness for short samples of transcripts. The problem for Malvern et al. is that while other measures of lexical richness are particularly weak at measuring short samples, in establishing a measure that actually does accomplish the task, Malvern et al. appear to have made a measure that is only accurate for short samples. In other words, we must ask whether the baby has been thrown out with the bathwater. A closer look at how D is calculated may show why this is so.
Malvern et al. use the vocd system to sample items from the available data. These samples are between 35 and 50 tokens in length. As such, the minimum transcript size is 50 words; however Malvern et al. claim that they cannot guarantee lexical richness for samples this small. Thus, the lower end of reliability for the measure is not made clear -- except to say that it must be above 50 tokens. Similarly, Malvern et al. cannot claim that D is reliable for longer texts. In fact, they place their upper limit at an unspecified "few hundred" tokens. The first question to ask, therefore, is, if D is reliable then where exactly is it reliable? The transcript borders are not that far apart (greater than 50 tokens but less than a few hundred), yet if the border areas are so murky then researchers would seriously have to wonder whether their data were of a suitable length for D.
The next issue is that Malvern et al. recommend using only stem forms in any lexical analysis so as to reduce the potential for confounding results. They further recommend controls for testing participants so as conversational topics do not diversify greatly. Perhaps most worryingly, however, is that they base the primary evidence of empirical testing on a corpus of 32 transcripts from children of just 2;8 years of age (Duran et al. 2004).
Such limited borders of transcript size, based on the production of such young children, from such a small corpus, with only stem forms recommended for fear of confounding D, does not yet secure faith that D is the most reliable (nor the most robust) of lexical richness measures.
We can look at Owen and Leonard's (2002) study for supporting concerns over D. In this work, it was concluded that D may not be a reliable measure of lexical richness. Owen and Leonard's transcripts were divided into sample sizes of 100, 250 and 500 tokens but when measured for lexical richness, differing D scores were produced. Jarvis (2002) despite knowing of D, chose to use an earlier D incarnation (see Malvern and Richards 1997) and was quite critical of the theoretical unpinning of the latest version of D (the one used in this book). The earlier D, used by Jarvis (2002), was quite successful at predicting lexical richness measures; however, the texts used in his study all had less than 400 words, and an alternative measure, U, actually performed better. Silverman and Bernstein Ratner (2002), on the other hand, do provide support for D, and Owen and Leonard (2002), while finding fault with D, still mention that it is a promising tool. On the whole, however, while Malvern and his colleagues continue to turn out positive studies on D, the wider community has not yet reached the same level of enthusiasm.
With a relatively limited use for the measure D, it is extremely hard to see how the measure could become the standard for lexical richness. That said, whatever the weaknesses of D, it does appear to be more reliable than any other available measure for texts of shorter length. Researchers would certainly be strongly advised to, at least, include D in their measurements, whatever the text size. However, with data of differing text length, or from different sources, researchers are equally strongly advised to interpret results with great care. While D itself may yet have a number of problems to overcome, while Malvern et al. may well have been a shade generous in their assessment of D, and while this book appears to promise much discussion on lexical diversity but in the end serves more as a commercial for a single measure, the book itself is nonetheless clearly the best (and indeed the only) book on lexical diversity currently available. Its competitors, Yule (1944) and Herdan (1960) have long been out of date, and a more recent offering by Baayen (2001) neither comes close to the expansive history offered by Malvern et al., nor does it focus on diversity so much as it does distribution. The significance of the differences between the two approaches may best be described by stating that neither author sees fit to mention the others' work. In sum, Lexical Diversity and Language Development makes a good attempt to fill a gaping hole in linguistic enquiry; however, whether its proposed product lives up to its authors' faith will only be revealed if greater research in this area (and through this method) is undertaken.
Baayen, R. H. (2001). Word frequency distributions. Kluwer Academic Publishers, Dordrecht.
Dalaney-Black, V., Covington, C., Templin, T., Kershaw, T., Nordstrom- Klee, B., Ager, J., Clark, N., Surendon, A., Martier,S., and Sokol, R. J. (2000). Expressive language development of children exposed to cocaine prenatally: Literature review and report of a prospective cohort study. Journal of Communication Disorders, 33, 463-81.
Ertmer, D. J., Strong, L. M., and Sadagopan, N. (2003). Beginning to communicate after cochlear implantation: Oral language development in a young child. Journal of Speech, Language and Hearing, 46, 328-40.
Herdan, G. (1960). Type-Token mathematics: A textbook of mathematical linguistics. The Hague: Mouton.
Jarvis, S. (2002). Short texts, best fitting curves, and new measures of lexical diversity. Language Testing, 19, 1-15.
Le Normand, M. T., and Cohen, H. (1999). The delayed emergence of lexical morphology in preterm children: The case of verbs. Journal of Neurolinguistics, 12, 235-46.
MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk (3rd ed, Vol. 1: Transcription format and programs). Mahwah, NJ: Erlbaum.
McKee, G., Malvern, D. D., and Richards, B. J. (2000). Measuring vocabulary diversity using dedicated software. Literary and Linguistic Computing, 15, 323-38.
Malvern, D. D., and Richards, B. J. (1997). A new measure of lexical diversity. In A. Ryan and A. Wray (Eds), Evolving models of language: Papers from the Annual Meeting of the British Association of Applied Linguists held at the University of Wales, Swansea, September 1996 (pp. 58- 71). Clevedon: Multilingual Matters.
Malvern, D. D., and Richards, B. J. (2000). Investigating accommodation in language proficiency interviews using a new measure of lexical diversity. Language Testing, 19, 85-104.
Ouellet, C., Cohen, H., Le Normand, M. T., and Braun, C. (2000). Asynchronous language acquisition in developmental dysphasia. Brain and Cognition, 43, 352-7.
Owen, A. and Leonard, L. B. (2002). Lexical diversity in the spontaneous speech of children with specific language impairment: Application of VOCD. Journal of Speech, Language and Hearing Research, 45, 927-37.
Richards, B. J. and Malvern, D. D. (2004). Investigating the validity of a new measure of lexical diversity for root and inflected forms. In K. Trott, S. Dobbinson and P. Griffith, eds., The child language reader (pp.81-9). London: Routledge.
Sichel, H. S. (1986). Word frequency distributions and type-token characteristics. Mathematical Scientist, 11, 45-72.
Silverman, S. and Bernstein Ratner, N. (2002). Measuring lexical diversity in children who stutter: application of vocd. Journal of Fluency Disorders, 27, 289-304.
Templin, M. (1957). Certain language skills in children. Minneapolis: University of Minneapolis Press.
Tweedie, F. J., and Baayen, R. H. (1998). How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32, 323-52.
Yule, G. U. (1944). The statistical study of literary vocabulary. Cambridge: Cambridge University Press.
ABOUT THE REVIEWER:
ABOUT THE REVIEWER
Philip McCarthy moved to the United States in 2001 having spent 11 years as an English teacher in England, Turkey and Japan. In 2003, he graduated with a Master's degree in English (Linguistics) from The University of Memphis, and he is currently conducting research for his Ph.D. in applied linguistics at the same university. Philip's primary work concerns lexical and textual diversity algorithms though he has also published work on child readers and the application of cohesion measures across genres. Philip is currently working as a research assistant on three grants at the Institute for Intelligent Systems at the FedEx Institute for Technology: iSTART, CohMetrix, and the iMAP project. His primary responsibilities are corpus analyses and programming. Philip teaches a variety of linguistics, ESL and composition courses. He is also working on a number of software projects including a phoneme acquisition application, and temporal and structural cohesion algorithms. When not working, Philip coaches one of Memphis's most successful soccer teams: Strangers FC.