EDITORS: Renouf, Antoinette; Kehoe, Andrew TITLE: The Changing Face of Corpus Linguistics SERIES: Language and Computers Vol. 55 PUBLISHER: Rodopi YEAR: 2006
Kalyanamalini Sahoo, Zi Corporation, Calgary
The book under review is the edited conference proceedings of the 24th International Computer Archive of Modern and Mediaeval English (ICAME) held in Guernsey in May 2003. It contains a brief introduction by the editors followed by 22 contributions from different authors. Each article has an abstract, endnotes and bibliographic references. The editors have thematically organized the articles into 6 sections: corpus creation, diachronic corpus study, synchronic corpus study, the web as a corpus, corpus linguistics and grammatical theory and a grammar discussion panel.
The book opens with Sue Blackwell's ''The corpus-user's chorus'', which is all praise for the corpus-user's vitality for carrying out corpus based research efficiently. This is followed by an introductory chapter where the editors Antoinette Renouf and Andrew Kehoe lay out the key concepts of each chapter and outline the scope of the book. Then start the contributions, reflecting a fruitful period in the evolution of the field.
Section 1 'Corpus Creation' starts with Stefan Dollinger's 'Oh Canada! Towards the Corpus of Early Ontario English', in which Dollinger introduces the Corpus of Early Ontario English (CONTE), the first electronic corpus of a variety of early Canadian English. He considers Ontarian English texts focusing on the issue of selection of authors and texts, which play a major role for corpora compilation. He exemplifies three genres of the corpus – diaries, letters and newspaper texts beginning from 1776 to 1899, also addresses the transcription problem of Late Modern English handwriting.
This is followed by Clemens Fritz's 'Favoring Americanisms? vs. before and in Early English in Australia: A corpus-based approach'. Like Dollinger, Fritz also deals with the classic theoretical dilemmas for the diachronic corpus linguist: at what point in its history is a language variety to be regarded as representative or fully-formed? What is the crucial selectional criterion for corpus compilation: the language of the texts themselves, or the geographical circumstances of the settlers? Fritz deals with a Corpus of Oz Early English (COOEE) containing about two million words. The corpus is structured on chronological lines and takes into account various registers and text types including court minutes, parliamentary proceedings, private letters and diaries, reports, memoirs, narratives, legal texts and petitions. One characteristic spelling difference between American English and British English is found in vs. in words of the hono(u)r type. Australian English lies in between the standards followed by the two other varieties. The author shows that this is not due to an increasing influence of American English on Australian English, but is the result of the historical development from 'English in Australia' to 'Australian English'. He suggests that the education and the origin of the author, as well as the semantics of a particular word and the period when it was written, all play a significant role in determining the choice between –or and –our.
The next article is by Ian Lancashire. Lancashire discusses the lexicons of Early Modern English (LEME) compendium of lexicographic and bibliographical material, a resource which builds on the unique information provided by his EMEDD (Early Modern English Dictionaries Database). LEME documents what speakers of English thought about their language over the lifetimes of authors like Sir Thomas More, William Shakespeare, John Milton, and John Dryden covering the period served by the short-title and Wing catalogues from the advent of printing to the early eighteenth century. It lists word-entries alphabetically by lemmatized headword, and then chronologically by lexicon date. The author has shown how LEME serves as a source of 'contemporary comments', quotations potentially useful in illustrating word usage. Introducing the HEDGEHOG database of 18th and 19th century EFL pedagogical and reference works, Manfred Markus discusses 'EFL dictionaries, grammars and language guides from 1700 to 1850: testing a new corpus on points of spoken-ness'. He discusses the corpus in view of features of spoken-ness, by analyzing typically spoken types of sound and syllable reduction, morphemic and lexical colloquialisms, as well as syntactic, semantic, pragmatic and idiomatic features of spoken English. Antonio Miranda Garcia, Javier Calle Marin, David Moreno Olalla and Gustovo Mnoz Gonzalez conclude the section with a report on their electronic database of Old English work, ''Apolloniums of Tyre'', with reference to the performance of a newly-developed concordancing software tool. They present Old English concordancer (OEC), a new tool to process an annotated corpus of Old English, which goes beyond the prototypical operations of similar programmes (lists, indexes, concordances, statistical information, queries, etc). OEC retrieves general and specific morpho-syntactic information from an OE annotated corpus. It allows lemma-based studies as well as some simple syntactical research at sentence level, solves morphological queries and generates statistical information including absolute and relative values of items, the distribution of words, lemmas, class and/or accidence [inflection], vocabulary profiles, etc. Section 2, 'Diachronic Corpus Study' starts with Maurizio Gotti's study of the semantic and functional evolution of verbs SHALL and WILL from 1350 to the present day. The paper analyses the evolution of the use of SHALL and WILL for the expression of the predictive function, using data drawn from both diachronic and synchronic corpora.
Anneli Meurman-Solin & Päivi Pahta's article 'Circumstantial adverbials in discourse: a synchronic and a diachronic perspective' presents a study of adverbials with grammaticalised connectives 'seeing' and 'considering', appearing in corpora from 1550. Considering electronic corpora ranging from past like Helsinki Corpus of Older Scots (HCOS), Corpus of Scottish correspondence (CSC), Corpus of early English Medical Writing (CEEM) to those on present day English, like British National Corpus (BNC), International Corpus of English – Great Britain (ICE-GB), the authors distinguish 'circumstance' from other semantic roles of contingency. They demonstrate how, chiefly because of their thematic potential, circumstantial adverbials can be used in specific functions in genres as different from one another as 'letter' and 'medical treaties'.
Building on Leech's 1966 categorisation of formal features, Caren auf dem Keller discusses 'Changes in textual structures of book advertisements in the ZEN corpus'. She reviews the changes in textual structures of book advertisements in early modern English newspaper covering the period from 1671 to 1791, and provides a detailed overview of textual components and graphic makers used in the eighteenth century.
Next comes Marianne Hundt's paper ''Curtains like these are selling right in the city of Chicago for $1.50'' – The mediopassive in American 20th-century advertising language. Studying the mediopassives in a corpus of late nineteenth and twentieth-century American mail order catalogues, Hundt notes an increase in use, which contradicts a claim by Leech (1966).
Geoffrey Leech & Nicholas Smith discuss grammatical changes in American and British written English in the Brown corpus (AmE, 1961) LOB Corpus (BrE, 1961), Frown Corpus (AmE, 1992) and FLOB Corpus (BrE, 1991). The authors use the POS-tagged versions of these corpora for tracking frequency changes in grammatical usage in written English 1961-1991/2 and for comparing similar changes in American and British English. They note a significant increase in the use of semi-modal, the present progressive, that-relativization, proper nouns, s-genitives, verbs and negative contractions; also on the other hand a significant decrease in the use of core modals, the passive voice, wh-relativization, and of-genitives. They discuss these changes in terms of general diachronic processes such as colloquialization and Americanization. They also note that the changes in AmE are more extreme than those in BrE.
Section 3 contains a fairly representative spread of synchronic studies of present-day English.
Mats Deutschmann explores sociolinguistic variation in the act of apologizing in the spoken part of the British National Corpus (BNC). He investigates 'apology formula', as exemplified by the lexemes 'afraid', 'apologise', 'apology', 'excuse', 'forgive', 'pardon', 'regret' and 'sorry'. Analyzing more than 3,000 examples of apology forms, he notes that in the BNC, young and middle-class speakers favour the use of the apology form, although only minor gender differences in apologizing is apparent. He addresses how formulaic politeness is an important linguistic marker of social class and shows that corpus linguistic methodology can successfully be used in socio-pragmatic research.
Göran Kjellmer takes a metalinguistic stance on the problem of semantic and referential ambiguity of certain lexemes in the modern-day English of the Cobuild Direct Corpus. Discussing 'How recent is recent? On overcoming interpretational difficulties', he shows that the words 'recent' and 'recently' are ambiguous between the meanings of 'not long before the present time' and 'not long before the time of the event described'. He illustrates how to resolve the ambiguity and claims that the disambiguation phenomenon sheds some light on the process of textual interpretation and comprehension.
Ute Römer's article 'Looking at looking: Functions and contexts of progressives in spoken English and 'school' English' is based on pedagogical texts focused on their shortcomings, in the unnatural representation of present-day verb usage. Studying the use of progressive forms in huge collections of spoken British English and in a small corpus of 'spoken-type' texts from German EFL textbooks, she investigates the differences observed between English as it is used in natural communicative situations and the type of English pupils are confronted within a foreign language teaching context. To overcome the discrepancies found between the 'real' spoken English and the so-called 'school' English, she argues that if linguists, teachers, and textbook writers aim at achieving a greater degree of naturalness or authenticity in English language teaching, corpus evidence must be taken more seriously.
Gabriel Ozón maintains the focus on verbs in his detailed study of 'Ditransitives, the Given Before New principle, and textual retrievability: a corpus-based study using ICECUP'. Exploring English double object constructions, he tries to find out if corpus studies can help track and confirm the divergences in the use of these constructions.
Anna-Brita Stenström represents contrastive corpus linguistics with her study of the functionality aspects of Spanish pragmatic marker 'pues' and its English equivalents 'cos' and 'well'. Discussing various functions under syntactic, discursive and pragmatic levels, she shows that 'well' corresponds to 'pues' in most of its functions, except on the syntactic level, where 'cos' is the only equivalent. Like 'pues', 'well' and 'cos' have been grammaticalized, but 'cos' less so than 'well', which partly explains its fairly restricted use.
Section 4 reflects a recent change in the definition of 'corpus' with the emergence of the World Wide Web. Day by day the potential of web-based text is recognized, as one finds rare, obsolescent and brand new language use not found in existing corpora. Several corpus linguists are engaged in making it a more readily usable source of language data. Three 'Web and Corpus' initiatives are presented here: WebCorp, developed by the Research Unit headed by the editors of this book; WebphraseCount, developed by Josef Schmied and team; and Glossanet, developed by Cedrick Fairon. The papers in this section focus on tools for extracting data and analyses of corpus.
Barry Morley discusses 'WebCorp: A tool for online linguistic information retrieval and analysis'. The WebCorp project has demonstrated how the Web may be used as a large corpus of text for linguistic research. Morley presents the improved functionality of WebCorp such as the ability to specify the web domain for search, the production of internal collocates, alphabetical sorting on left and right context, and concordance filtering. Andrew Kehoe also reports on WebCorp and the heuristics that he has developed to overcome the obstacle to diachronic study of web text caused by the absence of reliable date-marking. He discusses 'Diachronic linguistic analysis on the web with WebCorp'. The WebCorp project has demonstrated how the Web may be used as a source of linguistic data. He discusses the dating mechanisms available on the Web and the date query facilities offered by standard web search engines, assessing their usefulness for linguistic analysis and describing how the WebCorp system has been adapted to support diachronic analysis. In 'New ways of analysing ESL on the WWW with WebCorp and WebPhraseCount', Josef Schmied discusses how software tools can be developed to interface with search engines and help linguists to make use of the world-wide web in their work. He demonstrates the potential of WebPhraseCount, a tool devised to measure the relative frequency of individual aspects of language use across the English language varieties on the web. He shows how tools like WebCorp and WebPhraceCount can be used by advanced language learners as well as linguists interested in variation in English world-wide.
In ''I'm like, ''Hey, it works!'': Using GlossaNet to find attestations of the quotative (be) 'like' in English-language newspapers'', Cédrick Fairon & John V. Singler discuss another automatic web text retrieval and analysis system called GlossaNet, which downloads certain newspaper web sites executing complex linguistic queries. They discuss how GlossaNet monitors newspapers analysing the texts using the programs and linguistic resources of a corpus parser.
The papers in Section 5 'Corpus Linguistics and Grammatical Theory' raise some of the theoretical concerns which attest to the maturity of the field, emerging in the light of extensive empirical observation and experience. In 'Corpus linguistics and English reference grammars', Joybrato Mukherjee reviews some major English reference grammars like the new Cambridge Grammar of the English Language (CamGr), the comprehensive Grammar of the English Language (CGEL), and the Longman Grammar of Spoken and Written English (LGSWE). He discusses major conceptual and methodological differences between these grammars and asks how far these need to be informed by corpus data.
He argues that the combination of CGEL and LGSWE provides a first important step towards genuinely corpus-based reference grammars in that a theoretically eclectic descriptive apparatus of English grammar is complemented by qualitative and quantitative insights from corpus data. He emphasizes that future corpus-based grammars need to be optimized with regard to the transparency of corpus design and corpus analysis and the balance between general and genre-specific language data. Christian Mair discusses 'Tracking ongoing grammatical change and recent diversification in present-day standard English: the complementary role of small and large corpora'. He stresses the need of a closer cooperation between the two traditions in corpus linguistics: (1) a ''small-and-tidy'' approach which emphasizes detailed philological analysis of clean corpora, and (2) a ''big and messy'' one which stresses the advantages to be gained from the computer-assisted analysis of vast quantities of dirty data. Taking example of the get-passive, he argues that there are aspects of this well-studied and fairly common construction which cannot be investigated even in a very large closed corpus such as the BNC, although good results can easily be obtained from the World Wide Web. He emphasizes that in spite of its obvious shortcoming as a corpus, the Web is an indispensable source of data for the study of infrequent and recent linguistic phenomena. In the article 'but it will take time…points of view on a lexical grammar of English', Michaela Mahlberg takes time phrases to demonstrate how a 'lexical' grammar can reveal more about the semantics of language in use than a more surface-structural pattern grammar such as that of Hunston and Francis (2000). The volume is rounded of in section 6 by Jan Aarts's 'Corpus linguistics, grammar and theory: Report on a panel discussion at the 24th ICAME conference', where the main focus is on the impact of corpora on English reference grammars. The panelists address the characteristics of a reference grammar and the corpus-linguistic methodology appropriate for the writing of such a grammar as well as for corpus-linguistic research in general. This chapter provides a fitting conclusion to the volume that provides a very perceptive overview of the field of corpus linguistics and grammatical theory.
This edited volume of papers in the area of corpus-linguistics deals basically with the corpora of English. The book has been compiled with a lot of thought. It covers a lot of ground in over 400 pages, covering a wide range of topics beginning with corpus creation to corpus analysis, evaluation and the use of World Wide Web as Corpus; covering several fields like EFL, ESL, contrastive studies, grammatical theory, lexicography, semantics and socio-pragmatics; discussing various tools like WebCorp, WebPhraseCount, Glossanet, concordancing software tool etc. The volume shows just how very diverse and complex corpus based research can be. The richness of the book can be accredited not only to the editors' vast experience and knowledge in selection and arrangement of chapters in terms of theme and style but also to the authors in presenting the development in terms of linguistic research. Especially, the inclusion of the report of panel discussion is very useful to bring to light what topics are in the current focus of a research community in corpus-linguistics.
However, there are certain shortcomings as well. Although information can be retrieved through various tools like WebCorp, WebPhraseCount, OEC etc., the volume does not discuss on dialectal corpora which could pose a challenge to the techniques and tools for variation in spelling. As such Frequency profiling, concordancing, n-grams and keyword methods all suffer from problems of unreliability when applied to dialectal corpora. Secondly, the volume does not discuss how corpora can be used by language learners themselves, although Schmied touches the issue lightly and demonstrates how advanced language learners can make use of tools like WebCorp and WebPhraseCount.
Of course, the editors have rightly justified the title of the book 'The Changing Face of Corpus Linguistics' acknowledging the recent change in the definition of 'corpus' accompanying the availability of texts on the World Wide Web; also emphasizing the maturity of the field from corpus building to corpus analysis and evaluation. But making use of the text available on the World Wide Web is not that simple. Although the use of the web as a corpus is becoming more and more common these days, it raises the question how can such large amounts of data be cleaned, encoded, annotated, stored, and shared? Especially, clearance of copyright for web data as well as other corpus data is a vital issue.
Overall, the book is an extremely valuable resource not only for professional corpus linguists but also for the beginners interested in the area to understand the wider field of corpus linguistics including the historical developments it has undergone. A plus point of the book is the inclusion of many useful figures, tables and URLs that serve to capture the research findings in a concrete manner for the reader. The volume is concerned with issues relevant to linguists using corpora to carry out purely linguistic studies, without moving much to an allied discipline, natural language processing (NLP).
Hunston, S. and G. Francis (2000). 'Pattern Grammar. A corpus-based approach to the lexical grammar of English'. Amsterdam: Benjamins.
Leech, G.N. (1966). 'English in Advertising'. London: Longman.
ABOUT THE REVIEWER:
Kalyanamalini Sahoo works on computational morphology and South Asian languages for the Zi Corporation, Calgary, Canada. She is primarily interested in computational morphology and syntax.