AUTHORS: Hoffmann, Sebastian; Evert, Stefan; Smith, Nicholas; Lee, David; Prytz, Ylva Berglund TITLE: Corpus Linguistics with BNCweb SUBTITLE: A practical guide SERIES: English Corpus Linguistics, Volume 6 PUBLISHER: Peter Lang AG YEAR: 2008
Elizabeth Craig, Department of English, The University of Georgia
''Corpus Linguistics with BNCweb'' is the sixth in a series of titles from Peter Lang devoted to English corpus linguistics. BNC is an acronym for the British National Corpus (http://www.natcorp.ox.ac.uk/), which has been maintained at Lancaster University since the early 1990's and consists of 100 million words of both written (90%) and spoken (10%) British English in over 4000 texts, which are categorized by genre. A large corpus such as the BNC can provide accurate information on both a word's meaning and usage through the implementation of various query tools as explicitly described in this detailed guide to the BNCweb.
Emphasized at the outset is the fact that working with a corpus solves two problems for language researchers: how to base conclusions on actual usage rather than on mere introspection and how to consider a large amount of data without the time-consuming task of interviewing individual informants. A corpus then is not about what a researcher believes, but about what many people do with language. Lexical behavior is revealed in patterns that can be quickly and conveniently displayed in concordance lines through the use of sophisticated search tools such as the BNCweb, which is designed for working with words and phrases and their co-occurrence frequencies.
Chapter 1 begins by describing the purpose of each of the subsequent chapters, advising readers to utilize the manual while on the BNCweb (http://www.bncweb.info/) although the exhaustive inclusion of screenshots for every sample query discussed renders such full participation unnecessary. The authors delineate both the advantages and limitations of working with a corpus and what is essentially a search engine with various parameter settings, the BNCweb query tool.
For example, in using the DISTRIBUTION feature to look at the behavior of 'shall' in the spoken portion of the corpus, the term is found to co-occur with either 'I' or 'we' in 90% of cases. Because the BNCweb also allows for separating data by such sociological features as age, gender, and class, it can be further determined that the declarative forms of 'I/we shall' are more commonly used by older speakers, whereas the interrogative forms of 'shall I/we' are more commonly used by younger speakers, perhaps providing a 'snapshot' of language change in progress and indicating that the declarative form may be on its way out of the language or simply attesting to the fact that younger speakers ask more questions. It would be interesting to compare this British usage of 'shall' to North American usage. The corpus was not catalogued by race of the speakers, an unfortunate oversight in the data collection phase.
Some basic principles of corpus linguistics research such as representativeness and methodology are outlined in Chapter 2. A corpus is a principled collection of text, and no corpus can be truly representative of a language as a whole. The BNC, however, by incorporating a massive number of different text types from an array of social strata strives to present a picture of late 20th century British English usage. It is described by the authors as ''a synchronic and static corpus which consists of a large number of text samples that are heavily marked-up with information about the texts, speakers, and writers, and annotated with linguistic information (e.g. parts of speech).''
Also, the authors make the important point here that corpus linguistics, although concerned only with performance data, does offer a way to expose linguistic competence. The example of complex (multi-word) prepositions such as 'in terms of' and 'in response to' is used to illustrate how frequent phrasal patterns can be indicative of mental chunking of ''indivisible units.'' Constituent boundaries are evidenced by the non-random distribution of filled pauses in the spoken portion of the corpus. The authors demonstrate that ''filled pauses occur very frequently both immediately before and after complex prepositions,'' but rarely in internal positions surrounding the noun. I found this particular example extremely relevant to my own corpus research on noun plus preposition clusters in academic writing.
Chapter 3 is largely cautionary as to how generalizable findings from any corpus and the BNC in particular can be. After describing the BNC in some detail as a balanced reference corpus of 4000 files, the authors explain why they used both the highly accurate (98-99%) CLAWS POS tagset and a smaller, simplified tagset of only 11 tags to facilitate such searches as for ''any verb,'' for example. All words in the corpus are annotated for HEADWORD and LEMMA as well using XML format in the underlying source files. A discussion of the significance of type/token ratios is also useful here.
Chapters 4 and 5 focus on methodology. Chapter 4 is where the reader may want to begin actually sitting at a computer with access to the BNCweb, but screenshots are provided. Several alternative ways to conduct basic searches are covered along with some guidance in how to read and manipulate the display of concordance lines. The default view is of complete sentences, but the user can select the KWIC (''Key Word in Context'') view, which aligns the query item in a fixed, central position to facilitate detection of recurrent language patterns. Query results can also be displayed in random or corpus order and saved in QUERY HISTORY. The inclusion of hands-on exercises at the end of this and other chapters gives the reader a good idea of the kinds of specific questions that can be answered through corpus inquiry and enhances the suitability of this text for classroom instruction.
Chapter 5 on ''the comparability and reliability of findings'' emphasizes why normalized frequencies as determined by statistical significance are fundamentally important when comparing corpora or subsections of a corpus in order to ensure that high frequencies are not simply due to chance alone. Raw frequencies are meaningful only if you are dealing with corpora of the same size. In comparing the normalized frequencies of the discourse marker 'in fact' in the written and spoken subsections of the BNC, the authors demonstrate that it is almost twice as frequent in the spoken data, which is relatively scant compared to the written portion. The calculation of normalized frequencies is discussed in some detail because the authors contend that it is ''the number one source of error for novices in corpus linguistics.'' In the interest of reliability, there is also a 'Corpus Frequency Wizard' interface on-line for doing statistical calculations at http://sigil.collocations.de/wizard.html.
Chapter 6 outlines the use of ''Simple Query Syntax'' for more sophisticated searches of particular affixes, parts-of-speech, wildcards, and lexico-grammatical patterns using metacharacters.
Chapters 7 and 8 explain how search results can be further manipulated and analyzed for specific purposes. Chapter 7 describes the automated features of DISTRIBUTION and SORT. For example, 'because' is deemed to be ''overused'' in school essays because this is the only written genre showing frequencies comparable to those in the spoken genres of the corpus. Frequency breakdowns further allow the sorting of co-occurrence patterns by type and token.
Chapter 8, in which COLLOCATIONS are discussed in great detail, covers the automated analysis of concordance lines. A collocation is ''the habitual co-occurrence of two (or more) words,'' and ''collocational tendencies can arguably be seen as part of the meaning of a word.'' The concept of semantic prosody is discussed here using the example of the word 'cause', which is shown to have ''an overwhelming tendency to co-occur with events of a negative or unfortunate nature.'' The value of such idiomatic information to non-native speakers is appropriately mentioned here.
Chapter 9 explains how concordance lines may be manually annotated (tagged or classified) depending on the user's query results. Both advantages and disadvantages of categorizing queries are discussed. Users more familiar with Microsoft Excel will appreciate the inclusion of instructions on how to export and re-import query results to and from the spreadsheet database.
Chapter 10 provides a detailed guide in ways to create subcorpora in order to restrict searches to particular text types. All texts are classified according to domain, genre, time period, medium, and the sociological factors mentioned above.
Chapter 11 covers KEYWORD and FREQUENCY LIST features. A keyword is defined as one that occurs ''with significantly greater frequency in one part of the corpus than [in] another.'' A comparison between academic lectures and academic writing confirms the relatively high concentration of verbs in the former and nouns in the latter. Frequency lists are considered ''useful for detecting potentially salient linguistic items within the corpus.'' In written genres, 'the' is found to be the most frequent word (again attesting to the 'nouniness' of more formal registers), and pronouns such as 'I', 'you', and 'it' are the most frequent words in spoken genres. The more nominalized style of academic texts is also indicated by the relatively higher frequencies of prepositions such as 'of', 'in', 'by', and 'with' in this genre, another fact I found particularly supportive of my own corpus research.
Chapter 12 discusses the Corpus Query Processor (CQP) for more advanced searches and experienced users. Also mentioned is the IMS Open Corpus Workbench (http://cwp.sourceforge.net/), which allows for searching any annotated corpus in the proper format.
Chapter 13 concerns the more practical aspects of running BNCweb for network administrators. Topics include administrative access, customizable configuration settings, the cache system of previous searches, and disk-space requirements.
Finally, a brief list of references is provided, noting seminal works in English corpus linguistics by Douglas Biber, Graeme Kennedy, Geoffrey Leech, Charles Myer, Michael Scott, John Sinclair, and Michael Stubbs. There is also an 11-page glossary of computerese terms relevant to corpus inquiry. Four appendices provide all genre classifications for the texts in the corpus, part-of-speech tags (CLAWS), explication of the Simple Query Syntax, and HTML-entities for less common characters. A brief index is included as well.
This is a general, introductory text suitable for an undergraduate and/or graduate class in corpus linguistics. It demonstrates how corpus work is very much a balance between what the tools can deliver and how the human user can manipulate those tools to answer very elaborate types of questions about lexico-syntactic patterns.
The greatest attribute of this text is that it is not just a corpus usage manual, but an explication of corpus linguistics theory and methodology. In clear prose and using many illustrative examples, the authors go into great detail in their discussions about conducting various search queries, customizing annotations, contrasting raw and normalized frequencies, and enhancing validity and reliability. Throughout the text, the authors point out that the reader/user should consider intuitively what they may expect to find with particular queries before doing the actual searches. This practice reinforces the value of corpus work in that our assumptions about language usage are frequently found to be in error or in need of some finer revision in light of the search results. Even though the BNCweb provides a wide range of search options, the web-based interface is attractive and quite easy to use.
Some may find it a tedious read, especially the latter chapters for advanced users and network administrators, but such is the nature of the beast. This volume keeps it interesting with numerous suggestions about the types of questions that can and cannot be answered through both simple and more complex queries, and the chapter-final exercises are inspiring of innovative approaches to corpus linguistics. The potential for corpus linguistics discoveries about word/phrase frequencies has yet to be fully exploited, especially in the areas of lexicography, sociolinguistics, and second/foreign language teaching. A comparable, user-friendly mechanism for discovering and comparing the patterns of North American English usage would certainly be welcome on this side of the pond.
ABOUT THE REVIEWER
ABOUT THE REVIEWER:
Elizabeth Craig is an experienced ESL/EFL teacher and teacher-trainer with
a master's degree in applied linguistics (TESOL) and a doctorate in second
language acquisition. She was the English Language Fellow to Paraguay in
2006-2007 for the U.S. State Department and is currently teaching English
and linguistics courses at The University of Georgia. Her dissertation
consists of an examination of N+P clusters in a corpus of native-speaker
freshman compositions in an effort to address preposition errors in second
language writing. Dr. Craig is also Supervising On-line Editor of 'English
around the World', a free, weekly newspaper insert for English language
educators in and around Asunción, Paraguay.