Date: Fri, 31 Dec 2004 18:33:21 +0100 From: Rolf Kreyer Subject: Advances in Corpus Linguistics: Papers from ICAME 23
EDITOR: Aijmer, Karin; Altenberg, Bengt TITLE: Advances in Corpus Linguistics SUBTITLE: Papers from the 23rd International Conference on English Language Research on Computerized Corpora (ICAME 23) Göteborg 22-26 May 2002 SERIES: Language and Computers Vol. 49 PUBLISHER: Rodopi YEAR: 2004
Rolf Kreyer, University of Bonn
The volume under review is a collection of papers from the 23rd International Conference on English Language Research on Computerized Corpora and contains a total of 22 articles on 419 pages. The papers cover a wide range of topics, which according to the editors ''illustrate clearly the diversity of research that is characteristic of corpus linguistics today'' (1). The contributions are subsumed under six ''broad -- and inevitably overlapping -- categories'' (1): * The role of corpora in linguistic research * Exploring lexis, grammar and semantics * Discourse and pragmatics * Language change and language development * Cross-linguistic studies * Software development The following synopsis will give a summary of the key points of each of the articles. The review will conclude with a critical evaluation.
The first section, 'the role of corpora in linguistic research' starts with an article by Michael Halliday, who explores the spoken language corpus as a foundation for grammatical theory. Quantitative research into spoken language, in his view, will not only increase our understanding of spoken language itself but also of language as a whole. In his view, it is in spoken language that ''systemic patterns are established and maintained [...],instantial patterns are all the time being created [...] and the instantial can become systemic.'' (25) For instance, patterns, as they are described by Hunston/Francis (2000), Halliday claims, will most probably develop and change in speech. Here also 'non-standard' patterns like the ones below are found (19): (1) It's been going to've been being taken out for a long time. [of a package left on the back seat of the car] (2) All the system was somewhat disorganized, because of not being sitting in the front of the screen. [cf. because I wasn't sitting ...].
Such instances should not be dismissed as errors but rather as ''productive innovations which pass unnoticed in speech but have not (yet) found their way into the written language'' (19). The transcription of spoken corpora, however, is not without problems: it is well-known that meaningful prosodic features are often not represented, but in Halliday's view the problem of over-transcribing is more serious. For instance, only in transcribed speech are homophonous forms such as 'icicle' and 'eye sickle' overtly distinguishable; thus ''writing systems mask the indeterminacy in the spoken language'' (16). The analysis of spoken corpora might also prove challenging, due to what Halliday calls ''the lexicogrammatical bind'' (21) of corpus research. Obviously, lexical phenomena are more accessible by corpus linguistic methods than grammatical ones. Spoken language, however, shows a high level of grammatical intricacy and favours grammatical systems as opposed to written language, where meaning tends to be conveyed through lexis (cf. Halliday 1989). Written language therefore is inherently more prone to corpus linguistic analysis than spoken language. So, ''especially in relation to a spoken language corpus, there is work to be done to discover ways of designing a corpus for the use of grammarians''. (23)
John Sinclair examines ''the roles of intuition and annotation in corpus linguistics'' (41), thereby trying to clarify the stance of corpus-driven as opposed to corpus based linguists. For Sinclair the ''elusive faculty'' (41) of intuition seems to have a dual status: on the one hand, intuition has been shown not to be trustworthy: for the most part invented sentences are not of the kind that are usually found in a corpus, and the findings that the corpus yields often differ drastically from what has been expected. One the other hand, the corpus-driven linguist has ''a great respect for intuition, and cannot work without it'' (56), since it organises corpus evidence; as Sinclair puts it: ''[t]here is no escape from intuition if you have command of the language you are investigating'' (47). However, while the corpus-based linguist ''allows his intuition to overrule his corpus data and hence gives primacy to the former'' (40), the corpus- driven linguist tries to keep intuition at bay and is aware of its limitations at all times.
Similar discrepancies seem to divide corpus-based and corpus-driven researchers on the topic of annotation: while it seems indispensable to the former, it is rather obfuscating to the latter. Sinclair's scepticism towards annotation is due to two reasons: firstly, the language models that underlie most of the tagging programmes are usually pre-corpus models. Unfortunately, these models are not made subject to close scrutiny on the basis of corpus evidence but, according to Sinclair, it is usually assumed ''that the models are basically correct, and [that ...] there is no need to open up the whole complexity of language theory and description for the sake of some minor blemishes'' (52). The second argument against annotation is linked to the first one: since pre corpus language models are inadequate for the description of corpus data, human intervention is necessary. As a consequence, the process of annotation is not entirely replicable thereby failing the first test of scientific method. However, despite his conclusion that ''corpus-driven linguists are not likely to have much use for annotation''(56), Sinclair concedes that it ''has its place in application, where quick results are needed and rough-and ready ones will suffice'' (56).
Starting off with a short discussion of Chomsky's well known three levels of explanatory, descriptive and observational adequacy (1964: 62-3), Leech argues that ''a more realistic account of the main strata of investigation in linguistics'' (62) could be arrived at by the following hierarchy
THEORY: formal [and functional] characterization or explanation of language as a phenomenon of the human mind and of society. DESCRIPTION: formal [and functional] characterization of a given language, in terms of theory. DATA COLLECTION: collection of observations which a description, and ultimately a theory, has to account for [e.g. corpora] (62).
In order to explore the relation between the above levels and in order ''to argue against the common assumption that corpus linguistics is concerned with 'mere data collection' or 'mere description' (62), Leech describes two short-term diachronic case studies on modal auxiliaries and grammatical changes relating to colloquialization. Both studies are based on the Brown, LOB, Frown and FLOB corpora and two spoken mini-corpora extracted from the SEU and the ICE-GB corpora. Leech emphasizes that the description of corpus data does not necessarily lead to true statements about a language as such. The corpus linguist always has ''to bear in mind some hazardous assumptions which can be made in moving from data description to language'' (70), for instance, the well-known issues of representativeness and of interpreting statistical significance.
This, however, should not lead to discarding the corpus linguistic enterprise as such. Rather, these hazards should be regarded as a reminder that corpus-linguistic results usually are provisional and that ''further corroborating evidence as well as means of increasing accuracy and reliability'' (71) need to be sought for. Finally, in moving from the level of description to the level of theory, the researcher will have to find explanations for empirical data: for instance, the decline of modals between the 1960s and the 1990s, that Leech describes, might be accounted for by language-internal factors, such as processes of grammaticalization, or by external factors, such as colloquialization, democratization or Americanization. On the whole, then, ''corpus linguistics is not purely observational or descriptive in its goals, but also has theoretical implications'' (61).
Section 2, 'Exploring lexis, grammar and semantics', starts with an article by Joybrato Mukherjee who investigates the place of corpus data in a usage-based cognitive grammar. The author tries to show ''that corpus linguistics and cognitive linguistics are not at all mutually exclusive but can fruitfully complement each other in developing a genuinely usage- based model of [...] speakers' knowledge of the underlying language system'' (96). In particular, the author uses an analysis of the ditransitive verb GIVE in ICE-GB to illustrate how the lexical and constructional networks of cognitive grammar (e.g. Langacker 1999) can be refined by incorporating corpus data. Firstly, corpora provide frequencies, which in turn yield insights into the strength of the different links between a particular lexical item and the constructions in which it can occur. In the case of GIVE, for instance, it is found that 38% of all tokens occur in the pattern 'GIVE + Oi + Od'. The second most frequent pattern, 'GIVE + Od', accounts for 23.2% of the data. These patterns are supposed to be more deeply entrenched in the cognitive system than the other less frequent patterns of GIVE. In addition, corpus data also provide insights into the context-dependent principles that are at work in the selection of a particular pattern. The author, for instance, finds that the pattern 'GIVE + Od' is used in those cases only where the recipient is either retrievable from the context or where the specification of the recipient is irrelevant. Thus, Mukherjee claims, ''corpus-linguistic methodology obviously opens up new and promising perspectives in cognitive linguistics'' (97).
Caroline David puts 'putting verbs' to the test of corpora. In particular, she attempts to outline a new typology of 'putting verbs' by taking into account quantitative data from the corpora Brown, Frown, LOB, FLOB and the BNC. The first part of her paper is concerned with PUT, SET, PLACE, and LAY. The author finds that PUT is the most frequent of the four and is more likely to occur in idiomatic structures than the other three. This the author counts as evidence for ''generalness of meaning'' (102). The other three, in contrast, seem to be associated with a particular way of putting, namely a rather careful way. PUT, therefore, ''is considered the prototypical verb of the general process of putting with little additional information regarding the way things are displaced'' (105) while the other three ''are classified together as a kind of manner of putting'' (105). The second part of the paper concerns verbs of the SPRAY/LOAD class, namely LOAD, COIL and FILL. Here, the author is mainly concerned with syntactic alternations of the following kind: (3) I loaded school trunks on to the car. (4) I loaded the car with school trunks.
The author claims that in example (3) ''the default interpretation is that all the trunks are loaded, irrespective of whether the car is 'full' or not'' (107). Constructions of type (3), therefore, usually take a 'quantification' reading and are thus similar to construction with COIL. In the second case, however, a qualification, namely that the car is now full, is emphasized. Constructions of type (4) thus resemble those with FILL-verbs, such as CLOAK, FLOOD, or SOAK.
Peter Willemse explores the relationship of 'esphoric' reference, cataphoric reference within the same nominal group and pseudo-definite NPs, i.e. NPs that ''are formally definite but in fact realize presenting rather than presuming reference'' (117). Willemse focuses on pseudo- definite NPs in unmarked existential constructions, since their semantics entail that the postverbal NP is indefinite. A formally definite postverbal NP will therefore always have 'pseudo'-definite referential status, as the NP ''the usual sleazy reasons for that'' in the following (122): (5) The Woody Allen-Mia Farrow breakup [...] seems to have everyone's attention. There are the usual sleazy reasons for that, of course - the visceral thrill of seeing the extremely private couple's dirt in the street, etc.
On the basis of 200 tokens from the Bank-of-English corpus, the author tries to find a ''motivation of the use of the definite article in [...] the pseudo-definite NPs'' (130). Willemse provides two possible explanations: (i) The postverbal NP may have 'dual reference', i.e. it may refer to a type, which is usually hearer-old, and a token, which is usually hearer- new. In example (5) above, for instance, the specific reasons for the public fascination are introduced into the discourse and, therefore, hearer-new. However, the general type of reason that explains such attention is assumed to be known to the hearer, i.e. hearer-old. (ii) The other explanation lies in what Willemse calls ''a relation of [...] 'forward bridging' within the NP'' (131). In example (6) below, the definite article in 'the shrunken head' is licensed through the fact that ''the identity of its referent is recoverable by virtue of an experiential connection with the entity introduced by the second NP: a head is a part of (the body of) a boy'' (123) In such cases, as in (6) therefore, the definite article is motivated by esphoric reference (123). (6) In a room outside the court he talked with the French prosecuting counsel, [...]. There was the shrunken head of a Polish boy.
In his article 'Why ''an angel rides in the whirlwind and directs the storm''', Jonathan Charteris-Black analyses the use of metaphor in political corpora. On the basis of the 51 Inaugural Addresses of the American Presidents and the political manifestos of the Labour and the Conservative party from 1945 to 1997, the author explores the similarities and the differences between types of American and British political discourse. With regard to similarities, for instance, Charteris-Black finds that POLITICS IS CONFLICT is the most frequently used metaphor in the two corpora. This conflict either shows in action for ''abstract social goals that are positively evaluated'' (138) or in action against ''social phenomena that are negatively evaluated'' (138), as shown in these examples (138, 139): (7) While continuing to defend and respect the absolute right of individual conscience .... (8) [...] we intend to continue our fight against all form of social injustice.
More interesting maybe are the differences between the two corpora. For instance, the author finds that the fire metaphor is only used in the American corpus. This may be due to the fact that the fire metaphor was used by George Washington in the context of liberty. Apparently, ''the metaphorical link between fire and liberty has become a source of intertextual reference in presidential addresses'' (143). On the other hand, plant metaphors are only attested in the British manifestos. Again, the author suggests a historical-cultural explanation: ''the British passion for gardening lead[s] to the positive associations of words such as 'growth' and 'nurture''' (149). Charteris-Black also reports on metaphor borrowing. The conceptual metaphor POLITICS IS RELIGION is well represented in the American corpus but is only found in the more recent British manifestos; this metaphor seems to have found its way from American into British political discourse.
Peter Tan, Vincent Ooi and Andy Chan, in their article on ''Signalling spokenness in personal advertisements on the Web'', discuss the use of English as a second language in this register by South East Asians. Within this speech community, ''English is often relegated to the position of a 'neutral' and 'transactional' (as opposed to 'interactional') language where 'affect' (emotion) is played down'' (151). The question now arises as to how English language resources are employed for informal, private and personal means in personal advertisements (PA) by South East Asians. In particular, the authors want to analyse ''to what extent [...] resources of spoken discourse [are] relied on in PA'' (163). To this end, they compare the frequencies of augmenters (e.g. 'very', 'a lot', or 'really') and mitigators (e.g. 'somewhat', 'a bit' or 'only') in a corpus of South East Asian adverts with their usage in a spoken and a written subcorpus of ICE- SIN (the Singapore component of the International Corpus of English). On the basis of this data, the authors find ''that personal advertisers tend to make use of features of spokenness'' (163). However, it would be ''premature to say at this stage that Netspeak in South East Asia is closely associated with the norms of spoken language although it seems to be an important contributor to the norms associated with personal advertisements'' (163).
''Textual colligation: a special kind of lexical priming'' by Michael Hoey opens up the third section of the proceedings, ''Discourse and Pragmatics''. Hoey advocates a view that regards ''textual relationships (interactive, linear, cohesive, hierarchical and structural) as dependent upon and created by the lexis of the language in a manner not exhausted by the demands of the individual text'' (173), thereby claiming a vital role for corpus linguistic methods and findings in text linguistic research. In analogy to the term 'colligation', which captures the interdependencies of lexis and syntax, the author employs the term 'textual colligation' to denote the ''positive and negative preferences of a lexical item with regard to [...] textual features'' (174) such as participation in cohesive chains or occurrence as part of the theme in a Theme-Rheme relation.
An analysis of a 100 million word, predominantly Guardian newspaper corpus shows, for instance, that the lexical items 'army', 'baby', or 'political' occur as members of cohesive chains, whereas 'afterwards', 'best' or 'particularly' seem to show no tendency to form such chains, i.e. these lexical items have a negative preference with regard to the textual feature 'cohesion'. Words such as 'reason' or 'option', on the other hand, are neutral in this respect; they may occur in cohesive chains but if so, the chains are usually short. With regard to the feature 'occurrence as theme', Hoey finds that in 75% of 294 instances 'sixty' occurs as part of the theme in a Theme Rheme relation; interestingly, orthography seems to be relevant here, since '60' does not show this tendency. The preferences of lexical items for particular textual features should not be analysed in isolation from each other. The simultaneous occurrence of a lexical item in two or more textual features will lead to highly interesting generalizations: for instance, an item that ''has a positive preference for both Theme and cohesive chains [...] will inevitably have a positive preference for Thematic Progression'' (177). Moreover, textual-colligation analysis must not necessarily stop at the word level. The lexical items within a phrase may share certain preferences for textual features and thus create a particular 'colligational prosody'.
Hilde Hasselgard explores ''adverbials in IT-cleft constructions'' on the basis of data drawn from the British component of the International Corpus of English (ICE-GB). In particular, Hasselgard focuses on two aspects: (1) the information structural role of the adverbial, and (2) the discourse function of the whole IT-cleft construction. As to the first point, the author reports a marked difference in the information structure of clefts with adverbials as opposed to the other kinds of cleft constructions: ''IT clefts with adverbials occur by far most commonly with cleft clauses conveying new information (86%), while the cleft clauses of IT-clefts in general seem to be divided about equally between given and new information'' (200). The author's discussion of the discourse functions of adverbial-IT-clefts largely capitalizes on Johansson's (2002) fourfold taxonomy, which distinguishes contrast, topic launching, topic linking and summative functions, all of which Hasselgard finds attested in her data, too. However, she adds a further function, namely thematization, which serves ''to make extra clear what is to be understood as the theme and the rheme of a sentence'' (204), as in the following example (204): (9) It is with much regret that I find it necessary to send you a copy of the enclosed letter which is self explanatory.
According to Hasselgard, the writer here ''wants to give thematic prominence to the regret he/she feels'' (204). In addition, she suggests that thematization might be regarded as superordinate to Johansson's four discourse functions. For instance, if the focused constituent in a cleft construction is especially marked off as the theme, this may also serve to mark the theme as contrastive or it may be employed to introduce a new topic into the discourse.
Section 3 concludes with Bernard De Clerck's article ''on the pragmatic functions of 'let's' utterances'' in the spoken part of ICE-GB. Prototypically, these utterances ''have the directive illocutionary force of a proposal for joint action [... where] the speaker commits herself to an action and seeks the addressee's agreement'' (217). However, 'let's' utterances may also assume speaker or hearer orientation. In the first case, the construction may be used to secure the addressee's agreement to an action that the speaker is currently carrying out. In the case of hearer-orientation, the utterance may ''camouflage an authoritative speech act as a collaborative one'' (219). In both cases, the idea of joint action recedes into the background. Most frequently, 'let's' is used in a conversational function, namely to influence the flow of conversation. In this case ''they are more like announcements of a topical shift that round off the present topic and introduce the next step in the talk'' (225). This function involves interesting sociolinguistic consequences: 'let's' as a conversational imperative ''seem[s] to be part of the repertoire of [...] interactionally more powerful speakers, who present the conversation as a joint enterprise, but actually try to control it by restricting the hearer's influence to a minimum'' (226). A minor function of the construction is to present the speaker's evaluations or feelings at an interpersonal level, as in example (10) below, where the speaker evaluates the hearer's behaviour Again, the prototypical aspect of 'proposal for joint action' is no longer present in such cases (228): (10) A: God you really know how to put someone down don't you B: Oh let's not get touchy touchy.
The fourth section on ''Language change and language development'' starts off with a paper by Thomas Kohnen, who provides a diachronic case study of English directives, thereby addressing a number of ''methodological problems in corpus-based historical pragmatics''. Such problems, for instance, include what Kohnen calls 'pragmatic false friends', i.e. ''constructions which, against a contemporary background, suggest a wrong pragmatic interpretation'' (239). Example (11) (taken from Shakespeare's 'The Merry Wives of Windsor') is a case in point (239): (11) Ford: Blesse you sir. Fal.: And you sir: would you speake with me?
In this case, the utterance 'would you speake with me?' should not be understood as a request but as ''a real question which serves to identify the man who wanted to talk to Falstaff'' (240). Modern English does not allow this interpretation. Another methodological issue, not surprisingly, is the lack of sufficient data. This may be balanced by concentrating on individual texts types or genres and their functional profiles. On the whole, Kohnen argues for what he calls 'structured eclecticism': diachronic pragmatic analysis should be based on ''a deliberate selection of typical patterns which we trace by way of representative analysis throughout the history of English'' (238). Furthermore, ''a diachronic analysis of speech acts should be embedded in a reasonably stable functional profile of text types'' (242). This method is put into practice in a diachronic analysis of English directives. The author finds that, on the whole, there seems to be a move away from the explicit and direct forms of directives (e.g. imperatives) to more indirect alternatives, such as interrogative realisations. As an underlying motivation for this development Kohnen regards ''the growing importance of considerations of politeness'' (246) which entails a reduction of possibly face-threatening speech acts.
Liselotte Brems discusses ''degrees of delexicalization and grammaticalization'' in measure nouns (MNs) such as 'bunch(es) of' or 'heap (s) of', and attempts to clarify ''the status of the MNs [...] within their respective NPs'' (250). In particular, two analyses seem appropriate: the MN may either function as the head of the bi-nominal NP of which it is a part, as in (12) (250), or it may be regarded as a quantifier of the second NP within the construction, as in (13) (251). Other instances, such as (14) (250) are not easily decided on. (12) The fox, unable to reach a bunch of grapes that hangs too high, decides that they were sour anyway. (13) But then, when I needed one, there were a load of excuses as to why I couldn't borrow one. (14) We still have to move loads of furniture and other stuff.
The general structural status of MNs, therefore, is far from clear. As an answer to this problem Brems suggests to regard ''the developments observed in MN constructions [...] as a case of ongoing delexicalization and grammaticalization in MNs'' (251). In particular, delexicalization is understood as a precursor to grammaticalization, i.e. the ''gradual broadening of collocational scatter [... and the] loosening of the collocational requirements imposed by the MN'' (256) paves the way for ''the re-interpretation of the MN as a quantifier'' (256). Her corpus study of MNs reveals different degrees of synchronic grammaticalization. For instance, 'heaps of' is used as a quantifier in 65.6% of all cases, whereas only 4.7% of the tokens of the semantically related 'piles of' occur in the same function. According to Brems, these findings can be explained by the fact that 'pile' is associated with a ''feature of verticality and constructional solidity'' (261) which blocks processes of semantic generalization. On the other hand, 'heap' lends itself more easily to delexicalization (and subsequent grammaticalization) since it is ''in itself more vague and simply profiles an undifferentiated mass'' (261).
Göran Kjellmer investigates the use of 'yourself' as ''a general-purpose emphatic-reflexive''. The traditional grammar view of the personal pronoun 'you' and its reflexive counterparts 'yourself' and 'yourselves' is fixed and stable. However, Kjellmer comes up with a large amount of 'deviant' uses of 'yourself' in the CobuildDirect and the BNC corpora which seem to imply ''an ongoing extension of its semantic range, and consequently an increasing lack of precision'' (270). In (15) below, for example, 'yourself' unambiguously refers to plurals only (272): (15) Well can you sort that out amongst yourself [...]
Kjellmer reports on even more deviant (and also rarer) cases, where the plural that the reflexive pronoun refers to is not limited to the second person (273): (16) [...] we were told to use physical resources like deep breathing and actually making yourself sit down and making yourself go floppy.
Apparently, 'yourself' has ''become more general in its application'' (273) Furthermore, similar to 'you' as a substitute for the missing generic personal pronoun in English, 'yourself' also seems to be used generically. A most illustrative example is given in (17) where 'yourself' refers back to generic 'one' (274): (17) [...] in an engineering course one concerns yourself only with how to apply and harness phenomena
A possible final stage of the changing use of 'yourself', in Kjellmer's view, might be witnessed in the following examples (275): (18) I like boxing because it means I can defend yourself if you ever needed to (19) Pete's gone down to the shop and got yourself a bottle of whisky
Here, the reflexive pronoun is used specifically with reference to non- second-person entities On the whole, Kjellmer argues, that 'yourself' might be regarded as ''a general-purpose emphatic reflexive pronoun'' (175) which ''has become a close reflexive pronoun copy of [... 'you'] by getting rid of constraining features in its later stages of development'' (275).
Clive Souter explores ''aspects of spoken vocabulary development in the Polytechnic of Wales Corpus of Children's English [POW]''. Although the corpus is fairly small (roughly 61,000 words) and has originally been compiled to study syntactic and semantic development in children from 6 to 12, Souter argues that ''it does have great value for researchers into child language development, TEFL [Teaching English as a Foreign Language] syllabus designers and course-book authors'' (280) and sets out to show the potential of POW for the study of children's vocabulary development. However, as Souter points out, results have to be interpreted with great care due to limitations of corpus size and corpus compilation. For instance, the data show that the active vocabulary of children in the corpus increases only around 50 words per year, which, however, might be an artifact due to ''the limited activities used to elicit speech from the children'' (279), such as Lego building or conversation with adults about games or TV. The author also reports on a difference in frequency of the most common affirmative or negative expressions (e.g. 'yeah', 'yes', 'no' or 'can't') among boys and girls: boys, in general, seem to prefer positives while girls fore frequently use negatives. Again, the interpretation of the results is difficult. They might indicate a general trend but the frequencies might also be explained as a consequence of corpus compilation - the author concedes: ''[p]erhaps Lego building elicits more positive responses from boys and more negative responses from girls'' (285). More interesting is the finding that the vocabulary of boys and girls used in similar contexts only partly overlaps. No more than half of the words boys and girls use are used by both sexes, whereas the other half seems to be sex-specific. This feature, as Souter points out, is worth more investigation and then might indeed turn out to be ''promising and perhaps disturbing, from the point of view of syllabus and course material designers'' (288).
In the last paper of section 4, Roumiana Blagoeva describes the use of ''demonstrative reference as a cohesive device in advanced learner writing''. In particular, she is interested in ''the under/overuse of the demonstratives 'this', 'that' and their plural variants 'these', 'those''' (298) by advanced Bulgarian learners of English. As a basis for comparison she chooses the Bulgarian sub-corpus of the International Corpus of Learner English, the British component of the Louvain Corpus of Native English essays, a sub-corpus of the BNC from the domains 'Applied Science', 'Social Science' and 'World Affairs', and a collection of Bulgarian texts similar to the BNC sub-corpus. Her analysis shows, for instance, that 'near'-demonstratives, i.e. 'this' and 'these' are underused by Bulgarian learners when compared to British students while at the same time the 'remote' types of demonstratives are overrepresented. This cannot be accounted for by L1 interference, since the Bulgarian equivalents of 'that' and 'those' show a very low frequency in the Bulgarian corpus. Rather, the author suggests, a reason seems to lie in the teaching material that is used in Bulgaria: although Bulgarian, similarly to English, distinguishes near and remote demonstratives, the distinction between the English counterparts seem to be overlooked in teaching materials: ''learners are left with the impression [...] that both 'this' and 'that' [...] could be used indiscriminately to point to any word, phrase or longer stretch of text'' (304). Interestingly, both Bulgarian and British students show a high proportion of 'this' and 'these' in comparison to the BNC sub-corpus. Blagoeva suggests that this might be due to ''an influence on learner production by the nature of the text type'' (305). Furthermore, the author contends that learners of a foreign language at some point stop learning and mainly seem to be focused on remedying remaining mistakes in the field of lexis and syntax rather than developing skills to arrive at ''a more target-like way of producing coherent texts'' (306), which, of course, would include a native-like use of demonstratives.
In ''Translation as semantic mirrors'', the first paper of section 5, Helge Dyvik describes a method for identifying wordnet relations (e.g. synonymy or hyponymy) on the basis of parallel corpora. The basic assumption underlying Dyvik's approach is that ''semantically closely related words ought to have strongly overlapping sets of translations, and words with wide meanings ought to have a larger number of translations than words with narrow meanings'' (311). The results he presents are extracted manually form the 2.6 million word English-Norwegian Parallel Corpus (ENPC). Searching for a particular Norwegian or English word form in the corpus will yield all the original sentences that contain this word form and its translations into English or Norwegian, respectively. From this set of translations, a human analyser can then compile a list of possible translations of the word form in question. These lists form the basis for further analyses. The information they contain may, for instance, be used to distinguish different senses of a particular word. The Norwegian word 'tak', for example, is translated into 'roof', 'ceiling', 'cover', 'grip', 'hold'. These five word forms are translated into various Norwegian words, which form a number of sets which all contain 'tak' but also partially intersect. The translations for English 'roof' and 'ceiling', for instance, in addition to 'tak' also overlap in Norwegian 'hvelving'. Similarly, translations for 'grip' and 'hold' share Norwegian 'tak' and 'grep'. The respective translation sets, however, do not intersect. One can thus conclude that Norwegian 'tak' has at least two distinct senses, namely 'roof/ceiling' and 'grip/hold'. After different senses have been individuated semantic fields can be established on the basis of overlaps of translation sets. 'Beautiful', for instance, translates into 'vakker' and 'nydelig'. These, in turn, in addition to 'beautiful' translate into 'cute' and 'cute'/'delicious', respectively. It follows that 'beautiful', 'cute' and 'delicious' are part of the same semantic field. Further procedures assign lexical feature to individual entries and eventually lead to lattices that reveal hyperonym and hyponym relations among senses, and even identify sub senses and near-synonyms of each individual sense.
Åke Viberg analyses ''physical contact verbs in English and Swedish from the perspective of crosslinguistic lexicology''. On the basis of data drawn from the English Swedish Parallel Corpus (ESPC), the author presents an extensive and highly detailed comparison of the English verbs 'strike', 'hit' and 'beat' with their primary Swedish translation 'slå'. The author finds several interesting differences between the items at issue. 'Strike', 'hit' and 'beat' in their prototypical usage as a ''bodily action verb, for instance, most frequently take human beings as objects. This, however, only seems to be a tendency, ''whereas it is more or less a requirement of Swedish 'slå''' (332) Furthermore, the Swedish verb occurs with a human subject in 70% of all instances. The English counterparts show a mixed picture: while 'beat' with 72% of human subjects is similar to 'slå', 'strike' and 'hit' are not (41% and 48%, respectively). With these verbs ''natural disasters, economic crises, wars and diseases'' (334) seem to be frequent subjects. The same subjects, in Swedish usually cooccur with a different verb, namely 'drabba', which could roughly be translated as 'afflict'. Similarly, if the subject is a projectile (e.g. a bullet), English 'hit' is the most frequent verb, whereas Swedish again does not use 'slå' but 'träffa' meaning 'hit a target'. It follows that generally, 'slå' ''is grounded more firmly in sensorimotoric experience of limb movement'' (349) which prototypically makes use of arm and hand. For the English counterparts the sensorimotoric aspect does not seem to be as central.
Anna-Lena Fredriksson aims ''to discuss different approaches to the notion of theme and to show how parallel corpora can successfully be used for cross-linguistic analyses of theme'' (353). The author starts off with a description of theme and rheme in Systemic Functional Grammar (SFG) as laid out in Halliday (1994). However, SFG ''has a strong orientation towards English which is a potential problem for using it in other languages'' (354) One problem arises out of the V2 requirement in Swedish, since this leads to a different distribution of clause elements with initial non subject, as example (20) illustrates (EO = English Original; ST = Swedish Translation; LIT = Literal Translation) (361, adapted): (20) (a) EO: Surely I'd been freed from those painful memories long ago. (b) ST: Vistt had jag för länge sedan blivit befriad från de där plågsamma minnena. LIT: Surely had I for long ago become freed from those painful memories.
In (20a) 'surely' and 'I' make up the theme. In the Swedish translation, due to the V2 constraint, the two thematic components are separated by the auxiliary verb. The question that arises is where to locate the theme- rheme transition point. Fredrikson suggests a split theme, which ''(in a declarative clause) can be defined as including all elements preceding the finite verb plus the postverbal subject'' (365). Thus, the thematic elements 'surely' and 'I' of the English original can also be treated as thematic in the Swedish translation. Furthermore, the author questions Halliday's notion of 'topical theme'. In his approach, the thematic part of the clause contains one and only one experiential element, the topical theme, so ''everything that follows the topical theme constitutes the rheme'' (356). However, Fredriksson allows for several experiential elements in the theme. Accordingly, ''[t]he concept 'topical theme' has no function in [... her] approach'' (366). This modified understanding of the concept 'theme', in her view, is equally applicable to English and to Swedish data.
In their paper ''Welcoming children, pets and guests'' Elena Tognini Bonelli and Elena Manca search for translationally equivalent units in two comparable corpora, namely Italian texts that advertise 'Agriturismo' and English material that promotes 'Farmhouse Holidays'. The English corpus indicates that the notion of 'welcome' is central to the whole genre: a total of 324 instances of this word are attested in the data. Surprisingly, the 'prima facie' Italian equivalent 'benvenuto' and its related forms occur only 4 times in the Italian corpus. Translation equivalence, therefore, does not seem to be located at the word level. Rather, translation should always consider the context in which a particular word occurs. The authors therefore suggest a three-stage model of successive contextualisation for identifying translationally equivalent units. First, a collocational profile of the word to be translated should be established. For the word 'welcome' the corpus yields as collocates 'children', 'pets'/'dogs' and 'visitors'/'guests'. In a second step, the translator should try to find 'prima facie' translational equivalents for the respective collocates. In the current example these would be 'bambini', 'animali' and 'ospiti'. The final step would then try to identify collocates of these equivalents in L2. For instance, to find a suitable translation for 'welcome' in the context of 'guests' or 'visitors', the translator should compare the concordances of 'welcome' + 'guests'/'visitors' with the concordance of 'ospiti'. In the English corpus, the nouns at issue are found to occur regularly in the structure 'Vb BE + 'welcome' + 'to'-inifitive' ('guests are welcome to relax'). The concordance of 'ospiti', on the other hand, shows that the Italian equivalent to this structure is the Italian modal 'potere' and its inflected forms, as in 'gli ospiti potranno fuire'. Obviously then, translation equivalents are often not found at the word level. Rather, translation should aim at ''identifying and comparing syntagmatic units that share certain contextual feature with the view of identifying a similar function'' (383).
In the last article of section 5, Natalie Kübler reports on her experience with ''using WebCorp in the classroom for building specialized dictionaries''. As the title already indicates, Kübler followed pedagogical objectives that are different from language teaching, namely ''teaching students how to extract lexical and syntactic information to build customised dictionaries for machine translation (MT) in languages for specific purposes'' (387). The particular register envisaged in this experiment was computer science, more specifically, the most recent user manuals of the operating system Linux (HOWTOs). In this particular field of computer science, new terms are coined almost regularly. Therefore, existing parallel corpora of HOWTOs, although providing useful information for translation of the more recent HOWTOs, ''tend to become insufficient or slightly obsolete, even though they can be regularly updated'' (395). The web, on the other hand, will contain most of the neologisms in this field. Accordingly, accessing the internet via WebCorp may be a useful way of balancing the shortcomings of finite corpora. The term 'buffer', for instance, occurs as part of five different compounds in the parallel corpus of English and French HOWTOs. However, terms that were coined after the translation of the HOWTOs will not be included. Here WebCorp can help to supplement findings from finite corpora, since French computer scientists often use English terms together with their French translations: the search for 'buffer' in the French domain (.fr) yields two more recent compounds together with the appropriate French translations, namely 'buffer overflow' and 'heap buffer overflow'. Accordingly, Kübler concludes that ''WebCorp [...] is ideal for complementing and updating the information extracted from time-bound specialised finite corpora'' (398).
The final section, 'Software development', consists of an article by Antoinette Renouf, Andrew Kehoe and David Mezquiriz, who discuss ''some issues in extracting linguistic information from the web''. The article provides insights into the WebCorp project, which was launched at the University of Liverpool at the end of 2000 in order to investigate ''the usability of the Web as a linguistic resource, and [... to identify and address] some of the problems of retrieval and analysis that it presents'' (404). In particular, the authors describe issues that are pertinent in regard to the WebCorp tool, which allows to use the internet as a corpus. Issues discussed include the fact that search engines are constantly changing thereby reducing the comparability of results: ''corpus linguists [...] each access different pages, and different pages at each time. Thus the linguistic sample is not constant'' (409). Furthermore, Web text may not easily be transformed into a format that meets linguistic data requirements. In this context, the authors mention the problem of providing sentence-length concordances: since Web text is untagged only ''few clues exist at surface level as to sentence boundary'' (410). The automatic retrieval of sentences therefore poses considerable problems. Nevertheless, WebCorp provides a number of useful ways to exploit the web linguistically. For instance, searches with wildcards serve to search the web for phrases. More elaborate searches may be used to discover new or unconventional forms: the string '[he|she|I] text* [him|her|me], for example, ''reveals that 'text' not only functions as a verb but as an uninflected past tense verb'' (413), as in (21) below (21) The next time I text him, he didn't reply (413) In addition, web information can be exploited by the WebCorp tool to refine searches. This, for example, includes the specification of text types or genre via the Open Directory or Yahoo, or a limitation to certain domains, such as '.net' or '.ac.uk'. Domains may also be combined by Boolean operators. The next steps that the authors sketch out lead one to hope that eventually the WebCorp tool will turn out a highly useful means that opens up the web for corpus linguistic research.
Karin Aijmer and Bengt Altenberg have edited and excellent selection of papers. The articles (apart from two or three exceptions maybe) are of a very high quality and highly stimulating and show impressively the relevance of corpus linguistic research to linguistics in general. Furthermore, the diversity of topics covered will make this volume an interesting read for linguists of almost any area: from functionalists to cognitive linguists, from synchrony to diachrony, from syntacticians to text linguists and even translators.
Also, the variety of corpora analysed by the contributors show the wealth of material which corpus linguistics nowadays has at its disposal: in addition to the use of standard monolingual and parallel corpora, some contributors quite convincingly show how smaller special purpose corpora can be exploited: the HOWTOs corpus used by Kübler and the 'agriturismo' and 'farmhouse holidays' corpora by Tognini Bonelli and Manca are just two examples. In this context, mention must also be made of attempts to open up the worldwide web as a possible source of data; its relevance for future corpus linguistics, in my view, can hardly be overestimated. On the whole, this large variety of data reported on in this volume leaves no doubt as to the flexibility of corpus linguistics approaches in regard to data-mining.
A further point concerns the relationship between data and theory and the role of corpus linguistics, which ''have been debated ever since the rise of corpus linguistics'' (2). This debate has also found its way into the present volume. A number of extremely important issues are discussed by renowned linguists such as Michael Halliday, John Sinclair, and Geoffrey Leech. The mere fact that aspects like the role of intuition in corpus linguistics or the relation of corpus-based and corpus-driven approaches are still debated clearly shows the strong dedication of corpus linguists to theoretical and fundamental aspects of their approach. This is also mirrored in a number of papers that advance far beyond the word-crunching and case-studying that corpus linguistics often (and not always unfoundedly) has been accused of: Joybrato Mukherjee with his ''from-corpus- to cognition-approach'' (85), for instance, impressively shows how corpus data can refine cognitive models and thus lead to a more appropriate description of the speaker's linguistic knowledge. Michael Hoey, through his concept of 'textual colligation', establishes a ''theoretical relationship between lexis and text-linguistics'' (171). Anna-Lena Fredriksson uses contrastive corpus data to refined the theoretical notion of 'theme'. Even if theoretical aspects are not an explicit focus, the papers usually give convincing (theoretical) explanations for their findings and, where appropriate, discuss implications for the model of the speaker's competence or for the abstract language system.
Nonetheless, critical remarks should be made on two individual contributions. The first concerns Tan, Ooi and Chiang's conclusion on the use of augmenters in personal advertisements (PA) as opposed to spoken (SP) or written (WR) texts. I find it difficult to agree with the authors that ''PA tends towards SP norms -- but not quite reaching them, in most cases'' (161). Even if the rare cases 'incredibly' and 'ever' are not taken into consideration, we find that only two of the remaining five types, namely 'really' and 'too', show similar normalised frequencies in PA and SP. In contrast, the normalised frequency of 'very' in PA (29.7) just lies between that of 'very' in WR (9.7) and SP (50.1). In addition, the frequency of 'a lot' in PA (5.1) is more similar to that in WR (0.6) than to that in SP (15.2), and 'lah' is highly frequent in SP (77.2) but extremely rare in both PA (0.2) and WR (0.0). Admittedly, the authors concede that ''the situation is not always that clear-cut'' (162). However, on the basis of data presented I would rather claim that the situation is not at all clear cut and that the use of augmenters in PA more strongly resembles their use in WR than in SP. Another remark concerns the article by Clive Souter: he wants to convince the reader that POW ''is worth exploring, particularly if you are interested in learning and teaching language'' (288). At the same time, however, he repeatedly stresses the shortcomings of the corpus and the problems that may arise out of the corpus's size and the compilation of the material. So I am not quite convinced that ''interesting lexical information can be gleaned from this corpus for EFL instructors and curriculum designers'' (279)
The proofreading has been good, the number of typos and inconsistencies in layout (I found around 15 cases) is within reasonable limits for a book of over 400 pages.
On the whole, the volume makes for a highly stimulating and interesting read and gives a good insight into current issues and aspects of corpus linguistics showing the vitality and the diversity of the field. Linguists from many different branches of linguistics will no doubt profit from the papers.
Johansson, M. (2002): Clefts in English and Swedish: A Contrastive Study of IT-clefts and WH-clefts in original texts and translations. PhD dissertation, Lund University.
Chomsky, N. (1964): ''Current issues in linguistic theory'', The Structure of Language, ed. by J. A. Fodor & J. J. Katz. Englewood Cliffs, New Jersey, 50-118.
Langacker, R. W. (1999): Grammar and conceptualization. Berlin: Mouton de Gruyter.
Halliday, M. A. K. (1989): Spoken and Written Language. Oxford: Oxford University Press.
Halliday, M. A. K. (1994): An Introduction to Functional Grammar, 2nd ed. London: Edward Arnold.
ABOUT THE REVIEWER:
ABOUT THE REVIEWER
Rolf Kreyer is an Assistant Professor of Modern English Linguistics at the English Department of the University of Bonn/Germany. He holds a degree in English and Mathematics and has recently finished his PhD thesis, a corpus- based analysis of inverted constructions in modern written English. His research interests include syntax, text linguistics, corpus linguistics and theoretical linguistics.