|AUTHOR: Knoch, Ute
TITLE: Diagnostic Writing Assessment
SUBTITLE: The Development and Validation of a Rating Scale
SERIES TITLE: Language Testing and Evaluation 17
PUBLISHER: Peter Lang AG
Mark Brenchley, PhD Candidate, Graduate School of Education, University of Exeter
‘Diagnostic Writing Assessment' details a recent study aiming to ''develop a
theoretically-based and empirically-developed rating scale'' suitable for
diagnostic contexts (p. 5). To this end, Knoch has conducted an empirical
comparison of two trait scales. The first, that of the Diagnostic English
Language Needs Assessment (DELNA), pre-dates the study; the second was
purpose-built and tested as part of the study. To address the overall research
problem, three specific questions were posed and addressed: 1) What discourse
analytic measures successfully distinguish between writing samples at different
DELNA levels? 2) Along what axes do the ratings produced using the two rating
scales differ? 3) How do raters perceive the two scales?
The first chapter provides a brief overview of the study, in which Knoch draws
attention to the lack of clarity regarding ''how direct diagnostic tests of
writing should differ from proficiency or placement tests'' (p.13), a situation
particularly true of rating scale design. It is to this latter that the study is
addressed. Knoch investigates whether an empirically developed scale would be
''more valid for diagnostic writing assessment'' than an intuitively developed
scale of the kind typified by DELNA (p.15).
In chapter 2, Knoch situates ''diagnostic assessment within the literature on
performance assessment of writing'' (p. 35). Diagnostic assessment is described
and distinguished through both content and purpose, with reference to features
identified by Alderson (2005). Diagnostic tests, for example, should be expected
to ''identify strengths and weaknesses'' (p. 21) and to provide ''a detailed
analysis and report of responses to items or tasks'' (p. 21).
Chapter 3 provides a definition of rating scales and focuses on issues relating
to their design, Knoch discussing how such scales relate to diagnostic contexts.
She argues that many current rating scales are significantly flawed, often
developed, for example, on the basis of a single theory of writing development.
She considers this to be a particular problem, our understanding of writing
being ''not sufficiently developed to base a writing scale just on one theory''
In chapter 4, Knoch synthesises a taxonomy of linguistic constructs from the
various theories and models of writing development discussed in the previous
chapter, arguing that such a taxonomy provides ''the most comprehensive
description of our current knowledge about writing development'' (p. 71). Eight
potentially relevant writing constructs are identified, including Accuracy and
Cohesion (p. 75), and used to evaluate the current DELNA scale. She then further
analyses the available literature in order to determine what specific measures
have successfully identified writing development within these constructs, and to
determine which measures might be suitably operationalised for the pilot study
of phase one.
Chapter 5 outlines phase one. For the pilot study, Knoch catalogued 15 writing
scripts randomly selected from the University of Auckland's 2004 DELNA
administration. She coded them according to the specific linguistic measures
identified in chapter 4. From these scripts, Knoch determined a subset of
measures that successfully distinguished between different DELNA levels. This
subset then served as the basis for the main study, in which 601 randomly
selected DELNA scripts were analysed according to these measures.
Chapter 6 presents the analysis of the main study from phase one. Of the 26
linguistic measures identified by Knoch through the pilot study, the main study
identifies 17 – those which successfully differentiated between the various
levels of ability according to the original DELNA levels. These included
percentage of error-free t-units, the number of hedges, and the number of
propositions (p. 168).
Chapter 7 summarises the results presented in chapter 6, which are used to
devise a new rating scale for investigation during phase two. Knoch discusses
the success of each specific measure in turn, subsequently outlining a fresh
trait scale for that measure. She argues that the new trait scale offers more
explicit descriptors than the original DELNA scale, often stipulating fairly
precise quantitative measures (e.g. ''11-15 self-corrections'' (p. 172)). The new
scale is, therefore, arguably more objective and less open to rater subjectivity.
Chapter 8 presents the methodology for the empirical study comprising phase two.
10 current DELNA raters were asked to use both the DELNA scale and the new scale
to rate 100 randomly selected DELNA scripts. Their ratings were subjected to a
Rasch analysis and further analysed according to five hypotheses designed to
evaluate their respective superiorities. Finally, the raters were interviewed
and asked to fill in a questionnaire about their experiences, affording feedback
as to the raters' own particular evaluations of the two scales.
Chapter 9 presents the results from phase two and addresses the two research
questions framing this phase. The individual trait ratings from both scales are
directly compared so as to determine their respective superiorities according to
the five hypotheses outlined in chapter 8; the scales are then further compared
overall. In terms of the individual traits, Knoch determines the new rating
scale to be generally superior, noting in particular that the new scale resulted
in a reduced halo effect and greater success in identifying learners' strengths
and weaknesses. She also notes, however, that, analysed as a whole, ''the
existing scale resulted in a higher candidate discrimination'' (p. 229), and
attributes this to the new scale ''assessing different information…not measured
by the existing scale'' (p. 230). Finally, Knoch presents and discusses the
results from the rater interviews, demonstrating a general preference for the
Chapter 10 discusses the results of phase two. The two scales are compared
according to various aspects relevant to rating scale validity, these aspects
defined in terms of ten distinct warrants. She finds the new scale to be clearly
more valid on those warrants covering a scale's Construct Validity (the extent
to which the test actually assesses what it is meant to assess), Reliability
(how consistently the same script is rated by different raters), and
Authenticity (how the test scores generalise to actual language use). The DELNA
scale, on the other hand, is deemed to have greater validity on the two warrants
covering scale Practicality (how easy a scale is to operationalise).
In chapter 11, Knoch summarises the findings of the study and discusses its
overall implications in both theoretical and practical terms. She argues the new
scale to be ''more suitable in a diagnostic context'' (p. 298), presents a model
of performance assessment, and argues the need to distinguish between analytic
scales that have been intuitively developed and those that have been developed
A more extensive account of research recently published as Knoch (2009),
'Diagnostic Writing Assessment' represents a constructive contribution to the
literature on diagnostic language assessment. The study as a whole is
well-conceived, planned, and executed. Each stage is thoughtfully conducted so
as to serve as the empirical foundation for the succeeding stage, and the
results are carefully analysed and presented in a form which makes them easy to
engage with. Knoch is, further, both aware of the inherent difficulties of
rating scale research and displays a clear understanding of the limits of her
study. Finally, the study's conclusions, primarily that an empirically-developed
scale with more explicit descriptors is more appropriate for diagnostic purposes
since it more reliably isolates distinct aspects of learner proficiency, are
measured, plausible and supported by the empirical evidence as presented.
A particular virtue of Knoch's study is the explicitness of the construction
process, allowing for a clearer understanding of the basis of the resultant
scale. Given the fundamental role rating scales play in operationalising the
relevant linguistic constructs and evaluating test-taker proficiency, such
explicitness is vital. Yet, as Knoch herself notes, and despite this state of
affairs noted as far back as Brindley (1998), ''there is surprisingly little
information on how commonly used rating scales are constructed'' (p. 42). Hence,
it is often difficult to evaluate the rationale and validity of assessment
scales, substantially hampering an understanding of the nature of these scales
in general. Not so here. Indeed, if productive research in this area is to take
place, then assessment scales need to be presented and investigated in much the
same manner as Knoch has done - openly, explicitly, and methodically.
A further virtue is Knoch’s comparative approach to rating scale validation.
Rather than evaluating a single scale in isolation, Knoch's study ascertains the
respective worth of two scales, using each to illuminate the validity of the
other. This is an approach that yields some interesting results. Thus, Knoch
notes that while ''most individual trait scales on the new scale were more
discriminating…as a whole, the existing scale was more discriminating'' (p. 222).
This discrepancy prompted a Principal Factor Analysis of the two scales which
leads her to conclude that the new scale accounts ''for not only more aspects of
writing ability, but also for a larger amount of variation of the scores'' (p.
228). Significantly, this is a conclusion prompted by the contrastive nature of
the study, a fact which marks such an approach out as a fruitful avenue for
further research into assessment scales.
A final desirable quality of Knoch's study is her synthetic approach to language
assessment. Firstly, the study is firmly contextualised within the framework of
current literature. This feature is, of course, perhaps to be expected given
that the study was undertaken for doctoral purposes. Nevertheless, it ensures
the study is firmly and properly grounded, drawing on a broad basis of
theoretical and empirical writing. Secondly, and more interestingly, Knoch
synthesises her linguistic constructs from a range of available models on
writing proficiency. This is significant since, as she herself rightly notes,
''no adequate model or theory of writing or writing proficiency is currently
available'' (p. 104), a situation which calls into question the validity of any
rating scale based on only one such model. Knoch's response circumvents this
difficulty, resulting in a construct taxonomy which has broad theoretical
support yet which is not tied to any one particular theory per se. Consequently,
her study can be pursued as a more open and empirically-driven investigation of
rating scale construction and validity, one which cuts across the various models
as well as having the potential to productively feed back into them. Her results
demonstrate this to be a promising approach which would allow rating scale
research to develop with its own measure of independence and integrity.
'Diagnostic Writing Assessment' is not without flaw, of course, and there are
several features worth drawing attention to. Throughout the study, for example,
Knoch is careful to control for the possible variables, something she makes
clear herself (p. 186). This methodology did not, however, extend to rater
selection, all of whom were drawn from a pool of current DELNA raters (p. 185).
Consequently, although the raters received training on the newly-devised scale,
they would have been substantially more familiar with the DELNA version, a
factor that could have had a significant effect on the rating outcomes. It would
have been perhaps preferable, therefore, to have selected raters who were
equally inexperienced on both scales, though this may have been unavoidable
given the inevitable labour constraints of a PhD study.
Further, Knoch makes clear that the goal of the study is to devise and
investigate an ''empirically developed rating scale'' (p. 15). To her credit, she
generally succeeds in this, each component empirically constructed, researched,
and feeding into the next. It is regrettable, therefore, that, following the
results of the pilot study of phase one, Knoch does not select a greater number
of measures for the main study than she actually does. So, for example, although
''error-free t-units, error-free clauses and errors/clause'' (p. 113) were all
found to distinguish successfully between different levels, only the ''percentage
of error-free t-units was selected for the second phase of this study'' (p. 113).
Knoch's choices are mostly not without reason; this particular measure was
selected, for example, because it ''might be the easiest for the raters to apply
and is unaffected by the length of the script'' (p. 113). Nevertheless, since the
study is decidedly empirical in intent, it would have made more sense to take on
all the measures empirically identified by phase one. These could then have been
further investigated during the two main studies to see how they actually
affected the raters and rating scores, rather than being eliminated a priori.
Finally, though Knoch's analysis is generally sound, there are a couple of
points regarding the statistical methodology of the study. The first is a
somewhat minor one. This is simply that no breakdown is provided according to
the native:non-native (47%:53%, respectively) profiles of the study cohort.
These are groups likely to display different proficiency characteristics and
needs, something particularly significant for a diagnostically-oriented
assessment scale. Hence, it would have been relevant to explore the extent to
which these groups were differently handled by the two scales. It is true, as
Knoch notes, that ''it is very difficult to establish the language background of
students'' (295); nevertheless, even a brief exploration would have provided an
interesting further dimension for comparing the two scales.
The second point is more substantial and concerns the fact that the pilot study
bases its conclusions on an analysis of only 15 writing scripts. This is quite a
small sample, one for which ''no inferential statistics were calculated and the
data was not double coded'' (p. 112). This sample size makes her use of means
questionable since it renders the mean vulnerable to outlier scores. It also
often results in mean scores that are distinct but fairly close together (as in
the case of 'grammatical complexity' (p. 115)) and in standard deviation scores
that overlap (sometimes significantly, as in the case of 'number of words' (p.
115)). As a result, there is a residual uncertainty as to the accuracy of the
selected measures, reinforcing the point made above about carrying all of the
successful measures forward. That this may indeed have been a significant factor
is suggested by the fact that the success of the pilot study in identifying
clauses-per-t-units as a successful measure was not replicated in the main study
(p. 173). Consequently, it would have been helpful either to have utilised a
larger sample or to have included the individual scores alongside of the means
so as to present a more detailed picture of the data; both would have improved
the general empirical rigour of the study.
Nevertheless, it is to Knoch's credit that the above criticisms are only
available precisely because of the study's explicitness. It is also worth
remembering that 'Diagnostic Writing Assessment' is a PhD study and as such is
inevitably bound by all the labour constraints such a study entails. Indeed, in
this context, Knoch's work is particularly impressive, the end product a mature
piece of empirical research that extends our current understanding of diagnostic
rating scale design, raises relevant and important issues, and serves as a
useful staging post for future research in this area.
Alderson, J. C. (2005) Diagnosing Foreign Language Proficiency: The Interface
Between Learning and Assessment. London: Continuum.
Brindley, G. (1998) Describing Language Development? Rating Scales and Second
Language Acquisition. In Bachman, L. F. and Cohen, A. D. (eds.), Interfaces
Between Second Language Acquisition and Language Testing Research. Cambridge:
Cambridge University Press. pp. 112-140.
Knoch, U. (2009) Diagnostic Assessment of Writing: A Comparison of Two Rating
Scales. Language Testing 26(2), pp. 275-304.