Assessing learner\'s writing skills according to CEFR scales
Rating Approach and Rater Training The construction of the rating scale was based on the analysis and adaptation of preexisting rating scales along with the creation of descriptors suitable for the proficiency level targeted by each writing task. The development of the rating scale was grounded in the following key documents:
Similar to the developmental process of the test specifications described in the previous background section, rating scale descriptors were analyzed for content, terminology, coherence, and consistency. Based on this, level-specific descriptors were constructed for the following four assessment criteria: task fulfillment, organization, grammar, and vocabulary. The initial draft was pretrialed and revised in an iterative process by the task developers and reviewed by an international expert team. Moreover, a set of initial benchmark texts was selected from the pretrials to illustrate prototypical features of student responses for each assessment criterion at each proficiency level. During rater training seminars for the study of this paper, which are described in more detail next, the rating scales were revised once more and further benchmark texts were added. The final version of the rating scale was validated by a team of teachers and international researchers in the field of writing assessment (for details, see Harsch, 2010). An illustration of the scale for Level B1 is provided in Appendix C.
The rating approach chosen was an analytic one, whereby the detailed analyses of the students' responses formed the basis for a final overall score to account for the above discussed higher reliability of such a detailed rating procedure (Knoch, 2009). First, four analytic ratings of the four mentioned criteria were given. In line with the level-specific task design approach, each student response was rated on a two-point rating scale (“below pass” = not reaching the targeted level of performance, “pass” = reaching the targeted level of performance). To derive at reliable analytic ratings for the criteria, each single descriptor was first rated on the 2-point scale. These descriptor ratings formed the basis of the four analytic criteria ratings. Finally, based on the analysis of the student text, an overall grade was assigned in line with the test purpose, that is, to report one proficiency score.
The rater training took place between July and August 2007. We aimed to account for the known effects of rater characteristics by choosing raters with comparable educational backgrounds; by providing enough rating practice to account for different levels of experience and to allow for an in-depth familiarization with the procedures; by encouraging discussions of key task characteristics, assessment instruments, and sample students' responses to derive at a common level of expected performance; and by actively engaging raters in evaluating the adequacy and applicability of benchmark texts and rating scales, which they were allowed to revise where necessary.
In cooperation with the Data Processing Center in Hamburg, 13 graduate students from the University of Hamburg (English as a Foreign Language program) were trained on the level-specific rating approach. Their functional level of proficiency in English was established by an entrance test that included an assessment of their writing ability. The raters were between 25 and 33 years old and all had prior experience in either teaching English or marking essays. During the intensive training sessions (two 1-week seminars and six additional 1-day sessions over a period of 8 weeks), the raters were familiarized with the previously described rating instruments and procedures. Throughout the training period they rated sample student responses. These ratings were analyzed by the facilitator to control rater reliability, guide the training process, and further revise the assessment instruments. To describe this process further would go beyond the scope of this article. For a detailed account, See Harsch & Martin (2011).