Assessing learner\'s writing skills according to CEFR scales
Approaches to Assessing Writing The prototypical task design for assessing writing in large-scale assessment contexts consists of using tasks of approximately equal difficulty that are designed to elicit a wide range of written responses, which can then be rated by trained raters with a rating scale covering performance criteria across multiple proficiency levels (see, e.g., Hamp-Lyons & Kroll, 1998, or Weigle, 1999, 2002). Such a multilevel approach was used, for example, in the Study of German and English Language Proficiency (Deutsch-Englisch Schülerleistungen International; DESI; Beck & Klieme, 2007; Harsch, Neumann, Lehmann, & Schröder, 2007). In that context, the approach was justifiable by the test purpose, which was to report the proficiency distribution of all ninth graders in Germany based on curricular expectations.
If the purpose of an assessment is, however, to report whether students have reached a specific level of attainment or a specific performance standard within a broader framework of proficiency, a different approach to assessing writing could be worth pursuing, that is, the aforementioned level-specific approach, whereby one task operationalizes one specific level, as it is the traditional approach for assessing receptive skills. Examples of a level-specific task approach can be found in international English proficiency tests offered for example by Trinity, Pearson, or Cambridge Assessment. In the case of the Cambridge ESOL General English suite of exams (Taylor & Jones, 2006), for instance, different exams target five different proficiency levels; however, the written responses are assessed via different multiband (or multilevel) rating scales. Although the tasks can be characterized as level-specific, for a detailed account of how tasks of the different exam levels operationalize CEFR levels), the exam-specific rating scales cover six finer proficiency bands (“multilevels” from 0 to 5) within each of the five exam levels. As a result, each band of the rating scales has a meaning only in relation to the targeted exam level. It is stated that “candidates who fully satisfy the band 3 descriptor will demonstrate an adequate performance of writing at [the exam] level”. We could therefore interpret that ratings at or above band 3 constitute a “pass” and ratings below band 3 a “fail.” To link the various rating bands across the five exam levels, Cambridge ESOL has recently completed a long-term project to develop a Common Scale for Writing covering the five upper CEFR levels. However, it remains unclear how the finer bands of the exam-specific rating scales can be interpreted with reference to the levels of this Common Scale and to the CEFR proficiency levels; to be more specific, could a band 5 rating, for instance, in the CAE be interpreted as the candidate having shown a writing performance beyond CEFR Level C1? Although this issue is addressed for the overall grade (cf. the information on the CAE certificate at, where grade A in the CAE is interpreted as having shown performance at Level C2), it is not addressed for reporting a profile for the different skills covered in the exam. Thus, it seems difficult to transparently trace how multiband ratings of written performances in this suite of exams could lead to the assessment of a candidate's writing proficiency in terms of CEFR levels. Although such level-specific exams are aimed to assess and report individuals' English proficiency with a focus on one proficiency level, they seem not to be suitable for a large-scale context, where the aim is to screen a population spanning several proficiency levels. Here, instruments are needed which can account for a range of abilities in the sample tested, and the results of the assessment have to be generalisable for the population.
Another relevant example of assessing writing by using level-specific tasks and level-specific rating criteria is the one taken by the Australian Certificates in Spoken and Written English, an achievement test for adult migrants. Tasks are developed by teachers to operationalize four different levels of attainment. Written learner productions are then assessed by teachers using performance criteria that describe demanded or expected features for each of the four levels; the criteria are assessed by giving binary judgments as to whether or not learner texts show the demanded features Although this approach to assessing writing at a specific level is promising, it holds certain constraints: The Australian assessment focuses on individual achievement, whereby teachers give individual feedback to their learners. The task development is not standardized, and neither are the administration and assessment procedures. The relationship between the targeted four levels of attainment and the task specifications targeting each level remains somewhat unclear. For a large-scale proficiency assessment, the focus is less on individual achievement and more on gaining generalizable data for the targeted population. Therefore, standardized procedures for task development, administration, and assessment are needed. Specifically, the relationship between the targeted proficiency level and the task characteristic, as well as the relationship between proficiency levels and rating scale levels need to be made transparent. The current study therefore aims to make a significant contribution toward researching the level-specific approach for tasks and rating scales in a large-scale assessment context.