Assessing learner\'s writing skills according to CEFR scales
Variability of Ratings There exists ample research into the influence of assessment criteria and raters on the variability of ratings, yet it is mainly situated in contexts where multilevel rating approaches were used.
A general distinction is made between impressionistic holistic and detailed analytic assessment criteria, whereby the latter are reported to account for shortcomings of the former, such as impressionistic ratings of surface features (Hamp-Lyons & Kroll, 1996), halo-effects (Knoch, 2009), or imprecise wording of the criteria (Weigle, 2002). Therefore, an analytic approach using detailed rating criteria which account for the complexity of writing is often favored.
There is also substantial research on the effect of rater characteristics on the variability of their assigned ratings, yet it is, again, predominantly situated in multilevel approaches and does not focus on the particularities of level-specific ratings. For the traditional multilevel approach, critical characteristics include raters' background, experience, expectations, and their preferred rating styles.
Although these studies can help determine which facets have to be explored and controlled in a level-specific context, there nevertheless exist gaps in the literature as far as the level-specific approach to assessing writing is concerned: There is a lack of research on the effects of holistic and analytic criteria on the variability of level-specific ratings as well as on the effects of rater characteristics. With regards to these areas, two of the scarce studies can be found in the aforementioned Australian context. Smith (2000) reported that although raters may show acceptable consistency on an overall level, this might well “obscure differences at the level of individual performance criteria” (p. 174). These findings are corroborated by a study conducted by Brindley (2001), who concluded that the main sources of variance in his study were the terminology used in the performance criteria, and the writing tasks, which had been developed by individual teachers rather than in a standardized process. Both studies revealed a need for rater training, which can help lessen the unintended variance to a certain extent (e.g., Lumley, 2002). Therefore, all of the known characteristics that drive rating performance should be addressed during rater training, regardless of which approach is chosen.
Bearing in mind the peculiarities of the level-specific approach, where tasks and criteria focus on one level only, and raters have to rate via a fail/pass judgment, an analytic approach seems vital to gain insight into what aspects raters are actually rating on the descriptor level and to ensure rater reliability. It is this aspect of ensuring reliability that justifies the analytic approach even if the assessment purpose is to report one global proficiency score.
The review of the research literature thus reveals a gap in the area of level-specific approaches to assessing writing. There is, to our knowledge, no large-scale study reporting on how a level-specific rating approach affects the variability of ratings. Our article aims to explore facets known from other contexts to influence rating variability, namely, tasks, assessment criteria, raters, and student ability, within a level-specific approach in a large-scale assessment study. Because the in-depth analysis of different rater characteristics and their effects on rating variability is beyond the scope of this article, we intend to control rater characteristics as far as possible via selecting raters with similar background and via extensive training. The purpose is to ensure reliable ratings because they form the basis of inferences about task quality and difficulty estimates.
Moreover, to our knowledge, there are no reports on test-centered methods for linking level-specific writing tasks to specific CEFR levels. Our article thus makes an important contribution to test-centered standard-setting methods by exploring how far a priori task CEFR level classifications correspond with empirical task difficulty estimates.