Rater Severity Descriptive statistics showed that raters themselves displayed a relatively small degree of variation across all levels and tasks on average, which indicates an effective rater training to some extent. This conclusion is supported by g-theory, which shows that the main effect of different raters is negligible (1.2% and 2.2%, respectively), and by the Rasch analysis, which shows that there is a relatively small rater effect, making the adjustment in the student proficiency estimates due to rater severity relatively mild. Thus, we conclude that the rater training and the accompanying seminars in which raters could select and justify benchmark texts and revise conspicuous rating scale descriptors were very effective in producing raters who have a similar understanding of the different levels and criteria. Moreover, the results suggest that our approach to use descriptor-ratings as the basis for criteria ratings (see Harsch & Martin, 2011) leads to reliable overall ratings.
Student Proficiency The multifaceted Rasch analysis showed a plausible level of variance among the students in each sample and the descriptive statistics and g-theory analyses supported a higher average proficiency of the MSA sample compared to the HSA sample. The analysis of the reading and listening comprehension tests provided comparable degrees of variance, thus corroborating the findings for the writing scales.
Empirical Cut-Score Suggestions Rasch model analyses suggested regions where cut-scores could be empirically set. Although this is true for both student samples, potential regions are more clearly separated in the upper range of the proficiency continuum. That is, tasks targeting Level B1 in the HSA sample and tasks targeting Levels B2 and C1 in the MSA sample are more easily empirically distinguishable from tasks at Levels A1 and A2 taken together than, for example, tasks at Level A2 from tasks at Level B1 or tasks at Level A1 from tasks at Level A2. Unfortunately, some ambiguity about cut-score boundaries thus remains for tasks in the lower ranges of the proficiency continuum, which are critical for standards-based reporting. In addition, there is a significant gap for students in the range of B1 and B2 on the proficiency scale for the MSA sample. Thus, although our analyses cannot provide unambiguous cut-score suggestions across the whole scale, they nevertheless suggest regions in which cut-scores could be set. More important, they show that the a priori task classifications align very well with their empirical difficulty estimations, implying that a test-centered approach to aligning level-specific tasks to the CEFR Levels A1 to B2 seems to be an empirically defensible basis, upon which consensus-based approaches can be used to confirm the CEFR levels of the tasks and to set cut-scores (see next).4