METHODOLOGY The data for the study we report in this article were collected during a large-scale field trial of the standards-based tests for reading, listening, and writing that took place between April and May 2007. Although the broader purpose of the field trial was to obtain empirical information about the operating characteristics of the tests and to establish three separate one-dimensional proficiency scales for reading, listening, and writing, the focus of this article is on the writing data to investigate whether the level-specific approach allows for reliable inferences about the quality of the writing tasks and to assess how far the a priori classified tasks in terms of their targeted CEFR levels match the empirical task difficulties. Because the empirical difficulty of a task is the result of the complex interactions between the task, the rating criteria, the raters, and the students' proficiency distribution (see Weigle, 1999, for multilevel approaches to assessing writing), we need to analyze the data by taking these design factors into consideration, the more so because in our context these aspects are restricted to specific levels.
To address these aims, the design of the study, the specific research questions, and the statistical analyses have to be attuned to one another. Therefore, we first describe relevant facets of the design of the writing field trial, that is, the task samples, the rating approach, the rater training, and the rating design. Based on this, we address the specific research questions and describe the statistical analyses we conducted.
Design of Study and Data Collection Sampling issues
For the field trial, a representative national random sample of 2,065 students from all 16 federal states in Germany was selected. The students had undergone 8 to 10 years of schooling and were between 15 and 18 years of age. Most students were native German speakers. The students came from the HSA and MSA school tracks targeted by the NES: There were 791 students in the HSA sample (approximately balanced by grade with 383 in Grade 8 and 408 in Grade 9) and 1,274 students in the MSA sample (approximately balanced by grade with 629 in Grade 9 and 645 in Grade 10).
Taken together, the students responded to 19 writing tasks that covered levels A1 to C1 of the CEFR. Due to time constraints and the cognitive demands of the writing tasks, however, each student responded only to two, three, or four writing tasks, depending on the CEFR level that the tasks targeted. The writing tasks were administered in 13 different test booklets within a complex rotation design across the two samples, which is also known as a matrix-sampling or balanced incomplete block design. The different booklets in the two sample designs were linked by common writing tasks, so-called anchor tasks. These were included to allow for an investigation of whether the two samples could be calibrated together on one common proficiency scale.