Content introduction chapter-i assessing learner's writing skills according to cefr scales


Results of Multifaceted Rasch Modeling



Yüklə 113,49 Kb.
səhifə16/20
tarix03.05.2023
ölçüsü113,49 Kb.
#106646
1   ...   12   13   14   15   16   17   18   19   20
Assessing learner\'s writing skills according to CEFR scales

Results of Multifaceted Rasch Modeling
The previous analyses suggest that there will probably be a large variation in task difficulty and student proficiency estimates but only small variation in rater severity. To investigate this hypothesis using a suitable model, we first computed the deviance values between all nested models for both the HSA and MSA samples, including models with different main effects only and models with different main and two-way interaction effects. All deviance comparisons were statistically significant at the α = .001 level so that we do not present numerical details to conserve space.
Consequently, the final model chosen for reporting purposes was the most complex model with all main and two-way interaction effects for the design facets of tasks, raters, and rating criteria as well as a single latent proficiency variable as shown in Equation 2. To illustrate the structure of this model graphically, Figure 1 shows the Wright-map from ACER ConQuest for this model for the MSA sample. It contains the writing proficiency estimates of the students in the very left column along with the parameter estimates for all model effects in the columns to the right of it. Note that we have replaced the numerical codes for the rating criteria with letters (F = task fulfillmentO = organization, V = vocabulary, G = grammar, H = overall) in the main effect and two-way interaction effects panels and that we have replaced the numerical task IDs with the a priori CEFR classification (A1 − C1) of the writing tasks by the task developers in the task main effect panel. Note further that the figure only shows some parameter estimates for the two-way interaction effects due to space limitations; however, the remaining parameter estimates are located in the visible clusters so that the boundaries are well represented.
FIGURE 1 Wright-map for multifaceted Rasch analysis of Mittlerer Schulabschluss sample data (from Rupp & Porsch, 2010, p. 60. Reprinted with permission).

Due to the nature of the model used, all parameter estimates on a common scale and the resulting student proficiency estimates are conditioned on the remaining effects in the model, which, most important, removes any systematic differences due to rater severity. Overall, Figure 1 shows how this analysis captures the essential features observed earlier through the g-theory analyses. Specifically, we can see that there is a reasonably large amount of variation in the student proficiency estimates supporting similar analyses of reading and listening comprehension tasks that show a large degree of inter-individual proficiency differences for the population tested. Moreover, the rater variance is relatively small compared to the other effects showing that raters performed, on average, very similarly. The students also received similar average scores per rating criteria with task fulfillment being the easiest criterion to score highly on and the global rating being an approximate average of the other criteria. Finally, the reliability of the writing scales for making individual norm-referenced decisions is moderate for the HSA sample at .73 and high for the MSA sample at .89; consequently, the reliability for norm-referenced decisions at aggregate levels (e.g., schools, school districts) will be higher. Reliabilities are the expected a posteriori / plausible value reliability estimates available through ACER ConQuest 2.0 (see Adams, 2006).
There are two obvious problems with this model that remain nevertheless. The first one concerns the large “gap” on the proficiency scale that is visible between the B1 and the B2 tasks. Clearly, less information about students in this range of the scale is available under this model than for students in other ranges. The second problem is not directly visible from the map as it concerns data-model fit at the item level. To characterize item fit we use infit and outfit measures and cut-off values of .9 and 1.1 for both in alignment with conventional large-scale assessment practice, which are reasonably conservative for a scenario with approximately n = 400 responses per writing task (e.g., Adams & Wu, 2009). Most fit statistics for main effect parameters and interaction effect parameters are outside of their respective confidence intervals as computed in ACER ConQuest 2.0 (see Eckes, 2005, p. 210, for a discussion of alternatively suggested cutoff values) showing that some model effects are over- or underfitting. Specifically, raters for both samples show more variation in their ratings than expected (i.e., they underfit the model) even though the effect is relatively more pronounced for the HSA sample. These effects are not necessarily uncommon, however, as even the ACER ConQuest 2.0 manual shows similar parameter values for an illustrative multifaceted analysis of rater severity with multiple rating criteria (pp. 49−50). Due to the limited available sample size per rating per task and the desire to keep the writing assessment results in line with the reading and listening comprehension results, it was eventually decided to use the current results only cautiously, for refining the efficaciousness of the test development, rater training, and operational scoring process.

Yüklə 113,49 Kb.

Dostları ilə paylaş:
1   ...   12   13   14   15   16   17   18   19   20




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin