Develop a reliable rating process that allows different raters
at different points in time to obtain the same - or nearly the same
- results, or allows a single teacher in a classroom to assess each
student using the same criteria.
- Definition of reliable scoring: A performance is rated reliably
or consistently when the same behavior is given the same score
regardless of who has performed the task, when it is rated, or by
whom it is rated. The scoring criteria are applied uniformly to
each performance each time an assessment is conducted. Subjective
application of the scoring criteria is minimized.
- Rater training: A prerequisite for consistent scoring
- All teachers who will be scoring the assessment task must be
trained in the following topics:
- Orientation to the assessment task - give teachers an overview
of the assessment task, what the results will be used for, who will
use them, what directions and prompts the students will receive,
and how the scoring guide operationalizes desired outcomes or
processes.
- Clarification of the scoring criteria - raters engage in
extensive discussion of scoring dimensions and scale values.
Dimensions and values are defined, and examples or models of
students' work should be provided to exemplify each.
- Practice scoring - the very heart of the scorer training is to
allow prospective scorers to score student work, first in a group
with responses that are fairly simple to score, then with samples
that call for more difficult judgments (borderline or atypical
responses).
- Protocol revision - during the practice scoring, raters may
devise certain rules for dealing with unanticipated aspects of
judgment posed by a particular set of papers and not covered by the
scoring guide. For example, if a number of students misinterpreted
the task in the same way, raters may decide to revise the scoring
rubric to assess the students' papers based on the student-defined
task.
- Score recording - raters must be taught how to record student
scores. A sample score report should be provided.
- Documenting rater reliability - training ends when there is
agreement that scorers have reached an acceptable level of
consistency, usually rating sample pieces within one point of each
other. In order to determine when raters are ready for the real
thing, reliability checks are conducted during training. Figure 1
below demonstrates two ways to calculate rater agreement.
Figure 1
Is Rater in Perfect Is Rater in Agreement
Agreement with the with the Criterion Score,
Criterion Score? Plus or Minus 1 Point?
Rater Paper #1 Paper #2 Rater's Paper #1 Paper #2 Rater's
Average Average
Agreement Agreement
Linda yes no 50% yes no 50%
Donna no no 0% yes yes 100%
Mark yes yes 100% yes yes 100%
TOTAL 67% = 33% = 50% 100% = 67% = 83%
yes yes yes yes
Figure 1 illustrates a case in which three raters are asked to rate
two criterion (exemplary) papers after some training. According to
the results, Linda agrees with the score for paper 1 but not for
paper 2; in fact, for paper 2 she is not even within 1 point. Donna
is not in perfect agreement with the criterion scores on either
paper 1 or 2, but is in agreement plus or minus one point on both
papers. Mark is in agreement all the time and is ready to rate
student work. Linda and Donna probably need more training. If you
were to report these rater agreement results, you would say, "On
average, raters obtained perfect agreement with criterion scores 50
percent of the time, and reached plus or minus one agreement 83
percent of the time." (Herman, Aschbacher, & Winters, 1992)
- Scheduling considerations - Be sure you have an acceptable
level of agreement before teachers judge student work. Provide
time for training and don't overwork your teachers, inconsistency
of scoring occurs when raters are tired. No more than 6 hours a
day for scoring. Time for retraining and refreshing trainers
should be provided at the beginning of each day of scoring.
- Criteria for checking the reliability of your rating process
- Documented, field-tested scoring guides
- Clear, concrete scoring criteria
- Annotated examples of student work at all score points on the
scale
- Ample practice and feedback for raters
- Multiple raters with demonstrated agreement prior to scoring
- Periodic reliability checks throughout
- Retraining when needed
- Arrangements for collection of suitable reliability data