Select or Design
Assessments That Elicit Established Outcomes
Herman, Aschbacher, and Winters (1992) suggest ten steps
as part of the assessment design process.
- Clearly state the purpose for the assessment, and do not expect
the assessment to meet purposes for which it was not designed.
- There are two basic purposes for assessment:
- To determine whether and to what extent students have learned
specific knowledge or skills (content goals). The assessment
should focus on outcomes or products of student learning, such as
objective assessments and projects/products.
- To diagnose student strengths and weaknesses and plan
appropriate instruction (process goals). Because you are
interested in understanding where the student is going wrong, you
need to assess the process as well as the product, through
activities such as interviews, documented observations, student
learning logs and/or self-evaluations, behavioral checklists, and
student think-alouds in conjunction with multiple-choice tests.
- Herman, Aschbacher, and Winters (1992) point out that "People
[students] perform better when the know the goal, see models, know
how their performance compares to the standard." Keep this
observation in mind throughout the process of designing an
- Describe the purposes for your assessment program:
- Diagnosis and placement (Describe)
- Formative, instructional planning (Describe)
- End-of-course/summative evaluation (Describe)
- Clearly define what it is you want to assess (the achievement
- Answer the following questions to determine instructional and
achievement outcomes. (Provide sufficient detail so that your
colleagues understand the outcomes and whether or not students have
- What important cognitive skills do I want my students to
develop? (e.g., to communicate effectively in writing or, more
specifically, to write persuasively, to write good descriptions,
and to write stories; to analyze important current events issues
from historical, political, geographic, and multicultural
perspectives; to use algebra to solve everyday prolems, etc.)
Select no more than three to five skills per subject area. Use your
state or district outcomes as a guide.
- What social and affective skills do I want my students to
develop? (e.g., to work independently, to develop a spirit of
teamwork and skill in group work, to be persistent in the face of
challenges, to have a healthy skepticism about current arguments
and claims, etc.)
- What metacognitive skills do I want my students to develop?
(e.g., to reflect on the writing process they use, to evaluate its
effectiveness, to discuss and evaluate their problem-solving and/or
research strategies, etc.)
- What types of problems do I want my students to be able to
solve? (e.g., to do research, to solve problems that require
geometric proof, to understand the types of problems that can be
solved with the scientific method, to predict consequences, to
solve problems that have no right answer, etc.)
- What concepts and principles do I want my students to be able
to apply? (e.g., to understand what a democracy is, to understand
cause-and-effect relationships in history and everyday life, to
apply basic principles of ecology and conservation to everyday
- Prioritize these outcomes.
- How much time will it take for students to develop and/or
acquire the skill or accomplishment? If the answer is a short
amount of time, don't bother.
- How does the desired skill or accomplishment relate to other
complex cognitive, social, and affective skills? Higher priority
should be given to skills that are closely related to other
important skills and can be assessed in many situations.
- How does the desired skill or accomplishment relate to long-
term school and curricular goals? If the skill or accomplishment is
closely related to these goals, select it.
- What is the intrinsic importance of the desired skills and
accomplishments? (Avoiding the trivial seems obvious, but think
about how many multiple-choice items you've answered about trivial
- Are the desired skills and accomplishments teachable and
attainable for your students? While seeking to challenge students
in order to draw the highest level of achievement from all
students, pay attention to whether or not your students have the
necessary prerequisite skills, concepts, and principle knowledge to
attain your goals and whether they have the materials and
capabilities to help them reach these goals.
- List your final set of skills, processes, and dispositions (by
subject area, if desired).
- Match the assessment method to the
achievement purpose and target defined in step 2.
- Specify illustrative tasks that require students to demonstrate
certain skills and accomplishments. Avoid tasks that may be merely
interesting activities for students, but may not yield evidence of
a student's mastery of the desired outcomes.
- Matching the assessment tasks to intended learning outcomes
- Does the task match specific instructional intentions?
- Does the task adequately represent the content and skills you
expect students to attain?
- Does the task enable students to demonstrate their progress and
- Does the assessment use authentic, real-world tasks?
- Does the task lend itself to an interdisciplinary approach?
- Can the task be structured to provide measures of several
- Task specification checklist
- Outcomes to be measured
- Description of instructional goals
- Eligible content/topics
- Rules/process for selection
- Assessment administration process
- Group/individual roles
- Administration instructions
- Help allowed
- Time allowed
- Actual question/problem/prompt
- Options available
- Student Directions
- Scoring procedures
- Use of scores
- Specify the criteria and standards for judging student
performance on the tasks selected in step 4. Be as specific as
possible, and provide samples of student work that exemplify each
of the standards.
- Regardless of the assessment's purpose, there are four common
elements to sample criteria:
- Dimensions - one or more traits serve as the basis for
judging the student response.
- Definitions and Examples of the various criterion levels
are provided to clarify the meaning of each trait or dimension
- Scale - a scale of values or counting system, usually
with 4 to 6 points on the scale, are used to rate each dimension.
- Standards of excellence for specified performance levels
are accompanied by examples or benchmarks at each level.
- Selecting scoring dimensions requires answering the following
- What are the attributes of good writing, of good scientific
thinking, of good collaborative group process, of effective oral
- What do I expect to see if this task is performed excellently,
- Do I have examples of students' work that exemplify some of the
criteria I may want to use to judge performance on this task?
- What criteria for similar tasks exist in my state curriculum
frameworks, my state assessment program (e.g., it might be
worthwhile to use the "Write on Illinois" rubrics for writing), my
district curriculum guide, and my school's assessment program?
- Writing clear descriptions of scoring dimensions
- Write about the behavior or elements you will see in the
student's performance, not what you will perceive through
intuition. Instead of "will demonstrate an understanding of the
scientific method," say "will build into his or her investigation
a definition of the problem, a review of the literature about this
topic, a listing of hypotheses that could be tested, an
experimental design that controls for all variables except those in
the hypotheses that are being tested, and an explanation of how the
findings were used to find a solution for the original problem."
These specific descriptions are easier for other scorers to
- Provide examples of work at each point along the scoring scale
to help scorers articulate precise definitions of each dimension.
- Refine and revise scoring dimensions based on experience
involving the assessment with students.
- Evaluating scoring criteria
- Do the criteria address all important outcomes?
- Do the rating strategies match the decision purpose - holistic
for global, evaluative view; analytic for diagnostic view?
- Do the rating scales provide usable, easily interpretable
- Do the criteria employ concrete references and clear language
understandable to students, parents, and other teachers?
- Do the criteria reflect current conceptions of "excellence"
accepted in the field?
- Have the criteria been reviewed for developmental, ethnic, and
- Do the criteria reflect teachable outcomes?
- Are the criteria limited to a feasible number of dimensions?
- Are the criteria generalizable to other similar tasks or larger
- Develop a reliable rating process that allows different raters
at different points in time to obtain the same - or nearly the same
- results, or allows a single teacher in a classroom to assess each
student using the same criteria.
- Definition of reliable scoring: A performance is rated reliably
or consistently when the same behavior is given the same score
regardless of who has performed the task, when it is rated, or by
whom it is rated. The scoring criteria are applied uniformly to
each performance each time an assessment is conducted. Subjective
application of the scoring criteria is minimized.
- Rater training: A prerequisite for consistent scoring
- All teachers who will be scoring the assessment task must be
trained in the following topics:
- Orientation to the assessment task - give teachers an overview
of the assessment task, what the results will be used for, who will
use them, what directions and prompts the students will receive,
and how the scoring guide operationalizes desired outcomes or
- Clarification of the scoring criteria - raters engage in
extensive discussion of scoring dimensions and scale values.
Dimensions and values are defined, and examples or models of
students' work should be provided to exemplify each.
- Practice scoring - the very heart of the scorer training is to
allow prospective scorers to score student work, first in a group
with responses that are fairly simple to score, then with samples
that call for more difficult judgments (borderline or atypical
- Protocol revision - during the practice scoring, raters may
devise certain rules for dealing with unanticipated aspects of
judgment posed by a particular set of papers and not covered by the
scoring guide. For example, if a number of students misinterpreted
the task in the same way, raters may decide to revise the scoring
rubric to assess the students' papers based on the student-defined
- Score recording - raters must be taught how to record student
scores. A sample score report should be provided.
- Documenting rater reliability - training ends when there is
agreement that scorers have reached an acceptable level of
consistency, usually rating sample pieces within one point of each
other. In order to determine when raters are ready for the real
thing, reliability checks are conducted during training. Figure 1
below demonstrates two ways to calculate rater agreement.
Is Rater in Perfect Is Rater in Agreement
Agreement with the with the Criterion Score,
Criterion Score? Plus or Minus 1 Point?
Rater Paper #1 Paper #2 Rater's Paper #1 Paper #2 Rater's
Linda yes no 50% yes no 50%
Donna no no 0% yes yes 100%
Mark yes yes 100% yes yes 100%
TOTAL 67% = 33% = 50% 100% = 67% = 83%
yes yes yes yes
Figure 1 illustrates a case in which three raters are asked to rate
two criterion (exemplary) papers after some training. According to
the results, Linda agrees with the score for paper 1 but not for
paper 2; in fact, for paper 2 she is not even within 1 point. Donna
is not in perfect agreement with the criterion scores on either
paper 1 or 2, but is in agreement plus or minus one point on both
papers. Mark is in agreement all the time and is ready to rate
student work. Linda and Donna probably need more training. If you
were to report these rater agreement results, you would say, "On
average, raters obtained perfect agreement with criterion scores 50
percent of the time, and reached plus or minus one agreement 83
percent of the time." (Herman, Aschbacher, & Winters, 1992)
- Scheduling considerations - Be sure you have an acceptable
level of agreement before teachers judge student work. Provide
time for training and don't overwork your teachers, inconsistency
of scoring occurs when raters are tired. No more than 6 hours a
day for scoring. Time for retraining and refreshing trainers
should be provided at the beginning of each day of scoring.
- Criteria for checking the reliability of your rating process
- Documented, field-tested scoring guides
- Clear, concrete scoring criteria
- Annotated examples of student work at all score points on the
- Ample practice and feedback for raters
- Multiple raters with demonstrated agreement prior to scoring
- Periodic reliability checks throughout
- Retraining when needed
- Arrangements for collection of suitable reliability data
- Avoid the pitfalls that threaten reliability and validity and
can lead to mismeasurement of students. Assessors should ensure (a)
adequate sampling of the content domain, (b) absence of bias or
subjective scoring, (c) reasonable uniformity in administering
assessments, (d) minimal effects of extraneous factors (e.g., too
much reading on a mathematics or social studies test), (e) a
suitable environment for assessment, and (f) awareness of and
compensation for temporary factors affecting the student (e.g.,
parents' recent divorce or illness).
- Collect evidence/data showing that the assessment is reliable
(yields consistent results) and valid (yields useful data for the
decisions being made). With performance assessments, reliability
and validity might be demonstrated through inter-rater agreement on
scoring and evidence that students who perform well on the
assessment also perform well on related items or tasks. With
multiple-choice assessments, correlations should demonstrate
internal consistency (students perform equally well or poorly on
all related items) and show that performance on the test correlates
to performance of similar skills presented differently. In the
classroom, where teachers have multiple measurements of each
student's performance, the formal collection of technical quality
data is not necessary.
- Ensure "consequential validity." That is, the assessment should
have a maximum of positive effects and a minimum of negative ones.
For example, the assessment should give teachers and students the
right messages about what is important to learn and to teach, it
should not restrict the curriculum, it should be a useful
instructional tool, and decisions made on the basis of the
assessment results should be appropriate.
- Use test results to refine assessment and improve curriculum
and instruction; provide feedback to students, parents, and the
Copyright © North Central Regional Educational Laboratory. All rights reserved.
Disclaimer and copyright information.