|
Red Flags: Problems to Watch
for in Studies of Software Effectiveness
There are a number of possible problems with studies of the effectiveness
of educational software. Some of these are "red flags"
that reduce the overall quality of research and that may lead readers
to misjudge the size or magnitude of a software's likely effect
in a particular setting. Below we discuss some of the most commonly
encountered problems to look for when reviewing studies of program
effectiveness of education technology:
Selecting a special group of students,
teachers, and sites
In medical research, participants are selected randomly to be part
of a trial of a new medication or form of treatment. By selecting
participants randomly, medical researchers have participants from
a variety of backgrounds and personal medical histories that could
affect the results. Frequently, educational research observes students
and teachers in classrooms in schools that have not been selected
randomly. These students may have special characteristics, such
as being high-performing, at the outset. Teachers may have volunteered
to use the software, and these teachers may be different from teachers
in a comparison group. Schools in the treatment group may have a
strong commitment to technology. Selection of particular kinds of
students and sites in a research study limits the conclusions one
can draw: if the study is otherwise well-designed, we can know for
sure only that the software has been effective with this kind of
student or this kind of school.
Using a comparison group that is different
from the treatment group
Many studies use a comparison group to help determine whether having
students use software to practice reading or math skills is more
effective than other methods that might be used in the classroom.
In these studies, there may be only one time that student achievement
is measuredat the end of the treatment. In some cases, student
achievement is measured at the beginning and the end of the treatment
(a pre-post design). In either design, it is important that the
comparison group be similar to the treatment group with respect
to the students' backgrounds, prior levels of achievement, and any
other factors that could affect achievement. If the comparison group
is too different from the treatment group, results may be biased.
For example, the treatment group may have higher prior levels of
achievement than the comparison group. Alternately, initial performance
of the treatment group may be much lower, with more opportunity
for scores to increase than the comparison group. Some researchers
randomly select students to be part of a treatment or a comparison
group, helping to ensure comparable treatment and comparison groups.
Other researchers use other techniques to match students by using
demographic profiles of schools to find similar groups of students.
These alternative methods of matching samples are not as sound because
they increase the likelihood that groups will differ in other ways
such as instructional program, class size, or class scheduling.
Selecting an inappropriate test of student
achievement
Researchers must select some test to measure a program's effects.
On the one hand, it is important that this test be sensitive to
the kind of instruction that takes place in the program. For example,
if the software addresses reading skills, student scores on a reading
test might be expected to improve. At the same time, one could not
expect students' scores on a test of mathematics skills to improve
in such a program. Sometimes researchers measure technology's effectiveness
by using tests that fit the specific technology program goals so
narrowly that they do not reflect more common and familiar academic
outcomes. Ideally, researchers use tests that have been validated
for use across more than one program but that are also sensitive
to the kinds of things students might be expected to learn, given
the software's design. Other times researchers use tests that subject
students to performance conditions that differ from the performance
conditions supported by the technology. Such tests may not capture
an actual effect because students are performing at a disadvantage
(e.g., without a calculator).
Using a small sample
All else being equal, studies with larger numbers of students, classrooms
or schools are of higher quality than studies with significantly
fewer participants. When too few subjects are included in a study,
the validity of the findings may be questionable. For example with
larger sample sizes the research study is more likely to detect
meaningful effects of the intervention on students if the effects
in fact occurred. Readers should be particularly wary of studies
using small sample sizes (fewer than 30 subjects) that report no
effect of the use of the technology on student performance. In many
such cases, there are too few subjects in such studies to detect
an effect even when there might have been one. In addition, the
smaller the sample size the higher the probability that the findings
could be due purely to chance and might not be replicated if repeated
with different set of students, classrooms, or schools. Lastly,
with the use of significantly smaller study samples, the findings
from a single study are less likely to accurately reflect the typical
experience of the typical student or classroom in the population
of interest. Thus, in most cases where small sample sizes are used,
you may not be able to generalize the findings from a single study
to a larger population.
Not documenting the duration of students'
exposure to the software
All too often, studies of effectiveness do not describe how often
or for how long students used the educational software. Not reporting
such information makes it difficult to know not just how much exposure
is needed to achieve the results the researcher found but also whether
the software is effective at all. It may be that there was very
little use of the software at all, but other reforms led to the
results found (see Confounding below). It may also be that
students used the software quite extensively, and for such a length
of time that it would be impossible for another school to duplicate
the results.
Studying effectiveness too early in a
software program's use and the effects of novelty
Longer is better. Results are more likely to be applicable if the
study was conducted for at least an entire semester of the school
year. Programs tested for shorter periods of time may underestimate
their impact if teachers and students haven't learned how to incorporate
it with instruction. In addition, research has found that newly
introduced technologies in the classroom may initially inflate scores
due to a short-lived period of teacher and student excitement over
access to the new program ("novelty effect"). Program
effectiveness measured over the course of an entire semester or
school year will provide a more realistic measure of the impacts
a school can expect to achieve due to regular use of the software.
The evaluator is not independent of the
vendor
Vendors typically hire evaluators to conduct studies. They may even
suggest to evaluators particular research methods or ways to frame
results in such a way to present the program's effectiveness in
a more favorable light. At the same time, evaluators typically follow
a code of ethics that require them to be fair in their assessment
of programs and to be willing to report on less favorable results.
Even so, knowing who sponsored a particular evaluation study and
making sure that the data reported match the conclusions drawn by
the evaluator is an important part of judging research quality.
Confounding: Separating the effect of
software use from the effect of other changes in the school
When interpreting the results from studies that do not use a no-use
comparison group, you must be aware of other factors taking place
at the same time within the school and district that could explain
the results. Often times the introduction of technology programs
is accompanied by other changes within schools and districts that
may influence student performance, including changes in the curriculum,
instructional practice, teacher resources, school organization,
and how students are assessed. Without the use of a well-matched
comparison group, it is extremely difficult to separate the influence
of the technology intervention from the influence of other outside
factors. Scores on the testing instrument may be climbing across
all schools in the district regardless of any impact due to the
use of technology in the classroom.
Students in treatment groups dropping out
of the study at different rates from students in comparison groups
Differential sample attrition is a problem you must be aware of
any time a study compares a group of students who used the software
to a group who did not. Two groups that are almost identical at
the start of the study might become significantly different before
the study ends if students drop out of the groups before the end
of the study at different rates or for different reasons. Long-term
studies are particularly susceptible to differential sample attrition.
See if the author reports on the loss of subjects from either group
during the course of the study. Information on attrition may appear
in other ways such as in tables used to describe the samples and
results. See if you can determine whether the sample sizes for the
groups at the beginning and end of the study remain the same or
are only slightly different. If the difference is large, significant
improvements in test scores attributed to the use of the software
may be nothing more than the result of the differences in the types
of students that remained in the two groups.
|