|
White Paper: Effect Sizes and the Measurement of Learning
April 20, 2007
What Is an Effect Size and Why Should You Care?
An effect size is a standardized measure of change over time. Trainers and educators continually wonder if the materials they use, the courses they teach, and the examples and behaviors they model actually produce any change. This generalization applies equally to any industry or profession--medicine, law, accounting, sales training, and so forth. How do we know if what we are doing has any effect?
Traditionally, educators have used such things as grade point averages, test result data points, GME or GMAT scores, and other so-called "standardized tests." Surely, these outcomes mean something, but just what? And how much of a difference makes a difference?
Over time, the training profession has attempted to recognize the role of confounds, that is, factors other than the tests themselves that influence test results, and selection bias, who is included or excluded or who drops out in a testing process. As a result, the profession has adopted such techniques as Likert scales, multiple regression, multivariate analysis . . . and the list goes on. Each of these techniques aims to overcome some deficiency of traditional tests, but each in turn has its own shortcomings.
Experts in scientific and clinical research methods have invested a small fortune in time and money to create antidotes to test deficiencies. Just go on Evaluated Medline and examine the vast array of meta-analyses, odds ratio studies, research repositories, such as the Cochrane Collection, and quasi-analytic studies. These powerful approaches lend strength to outcome findings because they rely upon the power of large sample sizes and the collective benefit of multiple research techniques.
Unfortunately, the faculty in a residency or other clinical training program has neither the time nor the money to evaluate, select or employ such comprehensive tools. What can be done to arm the individual program director and his or her colleagues with a practical, yet feasible, method of evaluating test results?
The answer is the effect size measure. In fact, effect size techniques of one kind or another either directly or implicitly underlie the methods employed in meta-analysis and comparable large-scale research and testing methods. Here is what the term means and how it can work to help your teaching and testing practice:
An effect size compares the mean results of two or more "tests" relative to the amount of variance between or among these outcomes. In a very simple setting, suppose you pre-test the same group of residents on some element of content, and then post-test them on the same content a month later. You test the same people on the same materials. In each case, you obtain a mean result. In many, perhaps most, cases, we are left to ask ourselves what the difference in these mean results is telling us? Is a difference due to a real training influence or is it just chance? Can we estimate what the probability is that our outcome is just a random or chance result, as opposed to a real training benefit? The answer to this question is YES.
Well, like everything in testing, the answer is a qualified YES. As the professional statisticians will tell us, we are actually making the following assumptions about the test condition "from the get go":
- The compared groups are part of the same population of test-takers, i.e., they are either the same people or the groups are samples from the same type of pool—residents, PAs in training, undergraduates, medical students, or some other pool with common properties.
- The normal amount of variance in this common population is stable.
We can think of variance as the spread or distribution of data points in testing outcomes. To keep it simple, a narrower spread or smaller variance indicates that the mean point in the data distribution is a better measure of a group's average performance than would be the case with a wider spread. So, to measure testing effectiveness, we compare the difference in means to the "average variance" among or between the compared groups. The technical term for this average variance is the pooled standard deviation of the effect size measure. Here is a simplified version of the formula:
Es = [M2-M1] divided by Ö of [(V2 + V1)/2]
Where Es is the effect size; M2 and M1 are the mean, that is, average results of two group performances; and V2 and V1 are the variance calculations for each group. So, to get an effect size, we divide the difference in mean outcomes by the square root of the summed variances. That is a mouthful, but just think "mean difference adjusted for variance"!
Once we have an effect size outcome, what does the number actually indicate? Well, for starters, we sure hope the number is positive, since a negative number would mean that the training had a negative or harmful effect. More typically, the outcome will be a positive number in the range of 0.2 through 0.9; these values represent a continuum where the top end is highly favorable and the bottom end suggests lack of much impact. If we were to examine the outcomes of big clinical trials of treatments and large scale training programs, it would be typical for beneficial programs to have an effect size between 0.6 and 0.8. A much larger number would indicate an unexpectedly powerful result and might suggest a defect in the design of the training. For example, if you pre-tested a group with absolutely no knowledge of the test matter, then post-tested then after training, you could get numbers greater than 1.0. However, what would be the point of pre-testing a completely uninformed audience? No tool can overcome a poor or misguided training or testing design. So, let's be sure we are making comparisons that make sense so that we can take comfort in the usefulness of our effect size measures.
The Challenger StatPakTM system will automatically calculate effect size measures as part of its reporting options. With these measures, you can achieve the following objectives:
- increase your confidence in the raw outcomes of testing
- report verifiable results to the ACGME, your university, and your program administration
- more comfortably assign remediation based on validation of test result deficiencies
- modify your training curriculum to compensate for content gaps uncovered via test results.
Viewing Effect Size Results
Since the Challenger StatPakTM system automatically calculates effect size measures, taking advantage of this advanced feature is merely a matter of looking at the numbers and knowing how to interpret them.
Challenger StatPakTM includes two types of effect size reporting. The Pre Test/Post Test type allows you to compare assessment results for your trainees during two time periods, the Pretest period and the Posttest period. The Group Comparison Effect Size (World) compares scores in your program to others in the Challenger system.
Both reports are easily accessible from our main reports page for subscribers with the Premium Reporting Package version of Challenger StatPakTM.
Other Data Included in Effect Size Reports
Both types of Effect Size Report also include:
- the mean: the average score on the tests
- the standard deviation: a measure of the variability of the score distribution; smaller is better
- the variance: the spread or distribution of scores
- the confidence interval, assuming p<.05, that is the benchmark for Challenger is a 95% reliability that a test result or comparison is not a chance result.
- the sample size: the number of assessments sampled
These values can help you determine whether or not the Effect Size number is useful; for example, comparing two radically different sample sizes will provide a less reliable result than comparing two similar ones. A group of scores with a wider variance is a less reliable indicator of performance than a group with a smaller variance.
Conclusions
The Challenger Program for ResidenciesTM is just a few years old. However, we already have one of the largest national pools of adult end users. In fact, there are over ten thousand "seats" occupied by residents, PAs, and fellows in training with Challenger content and subject to evaluation in our reporting system. That is far larger than any corporate university, and approximates the scope of training pools for the military and large government organizations. Already, our client institutions are beginning to report training outcomes that were either unavailable in the past or not easily interpreted in terms of validated training effects. I will simply cite two examples:
- The Family Medicine Residency Program at St. Anthony's Hospital in Oklahoma City has replaced some of its traditional didactic curriculum with the Challenger content library for cognitive training of residents. Moreover, the faculty can now assign and verify compliance with instruction from remote locations when other commitments require their absence from campus.
- The PA Training Program at Trevecca Nazarene University in Nashville reported a measurable effect size gain from use of Challenger content in PA training, measured as the success rate of their graduates on the PA National Certification Exam (PANCE). The initial test pass rate rose from 83% to 97% for the students who tested from this program from 2005 to 2006.
For further discussion of the effect size methodology and how it might be applied to measure the benefits of your training program, you can e-mail or call me at:
Robert E. "Bob" Sweeney, PhD, MS
Chief Executive Officer
Challenger Corporation
5100 Poplar Avenue, Ste. 310
Memphis, Tennessee 38137
Tel: (901) 762-8425
E-mail: bobs@chall.com
|