What Is an Effect Size and Why Should You Care?

An effect size is a standardized measure of change over time. Trainers and educators continually wonder if the materials they use, the courses they teach, and the examples and behaviors they model actually produce any change. This generalization applies equally to any industry or profession--medicine, law, accounting, sales training, and so forth. How do we know if what we are doing has any effect?

Traditionally, educators have used such things as grade point averages, test result data points, GME or GMAT scores, and other so-called "standardized tests." Surely, these outcomes mean something, but just what?  And how much of a difference makes a difference?  

Over time, the training profession has attempted to recognize the role of confounds, that is, factors other than the tests themselves that influence test results, and selection bias, who is included or excluded or who drops out in a testing process.  As a result, the profession has adopted such techniques as Likert scales, multiple regression, multivariate analysis . . . and the list goes on.  Each of these techniques aims to overcome some deficiency of traditional tests, but each in turn has its own shortcomings.

Experts in scientific and clinical research methods have invested a small fortune in time and money to create antidotes to test deficiencies. Just go on Evaluated Medline and examine the vast array of meta-analyses, odds ratio studies, research repositories, such as the Cochrane Collection, and quasi-analytic studies. These powerful approaches lend strength to outcome findings because they rely upon the power of large sample sizes and the collective benefit of multiple research techniques. 

Unfortunately, the faculty in a residency or other clinical training program has neither the time nor the money to evaluate, select or employ such comprehensive tools. What can be done to arm the individual program director and his or her colleagues with a practical, yet feasible, method of evaluating test results?    

The answer is the effect size measure. In fact, effect size techniques of one kind or another either directly or implicitly underlie the methods employed in meta-analysis and comparable large-scale research and testing methods. Here is what the term means and how it can work to help your teaching and testing practice:

An effect size compares the mean results of two or more "tests" relative to the amount of variance between or among these outcomes.  In a very simple setting, suppose you pre-test the same group of residents on some element of content, and then post-test them on the same content a month later. You test the same people on the same materials.  In each case, you obtain a mean result. In many, perhaps most, cases, we are left to ask ourselves what the difference in these mean results is telling us? Is a difference due to a real training influence or is it just chance? Can we estimate what the probability is that our outcome is just a random or chance result, as opposed to a real training benefit? The answer to this question is YES.

Well, like everything in testing, the answer is a qualified YES. As the professional statisticians will tell us, we are actually making the following assumptions about the test condition "from the get go":

We can think of variance as the spread or distribution of data points in testing outcomes.  To keep it simple, a narrower spread or smaller variance indicates that the mean point in the data distribution is a better measure of a group's average performance than would be the case with a wider spread. So, to measure testing effectiveness, we compare the difference in means to the "average variance" among or between the compared groups.  The technical term for this average variance is the pooled standard deviation of the effect size measure.  Here is a simplified version of the formula *:

Es = [M2-M1] divided by √ of [(V2 + V1)/2]

Where Es is the effect size; M2 and M1 are the mean, that is, average results of two group performances; and V2 and V1 are the variance calculations for each group.  So, to get an effect size, we divide the difference in mean outcomes by the square root of the summed variances. That is a mouthful, but just think "mean difference adjusted for variance"!

Once we have an effect size outcome, what does the number actually indicate? Well, for starters, we sure hope the number is positive, since a negative number would mean that the training had a negative or harmful effect. More typically, the outcome will be a positive number in the range of 0.2 through 0.9; these values represent a continuum where the top end is highly favorable and the bottom end suggests lack of much impact.  If we were to examine the outcomes of big clinical trials of treatments and large scale training programs, it would be typical for beneficial programs to have an effect size between 0.6 and 0.8.  A much larger number would indicate an unexpectedly powerful result and might suggest a defect in the design of the training.  For example, if you pre-tested a group with absolutely no knowledge of the test matter, then post-tested then after training, you could get numbers greater than 1.0. However, what would be the point of pre-testing a completely uninformed audience?  No tool can overcome a poor or misguided training or testing design.  So, let's be sure we are making comparisons that make sense so that we can take comfort in the usefulness of our effect size measures.

The Challenger Statpak™ system will automatically calculate effect size measures as part of its reporting options. With these measures, you can achieve the following objectives:

Viewing Effect Size Results

Since the Challenger StatPak system automatically calculates effect size measures, taking advantage of this advanced feature is merely a matter of looking at the numbers and knowing how to interpret them.

Challenger StatPak includes two types of effect size reporting. The Pre Test/Post Test type allows you to compare assessment results for your trainees during two time periods, the Pretest period and the Posttest period.  The Group Comparison Effect Size (World) compares scores in your program to others in the Challenger system.

Both reports are easily accessible from the main reports page.

Viewing Pre Test / Post Test Reports

This report compares data from your program for any two groups that you define based on date ranges, the Pre Test Group and the Post Test Group. Open the reporting system and click on Effect Size from the menu page. Next, select 1) Pre Test / Post Test Effect Size.

Fill in the appropriate information in the Date From and Date To fields.

If, for example, you enter Date From: October 1, 2006, and Date To: January 1, 2007, your effect size will be calculated using pretest data containing all results up to October 1, 2006, and posttest data using all results between October 1, 2006 and January 1, 2007.

Note: All dates are defined as 12:00 a.m., Central Time.  In other words, if you choose a Date From of October 1, 2006, the time period begins at 12:00 a.m. on that day.  Conversely, a date of October 1, 2006, entered in the Date To field would represent an ending time of 12:00 a.m. Central Time on that date -- so an ending date of October 1, 2006, would include results up to midnight on September 30. 

After you've completed the date fields, select the topic for which you'd like to view results. Here's how:

You'll see two groups, the Pretest Group and the Posttest Group, along with the Effect Size, which is displayed just to the right of the topic you selected.

Viewing Group Comparison Reports

The second type of Effect Size reporting contained in StatPak is the Group Comparison Effect Size (World). At present, this report compares scores from your program with all scores in the Challenger system for a given topic during the date range specified. Its use is very similar to the report discussed above. Open the main reports page and select Effect Size, then choose 2) Group Comparison (World).

In this case, the Date From and Date To fields actually specify a range from which the data is taken.  Enter a date range in the Date From and Date To fields.

Then choose a course, a section, and a topic by clicking on the displayed links. The results are displayed for you on the page; you'll see data for your program ("Program Specific") and for the world of Challenger customers ("Generic Data"). As above, the Effect Size is displayed to the right of the topic selected.

Other Data Included in Effect Size Reports

Both Effect Size Reports also include:

These values can help you determine whether or not the Effect Size number is useful; for example, comparing two radically different sample sizes will provide a less reliable result than comparing two similar ones. A group of scores with a wider variance is a less reliable indicator of performance than a group with a smaller variance.

***

For further discussion of the effect size methodology, you can e-mail or call the following individuals:

For theory and method questions:

Bob Sweeney, PhD, MS
Chief Executive Officer
bobs@chall.com
901-762-8425

For technical questions related to using Challenger Statpak™ or your reporting system:

Dennis Plafcan
Director of Technology
dplafcan@chall.com
901-762-8437