When teachers and curriculum writers are developing tests to give to students, they can’t just write a test and assume it will measure the standards they wish to assess. You might be surprised to learn that curriculum writers have to test the validity of their assessments. They do this by running statistical analyses of individual test questions. This process is also called item analysis. Two statistical methods used in item analysis are the item difficulty index and the item discrimination index. The item difficulty index measures how easy a question is by determining the proportion of students who got it right. The item discrimination index measures how well a test question can help examiners differentiate between test takers who have attained mastery of the material and those who have not.
Determining the validity of individual question items on a test is vital, especially for standardized tests, the results of which can dictate the amount of money a school will receive in federal funding or even if students will be accepted to college. If tests do not measure the standards they intend to measure, then their results are not valid, and they cannot be used to judge schools, teachers or individual students. Essentially, a standardized test that has not been statistically analyzed before it is given is useless.
What Is the Item Difficulty Index?
The item difficulty index is a common and very useful analytical tool for statistical analysis, especially when it comes to determining the validity of test questions in an educational setting. The item difficulty index is often called the p-value because it is a measure of proportion – for example, the proportion of students who answer a particular question correctly on a test. P-values are found by using the difficulty index formula, and they are reported in a range between 0.0 and 1.0. In the scenario with students answering questions on a test, higher p-values, or p-values closer to 1.0, correspond with a greater proportion of students answering that question correctly. In other words, easier test questions will have greater p-values. That is why some statisticians also call the difficulty index “the easiness index” when they are performing an item analysis on data sets that have to do with education.
Different types of tests aim for different levels of easiness. Norm-referenced tests, for example, will have questions with varying levels of easiness because they are trying to create a wider spread in scores and categorize test takers into norms. Criterion-referenced tests, on the other hand, are trying to measure mastery. It is possible for criterion-referenced tests to have many questions with p-values close to 1.0.
Using the Difficulty Index Formula
The difficulty index formula is fairly easy to remember because it is the same as determining the percentage of students who answered the question correctly. The only difference is that p-value is left as a decimal point and is not converted to a percentage value out of 100.
The formula looks like this: the number of students who answer a question correctly (c) divided by the total number of students in the class who answered the question (s). The answer will equal a value between 0.0 and 1.0, with harder questions resulting in values closer to 0.0 and easier questions resulting in values closer to 1.0.
The formula: c ÷ s = p
Example: Out of the 20 students who answered question five, only four answered correctly.
Formula: 4 ÷ 20 = 0.2
Because the resulting p-value is closer to 0.0, we know that this is a difficult question.
What Is the Item Discrimination Index?
The discrimination index is another way that test writers can evaluate the validity of their tests. Item discrimination evaluates how well an individual question sorts students who have mastered the material from students who have not. Test takers with mastery of the material should be more likely to answer a question correctly, whereas students without mastery of the material should get the question wrong. Questions that do a good job of sorting those students who have mastered the material from students who have not are called “highly discriminating.” Such tests can be very beneficial, especially in areas where mastery is key, such as in medical certification.
There are several different formulas that calculate item discrimination, but the one that is most commonly used is called the point-biserial correlation, which compares a test taker’s score on an individual item with their score on the test overall. For highly discriminating questions, students who answer correctly are those who have done well on the rest of the test. The opposite is also true. Students who answer highly discriminating questions incorrectly tend to do poorly on the rest of the test as well.
Item discrimination is measured in a range between -1.0 and 1.0. Negative discrimination indicates that students who are scoring highly on the rest of the test are answering that question wrong. This could mean that there is a problem with the question, such as bias or even a typo in the answer key. Test writers should reevaluate questions that result in negative discrimination because they do not help to show mastery.
How to Find the Item Discrimination Index
Determining item discrimination is more complicated and involves more steps than finding an item’s difficulty. First, create a table of your students along with their test scores. In a third column, indicate whether the student answered the question you are measuring correctly by placing a 1 (for correct answers) or a 0 (for incorrect answers) in the corresponding box.
Now, arrange your students from highest scorers to lowest scorers, with the highest scorers at the top. Divide the table in half between high and low scorers, with an equal number of students on each side of the dividing line. Subtract the number of students in the lower-scoring group who answered the question correctly (lc) from the number of students in the higher-scoring group who answered the question correctly (hc). Then, divide the resulting number by the number of students on each side of your dividing line, which should be half of the class (t).
Item discrimination = (hc – lc) ÷ t
You have a class of 20 students, so after you arrange them by score in a table, you should have 10 on each side of the dividing line. If six students in the higher-scoring group answered the question correctly, and six students in the lower-scoring group also answered the question correctly, you should already know without doing the math that the item is not very discriminatory. However, we can still measure with the formula.
The formula for this problem should look like this:
(6 – 6) ÷ 10 = 0
This item is not a good measure of mastery because its discrimination index is zero.
If, however, six students in the higher-scoring group answer correctly, and only two students in the lower-scoring group answer correctly, the item is a much better measure of mastery.
The new formula would look like this:
(6 – 2) ÷ 10 = 0.4
Although the number could be higher, this question would still be a decent indicator of whether or not the student understood the material.