
Written and designed by the staff of the Center for Teaching and Learning. Reproduce with permission only.
As the title of this section implies, testing is only part of the evaluation of learning. Every time you ask a question in class, monitor a student discussion, or read a term paper, you are evaluating learning. Moreover, the evaluation process (whether it involves examinations or not) is a valuable part of the teaching process. The primary purpose of evaluation is to provide corrective feedback to the student, the secondary purpose is to satisfy the administrative requirement of ranking students on a grading scale. Paper and pencil tests are simply a convenient method to fulfill both functions.
From the standpoint of measurement, tests fall into two general categories: those in which students select the correct response from information provided on the test and those in which students must supply the answers themselves. True-false, multiple-choice, and matching tests are in the former category; short-answer and essay tests are in the latter. The cognitive capabilities required to answer supply items are different from those required by select items, irrespective of content. Owing to limitations of space, we cannot provide an exhaustive explanation of the types of tests and rules for writing them, but we will offer a few guidelines for each type and focus primarily on the two most widely used types of exams: multiple-choice and essay.
The selection of material to be tested should be based on learning objectives for the course, but the complexity of the course material associated with those objectives (and the limited time for taking exams) means that you can only sample the material in any given unit or course. The chart in Figure 5 is a table of specifications or blueprint for a test. The important concepts in the unit to be tested are listed down the left side of the page. The vertical columns represent three levels of learning: "knowledge," "application," and "evaluation." These categories parallel the cognitive levels in the taxonomy provided in Figure 2 (in the section on course design) and the three levels of questions illustrated in Figure 3 (in the section on discussion-leading). Other categories may be used, but you should have a clear idea of the level of learning each one represents. As you write test questions, decide which level they fit and enter their numbers in the appropriate cells in the matrix. This simple method allows you to check if you are testing the levels of learning you want to test. For example, if you find large numbers of questions falling into the knowledge column, it will be instantly apparent. Note that some of the cells in the matrix may be blank and some will contain larger numbers of questions than others. These frequencies should reflect the emphasis placed on these concepts when they were taught.
| Content Categories | Objectives | ||
|---|---|---|---|
| Knowledge | Application | Evaluation | |
| A. Identity crisis vs. role confusion; achievement motivation. | 2,9 | 4,21,33 | 16 |
| B. Adolescent sexual behavior; transition of puberty. | 5,8 | 1,13,26 | 11 |
| C. Social isolation and self esteem; person perception. | 14, 6 | 3, 20 | 25 |
| D. Egocentrism; adolescent idealism. | 7, 29 | 12, 31 | 10, 15, 27 |
| E. Law and maintenance of the social order. | 17 | 22 | 18 |
| F. Authoritarian bias: moral development. | 19 | 30 | 24 |
| G. Universal ethical principle orientation. | 28 | 23 | 32 |
All tests should have complete, clearly-written instructions, time limits for each section, and point values assigned to different questions or groups of questions. The question sheets should be clearly typed and duplicated so that students have no difficulty reading them. If your tests require optical scanning sheets, make sure students know where to buy them and how to use them (#2 pencils, complete erasures, avoid folding or crumpling).
When grading exams, strive for fairness and impartiality by keeping the identity of each student secret from yourself until you have finished the entire set of tests. Some teachers ask students to use their social security numbers (or some other code) instead of their names on exams to insure anonymity.
Some additional issues arise in testing in math and the natural sciences, since students are required to work problems on their exams. Answers may be right or correct but differ in accuracy and completeness, so the type of answer and the degree of precision you expect must be clearly specified. You must also decide how much work the student will be required to show and how partial credit will be allocated for incomplete answers.
Keep in mind that the basic purpose of a test is to measure student performance, and the best teachers constantly work to refine their testing techniques and procedures. Poor techniques may result in tests that only measure the ability to take a test--test-wise students will perform well whether or not they know the material.
The two most important characteristics of a test are its content validity and reliability. A test's validity is determined by how well it samples the range of knowledge, skills and abilities that students were supposed to acquire in the period covered by the exam. Reliability is determined by how consistently the test can be graded and how well the test discriminates between students of differing performance levels. Well-designed multiple-choice tests are generally more valid and reliable than essay tests because (a) they sample material more broadly (since you can ask so many more questions); (b) discrimination between performance levels is easier to determine; and (c) scoring consistency is virtually guaranteed. On the other hand, essay questions can test the upper levels of cognition (analysis, synthesis, evaluation) more easily than multiple-choice questions.
Writing good exam questions requires plenty of time for composition, review, and revision. If you jot down a few questions after class each day when the material is fresh in your mind, the exam is more likely to reflect your teaching emphases than if you wait to write them all later. Also, it is beneficial to ask a colleague to review the questions before you give the exam--another teacher might identify potential problems of interpretation or spot confusing language. The process of test development does not end when the students take the exam; careful analysis of the results will help refine your questions and sharpen your testing technique.
The major weakness of multiple choice tests is that teachers may develop questions that require only recognition or recall of information. Multiple-choice questions in teachers' manuals that accompany textbooks often test only recognition and recall. Strive for questions that require application of knowledge rather than recall. For example, interpretation of data presented in charts, graphs, maps, or other formats can form the basis for higher-level multiple-choice questions.
Writing the Stem
The "stem" of the item, which poses a problem or states a question, should be written first. The basic rule for stem-writing is that students should be able to understand the question without reading it several times or having to read all the options.
Writing Response Options
Multiple-choice questions normally have four or five options, to make it difficult for students to guess the correct answer. Only one option should be unequivocally correct; "distractors" should be unequivocally wrong. If you write items in which more than one answer is correct and the student must pick out all the correct responses, each item is essentially a set of true-false questions, with their attendant problems. The basic rules for writing responses are: (a) students should be able to select the right response without having to sort out complexities that have nothing to do with knowing the correct answer, and (b) students should not be able to guess the correct answer from the way the responses are written.
All questions in a multiple-choice test should stand on their own, so avoid using questions that depend on knowing the answers to other questions on the test. Also, check to see if information given in some items provides clues to the answers of others. Randomly assign the position of the correct response--some teachers use "c" as the correct option 80% of the time, and students quickly recognize the pattern. Finally, never use trick questions--they have no legitimate testing function.
After a test has been given, it is important to perform a test-item analysis to improve its validity and reliability. Most machine-scored test printouts, including those at UNC, include statistics for item difficulty, item discrimination, and frequency of response. Figure 6, based on a printout for a test in Business Law, illustrates the way test statistics are usually presented. (For information about the test scoring and analysis service in Academic Affairs, contact User Services, Office of Information Technology.)
The difficulty index is simply the percentage of students who answered the question correctly. In a four-option question, the chance of guessing correctly is 25%, so it is wise to re-write any item that falls below 30%. Testing authorities suggest that you strive for items that yield a wide range of difficulty levels. In Figure 6, the difficulty index ranges from .3353 (33.53%) to .9281 (92.81%). The difficulty index is handy for checking items that you expect to be particularly difficult or easy. Results that vary widely from your expectations may require rewriting the questions or changing the way you teach the material.
Item discrimination on the UNC printout is found under "point biserial" and "upper-lower disc. index." Both calculations are based on procedures that divide class scores into upper and lower portions and compare their performance on each question. For an item to discriminate well, most of the upper group should get it right and most of the lower group should miss it. The point biserial statistic is the correlation between the total correct score on the item and the total correct by the upper portion of the class, so the higher the number, the better the discrimination. In Figure 6, items 6, 9, and 23 have much lower correlations than the rest and should be examined for evidence of poor construction.
The upper-lower discrimination index is calculated by subtracting the proportion of students below the 27th percentile who answered the item correctly, from the proportion above the 73rd percentile who got it right. If all in the upper group are correct and all in the lower group are incorrect, the index would be +1.0000. In Figure 6, items 6, 9, 10, 15, 19, and 23 fall below .2000 and are therefore suspect.
Another page of the printout (not shown in Figure 6) will contain frequency of responses and proportion of responses for each alternative on each question. By examining the response figures for incorrect alternatives, you can determine if these choices were equally attractive to students who got the item wrong. Ideally, each incorrect option should be chosen by an equal number of students, and if no one chooses a particular distractor, rewrite it before using the question again. If many students chose an incorrect option, it would be wise to find out the reason.
| Item | Weight | Difficulty Index | Point Biserial | Upper-Lower Disc. Index |
|---|---|---|---|---|
| 1 | 1 | 0.6407 | 0.4078 | 0.5031 |
| 2 | 1 | 0.8263 | 0.4074 | 0.3725 |
| 3 | 1 | 0.3353 | 0.2988 | 0.3784 |
| 4 | 1 | 0.5030 | 0.3146 | 0.4097 |
| 5 | 1 | 0.8563 | 0.2982 | 0.2303 |
| 6 | 1 | 0.6627 | 0.1809 | 0.1794 |
| 7 | 1 | 0.8982 | 0.4243 | 0.2549 |
| 8 | 1 | 0.8253 | 0.3813 | 0.3709 |
| 9 | 1 | 0.6325 | 0.0561 | 0.0338 |
| 10 | 1 | 0.7904 | 0.2449 | 0.1648 |
| 11 | 1 | 0.8563 | 0.4997 | 0.4118 |
| 12 | 1 | 0.8383 | 0.3329 | 0.2286 |
| 13 | 1 | 0.8434 | 0.4213 | 0.3333 |
| 14 | 1 | 0.7485 | 0.4831 | 0.4460 |
| 15 | 1 | 0.9281 | 0.3010 | 0.1552 |
| 16 | 1 | 0.7485 | 0.2785 | 0.2612 |
| 17 | 1 | 0.6325 | 0.4413 | 0.5440 |
| 18 | 1 | 0.5663 | 0.3885 | 0.5357 |
| 19 | 1 | 0.8675 | 0.2695 | 0.1944 |
| 20 | 1 | 0.7831 | 0.4561 | 0.4493 |
| 21 | 1 | 0.6766 | 0.5350 | 0.5816 |
| 22 | 1 | 0.7485 | 0.2884 | 0.3872 |
| 23 | 1 | 0.5090 | 0.1200 | 0.1285 |
| 24 | 1 | 0.7425 | 0.5107 | 0.5670 |
| 25 | 1 | 0.7186 | 0.3505 | 0.3642 |
Matching questions are a type of multiple-choice question, and the same principles apply to writing them. It is extremely difficult to write matching items that test higher-order learning. The connections that students make between two concepts may reflect only a barely understood association rather than a full appreciation of the relationship.
In matching items, the student is presented with two related lists of words or phrases and must match those in one column with those in a longer column of alternative responses. Obviously, one should use only homogeneous words and phrases in a given set of items to reduce the possibility of guessing the correct answers through elimination. For example, a list which includes names, dates, and terms is obviously easier to match than one containing only names. Arrange the lists in alphabetical, chronological, or some other order. Keep the lists short (ten to twelve items) and type them on the same page of the exam.
Completion questions, short-answer questions, and essays form a continuum of questions that require students to supply the correct answers. Completion questions are an alternative to selection items for testing recall, but they cannot test higher-order learning. In writing completion items, give the student sufficient information to answer the question but not enough to give the answer away. For example, articles (a, an, the) and specific antecedents often provide clues. Blanks should occur at the end of the statement, and required responses should be short. Sometimes multiple-choice questions can be converted to completion items, a feature that can be useful in creating subsequent tests on the same material.
Short-answer items can take a variety of forms: definitions, descriptions, short essays, or mixtures of the three. Because of this flexibility, they can measure some elements of higher-order learning. Specific instructions are the key to successful short-answer questions. Questions that require students to generate their own response need clear, unambiguous directions for the expected answer. For example, if you ask for a definition, outline the expected length of the response and the specific elements you require in a complete definition. In this case, you might limit the response to "two sentences which contain a description of the term's literal meaning and its application to the course." On a typed exam, leaving only enough space for the desired length of response may help, but unless the instructions are specific, students may cram whole paragraphs of tiny writing into the space. Short essays can require students to apply their knowledge to a specific situation carefully delimited by instructions. This type of question is the equivalent of a math or physics problem.
The requirement of specificity is not only for the students' benefit in answering the questions, but also to make the answers easier to grade. With the directions, list the number of points each question is worth; for longer questions with higher scores the worth of each section should be clear.
Many teachers consider essay questions the ideal form of testing, since essays seem to require more effort from the student than other types of questions. Students cannot answer an essay question by simply recognizing the correct answer, nor can they study for an essay exam by memorizing factual material. Essay questions can test complex thought processes, critical thinking, and problem-solving skills, and essays require students to use the English language to communicate in sentences and paragraphs--a skill that undergraduates need to exercise more frequently. But essay questions which require no more than a regurgitation of facts do not measure higher-order learning. Essay exams also place limitations on the amount of material that can be sampled in the test, a fact that may cause a student to complain (sometimes legitimately) that "I knew a lot more about the subject than the test showed," or "Your test didn't reflect the material we covered." For better sampling of the material, it is preferable to design tests that include several different kinds of questions: multiple-choice, short-answer, and essays of varying lengths. The following guidelines will help you avoid many of the drawbacks of essay questions. Although these guidelines are written from the perspective of the social sciences and humanities, most of these rules also apply to devising long problems in science courses.
Since one of the advantages of essay questions is their ability to test elements of higher-order learning, your first task is to define the type of learning you expect to measure. For example, do you expect students to be able to construct a reasoned argument from evidence, to analyze weaknesses in competing arguments, to select the best course of action in a new situation, or some combination of all these things? The best essay questions are based on the cognitive skills underlying the content rather than on the content alone.
If you wish to test problem-solving skills, the format and method for solving the problems must be clearly communicated to students. Presenting problems with no clues about how to proceed may cause students to adopt a plausible but incorrect approach, even if they knew how to solve the problem in the correct way. If you are interested in testing students' writing skills, you need to stipulate the kinds of skills that they must demonstrate and provide some test time for thinking and composing a well-crafted answer (otherwise, the effects of time pressure and test anxiety will usually result in poor writing).
Validity and Reliability of Essay Tests
It is helpful to distinguish between essay questions that require objectively verifiable answers and those that ask students to express their opinions, attitudes, or creativity. The latter are more difficult to construct and evaluate because it is more difficult to specify grading criteria (they therefore tend to be less valid measures of performance). Take-home tests and other out-of-class writing assignments may be more appropriate for demonstrating these kinds of skills.
Allowing students to select which essay questions to answer (e.g. "choose two out of five") is not a good practice. It is virtually impossible to compose five equivalent essay questions, and students will usually choose weaker questions and thereby reduce the validity of the exam. Some teachers follow this practice because students complain that their exams are too difficult. If their complaints are well-founded, the teacher would be wise to seek help in composing better questions rather than risk creating invalid exams.
The reliability of essay questions can be increased by paying close attention to the criteria for answers. Many teachers don't realize that it is not only necessary to compose a model answer, but to provide students with instructions that will elicit the desired answer. First, write an outline of your best approximation of the correct answer, with all of its sections in place. Decide on the total number of points the essay will be worth and assign points to each section. When you have read over your answer several times and are satisfied that it will measure the appropriate course objective, write the instructions students will need to answer the question with the scope and direction you intend. Describe the expected length of the answer, its form and structure, and any special elements that should be present. Figure 7 is an example of an essay question from a mid-term test in Anthropology. This question exemplifies the guidelines for increasing the reliability of essay questions, and illustrates three levels of cognitive complexity: Part 1 is primarily recall of knowledge, Part 2 is application, and Part 3 is evaluation.
Lectures covering Piltdown Man, Gradualism, Punctuated Equilibrium, and Catastrophism were given sequentially to illustrate the interplay of theory and fact in the formulation of an anthropological account of the evolution of Humankind. Write a three-part essay addressing the following questions:
Good grading practices also increase the reliability of essay tests. Research has shown that the scoring of essays is usually unreliable; scores not only vary across different graders, they vary with the individual grader at different times. Graders can be influenced by extraneous factors such as handwriting, color of ink, and word spacing. If the grader knows the identity of the student, his/her overall impressions of that student's work will inevitably influence the scoring of the test.
Grading should be done anonymously. When grading essay questions, fold the blue books over so that the names are not visible (even better, ask students to use their social security numbers rather than their names). If there is more than one essay question on the test, grade each essay separately rather than grading a student's entire test at once. Otherwise, a brilliant performance on the first question may overshadow weaker answers in other questions (or vice-versa). It is also easier for the grader to keep in mind one answer key at a time. Shuffling the papers after grading each question will help compensate for the tendency to give later papers lower scores as you grow tired.
Before starting to grade a batch of tests, skim over several essays to determine if the model answer needs to be modified. If, through some quirk in wording, students misinterpret your intent, or if your standards are unrealistically high or low, you can alter the key in light of this information. The effects of an ambiguous lecture or other anomaly in teaching the material can also be a legitimate reason for altering the answer key. If these problems are not in evidence, and you have carefully constructed the model answer, students should not be able to surprise you with better answers than yours. However, you should be open to legitimate interpretations of the questions different from your own. Finally, unless you intend to grade grammar, syntax, spelling, and punctuation as part of the examination, try to overlook flaws in composition and focus instead on the accuracy and completeness of the answers.
It is important to write comments on test papers as you grade them, but comments do not have to be extensive to be effective. Point out specific elements of the answer that were omitted or incorrect and the number of points lost as a result. For example, you might assess penalties for incorrect statements, omission of relevant material, inclusion of irrelevant material, and errors in logic that lead to unsound conclusions. Students have a right to know the reasons for the grades they receive, and need specific guidance to improve their performance. Strive for a few analytical comments on the good and bad aspects of the essay rather than a detailed critique--writing too many comments tends to overwhelm students, and they may miss the main points of your critique.
Distributing your model answers with the corrected essays can alleviate some of the burden of writing comments on exams; this practice has several other benefits as well. Students tend to learn a little more when they compare their answers with the model, and they develop a clearer picture of why they received the grade they did, thereby reducing the number of requests that you re-grade their papers.

home / teaching at carolina / publications / email
Last updated: January 30, 2001