Certification
Training
About Us
News
Articles
Contact Us

Cornerstones of Successful Test Development

by Raymond A. Talke, Jr.
President
Minds in Action, Inc.

 

            Organizations create and use testing for many purposes.  They might want to determine the effectiveness of a training class and evaluate how much content the students retained.  They may want to validate that an individual possesses the skills needed to be successful in a new job role.  They may want to validate that their representatives have the proper knowledge needed to effectively represent their business.  Testing has continued to permeate the corporate environment.

            Corporations need to ensure that their testing efforts are successful.  They have to validate that the tests they create and deliver provide accurate measurements of their intended purpose.  They must ensure that the tests provide reliable and consistent results.  Finally, they have to ensure that the delivery of testing does not increase their legal liability.  By following sound test design principles, organizations can help improve the validity and reliability of their tests and reduce their potential legal liability.

            Creating successful tests takes commitment and effort.  Yet, there are no mysteries to developing good tests.  There are six cornerstones that, when implemented consistently, can help enhance the validity, reliability and legal defensibility of your testing efforts.  Follow the principles of these cornerstones, and you will find that your tests will meet their intended purposes.

Cornerstone #1:  Determine the Purpose of the Test

            One of the most critical tasks in developing an effective and valid test instrument is determining the purpose of the test.  Although this task appears to be self-evident, tests are often developed without a clearly defined and communicated purpose.  Tests that are developed without a clear purpose cannot be used as a valid measurement.    Sometimes tests are defined for one purpose, but then used for another.  For example, I had the opportunity to develop an assessment test that was used to evaluate whether a Project Manager possessed the knowledge and skills necessary to enter an advanced project management class.  The assessment instrument was designed solely to evaluate the Project Manager’s mastery of class prerequisite concepts and skills.  Unfortunately, some individuals within the organization felt it would be useful to use the assessment instrument to evaluate the professional skill level of Project Managers.  Since the test was written to test only concepts and skills needed for the class rather than concepts and skills needed for professional success, the test was an invalid indicator of professional competence.  The use of the test to measure anything other than its defined purpose invalidates any meaningful interpretation of the results.

            In general, there are usually three types of tests created and administered by corporations:

            Assessment tests are used to determine an individual’s retention of knowledge and skills.  Assessment tests are often used to determine the effectiveness of learning interventions by evaluating a student’s retention of the course materials.  Assessment tests are also often used to gauge overall knowledge and skill levels within an organization.  Generally, no benefit or sanction is accorded an individual based upon his or her assessment test scores.

            Qualification tests are designed to measure the mastery of a given knowledge and skills domain, and an individual is required to pass a qualification test before he or she is allowed to do something.  For example, an individual may be required to pass a course pre-test before he or she is permitted to attend a class.  Or an individual must pass a qualification test before being considered for a promotion.  Qualification tests are often given to business partners of corporations before they are allowed to carry an organization’s products.  Qualification tests simply measure the mastery of a given set of prerequisite knowledge and skills.

            Care must be taken to position qualification tests appropriately.  Although qualification tests are designed to assess the mastery of a given prerequisite knowledge and skills domain, passing the qualification test does not necessarily mean that an individual will master the job or activity for which the qualification test is a prerequisite.  The qualification test merely measures an individual’s knowledge and skills at the time the test is taken.  An individual’s performance on a qualification test should not be used as a predictor of future performance.  Their results should only be used to evaluate the current knowledge and skill level of an individual.

            Because a qualification test is used to determine readiness to enter a given activity or profession, the test developer must be aware of the potential legal liability he or she assumes, and take the appropriate actions to lessen the potential liability.  Since a sanction is levied against anyone who fails the test (prohibition from entering the activity), care must be given to the selection of the knowledge and skills domain that is being tested.  All test items must measure knowledge and skills that are directly related to, and a requirement for, entering the activity or profession that is the subject of the test.  If any test item cannot be shown to be an essential requirement for the activity or profession, the test itself is an invalid qualification test.

            Certification tests have similar characteristics to qualification tests, but differ in that certification tests are designed to predict future performance in a given activity or profession.  By virtue of passing a certification test, the test sponsor asserts that the individual who passed the test is capable of performing a given activity or job safely and effectively. 

            The Bar exam for lawyers is an example of a certification test.  Medical Board exams are another.  In each case, an individual passing the test is certified by the sponsoring organization (in these examples the Bar and the Medical Board) as being capable of successfully performing the subject job. 

            As is the case with qualification tests, sponsors of certification tests assume a degree of legal liability.  In addition to the liability assumed by a qualification test, certification test sponsors also assume potential liability if the certified individual cannot perform the duties of that profession safely and effectively.  Therefore, it is imperative that a certification test thoroughly assesses all of the knowledge and skills critical to the profession and omits the evaluation of any knowledge and skills that are not critical.

            By determining and documenting the purpose of the test, the sponsor can take the appropriate actions to ensure that the test is developed properly.  By communicating the purpose, the sponsor can help prevent misuse of the results by individuals attempting to use the test for purposes other than those for which it was designed.

Cornerstone #2:  Ensure the Relevancy of the Test

            The second cornerstone to successful test development is ensuring that the test is relevant to its purpose.  In other words, the developer must make certain that the items evaluated in the test truly represent the items being measured.  For example, a qualification test for students entering a class must evaluate the mastery of knowledge and skills that will actually be used in the class.  If any test item assesses mastery of knowledge or skills that are not be used in the class, that test item is superfluous and an invalid indicator of the desired measurement. 

             Relevancy can be accomplished by creating a knowledge and skills domain that matches the knowledge and skills necessary to perform the evaluated job or activity.  This knowledge and skills domain is represented as a number of objectives, and test items are written to be congruent to specific objectives.  If the objectives are critical to the job or activity, and if test items are congruent to the objectives, the test is likely to serve as a valid measurement for its purpose.

            The developer must ensure that the objectives that form the foundation of the test are relevant to the job or activity that is the focus of the test.  For qualification or certification tests designed to assess professional knowledge or skills, a job task analysis should be used to create the objectives of the test.  A job task analysis uses experts in a given job to identify the tasks that an individual in that job should be able to perform.  Once the tasks have been identified, the job task analysis identifies the knowledge and skills that an individual must possess to perform the identified tasks.  This knowledge and skills domain, coupled with specific proficiency requirements identified during the job task analysis, forms the basis of valid test objectives.  By performing a job task analysis for qualification and certification tests and ensuring that the test accurately evaluates the outputs of the job task analysis, the test developer may substantially lesson his or her legal liability and improve the validity of the test.

Cornerstone #3:  Evaluate the Right Characteristics in the Right Proportions

            Some components of a given job are more important than others.  For example, a driver of a motor vehicle should be able to both parallel park and stop the vehicle.  Both skills are hallmarks of a successful driver.  However, the importance of each of these skills differs.  The motor vehicle operator is called upon to stop the vehicle several times a day, while he or she may only be called upon to parallel park on an occasional basis.  In addition, a driver who cannot parallel park poses little risk to other individuals, while a driver who cannot stop a car is a menace on the road.  When evaluating these two skills, we find that stopping a car is performed more frequently and is more critical to good driving than parallel parking.  As a result, we can maintain that of the two skills, stopping a car is more important than parallel parking.  Based upon this finding, a test that evaluates driving ability should place more weight on successfully stopping a car than on successfully parallel parking.

            A valid test should assess the mastery of skills and knowledge based upon the relative importance of each of the knowledge and skill components.  Importance can be defined as the product of a task’s frequency of use and criticality.  Objectives relating to tasks that are critical to success or that are performed frequently should be better represented on a test than objectives that relate to less critical or less frequently performed tasks.

            A survey of subject matter experts can be used to determine the relative importance of each of the proposed objectives for the test.  Usually, the subject matter experts are asked to assign a value to each objective reflecting how frequently the knowledge or skill embedded in the objective is used on the job, and another value to reflect the criticality of the objective to the job.  The product of the average frequency and criticality values is calculated to arrive at the relative importance of each objective.  When blueprinting the test, an objective that is twice as important as another objective should be represented by twice the number of test items or by twice the scoring value.  This ensures that the test accurately represents the right characteristics of the job in the right proportions.

Cornerstone #4:  Evaluate the Test Items

            Before releasing the test, check and double check each of the test items.  Your evaluation of the test items should include:

Checking the congruence of each of the items.  Each test item should be written to an objective on the test blueprint.  Read each of the test items to ensure that the item actually measures the knowledge or skill embedded in the objective.  For example, you might find the following test objective:

Name the three ships Columbus used on his first voyage to the Western Hemisphere.

The following test item is submitted to you for this objective:

What type of ship was Columbus’ Santa Maria?

            A.     sloop
            B.  caravel
            C.     cargo ship
            D.     clipper ship

This item would not be congruent to the test objective because it asks about the characteristics of one of Columbus’ ships, not the name of his ships.  Since the test item is not congruent to the objective, the item must be rejected.

Validating the accuracy of each item.  Before including an item on the test, make sure that the stated correct answer is, in fact, correct.  For open-ended test items, such as “fill in the blank” tests, make sure that the answer key includes all possible variations of the correct answer.  This is particularly important if a machine grades the test.  The correct answer in the key should include all variations of the correct answer, including differences in capitalization and spelling (unless correct spelling is part of the test objective).

If you are evaluating a multiple-choice item, ensure that the listed correct answer is always correct, and that the listed incorrect answers (foils or distracters) are always incorrect.  If an individual can maintain that a specific distracter can be correct under certain circumstances, the item is invalid and should be replaced.  For example, let’s examine the following multiple-choice test item:

Which planet in the solar system is farthest from the sun?  

A.  Pluto
B.  Jupiter
C.  Uranus
D.  Neptune

Although selection A., Pluto is usually correct, there are times in which the orbit of Neptune actually sends it farther from the Sun than Pluto.  Although this does not happen often, the fact that D., Neptune is also correct some of the time invalidates the entire test item.

            It is recommended that a formal technical review be held to validate the accuracy of each of the test items.  During the technical review, subject matter experts review and validate the accuracy of each test item.

Removing bias from the test items.  Each test item should be reviewed to ensure that there is no inherent bias to the test item.  Each test item should be free of any characteristics that would place any group of individuals at an advantage or disadvantage when compared to other groups.  The test should be free of any racial, cultural, religious, gender, or age bias. 

            You will rarely be confronted with a test item that exhibits symptoms of obvious bias.  However, you will often find test items that inadvertently bias the results in favor of one group or another.  Usually these inadvertently biased test items refer to examples or situations only experienced by one group or jargon and slang that is specific to a single region or group.  Let’s look at two inadvertently biased test items:

1.      When should you affix your John Hancock to a contract?  

2.      What is the rule of thumb for calculating the capacity of a single LAN server?  

Each of these test items display characteristics of unintentional bias.  The first uses the slang phrase, John Hancock, instead of the word signature.  This would place an individual with knowledge of American history or slang at an advantage over someone without this knowledge.  It is unlikely, for example, to find Europeans using this phrase in their regular conversations.

            The second example uses a phrase that is not meaningful to the entire world.  The phrase, rule of thumb, is primarily used only in English speaking nations.  In addition, those who are aware of the origins of the phrase might find its use offensive.  As a result, this test item is biased towards individuals from areas with an English culture.

            When evaluating an item for potential bias, look for certain characteristics that almost always indicate the potential for bias.  Jargon or slang phrases are often found in items that are biased.  References to events that are meaningful to only one generation can reflect inadvertent age bias.  For example, references to the Beatles or the Vietnam War might be more meaningful to the Baby Boom generation than to younger generations.  Sports analogies are almost always dangerous.  The use of baseball analogies might be meaningful to people in North America, but mystifying to people from the Middle East.  The use of historical or geographical references are also dangerous unless the test objective clearly requires knowledge of the listed historical or geographical concept.

Checking for cues to the answer.  There are those who maintain that “good test takers” have an advantage over average test takers.  This is because so-called “good test takers” are able to discern inadvertent hints that tip them off to the correct answer.  This is usually most apparent when multiple-choice tests are offered.

            When creating a test item, make sure that the test item conforms to standards established for the specific type of test item.  Ensure that the test item is grammatically correct.  For multiple choice test items, make sure that the distracters are plausible to unqualified individuals and structurally similar to the correct answer.  By doing so, you can lessen the chance that the correct answer is cued to the experienced test taker.  This allows your test to measure the actual job abilities of the test taker, not their ability to take tests.

            By evaluating each of the test items before the test is delivered, you will help ensure that the test as a whole serves as a valid indicator of the knowledge and skills it intends to measure.

Cornerstone #5:  Establish a Valid Passing Score

            One of the challenges to creating valid and reliable tests is establishing the proper passing score for the test.  In many instances, test developers arbitrarily choose a passing score based upon their previous experiences, usually their experiences with the public schools.  As a result, many tests are found with a passing score arbitrarily set at around 70%.

            Establishing an arbitrary passing score will immediately invalidate any reasonable interpretation of test results.  Test items vary in their difficulty, and it is difficult to create a large set of test items each with the same level of difficulty.  Some test items will be relatively easy, with a large proportion of test takers answering the item correctly.  Other items will be more challenging, and a smaller pool of test takers will answer these items correctly.

            In order to properly create a valid passing score for a test, the test developer must establish the characteristics of a minimally qualified individual taking the test.  In the case of a qualification or certification test, these characteristics will identify the minimum level of knowledge or skills required to be successful in a given job role or to complete a given activity.  The results of a job task analysis should be used to establish this standard.

            The concept of a minimally qualified individual makes many people uncomfortable.  They may assert that a test should identify above average performers, not those that are minimally qualified.  In fact, this assertion is incorrect.  A certification test, for example, is used to identify individuals who can perform a given job safely and effectively.  There may be varying degrees in which an individual may excel in a job, but there must be an absolute minimum set of standards that all job incumbents must meet or exceed.  Examine the medical profession.  In order to graduate a medical school and earn the right to use the M.D. initials, an individual must exceed a certain set of standards.  While some people may graduate at the top of the class, there will always be someone who graduates at the bottom.  Yet all become doctors upon graduation.  This is because all of the graduates have met or exceeded the minimum standards of the profession.  Although some may have just barely met the standards, they possess the skills needed to function safely and effectively as medical doctors.  Those who were unable to meet the minimum standards were not permitted to graduate.

            When establishing a passing score for a test, the test developer must attempt to set a standard that constitutes the minimum set of qualifications for the skills and knowledge domain that forms the basis of the test.  If the passing score is set at the appropriate level, an ideal test will be one in which all of the qualified individuals pass the test, and all of the unqualified individuals fail.  Although there are several methods used to calculate a passing score, the two most commonly used are the contrasting groups approach and the modified Angoff method.

            The contrasting groups approach is typically used when you are able to identify a group of test takers as being qualified or unqualified individuals before they take the test.  In this approach, you have all of the individuals take the test, then separate the scores of the qualified individuals from the unqualified.  Usually, you will find that the distribution of the scores for each group resembles a bell curve when plotted, with the bell curve for the qualified individuals peaking at a higher score than the bell curve representing the unqualified test takers.  When the contrasting groups approach is used, the point of convergence between the bell curves for the two groups represents the suggested passing score.  This ensures that the vast majority of qualified individuals are likely to pass the test, while the vast majority of unqualified individuals will fail.

            As an example, the following chart graphs the distribution of test scores between qualified and unqualified individuals.  As you can see on the chart, the bell curve for the qualified individuals peaks at about 95%, while the curve for the unqualified individuals peaks at about 86%.  The two curves converge at 91%.  Therefore, in this example, the contrasting groups approach would mandate a minimum passing score of 91%. 

            This example demonstrates setting a passing score for a relatively easy test.  Note that if the passing score was arbitrarily set to 70%, virtually all test takers would pass the test, regardless of whether they were qualified or not.  If the passing score was arbitrarily set to a different value than indicated by the contrasting groups approach, the test results could not be used to discriminate between qualified and unqualified individuals.

            There may be times when an initial pool of test takers is not available to be used to set passing scores.  The test may be brand new, and a pass or fail score must be immediately made available to all test takers.  You may not be able to identify qualified and unqualified control groups among the test takers.  If, for any reason, you are unable to use the contrasting groups approach, the modified Angoff method of establishing a passing score can be used.

            In the modified-Angoff approach, a panel of experts familiar with the characteristics of the profession or activity that the test is designed to measure convenes.  After identifying the qualities that constitute a minimally qualified individual, the panel of experts analyzes each test item.  For each test item, the experts offer their opinion on the percentage of minimally qualified individuals that are likely to answer the test item correctly.  After all test items are analyzed, the percentage values assigned to all of the items are averaged and the average result is used as the passing score.  When conducting a modified-Angoff panel, care must be taken to ensure that the panelists keep in mind that they are assessing the likely results of minimally qualified, and not average or superior, performers.  There is often a tendency among panelists to set a higher passing score requirement than is warranted if the panelists stray from the concept of the minimally qualified individual.

            By using a valid procedure to establish a passing score for the test, you are ensuring that the test will be a valid instrument to discriminate between those who are qualified versus those who are less than qualified.  A formal method of determining a passing score allows items of varying difficulty to be included and still allow the test to serve as a valid measurement device.  Setting an arbitrary passing score makes it less likely that the test scores can be used to assess the characteristics we intend to measure in the test.

Cornerstone #6:  Perform Item Level Analysis

            After a number of individuals have taken the test, it is important to verify that all of the test items are validly measuring the intended knowledge and skill metrics.  This cornerstone involves performing a statistical or trend analysis on each of the test items to ensure that they are performing well.  The item level analysis generally examines three performance criteria for each test item:

            The first area considered when analyzing test items is the difficulty of the items.  Difficulty is measured by determining the average number of test takers who answered each item correctly.  Higher percentages of correct responses for an item indicate that the item is easier than one with a lower percentage of correct responses.  Since the purpose of any test is to discriminate between qualified and less than qualified individuals, we want the test to include items that are not too easy or too difficult.

            Test items with extremely high proportions of correct responses (95% - 100%) are generally undesirable, since they do little to discriminate between competent and less than competent individuals.  Conversely, items with extremely low proportions of correct responses (below 50%) are worthy of additional analysis.  Difficult items are not in themselves necessarily undesirable, but further analysis should be performed to validate that the item effectively discriminates between high and low performers, and that, in the case of multiple-choice tests, distracters are performing properly.  Test items that are found to be too difficult or too easy are candidates for replacement in the test.

            The second area, discrimination analysis, indicates the reliability of individual test items to distinguish between competent and less than competent performers.  Generally, we would expect people who have a high score on the test as a whole to perform better on a given test item than people who have a low score for the entire test.  Reverse discrimination occurs when low overall performers perform better on a given test item than high overall performers.  This is undesirable, since it indicates that the test item was not effective in distinguishing the characteristics that differ between high performers and low performers.

            The third type of analysis performed on test items is a validity analysis, which identifies the frequency in which various answer options were selected for a given test item.  This allows us to determine if the test item is accurately measuring the intended knowledge or skill.  If test takers, particularly high performers, often select a specific incorrect answer, the test item may be inaccurate or misleading.  This is usually an indicator that the test item should be removed from the test or replaced.  In the case of multiple-choice tests, we would expect all distracters (incorrect answers) on an individual test item to be selected at similar frequencies.  If one distracter is selected significantly more often than other distracters (or more often than the correct answer), this indicates that the distracter should be removed or replaced.  The distracter may, in fact, be a valid correct answer, or the test item may be inadvertently cueing the test taker to select the distracter.  In either event, the test item should be reviewed and revised in order to ensure a more even distribution of selection among the distracters.

            The item level analysis allows the test developer to maintain a valid and reliable test.  By removing or replacing problem items, the test developer ensures that the remaining test items accurately reflect the actual abilities of the test takers.  Superfluous or incorrect items that may adversely affect the validity of the test are easily identified when performing an item level analysis.

Conclusion

            The primary cornerstones to ensure successful testing are not particularly difficult to implement.  However, the test sponsors and developers must have the will and desire to consistently ensure that each of the cornerstones are incorporated into every test development effort.  By doing so, you will enhance the validity, reliability, and legal defensibility of your testing efforts and guarantee satisfaction among both the takers of the test and those who rely upon the test’s results.

 

 

Send mail to Minds in Action with questions or comments about this web site.
Copyright © 2000 - 2007 Minds in Action, Inc. and its licensors.  All rights reserved.
Last modified: 05/06/07