RELIABILITY AND VALIDITY OF DATA COLLECTION PROCESSES
WHERE WE'VE BEEN
We've described how to identify research variables, use reference sources to obtain information about these variables, and devise operational definitions of them.
WHERE WE'RE GOING NOW
We're going to discuss how to make data collection processes reliable and valid - that is, how to make sure that our measurement processes do not generate evidence that is self-contradictory because of internal inconsistency or instability and make it more likely that they actually zero in on the outcome we want them to measure rather than on extraneous outcomes or no outcome at all.
CHAPTER PREVIEW
Once you have decided on the operational definitions of an outcome variable, you can collect data regarding the occurrence of that outcome. This chapter describes reliability and validity - two essential characteristics of all good data collection techniques.
The present chapter defines reliability in terms of how consistently a data collection process measures whatever it measures. This consistency concerns the level of agreement among independent tests, testing occasions, observers, or items that purport to be measuring the same outcome. The confidence we can place in judgments based on data collection processes will be greater to the extent that they are reliable. This chapter discusses ways to increase the prospect that you can make consistent decisions based on your rests and observations. We also introduce here the concept of validity of data collection processes - the extent to which a data collection process really measures what it is designed to measure. We will discuss the factors that influence validity of data collection processes and methods of establishing validity.
After reading this chapter, you should be able to
Reliability addresses the question of whether the results of measuring processes are consistent on occasions when they should be consistent. Consistent means what the dictionary says it means: "not self-contradictory." If a person possesses a certain degree of knowledge about a topic, for example, the estimate of knowledge that appears as a test score should not be contradicted by other administrations of the same or similar tests. The estimate should be approximately the same whether the test is taken today or tomorrow. If different tests are given to students in the first and third period, we should be able to assume that our judgments about the person's knowledge would have been about the same on either test; and we should be free from the impression that the score would have been substantially different if someone else had graded the test. A data collection process is less reliable if the results are influenced by irrelevant factors that cause our judgments to fluctuate when they should not fluctuate. Measurement is reliable to the extent that the results are similar every time they should be similar.
If a mother wants to measure the body temperature of a sick child, she will expect her assessment to be reliable. If she measures his temperature once and the temperature is 102.4, then tries again tw minutes later and gets a reading of 99.9, she has an unreliable thermometer. (Of course, if she gives medication and then takes his temperature two hours later and discovers a large drop, this would have nothing to do with unreliability. The temperature would not be expected to be similar on the second occasion, since the medicine is likely to have had an effect.)
Reliability can be applied in the same way to instructional or research situation. If you ask a child a question and conclude on the basis of her response that she has achieved an educational outcome, you would hope that if you questioned her a few minutes later you would still come to the same conclusion. To the extent that the result is similar on repeated occasions, you are dealing with a reliable method data collection. However, if you concluded the second time that she had not achieved the outcome, then you would be dealing with an unreliable collection process. (If a week goes by and it is plausible that the child might have forgotten something during the intervening week, then a different result on second occasion would have nothing to do unreliability, any more than a reduction in temperature after a child has been cured would indicate unreliability of the thermometer.)
An important point to keep in mind is that it is reliability of the data collection process - not of data collection instrument - that must be demonstrated. What we are really looking for is consistency in the decisions we make based on the data collection process; we don't want to draw conclusions would be likely to change if we took another estimate of the outcome variable. It is technically incorrect to refer to the reliability of a test. A test, a checklist, an interview schedule, or any other measurement device that is reliable in one setting or for one purpose may be unreliable in another setting or for a different purpose.
Therefore, this chapter always refers to the reliability of data collection processes. It is important to remember this distinction.
Examine each of the following descriptions and indicate whether it is reliable, is unreliable, or provides no information about reliability.
If you got most of the questions in Review Quiz 5.1 correct, or if you easily saw the logic of the explanations, then you probably have a good basic grasp of the concept of reliability. If not, reread the chapter to this point, check the chapter in the work- hook, refer to the recommended readings, or ask your instructor or a peer for help. Be sure that you understand the summary in the following paragraph so that you will profit from the rest of this chapter.
In summary, reliability refers to whether a data collection process is consistent. Unreliability occurs when the data collection process contradicts itself: when observations, observers, items, or alternate forms of the same test give contradictory evidence.
Reliability is not an all-or-nothing characteristic; data collection processes range from strong reliability to weak reliability. Because of the highly internalized nature of educational outcomes, measurement processes in education can never be perfectly reliable. If the scores on a data collection process vary when they should not, then the test is less reliable. If a data collection process produces consistent, non-contradictory results over a span of time or in varying settings, then it is said to be reliable.
As you read the rest of this chapter, remember that reliability is not synonymous with reliability coefficients. The technical reliability coefficients are often irrelevant, unnecessary, or at least more trouble than they are worth in practical situations.
However, a major goal of professionals in education is to achieve respect and effectiveness by engaging in scholarly activity that is authentic, public, and replicable. Decisions or statements at any level are more worthy of respect and more likely to be fruitful if they are based on sound reasoning rooted in the scientific method. A major part of the scientific method is to be public in one's methods and results and to show that they are replicable - that others can independently arrive at much the same conclusion. Establishing that our data collection processes are reliable is an important step in this process of public, scientific thinking.
The best way to increase the reliability of our measuring instruments is to determine what causes unreliability and then to make sure that these causes of inconsistency are not present in the data collection strategies we employ. The following paragraphs summarize the major sources of unreliability.
The reliability of educational measurement can never be perfect. However, it can be improved by designing and administering data collection processes carefully. The way to increase reliability is to minimize the sources of unreliability cited in the previous section. There are statistical procedures for determining coefficients of reliability, and one of the ambitions of professional test constructors is to get this coefficient to be high. The use of these coefficients will be discussed in the next section of this chapter. At this point, let us say that it is possible (and important) to take steps to improve reliability (and to verify that others have done so) even if you never intend to compute a reliability coefficient outside an assignment for a college course. The following are specific guidelines for improving the reliability of measuring instruments:
{Page 93 is missing.}
It is important to understand this information before proceeding. If you are a true skeptic, you might by now realize that this quiz may itself be unreliable. If that worries you, try the appropriate exercises in the workbook. A longer test will enable you to make a more reliable (consistent) judgment regarding your knowledge of this material.
STATISTICAL PROCEDURES FOR ESTIMATING RELIABILITY
Reliability coefficients are statistical procedures for estimating how consistent a data collection process is. These are important tools. Even if you do not feel a particular urge to compute these statistics, you should still be concerned about the reliability of your data collection techniques. These procedures are described here because they are relatively easy to understand and can be helpful to you. In addition, you will often want to administer professionally prepared data collection procedures, interpret the results of such procedures, or read about them in the published literature. Understanding the meaning of these statistical procedures can be extremely helpful for these purposes.
The following are the basic types of statistical reliability coefficients:
The main value of the split-half procedure is that it can easily
be computed by hand; but it is not as precise as the others, and
computers have rendered it obsolete. The Kuder-Richardson
coefficient is a better estimate of internal consistency than the
split-half procedure, and it is frequently reported with
computerized scoring packages for objectively scored tests. Since
it is applicable to every situation in which the other two can be
used and to other situations as well, coefficient alpha is clearly
the most important indicator of internal consistency.
Internal consistency reliability sets the upper limit for the
other statistics that measure relationships among variables
(including the other reliability coefficients). The statistical
logic to support this statement will not be presented in this
hook. In practical terms, this means that it is a good strategy to
use internal consistency as a starting point for developing
reliable data collection procedures. If you develop solid,
internally consistent procedures and then do other things right,
you will be able to conduct reliable measurements of outcome
variables. If you fail to develop internally consistent
procedures, then it is unlikely that your attempts to measure
outcome variables will be reliable.
Interobserver agreement is used only when a yes/no decision is made regarding the occurrence/nonoccurrence of an event. When the observer makes a rating, interscorer reliability (which we discussed earlier in this section) is the appropriate estimate of reliability. Interobserver agreement is important in situations (such as behavior modification programs) where the data collection consists of observing a child to determine whether he is performing a predefined behavior.
In a very real sense, you may not need any reliability coefficient. What you need is the concept of reliability, because you want your measurements, observations, and interviews to be consistent. A coefficient is merely a tool to help estimate consistency. The question, therefore, is what kind of statistical reliability is going to be helpful to you in determining whether your measuring instruments are consistent. The preceding descriptions (summarized in Table 5.1) should help you make decisions regarding whether a statistical procedure may be helpful to you and to interpret these statistics when other researchers report them.
Table 5.1 Statistical
Methods of Estimating Reliability Employed Test-retest
reliability To ensure stability;
to rule out the likelihood that results will fluctuate
widely on different administrations of same instrument to
same people Administer the same
test twice to the same group with a time interval in
between; then compute the correlation
coefficient Correlation
coefficient Equivalent-forms
reliability To ensure that two
forms of a test are actually equivalent Administer two forms
of the same test to the same group in close succession;
compute correlation coefficient Correlation
coefficient Test-retest with
equivalent forms reliability To ensure both
stability and equivalence (combines first two
methods) Administer one form;
let time pass; administer second form; compute correlation
coefficient Correlation
coefficient Internal consistency
reliability To determine the
extent to which the items on a test are measuring a common
characteristic (to ensure internal
consistency) Administer test only
once; apply formula to compute Coefficient
alpha Coefficient
alpha Interscorer
reliability To determine the
extent to which the results are objective; i.e., will be the
same no matter who scores the test Administer the test
once; have two different persons score the test compute
correlation coefficient Correlation
coefficient Interobserver
agreement To determine the
extent to which different observers can agree whether
an outcome is
occurring Have two observers
watch for the occurrence of an event during a designated
number of intervals; compute the percentage of intervals
during which they agree Percentage of
Agreement
Identify the type of statistical reliability that would be helpful in determining whether the stated data collection technique is consistent.
While correlation coefficients give good estimates of the reliability of data collection processes, they are not directly useful for communicating information about the degree to which a specific score is likely to be accurate. The standard error of measurement is a statistic that is based on reliability coefficients and gives information about the relative accuracy of individual scores. The standard error of measurement indicates the range within which the "true" score of the individual is likely to fall - taking into consideration the unreliability of the test. For example, if a student received a score of 85 on a test with a standard error of measurement of 4.0, then her true score would probably range somewhere between 81.0 and 89.0. If the standard error of the test were 7.0, then this student's true score would probably lie in the range of 78.0 to 92.0. (The word probably in the previous two sentences means that the statistical formula gives about a 68% probability that the true score falls in the designated range.) Since the standard error of measurement is based on the reliability of the data collection process, higher test reliability leads to a smaller standard error of measurement - that is, to a more narrow range of scores within which the true score would be likely to fall.
The standard error of measurement has considerable practical importance. Within the context of the preceding paragraph, it is reasonable to think of the standard error of measurement as an estimate of the "likely error" of a data collection process. For example, if a person scores 115 one year on an IQ test that has a standard error of measurement of 5 and then scores 112 on a parallel form of the test the next year, we would assume that this probably represents a normal fluctuation of scores rather than an actual deterioration in performance. The standard error of measurement is closely related to the concept of standard deviations (discussed in chapter 7) and to the concept of confidence intervals (discussed in chapter 8). A major advantage of standardized tests (discussed in chapter 6) is that their test manuals almost always include information on the standard error of measurement.
HOW RELIABLE DOES A DATA COLLECTION PROCESS HAVE TO BE?
It is an axiom that no data collection process in education can ever be perfectly reliable. Whether you use statistical procedures or not, it is obvious that some data collection processes are more reliable than others. The reliability of almost any given data collection process could be improved, if you worked a little harder or added more items or observations. How reliable is reliable enough? The answer is that the necessary degree of reliability depends on what you plan to do with the results of your data collection.
If you are giving a weekly arithmetic test, and you happen to make an inaccurate decision based on it, this is probably not a serious problem. If you give a child credit for mastering a topic and you discover a day later that she has not mastered it after all, then you can simply change your decision and offer her some additional instruction. Although you would not want to make frivolous decisions even in such cases, it is obvious that you could settle for a more unreliable instrument than you would require if you were deciding whether that same student should embark upon a college preparatory curriculum in mathematics. Therefore, the first answer to your question is that the data collection process needs to be more reliable to the extent that the decisions based on it are likely to be permanent or irreversible.
A second, closely related factor is whether the results of the data collection process will be the only source of information in making a decision or whether they will be supplemented by other sources of data. In chapter 4 we recommended multiple operational definitions of outcome variables and multiple methods to measure these outcomes. To the extent that a data collection process is effectively supplemented by other sources of information, lower reliability is tolerable. The inconsistencies and imprecision in one set of data will be counterbalanced by information from other sources.
The point is this: The more confidence you want to be able to place in the score an individual attains, the greater the reliability you should require from your instrument.
The situation is somewhat different when you are examining group accomplishments rather than diagnosing the performance of an individual. The factors that lead to unreliability (inconsistency) on a data collection process are often essentially random, and they tend to average out over the long run. In other words, if one student improves his score by guessing accurately on a test, it is probable that someone else's score has been hurt to a similar proportion by poor guessing on the same test. Therefore, a chance factor like guessing is likely to contribute less to inconsistency when group evaluations rather than individual evaluations are being considered. For this reason, substantially lower reliabilities are acceptable for comparing group scores than for comparing individual scores. In addition, note that when statistical comparisons are made among groups (see chapter 14), the statistical estimates of reliability will be accounted for in the computation of the statistical comparison.
Although it would often be absurd to evaluate an individual's performance in a history course based her answer to a small set of questions, it would nevertheless make sense to evaluate the performance a group based on the group's answer to that same set of questions. (Of course, it is still relevant to ascertain that the questions properly sample the topics covered in the history course; see the discussion of content validity later in this chapter.) In fact, this exactly what the highly reputable National Assessment of Educational Progress (NAEP) is attempting. The NAEP is asking several questions to carefully selected groups of students in schools throughout the United States. On the basis of NAEP results, it would be possible to conclude something like, "In 1980, only 70% of fifth graders knew who Christopher Columbus was, whereas in 1990, 95% fifth graders knew who he was." On the other hand, it would not be appropriate to use one child's answer to that same question to draw reliable conclusions about his knowledge of history.
Finally, one must consider how high reliabilities should be for commercially prepared tests. If we're paying the pros to come up with good tests, shouldn't we expect the tests to be highly reliable? Here again, it depends on what kind of test you're looking for. Commercially available intelligence tests often report reliabilities of .90 or higher. On the other hand, some personality tests used for group research report reliabilities of only .60. The general strategy is to determine what you want to use the test for, and then to look for information regarding the specific type of reliability needed to achieve that goal. (For example, look for equivalent-forms reliability, not just internal consistency, if you are interested in using one form for a pretest and another for a posttest.) It's a good idea to look in a source like The Mental Measurements Yearbook (Kramer & Conoley, 2002) to find out what levels of reliability are available for tests of the sort you're looking for. If there are five tests of the same sort, and four of them report reliabilities of .85 or better, then the fifth one with a coefficient of .60 is substantially less reliable.
Validity of data collection addresses the question of whether a data collection process is really measuring what it purports to be measuring. A data collection process is valid to the extent that the results are actually a measurement of the characteristic the process was designed to measure, free from the influence of extraneous factors. Validity is the most important characteristic of a data collection process.
A data collection process is invalid to the extent that the results have been influenced by irrelevant characteristics rather than by the factors the process was intended to measure. For example, if a teacher gives a reading test and the test does not really measure reading performance, the test is useless. There is no logical way that the invalid test can help the teacher measure the outcome in which she is interested. If she gives a self-concept test that is so difficult to read that the third graders taking it are unable to interpret the tasks correctly, the test cannot validly measure self-concept among those students. It is invalid for that purpose, because it is so heavily influenced by reading skills that self-concept is not likely to come to the surface. This test cannot help the teachers make decisions about the outcome variable "self-concept." For example, if they ran a self-concept program for their students and their students' "self-concept" scores improved, how could they know whether it was really self-concept and not just reading ability that improved? In designing and carrying out any sort of data collection process, therefore, validity is of paramount importance.
As we said with regard to reliability, it is important to keep in mind that it is the validity of the data collection process - not of the data collection instrument - that must be demonstrated. What we really want to do is strengthen the validity of the conclusions we draw based on the data collection process; we don't want to draw conclusions based on the measurement of the wrong outcomes. It is technically incorrect to refer to the validity of a test. A test, a checklist, an interview schedule, or any other data collection device that is valid in one setting or for one purpose may be invalid in another setting or for another purpose. Therefore, this chapter always refers to the validity of data collection processes. It is important to rein ember this distinction.
What makes a data collection process valid or invalid? A data collection process is valid to the extent that it meets the triple criteria of (1) employing a logically appropriate operational definition, (2) matching the items to the operational definition, and (3) possessing a reasonable degree of reliability. Invalidity enters the picture when the data collection strategy fails seriously with regard to one of these criteria or fails to lesser degrees in a combination of these criteria.
It may be instructive to look at some examples of invalid data collection processes. Assume that a researcher wants to develop an intelligence test. He operationally defines intelligence as follows: "A person is intelligent to the extent that he/she agrees with me." He then makes up a list of 100 of his opinions and has people indicate whether they agree or disagree with each item on this list. A person agreeing with 95 of the items would be defined as being more intelligent than one who agreed with 90, and so on. This is an invalid measure of intelligence, because the operational definition has nothing to do with intelligence as any reputable theorist has ever defined it.
Not all invalid data collection processes are so blatantly invalid. Indeed, one of the most heated arguments in psychology today is over the question of what intelligence tests actually measure. This whole question is one of validity. The advocates of many IQ tests argue that intelligence can be defined as general problem-solving ability. They operationally define intelligence as something like, "People are intelligent to the extent that they can solve new problems presented to them." They test for intelligence by giving a child a series of problems and counting how many she can solve. A child who can solve a large number of problems is considered to be more intelligent than one who can solve only a few. The opponents of such tests argue that the tests are invalid. They say that general problem-solving ability is not the only quality - or even the most important one required to do well on such tests. The tests, they argue, really measure how well a person has adapted to a specific middle-class culture. Success on such tests, therefore, is really an operational definition of ability to adapt to middle-class culture." Since the test is designed to measure intelligence but really measures a different ability, it is invalid. The argument over the validity of IQ tests is far from settled. Important theorists continue to line up on both sides, and others continue to suggest compromises - such as recommending new tests or redefining the concept of intelligence.
Consider another hypothetical intelligence test. Assume that we ask the child one question directly related to a valid operational definition. This is an excessively short test, and thus it is likely to provide an unreliable estimate of intelligence. Our result is also likely to be invalid, because our conclusion that a child is a genius for answering 100% of the questions correctly is about as likely to be a result of chance factors (unreliability) as it is to be a result of real ability related to the concept of intelligence.
The factors that determine the validity of a data collection process are diagrammed in Figure 5.1. The first test cited in this section was invalid because the operational definition was inappropriate. In the second case, the operational definition was logically appropriate, but it was not clear whether the tasks the child performed were really related to this operational definition. The final IQ test was considerably limited in its validity because the test was unreliable.



To the extent that there is a complete breakdown at any of these stages, the data collection process is invalid. Likewise, if there is a cumulative breakdown at several stages, the data collection process can be invalid.
Factors Influencing Test Validity
From the preceding discussion, it can be seen that there are three steps to establishing the validity of a data collection process designed to measure an outcome variable:
1. Demonstrate that the operational definition upon which the data collection process is based is actually a logically appropriate operational definition of the outcome variable under consideration. The strategy for demonstrating logical appropriateness was discussed in detail in chapter 4, where we pointed out that operational definitions are not actually synonymous with the outcome variable but rather represent the evidence that we are willing to accept to indicate that an internal behavior is occurring. Table 5.2 lists some cases where the operational definitions are to varying degrees logically inappropriate. For example, if the instructors in English 101 administer an anonymous questionnaire at the end of the semester to evaluate their performance in the course, they might think that the students are responding to questions about how they performed during the course. However, it's possible that the students who are completing the questionnaire are thinking, "If we tell them what we really think, they'll be upset and come down hard on us when they grade the exam. I think we should play it safe and give them good ratings for the course." If this is what students are thinking, then the favorable comments on the questionnaire are actually an operational definition of "anxiety over alienating instructor" rather than of "quality teaching."
In many cases, the logical connection is easy to establish, and hence the logical fallacies found in Table 5.2 are often easy to avoid. For example, the connection between the operational definitions and the outcome variables in Table 5.3 are much more obvious than the connections in Table 5.2. It's still possible for a person to perform behaviors described in the operational definitions without having achieved the outcome variable, but it is much less likely than was the case in the situations in Table 5.2.
Logical inappropriateness is most likely to occur when the outcome variable under consideration is a highly internalized one. Affective outcomes present particularly difficult problems, because the evidence is much less directly connected to the internal outcome than is the case with behavioral, psychomotor, and cognitive outcomes. The guidelines presented in chapter 4 are applicable here - namely, rule out as many alternative explanations as possible, and use more than one operational definition.
|
Table 5.2 Some Examples of Logically Inappropriate Operational Definitions of Outcome Variables |
||
|
|
|
|
|
Ability to understand reading passages |
The pupil paraphrases a passage he/she has read silently |
Ability to guess from context clues |
|
Love of Shakespearean drama |
The student will carry a copy of Shakespeare's plays with him to class |
Eagerness to impress professor |
|
Appreciation of English 101 |
The students will indicate on a questionnaire that they liked the course |
Anxiety over alienating instructor |
|
Knowledge of driving laws |
The candidate will get at least 17 out of 20 true-false questions right on license test |
Ability to take true-false tests with subtle clues present in them |
|
Friendliness toward peers |
The pupil will stand near other children on the playground |
Anxiety over being beaten up if he or she stands apart |
|
Appreciation of American heritage |
Child will voluntarily attend the Fourth of July picnic given by the American Legion |
Appreciation of watching fireworks explode |
Table 5.3
Some Examples of Operational Definitions That Are Almost
Certain to Be Appropriate for the Designated Outcome
Variables Ability to add
single-digit integers The student will add
single-digit integers presented to him ten at a time on a
test sheet Ability to tie one's own
shoes The student will tie her
own shoes after they have been presented to her
untied Ability to bench press
150 pounds The student will bench
press 150 pounds during the test period in the
gymnasium. Ability to spell
correctly from memory The student will write
down from memory the correct spelling of each word given in
dictation Ability to spell
correctly on essays with use of dictionary The student will make no
more than two spelling errors in a 200-word essay written
during class with the aid of a dictionary Ability to type 60 words
per minute The student will type a
designated 300-word passage in five minutes or
less Ability to raise hand
before talking in class The student will raise
his hand before talking in class. Ability to recall the
quadratic equation The student will write
from memory the quadratic equation Ability to apply the
quadratic equation Given the quadratic
equation and ten problems that can be solved using the
equation, the student will solve at least nine
correctly
2. Demonstrate that the tasks the respondent has to perform to generate a score during the data collection process match the task suggested by the operational definition. The benefits of stating operational definitions can be completely nullified if the tasks that generate a score during the data collection process do not match the tasks stated in the operational definitions.
Table 5.4 provides examples of such mismatches. The first three are not intended to be facetious. Mismatches this obvious actually do occur on teacher-designed tests. They say they are going to measure one thing, and then they measure something else. The other examples in Table 5.4 are more subtle. In these cases, the teacher has one behavior in mind; and in fact, many of the persons responding to the data collection process will perform the behavior anticipated by the teacher. But the mismatch occurs whenever a respondent performs the different or additional tasks indicated in the second column of the table.
|
Table 5.4 Some Examples of a Mismatch Between the Operational Definition and the Task the Respondent Has to Perform on the Instrument |
|
|
|
|
|
The student will add single-digit integers presented to him ten at a time on a test sheet |
"If I have three apples and you give me two more apples, how many do I have?" |
|
The student will solve problems using the quadratic equation |
"Explain the derivation of the quadratic equation." |
|
The student will use prepositions correctly in her essays |
"Write the definition of a preposition." |
|
The student will apply the principles of operant conditioning to hypothetical situations |
The student first has to unscramble a complex multiple-choice thought pattern and then apply the principles |
|
Given a (culturally familiar) novel problem to solve, the test taker will be able to solve the problem |
The student is presented with a problem entirely foreign to his cultural background |
|
The student will describe the relationship between nuclear energy and atmospheric pollution |
The student will write, in correct grammatical structures, a description of the relationship between nuclear energy and atmospheric pollution |
|
The student will circle each of the prepositions in the paragraph provided |
The student will first decipher the teacher's unintelligible directions and then circle each of the prepositions |
|
The respondent will place herself in the simulated job situation provided to her and will indicate how she would perform in that situation |
The respondent has to first ignore that the situation is absurdly artificial and highly different from the real world and then still respond as she would perform in the hypothetical situation |
When questions arise concerning various sorts of bias in the data collection process, it is often the mismatch between task and operational definition that is being challenged. For example, with regard to bias in IQ tests, one of the most common arguments is essentially that middle-class youngsters who take the test are actually performing behaviors related to the operational definition, whereas equally intelligent lower-class youngsters are taking a test where there is a discrepancy between what they are doing and the operational definition of intelligence.It is important to be aware of the various kinds of bias and other contaminating factors that could cause discrepancies, and to carefully rule these out. Such sources of mismatching include cultural bias, test-wiseness, reading ability, writing ability, ability to put oneself in a hypothetical framework, tendency to guess, and social responsibility bias. The preceding list is not to be considered exhaustive. There are her factors unique to specific individuals that produce a similar effect. A good way to assure a match to have several different qualified persons examine the data collection process and state whether the task matches the operational definition.
A special type of mismatch between operational definition and task is worth mentioning. Some data collection strategies are so obtrusive that the respondent is more likely to be responding to the data collection process itself than to be performing the tasks indicated in the operational definition. For example, if a child knows that a questionnaire is measuring prejudice and that it is not nice to be prejudiced, the child may answer what he thinks he should answer instead of revealing his true attitude. (This is referred to as a social-desirability bias.) Likewise, if a researcher comes into the classroom and sits in a prominent position with a behavioral checklist, children may be acutely aware that something unusual is happening; and so the behavior recorded on the checklist is more a reaction to the data collection strategy than an indication of actual behavioral tendencies. (Specific strategies for overcoming obtrusiveness are discussed in chapter 6.)
3. Demonstrate that the data collection process is reliable. Reliability was discussed extensively earlier in this chapter. The contribution of reliability to validity was mentioned in Figure 5.1 and in the accompanying discussion. The relationship between reliability and validity is diagrammed more specifically in Figure 5.2. As this diagram suggests, a certain amount of reliability is necessary before a data collection process can possess validity. In other words, a data collection process cannot measure what it's supposed to measure if it measures nothing consistently. In demonstrating that data collection processes are valid, professional test constructors first demonstrate that their data collection processes are reliable - that they measure something consistently; then they demonstrate that this something is the characteristic that the data collection processes are supposed to measure. In other words, they first demonstrate reliability in several ways, and then they demonstrate validity.
An important caution is necessary in discussing the relationship between reliability and validity. It is crucial to realize that it is possible (but undesirable and inappropriate) to increase reliability while simultaneously reducing the validity of a data collection process. This can be done by either (1) narrowing or changing the operational definition so that it is no longer logically appropriate or (2) changing the tasks based on the operational definition to less directly related tasks and then (3) devising a more reliable data collection process based on the more measurable but less appropriate operational definition or tasks. This is obviously a bad idea, because the result is that the data collection now measures a less valid or wrong outcome "more reliably."
Such an increase in reliability accompanied by a reduction in validity occurs, for example, if a teacher introduces unnecessarily complex language into a data collection process. A data collection process that had previously measured "ability to apply scientific concepts" might now instead measure "ability to decipher complex language and then apply scientific concepts." The resulting reliability might be higher; but if the teacher is still making decisions about the original outcome, the data collection process has become less valid.
Overemphasis on reliability is one of the arguments against culturally biased norm-referenced tests. Their detractors argue that many standardized tests become more reliable when cultural bias is added, because such bias is a relatively stable (consistent) factor, which is likely to work the same way on all questions and on all administrations of the test. However, the cultural bias detracts from the validity of the test.
It is important to be alert to the tendency to accept spuriously high statistical estimates of reliability as solid evidence of validity. The fact that a certain amount of reliability is a necessary prerequisite for validity does not mean that the most reliable data collection process is also the most valid. Statistical reliability is only one factor in establishing the validity of a data collection process. Another way to state this is to say that reliability is a necessary but not sufficient condition for validity.
As you can see, establishing validity is predominantly a logical process.
Finally, before leaving this introduction to the validity of data collection processes, it is important to note that a data collection process that provides valid data for group decisions will not always provide valid data for decisions about individuals. On the other hand, a data collection process that provides valid data for decisions about individuals will always provide valid data for group decisions. This is not as complicated as it sounds. To take an example, we might operationally define appreciation of Shakespeare as "borrowing Shakespearean books from the library without being required to do so." Even if Janet Jones borrows books on Shakespeare without being required to do so, it is not possible to diagnose her specifically as either appreciating or not appreciating the bard using this operational definition. There are too many competing explanations for her behavior, and these would invalidate this data collection process as an estimate of her appreciation. (For example, she might hate the subject but need to pass the exam; and so she has to borrow a vast number of books to do burdensome, additional studying. Or she might like Shakespeare so much that she owns annotated copies of all the plays and never has to borrow from any library except her own.) Nevertheless, it may still be valid to evaluate the group based on this operational definition. If you teach the Shakespeare plays a certain way one year and only 2% of the students ever borrow related books from the library, and the next year you teach the same subject differently and 50% of the students spontaneously borrow books, it is probably valid to infer from their available documented records that appreciation of Shakespeare has increased. The group decision, at any rate, is more likely to be valid than is the individual diagnosis.
Box 5.1 An Argument-Based Approach to Validity
Kane (1992) presents the practical yet sophisticated idea that validity should be discussed in terms of the practical effectiveness of the argument to support the interpretation of the results of a data collection process for a particular purpose. The researcher or user of the research chooses an interpretation of the data, specifies the interpretive argument associated with that interpretation, identifies competing interpretations, and develops evidence to support the intended interpretation and refute the competing interpretations. The amount and type of evidence needed in a particular case depend on the inferences and assumptions associated with a particular application.
The key points in this approach are that the interpretive argument and the associated assumptions be stated as clearly as possible and that the assumptions be carefully tested by whatever strategies will best rule Out bias and other sources of faulty conclusions. As the most questionable inferences and assumptions are checked and either supported by the evidence or adjusted so that they become more plausible, the plausibility (validity) of the interpretive argument increases.
This interpretation of validity is compatible with the discussion presented in this chapter. In addition, it has the advantage of presenting validity as a special instance of the overall application of formal and informal reasoning to solving problems. From this viewpoint, when educators do research, they are under the same obligation as any other person making public statements to demonstrate that those statements really do mean what the speaker or writer says they mean- Statistical procedures and other specific techniques are merely pieces of evidence to check the quality of inferences and the authenticity of the assumptions underlying a particular interpretation.
(Source: Kane, M. T. [1992]. An argument-based approach to validity. Psychological Bulletin, 112, 327-535.)
Part I
Identify the item from each pair that is most likely to be an invalid measure of the outcome variable given in parentheses.
Set 1.a. The child will correspond intelligibly with an assigned Spanish-speaking pen pal. (understands Spanish)
b. The child will correspond intelligibly with an assigned Spanish-speaking pen pal. (appreciates Spanish culture)
Set 2.
a. The student will identify examples of the principles of physics in the kitchen at home. (understands principles of physics)
b. The student will choose to take optional courses in the physical sciences. (appreciates physical sciences)
Part 2
Write Invalid next to statements that indicate an invalid data collection process; write Valid next to those that indicate a valid data collection process; write N if no relevant information regarding validity is contained in the statement.
1____ The questions were so hard that I was reduced to flipping a coin to guess the answers.2____ The test measures mere trivia, not the important outcomes of the course.
3____ To rule out the influence of memorized information regarding a problem, only topics that were entirely novel to all the students were included on the problem-solving test.
4____ The only way he got an A was by having his girlfriend write the term paper for him.
5____ The length of the true-false English test was increased from 30 to 50 items to minimize the chances of getting a high score by guessing.
6____ The teacher ruled out the likelihood of cheating by giving each of the students seated at the same table a different form of the test.
7____ Since the personality test had such a difficult vocabulary level, it probably was influenced more by intelligence than by personality factors.
8____ The observer rated the classroom as displaying a hostile environment toward handicapped people, but the teacher argued that the observer's judgment was clouded because she observed from a position where she was next to students who were not at all typical of the entire class.
9____ The observer rated the atmosphere of the school hoard meeting as being supportive of innovative teaching, but the newspaper critic pointed out that this was because the board members were local residents with business interests and were therefore very likely to be supportive of innovation.
If you got most of the questions in Review Quiz 5.4 correct, or if you easily saw the logic of the explanations, then you probably have a good basic grasp of the concept of validity. If you do not understand the concept, reread the chapter to this point, check the chapter in the workbook, refer to the recommended readings, or ask your instructor or a peer for help. Be sure that you understand the summary in the following paragraph so that you will profit from the rest of this chapter.
In summary, validity refers to whether a data collection process really measures what it is designed to measure. Invalidity occurs to the extent that the data collection process measures an incorrect variable or no consistent variable at all. The main sources of invalidity are logically inappropriate operational definitions, mismatches between operational definitions and the tasks employed to measure them, and unreliability of data collection processes. Validity is not an all-or-nothing characteristic; data collection processes range from strong validity to weak validity. Because of the highly internalized nature of educational outcomes, data collection processes in education can never be perfectly valid. By carefully stating appropriate operational definitions, ascertaining that tasks employed in data collection processes are directly related to the operational definitions, and designing reliable data collection processes, we can increase the validity of our data collection processes and the probability that we will draw valid conclusions from them.
SPECIFIC, TECHNICAL EVIDENCE OF MEASUREMENT VALIDITY
If you read a test manual or look up the citation of a test in The Mental Measurements Yearbook (Kramer & Conoley, 2002), you will find references to three basic types of evidence to support measurement validity. These have been defined by several major organizations interested in mental measurement (American Educational Research Association et al., 1985). The technical types of evidence for validity are rooted in the theory discussed earlier in this chapter, and it is not difficult to achieve a fundamental understanding of these concepts. A brief discussion of these types of evidence for validity can help teachers and researchers develop more valid data collection processes for their own use. In addition, an understanding of these concepts will be especially useful when selecting or using standardized tests, reading the professional literature, and attempting to measure psychological or theoretical characteristics beyond those that are typically covered by classroom tests. These three types of evidence for validity are (1) content validity, (2) criterion-related validity, and (3) construct validity.
Content validity refers to the extent to which a data collection process measures a representative sample of the subject matter or behavior that should be encompassed by the operational definition. A high school English teacher's midterm exam, for example, lacks content validity when it focuses exclusively on what was covered in the last two weeks of the term and inadvertently ignores the first six weeks of the grading period. Likewise, a self-concept test would lack content validity if all the items focused on academic situations, ignoring the impact of home, church, and other factors outside the school. Content validity is assured by logically analyzing the domain of subject matter or behavior that would be appropriate for inclusion on a data collection process and examining the items to make sure that a representative sample of the possible domain is included. In classroom tests, a frequent violation of content validity occurs when test items are written that focus on knowledge and comprehension levels (because such items are easy to write), while ignoring the important higher levels, such as synthesis and application of principles (because such items are difficult to write).
Criterion-related validity refers to how closely performance on a data collection process is related to other measure of performance. There are two of criterion-related validity: predictive and concurrent.
Predictive validity refers to how well a data collection process predicts some future performance. If a university uses the Graduate Record Exam (GRE) as a criterion for admission to graduate school, for example, the predictive validity of the GRE must be known. This predictive validity would have been established by administering the GRE to a group of students entering a school and determining how their performance on the GRE corresponded with their performance in that school. It would be expressed as correlation coefficient. A high positive coefficient would indicate that persons who did well on the GRE tended to do well in graduate school, whereas who scored low on the GRE tended to perform poorly in school. A low correlation would indicate that there was little relationship between GRE performance and success in that particular graduate school.
Concurrent validity refers to how well a data collection process correlates with some current criterion - usually another test. It "predicts" the present. At first glance it sounds like an exercise in futility to predict what is already known, but more careful consideration will suggest two important uses for concurrent validity. First, it is a useful predecessor for predictive validity. If the GRE, for example, does not even correlate with success among those who are going to school right now, then there is little value in doing the more expensive, time-consuming, predictive validity study. Second, concurrent validity enables us to use one measuring strategy in place of another. If a university wants to require that students either take freshman composition or take a test to "test out" of the course, concurrent validity would enable the English department to demonstrate that a high score on the alternative test has a similar meaning to a high grade in the course. Like predictive validity, concurrent validity is expressed by a correlation coefficient.
Construct validity refers to the extent to which the results of a data collection process can be interpreted in terms of underlying psychological constructs. A construct is a label or hypothetical interpretation of an internal behavior or psychological quality - such as self-confidence, motivation, or intelligence - that we assume exists to explain some observed behavior. Construct validity often necessitates an extremely complicated process of validation. To state it briefly, the researcher develops a theory about how people should perform during the data collection process if it really measures the alleged construct and then collects data to see whether this is what really happens. The process is complicated because the researcher is doing two separate things: (1) proving that the data collection process possesses construct validity and (2) refining the theory about the construct. Note that this process of validation can never be completed; the goal of researchers engaging in construct validation is to refine concepts and data collection processes, not to arrive at ultimate conclusions. Construct validity often deals with the intervening variable (discussed in chapter 2), and it is of greatest relevance to theoretical research (discussed in chapter 17).
Remember: The three technical types of evidence for validity are merely tools for demonstrating that a data collection process measures what the test designer or researcher says it measures. The fundamental logic behind them is relatively straightforward. The difficulty lies in carrying out the procedures to collect these types of evidence for validity. The information presented here (summarized in Table 5.5) should be enough to enable you to deal with applying and interpreting these concepts in most situations. If you find that you need further information (for example, if your job requires that you select people accurately for various programs), consult the more technical references in the Annotated Bibliography at the end of this chapter.
Table 5.5
Summary of the Three Major Types of Psychological
Validity Type of
Validity Definition Mnemonic Examples of How to
Achieve and Demonstrate It Content The extent to which a
data collection process measures a representative sample of
the topic encompassed by the operational
definition. The content of the
data collection process is a good sample of the content
that it should cover. 1. Use a plan (such as an
item matrix) to plan a test so that all areas are properly
represented. 2. Logically show that
nothing has been omitted or overrepresented in the data
collection process. Predictive How well a data
collection process predicts some future
performance. The data collection
process predicts something that has not yet
occurred. 1. Select students for an
advanced algebra class based on a standardized math test.
Then see if those who did well on the math test actually do
better in the course. 2. Give students the SAT
before they enter college. Then compute a correlation
coefficient with college GPA to see if the SAT accurately
predicts college performance. Concurrent How well a data
collection process correlates with a current
criterion. Both data collection
processes occur at the same time (concurrently). We want to
demonstrate that one can be considered a substitute for the
other. 1. Determine that success
in English composition classes has already demonstrated
writing skill, making it unnecessary for the student to take
the English exit exam (which measures the same
thing). 2. Compute a correlation
coefficient between the performance of students on the
computerized and non-computerized versions of the GRE (so
that we can consider performance on one to be equivalent to
performance on the other). Construct The extent to which the
results of a data collection process can be interpreted in
terms of underlying psychological constructs. A psychological construct
(accent on first syllable) is something that exists inside a
person's head. We construct it (accent on second syllable)
by reasoning about observable information (such as test
results). 1. A person's test
results show whites are smarter than blacks. We challenge
this person by demonstrating that the test measures cultural
familiarity rather than intelligence. 2. A person shows that
her moral development test really does measure something
that can be called moral reasoning - rather than reading
ability, conformity, intelligence, or some other unrelated
characteristic.
Indicate the type of technical evidence for validity or in each of the following situations. Choose from this list:
a. content validityb. predictive validity
c. concurrent validity
d. construct validity
As you will recall, Eugene Anderson, the humane educator had written several operational definitions of "attitude toward animal life." Several of these will be discussed in the next few chapters, but here we shall focus on just one of them. Based on his second operational definition ("The child protects animals from harm"), he devised the paper-and-pencil test shown in Figure 5.3. He reasoned that a person with a favorable attitude would want the fireman to save animals before he or she saved objects from a burning building. (The validity of this belief will be discussed next.)
Mr. Anderson planned to give each respondent a score between 0 and 3, depending on how many animals the child selected on this test. Of course, he wanted to be reasonably certain that the score a child received on any testing occasion would actually represent that child's feelings toward animals, not some irrelevant or transient factor. In addition, he wanted to have two forms of the test; and so he devised a second test ("Billy and the Fireman" - not shown here), which contained a different set of animals and objects. Mr. Anderson needed to ascertain that both tests really were equivalent forms of the same test. If they were really equivalent forms, then he could give one as a pretest and the other as a posttest, to determine whether attitudes really changed as a result of his visits.
Mr. Anderson tried to follow all the non-statistical guidelines listed in this chapter to make the test as reliable as possible. As he completed his task, the only guideline that caused him any real concern was the one about making the test long enough. Was a range of 0 to 3 a big enough span of scores? On the one hand, he thought it might be a good idea to increase the number of choices; but on the other hand, he felt that the larger number of choices might needlessly confuse his respondents, since many of them would be in only the third or fourth grade.
Because of his doubts about the length of the test, he decided to use statistical techniques to check its reliability. If the test was too short, he would obtain a low reliability coefficient; if he obtained high coefficients, he would know that the brevity of the test was not a serious problem. In addition, the statistical procedures would be helpful in establishing the equivalence of the two tests. He had tried to obtain equivalence by pairing items and assigning them from a larger pool, but he would feel more secure if he had statistical evidence to demonstrate that they were parallel. Finally, the statistical evidence would be helpful to Mr. Anderson when he presented his results to his colleagues at meetings. With the statistical reliability data, he would not have to persuade them of his personal capability as an item writer. He could simply show them the numbers to prove that the tests were consistent.
He found several schools in which he was allowed to field test his instrument. In some cases, he had the same students take the same form of the test with an interval of a week or two in between (test-retest reliability). In other cases, he had them take the alternate forms after an interval of only a day or so (equivalent-forms reliability). In two cases, he gave the alternate forms with two weeks between the two testing occasions (test-retest with equivalent-forms reliability). The results are summarized in Table 5.6. As Mr. Anderson looked at his results, he was quite satisfied. The reliability coefficients showed that he had devised a reasonably consistent instrument. In addition, the alternate forms of the test really did appear to be equivalent. When one of his colleagues pointed out that his correlation coefficients were not as high as the correlations of .90 often reported for good standardized tests, Mr. Anderson replied that he was not concerned about that. The standardized tests were intended for diagnosing individual abilities, and a higher degree of reliability was necessary for that purpose. All that Mr. Anderson wanted to do was examine group attitudes, and his statistical reliabilities were more than sufficient for his needs. Mr. Anderson had indeed developed a consistent test. His next problem was to demonstrate that the trait he was consistently measuring could legitimately be called "attitude toward animal life."
An even more important concern for Mr. Anderson was that his tests should be valid. He was concerned about the validity of all his measuring instruments, but in this section we'll focus exclusively on how he established the validity of his Fireman Test.
Johnny and the Fireman Johnny is a boy about your age. One night his house catches fire. He and all the members of his family escape, but they have time to bring nothing with them. A fireman comes up to Johnny and says, "The house is going to be a total loss. Is there anything you would like us to try to get out of the house before it burns down?"
Here is a list of some of the things in the house. Choose the three things that Johnny should tell the firemen to try to save if there is time. Then explain the reasons for your choice.
Color portable TV (brand new: cost $450).Father's wallet ($75 and credit cards).
Johnny's dog (1 year old: cost $30).
Johnny's stamp collection (worth $75).
His sister's cat (she got it free a year ago).
Dad's car keys (car is safely parked on the street).
Mother's expensive coat (worth $300).
CB radio (worth $210). Little brother's pet gerbil.
Dad's checkbook.
What is the first thing to save?
What is the second thing to save?
What is the third thing to save?
Figure 5.3 Mr. Anderson's Humane Attitudes Test
|
|
|||
|
|
|||
|
|
|
|
|
|
Johnny |
5th (n=20) |
1 week |
.63 |
|
Johnny |
4th (n=24) |
1 week |
.75 |
|
Johnny |
6th (n=25) |
2 weeks |
.70 |
|
Billy |
5th (n=20) |
1 week |
.69 |
|
Billy |
4th (n=23) |
1 week |
.70 |
|
|
|||
|
|
|
|
|
|
4th (n=47) |
1 week |
. 70 |
|
|
4th (n=26) |
2 days |
.64 |
|
|
3rd (n=35) |
1 day |
.73 |
|
|
4th (n=24) |
4 days |
.55 |
|
|
5th (n=65) |
1 day |
.71 |
|
|
5th (n=65) |
1 day |
.73 |
|
In determining the validity of this data collection process, Mr. Anderson followed the guidelines suggested in this chapter. First he looked at the operational definition to ascertain that it was really logically valid. This operational definition had been revised to state, "Given a hypothetical situation in which animals might undergo pain and suffering, the respondent will choose to save the animals from that pain and suffering." He talked this over with several of his colleagues, and they agreed that saving the animals was the behavior they would expect from a person with humane values.
Next, he ascertained that the children involved in the data collection process would actually be doing what the operational definition said they should be doing. At this point, he had to rule out such irrelevant tasks as reading ability and the tendency to give false but socially desirable answers. He ruled out the reading variable by consulting some reading specialists. They agreed that for most third through seventh graders, the vocabulary would not be excessively difficult. They suggested that in case of uncertainty, Mr. Anderson should simply read the test to the respondents. Next he ruled out the social-desirability factor by reasoning that all the objects in the house were socially desirable. In addition, since it would be introduced as part of a discussion of fire prevention, the test would be presented in such a way that the children would not even know that it had anything to do with attitudes toward animals. Finally, he noted that he had already established the reliability of the data collection process.
Mr. Anderson decided to use some statistical procedures to further authenticate validity. The procedures he used were a combination of criterion-related (concurrent) validity and construct validity. (It is not very important for you to distinguish precisely between the various techniques he used.) He asked himself, "If my data collection process is valid, what can I expect the results to be?" He answered this question with three predictions:
He set out to check each of these predictions.
Mr. Anderson found it hard to check his first prediction. This was because he could not find any other good tests of humane attitudes. What he did, therefore, was compare the results of the Fireman Tests with the results of some other measuring techniques he had derived from his own set of operational definitions. (Some of these other tests are described in chapter 6.) He found a definite pattern. Those who did well on the Fireman Test also did well on the other instruments. This information seemed to verify his first prediction.
His second prediction was much easier to check. He knew from his professional reading that one specific geographic region of the country was noted for its humane attitudes. The largest humane organizations were in that part of the country, and the incidence of pet and animal abuse was very low there. He also knew of another area that was generally considered by experts to be populated by much less humane people. He arranged to have his test given at the same grade levels in comparable schools in each of these communities. The results overwhelmingly supported his prediction. The students from the part of the country where the attitudes were known to be favorable scored higher on the test than the other students. This provided very strong support for the validity of his data collection process.
Then Mr. Anderson checked his third prediction. He correlated the test scores with scores on reading tests, math tests, and intelligence tests. The Fireman Test did not correlate substantially with any of these other scores. This is what he had hoped for. If the Fireman Test had correlated strongly with reading ability, for example, this might have indicated that the test was really a measure of reading ability rather than humane attitudes.
Mr. Anderson was happy with his validity data. He had demonstrated both logically and with empirical data that his data collection process really did seem to measure attitude toward animal life. He still intended to supplement the Fireman Test with other measuring techniques (described in the next chapter), but at least he knew he was off to a good start.
Reliability refers to the degree to which measuring techniques are consistent rather than self-contradictory. Reliability is important because you will want to make decisions about your programs and students based on internally consistent and stable data rather than fleeting information that would change if you simply took the time to collect the information a second time. This chapter has discussed factors that introduce inconsistency into data collection procedures as well as strategies for controlling these factors. In addition to presenting these guidelines, this chapter has described statistical procedures that can be useful tools to help assure reliability.
Validity refers to the extent to which a data collection process really measures what it is designed to measure. The validity of a data collection process is established by demonstrating that (1) the operational definitions upon which the data collection process is based are actually logically appropriate operational definitions of the outcome variable under consideration, (21 the tasks the respondent performs during the data collection process match the task suggested by the operational definitions, and (3) the data collection process is reliable.
What Comes Next
In the next few chapters we'll discuss how to collect, report, and interpret reliable and valid data. Later we'll integrate this information into strategies for effectively conducting and interpreting research in education.
When conducting quantitative research, it is essential that your data collection processes be valid. (The issue of validity of qualitative research is discussed in chapter 9.) The following guidelines emerge from the principles discussed in this chapter:
- Develop good operational definitions of your outcome variables. Use the guidelines discussed in chapter 4.
- Keep your operational definitions in mind when developing or selecting data collection processes. Be aware of sources of invalidity, and design or select only data collection processes that are directly related to your operational definitions.
- Develop or select reliable data collection processes, but remember that reliability is only a tool for establishing validity - it is not an end in itself.
- Increase validity by triangulating - that is, by using more than one operational definition and more than one data collection process for each outcome variable.
- Check the reliability and validity of your data collection processes during the early stages of your research.
- Collect information about reliability and validity from published sources like those described in chapter 6 and from information in the methods section of published articles.
In addition, even if your research plan emphasizes quantitative methods, consider enhancing validity by supplementing quantitative methods with the qualitative strategies described in chapter 9.
The following sources provide more detailed information on the general topics of reliability and validity:
American Educational Research Association, American Psychological Association, & National Council for Measurement in Education. (1935). Standards for educational and psychological testing. Washington, DC: American Psychological Association. This booklet includes the guidelines for reliability and validity recommended by the three corporate authors. A familiarity with these guidelines will help you make better use of published information regarding data collection processes.Ebel, R. L, & Frisbie, D. A. (1979). Essentials of educational measurement (5th ed.). Englewood Cliffs, NJ: Prentice-Hall. Chapter 5, "The Reliability of Test Scores," presents a clear statement of the theoretical rationale behind the traditional methods of assessing reliability, with a special emphasis on how to apply these methods to educational practice. Chapter 6, "Validity: Interpretation and Use," offers guidelines to help teachers enhance the validity of their tests by making sure they are appropriate for the purposes for which they are intended. Chapter 13, "Evaluating Test and Item Characteristics," describes important techniques for promoting internal consistency and generally revising early versions of data collection procedures.
Gronlund, N. F., & Linn, R. L. (1990). Measurement and evaluation in teaching (6th ed.). New York: Macmillan. Chapter 3, "Validity," provides some useful guidelines for teachers to increase the validity of their classroom tests. Part of Chapter 11, "Appraising Classroom Tests," describes item analysis, an important technique for promoting internal consistency. Chapter 4, "Reliability and Other Desired Characteristics," discusses the traditional methods of establishing reliability and gives concrete advice on how to apply these to improving classroom tests.
Kramer, J. J., & Conoley, J. C. (Eds.). (2002). The Fifteenth Mental Measurements Yearbook. Lincoln, NE: Buros Institute of Mental Measurements. This book and earlier volumes in the series provide critical, scholarly information about the reliability, validity, and other characteristics of published, standardized data collection materials. This resource is also available via Internet as an online database.
Mager, R. (1984). Measuring instructional results (2nd ed.). Belmont, CA: David S. Lake. This book addresses the very important problem of marching test items to behavioral objectives (operational definitions of outcome variables). Although Mager does not use the specific term validity, this little programmed text offers an excellent guide to one of the most important validity-related problems the classroom teacher faces.
Worthen, B. R., Borg, W. R., & White, K. R. (1993). Measurement and evaluation in the schools. New York: Longman. Chapters 6 and 7 offer practical and useful answers to the questions "Why worry about reliability?" and "Why worry about validity?" Chapter 8 focuses on "Cutting Down Test Score Pollution" by discussing ways to increase the reliability and validity of data collection processes.
The following source is useful for readers who are interested in more theoretical information on reliability:
Feldt, L. S., & Brennen, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed.). New York: American Council on Education. This is a brief but comprehensive treatment of the major issues relating to reliability. It's heavy on statistical formulas.
The following sources are useful for readers who are interested in more detailed information on validity:
Cole, N. S. (1989). Bias in test use. In R. L. Linn (Ed.), Educational measurement (3rd ed.). New York: American Council on Education. This is a detailed treatment of one of the major sources of invalidity in the interpretation of data collection processes.Messick, 5. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.). New York: American Council on Education. This is probably the most authoritative discussion available regarding the status of current thought on validity. Anyone doing serious work on validity of data collection processes should consult this chapter.
The following source provides more detailed information on the specific topic of reliability of criterion-referenced tests:
Popham, W. J. (1978). Criterion referenced measurement. Englewood Cliffs, NJ: Prentice-Hall. Chapter 2, "Traditional Measurement Practices," discusses traditional approaches to reliability and points out some of the problems that are likely to occur when we try to apply these same approaches to the kinds of tests that teachers should be using to evaluate student performance. Chapter 7, "Reliability, Validity, and Performance Standards," is probably the most comprehensive treatment available on the reliability and validity of criterion-referenced tests.
Review Quiz 5.1
REVIEW QUIZ QUIZ 5.2
REVIEW QUIZ 5.3
1. interobserver agreement2. test-retest reliability
3. interscorer reliability
4. internal consistency reliability
S. equivalent-forms reliability
REVIEW QUIZ 5.4
Part 1
Set 1: bSet 2: b
In both cases, the second item requires a greater inferential leap to conclude that it is evidence for the occurrence of the outcome variable. The first item in each pair offers more direct evidence.
Part 2
REVIEW QUIZ 5.5
The research report in Appendix C uses two strategies to measure the ability of students to solve problems. Evaluate the reliability of each strategy separately.
- How reliable were the unit pretests as a measure of problem-solving ability?
- How valid were the unit pretests as a measure of problem-solving ability?
- How reliable was the Watson-Glaser test as a measure of problem-solving ability?
- How valid was the Watson-Glaser test as a measure of problem-solving ability?
- How satisfactory was the overall data collection process with regard to reliability and validity?
ANSWERS: