Chapter 5

RELIABILITY AND VALIDITY OF DATA COLLECTION PROCESSES

 

WHERE WE'VE BEEN

We've described how to identify research variables, use reference sources to obtain information about these variables, and devise operational definitions of them.

 

WHERE WE'RE GOING NOW

We're going to discuss how to make data collection processes reliable and valid - that is, how to make sure that our measurement processes do not generate evidence that is self-contradictory because of internal inconsistency or instability and make it more likely that they actually zero in on the outcome we want them to measure rather than on extraneous outcomes or no outcome at all.

 

CHAPTER PREVIEW

 

Once you have decided on the operational definitions of an outcome variable, you can collect data regarding the occurrence of that outcome. This chapter describes reliability and validity - two essential characteristics of all good data collection techniques.

The present chapter defines reliability in terms of how consistently a data collection process measures whatever it measures. This consistency concerns the level of agreement among independent tests, testing occasions, observers, or items that purport to be measuring the same outcome. The confidence we can place in judgments based on data collection processes will be greater to the extent that they are reliable. This chapter discusses ways to increase the prospect that you can make consistent decisions based on your rests and observations. We also introduce here the concept of validity of data collection processes - the extent to which a data collection process really measures what it is designed to measure. We will discuss the factors that influence validity of data collection processes and methods of establishing validity.

After reading this chapter, you should be able to

  1. Define reliability and validity.

  2. Identify examples of data collection processes with strong reliability and validity and examples with weaker reliability and validity.

  3. Identify factors that contribute to the unreliability and invalidity of a data collection process.

  4. Identify effective ways to increase the reliability and validity of a data collection process.

  5. Identify appropriate statistical procedures for estimating the reliability of a data collection process and identify proper situations in which each of these procedures would be appropriately employed.

  6. Identify the weaknesses and limitations for each of these statistical procedures for estimating the reliability of data collection processes.

  7. Describe how to use the concept of reliability in selecting and improving techniques for measuring outcome variables.

  8. Describe the use of the standard error of measurement in interpreting test scores.

  9. Describe the process for establishing the validity of data collection processes.

  10. Describe the role of evidence from content validity, criterion-related validity, and construct validity in establishing the validity of data collection processes.

 

RELIABILITY OF DATA COLLECTION PROCESSES

Reliability addresses the question of whether the results of measuring processes are consistent on occasions when they should be consistent. Consistent means what the dictionary says it means: "not self-contradictory." If a person possesses a certain degree of knowledge about a topic, for example, the estimate of knowledge that appears as a test score should not be contradicted by other administrations of the same or similar tests. The estimate should be approximately the same whether the test is taken today or tomorrow. If different tests are given to students in the first and third period, we should be able to assume that our judgments about the person's knowledge would have been about the same on either test; and we should be free from the impression that the score would have been substantially different if someone else had graded the test. A data collection process is less reliable if the results are influenced by irrelevant factors that cause our judgments to fluctuate when they should not fluctuate. Measurement is reliable to the extent that the results are similar every time they should be similar.

If a mother wants to measure the body temperature of a sick child, she will expect her assessment to be reliable. If she measures his temperature once and the temperature is 102.4, then tries again tw minutes later and gets a reading of 99.9, she has an unreliable thermometer. (Of course, if she gives medication and then takes his temperature two hours later and discovers a large drop, this would have nothing to do with unreliability. The temperature would not be expected to be similar on the second occasion, since the medicine is likely to have had an effect.)

Reliability can be applied in the same way to instructional or research situation. If you ask a child a question and conclude on the basis of her response that she has achieved an educational outcome, you would hope that if you questioned her a few minutes later you would still come to the same conclusion. To the extent that the result is similar on repeated occasions, you are dealing with a reliable method data collection. However, if you concluded the second time that she had not achieved the outcome, then you would be dealing with an unreliable collection process. (If a week goes by and it is plausible that the child might have forgotten something during the intervening week, then a different result on second occasion would have nothing to do unreliability, any more than a reduction in temperature after a child has been cured would indicate unreliability of the thermometer.)

An important point to keep in mind is that it is reliability of the data collection process - not of data collection instrument - that must be demonstrated. What we are really looking for is consistency in the decisions we make based on the data collection process; we don't want to draw conclusions would be likely to change if we took another estimate of the outcome variable. It is technically incorrect to refer to the reliability of a test. A test, a checklist, an interview schedule, or any other measurement device that is reliable in one setting or for one purpose may be unreliable in another setting or for a different purpose.

Therefore, this chapter always refers to the reliability of data collection processes. It is important to remember this distinction.

 

REVIEW QUIZ 5.1

Examine each of the following descriptions and indicate whether it is reliable, is unreliable, or provides no information about reliability.

  1. Ralph got a B when Mrs. Washington scored his essay test. However, he got a D when Mr. Lincoln scored the same test.

     

  2. Both witnesses independently told the police that the suspect had been carrying a violin case when he got on the train.

     

  3. Thelma got a C in third-grade spelling, but she got a B in fourth-grade spelling.

     

  4. The students filled out a rating sheet on Mr. Rivers on Monday, and they rated him as an overall good instructor. On Wednesday, they filled out the same rating sheet and rated him a mediocre instructor.

     

  5. On Tuesday, the counselor concluded that the client had a severe neurosis. On Friday, she concluded that he was probably as well adjusted as anyone else. On the next Tuesday, she concluded again that he had a severe neurosis.

     

  6. Steve did not know much about reliability, and so when he took this review quiz he guessed wildly. The first time he got six right. He tried again, without any further studying, and this time he got three right.

     

  7. Mr. Monroe's class had to study a list of 1,000 spelling words. The exam consisted of two 50-word subtests. The average score on both tests was about 85% correct.

     

  8. Mr. Roth's new novel was rated the number one that best seller in the Chicago Tribune poll, but it did not even make the list in the Time magazine best-seller poll.

     

  9. Miss Mears wanted to rate the degree to which students in her classroom accepted cultural attitudes of students from other cultures. She collected data that indicated that they were more accepting than most classes. She asked the principal to make a similar rating of her classroom, and he came to the same conclusion.

 

If you got most of the questions in Review Quiz 5.1 correct, or if you easily saw the logic of the explanations, then you probably have a good basic grasp of the concept of reliability. If not, reread the chapter to this point, check the chapter in the work- hook, refer to the recommended readings, or ask your instructor or a peer for help. Be sure that you understand the summary in the following paragraph so that you will profit from the rest of this chapter.

In summary, reliability refers to whether a data collection process is consistent. Unreliability occurs when the data collection process contradicts itself: when observations, observers, items, or alternate forms of the same test give contradictory evidence.

Reliability is not an all-or-nothing characteristic; data collection processes range from strong reliability to weak reliability. Because of the highly internalized nature of educational outcomes, measurement processes in education can never be perfectly reliable. If the scores on a data collection process vary when they should not, then the test is less reliable. If a data collection process produces consistent, non-contradictory results over a span of time or in varying settings, then it is said to be reliable.

As you read the rest of this chapter, remember that reliability is not synonymous with reliability coefficients. The technical reliability coefficients are often irrelevant, unnecessary, or at least more trouble than they are worth in practical situations.

However, a major goal of professionals in education is to achieve respect and effectiveness by engaging in scholarly activity that is authentic, public, and replicable. Decisions or statements at any level are more worthy of respect and more likely to be fruitful if they are based on sound reasoning rooted in the scientific method. A major part of the scientific method is to be public in one's methods and results and to show that they are replicable - that others can independently arrive at much the same conclusion. Establishing that our data collection processes are reliable is an important step in this process of public, scientific thinking.

 

SOURCES OF UNRELIABILITY

The best way to increase the reliability of our measuring instruments is to determine what causes unreliability and then to make sure that these causes of inconsistency are not present in the data collection strategies we employ. The following paragraphs summarize the major sources of unreliability.

  1. Faulty items and observations. Questions on a test, items on a checklist, or statements on a questionnaire or interview schedule can be ambiguous, tricky, or presented in a confusing format. When people are presented with such faulty items, it is hard for them to respond consistently. If the respondents do not know what they are expected to do, it will be hard for them to respond reliably.

     

  2. Excessively difficult elements of the data collection process. This factor is a problem primarily with tests designed to measure students' knowledge or learning. If a test is too difficult, the test taker is likely to guess at the answers, and the result will be problems similar to those described in the previous paragraph. Some types of test items promote guessing on difficult items more readily than other types. For example, if a true-false item is too difficult, the student still has a 50-50 chance of getting it right; whereas on a short-answer test the probability of guessing correctly is considerably smaller.

     

  3. Excessively easy elements of the data collection process. This is another factor that presents problems on tests designed to measure students' knowledge or learning. If we ask you a question that is extraordinarily simple for you to answer, we may learn nothing about what you really know. This is especially true if the correct answer is obtained from extraneous clues that are unrelated to the learning task. Asking an excessively easy question is like asking no question at all. This becomes a problem of reliability if we are asking a student several questions on a test and plan to combine the responses to get a total score. If we ask you 10 questions, and 9 of them are absurdly easy, then we are really basing our decision about you on the single good question that was not excessively easy. We might think we have a 10-item test, but it is really a 1-item test camouflaged by 9 non-items. The problem of length is discussed in the next paragraph. The point here is that excessively easy items contribute nothing to increasing the sample of consistent items included on our instrument. The same effect occurs with attitude questionnaires. If you ask everyone in your class to fill out a 10-item agree-disagree questionnaire, and if 9 of them are written in such a way that practically everyone is guaranteed to answer "strongly agree," then you really have a 1-item questionnaire.

     

  4. Inadequate number of observations or items. A general rule about reliability is that the shorter the measuring instrument or the smaller the number of observations, the greater the opportunity for chance factors to operate, and the more likely that unreliability will be present.

     

  5. Accidentally focusing on multiple outcomes. If all the items on a test or all the aspects of a data collection process focus on pretty much the same characteristic, then the reliability of the data collection process will likely be high; whereas if the items or observations focus on several different characteristics, the reliability will be lower. In fact, when the data collection process focuses on several different characteristics, you really have a large number of very short data collection processes (each of which are therefore likely to be inconsistent) rather than a single longer data collection process. The point is this: if you are going to combine scores or observations to obtain a measure of a single characteristic (such as English-speaking ability, test anxiety, or attitude toward cooperative learning), then you should be certain that all the items are actually measuring much the same thing.

     

  6. Characteristics of the respondents. Reliability is reduced by any temporary characteristic of the respondents that causes them to respond or act differently than they would have responded under normal conditions. Such characteristics include inability concentrate at a given time because of surrounding conditions, fluctuations in mood, and inconsistent recall of information.

     

  7. Faulty administration of the data collection process. The way a data collection process is administered can render the results inconsistent. If a test is given in a room that is extremely hot or full of distractions, the results may be affected. If the teacher gives one set of instructions to one class and a different set to another class, performance may be inconsistent and class comparisons will be based on less reliable test scores. In addition, the mannerisms, idiosyncrasies, and other characteristics of the person administering the instrument or conducting observations can influence reliability.

     

  8. Faulty scoring procedures. After the students or respondents have done their share of responding, inconsistency can still creep in when the scorer tries to assign values to the respondents' performance. The scorer could simply record information inaccurately or count the number of right answers incorrectly. If the answer sheet was at all ambiguous, it might be hard to determine what the respondent actually meant. Extended-answer essay tests are particularly notorious for the inconsistency with which they are graded. Research has shown that it is possible for one grader to give an essay an A, while another might give the same paper an F. Furthermore, even after the instrument has been accurately scored, it is still possible to introduce inconsistency through faulty record keeping.

 

HOW TO INCREASE RELIABILITY

The reliability of educational measurement can never be perfect. However, it can be improved by designing and administering data collection processes carefully. The way to increase reliability is to minimize the sources of unreliability cited in the previous section. There are statistical procedures for determining coefficients of reliability, and one of the ambitions of professional test constructors is to get this coefficient to be high. The use of these coefficients will be discussed in the next section of this chapter. At this point, let us say that it is possible (and important) to take steps to improve reliability (and to verify that others have done so) even if you never intend to compute a reliability coefficient outside an assignment for a college course. The following are specific guidelines for improving the reliability of measuring instruments:

  1. Use technically correct, unambiguous items. Make sure that the respondents are able to give the answers they really want to give. There are excellent textbooks available on educational and psychological measurement that offer specific guidelines on how to write technically correct items in various formats and for various content areas and how to develop effective observational systems. Teachers frequently collect data with instruments they have not even proofread properly, and such laxity is likely to lead to unreliability. A simple procedure for improving the technical quality of a data collection process is to have someone else look it over or take the test before the target audience actually sees it.

 

  1. Standardize the administration procedures. Collect the data in such a way as to promote consistency. Eliminate distractions. Don't make your personality a part of the data collection process. If more than one person will collect data, make sure they are using precisely the same set of instructions. The key point is to make it as likely as possible that each person administering the data collection procedure will make the same decisions when these decisions can influence the way an outcome will be recorded. This standardization is accomplished by writing the interview schedule or observation checklist as clearly as possible (covering as many as possible of the responses and behaviors that are likely to arise) and training the interviewers or observers carefully before they go into the field. (Strategies for standardizing interviews and observations are discussed in chapter 6.)

     

  2. Standardize the scoring procedures. Develop systematic strategies for consistency during the scoring process. This is easy with objectively scored instruments like true-false and multiple-choice tests, simple checklists, and Likert questionnaires. It becomes more difficult with extended-length essays, open-ended interviews, and unstructured observations. The idea here is that you want to allow the respondent to make as much of the decision as possible regarding what response will be recorded. Otherwise, you will have two sources of inconsistency - yours and the respondent's. When you do have to make decisions about how to record a response, make your decisions according to as structured a format as possible. An excellent way to get evidence that your scoring format is sufficiently structured and reliable is to let someone else indepen-

 

{Page 93 is missing.}

It is important to understand this information before proceeding. If you are a true skeptic, you might by now realize that this quiz may itself be unreliable. If that worries you, try the appropriate exercises in the workbook. A longer test will enable you to make a more reliable (consistent) judgment regarding your knowledge of this material.

 

STATISTICAL PROCEDURES FOR ESTIMATING RELIABILITY

Reliability coefficients are statistical procedures for estimating how consistent a data collection process is. These are important tools. Even if you do not feel a particular urge to compute these statistics, you should still be concerned about the reliability of your data collection techniques. These procedures are described here because they are relatively easy to understand and can be helpful to you. In addition, you will often want to administer professionally prepared data collection procedures, interpret the results of such procedures, or read about them in the published literature. Understanding the meaning of these statistical procedures can be extremely helpful for these purposes.

The following are the basic types of statistical reliability coefficients:

  1. Test-retest reliability (stability). The purpose of test-retest reliability is to estimate the likelihood that the results of the data collection process would have been the same if it had been administered on a different occasion. In other words, it helps us determine whether the measurement of the characteristic is likely to be stable. To compute this reliability coefficient, you would administer your data collection procedure, let some time pass, and then administer the same procedure a second time to the same people. Then you would compute a correlation coefficient between the two sets of scores. A high correlation coefficient (near 1.00) indicates that respondents performed comparably on both tests, whereas a low coefficient (near .00) indicates that their performance was inconsistent.

    A frequent misapplication of this concept of reliability is to give a pretest, then offer instruction to the students, then give them a posttest after the instruction, and finally compute a correlation coefficient. Actually, this coefficient has little to do with reliability - the two sets of scores would be subject to change because of the intervening instruction. If instruction has been successful, there is no reason why a person's score on the posttest should be at the same level as the pretest score.

     

  2. Equivalent-forms reliability (consistency among data collection procedures). The purpose of equivalent-forms reliability is to provide evidence that the results of a data collection process would have been similar if the results were obtained with a variant form of the data collection procedure. This form of reliability is useful when it is necessary to make comparisons or common judgments about people even though they could not be measured by exactly the same procedure. For example, if there are six forms of the SAT, it is important to know that a score of 1100 by a student taking form A has the same meaning as a score of 1100 by a student taking form B. To compute this form of statistical reliability, you would administer one form of the test to a group and then administer a different form of the same test to the same group. A high correlation between the two sets of scores would indicate that the respondents performed comparably on both tests; and this would mean that the two forms are essentially equivalent - that they consistently measure the same outcome.

    This form of reliability is especially useful when you need to determine the effectiveness of instruction by using one form of a test as a pretest and the other as a posttest. If you have a reliable test (that is, if the pretest and posttest data collection strategies are essentially equivalent), you can more logically attribute any improvements in performance to the intervening situation; whereas if the test is unreliable, then the improvements (or absence of improvements) could be the result of chance fluctuations resulting from the inconsistency of the test.

     

  3. Test-retest with equivalent-forms reliability. The purpose of this form of reliability is to provide evidence that the results of a data collection process would have been similar if they were obtained both on a different occasion and with a variant form of the data collection procedure. To compute this coefficient, you would administer one form of the test, let some time pass, and then administer the other form of the test to the same group of people. A resulting high correlation coefficient would indicate that there is a stable characteristic of some sort that both forms of the test are measuring. (This coefficient is a combination of the first two types.)

     

  4. Internal consistency reliability. The purpose of internal consistency reliability is to provide an estimate of the degree to which the items or elements that constitute a data collection process measure a single outcome rather than several diverse outcomes. The term internal consistency refers to the degree that all the elements or aspects composing the data collection process appear to be measuring the same thing. Internal consistency is expressed by coefficients arising from mathematical formulas that correlate scores on different items or separate parts of a data collection procedure with other items or parts of the same procedure. Unlike the other types of statistical reliability, internal consistency can be calculated from the administration of a single data collection process to a single group of people. The following are three common statistical estimates of internal consistency:
    1. Coefficient alpha is the internal consistency reliability coefficient that can be used with the widest variety of data collection procedures.

      The Kuder-Richardson reliability coefficient is used with measurement procedures (such as test items) that can be scored on a right-wrong or yes-no basis. (It is a special case of coefficient alpha.)

      The split-half reliability coefficient can be computed for tests by splitting the test in half and comparing the students' performance on each half of the test. (It is now considered to be obsolete, having been superseded by coefficient alpha.)


    The main value of the split-half procedure is that it can easily be computed by hand; but it is not as precise as the others, and computers have rendered it obsolete. The Kuder-Richardson coefficient is a better estimate of internal consistency than the split-half procedure, and it is frequently reported with computerized scoring packages for objectively scored tests. Since it is applicable to every situation in which the other two can be used and to other situations as well, coefficient alpha is clearly the most important indicator of internal consistency.

    Internal consistency reliability sets the upper limit for the other statistics that measure relationships among variables (including the other reliability coefficients). The statistical logic to support this statement will not be presented in this hook. In practical terms, this means that it is a good strategy to use internal consistency as a starting point for developing reliable data collection procedures. If you develop solid, internally consistent procedures and then do other things right, you will be able to conduct reliable measurements of outcome variables. If you fail to develop internally consistent procedures, then it is unlikely that your attempts to measure outcome variables will be reliable.

     

  5. Interscorer reliability. The purpose of this procedure is to rule out the possibility that unreliability has been introduced by the person recording the results of a data collection process. In other words, it provides evidence that the scores would have been similar, regardless of who calculated the results of the data collection process. In using this procedure, you would have two different persons score the same set of tests (or make the same set of observations or conduct the same set of interviews), and then you would compare the two sets of results. A high correlation coefficient between the two sets of scores would indicate that both persons were interpreting the data collection process similarly. A low coefficient would indicate that differences among the scores of the persons being measured might be the result of the way the data collection was scored rather than the result of real differences among the respondents. (Some Olympics events are scored by ratings of observing judges. When spectators and critics charge that these events are inconsistently judged, this is actually a statement about poor interscorer reliability among the observers.)

    With many educational tests, interscorer reliability is irrelevant. This is particularly true of "objective tests," which would be described more specifically as "objectively scored tests." There is little chance that two scorings by a machine will differ significantly in giving the results of a multiple-choice test. With more subjective data collection processes, such as essay tests and ratings of personality characteristics or classroom social climate, an evaluation of the consistency of the scoring process is much more important.

    Like internal consistency, interscorer reliability sets an upper limit on the other types of reliability and on correlations with other variables. That is, if there is unreliability in the scoring process, these other types of reliability will be lower than if the scoring process were perfectly reliable. This is because the scoring process provides chances for error and disagreement in addition to whatever inconsistencies are inherent in the respondent's actual performance during the data collection process. This means, for example, that if the test-retest reliability of a data collection procedure is low and if its interscorer reliability is also low, the test-retest reliability can be increased by improving the interscorer reliability. This would occur because a major source of error would be removed on both testing occasions.

     

  6. Interobserver agreement. The purpose of this procedure is to verify that different observers can agree that an event has or has not occurred. This estimate of reliability differs from the others in that it is stated as a percentage rather than as a correlation coefficient. It is used when a rater is trying to observe a person or a group; it is also used to determine whether a behavior or a set of social conditions is occurring. The interobserver reliability is determined by having a second person simultaneously make the same set of observations. After this has been done on a certain number of occasions, a percentage is calculated to determine how often the two raters agreed. For example, a teacher might be concerned about the disruptive behavior of a kindergarten child. Disruptive behavior might be operationally defined as "being out of one's seat when children are supposed to be in their seats." it should be relatively easy to ascertain whether a child is or is not seated; but in actual practice it may be difficult to discern when children "should" be in their seat or when a child has actually left his seat at the wrong time. To establish reliability, the teacher could hay two observers independently but simultaneously observe the child and record how often he is out of his seat. They might each watch the child for 10-second intervals and mark him as being in-seat or out-of-seat during each interval. Afterwards, they would compute their percentage of agreement. If they watched the child for 50 intervals and agreed on 40 of these, then their interobserver reliability would be 80%. Upon examining their data more closely, the raters might discover that 8 out of their 10 disagreements occurred when the student was out of his seat but still in the vicinity of his desk, as when reaching down below for something. By agreeing on whether this should be classified as in-seat or out-of-seat behavior and writing this decision into the guidelines for administering the observation instrument, the interobserver agreement could become much higher on subsequent administrations.

 

 

Interobserver agreement is used only when a yes/no decision is made regarding the occurrence/nonoccurrence of an event. When the observer makes a rating, interscorer reliability (which we discussed earlier in this section) is the appropriate estimate of reliability. Interobserver agreement is important in situations (such as behavior modification programs) where the data collection consists of observing a child to determine whether he is performing a predefined behavior.

 

 

In a very real sense, you may not need any reliability coefficient. What you need is the concept of reliability, because you want your measurements, observations, and interviews to be consistent. A coefficient is merely a tool to help estimate consistency. The question, therefore, is what kind of statistical reliability is going to be helpful to you in determining whether your measuring instruments are consistent. The preceding descriptions (summarized in Table 5.1) should help you make decisions regarding whether a statistical procedure may be helpful to you and to interpret these statistics when other researchers report them.

 

 

Table 5.1 Statistical Methods of Estimating Reliability

 

Type of Reliability

 

Purpose

 

Procedure

 

Statistic

Employed

 

Test-retest reliability

 

To ensure stability; to rule out the likelihood that results will fluctuate widely on different administrations of same instrument to same people

 

Administer the same test twice to the same group with a time interval in between; then compute the correlation coefficient

 

Correlation coefficient

 

Equivalent-forms reliability

 

To ensure that two forms of a test are actually equivalent

 

Administer two forms of the same test to the same group in close succession; compute correlation coefficient

 

Correlation coefficient

 

Test-retest with equivalent forms reliability

 

To ensure both stability and equivalence (combines first two methods)

 

Administer one form; let time pass; administer second form; compute correlation coefficient

 

Correlation coefficient

 

Internal consistency reliability

 

To determine the extent to which the items on a test are measuring a common characteristic (to ensure internal consistency)

 

Administer test only once; apply formula to compute Coefficient alpha

 

Coefficient alpha

 

Interscorer reliability

 

To determine the extent to which the results are objective; i.e., will be the same no matter who scores the test

 

Administer the test once; have two different persons score the test compute correlation coefficient

 

Correlation coefficient

 

Interobserver agreement

 

To determine the extent to which different observers can agree whether an

outcome is occurring

 

Have two observers watch for the occurrence of an event during a designated number of intervals; compute the percentage of intervals during which they agree

 

Percentage of Agreement

 

REVIEW QUIZ 5.3

Identify the type of statistical reliability that would be helpful in determining whether the stated data collection technique is consistent.

  1. Mr. Perkins had decided to help Jamahl control his aggressive behavior. He has defined aggressive behavior as any attempt to inflict physical harm on another person. He plans to count how often such attempts occur during an hour-long period each day for two weeks.

  2. Ms. Wilkes is going to give her music students a test of tonal discrimination. She doesn't want to waste her time with a test that will give one result today and a different result next week.

  3. Mrs. Johns is a vocational education supervisor. She has developed a rating scale to determine how ready each student is to take a full-time job in an out-of-school situation. She plans to have each of the teachers use this instrument to rate their students, and she expects that the scores will reflect the students' capabilities, not the eccentricities of the teachers.

  4. Mr. Byrd teaches Freshman composition. He has developed an end-of-the-year test that he claims gives a good indication of an overall skill he labels "proficiency in the basics." Students are required to get a score of at least 80 on this test before they can take more advanced Courses.

  5. Miss Gordon wants to find out whether her new method of teaching speed-reading works. She wants to give one test of speed and comprehension at the beginning of her course and another at the end. She hopes to be able to determine that speed will increase while comprehension stays about the same.

 

STANDARD ERROR OF MEASUREMENT

While correlation coefficients give good estimates of the reliability of data collection processes, they are not directly useful for communicating information about the degree to which a specific score is likely to be accurate. The standard error of measurement is a statistic that is based on reliability coefficients and gives information about the relative accuracy of individual scores. The standard error of measurement indicates the range within which the "true" score of the individual is likely to fall - taking into consideration the unreliability of the test. For example, if a student received a score of 85 on a test with a standard error of measurement of 4.0, then her true score would probably range somewhere between 81.0 and 89.0. If the standard error of the test were 7.0, then this student's true score would probably lie in the range of 78.0 to 92.0. (The word probably in the previous two sentences means that the statistical formula gives about a 68% probability that the true score falls in the designated range.) Since the standard error of measurement is based on the reliability of the data collection process, higher test reliability leads to a smaller standard error of measurement - that is, to a more narrow range of scores within which the true score would be likely to fall.

The standard error of measurement has considerable practical importance. Within the context of the preceding paragraph, it is reasonable to think of the standard error of measurement as an estimate of the "likely error" of a data collection process. For example, if a person scores 115 one year on an IQ test that has a standard error of measurement of 5 and then scores 112 on a parallel form of the test the next year, we would assume that this probably represents a normal fluctuation of scores rather than an actual deterioration in performance. The standard error of measurement is closely related to the concept of standard deviations (discussed in chapter 7) and to the concept of confidence intervals (discussed in chapter 8). A major advantage of standardized tests (discussed in chapter 6) is that their test manuals almost always include information on the standard error of measurement.

 

HOW RELIABLE DOES A DATA COLLECTION PROCESS HAVE TO BE?

It is an axiom that no data collection process in education can ever be perfectly reliable. Whether you use statistical procedures or not, it is obvious that some data collection processes are more reliable than others. The reliability of almost any given data collection process could be improved, if you worked a little harder or added more items or observations. How reliable is reliable enough? The answer is that the necessary degree of reliability depends on what you plan to do with the results of your data collection.

If you are giving a weekly arithmetic test, and you happen to make an inaccurate decision based on it, this is probably not a serious problem. If you give a child credit for mastering a topic and you discover a day later that she has not mastered it after all, then you can simply change your decision and offer her some additional instruction. Although you would not want to make frivolous decisions even in such cases, it is obvious that you could settle for a more unreliable instrument than you would require if you were deciding whether that same student should embark upon a college preparatory curriculum in mathematics. Therefore, the first answer to your question is that the data collection process needs to be more reliable to the extent that the decisions based on it are likely to be permanent or irreversible.

A second, closely related factor is whether the results of the data collection process will be the only source of information in making a decision or whether they will be supplemented by other sources of data. In chapter 4 we recommended multiple operational definitions of outcome variables and multiple methods to measure these outcomes. To the extent that a data collection process is effectively supplemented by other sources of information, lower reliability is tolerable. The inconsistencies and imprecision in one set of data will be counterbalanced by information from other sources.

The point is this: The more confidence you want to be able to place in the score an individual attains, the greater the reliability you should require from your instrument.

The situation is somewhat different when you are examining group accomplishments rather than diagnosing the performance of an individual. The factors that lead to unreliability (inconsistency) on a data collection process are often essentially random, and they tend to average out over the long run. In other words, if one student improves his score by guessing accurately on a test, it is probable that someone else's score has been hurt to a similar proportion by poor guessing on the same test. Therefore, a chance factor like guessing is likely to contribute less to inconsistency when group evaluations rather than individual evaluations are being considered. For this reason, substantially lower reliabilities are acceptable for comparing group scores than for comparing individual scores. In addition, note that when statistical comparisons are made among groups (see chapter 14), the statistical estimates of reliability will be accounted for in the computation of the statistical comparison.

Although it would often be absurd to evaluate an individual's performance in a history course based her answer to a small set of questions, it would nevertheless make sense to evaluate the performance a group based on the group's answer to that same set of questions. (Of course, it is still relevant to ascertain that the questions properly sample the topics covered in the history course; see the discussion of content validity later in this chapter.) In fact, this exactly what the highly reputable National Assessment of Educational Progress (NAEP) is attempting. The NAEP is asking several questions to carefully selected groups of students in schools throughout the United States. On the basis of NAEP results, it would be possible to conclude something like, "In 1980, only 70% of fifth graders knew who Christopher Columbus was, whereas in 1990, 95% fifth graders knew who he was." On the other hand, it would not be appropriate to use one child's answer to that same question to draw reliable conclusions about his knowledge of history.

Finally, one must consider how high reliabilities should be for commercially prepared tests. If we're paying the pros to come up with good tests, shouldn't we expect the tests to be highly reliable? Here again, it depends on what kind of test you're looking for. Commercially available intelligence tests often report reliabilities of .90 or higher. On the other hand, some personality tests used for group research report reliabilities of only .60. The general strategy is to determine what you want to use the test for, and then to look for information regarding the specific type of reliability needed to achieve that goal. (For example, look for equivalent-forms reliability, not just internal consistency, if you are interested in using one form for a pretest and another for a posttest.) It's a good idea to look in a source like The Mental Measurements Yearbook (Kramer & Conoley, 2002) to find out what levels of reliability are available for tests of the sort you're looking for. If there are five tests of the same sort, and four of them report reliabilities of .85 or better, then the fifth one with a coefficient of .60 is substantially less reliable.

 

VALIDITY OF DATA COLLECTION PROCESSES

Validity of data collection addresses the question of whether a data collection process is really measuring what it purports to be measuring. A data collection process is valid to the extent that the results are actually a measurement of the characteristic the process was designed to measure, free from the influence of extraneous factors. Validity is the most important characteristic of a data collection process.

 

A data collection process is invalid to the extent that the results have been influenced by irrelevant characteristics rather than by the factors the process was intended to measure. For example, if a teacher gives a reading test and the test does not really measure reading performance, the test is useless. There is no logical way that the invalid test can help the teacher measure the outcome in which she is interested. If she gives a self-concept test that is so difficult to read that the third graders taking it are unable to interpret the tasks correctly, the test cannot validly measure self-concept among those students. It is invalid for that purpose, because it is so heavily influenced by reading skills that self-concept is not likely to come to the surface. This test cannot help the teachers make decisions about the outcome variable "self-concept." For example, if they ran a self-concept program for their students and their students' "self-concept" scores improved, how could they know whether it was really self-concept and not just reading ability that improved? In designing and carrying out any sort of data collection process, therefore, validity is of paramount importance.

As we said with regard to reliability, it is important to keep in mind that it is the validity of the data collection process - not of the data collection instrument - that must be demonstrated. What we really want to do is strengthen the validity of the conclusions we draw based on the data collection process; we don't want to draw conclusions based on the measurement of the wrong outcomes. It is technically incorrect to refer to the validity of a test. A test, a checklist, an interview schedule, or any other data collection device that is valid in one setting or for one purpose may be invalid in another setting or for another purpose. Therefore, this chapter always refers to the validity of data collection processes. It is important to rein ember this distinction.

 

SOURCES OF INVALIDITY

What makes a data collection process valid or invalid? A data collection process is valid to the extent that it meets the triple criteria of (1) employing a logically appropriate operational definition, (2) matching the items to the operational definition, and (3) possessing a reasonable degree of reliability. Invalidity enters the picture when the data collection strategy fails seriously with regard to one of these criteria or fails to lesser degrees in a combination of these criteria.

It may be instructive to look at some examples of invalid data collection processes. Assume that a researcher wants to develop an intelligence test. He operationally defines intelligence as follows: "A person is intelligent to the extent that he/she agrees with me." He then makes up a list of 100 of his opinions and has people indicate whether they agree or disagree with each item on this list. A person agreeing with 95 of the items would be defined as being more intelligent than one who agreed with 90, and so on. This is an invalid measure of intelligence, because the operational definition has nothing to do with intelligence as any reputable theorist has ever defined it.

Not all invalid data collection processes are so blatantly invalid. Indeed, one of the most heated arguments in psychology today is over the question of what intelligence tests actually measure. This whole question is one of validity. The advocates of many IQ tests argue that intelligence can be defined as general problem-solving ability. They operationally define intelligence as something like, "People are intelligent to the extent that they can solve new problems presented to them." They test for intelligence by giving a child a series of problems and counting how many she can solve. A child who can solve a large number of problems is considered to be more intelligent than one who can solve only a few. The opponents of such tests argue that the tests are invalid. They say that general problem-solving ability is not the only quality - or even the most important one required to do well on such tests. The tests, they argue, really measure how well a person has adapted to a specific middle-class culture. Success on such tests, therefore, is really an operational definition of ability to adapt to middle-class culture." Since the test is designed to measure intelligence but really measures a different ability, it is invalid. The argument over the validity of IQ tests is far from settled. Important theorists continue to line up on both sides, and others continue to suggest compromises - such as recommending new tests or redefining the concept of intelligence.

Consider another hypothetical intelligence test. Assume that we ask the child one question directly related to a valid operational definition. This is an excessively short test, and thus it is likely to provide an unreliable estimate of intelligence. Our result is also likely to be invalid, because our conclusion that a child is a genius for answering 100% of the questions correctly is about as likely to be a result of chance factors (unreliability) as it is to be a result of real ability related to the concept of intelligence.

The factors that determine the validity of a data collection process are diagrammed in Figure 5.1. The first test cited in this section was invalid because the operational definition was inappropriate. In the second case, the operational definition was logically appropriate, but it was not clear whether the tasks the child performed were really related to this operational definition. The final IQ test was considerably limited in its validity because the test was unreliable.

 

 

 

To the extent that there is a complete breakdown at any of these stages, the data collection process is invalid. Likewise, if there is a cumulative breakdown at several stages, the data collection process can be invalid.
Figure 5.1

Factors Influencing Test Validity

 

ESTABLISHING VALIDITY

From the preceding discussion, it can be seen that there are three steps to establishing the validity of a data collection process designed to measure an outcome variable:

 

1. Demonstrate that the operational definition upon which the data collection process is based is actually a logically appropriate operational definition of the outcome variable under consideration. The strategy for demonstrating logical appropriateness was discussed in detail in chapter 4, where we pointed out that operational definitions are not actually synonymous with the outcome variable but rather represent the evidence that we are willing to accept to indicate that an internal behavior is occurring. Table 5.2 lists some cases where the operational definitions are to varying degrees logically inappropriate. For example, if the instructors in English 101 administer an anonymous questionnaire at the end of the semester to evaluate their performance in the course, they might think that the students are responding to questions about how they performed during the course. However, it's possible that the students who are completing the questionnaire are thinking, "If we tell them what we really think, they'll be upset and come down hard on us when they grade the exam. I think we should play it safe and give them good ratings for the course." If this is what students are thinking, then the favorable comments on the questionnaire are actually an operational definition of "anxiety over alienating instructor" rather than of "quality teaching."

 

In many cases, the logical connection is easy to establish, and hence the logical fallacies found in Table 5.2 are often easy to avoid. For example, the connection between the operational definitions and the outcome variables in Table 5.3 are much more obvious than the connections in Table 5.2. It's still possible for a person to perform behaviors described in the operational definitions without having achieved the outcome variable, but it is much less likely than was the case in the situations in Table 5.2.

Logical inappropriateness is most likely to occur when the outcome variable under consideration is a highly internalized one. Affective outcomes present particularly difficult problems, because the evidence is much less directly connected to the internal outcome than is the case with behavioral, psychomotor, and cognitive outcomes. The guidelines presented in chapter 4 are applicable here - namely, rule out as many alternative explanations as possible, and use more than one operational definition.

 

Table 5.2 Some Examples of Logically Inappropriate Operational Definitions of Outcome Variables

 

Assumed Outcome Variable

 

Operational Definition

 

Conceivable Real Outcome Variable

 

Ability to understand reading passages

 

The pupil paraphrases a passage he/she has read silently

 

Ability to guess from context clues

 

Love of Shakespearean drama

 

The student will carry a copy of Shakespeare's plays with him to class

 

Eagerness to impress professor

 

Appreciation of English 101

 

The students will indicate on a questionnaire that they liked the course

 

Anxiety over alienating instructor

 

Knowledge of driving laws

 

The candidate will get at least 17 out of 20 true-false questions right on license test

 

Ability to take true-false tests with subtle clues present in them

 

Friendliness toward peers

 

The pupil will stand near other children on the playground

 

Anxiety over being beaten up if he or she stands apart

 

Appreciation of American heritage

 

Child will voluntarily attend the Fourth of July picnic given by the American Legion

 

Appreciation of watching fireworks explode

 Table 5.3 Some Examples of Operational Definitions That Are Almost Certain to Be Appropriate for the Designated Outcome Variables

 

Ability to add single-digit integers

 

The student will add single-digit integers presented to him ten at a time on a test sheet

 

Ability to tie one's own shoes

 

The student will tie her own shoes after they have been presented to her untied

 

Ability to bench press 150 pounds

 

The student will bench press 150 pounds during the test period in the gymnasium.

 

Ability to spell correctly from memory

 

The student will write down from memory the correct spelling of each word given in dictation

 

Ability to spell correctly on essays with use of dictionary

 

The student will make no more than two spelling errors in a 200-word essay written during class with the aid of a dictionary

 

Ability to type 60 words per minute

 

The student will type a designated 300-word passage in five minutes or less

 

Ability to raise hand before talking in class

 

The student will raise his hand before talking in class.

 

Ability to recall the quadratic equation

 

The student will write from memory the quadratic equation

 

Ability to apply the quadratic equation

 

Given the quadratic equation and ten problems that can be solved using the equation, the student will solve at least nine correctly

 

 

2. Demonstrate that the tasks the respondent has to perform to generate a score during the data collection process match the task suggested by the operational definition. The benefits of stating operational definitions can be completely nullified if the tasks that generate a score during the data collection process do not match the tasks stated in the operational definitions.

 

Table 5.4 provides examples of such mismatches. The first three are not intended to be facetious. Mismatches this obvious actually do occur on teacher-designed tests. They say they are going to measure one thing, and then they measure something else. The other examples in Table 5.4 are more subtle. In these cases, the teacher has one behavior in mind; and in fact, many of the persons responding to the data collection process will perform the behavior anticipated by the teacher. But the mismatch occurs whenever a respondent performs the different or additional tasks indicated in the second column of the table.

 

 

Table 5.4 Some Examples of a Mismatch Between the Operational Definition and the Task the Respondent Has to Perform on the Instrument

 

Operational Definition

 

Task on Instrument

 

The student will add single-digit integers presented to him ten at a time on a test sheet

 

"If I have three apples and you give me two more apples, how many do I have?"

 

The student will solve problems using the quadratic equation

 

"Explain the derivation of the quadratic equation."

 

The student will use prepositions correctly in her essays

 

"Write the definition of a preposition."

 

The student will apply the principles of operant conditioning to hypothetical situations

 

The student first has to unscramble a complex multiple-choice thought pattern and then apply the principles

 

Given a (culturally familiar) novel problem to solve, the test taker will be able to solve the problem

 

The student is presented with a problem entirely foreign to his cultural background

 

The student will describe the relationship between nuclear energy and atmospheric pollution

 

The student will write, in correct grammatical structures, a description of the relationship between nuclear energy and atmospheric pollution

 

The student will circle each of the prepositions in the paragraph provided

 

The student will first decipher the teacher's unintelligible directions and then circle each of the prepositions

 

The respondent will place herself in the simulated job situation provided to her and will indicate how she would perform in that situation

 

The respondent has to first ignore that the situation is absurdly artificial and highly different from the real world and then still respond as she would perform in the hypothetical situation

 

When questions arise concerning various sorts of bias in the data collection process, it is often the mismatch between task and operational definition that is being challenged. For example, with regard to bias in IQ tests, one of the most common arguments is essentially that middle-class youngsters who take the test are actually performing behaviors related to the operational definition, whereas equally intelligent lower-class youngsters are taking a test where there is a discrepancy between what they are doing and the operational definition of intelligence.

It is important to be aware of the various kinds of bias and other contaminating factors that could cause discrepancies, and to carefully rule these out. Such sources of mismatching include cultural bias, test-wiseness, reading ability, writing ability, ability to put oneself in a hypothetical framework, tendency to guess, and social responsibility bias. The preceding list is not to be considered exhaustive. There are her factors unique to specific individuals that produce a similar effect. A good way to assure a match to have several different qualified persons examine the data collection process and state whether the task matches the operational definition.

A special type of mismatch between operational definition and task is worth mentioning. Some data collection strategies are so obtrusive that the respondent is more likely to be responding to the data collection process itself than to be performing the tasks indicated in the operational definition. For example, if a child knows that a questionnaire is measuring prejudice and that it is not nice to be prejudiced, the child may answer what he thinks he should answer instead of revealing his true attitude. (This is referred to as a social-desirability bias.) Likewise, if a researcher comes into the classroom and sits in a prominent position with a behavioral checklist, children may be acutely aware that something unusual is happening; and so the behavior recorded on the checklist is more a reaction to the data collection strategy than an indication of actual behavioral tendencies. (Specific strategies for overcoming obtrusiveness are discussed in chapter 6.)

 

3. Demonstrate that the data collection process is reliable. Reliability was discussed extensively earlier in this chapter. The contribution of reliability to validity was mentioned in Figure 5.1 and in the accompanying discussion. The relationship between reliability and validity is diagrammed more specifically in Figure 5.2. As this diagram suggests, a certain amount of reliability is necessary before a data collection process can possess validity. In other words, a data collection process cannot measure what it's supposed to measure if it measures nothing consistently. In demonstrating that data collection processes are valid, professional test constructors first demonstrate that their data collection processes are reliable - that they measure something consistently; then they demonstrate that this something is the characteristic that the data collection processes are supposed to measure. In other words, they first demonstrate reliability in several ways, and then they demonstrate validity.

 

An important caution is necessary in discussing the relationship between reliability and validity. It is crucial to realize that it is possible (but undesirable and inappropriate) to increase reliability while simultaneously reducing the validity of a data collection process. This can be done by either (1) narrowing or changing the operational definition so that it is no longer logically appropriate or (2) changing the tasks based on the operational definition to less directly related tasks and then (3) devising a more reliable data collection process based on the more measurable but less appropriate operational definition or tasks. This is obviously a bad idea, because the result is that the data collection now measures a less valid or wrong outcome "more reliably."

Such an increase in reliability accompanied by a reduction in validity occurs, for example, if a teacher introduces unnecessarily complex language into a data collection process. A data collection process that had previously measured "ability to apply scientific concepts" might now instead measure "ability to decipher complex language and then apply scientific concepts." The resulting reliability might be higher; but if the teacher is still making decisions about the original outcome, the data collection process has become less valid.

Overemphasis on reliability is one of the arguments against culturally biased norm-referenced tests. Their detractors argue that many standardized tests become more reliable when cultural bias is added, because such bias is a relatively stable (consistent) factor, which is likely to work the same way on all questions and on all administrations of the test. However, the cultural bias detracts from the validity of the test.

It is important to be alert to the tendency to accept spuriously high statistical estimates of reliability as solid evidence of validity. The fact that a certain amount of reliability is a necessary prerequisite for validity does not mean that the most reliable data collection process is also the most valid. Statistical reliability is only one factor in establishing the validity of a data collection process. Another way to state this is to say that reliability is a necessary but not sufficient condition for validity.

 

 

As you can see, establishing validity is predominantly a logical process.

Finally, before leaving this introduction to the validity of data collection processes, it is important to note that a data collection process that provides valid data for group decisions will not always provide valid data for decisions about individuals. On the other hand, a data collection process that provides valid data for decisions about individuals will always provide valid data for group decisions. This is not as complicated as it sounds. To take an example, we might operationally define appreciation of Shakespeare as "borrowing Shakespearean books from the library without being required to do so." Even if Janet Jones borrows books on Shakespeare without being required to do so, it is not possible to diagnose her specifically as either appreciating or not appreciating the bard using this operational definition. There are too many competing explanations for her behavior, and these would invalidate this data collection process as an estimate of her appreciation. (For example, she might hate the subject but need to pass the exam; and so she has to borrow a vast number of books to do burdensome, additional studying. Or she might like Shakespeare so much that she owns annotated copies of all the plays and never has to borrow from any library except her own.) Nevertheless, it may still be valid to evaluate the group based on this operational definition. If you teach the Shakespeare plays a certain way one year and only 2% of the students ever borrow related books from the library, and the next year you teach the same subject differently and 50% of the students spontaneously borrow books, it is probably valid to infer from their available documented records that appreciation of Shakespeare has increased. The group decision, at any rate, is more likely to be valid than is the individual diagnosis.

 

Box 5.1

An Argument-Based Approach to Validity

 

Kane (1992) presents the practical yet sophisticated idea that validity should be discussed in terms of the practical effectiveness of the argument to support the interpretation of the results of a data collection process for a particular purpose. The researcher or user of the research chooses an interpretation of the data, specifies the interpretive argument associated with that interpretation, identifies competing interpretations, and develops evidence to support the intended interpretation and refute the competing interpretations. The amount and type of evidence needed in a particular case depend on the inferences and assumptions associated with a particular application.

The key points in this approach are that the interpretive argument and the associated assumptions be stated as clearly as possible and that the assumptions be carefully tested by whatever strategies will best rule Out bias and other sources of faulty conclusions. As the most questionable inferences and assumptions are checked and either supported by the evidence or adjusted so that they become more plausible, the plausibility (validity) of the interpretive argument increases.

This interpretation of validity is compatible with the discussion presented in this chapter. In addition, it has the advantage of presenting validity as a special instance of the overall application of formal and informal reasoning to solving problems. From this viewpoint, when educators do research, they are under the same obligation as any other person making public statements to demonstrate that those statements really do mean what the speaker or writer says they mean- Statistical procedures and other specific techniques are merely pieces of evidence to check the quality of inferences and the authenticity of the assumptions underlying a particular interpretation.

(Source: Kane, M. T. [1992]. An argument-based approach to validity. Psychological Bulletin, 112, 327-535.)

 

 

 

 

 

REVIEW QUIZ 5.4

Part I

Identify the item from each pair that is most likely to be an invalid measure of the outcome variable given in parentheses.

Set 1.

a. The child will correspond intelligibly with an assigned Spanish-speaking pen pal. (understands Spanish)

b. The child will correspond intelligibly with an assigned Spanish-speaking pen pal. (appreciates Spanish culture)

 

 

Set 2.

a. The student will identify examples of the principles of physics in the kitchen at home. (understands principles of physics)

b. The student will choose to take optional courses in the physical sciences. (appreciates physical sciences)

 

 

Part 2

Write Invalid next to statements that indicate an invalid data collection process; write Valid next to those that indicate a valid data collection process; write N if no relevant information regarding validity is contained in the statement.

1____ The questions were so hard that I was reduced to flipping a coin to guess the answers.

2____ The test measures mere trivia, not the important outcomes of the course.

3____ To rule out the influence of memorized information regarding a problem, only topics that were entirely novel to all the students were included on the problem-solving test.

4____ The only way he got an A was by having his girlfriend write the term paper for him.

5____ The length of the true-false English test was increased from 30 to 50 items to minimize the chances of getting a high score by guessing.

6____ The teacher ruled out the likelihood of cheating by giving each of the students seated at the same table a different form of the test.

7____ Since the personality test had such a difficult vocabulary level, it probably was influenced more by intelligence than by personality factors.

8____ The observer rated the classroom as displaying a hostile environment toward handicapped people, but the teacher argued that the observer's judgment was clouded because she observed from a position where she was next to students who were not at all typical of the entire class.

9____ The observer rated the atmosphere of the school hoard meeting as being supportive of innovative teaching, but the newspaper critic pointed out that this was because the board members were local residents with business interests and were therefore very likely to be supportive of innovation.

 

 

If you got most of the questions in Review Quiz 5.4 correct, or if you easily saw the logic of the explanations, then you probably have a good basic grasp of the concept of validity. If you do not understand the concept, reread the chapter to this point, check the chapter in the workbook, refer to the recommended readings, or ask your instructor or a peer for help. Be sure that you understand the summary in the following paragraph so that you will profit from the rest of this chapter.

In summary, validity refers to whether a data collection process really measures what it is designed to measure. Invalidity occurs to the extent that the data collection process measures an incorrect variable or no consistent variable at all. The main sources of invalidity are logically inappropriate operational definitions, mismatches between operational definitions and the tasks employed to measure them, and unreliability of data collection processes. Validity is not an all-or-nothing characteristic; data collection processes range from strong validity to weak validity. Because of the highly internalized nature of educational outcomes, data collection processes in education can never be perfectly valid. By carefully stating appropriate operational definitions, ascertaining that tasks employed in data collection processes are directly related to the operational definitions, and designing reliable data collection processes, we can increase the validity of our data collection processes and the probability that we will draw valid conclusions from them.

 

SPECIFIC, TECHNICAL EVIDENCE OF MEASUREMENT VALIDITY

If you read a test manual or look up the citation of a test in The Mental Measurements Yearbook (Kramer & Conoley, 2002), you will find references to three basic types of evidence to support measurement validity. These have been defined by several major organizations interested in mental measurement (American Educational Research Association et al., 1985). The technical types of evidence for validity are rooted in the theory discussed earlier in this chapter, and it is not difficult to achieve a fundamental understanding of these concepts. A brief discussion of these types of evidence for validity can help teachers and researchers develop more valid data collection processes for their own use. In addition, an understanding of these concepts will be especially useful when selecting or using standardized tests, reading the professional literature, and attempting to measure psychological or theoretical characteristics beyond those that are typically covered by classroom tests. These three types of evidence for validity are (1) content validity, (2) criterion-related validity, and (3) construct validity.

 

Content Validity

Content validity refers to the extent to which a data collection process measures a representative sample of the subject matter or behavior that should be encompassed by the operational definition. A high school English teacher's midterm exam, for example, lacks content validity when it focuses exclusively on what was covered in the last two weeks of the term and inadvertently ignores the first six weeks of the grading period. Likewise, a self-concept test would lack content validity if all the items focused on academic situations, ignoring the impact of home, church, and other factors outside the school. Content validity is assured by logically analyzing the domain of subject matter or behavior that would be appropriate for inclusion on a data collection process and examining the items to make sure that a representative sample of the possible domain is included. In classroom tests, a frequent violation of content validity occurs when test items are written that focus on knowledge and comprehension levels (because such items are easy to write), while ignoring the important higher levels, such as synthesis and application of principles (because such items are difficult to write).

 

Criterion-Related Validity

Criterion-related validity refers to how closely performance on a data collection process is related to other measure of performance. There are two of criterion-related validity: predictive and concurrent.

Predictive validity refers to how well a data collection process predicts some future performance. If a university uses the Graduate Record Exam (GRE) as a criterion for admission to graduate school, for example, the predictive validity of the GRE must be known. This predictive validity would have been established by administering the GRE to a group of students entering a school and determining how their performance on the GRE corresponded with their performance in that school. It would be expressed as correlation coefficient. A high positive coefficient would indicate that persons who did well on the GRE tended to do well in graduate school, whereas who scored low on the GRE tended to perform poorly in school. A low correlation would indicate that there was little relationship between GRE performance and success in that particular graduate school.

 

Concurrent validity refers to how well a data collection process correlates with some current criterion - usually another test. It "predicts" the present. At first glance it sounds like an exercise in futility to predict what is already known, but more careful consideration will suggest two important uses for concurrent validity. First, it is a useful predecessor for predictive validity. If the GRE, for example, does not even correlate with success among those who are going to school right now, then there is little value in doing the more expensive, time-consuming, predictive validity study. Second, concurrent validity enables us to use one measuring strategy in place of another. If a university wants to require that students either take freshman composition or take a test to "test out" of the course, concurrent validity would enable the English department to demonstrate that a high score on the alternative test has a similar meaning to a high grade in the course. Like predictive validity, concurrent validity is expressed by a correlation coefficient.

 

Construct Validity

Construct validity refers to the extent to which the results of a data collection process can be interpreted in terms of underlying psychological constructs. A construct is a label or hypothetical interpretation of an internal behavior or psychological quality - such as self-confidence, motivation, or intelligence - that we assume exists to explain some observed behavior. Construct validity often necessitates an extremely complicated process of validation. To state it briefly, the researcher develops a theory about how people should perform during the data collection process if it really measures the alleged construct and then collects data to see whether this is what really happens. The process is complicated because the researcher is doing two separate things: (1) proving that the data collection process possesses construct validity and (2) refining the theory about the construct. Note that this process of validation can never be completed; the goal of researchers engaging in construct validation is to refine concepts and data collection processes, not to arrive at ultimate conclusions. Construct validity often deals with the intervening variable (discussed in chapter 2), and it is of greatest relevance to theoretical research (discussed in chapter 17).

Remember: The three technical types of evidence for validity are merely tools for demonstrating that a data collection process measures what the test designer or researcher says it measures. The fundamental logic behind them is relatively straightforward. The difficulty lies in carrying out the procedures to collect these types of evidence for validity. The information presented here (summarized in Table 5.5) should be enough to enable you to deal with applying and interpreting these concepts in most situations. If you find that you need further information (for example, if your job requires that you select people accurately for various programs), consult the more technical references in the Annotated Bibliography at the end of this chapter.

 

 Table 5.5 Summary of the Three Major Types of Psychological Validity

 

Type of Validity

 

Definition

 

Mnemonic

 

Examples of How to Achieve and Demonstrate It

 

Content

The extent to which a data collection process measures a representative sample of the topic encompassed by the operational definition.

The content of the data collection process is a good sample of the content that it should cover.

1. Use a plan (such as an item matrix) to plan a test so that all areas are properly represented.

2. Logically show that nothing has been omitted or overrepresented in the data collection process.

 

Predictive

How well a data collection process predicts some future performance.

The data collection process predicts something that has not yet occurred.

1. Select students for an advanced algebra class based on a standardized math test. Then see if those who did well on the math test actually do better in the course.

2. Give students the SAT before they enter college. Then compute a correlation coefficient with college GPA to see if the SAT accurately predicts college performance.

 

Concurrent

How well a data collection process correlates with a current criterion.

Both data collection processes occur at the same time (concurrently). We want to demonstrate that one can be considered a substitute for the other.

1. Determine that success in English composition classes has already demonstrated writing skill, making it unnecessary for the student to take the English exit exam (which measures the same thing).

2. Compute a correlation coefficient between the performance of students on the computerized and non-computerized versions of the GRE (so that we can consider performance on one to be equivalent to performance on the other).

 

Construct

The extent to which the results of a data collection process can be interpreted in terms of underlying psychological constructs.

A psychological construct (accent on first syllable) is something that exists inside a person's head. We construct it (accent on second syllable) by reasoning about observable information (such as test results).

1. A person's test results show whites are smarter than blacks. We challenge this person by demonstrating that the test measures cultural familiarity rather than intelligence.

2. A person shows that her moral development test really does measure something that can be called moral reasoning - rather than reading ability, conformity, intelligence, or some other unrelated characteristic.

 

 

REVIEW QUIZ 5.5

Indicate the type of technical evidence for validity or in each of the following situations. Choose from this list:

a. content validity

b. predictive validity

c. concurrent validity

d. construct validity

 

  1. A test designer has developed an Anxiety Measurement Scale and wants to verify that it really measures a characteristic that can be labeled "anxiety."

     

  2. A counselor wants to select students into his school's college preparatory program based on likelihood that they will succeed in college. He wants to know whether a certain data collection process can help him accomplish this selection process.

     

  3. A test designer has developed a new, 10-minute IQ test and wants to demonstrate that it measures about the same thing as the more expensive Stanford-Binet IQ test.

     

  4. The dean wants to make sure that all the exams in the English composition course cover all the objectives they are supposed to cover.

     

  5. A teacher wants to find out whether the students who fail her final comprehensive exam really are the ones who will have trouble with related materials the next year.

 

PUTTING IT ALL TOGETHER

As you will recall, Eugene Anderson, the humane educator had written several operational definitions of "attitude toward animal life." Several of these will be discussed in the next few chapters, but here we shall focus on just one of them. Based on his second operational definition ("The child protects animals from harm"), he devised the paper-and-pencil test shown in Figure 5.3. He reasoned that a person with a favorable attitude would want the fireman to save animals before he or she saved objects from a burning building. (The validity of this belief will be discussed next.)

Mr. Anderson planned to give each respondent a score between 0 and 3, depending on how many animals the child selected on this test. Of course, he wanted to be reasonably certain that the score a child received on any testing occasion would actually represent that child's feelings toward animals, not some irrelevant or transient factor. In addition, he wanted to have two forms of the test; and so he devised a second test ("Billy and the Fireman" - not shown here), which contained a different set of animals and objects. Mr. Anderson needed to ascertain that both tests really were equivalent forms of the same test. If they were really equivalent forms, then he could give one as a pretest and the other as a posttest, to determine whether attitudes really changed as a result of his visits.

Mr. Anderson tried to follow all the non-statistical guidelines listed in this chapter to make the test as reliable as possible. As he completed his task, the only guideline that caused him any real concern was the one about making the test long enough. Was a range of 0 to 3 a big enough span of scores? On the one hand, he thought it might be a good idea to increase the number of choices; but on the other hand, he felt that the larger number of choices might needlessly confuse his respondents, since many of them would be in only the third or fourth grade.

Because of his doubts about the length of the test, he decided to use statistical techniques to check its reliability. If the test was too short, he would obtain a low reliability coefficient; if he obtained high coefficients, he would know that the brevity of the test was not a serious problem. In addition, the statistical procedures would be helpful in establishing the equivalence of the two tests. He had tried to obtain equivalence by pairing items and assigning them from a larger pool, but he would feel more secure if he had statistical evidence to demonstrate that they were parallel. Finally, the statistical evidence would be helpful to Mr. Anderson when he presented his results to his colleagues at meetings. With the statistical reliability data, he would not have to persuade them of his personal capability as an item writer. He could simply show them the numbers to prove that the tests were consistent.

He found several schools in which he was allowed to field test his instrument. In some cases, he had the same students take the same form of the test with an interval of a week or two in between (test-retest reliability). In other cases, he had them take the alternate forms after an interval of only a day or so (equivalent-forms reliability). In two cases, he gave the alternate forms with two weeks between the two testing occasions (test-retest with equivalent-forms reliability). The results are summarized in Table 5.6. As Mr. Anderson looked at his results, he was quite satisfied. The reliability coefficients showed that he had devised a reasonably consistent instrument. In addition, the alternate forms of the test really did appear to be equivalent. When one of his colleagues pointed out that his correlation coefficients were not as high as the correlations of .90 often reported for good standardized tests, Mr. Anderson replied that he was not concerned about that. The standardized tests were intended for diagnosing individual abilities, and a higher degree of reliability was necessary for that purpose. All that Mr. Anderson wanted to do was examine group attitudes, and his statistical reliabilities were more than sufficient for his needs. Mr. Anderson had indeed developed a consistent test. His next problem was to demonstrate that the trait he was consistently measuring could legitimately be called "attitude toward animal life."

An even more important concern for Mr. Anderson was that his tests should be valid. He was concerned about the validity of all his measuring instruments, but in this section we'll focus exclusively on how he established the validity of his Fireman Test.

 
Johnny and the Fireman

Johnny is a boy about your age. One night his house catches fire. He and all the members of his family escape, but they have time to bring nothing with them. A fireman comes up to Johnny and says, "The house is going to be a total loss. Is there anything you would like us to try to get out of the house before it burns down?"

Here is a list of some of the things in the house. Choose the three things that Johnny should tell the firemen to try to save if there is time. Then explain the reasons for your choice.

Color portable TV (brand new: cost $450).

Father's wallet ($75 and credit cards).

Johnny's dog (1 year old: cost $30).

Johnny's stamp collection (worth $75).

His sister's cat (she got it free a year ago).

Dad's car keys (car is safely parked on the street).

Mother's expensive coat (worth $300).

CB radio (worth $210). Little brother's pet gerbil.

Dad's checkbook.

 

What is the first thing to save?

 

What is the second thing to save?

 

What is the third thing to save?

 

Figure 5.3 Mr. Anderson's Humane Attitudes Test

 

 

 

Table 5.6 Reliability Data on the Fireman Tests

 

Test-Retest Reliability

 

Test

 

Grade Level

 

Time Interval

 

Correlation

 

Johnny

 

5th (n=20)

 

1 week

 

.63

 

Johnny

 

4th (n=24)

 

1 week

 

.75

 

Johnny

 

6th (n=25)

 

2 weeks

 

.70

 

Billy

 

5th (n=20)

 

1 week

 

.69

 

Billy

 

4th (n=23)

 

1 week

 

.70

 

Equivalent-Forms Reliability

 

 

Grade Level

 

Time Interval

 

Correlation

 

4th (n=47)

 

1 week

 

. 70

 

4th (n=26)

 

2 days

 

.64

 

3rd (n=35)

 

1 day

 

.73

 

4th (n=24)

 

4 days

 

.55

 

5th (n=65)

 

1 day

 

.71

 

5th (n=65)

 

1 day

 

.73

In determining the validity of this data collection process, Mr. Anderson followed the guidelines suggested in this chapter. First he looked at the operational definition to ascertain that it was really logically valid. This operational definition had been revised to state, "Given a hypothetical situation in which animals might undergo pain and suffering, the respondent will choose to save the animals from that pain and suffering." He talked this over with several of his colleagues, and they agreed that saving the animals was the behavior they would expect from a person with humane values.

Next, he ascertained that the children involved in the data collection process would actually be doing what the operational definition said they should be doing. At this point, he had to rule out such irrelevant tasks as reading ability and the tendency to give false but socially desirable answers. He ruled out the reading variable by consulting some reading specialists. They agreed that for most third through seventh graders, the vocabulary would not be excessively difficult. They suggested that in case of uncertainty, Mr. Anderson should simply read the test to the respondents. Next he ruled out the social-desirability factor by reasoning that all the objects in the house were socially desirable. In addition, since it would be introduced as part of a discussion of fire prevention, the test would be presented in such a way that the children would not even know that it had anything to do with attitudes toward animals. Finally, he noted that he had already established the reliability of the data collection process.

Mr. Anderson decided to use some statistical procedures to further authenticate validity. The procedures he used were a combination of criterion-related (concurrent) validity and construct validity. (It is not very important for you to distinguish precisely between the various techniques he used.) He asked himself, "If my data collection process is valid, what can I expect the results to be?" He answered this question with three predictions:

 

 

 

He set out to check each of these predictions.

Mr. Anderson found it hard to check his first prediction. This was because he could not find any other good tests of humane attitudes. What he did, therefore, was compare the results of the Fireman Tests with the results of some other measuring techniques he had derived from his own set of operational definitions. (Some of these other tests are described in chapter 6.) He found a definite pattern. Those who did well on the Fireman Test also did well on the other instruments. This information seemed to verify his first prediction.

His second prediction was much easier to check. He knew from his professional reading that one specific geographic region of the country was noted for its humane attitudes. The largest humane organizations were in that part of the country, and the incidence of pet and animal abuse was very low there. He also knew of another area that was generally considered by experts to be populated by much less humane people. He arranged to have his test given at the same grade levels in comparable schools in each of these communities. The results overwhelmingly supported his prediction. The students from the part of the country where the attitudes were known to be favorable scored higher on the test than the other students. This provided very strong support for the validity of his data collection process.

Then Mr. Anderson checked his third prediction. He correlated the test scores with scores on reading tests, math tests, and intelligence tests. The Fireman Test did not correlate substantially with any of these other scores. This is what he had hoped for. If the Fireman Test had correlated strongly with reading ability, for example, this might have indicated that the test was really a measure of reading ability rather than humane attitudes.

Mr. Anderson was happy with his validity data. He had demonstrated both logically and with empirical data that his data collection process really did seem to measure attitude toward animal life. He still intended to supplement the Fireman Test with other measuring techniques (described in the next chapter), but at least he knew he was off to a good start.

 

SUMMARY

Reliability refers to the degree to which measuring techniques are consistent rather than self-contradictory. Reliability is important because you will want to make decisions about your programs and students based on internally consistent and stable data rather than fleeting information that would change if you simply took the time to collect the information a second time. This chapter has discussed factors that introduce inconsistency into data collection procedures as well as strategies for controlling these factors. In addition to presenting these guidelines, this chapter has described statistical procedures that can be useful tools to help assure reliability.

Validity refers to the extent to which a data collection process really measures what it is designed to measure. The validity of a data collection process is established by demonstrating that (1) the operational definitions upon which the data collection process is based are actually logically appropriate operational definitions of the outcome variable under consideration, (21 the tasks the respondent performs during the data collection process match the task suggested by the operational definitions, and (3) the data collection process is reliable.

 

What Comes Next

In the next few chapters we'll discuss how to collect, report, and interpret reliable and valid data. Later we'll integrate this information into strategies for effectively conducting and interpreting research in education.

 

DOING YOUR OWN RESEARCH

When conducting quantitative research, it is essential that your data collection processes be valid. (The issue of validity of qualitative research is discussed in chapter 9.) The following guidelines emerge from the principles discussed in this chapter:

  1. Develop good operational definitions of your outcome variables. Use the guidelines discussed in chapter 4.

  2. Keep your operational definitions in mind when developing or selecting data collection processes. Be aware of sources of invalidity, and design or select only data collection processes that are directly related to your operational definitions.

  3. Develop or select reliable data collection processes, but remember that reliability is only a tool for establishing validity - it is not an end in itself.

  4. Increase validity by triangulating - that is, by using more than one operational definition and more than one data collection process for each outcome variable.

  5. Check the reliability and validity of your data collection processes during the early stages of your research.

  6. Collect information about reliability and validity from published sources like those described in chapter 6 and from information in the methods section of published articles.
    1.  

In addition, even if your research plan emphasizes quantitative methods, consider enhancing validity by supplementing quantitative methods with the qualitative strategies described in chapter 9.

 

FOR FURTHER THOUGHT

  1. Why is it that the same steps that increase reliability often interfere with the validity of a data collection process?

     

  2. Complete this sentence by answering the designated questions: "The concept of reliability .(does what?) (To what or whom?) (When?) (Where?) (How?) (Why?)"

     

  3. Complete this sentence by answering the designated questions: "The concept of validity .(does what?) (To what or whom?) (When?) (Where?) (How?) (Why?)"

 

ANNOTATED BIBLIOGRAPHY

The following sources provide more detailed information on the general topics of reliability and validity:

American Educational Research Association, American Psychological Association, & National Council for Measurement in Education. (1935). Standards for educational and psychological testing. Washington, DC: American Psychological Association. This booklet includes the guidelines for reliability and validity recommended by the three corporate authors. A familiarity with these guidelines will help you make better use of published information regarding data collection processes.

Ebel, R. L, & Frisbie, D. A. (1979). Essentials of educational measurement (5th ed.). Englewood Cliffs, NJ: Prentice-Hall. Chapter 5, "The Reliability of Test Scores," presents a clear statement of the theoretical rationale behind the traditional methods of assessing reliability, with a special emphasis on how to apply these methods to educational practice. Chapter 6, "Validity: Interpretation and Use," offers guidelines to help teachers enhance the validity of their tests by making sure they are appropriate for the purposes for which they are intended. Chapter 13, "Evaluating Test and Item Characteristics," describes important techniques for promoting internal consistency and generally revising early versions of data collection procedures.

Gronlund, N. F., & Linn, R. L. (1990). Measurement and evaluation in teaching (6th ed.). New York: Macmillan. Chapter 3, "Validity," provides some useful guidelines for teachers to increase the validity of their classroom tests. Part of Chapter 11, "Appraising Classroom Tests," describes item analysis, an important technique for promoting internal consistency. Chapter 4, "Reliability and Other Desired Characteristics," discusses the traditional methods of establishing reliability and gives concrete advice on how to apply these to improving classroom tests.

Kramer, J. J., & Conoley, J. C. (Eds.). (2002). The Fifteenth Mental Measurements Yearbook. Lincoln, NE: Buros Institute of Mental Measurements. This book and earlier volumes in the series provide critical, scholarly information about the reliability, validity, and other characteristics of published, standardized data collection materials. This resource is also available via Internet as an online database.

Mager, R. (1984). Measuring instructional results (2nd ed.). Belmont, CA: David S. Lake. This book addresses the very important problem of marching test items to behavioral objectives (operational definitions of outcome variables). Although Mager does not use the specific term validity, this little programmed text offers an excellent guide to one of the most important validity-related problems the classroom teacher faces.

Worthen, B. R., Borg, W. R., & White, K. R. (1993). Measurement and evaluation in the schools. New York: Longman. Chapters 6 and 7 offer practical and useful answers to the questions "Why worry about reliability?" and "Why worry about validity?" Chapter 8 focuses on "Cutting Down Test Score Pollution" by discussing ways to increase the reliability and validity of data collection processes.

 

The following source is useful for readers who are interested in more theoretical information on reliability:

Feldt, L. S., & Brennen, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed.). New York: American Council on Education. This is a brief but comprehensive treatment of the major issues relating to reliability. It's heavy on statistical formulas.

 

The following sources are useful for readers who are interested in more detailed information on validity:

Cole, N. S. (1989). Bias in test use. In R. L. Linn (Ed.), Educational measurement (3rd ed.). New York: American Council on Education. This is a detailed treatment of one of the major sources of invalidity in the interpretation of data collection processes.

Messick, 5. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.). New York: American Council on Education. This is probably the most authoritative discussion available regarding the status of current thought on validity. Anyone doing serious work on validity of data collection processes should consult this chapter.

 

The following source provides more detailed information on the specific topic of reliability of criterion-referenced tests:

Popham, W. J. (1978). Criterion referenced measurement. Englewood Cliffs, NJ: Prentice-Hall. Chapter 2, "Traditional Measurement Practices," discusses traditional approaches to reliability and points out some of the problems that are likely to occur when we try to apply these same approaches to the kinds of tests that teachers should be using to evaluate student performance. Chapter 7, "Reliability, Validity, and Performance Standards," is probably the most comprehensive treatment available on the reliability and validity of criterion-referenced tests.

 

ANSWERS TO QUIZZES

Review Quiz 5.1

  1. Unreliable. If they were measuring Ralph's ability consistently, they should be able to agree on his score. Ralph's score depends on who scores the test, not on what he wrote down. This is like having two parents read the thermometer and one conclude that the child is sick, while the other concludes that he is healthy.

  2. Reliable. Of course, you may need more than two witnesses to convince a jury. Their testimony could be invalid (discussed elsewhere in the chapter) if they had conspired to lie about seeing the violin case. However, in the sense that we are using the word here, their testimony is reliable.

  3. No information is provided about reliability. It is quite possible that her performance could have changed during the intervening year. The difference could be the result of unreliability, but we have no way to know. There is no solid reason to expect the two grades to be identical.

  4. Unreliable. If the same students are rating his overall ability as a teacher, there does not seem to be any good reason why this should change between Monday and Wednesday. If factors such as mood swings on the part of the students or teacher are causing the ratings to vary, these are sources of unreliability. (Note that if the students were rating him on how well he taught a specific lesson, then it might be plausible to say that he did well on one lesson and less well on another. In this case, the different scores would be a reflection of his actual performance, and the variation would not be evidence of unreliability.)

  5. Unreliable. Neuroses are supposed to be relatively permanent personality characteristics. They do not come one day and go the next. Neurosis is a vague term, and the counselor is probably having trouble operationally defining what she means.

  6. Unreliable. The test should prove him equally ignorant both times. Wild guessing is one of the most frequent sources of unreliability on "objectively" scored tests.

  7. Reliable. Mr. Monroe has essentially measured them twice and has come up with the same result. That's consistency. (This measurement of the class is apparently reliable; it still may be an unreliable way to diagnose individual students. The distinction is treated elsewhere in the chapter.)

  8. Unreliable. If both pollsters are trying to measure popularity of novels, then their results should be much alike. However, if one is measuring nationwide popularity and the other is measuring popularity in Chicago, then discrepancies are plausible, provided there is an actual reason (other than inaccuracy of the measuring process) for the differences.

  9. Reliable. The two persons have made independent ratings of the same students and have come to similar conclusions.
    1.  

 

REVIEW QUIZ QUIZ 5.2

  1. R. She is increasing the length of the test and getting a better sample of student behavior.

  2. U. He is using excessively difficult items.

  3. U. She is making the items more dissimilar. Items that are added into a single score should focus on a common topic.

  4. U. This is an excessively short (one-item) test.

  5. U. Each of these subtests is very short. Mr. Peters would be on solid ground if he knew that each of the subscales had adequate reliability.

  6. R. Miss Adams is avoiding the chance that temporary characteristics of the students anticipating the pep rally will lead to inconsistency.

  7. U. Mrs. Wolf is adding an additional source of inconsistency (the chance that she will make a mistake in instructions) to the sources that the students themselves bring to the test.

  8. R. Mrs. Johnson is using items of medium difficulty in a situation in which it is plausible to expect less than perfect mastery.
  9.  

REVIEW QUIZ 5.3

1. interobserver agreement

2. test-retest reliability

3. interscorer reliability

4. internal consistency reliability

S. equivalent-forms reliability

 

REVIEW QUIZ 5.4

Part 1

Set 1: b

Set 2: b

In both cases, the second item requires a greater inferential leap to conclude that it is evidence for the occurrence of the outcome variable. The first item in each pair offers more direct evidence.

 

Part 2

  1. Invalid. The test is unreliable and therefore invalid.

  2. Invalid. The tasks do not match the designated outcome variable.

  3. Valid. The selection of topics is designed to rule out a major source of bias.

  4. Invalid. Having someone else do your assignment is a different task than writing the assignment on one's own.

  5. Valid. This would increase reliability and hence validity - provided the true-false items are all appropriate for the outcome variable.

  6. Valid. Ruling out cheating increases the probability that the students will, in fact, respond to the correct tasks.

  7. Invalid. Intelligence and personality are different outcomes.

  8. Invalid. The observer's task was to observe the whole class. The teacher's criticism is that the observer has instead performed the different task of observing atypical students.

  9. Valid. The observer's job was to observe the atmosphere of the meeting, not to determine the causes for this atmosphere. The observer apparently assessed this information correctly.
    1.  

REVIEW QUIZ 5.5

  1. d. She is interested in finding evidence that the characteristic (construct) being measured is really anxiety.

  2. b. This one should have been easy. The counselor is interested in finding evidence about the accuracy of predictions.

  3. c. The test designer wants evidence that the two tests really measure the same outcome.

  4. a. The dean is interested in finding evidence that the test samples the subject matter appropriately.

  5. h. The teacher is trying to predict who will have trouble with related materials the next year, and she wants evidence regarding whether predictions based on the comprehensive exam are likely to be accurate.
    1.  

RESEARCH REPORT ANALYSIS

The research report in Appendix C uses two strategies to measure the ability of students to solve problems. Evaluate the reliability of each strategy separately.

  1. How reliable were the unit pretests as a measure of problem-solving ability?

  2. How valid were the unit pretests as a measure of problem-solving ability?

  3. How reliable was the Watson-Glaser test as a measure of problem-solving ability?

  4. How valid was the Watson-Glaser test as a measure of problem-solving ability?

  5. How satisfactory was the overall data collection process with regard to reliability and validity?

 

ANSWERS:

  1. Information about the reliability of the unit pretests is not clearly stated. The fact that the pretest and posttest items were based on the same set of objectives would tend to make it likely that changes in performance from pretest to posttest reflected real gains rather than a change in the test, but this report does not focus on the pretest-posttest comparisons. The fact that the tests were administered under standardized conditions would also tend to make them more reliable. However, it would have been useful to have evidence from a reliability coefficient.

  2. The fact that the tests were taken from the manual suggests good content validity for the objectives of the unit, but we don't know how strongly the objectives of the unit are related to the concept of problem solving. In fact, it seems likely that the improved performance of students on subsequent tests could be a result of generalization of learning rather than problem solving. Actually, since the first author of this textbook was the second author of this study, he knows that the tests actually did measure problem-solving ability, but this information is not clearly expressed in the report.

  3. The report itself contains no specific information regarding the reliability of the Watson-Glaser test, except to say that it was administered in accordance with the instructions in the manual. However, since this is a commercial test, interested readers could find information regarding reliability in several sources: the reference to Watson and Glaser cited in the text, the Mental Measurements Yearbook or a similar source, or the test manual.

  4. The report itself contains no specific information regarding the validity of the Watson-Glaser test. However, since this is a commercial test, interested readers could find information regarding validity in several sources: the reference to Watson and Glaser cited in the text, the Mental Measurements Yearbook or a similar source, or the test manual.

  5. If these tests were going to be used to diagnose the problem-solving ability of individual students, they would not be considered valid. However, since they were used to assess the performance of groups, they provide good evidence. (This distinction between individual diagnosis and group research is discussed on page 99 of this textbook.) By using two forms of data collection in tandem, the researchers enhanced validity. It would have been useful to use other methods as well. (In the original report, the researchers also reported using the Test of Integrated Process Skills and the Biological Sciences Curriculum Study Test.)