Reliability and Validity
Chapter 5 is designed to help you reach the following objectives:
1. Define reliability.
(Review question 1)(Frames 1-4 of the Semiprogrammed Reliability Unit)
(Textbook emphasis pp. 88-90)
2. Identify examples of reliable and unreliable data collection processes.
(Review questions 2-5)(Frames 5-23 of the Semiprogrammed Reliability Unit)
(Textbook emphasis pp. 90-91)
3. Identify factors which contribute to the unreliability of data collection processes.
(Review question 6)(Frames 24 & 26 of the Semiprogrammed Reliability Unit)
(Textbook emphasis pp. 90-91)
4. Identify effective ways to increase the reliability of data collection processes.
(Review question 7)(Frames 25 & 26 of the Semiprogrammed Reliability Unit)
(Textbook emphasis pp. 92-93)
5. Identify appropriate statistical procedures for estimating the reliability of data collection processes and identify appropriate situations in which each of these procedures would be appropriately employed.
(Review questions 8 & 9)(Frames 27-60 of the Semiprogrammed Reliability Unit)
Reliabilaity and Validity Exercises A and B
(Textbook emphasis pp. 94-98)
6. Identify the weaknesses and limitations of these statistical procedures for estimating reliability.
(Review question 10)(Frames 62-63 of the Semiprogrammed Reliability Unit)
(Textbook emphasis pp. 96-99)
7. Describe how to use the concept of reliability in selecting and improving techniques for measuring research variables.
(Review questions 6-11)(Frames 1-62 of the Semiprogrammed Reliability Unit)
(Textbook emphasis pp. 98-99)
8. Describe the use of the standard error of measurement in interpreting test scores.
(Frame 61 of the Semiprogrammed Reliability Unit)(Textbook emphasis p. 98)
8. Define validity.
(Review question 12)(Frame 1 of the Semiprogrammed Validity Unit)
(Textbook emphasis pp. 99-100)
9. Identify factors that introduce invalidity into data collection processes.
(Review questions 13-20)(Frames 2-16 of the Semiprogrammed Validity Unit)
(Textbook emphasis pp. 48-49)
10. Describe the process for establishing the validity of classroom data collection processes.
(Review questions 13-20)(Frames 17-32 of the Semiprogrammed Validity Unit)
(Textbook emphasis pp. 101-108)
11. Identify guidelines for enhancing the validity of data collection processes.
(Review questions 13-20)(Frames 17-32 of the Semiprogrammed Validity Unit)
(Textbook emphasis pp. 101-108)
12. Describe the role of content validity, criterion-related validity, and construct validity in measuring research variables.
(Review questions 21-24)(Frames 33-44 of the Semiprogrammed Validity Unit)
Reliabilaity and Validity Exercises A and B
(Textbook emphasis pp. 108-111)
1. Which of the following is the best definition of reliability?
a. Reliability refers to whether the data collection process measures what it is supposed to measure.b. Reliability refers to the degree to which the data collection process covers the entire scope of the content it is supposed to cover.
c. Reliability refers to whether or not the data collection process is appropriate for the people to whom it will be administered.
d. Reliability refers to the consistency with which the data collection process measures whatever it measures.
2. Mary took a test on which she received a score of 75. The teacher's house burned down, and the tests were destroyed. Mary took the same test over again the next day and again received a score of 75.
a. There is evidence to suggest that Mary's test was reliable.b. There is evidence to suggest that Mary's test was unreliable.
c. There is no evidence upon which to base even a tentative judgment about the reliability of the test.
3. Marvin's final exam was scored by his teacher, who gave him a 64. This would have caused him to fail the course. He protested to the school officials, and two other teachers scored the same test. One of them gave him a 75, and the other an 85.
a. There is evidence that Marvin's test was scored reliably.b. There is evidence that Marvin's test was scored unreliably.
c. There is no evidence upon which to base even a tentative judgment about the reliability of the scoring process.
4. Ella May's teacher rated her behavior as indicating that she was quite popular. The teacher's aide assigned to the same classroom rated Ella May as being uncooperative when given assignments.
a. There is evidence that the rating process was reliable.b. There is evidence that the rating process was unreliable.
c. There is no evidence upon which to base even a tentative judgment about the reliability of the rating process.
5. Donald was rated by his teacher as being unable to perform the mathematical skills necessary for the next math unit. Because of this low rating, Donald received a special programmed unit to help him review his skills. A day later, after completing the programmed materials, Donald was rated by the same teacher as able to perform the skills necessary for the next unit.
a. There is evidence that the teacher's rating system was reliable.b. There is evidence that the rating system was unreliable.
c. There is no evidence upon which to base even a tentative judgment about the reliability of the rating process.
Questions 6 through 8 go together.
6. Miss Curtis was planning to teach a unit on English grammar to her ninth graders. She planned to give one test as a pretest, and another as a posttest. Then she planned to compare the two sets of scores to determine whether or not the students had profited from the unit. Examine the following list of statements and indicate which ones suggest that her tests lacked reliability. (Choose more than one answer.)
a. Each form of the test contained 50 items, worth two points apiece.b. When the tests were scored by two separate persons, the results were exactly the same.
c. Thirty-five of the items on each of the alternate forms of the test were answered correctly by everyone.
d. Rather than giving highly structured instructions, she allowed the students to ask questions as they went along, and provided information as it was requested.
e. The average score on the posttest was substantially higher than the average score on the pretest.
7. After Miss Curtis had administered both forms of her English grammar test, she decided to revise it. By doing this, she hoped to make it a more reliable test the next year. Examine the following list of statements and indicate which ones would be likely to increase the reliability of the test. (Choose more than one.)
a. She wrote out a detailed set of instructions based on the questions which had arisen this time, and she attached these instructions to the tests.b. She increased the length of the test from 50 to 75 items on each form.
c. She eliminated several of the items which everyone had answered correctly, because she found that these had included irrelevant clues which enabled the students to get them right. She replaced these with items that she felt contained no irrelevant clues.
d. She eliminated each of the items which had been missed by 40-60% of the students on the pretest and by 70-90% of the students on the posttest.
e. She decided to base many of her new items on contemporary music, since nearly all the students seemed to be interested in such music.
8. Miss Curtis decides to compute statistical reliability to help determine the degree of reliability her tests possess. The tests are multiple choice/true-false in format. She considers them to be criterion-referenced rather than norm-referenced tests. Her main concern is that the decisions she would make on the basis of the results would be based on actual abilities of the students rather than on unique aspects of the testing situation. She is also concerned that any differences between the pretest and the posttest should indicate real differences, rather than merely differences between the two tests. Which of the following types of statistical reliability would help Miss Curtis make useful decisions about her tests? (Choose as many as necessary.)
a. Test-retest reliability.b. Equivalent-forms reliability.
c. Internal consistency reliability.
d. Interscorer reliability.
e. Interobserver agreement
9. Which of the following types of statistical reliability require that the same test be administered to the same persons two times? (Choose as many as necessary.)
a. Test-retest reliability.b. Equivalent-forms reliability.
c. Internal consistency reliability.
d. Interscorer reliability.
e. Interobserver agreement
10. Which of the following is a major weakness of the statistical techniques for estimating reliability? (Choose only one.)
a. When respondents give different answers because of chance factors such as health problems or luck, this lowers the statistical estimate of reliability.b. When a large number of persons master a skill and therefore get the answer right, this lowers statistical reliability.
c. Changes in the directions as they are given or as they are perceived by the respondents will lower the statistical reliability.
d. Essay tests receive lower estimates of statistical reliability than more objective tests, because there are more likely to be subjective factors influencing the scoring process.
11. If a teacher has access to two tests which attempt to measure the same research variable, she should almost always choose the test which is the more reliable (provided they take about the same amount of time to administer and score).
a. True.b. False.
12. Which of the following is the best definition of validity?
a. Validity deals with whether the data collection process actually measures what it purports to be measuring.b. Validity deals with whether the data collection process is designed at the appropriate level of difficulty.
c. Validity deals with whether the data collection process is consistent in measuring whatever it measures.
d. Validity deals with the question of how subjectivity can best be controlled in the scoring process.
e. Validity deals with the standardization of procedures for administering, scoring, and interpreting data collection processes.
13. All but one of the following are factors which directly influence the validity of a data collection process. Choose the exception.
a. The logical appropriateness of the operational definition.b. The match between the tasks in the data collection process and the operational definition.
c. The difficulty of the data collection process.
d. The reliability of the data collection process.
14. Mr. Gomez wants to help his students become familiar with educational television. He defines "familiarity" with educational television as meaning that the students will be able to name several of the shows on the local educational television station. To measure this research variable, he asks the students one day to write down the name of as many shows as they can think of which were on the local educational channel the night before. He then concludes that the students who can name more shows are more familiar with educational television than those who name few or no shows accurately. What is the most obvious reason why this measurement strategy is likely to be invalid?
a. It is likely to be unreliable.b. The task doesn't match the operational definition.
c. The operational definition is logically inappropriate.
d. The task requires that the students be familiar with the local educational television station.
15. Miss Chesterton is teaching her students to use English grammar correctly. She operationally defines using English grammar correctly as meaning that they will follow all the rules of normal English grammar in the compositions they write. On the exam, she determines how well the students have met this goal by requiring them to diagram twenty sentences of varying levels of complexity. What is the most obvious reason why this measurement strategy is likely to be invalid?
a. It is likely to be unreliable.b. The task doesn't match the operational definition.
c. The operational definition is logically inappropriate.
d. The task requires that the students be capable of following the rules of English grammar in their writing.
16. Professor Carter wants her students to develop a genuine appreciation of Shakespeare's plays. She operationally defines this to mean that the students will be able to recall lines of the plays from memory. She measures this by giving the students several important scenes with lines omitted and having them fill in the missing lines. What is the most obvious reason why this measurement strategy is likely to be invalid?
a. It is likely to be unreliable.b. The task doesn't match the operational definition.
c. The operational definition is logically inappropriate.
d. The students may not be able to recall the lines.
Questions 17 through 20 are based on the following information.
Mrs. Green wants to measure Kathy's reading comprehension by having her read a story and then relate it to her own experience. Examine each of the following statements (assuming they are all true), and indicate whether each would or would not weaken the validity of Mrs. Green's testing strategy.
17. Even outside reading situations, Kathy has a great deal of trouble relating any stories at all to her personal life.
a. Weakens the validity of the data collection process.b. Does not weaken the validity of the data collection process.
18. Kathy has trouble understanding the passage.
a. Weakens the validity of the data collection process.b. Does not weaken the validity of the data collection process.
19. Kathy becomes anxious because she has to take the test aloud in front of the class, and anxiety makes her perform poorly.
a. Weakens the validity of the data collection process.b. Does not weaken the validity of the data collection process.
20. The passage is extremely short.
a. Weakens the validity of the data collection process.b. Does not weaken the validity of the data collection process.
21. Ms. Monroe has developed a questionnaire to measure her students' attitudes toward the practicum in her nursing training program. She is concerned about whether the questions apply proportionately to all the aspects of the program. What tool for estimating aspects of validity would help Ms. Monroe make a sound judgment in this regard?
a. Content validity.b. Criterion-related validity.
c. Construct validity.
d. None of the above.
22. Mr. Shepard has developed a criterion-referenced test on basic mathematic abilities. He wants to be sure it gives appropriate coverage to all the topics covered during the semester. What tool for estimating aspects of validity would help Mr. Shepard make a sound judgment in this regard?
a. Content validity.b. Criterion-related validity.
c. Construct validity.
d. None of the above.
23. Professor DuParc has developed an observational strategy to measure a person's "independence from peer pressure." What tool for estimating aspects of validity would help Professor DuParc to demonstrate that his strategy really measures "independence from peer pressure" rather than some other characteristic?
a. Content validity.b. Criterion-related validity.
c. Construct validity.
d. None of the above.
24. Mrs. Masters has been admitting persons into her Advanced Composition course on the basis of their performance in Introductory English. She decides that she could make better selections if she would have the applicants take a special test, and then successful candidates would be those who scored highest on the test. What tool for estimating aspects of validity would help Mrs. Masters demonstrate that her new procedure is better than the old one?
a. Content validity.b. Criterion-related validity.
c. Construct validity.
d. None of the above.
1. Confusion of test-retest reliability with a pretest-posttest measurement strategy. If data collection takes place, then something independent of the data collection process happens to induce changes in the persons being measured, and then data collection occurs again, this is not an example of test-retest reliability. For example, if students score badly on a spelling test because some of them have not studied, then all of them study, and then they take another spelling test, we would not necessarily expect the scores to be highly correlated. Test-retest reliability refers to the idea that the results of data collection processes are similar when they should be similar. The results should be similar, of course, when data are collected from the same group of persons when these persons have not changed with regard to the characteristic being measured.
2. Confusion of equivalent-forms reliability with criterion-related validity. The similarity between these two techniques is that in each case the results of an initial data collection process are correlated with the results of another process. The difference is that with equivalent-forms reliability, the results are correlated with another administration of the same data collection process. For example, a researcher could correlate the results of students taking Form A and Form B of the same test in order to ascertain that the tests are essentially equivalent forms of the same test. On the other hand, to compute criterion-related validity the researcher would correlate the results of one data collection process with the results of an entirely different process that purports to collect data regarding the same (or a similar) outcome. For example, a person developing an short form of an IQ test might administer to one group of students both her test and another test that is generally accepted to measure IQ and then compute the correlation coefficient between the two sets of scores.
In some cases the distinction between equivalent forms reliability and criterion-related validity becomes blurred. For example, assume that a student who plans to take the SAT purchases a book to help prepare for that test. At the end of the book are several practice tests. The instructions say that the student can estimate how well he will do on the actual SAT by taking one or more of the practice tests. If the author of this study guide claimed a correlation of .80 between performance on the SAT and performance on the practice test, does this coefficient represent reliability or validity? This is a hard question, and since it is likely to appear as the final question on Double Jeopardy, I am not allowed to reveal the answer. I personally doubt the value of spending a great deal of time on such subtle distinctions. In most cases, the differences between equivalent-forms reliability and concurrent validity are more obvious, and distinctions can be made by using the guidelines described in the preceding paragraph.
3. Confusion regarding what a correlation coefficient is. It is necessary to cover topics in some order, and we have chosen to discuss reliability in Chapter 5 and correlation coefficients in Chapter 13. We think this makes sense. However, helpful reviewers and editors have pointed out to us that it is improper to define concepts in terms of other concepts that are not familiar to students. The problem can be solved by telling students (as we do in the text) that correlation coefficients have an absolute value that ranges between .00 and 1.00. A reliability coefficient of 1.00 would indicate perfect reliability (but this never occurs in educational measurement), and a coefficient of .00 would indicate a complete lack of reliability. Coefficients closer to 1.00 indicate a higher degree of reliability than those closer to .00.
4. Confusion over how high a reliability or validity coefficient should be in order to be considered good. Although students would like to know that .90 is "excellent" and .50 is "getting pretty bad," it is generally not reasonable to give any such answer. The limits on the magnitude of these coefficients arise from practical factors in the data collection process; and these vary, depending on what is being measured. The "goodness" of a correlation coefficient depends on how it compares to what could be expected. For example, if you were purchasing a standardized test that came with two forms for pretest and posttest, you would want to know (among other things) both the equivalent-forms reliability coefficient and how this coefficient compares to the reliability of other tests that measure comparable outcomes.
A second answer to this question is more technical. Measurement processes with low reliabilities allow a larger number of extraneous factors to go uncontrolled. It is possible to give a mathematical estimate of the amounts of variation that are controlled and uncontrolled. The coefficient of determination can be used for this purpose. (This coefficient is not discussed in the textbook.) It is essentially the square of the internal consistency coefficient - or the product, if two separate data collection processes are involved. Hence, if a counselor wanted to use a test with a .70 internal consistency to predict success in college (Assume that that can also be measured with .70 internal consistency), we would multiply .70 x .70 and get a coefficient of determination of .49. This means that the predictor test can account for 49% of the variation in college test performance - the other 51% would be unaccounted for, if the counselor used this as the sole means of making the prediction. The coefficient of determination makes it possible, therefore, to give a more precise answer to "how good is this reliability coefficient?" A correlation of .50 (which becomes .25 when squared) could legitimately be considered "not very good," even though it may also be "the best we can do." What this means is that if .50 is the best you can do, you should at least be aware of the inadequacies of the measurement process.The information in the preceding paragraph goes well beyond the current textbook. Students with a statistics background may understand it easily. Other students can be satisfied with the information given in the first paragraph.
5. Confusion of interscorer reliability with interobserver agreement. The similarity between these two is that they both deal with the degree of consistency among different persons scoring data on the same performance. The major difference is that interscorer reliability is calculated by a correlation coefficient, while interobserver agreement is calculated by a percentage of agreement. If the measurement process results in scores or ratings of the individuals being measured (e.g., a range of 1 to 5 or of 0 to 100), then interscorer reliability is appropriate. If the data collection process consists of a simple yes/no decision to decide whether an outcome has occurred, then interobserver agreement is appropriate.
6. Confusion regarding how cultural bias relates to the factors that cause invalidity. When cultural bias interferes with the validity of a data collection process, it is almost always because it causes a mismatch between the operational definition and the task that the respondent performs during the data collection process. For example, if a mathematics test expects children to "solve word problems involving division," then there would be a mismatch if a child had to "first translate the problems into a language she can understand and then solve them using long division." Likewise, if the test made frequent references to topics of interest to boys but not to girls, then boys would be "solving division problems regarding familiar topics," whereas girls would be "solving division problems regarding unfamiliar topics."
SEMIPROGRAMMED UNIT ON RELIABILITY
1. Reliability means consistency. When we use data collection processes to measure a research variable, we want the judgments which we base on these processes to be consistent. Applied to educational settings, this means that if we use a test or other data collection process to evaluate a person's performance with regard to a research variable, our evaluation of that person's performance should be the same on different occasions - unless, of course, something happened between the occasions which would legitimately cause our evaluation to be different on the second occasion. A data collection process lacks reliability to the extent that performance on it is influenced by irrelevant factors which are likely to occur only during unique administrations of the data collection process. To the extent that we can say the data collection process is free from such unique and extraneous influences, we can say that the data collection process possesses reliability.
2. Mrs. Fox was in a real-estate training program. She went before an examiner who administered to her a test which would enable her to be certified as a broker. She failed the test. This angered her, and she went down the street to a different examiner and immediately took another form of the same test. She passed the test. Would you say that Mrs. Fox was reliably evaluated? The answer, of course, is that this evaluation process was not reliable. We don't know which test gave the more accurate results; but it seems obvious that Mrs. Fox's knowledge of real estate had not changed between the two testing occasions, and so her scores should have been the same on both occasions. Since she was judged competent on one occasion and incompetent on the other, the evaluation process was not consistent. In other words, the test was unreliable. If this is not clear to you, try rereading 1 and 2 before continuing.
3. Mr. Wolf was training to be a paramedic. After completing the training course, he took the test which would certify him as a paramedic. His score was just below the cut-off point. This disappointed Mr. Wolf, and so he spent the entire next week studying and reviewing the material covered in the course. At the end of the week, he took a different form of the same test. This time he passed with flying colors. Would you say that Mr. Wolf was reliably evaluated? The correct answer is that we don't know for sure whether the test was reliable. However, this case differs from Mrs. Fox's case because in this situation we don't have specific evidence to suggest that it was an unreliable test. Mrs. Fox scored differently on two occasions when she should have scored the same. However, when Mr. Wolf scored differently on two occasions, there was a perfectly logical basis for his difference in performance. If you do not see the difference between Mrs. Fox's and Mr. Wolf's situations, you should reexamine 1 through 3 or seek additional help until this becomes clear.
4. The examples in 2 and 3 demonstrate the need for reliability. If a person scores differently on two testing occasions, we would like to be able to assert that the reason for this difference is a change in the person's capability - a change in the research variable - rather than a simple inconsistency in the data collection process. Likewise, if a person scores the same on two occasions, we would like to be able to say that the person's capability was really the same on both occasions. With an unreliable test, it would be possible that a person whose capability had changed would still score the same, because inconsistencies in the test had produced fluctuations in scores to mask the changes. The same logic applies to comparisons between two or more persons. We would like to be able to assert that similarities and differences between people's performance on a test are indications of real similarities and differences between these people, not merely the result of inconsistencies in the data collection process.
5. Examine each of the following anecdotes and determine whether the measurement strategy appears to be reliable or unreliable. Remember: A data collection process is reliable if the results are the same when they should be the same. A data collection process is unreliable if the results are different when they should be the same. If there is no logical reason to expect that the results should be the same, then further information is needed for determining reliability.
6. Mr. Nelson wanted to find out how rapidly his son could read. He gave him a newspaper and told him to start reading. He then had the boy circle the word he was reading when he called "time" at the end of five minutes. He did this for three different sections of the newspaper, and each time the boy was evaluated as reading at about 300 words per minute. What is your judgment about the apparent reliability of this procedure to estimate the boy's reading speed?
a. It seems to be reliable. (Go to 7.)b. It seems to be unreliable. (Go to 8.)
c. There is no basis for making a judgment about reliability. (Go to 9.)
7. Right. The scores were approximately the same on occasions when they should have been approximately the same. This is what is meant by reliability. (Go to 10.)
8. Wrong. There was no apparent inconsistency in the evaluation process. The boy was evaluated approximately the same on all three occasions. Re-examine 6 and see if you can see the logic of the correct answer before proceeding to 10.
9. Wrong. There is a basis for a judgment about reliability. We would expect the scores to be the same if the boy's rate of reading were estimated on three near-simultaneous occasions. This is exactly what his father discovered. Reexamine 6 and see if you can see the logic of the correct answer before going on to 10.
10. Ms. Peters submitted an article to a professional journal. The editor of the journal routed the article to three reviewers who were asked to evaluate it using the same set of criteria. One reviewer recommended publishing the article, the second recommended several revisions before it would be acceptable, and the third rejected the article. What is your opinion of the reliability of this review process?
a. It seems to be reliable. (Go to 11.)b. It seems to be unreliable. (Go to 12.)
c. There is no basis for making a judgment about the reliability of the review process. (Go to 13.)
11. Wrong. The evaluations of the article's quality should have been similar, since the article was identical and the reviewers were all using the same guidelines. The fact that they were so divergent suggests unreliability. Reexamine 10, and then go on to 14.
12. Right. We would expect all three reviewers to be similar in their evaluations, since the article was identical and the reviewers were all using the same set of guidelines. Go to 14.
13. Wrong. There is a perfectly logical reason to expect the reviewers to have similar evaluations of the article (See 12.), and therefore there is a basis for a judgment about reliability. Your answer would have been correct if there would have been a logical basis for the reviewers to respond differently; for example, if one would have been instructed to evaluate the article as a content-matter specialist, another as a statistician, and the third as a literary stylist. In such a case, the instructions for each reviewer would have been different - each would have been evaluating something different, and we would expect different evaluations. Reexamine 10 before going on to 14.
14. Denise is taking a physical education class. As part of the class, she shoots a series of 10 free throws on the basketball court to demonstrate her shooting ability. She makes 7 out of 10 shots. Her teacher is surprised and has her do it over again. This time she makes only 2 out of 10. The teacher concludes that Denise was just lucky the first time and that 20% is more representative of her overall ability. What is your opinion of the reliability of this assessment?
a. It seems to be reliable. (Go to 15.)b. It seems to be unreliable. (Go to 16.)
c. There is no basis for a judgment about reliability. (Go to 17.)
15. Wrong. There is no reasonable way you could consider this evaluation process to be valid. If you sincerely believe that this is the right answer, you should seek additional help before going any further. You have no understanding whatsoever of the principles being discussed. (Stop.)
16. Right. We would expect the result to be the same on both occasions. (There's no reason to expect a change in skill.) It is likely that extraneous factors - such as nervousness, variations in attentiveness, response to pressure, or just plain luck - influenced her performance on one or both of the testing occasions. (Go to 18.)
17. Wrong. There is a basis for a judgment (see 16). It is not likely that her "shooting ability" changed between the two occasions. If the teacher would have ascertained some spurious reason for Denise's good performance on the first occasion (for example, if Denise stood too close to the basket), then her score would have been expected to be different on the second occasion. (Go to 18.)
18. In the preceding example (Frame 14) which of the performances by Denise is the best estimate of her ability?
a. 70% accuracy.b. 20% accuracy.
c. Neither of the above is likely to be a good estimate.
The correct answer is (c) - that we cannot say that either is accurate. That's the whole point of saying that the test was unreliable. The instability of the data collection process makes it impossible to make a judgment. The best estimate would be to say that she averages about 45% accuracy. This would be obtained by averaging the 70% and 20%. It seems that Denise's performance is influenced by luck; and by basing our judgment on 20 rather than 10 shots, we would lessen the impact of chance hits or misses.
19. Fred misses 2 out of 10 words on an oral spelling test. His friends make fun of him because the words were rather easy. Fred asks to take another shot at the test, but he is so nervous that he misses 5 out of 10. He goes home and studies the words carefully. The next day he gets all 10 right. What is your opinion of the reliability of this data collection process for estimating Fred's spelling ability?
a. It seems to be reliable. (Go to 20.)b. It seems to be unreliable. (Go to 21.)
c. There is no basis for judgment about reliability. (Go to 22.)
20. Wrong. Although it sounds easy to devise reliable spelling tests, there is no evidence that this was the case for Fred. Fred's scores were quite divergent, and so there is no sound reason to say that his ability was consistently estimated. (Try 19 again.)
21. Wrong. Although the scores were divergent, there seems to be a possible explanation for the difference. It's hard to unscramble the evidence here in order to provide an explanation for the reason why Fred performed so differently on the three occasions. (Try 19 again.)
22. Right. It's perfectly logical that even on a reliable test Fred's performance would differ on three such diverse occasions. We would need further information to determine how reliable the spelling test really was as a basis for evaluating Fred's spelling ability. (Go to 23.)
23. To summarize: A data collection process is designed to measure a specific research variable and only that research variable. The research variable is usually a characteristic such as a physical ability, a cognitive skill, or an attitude; and we want to measure this condition or outcome without the data collection process being contaminated by extraneous factors which are related to the measurement setting or variations in personal characteristics rather than to the outcome being measured. To the extent that a data collection process is free of such extraneous influences, the results (and decisions based on the results) will be consistent and stable. This is what is meant by a reliable data collection process.
24. The several factors that introduce unreliability into a data collection process are listed below. Each of these factors can influence the outcome of a data collection process in such a way that the scores obtained by the researcher reflect something besides the respondent's genuine performance on the test. To the extent that this is true, the results will become inconsistent.
a. Faulty items.b. Excessively difficult elements in the data collection process.
c. Excessively easy elements in the data collection process.
d. An inadequate sampling of behavior.
e. Widely dissimilar tasks or items comprising the data collection process.
f. Unique temporary characteristics of the respondents.
g. Faulty administration of the data collection process.
h. Faulty scoring procedures.
The above items cannot be dealt with in this workbook chapter without making it too lengthy. They are explained in detail in the textbook on pages 90 to 92. Each of the above factors can introduce an additional factor into the data collection process. These additional factors are likely to change from occasion to occasion (whereas the genuine research variables are more stable), and hence they introduce inconsistency. By eliminating the above factors as thoroughly as possible, we can make our data collection processes more reliable.
25. Reliability can be increased by observing the following guidelines.
a. Devise technically correct, unambiguous data collection strategies.b. Standardize the administration procedures for the data collection process.
c. Standardize the scoring procedures.
d. Be alert for respondent irregularities.
e. Make the data collection process long enough to include a good example of items.
f. Make each element on the data collection process measure the same characteristic or set of characteristics.
g. Be sure that the data collection process is neither too easy nor too difficult.
What the above guidelines really say is don't introduce the factors listed in 24. These guidelines are discussed in detail on pages 92 to 94 of the textbook.
26. Important note: Reliability is not the most important characteristic of a measurement strategy. Validity is the most important characteristic. If following the above guidelines would lessen the validity of a measurement strategy, then it would often be advisable to violate the guidelines and work toward greater validity. The relationship between reliability and validity is discussed later in this chapter.
27. By following the guidelines described in 25 and by avoiding the factors listed in 24, it is possible to maintain a high degree of reliability in our data collection processes. This can be done even without any recourse to statistical procedures for estimating reliability. The statistical procedures for estimating reliability, which will be described next, are merely tools for estimating how successfully a researcher has followed the guidelines and eliminated the extraneous factors. Even in the many situations in which teachers/researchers are unwilling or unable to compute statistical reliability, the principles and guidelines discussed up to this point will help make a measurement strategy more reliable. All the statistical procedures can do is help ascertain the success of the strategies for eliminating inconsistency.
28. The statistical methods of estimating reliability are summarized in Table 5.1, which is taken from the textbook.
(Insert Table 5.1 about here.)(Identical to Table 5.1 of the Textbook)
29. An instructor might be concerned that his data collection processes are inconsistent in the sense that the scores might fluctuate widely on different measurement occasions. A judgment he would make one day would have been different, therefore, if he would have made the same judgment after collecting data on a different day. The instructor can rule out this possibility by collecting the same data two times from the same group of people. If the results are approximately the same on the two occasions, then the data collection process is considered reliable for that group of people. To be more specific, a correlation coefficient (Chapter 13) is computed between the scores from the first occasion and the second occasion. A higher test-retest reliability coefficient indicates a more reliable data collection process than a lower coefficient.
30. The strategy described in 29 is emphatically not the same as giving a pretest prior to instruction and then a posttest after instruction. Although such a pretest-posttest strategy would be useful for other reasons, it would be useless for estimating reliability (for reasons, see frames 1 to 3). The test has to be administered twice to the same group of persons on occasions when their scores should logically be the same. There are three important components of this test-retest reliability strategy: it has to be administered (a) twice, (b) to the same group of persons, (c) on occasions when their scores should logically be the same.
31. Many inservice educators find it inconvenient to administer the same data collection process twice to the same persons. Many respondents would likewise feel upset about receiving the same data collection process twice. To the extent that such problems are insurmountable, then the test-retest reliability coefficient cannot be computed. By following the guidelines suggested in 24 and 25, teachers/researchers can establish a high degree of probability that they have developed a data collection process with good stability between different testing occasions. Likewise, data collection processes which possess other types of statistical reliability (see Frames 43-46) are often very likely to also possess good test-retest reliability. However, the only way to assure test-retest reliability is to follow the guidelines stated in 30. Researchers publishing formal reports are more likely than classroom teachers to want to compute test-retest reliability to determine the extent to which their efforts at achieving this type of stability of measurement are successful.
32. Which of the following researchers needs test-retest reliability?
a. Mr. Rosencrantz wonders whether all the items on his test tend to consistently focus on the same objective. (Go to 33.)b. Mrs. Gildenstern wonders whether ambiguity in her instructions will cause people to respond differently depending on how they react to the instructions. (Go to 34.)
33. Wrong. Mr. Rosencrantz is concerned about the internal consistency of his test. This concept will be discussed later. (Try 32 again.)
34. Right. It's likely that ambiguity in instructions will lead respondents to answer differently on different occasions, since the ambiguities would present themselves differently at different times. A test-retest reliability coefficient would help Mrs. Gildenstern determine how serious a problem this would be.
35. Which of the following researchers needs test-retest reliability?
a. Miss Fitzgerald is planning to use an attitude-toward-the-library questionnaire at the beginning and end of her library unit. She plans to use the exact same questionnaire on both occasions. At present she is developing and field testing it. She is concerned that any differences in results between the two occasions should reflect real differences. (Go to 36.)b. Mr. O'Neill is planning to use an attitude-toward-the-police-department questionnaire to evaluate children's attitudes both before and after an Officer Friendly program. He has two forms of the questionnaire. At the present time he is developing and field testing his questionnaire. He is concerned that any differences between the pretest form and the posttest form should reflect real differences. (Go to 37.)
36. Right. Miss Fitzgerald wants to make sure that the results of her questionnaire will not be influenced by chance occurrences accompanying either of the testing occasions. Establishing test-retest reliability during the developmental stage would be extremely useful. (Go to 38.)
37. Wrong. This example calls for equivalent-forms reliability. Even though he plans to administer the tests on separate occasions, Mr. O'Neill's main concern is to determine that the two forms of the test are actually equivalent. If he wanted to do a somewhat more thorough job, he could compute a coefficient to establish test-retest with equivalent-forms reliability. (Try 35 again.)
38. Equivalent-forms reliability is useful to educators when they plan to use two or more forms of the same data collection process and are concerned about whether the judgments they would base on the various forms would really be equivalent judgments. The two principal situations in which this type of reliability coefficient is helpful are (a) when one form of a data collection process will be used as a pretest and another as a posttest, and (b) when two or more groups are going to take the same data collection process but it's necessary to vary the exact content. The second situation occurs, for example, when a teacher has two sections of an English course during the same day. If she give an exam to one section and lets them take it home with them after turning in their answers, then she can't really give the exact same test to the second section, since this would give an unfair advantage to students who had friends in the first section. Therefore she needs parallel forms of the same test. She could establish the statistical reliability of these two forms by administering both forms to a single class. (This single class could be either one of the actual classes or a "pilot" class.) Then she would compute a correlation coefficient (see Chapter 13) between the students' performance on the two forms of the test. A high coefficient would indicate that judgments based on the two forms were likely to be equivalent, whereas a low coefficient would suggest that she would be evaluating one of her sections on a different basis than the other.
39. In spite of the above example, most teachers do not often compute equivalent-form reliability for regular classroom exams. This is because it would require additional administrations of the exams; and if a teacher would perform the reliability procedure with either one of her course sections, she would need a third form for the other section! This is too much work for a busy instructor. However, even if we don't compute the statistic, it is important that the forms be equivalent. Otherwise, we are running the risk of being unfair to one group or the other. We can establish a high degree of probability that the forms are equivalent by following a nonstatistical procedure. We could simply write two questions which we consider to be matched (on the same objective, for example), and then randomly put one on the first form and the other on the second form of the test. (I followed this strategy in developing test items for this Workbook and Instructor's Manual; and so there is good reason to believe that the review quiz for a given chapter in the Workbook is roughly equivalent to the parallel quiz in the Instructor's manual, even though I have not computed the equivalent-forms reliability coefficient for these tests.) By following this procedure for both forms of the entire data collection process, we would have a fair degree of confidence that the two forms would be equivalent. Occasionally, we could compute the equivalent form reliability to make sure our method is working. On the other hand, when we are engaged in formal research, when we have sufficient time and need to pilot our data collection processes on a separate sample of respondents, then we should always perform the equivalent-forms analysis. The point is this: equivalent-form reliability merely verifies that parallel forms are really equivalent. The most important thing is to take active steps to actually make forms equivalent when parallel forms are needed.
40. Which of the following researchers needs equivalent-forms reliability?
a. Mr. Hoyte is evaluating his school system's writing program. At the end of each year, he plans to have the students write an essay on a given topic, and then he will hire graders to evaluate the essays. He is concerned that changes in scores should indicate real changes in ability, rather than merely reflecting differences in the topics given each year. (Go to 41.)b. Mrs. Adams has asked her history students to write an essay on "The Value of the American Heritage." She plans to evaluate the essays according to specified criteria to ascertain whether or not her students are developing a patriotic attitude. She is concerned that the differences she finds between students might reflect inconsistencies in the opinions of the persons scoring the tests rather than actual differences. (Go to 42.)
41. Right. Since Mr. Hoyte plans to use different topics each year, it is possible that the choice of topics could influence the scores. In reality, he is giving different forms of the same data collection process; and it would be important for him to know that these forms are really equivalent. Concern about the reliability of the person scoring the test would also be legitimate, but Mr. Hoyte has not expressed this concern. (Go to 43.)
42. Wrong. Mrs. Adams has only one form of her test, and therefore she cannot be concerned about equivalent-forms reliability. She is concerned with interscorer reliability (discussed later). She may wish to compute both of this other form of statistical reliability, but equivalent-forms reliability would be of no use to her. (Try 40 again.)
43. A researcher is often interested in knowing whether or not all the elements of a data collection process tend to measure the same thing. This would be especially important, for example, if the researcher wanted to add all the scores on a test together to arrive at a total or composite score. For instance, if an instructor wants to know if her students "understand the Western heritage," it would be useful to know the extent to which all the items on her test or questionnaire actually measure this outcome. An example of unreliability in this case would occur if half of the items actually dealt with this topic while the others could more adequately be described as measuring something else, such as "ability to memorize trivial facts." The instructor in this example is concerned with internal consistency reliability. It has this label because it deals with the question of whether the data collection process is "consistent within itself." A data collection process is internally consistent to the extent that all the items on the test can be said to be measuring one aspect of knowledge, attitude, or skill. A data collection process is internally inconsistent to the extent that the same data collection process is measuring several diverse outcomes.
44. Coefficient alpha is the most commonly employed procedure for obtaining a statistical estimate of internal consistency. This procedure requires that the data collection process be administered once, and then the scores are analyzed through a mathematical formula. The resulting coefficient can theoretically range from .00 to 1.00, with a higher coefficient representing greater internal consistency.
45. Since the internal consistency procedures require only a single administration of a data collection process, many researchers naively hasten to use this procedure in preference to the others. This can often be a mistake. It is always important to look at your specific situation and needs in order to determine what kind of reliability your situation requires. It makes no sense to compute the "easiest" type of coefficient, if that type does not provide information relevant to your situation.
46. Many classroom tests and important data collection processes (such as screening devices for special education) deliberately include a wide variety of outcomes. If this is the case, the internal consistency will be lowered. However, even in such cases, the other forms of reliability can be kept at a high level. Teachers/researchers with limited time, therefore, will often be more concerned about these other forms of reliability - even though the internal consistency coefficient is often more readily available. The important point is this: don't make rash conclusions based on the internal consistency coefficients in situations where the internal consistency coefficient is irrelevant to the problem at hand.
47. Which of the following researchers calls for an internal consistency coefficient ?
a. Professor Jenkins is teaching a course on ethics. He likes to give the students a moral problem at the beginning of the year and another at the end of the semester. He then examines them to see if they include more sophisticated philosophical reasoning on the second than on the first test. (Go to 48.)b. Miss Jones wants to find out if her students "understand the application of long division to practical problems." She has devised a 30-element test, and on the basis of this test she hopes to determine which students have mastered the principles and which have not. (Go to 49.)
48. Wrong. Professor Jenkins is concerned with whether his alternate forms of the test are equivalent. He needs equivalent-forms reliability. (Try 47 again.)
49. Right. She plans to add the correct answers together to obtain a total score, and this would make more sense if they all measured a common outcome. Her concern is that all the items actually measure her designated outcome, and the internal consistency method would provide her with useful information. (Go to 50.)
50. Sometimes it's possible that the person scoring the data collection process (rather than the persons administering or taking it) will introduce inconsistency into a data collection process. This will often happen with "subjective" data collection processes, where the scorer must examine a relatively unstructured performance and provide a score for the respondent. Examples of such "subjective" data collection processes include essay exams, personality tests, judgments about speeches and artistic presentations, etc. A useful way to establish reliability in such cases is to have two different persons score (or rate) the respondents' performance, and then compute a correlation coefficient (Chapter 13) using these two sets of scores. The result of this calculation is referred to as interscorer reliability.
51. Since both the respondent and the person scoring the instrument can conceivably be inconsistent, it is often useful to compute both interscorer and another type of reliability. For example, if an administrator of a program wanted to use an essay on one topic as a pretest and an essay on another topic as a posttest, then it would be a good idea to compute both equivalent-forms reliability (to make sure they really were alternate forms) and interscorer reliability (to make sure the scoring process was reliable).
52. Which of the following researchers needs an interscorer reliability coefficient?
a. Miss Williams is teaching a speech course. Students' grades are based on "how they persuade an outside observer that their arguments are valid." Miss Williams wants to be sure that decisions don't depend merely on who the outside observer is. (Go to 53.)b. Mr. Johnson is teaching a typing course. The students' grades depend on how fast they type and on how many errors they make. All students type the same material at the end of the course. Mr. Johnson is concerned that they are all judged on the basis of the same test, even though they may take the test several weeks apart. (Go to 54.)
53. Right. There are several different persons scoring the students, and therefore Miss Williams could compute a correlation between the ratings of several raters on the same students to determine their degree of agreement. Incidentally, she could increase the probability of a high degree of agreement by giving the raters a structured checklist of some sort to make sure they were looking for common elements in the arguments. (Go to 55.)
54. Wrong. Mr. Johnson is apparently the only person scoring the test. Even if here were more than one scorer, it would be likely that they would agree on the rate of speed and the number of errors. It sounds like Mr. Johnson needs test-retest reliability. (Try 52 again.)
55. A final type of reliability is useful in determining the extent to which different observers agree on whether or not an outcome is occurring. This is called interobserver agreement. In this case, it is not a correlation coefficient, but rather a percentage of agreement which is computed to estimate the degree of reliability. An example of this would occur if a teacher were trying to get a student to make more frequent eye contact during class time. The teacher might operationally define what is meant by eye contact and then have an observer record whether or not the child made eye contact during each of, say, a hundred 15-second intervals. To check for reliability, the teacher might have a second observer look for the same behavior during the same time period. The two observers would later compare notes to see how often they agreed on whether there was eye contact. If they agreed during 95% of the intervals, this would be a high degree of reliability. If they achieved consensus for only 50% of the intervals, however, this would be a much less satisfactory level of reliability. (This measure of reliability is often used in behavior modification research.) When such reliability estimates are low, the usual course of action is to define the behavior more specifically or to provide more careful training for the observers.
56. Which of the following researchers needs to compute interobserver agreement reliability?
a. At the end of each day, Ms. Freeman rates the behavior of each child in her class on a 5-point scale, ranging from "Very cooperative" to "Extremely uncooperative." She sometimes thinks that if someone else rated the same children, the results would be quite different. (Go to 57.)b. Mr. Wells is a speech therapist. He is trying to get Wilma to pronounce the letter "m" correctly at the end of words. He has her say sentences about pictures, and he rates her as "yes" or "no" each time she has an opportunity to say an "m" at the end of a word. He has his aide make similar ratings during some of the periods and compares notes with his aide to see if they agree. (Go to 58.)
57. Wrong. This was a hard one. Ms. Freeman needs interscorer reliability. She is not focusing on the simple occurrence vs. nonoccurrence of a specified behavior. If she would divide the day up into l-minute periods and evaluate each child as either cooperative or noncooperative during each of these periods, then a measure of interobserver agreement would be useful. With her present procedure, however, interscorer reliability would be more useful. (Try 56 again.)
58. Right. Mr. Wells is trying to determine the extent to which objective observers can agree whether or not an outcome (saying "m" at the end of words) is occurring. Computing a percentage of interobserver agreement would help him determine how reliable his data collection process is. (Go to 59.)
59. This programmed unit has devoted a great deal of space to statistical methods for estimating reliability. Don't let this emphasis mislead you. The important thing is to understand the concept of reliability and to follow the guidelines (Frames 1-25) which will make your data collection processes as reliable as possible. The statistical procedures are useful only to estimate how successful you have been. A great deal of space has been devoted to this topic only because students often request additional help.
60. Examine the following list, and determine what kind of statistical reliability would provide the researcher/educator with useful information. Choose from the following list.
a. Test-retest reliability.b. Equivalent-forms reliability.
c. Internal consistency reliability.
d. Interscorer reliability.
e. Interobserver agreement.
- ______ Mr. Levine is trying to estimate the level of moral development of his ninth graders. He presents them with moral dilemmas, and then he rates them into one of six categories describing levels of moral growth.
- ______ Doctor Jamison is in charge of comprehensive exams for students in her Masters program. For security reasons, she must develop a new set of questions each semester; but she wants the various exams to be as comparable as possible.
- ______ Miss Armstrong is the administrator of a nursing program. She wants her trainees to fill out a 50-item questionnaire which will indicate how well they can "develop rapport with patients." A judgment regarding how well each trainee possesses this skill will be made based on the score on the questionnaire.
- ______ Mrs. Anderson is planning a criterion-referenced test for her third graders. It will cover the basic mathematical principles that they have covered during the course of the year.
- ______ Mr. Washington wants to estimate "appreciation of art" by assessing the number of students who actually stop and gaze at a painting that has been newly displayed in his classroom.
Answers:
If you got all of these answers right, you probably have a sound basic understanding of the statistical procedures for estimating reliability. If you got some of them wrong, you should make a judgment about whether or not you need further review work before continuing with other chapters.
61. The standard error of measurement provides a slightly different approach to describing the reliability of a data collection process. The standard error of measurement describes the range within which the "true" score of an individual is likely to occur. The more reliable a test is, the narrower this range will be.
For example, a student might score 77 on an unreliable reading test with a reliability of .50 and a standard error of 10. Since the test is unreliable, the score would probably be different if the student retook the same test or an alternate form of that test. The standard error of 10 means that there is a good chance that the student's actual score - which could be ascertained by taking the test or its alternate forms a large number of times to rule out chance fluctuations - probably falls between 67 and 87.If the test were more reliable - say, with a reliability of .80 and a standard error of 3, we would have greater confidence in the accuracy of a given student's score. If a student received a score of 77 on a test with a standard error of 3, we would estimate that the actual score would be somewhere within the range of 74 to 80.
The standard error of measurement is directly related to the concept of reliability. It is similar to the concept of standard deviation (discussed in Chapter 7) and to the concept of confidence intervals (discussed in Chapter 8).
Although computations of the standard error of measurement may sound imposing, the concept is really not difficult. The point is this: the smaller the standard error of measurement, the more accurate the results of a data collection process are likely to be.
62. Although the statistical procedures for estimating reliability often provide useful information, they can sometimes be counterproductive by providing misleading information. This is easily understood if you know that correlation coefficients function correctly as estimates of reliability only when scores are distributed over a wide range. (If five students take a test and get scores of 35, 45, 55, 65, and 75, these scores are spread over a wide range. If the five students had scores of 61, 62, 63, 64, and 65 on the same test, these scores are spread over a narrow range.) Because of this characteristic of correlation coefficients, any time the testing situation requires that the scores be grouped closely together, the use of correlation coefficients will supply misleading information. This is because the optimal requirements for the testing situation (narrowly spread-out scores) and those for the correlation coefficients (widely distributed scored) can be in direct conflict. To take a specific example, on criterion-referenced tests (Chapter 6), scores will often be tightly clustered, and correlation coefficients will therefore be lowered. Therefore, discretion should be used in interpreting such coefficients. In such cases, a low correlation coefficient is not good evidence of unreliability, provided the appropriate guidelines to achieve reliability have been followed during the development of the data collection process.
63. Closely related to the problem in 62 is the fact that companies that design standardized tests often choose items which maximize the size of the reliability coefficients. For reasons mentioned in 62, this necessitates eliminating items which are answered correctly by nearly everyone. If it happens that everyone happens to have legitimately mastered a given educational objective, then items relating to that objective will be eliminated from the test. This lessens the content validity of the test (discussed in the next semiprogrammed unit). When you select standardized tests, therefore, it is useful to keep this factor in mind and to seek an appropriate balance between statistical reliability and content validity.
SEMIPROGRAMMED UNIT ON VALIDITY
1. Validity is the most important characteristic of a data collection process. It deals with the question of whether or not the data collection process is really measuring what it purports to measure. A data collection process is valid to the extent that a person's performance on it is really and truly an indication of the extent to which the respondent possesses the characteristic which the data collection process is attempting to measure. A data collection process is invalid to the extent that performance on the data collection process is influenced by characteristics irrelevant to the one the data collection process is trying to measure.
2. There are three sources of invalidity. First, the operational definition of the research variable might be a logically inappropriate description of the research variable. In other words, if the operational definition misses the mark, then any evidence based on this operational definition is also likely to miss the mark. Which one of the following is an example of a logically inappropriate operational definition?
a. The teacher defined "comprehending Spanish sentences" as meaning that the student would be able to state the person, tense, and voice of each verb in each sentence. (Go to 3.)b. The teacher defined "comprehending Spanish sentences" as meaning that the student would be able to paraphrase each sentence in English. (Go to 4.)
3. Right. There are many reasons why a student might understand a sentence and yet not be able to give the person, tense, and voice of the verbs. Likewise, there would be many instances in which a person might know the person, tense, and voice of the verb without comprehending the rest of the sentence. (Go to 5.)
4. Wrong. In general, it seems safe to say that a person who can paraphrase a sentence has understood the sentence (provided the person speaks English adequately and provided the teacher requires a clear enough paraphrase). (Try 2 again.)
5. Which of the following is an example of a logically inappropriate operational definition?
a. Mr. Columbus is trying to teach his students to "appreciate the American heritage." He defines this as being able to identify the main events in American history. (Go to 6.)b. Mr. Erikson is trying to teach his students to "appreciate the American heritage." He defines this as being able to give reasons why American culture has developed to its present state and identifying the major advantages and disadvantages in this development. (Go to 7.)
6. Right. Identifying events is not nearly synonymous with appreciating a heritage. Many students, for example, can identify events in the Pleistocene era or in the history of other countries, and it would be hard to convince teachers that these students appreciate the heritage of the Pleistocene era or of these other countries. (Go to 8.)
7. Wrong. Even though this is not a perfectly foolproof operational definition, it is much better than the other one. A student who does what this operational definition says is likely to have an appreciation of the American heritage. Of course, if the student parrots back a memorized a list of reasons and advantages, then this would not be evidence of "appreciation of the American heritage." Nevertheless, the second definition seems to be better than the first. (Try 5 again.)
8. The second source of invalidity occurs when the tasks in the data collection process don't match the tasks in the operational definition. Sometimes such mismatches occur because the designer of the data collection process is simply inattentive and mistakenly fails to match the task to the operational definition. Frame 9 focuses on an example of this obvious type of mismatch. In other cases, however, the task superficially matches, but respondent characteristics or the demands of the testing situation cause the actual task as performed to differ from the operational definition. Frame 12 focuses on an example of this second type of mismatch.
9. Mr. Erikson (from Frame 5) operationally defines "appreciation of the American heritage" as being able to give reasons why the American culture has developed to its present state and identifying the major advantages and disadvantages of this development. In which of the following cases would there be a mismatch between the actual task and the operational definition?
a. Mr. Erikson has the students write an essay on this topic. He grades them purely on their ideas, with no consideration for neatness, grammatical errors, or spelling. (Go to 10.)b. Mr. Erikson has the students type an essay on this topic. While focusing primarily on content, he takes points off for typing errors, grammatical errors, and spelling errors. (Go to 11.)
10. Wrong. There is no obvious mismatch here. The task very closely resembles the operational definition. (Try 9 again.)
11. Right. There is a very obvious mismatch here. A person's score will be based not only on adherence to the operational definition, but also on irrelevant tasks such as typing, spelling, and grammatical usage. By focusing on such irrelevant characteristics, Mr. Erikson would lower the validity of his data collection process for measuring the designated research variable.
12. Mr. Erikson (Frames 5 and 9) decides to have the students write the essays during class time and to grade the students' essays without regard to such matters as neatness, spelling ability, and grammatical consistency. Listed below are four brief descriptions of students who wrote essays. Write "yes" before the description if it matches the operational definition and "no" if it doesn't match.
a. _____ Juan's native language is Spanish, and the test is given in English. He has trouble expressing complex ideas in English, and he avoids this difficulty by using the simplest expressions possible. He therefore omits several complex ideas which he could have included had the test been written in Spanish.b. _____ Beatrice read all the information and listened to all Mr. Erikson's lectures. However, she didn't really attach much meaning to all these ideas. Because of this, she performed quite poorly on the test.
c. _____ Roscoe understood the information as it was first presented, but he didn't review and quickly forgot what he had learned. He therefore got a low score on the test.
d. _____ Amy gets anxious when she takes tests under time pressure. Although she could have done well if she had written the test at home, she scores poorly when she takes the test during class time.
Answers:
a. No. The task does not match for Juan. The task he is actually performing could be described as "write an essay using artificially simplistic English expressions on a relatively complex topic." Juan's low score is a result of his inability to perform this more difficult task. Mr. Erikson has not collected evidence with regard to the original operational definition.
b. Yes. The task matches. Beatrice's low performance appears to be the result of her inability to perform the task described in the operational definition. To the extent that the operational definition itself is valid (Frames 5 and 7), it can be assumed that Beatrice does not appreciate the American heritage. (Why she doesn't appreciate the American heritage is irrelevant to the measurement problem.)
c. Yes. The task matches. Roscoe's low performance appears to be the result of his inability to perform the task described in the operational definition. At the time he took the test, Roscoe apparently did not appreciate the American heritage, according to this operational definition.
d. No. The task does not match for Amy. The task she is actually performing could be described as "write an essay under conditions of artificially high anxiety. . . ." Her low score is a result of her inability to perform this more difficult task. Mr. Erikson has not collected evidence with regard to the original operational definition.
If you got these four questions right, you probably understand how validity is related to the match between the task and the operational definition. Otherwise, you may wish to do some further reviewing before continuing.
13. The final source of invalidity in data collection processes is unreliability. If a data collection process lacks reliability, it will also lack validity. In other words, unless a data collection process measures something consistently, it cannot measure anything validly. Reliability doesn't guarantee validity, but it's a necessary prerequisite. A certain amount of reliability is necessary if a data collection process is to be valid.
14. Mr. Tolliver wants to find out whether his math unit teaches his pupils to solve practical problems. He writes a set of such practical problems and constructs a pretest and posttest. In which of the following settings would low reliability pose a threat to the validity of his judgments about the success of the unit?
a. He devises a 5-item test for a pretest and a similar posttest. He finds that the students average 53.5 on the pretest and 88.6 on the posttest. (Go to 15.)b. He devises a 20-item pretest and a similar posttest. He finds that the students average 53.5 on the pretest and 88.6 on the posttest. (Go to 16.)
15. Right. This is an extremely short data collection process, and scores are likely to vary by chance because of unreliability. The shortness of the data collection process makes it likely that irrelevant factors rather than those comprising the operational definition of the research variable account for some of the difference between a person's scores on the pretest and posttest. (Technical note: This would be a much more severe problem if Mr. Tolliver were diagnosing individual students rather than group performance. This difference between reliability requirements for groups vs. individuals is discussed in the textbook on pages 98 to 99. (Go to 17.)
16. Wrong. There is no obvious problem with reliability in this case. This test is four times as long as the first one, and therefore reliability is less likely to be a problem. (Try 14 again.)
17. In order to increase the validity of your data collection processes and of the judgments you base on them, you should follow these guidelines.
a. Demonstrate that the operational definition upon which the data collection process is based is actually a logically appropriate definition of the research variable under consideration.b. Demonstrate that the tasks which the respondent has to perform to obtain a score on the data collection process match the task suggested by the operational definition.
c. Demonstrate that the data collection process is reliable.
18. The first guideline is largely a logical process. Ask yourself what else (besides the research variable) could account for the behavior or events described in your operational definition. Then revise the operational definition. Try to devise an operational definition that has as few alternate explanations as possible. For research variables which are highly internalized and hard to define, it is often useful to devise multiple operational definitions. The use of several carefully-chosen operational definitions is a good way to rule out many of the alternate explanations. This process was discussed in Chapter 4.
19. The second guideline requires that you be alert for both the obvious accidental mismatches described in Frames 9 to 11 and for the more subtle mismatches arising from respondent characteristics and demands of the measurement setting (Frame 12). (If these respondent characteristics and demands of the setting are likely to change from occasion to occasion, then they present problems regarding reliability. However, if they are relatively permanent characteristics or demands - as in the case of Amy and Juan in Frame 12 - then they present problems regarding validity.) Careful proofreading and piloting your data collection processes with colleagues can rule out the more obvious accidental mismatches. The mismatches arising from respondent characteristics and demands of the measurement setting, however, are often more subtle. They can often be identified and eliminated only through detailed knowledge of the persons and settings involved and through the use of the technical strategies for assessing validity discussed later in this chapter.
20. Respondents like Juan and Amy in Frame 12 present difficult problems for researchers trying to collect valid data about research variables. How can Mr. Erikson collect valid data on Juan? The most obvious way is to administer the test in Spanish, but Mr. Erikson may be unable to do this. In addition, it can be argued that if Juan is going to live in American society, it's important for him to be able to express himself in English. But by requiring him to express himself in English, we would really be establishing a new objective (a new research variable) for Juan. Is Mr. Erikson qualified to help him reach this more complex goal? Is it appropriate to require Juan to ignore the original goal while he tries first to reach his new goal? These philosophical questions are important, but they are beyond the scope of educational research. The point here is that if Mr. Erikson thinks he is collecting evidence about his original research variable, he is wrong - his data collection process lacks validity. A similar situation arises with Amy. Her data collection process is invalid, but steps to overcome this invalidity may be hard for Mr. Erikson to identify.
21. The third guideline for improving validity is to demonstrate that the data collection process is reliable. The strategies discussed in earlier in this chapter will be helpful in this regard. However, a note of caution is in order. It is possible to increase reliability while actually lowering the validity of a data collection process. In other words, it is emphatically NOT true that the data collection process with the highest reliability coefficient is always the most valid data collection process. This can be easily seen by expanding upon the example given in Frame 14. In which of the following cases would Mr. Tolliver's data collection process have the higher reliability?
a. Mr. Tolliver devises a 5-item test with all 5 items specifically related to the research variable.b. Mr. Tolliver devises a 20-item test with all 20 items specifically related to a single research other than the research variable.
The answer, of course, is that b is more likely to provide the more reliable data collection process, since it samples a larger number of items from a single area. But which is more valid? Obviously, a is more valid, since these five items deal directly with the research variable, whereas none of those in b fits this essential criterion.
22. As a general rule, increased reliability will always increase validity as long as the other two guidelines are followed. If these other two guidelines are not followed, then the relationship of reliability to validity becomes ambiguous. The most frequent difficulty is that reliability is often increased at the expense of violating the second guideline, thereby lowering validity.
23. Professor Robinson wants to find out how effectively his counseling students can apply a certain counseling theory. Which of the following strategies would be the more reliable procedure?
a. He could have them write a single, lengthy essay in which the students would be required to apply the theoretical principles to a hypothetical problem. (Go to 24.)b. He could have them answer 50 multiple choice questions in which the students would have to identify specific concepts related to the theory. (Go to 25.)
24. Wrong. This is a comparatively unreliable procedure. It is based on a single question. In addition, the subjectivity of the person scoring the test could further increase unreliability. (Try 23 again.)
25. Right. Now which is more valid?
a. The essay test. (Go to 26.)b. The multiple choice test. (Go to 27.)
26. Right. The task which the student will perform more closely matches the operational definition which would be derived from the research variable. (Go to 28.)
27. Wrong. This data collection process is a highly reliable measure of something close to (but not matching) the research variable. Success on such a criterion might be a valid measure of "comprehension of the principles underlying the theory." This may even be related to the desired outcome, but it is far from a perfect match. There are probably many persons who can understand a theory who cannot apply it. And application is the desired outcome. (Try 25 again.)
28. What should Professor Robinson do?
a. Give the essay test. (Go to 29.)b. Give the multiple choice test. (Go to 30.)
c. Do something else. (Go to 31.)
29. Of the two data collection processes available, the essay test would certainly be the better choice. However, it would also be possible to do something else. (Go to 31.)
30. If he gave the multiple choice test, he should at least admit that he is measuring a different skill, and not the one he said he was trying to teach. (Try 28 again.)
31. There are several other possibilities. One would be to make the essay test more reliable. He could do this, perhaps, by asking several short questions instead of a single lengthy question. Likewise, if he hasn't already done so, he could add structure into the directions for the essay test and into the scoring process. Such steps often increase the reliability of essay tests without lowering validity. Finally, he could switch to some alternate format, such as the "comprehensive exercise" (briefly discussed in Chapter 6 of the textbook). Major portions of courses and textbooks on Achievement Testing are devoted to solving problems like that faced by Professor Robinson.
32. A final note regarding the relationship between reliability and validity is in order. Since reliability coefficients are relatively easy to compute, there is often an unfortunate tendency to use such statistical coefficients as the main determinant of validity. This extreme emphasis on the third guideline often occurs because the other two guidelines are more difficult to follow - especially with highly internalized educational outcomes. If it's difficult to operationally define a research variable, and if it's difficult to match good items to the operational definition, then perhaps it would be a good solution to make a half-hearted effort to follow these first two guidelines and then go all out for reliability. In fact, this is a very misleading and bad solution to the validity problem. A much better approach is to rely more heavily on the first two guidelines and to incorporate the third guideline into the application of the first two.
33. The American Psychological Association has defined three specific tools for estimating aspects of validity. These technical tools for estimating aspects of validity refer to specific aspects of the overall concept of validity. By examining these three technical tools for estimating aspects of validity, we can gain insights which may be useful to us in constructing, administering, and interpreting various data collection processes. In addition, if you read technical man-uals or look up a citation of a data collection process in the Mental Measurement Yearbook, you will find that these technical terms are often used to describe some of the major characteristics of professionally developed tests.
34. Content validity refers to the extent to which a data collection process measures a representative sample of the subject matter it is designed to measure. For example, a classroom examination should cover all the matter covered during a semester, not just that which was covered in the two weeks immediately prior to the exam. Likewise, an "Attitude Toward Sports" questionnaire should include questions on a wide range of sports, not just on one or two key athletic activities. Similarly, if a researcher wants to find out how often a child pays attention, he should watch the child during a representative sample of time intervals throughout the entire class period, not just during the five minutes immediately before the bell rings. All these examples focus on the question of how well the behaviors sampled with the data collection process actually represent the research variable with which we are concerned. Content validity is not computed statistically, but rather through a systematic effort to demonstrate that the whole range of desired outcomes has been sampled.
35. Which of the following researchers is concerned with content validity?
a. Mr. Ky has developed a questionnaire which will reveal the reading habits of the children in his Gifted Program. He is concerned that the questions might cause the children to focus too narrowly on one type of reading material at the expense of ignoring others. (Go to 36.)b. Mrs. Flynn has developed an observational strategy to study the social interactions of children in kindergarten. She is concerned that the data collection process might be too complex for teachers to administer without serious disruption of their programs. (Go to 37.)
36. Right. Mr. Ky is concerned with whether or not the sample of items on his questionnaire adequately cover the scope of the topic in which he's interested. (Go to 38.)
37. Wrong. Mrs. Flynn is concerned about a serious problem, but this problem is not directly related to whether the questionnaire covers the proper scope of the topic in which she is interested. (Try 35 again.)
38. Criterion-related validity refers to how strongly the results of a data collection process are related to the results of another measuring technique with which they should logically be related. There are two types of criterion-related validity: (a) The researcher might want to know how strongly his present technique is related to another currently existing data collection process; or (b) The researcher might want to know how strongly his present technique is related to some data collection process to be conducted in the future. In situation (a), the researcher might want to use his new data collection process as a replacement for the other technique. In situation (b), he might want to predict the subsequent criterion as accurately as possible. In both cases, criterion-related validity is computed by calculating a correlation coefficient (Chapter 13) between the data collection process being validated and the other technique. A high coefficient, of course, indicates strong criterion-related validity.
39. Which of the following researchers needs criterion-related validity?
a. Mr. Foster wants to know whether his personal estimate of student ability can predict achievement better than the standardized IQ tests. (Go to 40.)b. Miss Sampson wants to know if the form of the test she administered to the morning students correlated with the form of the same test she administered to the afternoon students. (Go to 41.)
40. Right. Mr. Foster is interested in finding out how strongly his data collection process (his personal estimate of ability) is related to a criterion - the future achievement of the students. (Go to 42.)
41. Wrong. Miss Sampson is interested in equivalent-forms reliability, which was discussed above. (Try 39 again.)
42. Which of the following researchers is interested in criterion-related validity?
a. Officer Jones is interested in knowing how closely a traffic offender's ability to walk a straight line is related to the alcohol content in the blood. (Go to 43.)b. Mrs. Jackson wants to know the extent to which all the items on her art test measure the same thing. (Go to 44.)
43. Right. Officer Jones is interested in finding out how strongly one measurement technique (walking a straight line) is related to a criterion - alcohol content in the blood. (Go to 45.)
44. Wrong. Mrs. Jackson is interested in internal consistency reliability, discussed earlier in this chapter. (Try 42 again.)
45. Construct validity refers to the extent to which a data collection process can be interpreted in terms of underlying psychological concepts. The researcher develops a theory about how people should respond to the data collection process if it were really measuring the desired concept. Next, the researcher administers the data collection process in such a way as to see if this is how people really do respond. The actual process of demonstrating construct validity is often complex, and it will not be discussed here.
46. Which of the following researchers is interested in construct validity?
a. Miss Young is concerned that her Shakespearean Knowledge test might give too much emphasis to the tragedies. (Go to 47.)b. Mr. Krause is concerned that his Color Anxiety Scale might not really measure anxiety about colors. (Go to 48.)
47. Wrong. Miss Young is concerned with content validity. (Try 46 again.)
48. Right. Mr. Krause is interested in finding out whether his test really measures the concept he thinks it measures.
49. Examine each of the following brief descriptions and indicate the tools for estimating aspects of validity which the researcher needs. Choose from the following list.
a. Content validity.b. Criterion-related validity.
c. Construct validity.
d. None of the above.
- _____ Mrs. Dettenwanger wants to know whether a child's TV watching habits can predict his success in her Contemporary Fiction course.
- _____ Doctor Banks thinks that maybe his university's entrance exams focus too much on knowledge which would be acquired only by middle-class white applicants.
- _____ Mr. Alexandrov's school administers a battery of tests to determine a child's "optimal learning style." Mr. Alexandrov feels he can accomplish the same thing by asking the children how they prefer to learn.
- _____ Miss McCabe, the school nurse, wants to know how often the children she refers for further eye testing actually have eye problems.
- _____ Ms. Bronson wants to know if the Teacher Aptitude Test really measures teacher aptitude.
- _____ Mr. Monrow wants to know whether Billy's improved performance on the spelling test was due to an actual improvement in spelling ability or to the fact that the second test simply contained easier words.
Answers:
- b. Mrs. Dettenwanger is interested in knowing how strongly a data collection process (TV watching habits - defined in some way) is related to a criterion (success in the course).
- a. Doctor Banks wants to be sure that the entrance exam does not contain a biased sample of behaviors, giving an unfair advantage to certain applicants.
- b. Mr. Alexandrov will probably compute the correlation between his single question and the currently used test.
- b. Miss McCabe is interested in knowing how strongly her measurement (eye screening test) is related to a criterion (actual eye problems).
- c. Ms. Bronson is interested in knowing whether the test can really be interpreted in terms of the concept it claims to measure.
- d. Mr. Monrow is interested in equivalent-forms reliability. (A case could also be made for content validity.)
CROSS-REFERENCES TO OTHER CHAPTERS
Chapter 5 makes reference to the following concepts that are defined and discussed in other chapters. These are listed in the order in which they occurred in Chapter 5.
Correlation coefficients (discussed in several places throughout the chapter) are discussed in Chapter 13 on page 298.Standard deviation (which is mentioned on page 98) is discussed in Chapter 7 on page 160.
Confidence intervals (which are mentioned on page 98) are discussed in Chapter 8 on page 181.
Standardized tests (which are mentioned on page 98 and 99) are discussed in Chapter 6 on page 145.
Operational definitions (which are an important consideration in the discussion of validity, beginning on page 101) were covered in Chapter 4.
Unobtrusive measurement (mentioned on page 105) is further discussed on page 142.
EXAMPLES OF IMPORTANT CONCEPTS IN THIS CHAPTER
Sometimes readers want to go directly to examples
of topics. Anecdotes or examples of each of
these concepts can be found on the following
pages of the textbook:
Reliability - pp. 89, 93, 111Statistical estimates of reliability - pp. 97-98, 111
Validity - pp. 103, 104, 106, 111
Operational definitions - pp. 103, 104, 111
The following matching exercises focus on the key terms in this chapter. Instead of using them as matching exercises, you may find it effective to try to define each of the terms. The correct answers can be found by checking the answers to the matching exercise.
MATCHING EXERCISE "A" - RELIABILITY
Listed below are several procedures that could be used in computing reliability. Match each of these procedures with one of the following forms of reliability:
a. Test-retest reliability.b. Equivalent forms of reliability.
c. Test-retest with equivalent forms reliability.
d. Internal consistency reliability.
e. Interscorer reliability.
f. Interobserver agreement.
MATCHING EXERCISE "B" - RELIABILITY
Listed below are several purposes of computing estimates of reliability. Match each of these purposes with one of these forms of reliability:
a. Test-retest reliability.b. Equivalent forms of reliability.
c. Test-retest with equivalent forms reliability.
d. Internal consistency reliability.
e. Interscorer reliability.
f. Interobserver agreement.
MATCHING EXERCISE "C" - VALIDITY
Listed below are several reasons that a person might want to estimate the validity of a data collection process. Match each with one of these tools for estimating aspects of validity:
a. Content validity.b. Criterion-related validity.
c. Construct validity.
MATCHING EXERCISE "D" - VALIDITY
Listed below are several methods for estimating or determining the validity of a data collection process. Match each of these with one of these tools for estimating aspects of validity:
a. Content validity.b. Criterion-related validity.
c. Construct validity.
The "Humane Attitude Scale for Children" has the following characteristics:
- It has been criticized for evaluating attitudes towards dogs and cats only, while ignoring other pets and all wild animals.
- Scores on the data collection process are not merely the result of temporary attitudes, but rather scores seem to be consistent on different occasions for the same people.
- The average score for boys was 36 points out of a possible 100; and for girls 57 out of a possible 100.
- Children who hate animals (according to a Teacher Rating Scale) actually score low on the data collection process, and children who love animals score high on it.
- It can be shown that it measures attitudes and not merely memorized information about animals.
- The reading level and response format has been shown to be suitable for third to eighth grade children.
Using the above information, rate the "Humane Attitude Scale for Children" with regard to each of the following tools for estimating aspects of validity and reliability.
|
|
|
|
|
|
a. Content validity |
|||
|
b. Concurrent validity |
|||
|
c. Predictive validity |
|||
|
d. Construct validity |
|||
|
e. Test-retest reliability |
|||
|
f. Equivalent form reliability |
|||
|
g. Internal consistency reliability |
The Vockell-Campbell Test of Racquetball Proficiency has the following characteristics:
- It can be used to predict quite accurately how well a player will do in racquetball tournaments.
- It does not correlate highly with performance on the Solon Test of Racquetball Proficiency.
- It samples only a few skills rather than the huge number of skills needed to play racquetball.
- It apparently does measure an ability which could be called "racquetball proficiency."
- The different tasks on the test all seem to measure the same thing ("racquetball proficiency").
- When taken on two different occasions by the same person, the scores are about the same.
Using the above information, rate the Vockell-Campbell test with regard to each of the following tools for estimating aspects of validity and reliability.
Using the above information, rate the "Vockell-Campbell test with regard to each of the following tools for estimating aspects of validity and reliability.
|
|
|
|
|
|
a. Content validity |
|||
|
b. Concurrent validity |
|||
|
c. Predictive validity |
|||
|
d. Construct validity |
|||
|
e. Test-retest reliability |
|||
|
f. Equivalent form reliability |
|||
|
g. Internal consistency reliability |
Review Quiz
Matching Exercises: Reliability
Exercise A
1. e2. a
3. c
4. f
5. d
6. b
Exercise B
1. f2. d
3. b
4. e
5. a
6. c
Matching Exercises: Validity
Exercise C
1. a2. b
3. c
4. b
Exercise D
1. b2. c
3. a
4. b
Reliability and Validity Exercise "A"
{The number in parentheses after the answer indicates the statement that supplies evidence that the test was either good or bad. For example, the test had weak content validity because Statement 1 says, "The test has been criticized for evaluating attitudes towards dogs and cats only, while ignoring other pets and all wild animals."}
a. Bad (1)b. Good (4)
c. Not mentioned
d. Good (4 and 5)
e. Good (2)
f. Not mentioned
g. Not mentioned
Reliability and Validity Exercise "B"
{The number in parentheses after the answer indicates the statement that supplies evidence that the test was either good or bad. For example, the test had weak content validity because Statement 1 says, "The test samples only a few skills rather than the huge number of skills needed to play racquetball."}
a. Bad (3)b. Bad (2)
c. Good (1)
d. Good (4 and 5)
e. Good (6)
f. Not mentioned
g. Good (5)