GETTINGEVALUATION WRONGThe Case Against StandardizedTestingHow Little
Test Scores Mean
NOT LONG AGO, I am told, a
widely respected middle school teacher in Wisconsin, famous for helping
students design their own innovative learning projects, stood up at a community
meeting and announced that he "used to be" a good teacher. These days,
he explained, he just handed out textbooks and quizzed his students on
what they had memorized. The reason was very simple. He and his colleagues
were increasingly being held accountable for raising test scores. The kind
of wide-ranging and enthusiastic exploration of ideas that once characterized
his classroom could no longer survive when the emphasis was on preparing
students to take a standardized examination. Because the purveyors of Tougher
Standards had won, the students had lost.
I don't know how many teachers across the country would identify with this story-either because they have already thrown up their hands, as this man did, or because they struggle every Monday morning to try to avoid his fate. But I do know this: the issue of standardized testing is not reserved for bureaucrats and specialists. All of us with children need to make it our business to understand just how much harm these tests are doing. They are not an inevitable part of "life" or even a necessary part of school; they are a relatively recent invention that gets in the way of our kids' learning. Their impact is deep, direct, and personal. Every time we judge a school on the basis of a standardized test score-indeed, every time we permit our children to participate in these mass testing programs-we unwittingly help to make our schools just a little bit worse. In case I am being too subtle here, let me state clearly that I think standardized testing is a very bad thing, and the more familiar you become with it, the more appalled you are likely to be. I am not talking about the kind of tests invented by individual teachers for their classes but about those prepared by giant companies and taken by thousands of students across schools, districts, and even states. Similarly, I am not primarily interested in the tests used for admission to college, like the SAT-although there is plenty wrong with these exams, too.1 Of more immediate concern are those that begin much earlier, tests used all over the country, like the Iowa and Comprehensive Test of Basic Skills (ITBS and CTBS) and the Metropolitan, Stanford, and California Achievement Tests, as well as those developed for use in just one state, such as the ISAT in Illinois, the TAAS in Texas, the MEAP in Michigan, and so on. The case against exams like these "may be as intellectually and ethically rigorous as any argument made about social policy in the past 20 years," says one writer, "but such testing continues to dominate the education system. . . . We are a nation of standardized-testing junkies. "2 Estimates of how many times students in the United States sit down to take these tests every year vary from 40 million to 400 million. It is clear, though, that no other nation in the world does anything like this to its children.3Yet despite requirements for some students to take several standardized tests a year,4and although even young children are routinely subjected to these tests5 (in the face of explicit appeals by experts to stop), the trend, incredibly, is for even more testing. In some cases, this trend reflects a deliberate strategy, part of an educational philosophy based on getting students to memorize a bunch of basic facts and skills. Standardized tests generally go hand-in-hand with that kind of teaching - and with a system of carrot-and-stick control to mandate that kind of teaching. In other cases, the use of such tests reflects no particular endorsement of a style of instruction or testing - only a vague desire to hold schools accountable coupled with a total ignorance of other ways of achieving that goal. For example, civil rights groups and sympathetic judges who are understandably outraged by disparities among school systems in the same state (or even county) may uncritically use standardized tests to indicate how much progress has been made to "close the gap" between black and white neighborhoods or poor and rich districts - all the while apparently unaware of how much harm they are doing by legitimating and perpetuating a reliance on such testing. Standardized tests persist and proliferate for other reasons, too. First, they are enormously profitable for the corporations that prepare and grade them. (More often than not, these companies simultaneously sell teaching materials designed to raise scores on their own tests.) Second, they appeal to school systems because they're efficient, and the worst tests are usually the most efficient. It is fast, easy, and therefore relatively inexpensive6to administer a multiple-choice exam that arrives from somewhere else and is sent back to be graded by a machine at lightning speed. There is little incentive to replace these tests with more meaningful forms of assessment that require human beings to evaluate the quality of students' accomplishments. In the words of Norman Frederiksen, a specialist in the measurement of learning with the Educational Testing Service, "Efficient tests tend to drive out less efficient tests, leaving many important abilities untested - and untaught. "7 Anyone trying to account for the popularity of standardized tests may also want to consider our cultural penchant for attaching numbers to things. One writer has called it a "prosaic mentality": a preoccupation with that which can be seen and measured.8 Any aspect of learning (or life) that resists being reduced to numbers is regarded as vaguely suspicious. By contrast, anything that appears in numerical form seems reassuringly scientific; if the numbers are getting larger over time, we must be making progress. Concepts like intrinsic motivation and intellectual exploration are hard for the prosaic mind to grasp, whereas test scores, like sales figures or votes, can be calculated and charted and used to define success and failure. The more tests we make kids take, the more precise our knowledge about who has learned well, who has taught well, which districts are in trouble, and even which schools (in this brave new world of for-profit education) will survive another day. In a broad sense, it is easier to measure efficiency than effectiveness, easier to rate how well we are doing something than to ask if what we are doing makes sense. But the heirs of Descartes and Bacon, Skinner and Taylor, rarely make such distinctions. More to the point, they fail to see how the process of coming to understand ideas in a classroom is not always linear or quantifiable. Contrary to virtually every discussion of education by the Tougher Standards contingent, meaningful learning does not proceed along a single dimension in such a way that we can nail down the extent of improvement. In fact, as Linda McNeil has observed, "Measurable outcomes may be the least significant results of learning."9 (That sentence ought to be printed out in 36-point Helvetica and hung in the office of every person in the country involved with school reform.) To talk about what happens in schools as moving forward or backward in specifiable degrees is not only simplistic, in the sense that it fails to capture what is actually going on; it is destructive, because it can change what is going on for the worse. Once teachers and students are compelled to focus only on what lends itself to quantification, such as the number of grammatical errors in a composition or the number of state capitals memorized, the process of thinking has been severely compromised. If it is worth reflecting critically on our infatuation with numbers, it is at least as important to examine our assumptions about standardized tests in particular. Such tests are commonly justified on the basis of providing us with objective information about teaching and learning, but the precise score assigned to a student or school is meaningless until we know the content of the test and whether it is a valid measure of learning. Similarly, you can call these tests "objective" in the sense that they are scored by machines, but it was people who wrote the questions (which may be biased or murky or stupid) and people who decided to include them on the exam. As one writer put it, "Judgment was used in the choice of items and that judgment decided which bubble would count and which would not, and hence what the score would be. The people exercising the judgment are too far out of the picture to have faces and personalities, so it is easy to act as if they do not exist. " 10 Beyond the test-makers, we need to look at the test-takers. Those scientific-sounding results are actually the product of rows of real students scrunched into desks, frantically filling in bubbles. As soon as we focus on this human part of the testing process, the significance of the scores becomes dubious. For example, test anxiety has grown into a subfield of educational psychology, and its prevalence means that the tests that produce this reaction are not giving us a good picture of what many students really know and can do. The more a test is made to "count" -in terms of being the basis for promoting or retaining students, for funding or closing down schools - the more that anxiety is likely to rise and the less valid the scores become. Then there are the students who don't take the tests seriously. A friend of mine remembers neatly filling in those ovals with his pencil in such a way that they made a picture of a Christmas tree. (He was assigned to a low-level class as a result, since his score on a single test was all the evidence anyone needed to judge his capabilities.) Even those test-takers who are not quite so creative may just guess wildly, fill in ovals randomly, or otherwise blow off the whole exercise, understandably regarding it as a waste of time. In short, it may be that a good proportion of students either couldn't care less about the tests, on the one hand, or care so much that they choke, on the other. Either way, their scores aren't very meaningful. Anyone who can relate to these descriptions of what goes through the minds of real students on test day ought to think twice before celebrating a high score, complaining about a low one, or using standardized tests to judge schools. Can tests be reliable indicators despite these factors? Perhaps, but
it is an open secret among educators that much of what the scores are indicating
is just the socioeconomic status of the students who take them. One educator
suggests we should save everyone a lot of time and money by eliminating
standardized tests, since we could get the same results by asking a single
question: "How much money does your mom make? ... OK, you're on the bottom.
"11 (In the case of the SATs, the scores reflect
not only family income [see pp. 262-63n1, par.4] but also the proportion
of the eligible population that actually took the test.) The larger point
is that "a ranking of states, districts, or schools by test scores is too
crude a measure to offer any insight about the quality of education" because
other factors, having nothing to do with instruction, contribute significantly
to those scores.12
The Worst Kind of Testing Some standardized tests aren't this bad. They're worse. I have in mind the tests that by design "provide little or no information about. . . what the individual can do. They tell that one student is more or less proficient than another, but do not tell how proficient either of them is with respect to the subject matter tasks involved." 13 Those are the words of a psychologist named Robert Glaser, who many years ago coined the term "norm-referenced" to describe this kind of test and contrasted it with a "criterion-referenced" test, which does compare each individual to a given standard. Believe it or not, the majority of states today rely on norm-referenced tests,14 that is, tests that aren't intended to find out how much students know. These tests were created only to find out how well your child does compared to every other child taking the test, which is usually reported as a percentile. Think for a moment about the implications of this fact. No matter how many students take the test, no matter how well or poorly they were taught, no matter how difficult the questions are, the pattern of results is guaranteed to be the same: exactly 10 percent of those who take the test will score in the top 10 percent, and half will always fall below the median. That's not because our schools are failing; that's because of what "median" means. A good score on a norm-referenced test means "better than other people," but we don't even know how much better. It could be that everyone's actual scores are all pretty similar, in which case the distinctions between them are meaningless - rather like saying I'm the tallest person on my block even though I'm only half an inch taller than the shortest person on my block. More important, even if the top 10 percent did a lot better than the bottom 10 percent, that still doesn't tell us anything at all about how well they did in absolute terms, such as how many questions they got right. Maybe everyone did reasonably well; maybe everyone blew it. We don't know. Norm-referenced tests aren't meant to tell us how well a student did or how much of a body of knowledge was effectively learned. To use them for that purpose is, in the words of a leading authority on the subject, "like measuring temperature with a tablespoon. "15 Yet they are used for exactly that purpose all across the United States.16 We've already bumped up against some of these same criticisms in the context of surveys that rank students from different countries. Exactly the same points apply to ranking schools or districts or states. A reasonably informed person would not care how her child's school did compared to other schools in the area; a reasonably conscientious journalist would not dream of publishing something so meaningless and misleading. The only thing that should count is how many questions on a test were answered correctly (assuming they measured important knowledge). By the same token, the news that your state moved up this year from thirty-seventh in the country to eighteenth says nothing about whether its schools are really improving: for all you know, the schools in your state are in worse shape than they were last year, but those in other states slid even further.17 Even that doesn't tell the whole story. When specialists sit down to design a norm-referenced test, they're not interested in making sure the questions cover what is most important for students to know. Rather, their goal is to include questions that some test-takers - not all of them, and not none of them-will get right. They don't want everyone to do well. Furthermore, they want each question to be answered correctly by the same students who get most of the other questions right. The ultimate objective, remember, is not to evaluate how well the students were taught but to separate them, to get a range of scores. If a certain question is included in a trial test and almost everyone gets it right - or, for that matter, if almost no one gets it right - that question will likely be tossed out. Whether it is reasonable for kids to get it right is completely irrelevant. Moreover, the questions that "too many" students will answer correctly are probably those that deal with the content teachers have been emphasizing in class because they thought it was important. So norm-referenced tests are likely to include a lot of trivial stuff that isn't emphasized in school because that helps to distinguish one student from another.18 Given that scores from norm-referenced tests are widely regarded as if they said something meaningful about how our children (and their schools) are doing, they are not only dumb but dangerous. And the harm ramifies through the whole system in a variety of ways. First, these tests contribute to the already pathological competitiveness of our culture, which leads us to regard others as obstacles to our own success - with all the suspicion, envy, self-doubt, and hostility that rivalry entails. The process of assigning children to percentiles helps to ensure that schooling is more about triumphing over everyone else than about learning. Second, because every distribution of scores will contain a bottom, it will always appear that some kids are doing terribly. That, in turn, reinforces a sense that the schools are failing. Worse, it contributes to the insidious assumption that some children just can't learn - especially if the same kids always seem to fall below the median. (This conclusion, based on a misunderstanding of statistics, is then defended as "just being realistic.") Parents and teachers may come to believe this falsehood, and so too may the kids themselves. They may figure: no matter how much I improve, everyone else will probably get better too, and I'm always going to be at the bottom. Thus, why bother trying? Conversely, a very successful student, trained to believe that rankings are what matter, may be confident of remaining at the top and therefore have no reason to do as well as possible. (Remember: excellence and victory are two completely different goals.) For both groups, it is difficult to imagine a more powerful demotivator than norm-referenced testing. One more disturbing consequence: teachers and administrators who are determined to outsmart the test-or who are under significant pressure to bring up their school's rank - may try to adjust the curriculum in order to bolster their students' scores. (More about this later.) But if the tests emphasize relatively unimportant knowledge that's designed for sorting, then "teaching to the test" isn't going to improve the quality of education. It may have exactly the opposite effect. Even though they suffer from the more general problems with standardized testing, criterion-based exams make more sense than the norm-referenced kind. At least they're set up so everyone theoretically could do very well (or very badly); it's not a zero-sum game. But in practice these tests may be treated as though they were norm-referenced. This can happen if parents or students aren't helped to understand that a score of 80 percent refers to the proportion of questions answered correctly, leaving them to assume that it refers to a score better than 80 percent of the other test-takers.19 Worse yet, criterion-referenced tests may be turned into the norm-referenced kind if newspapers publish charts showing how every school or district ranks on the same test, thus calling attention to what is least significant. (One expert on testing suggests that if newspapers insist on publishing such a chart, they should at least run it where it belongs, in the sports section.)20 Still, the main point is that when the tests themselves have been designed
specifically as sorting devices, the harm is damn near inescapable. It
is a point that almost everyone should be able to understand, yet our children
continue to be subjected to tests like the ITBS that are both destructive
and ridiculously ill suited to the purposes for which they are used. And
they will continue to be used - and the scores will continue to be published
- until you and I stop responding to the results by saying, "Ninety-fifth
percentile! That's terrific!" or "Bottom quartile? What went wrong?" and
start responding to such numbers by saying, "Wait a minute. What difference
does that make? Do they think we're idiots?"
What the Tests Test Even standardized tests that are criterion- rather than norm-referenced tend to be contrived exercises that tell us very little about the intellectual capabilities that matter most. What they primarily seem to be measuring is how much a student has crammed into his short-term memory. Lauren Resnick concedes that some standardized tests contain "isolated items that test students' critical thinking and reasoning knowledge," but she explains that they nevertheless fail to offer students the opportunity "to carry out extended analyses, to solve open-ended problems, or to display command of complex relationships, although these abilities are at the heart of higher-order competence."21 Resnick points out that what generally passes for a test of reading comprehension is a series of separate questions about short passages on random topics. These questions "rarely examine how students interrelate parts of the text and do not require justifications that support the interpretations"; indeed, the whole point is the "quick finding of answers rather than reflective interpretation." Tests of writing, meanwhile, are positively laughable: they are about memorizing the mechanics of grammar and punctuation, often requiring that students correct mistakes in isolated sentences. To state the obvious, "Recognizing other people's errors and choosing the correct alternatives are not the same processes as those needed to produce good written language."22 In mathematics, the story is much the same. An analysis of the most widely used standardized math tests found that only 3 percent of the questions required "high level conceptual knowledge" and only 5 percent tested "high level thinking skills such as problem solving and reasoning."23 Typically, the tests aim to make sure that students have memorized a series of procedures, not that they understand what they are doing. They also end up measuring knowledge of arbitrary conventions (such as the accepted way of writing a ratio or knowing that " < " means "less than") more than a capacity for logical thinking.24 Even those parts of math tests that have names like "Concepts and Applications are "still given in multiple-choice format, are computational in nature, and test for knowledge of basic skills through the use of traditional algorithms." 25As for science, the parts of standardized examinations devoted to this subject often amount to nothing more than a vocabulary test. Multiple-choice questions that focus on "excruciatingly boring material" fail to judge students' capacity to think and wind up driving away potential future scientists, according to the president of the National Academy of Sciences.26 The point here is not that standardized tests are too hard or too easy. The problem is not the difficulty level, per se, but that they are geared to a different, less sophisticated kind of knowledge. And the more this is so, the more teaching comes to imitate these tests as teachers are steered away from helping kids learn how to think. Indeed, the students who ace these tests are often those who are least interested in learning and most superficially engaged in what they are doing. This is not just my opinion: studies of elementary, middle school, and high school students have all found a statistical association between high scores on standardized tests and relatively shallow thinking. Not only are these examinations not about deep understanding-they seem to be about its opposite.27 Perhaps this is why, as Piaget pointed out years ago, "anyone can confirm how little the grading that results from examinations corresponds to the final useful work of people in life."28 But never mind their inability to predict what students will be able to do later; they don't even capture what students can do today. In fact, we could say that such tests fail in two directions at once. On the one hand, they overestimate what some students know: those who score well often understand very little of the subject in question. They may be able to find a synonym or antonym for a word without being able to use it properly in a sentence. Older students may have memorized the steps of comparing the areas of two geometric figures without really understanding geometry at all. Younger children may be able to correctly "write 8 next to a picture of eight ice cream cones while continuing to believe that eight of them spread out are more than eight crowded together.29Students may even be able to "psych out" the test itself by ascertaining which kinds of answers are usually incorrect or what the writers of the test are looking for. On the other hand, standardized tests underestimate what other students know because, as any teacher can tell you, very talented kids often get low scores. It is true in writing-"countless cases of magnificent student writers whose work was labeled as 'not proficient' because it did not follow the step-by-step sequence of what the test scorers (many of whom are not educators, by the way) think good expository writing should look like." 30 It is true in reading: a first-grade teacher in Ohio, frustrated that an excellent reader was being placed in a remedial class because he didn't perform well on a standardized test, showed an administrator the books this boy could read as well as an entire book he had written. This evidence was brushed aside and she was told just to look at the test scores. As she recalls the conversation, "When I pointed out that there wasn't even any reading on this so-called reading-readiness test, well then [the administrator] said maybe that was the problem-that I should spend more time getting them ready to read rather than having the kids read."31 What we have here is a double indictment of standardized testing. "Pupils who read widely and with good comprehension may be undervalued, while pupils who perform well on isolated skill tests but who can't or don't care to read are lulled into complacency. " 32 The same is true in math. One group of researchers described a fifth-grader who flawlessly marched through the steps of subtracting 2 5/6 from 3 1/3, ending up quite correctly with 3/6 and then reducing that to ½. But successfully performing this final reduction doesn't mean he understood that the two fractions were equivalent. In fact, he remarked in an interview that ½ was larger than 3/6 because "the denominator is smaller so the pieces are larger." Meanwhile, one of his classmates, whose answer had been marked wrong because it hadn't been expressed in the correct terms, clearly understood the underlying concepts. Intrigued, the researchers then proceeded to interview a number of fifth-graders about another topic, division, and discovered that 41 percent had memorized the process without really grasping the idea, whereas 11 percent understood the concept but made minor errors that resulted in getting the wrong answers. A standardized test therefore would have misclassified more than half of these students.33 As disturbing as all of this may be, we can dig even deeper, looking not only at the specific questions that appear on such tests but at the format and nature of the tests themselves. It is the very features of standardized testing we take for granted that ultimately undermine their usefulness:
Put these last few points together and you have a scenario that is not merely disagreeable but ludicrously contrived. After all, how many jobs demand that employees come up with the right answer on the spot, from memory, while the clock is ticking? (I can think of one or two, but they are the exceptions that prove the rule.) How often are we forbidden to ask coworkers for help or to depend on a larger organization for support-even in a society that worships self-sufficiency? And when someone is going to judge the quality of your work, whether you are a sculptor, a lifeguard, a financial analyst, a housekeeper, a professor, a refrigerator repairman, a reporter, or a therapist, how common is it for you to be given a secret pencil-and-paper exam? Isn't it far more likely that the evaluator will look at examples of what you've already done or perhaps watch you perform your normal tasks? To be consistent, those educational critics who indignantly insist that schools should be doing more to prepare students for the real world ought to be manning the barricades to demand an end to these artificial exercises called standardized tests. Of course, anyone who reads through this list may be inclined to wonder,
"Well, how could you have a standardized test that wasn't concerned
only with right answers or wasn't secret or timed or whatever?" Indeed,
you probably couldn't. But here is a very different question: How could
you devise a way of figuring out how well students are learning, or schools
are teaching, that didn't have these features? As we'll see in the final
chapter, this question does have an answer. But it's critical that we frame
the issue in these broader terms, that this becomes our point of departure,
because only then are we free to look beyond - and solve the problems created
by - standardized tests.
Raising the Scores, Ruining the Schools
Lately, some opponents of standardized testing have invoked a bucolic saying: "You don't fatten a steer by weighing it." The point, of course, is that merely measuring something, such as students' learning, doesn't in itself lead to any change in what is being measured. This is true as far as it goes, but the metaphor is poorly chosen44 because it implies that testing has no impact on education. The reality is that it almost always does have an impact, increasingly by design. Unfortunately, the impact is usually negative. Consider the messages that standardized testing communicates to children about the nature of learning. Because a premium is placed on remembering facts in many of these tests, students may come to think that this is what really matters - and they may even come to develop a "quiz show" view of intelligence that confuses being smart with knowing a lot of stuff. Because the tests are timed, students may be encouraged to see intelligence as a function of how quickly people can do things. Because the tests often rely on a multiple-choice format, students may infer "that a right or wrong answer is available for all questions and problems" in life and "that someone else already knows the answer to [all these questions], so original interpretations are not expected; the task is to find or guess the right answer, rather than to engage in interpretive activity." 45 If we're looking for more direct harms, we could just tally up the time students and teachers waste actually taking these tests. But where our kids really pay the price is with what comes before and because of the tests. "It is the hours spent practicing types of questions that might appear on the tests and the days denying students enrichment options that are truly meaningful that make proficiency tests so harmful and invasive," as one educator put it.46 In schools around the country, the content and style of teaching are being placed in the service of the tests. Teachers often feel obliged to set aside other subjects for weeks at a time in order to teach test-taking skills. Sometimes the tests hijack the entire curriculum as schools are transformed into giant test-prep centers. When students will be judged on the basis of a multiple-choice test, teachers may use multiple-choice activities beforehand. It is not uncommon to find instruction "in the same format as the test rather than [in a form] used in the real world. For example, teachers reported giving up essay tests because they are inefficient in preparing students for multiple-choice tests."47 The assignments (in class and to take home) may change as well. It's not unusual to hear of schools where teachers are required to use multiple-choice formats in their teaching.48 This has aptly been called the "dumbing down" of instruction, although curiously not by the conservative critics with whom that phrase is normally associated. More striking, either because they think it is best for their students or because they have a gun to their heads, teachers will dispense with poetry and focus on prose, breeze through the Depression and linger on the Cold War, cut back on social studies to make room for more math-depending on what they think will be emphasized on the tests. They may even put all instruction on hold and spend time giving practice tests. The defenders of standardized testing don't try to deny that it forces schools to reconfigure the curriculum; indeed, they cheerfully acknowledge this. "There's nothing wrong with teaching to the test. That is what education is all about," declared Robert V. Antonucci, the former commissioner of education for Massachusetts.49 An article in the American School Board Journal prefers the euphemism "curriculum alignment" and insists that it's paying off . . . as measured exclusively by test scores!50 It is a relatively new idea-that tests should be used not only to measure but to mandate-but in scarcely a generation it has come to be taken for granted. Sometimes it is done deliberately, perhaps because it offers policymakers "one of the few levers on the curriculum that [they] can control."51Other times, educators and parents simply realize that a test emphasizes isolated facts rather than critical thinking and figure the curriculum had better be retooled accordingly. Either way, the tail of testing is now wagging the educational dog. You'd think that when officials sit down to formulate an education policy, they would begin by agreeing on some broad outlines of what students ought to know and be able to do, and only then address the question of measuring how successfully this is happening. The reality, though, often seems to be exactly the opposite: "What can be measured reliably and validly becomes what is important to know," as critics of one state's reform efforts observed.52 It's rather like the old joke about the fellow who was looking around for his lost keys one night, explaining to a passerby that he was searching the sidewalk right near a streetlight not because that was where he dropped them but because "the light is better here." A more indirect effect of the same mentality can be glimpsed in Ohio, where the pressure to boost proficiency test scores has led to changes in how teachers of children from age 9 to 14 are certified by the state. Teachers have been forced to specialize in only two content areas (such as math and science), which means that the kind of departmentalization that has created such a fragmented educational experience in high school may now happen, thanks to testing pressures, as early as fourth grade.53 In some states, as the following chapter will explain, officials have turned up the heat by creating "high-stakes" testing programs. But even without this added pressure, it's virtually inevitable that teachers will feel pressured to change what they do. As one principal in Virginia put it, "We know what's being tested. So now we know what we have to teach. "54 In fact, when teachers from forty-eight schools around the country were surveyed a few years ago, nearly all "reported spending substantial time (a week or more) giving students worksheets that review test content, giving students practice with the item formats likely to be on the test, and directly teaching test-taking strategies."55 And teachers understand the costs of doing this. In another survey, 60 percent of the math teachers and 63 percent of the science teachers described the "negative effects of a testing program on curriculum or student learning," citing the "narrowing and fragmenting of the curriculum" among other consequences.56 We've already seen some illustrations of the limits of these tests. If students get higher scores on math exams for memorizing techniques than for understanding concepts, which do you think teachers will emphasize? If a teacher sees first-graders being penalized for having spent time reading rather than being drilled on reading-readiness skills, what direction will her class take the following year? Teachers all over the country struggle with variations of this dilemma, worrying not only about their own jobs but about the short-term price their students may have to pay for more authentic learning. The choices are grim: either the teachers capitulate, or they struggle courageously to resist this, or they find another career.57 I remember visiting a school in Illinois a few years ago where the battle had already been lost. The district was under a court order to bring up its test scores and had bought a packaged program called "Success for All" to do just that (see p. 300n53). The result resembled a factory more than a place of learning, with children being exhorted to succeed and perform and achieve. (By contrast, words like "curiosity," "discovery," and "exploration" were nowhere to be seen or heard.) In room after room I saw children correcting punctuation, answering plot questions about story fragments, and completing worksheets full of multiplication problems and those all-too-familiar analogy questions. Of course I hadn't seen the school earlier, so I can't say for sure that its patron saint was John Dewey before it became Stanley Kaplan. In other places, though, the shift is visible to the naked eye: As ITBS testing gets closer, sixth-grade science at a school in Arizona mutates from hands-on learning to textbooks and dittos - and the total time allocated to science is sharply curtailed (and, for a while, completely eliminated) to make more time for tested subjects.And so it goes. "Everywhere we turned," one group of educators reported in 1998, "we heard stories of teachers who were being told, in the name of 'raising standards,' that they could no longer teach reading using the best of children's literature but instead must fill their classrooms and their days with worksheets, exercises, and drills." The result in any given classroom was that "children who had been excited about books, reading with each other, and talking to each other were now struggling to categorize lists of words." 59 In some classes, of course, it was never otherwise. The very raison d'etre of high school advanced placement (AP) courses, for example, has always been to prepare students for a test. (There has been much discussion about who gets to take these classes60 but very little about their basic purpose and high-powered drill-and-skill method of instruction.) One writer suggests that teachers ought to just issue a formal declaration of surrender to the Educational Testing Service and be done with it.61 Even in classes less noticeably ravaged by the imperatives of test preparation, there are hidden costs - opportunities missed, intellectual roads not taken. For one thing, teachers are less likely to work together in teams.62 For another, in each classroom, "the most engaging questions kids bring up spontaneously- 'teachable moments'-become annoyances."63 Excitement about learning pulls in one direction; covering the material that will be on the test pulls in the other. No wonder elementary school teachers in one state overwhelmingly denounced the effects of a new testing program, saying in a recent survey that "morale has sunk, practice tests are soaking up teaching time, and students are more anxious about school than ever before. " 64 Sometimes teachers feel they must depart from teaching testable facts in testlike fashion . . . in order to impart advice about test-taking, per se. The first thing to be said about using school for this purpose is that it is an egregious waste of our children's time-although the fault lies not with the teachers who do it but with the tests themselves and those who pay inordinate attention to the results. Second, it is educationally "harmful if students transfer [these test-taking tactics] to other classroom activities."65 We don't want kids to get in the habit of skimming a book, looking for facts they might be asked on a test, instead of really thinking about and responding to what they're reading. Third, even if clever strategies (for example, skipping to the questions first, then going back to the passage to find the answers) are effective, this means that, to some extent at least, a high test score reflects not knowledge or intelligence but good test-taking skills. If one can be successful by engaging in what some have called "legal cheating" -if we can indeed raise students' scores by teaching them tricks or by cramming them full of carefully chosen information at the last minute - this should be seen not as an endorsement of such methods but as a devastating revelation about how little these tests are really telling us. Linda Darling-Hammond offers this analogy: Suppose it has been decided that hospital standards must be raised, so all patients must now have their temperatures taken on a regular basis. Shortly before the thermometers are inserted, the doctors run around giving out huge doses of aspirin and lots of cold drinks. Remarkably, then, it turns out that no one is running a fever! The quality of hospital care is at an all-time high!66 What is really going on, of course, is completely different from providing good health care and assessing it accurately-just as teaching to the test is completely different from providing good instruction and assessing it accurately. "By focusing on improving test scores," two researchers warn, "only test scores, and not schools themselves, will improve. " 67 Notice that scores typically plummet whenever a state or district decides to administer a new test. (And the headlines read: Our schools are failing! Our students are ignorant!) After a few years, the scores begin to rise as students and teachers get used to the test. (And the headlines read: Our schools are improving! Tougher standards are effective!)68Another kind of evidence comes from stories like the one about a junior high school in New Jersey where an intensive test-prep effort succeeded in producing the highest scores in the area-after which one third of the students required remedial classes when they got to high school.69 They weren't helped to learn; they were helped to get good scores, which did them no good and may even have done them considerable harm. What all this means can be summarized in a sentence: At best, high test scores for a given school or district are probably meaningless; at worst, they're actually bad news because of the kind of teaching that was done to produce those scores. To talk about the kind of teaching that was done is to talk about the kind of teaching that was not done. The first thing to go in a school or district where these tests matter a lot is a more vibrant, integrated, active, effective kind of instruction. A Cambridge, Massachusetts, teacher of seventh- and eighth-grade students70 can tick off exactly what she's had to sacrifice in order to prepare her students for that state's new test. Class meetings to build community, learn democratic skills, and solve problems together? No time. The flexibility to depart from the lesson plan and discuss important current events? Forget it; today's news won't be on the exam. A while back, this teacher had devised a remarkable unit in which every student picked an activity that he or she cared about and then proceeded to become an expert in it. Each subject, from baking to ballet, was researched intensively, described in a detailed report, ~nd taught to the rest of the class. The idea was to hone research and writing skills as well as to help each student feel like an expert in something and to heighten everyone's appreciation for the craft involved in activities they may not have thought much about. In short, it was the kind of academic experience that people look back on years later as a highlight of their time in school. But now her students won't have the chance: "Because we have so much content material to cover, I don't have the time to do it," this teacher says ruefully. "I mean, I've got to do the Industrial Revolution because it's going to be on the test." One leading educational organization has noted that "ironically, the calls for excellence in education that have produced widespread reliance on standardized testing may have had the opposite effect-mediocrity."71But it's important to point out that this effect shows up in some areas more than others. The noted high school reformer Ted Sizer recalls a conversation he once had with a top school official who was "proposing to test the kids until they begged for mercy." It turned out, you may not be surprised to learn, that this administrator sent his own kids to private schools where standardized tests were exceedingly rare if they were used at all and, incidentally, where excellence meant an emphasis on discovery rather than a "back-to-basics" mentality.72 The point isn't that this official was caught in a political faux pas for pulling his children out of the public schools he was supposed to be defending. Rather, he seemed to think the traditional approach to education, including a heavy diet of standardized testing, is for other people's children - and, as it turns out, particularly for children of color. Even apart from charges that some standardized tests are biased against minorities because of their content,73 such tests-with all the implications for teaching they carry-are more likely to be used and emphasized in schools with higher percentages of minority students.74 The result is that even people who are understandably desperate to improve inner-city schools wind up making the problem worse when they cause reform efforts to be framed in terms of improving standardized test scores. Indeed, the whole conversation about improving education in this country has been narrowed by the use of these tests. The more that scores are emphasized, as Sherman Dorn at the University of South Florida has pointed out, the less we discuss the proper goals of schooling. Now it's just a matter of finding the most efficient means for what has become the de facto goal: doing better on tests. Furthermore, we tend to stop using (or developing) other ways for evaluating classroom practices and student learning: "As long as a school or teacher has adequate test scores, what happens in the classroom is irrelevant," Dorn remarks. Similarly, poor test scores are viewed as indicators that change is needed, "no matter what happens in the classroom." 75 For a parent, the implications of all this are straightforward. In the words of Gail Jones, a professor of education at the University of North Caroliria, "The bottom line is, do you want to have a child who can take tests well or do you want to have a well-educated child?" 76
4. GETTING
EVALUATION WRONG (NOTES)
1. Here are four facts about the SAT that everyone should know: 1. We've come to think of it as a necessary part of going to college, but in its current (multiple-choice) form it has been used only since the 1940s. Realizing that something hasn't always been done helps to remind us that it doesn't have to be done. So does realizing that something isn't done everywhere. Canadian universities, for example, don't use the SAT or anything like it for college admission. Moreover, at least 280 U.S. colleges and universities have stopped requiring that applicants take the SAT-or its equally pernicious Midwest counterpart, the ACT. (See "ACT/SAT Optional," 1997.) The organization that released that information, FairTest, subsequently surveyed some of those colleges and found that most were pleased with the results; applicants, for example, were no less capable when test scores were not required. (See "'Test Score Optional' Admissions," 1998. For a list of colleges that no longer require SATs or ACTs, send a self-addressed stamped envelope to FairTest, 342 Broadway, Cambridge, MA, 02139, or download it from www.fairtest.org.)2. Sacks, 1997, p.25. 3. "Few countries today give these formal examinations to students before the age of sixteen or so" (Resnick and Nolan, 1995, p.102). Moreover, "it is interesting to note that European countries, whose education systems are often touted as superior to ours, have not historically relied on standardized tests at all" (Wagner, 1998, p. 516). 4. "By the time they were ready to graduate to sixth grade last June, New York City fifth graders had taken eight standardized tests over the previous 14 months; this year's fourth-grade reading test... will take three days to administer. In Chicago, pupils took 12 standardized tests from the winter of third grade through the spring of fourth grade" (Steinberg, 1999). 5. A survey of 250 representative school districts across the U.S. in 1995 found that more than 93 percent reported giving standardized reading tests to children before they had reached the third grade (Dougherty, 1998). 6. Compared to other forms of assessment, standardized tests are a bargain. But given the sheer scope of testing these days, they cost hundreds of millions of dollars that could instead be spent to help children learn. 9. McNeil, 1986, p. xviii. W. Edwards Deming was one of the few people in his field to understand that this is also true in the business world. He frequently commented that the most important management issues simply cannot be reduced to numbers. 10. Mitchell, 1992, p. 134. We may not know who "they" are, but "their" fallibility is brought home to us every so often, particularly when a student taking one of these tests can make a case that, say, "b" is the best answer, or at least that there is sufficient ambiguity in the question to permit several legitimate answers, but at the same time knows full well that "they" are looking for "c" and will count only that response as correct. Anyone who has ever had this experience should understand on yet another level how wide the gap is between higher achievement and higher test scores. 11. Ayers, 1993, p. 118. At the very least, reports of a school's (or district's) average test score should always be accompanied by information about the income levels of the community in question. 12. Rotberg, 1998b, p.28. A chart in this article provides striking evidence of a correlation between the child poverty rate in each state and its ranking on the NAEP math test for eighth-graders in 1992. The average poverty rate for the five states with the highest test scores was 15 percent; for the bottom five states, it was almost exactly twice that. 14. Norm-referenced tests were used by thirty-one states in 1997, two more than the year before, and they are "not going away" (Bond et al., 1998, pp.7, 9). Worse, individual school districts often use more of these tests than the states require. 16. For example, far and away the most common method for determining whether students are "reading on grade level" is to look at whether they score above a certain percentile (such as the 50th) on a norm-referenced test. At least two out of every five districts in the country do this, compared with only one out of six that use a criterion-referenced test and fewer still that use other definitions. See Dougherty. 17. The reverse situation actually happened in late 1998 on an international scale. See p. 246n63. 18. Thus, "the better the job that teachers do in teaching important knowledge and/or skills, the less likely it is that there will be items on a standardized achievement test measuring such knowledge and/or skills" (Popham, p.12; also see 1993, pp.106-il). Furthermore, because these tests are designed to maximize "response variance" (that is, to spread out the students' scores), they also tend to include the sort of questions that "tap innate intellectual skills that are not readily modifiable in school" and information "learned outside of school." Yet the results of these tests are then used to rate the effectiveness of schools. Popham (1999, p.14) also points out that the relevance of knowledge acquired outside school, which of course is not acquired equally by all children, helps to explain the high correlation between test scores and socioeconomic status. 19. A survey of parents' understanding of a test in Michigan revealed that "an overwhelming majority. . . interpreted the criterion-referenced score as a normative percentile" (Paris et al., 1991, p.14). 22. Resnick and Resnick, 1990, pp.71-72. Their overall conclusion: "The tests most widely used to assess achievement are unfriendly to the goals of the thinking curriculum" (p.73). 24. Constance Kamii (1989, p.157) has pointed out that standardized tests may look as though they're tapping logical-mathematical knowledge but to a significant extent are really just tapping arbitrary conventional knowledge. Two math educators give a good example from a Massachusetts test for high school students. The question reads as follows: n 1 2 3 4 5 6 tn 3 525.Wood and Sellers, 1997, p.181. 26. Bruce Alberts's comments, delivered at the academy's annual meeting in May 1998, were reported in "Science Leader Criticizes Tests," 1998. 27. One study classified fifth- and sixth-graders as "actively" engaged in learning if they went back over things they didn't understand, asked questions of themselves as they read, and tried to connect what they were doing to what they had already learned; and as "superficially" engaged if they just copied down answers, guessed a lot, and skipped the hard parts. It turned out that the superficial style was positively correlated (r= .28, significant at p < .001) with composite scores on the CTBS and Metropolitan standardized tests (Meece et al., 1988). A study of middle school students, meanwhile, found that those "who value literacy activities and who are task-focused toward literacy activities" got lower scores on the CTBS reading test (Anderman, 1992). And, as already mentioned, the same pattern shows up with the SAT (see p. 262n1, par. 2). 30. Delisle, 1997, p.44. He continues: "With many of the multiple-choice questions having several 'correct' options in the eyes of creative thinkers, scores get depressed for children who see possibilities that are only visible to those with open minds." And from another source: "Good readers, for example, take lots of risks in the process of reading most materials. These risks lead to errors. Depending upon their impact on meaning, these errors may or may not be corrected by the reader. But reading tests, which involve fairly short passages followed by trick questions and answers, require being constantly alert to precisely the kind of insignificant errors that good readers let fall by the wayside" (Calkins et al., 1998, p.47). 31. The teacher, Marcia Burchby, is quoted in Wood, 1992, p.32. 35. Robert J. Mislevy is quoted in Mitchell, 1992, p.179. Actually, it's more like twenty-first-century statistics at this point, and the psychology in question didn't come into its own until the early part of the twentieth century, but you get the idea. Dennie Wolf and her associates (1991, p.47) contend that "the technology of scoring has become one of the most powerful realizations of behaviorist views of learning and performance." 36. See Frederiksen, p.199, for a discussion of Herbert Simon's distinction between well-structured and ill-structured problems, the latter being more realistic and important, the former showing up on standardized tests. 37. Bond et al., p.10; Neill, 1997. 38. "Short-answer questions and computational exercises presented in formats that can be scored quickly and 'objectively'" represent a "typically American style of testing [that] is quite different from traditions in other countries, where more complex problem solving is the norm on both classroom and external examinations" (Schoen et al., 1999, p.446). 39. Farr is quoted in Checkley, 1997a, p.S. 40. Frederiksen, p.199. And from another source: "No multiple-choice question can be used to discover how well students can express their own ideas in their own words, how well they can marshal evidence to support their arguments, or how well they can adjust to the need to communicate for a particular purpose and to a particular audience. Nor can multiple-choice questions ever indicate whether what the student writes will be interesting to read" (Gertrude Conlan is quoted in Freedman, 1993, pp.29-30). 41. This is roughly analogous to how norm-referenced testing is inherently objectionable but not all criterion-referenced testing is worth celebrating either. In fact, putting the two observations together, it's not hard to find criterion-referenced essay exams that require students to analyze a contrived chunk of text or cough up facts about the Victorian era. The results may not be valid, and the exercise may not be worth it. Moreover, the way these exams are scored raises even more concerns. For example, the essays written by students from two dozen states are not evaluated by educators; they are shipped off to North Carolina, where low-paid temp workers spend no more than two or three minutes reading each one. "'There were times I'd be reading a paper every 10 seconds,' " one former scorer told a reporter. Sometimes he "would only briefly scan papers before issuing a grade, searching for clues such as a descriptive passage within a narrative to determine what grade to give. 'You could skim them very quickly ... I know this sounds very bizarre, but you could put a number on these things without actually reading the paper,'" said this scorer, who, like his coworkers, was offered a "$200 bonus that kicked in after 8,000 papers" (Glovin, 1998). 43. Wiggins, 1993, p.72. "Thoughtful and deep understanding is simply not assessable in secure testing," he continues, "and we will continue to send the message to teachers that simplistic recall or application, based on 'coverage,' is all that matters-until we change the policy of secrecy" (p.92). 44. If we need an analogy that evokes the heartland, this one seems more apt: "Measuring the richness of learning by giving a standardized test is like judging chili by counting the beans" (Fern, 1998, p.20).45. Mitchell, 1992, p.15, and Resnick and Resnick, p.73, respectively. Lauren Resnick (1987, p.47) adds that multiple choice tests "can measure the accumulation of knowledge and can be used to examine specific components of reasoning or thinking. However, they are ill suited to assessing the kinds of integrated thinking that we call 'higher order.' 48. For example, see Goldstein, 1990a. 49. Antonucci is quoted in Daley and Hart, 1998. 51. Freedman, 1995, p.29.1 am reminded of how the management theorist Douglas McGregor once explained why corporate executives are so fond of incentive plans. He said, in effect, that they like to dangle money in front of their employees because they can-that is, because while they cannot control how people will feel about their work, they can unilaterally determine how much people are paid. 52. Jones and Whitford, p.280. Also see Frederiksen on this point. 53. Departmentalization, in turn, tends to support other problematic practices, such as giving letter grades and segregating students by alleged ability. 54. Gloria Hoffman, principal of Randolph Elementary School in Arlington, Va., was quoted in Mathews and Benning, 1999. 55. Herman and Golan, 1993, p.22. It was also common to find staff meetings, as well as conversations between principals and individual teachers, devoted to reviewing test scores and strategies for raising them. 56. Madaus et al., p.16. This survey included more than two thousand teachers of grades four through twelve in six diverse districts. Also see Herman and Golan. 57. I borrow here from the analyses of Linda McNeil and Lorrie Shepard. 58. The Arizona anecdote comes from a study reported in Smith, 1991, p.10. The Texas examples are from Kunen, 1997, and Johnston, 1998. (In response to this situation, incidentally, the director of the Texas Business and Education Coalition reportedly commented that "what the state needs is even more testing" [Johnston, p.21].) The Illinois example comes from Smith et al., 1998; the New Jersey quotation from Glovin, 1998; the North Carolina quotation from Simmons, 1997; the Maryland example from Goldstein, 1990a; and the New York example from Hartocollis, 1999. 60. For example, see Mathews, 1998a. 62. This finding by Susan Stodolsky is cited in Noble and Smith, 1994. 63. Zemelman et al., 1998, p.218. 64. Cenziper, 1998, p. 1C. "About half [of the teachers surveyed] said they spent more than 40 percent of their time having students practice for end-of-grade tests. Almost 70 percent thought the [tests] would not improve education at their schools" and "almost half the teachers surveyed in the UNC study said preparation for the tests has decreased students' love of learning." A press release about the study revealed that two thirds of teachers have changed their teaching strategies to prepare students for the tests, with many relying more on worksheets, lectures, and tests in class than they otherwise would. 66. Darling-Hammond's analogy, which originally appeared in a 1989 article in Rethinking Schools, is paraphrased in Calkins et al., p.44. 68. Principals, politicians, and journalists routinely cite just such a predictable jump in scores as evidence that the desperate efforts to prepare students for those very tests has been successful. For example, a reporter for the Los Angeles Times asserted that a high-stakes testing program in Texas-perhaps the most educationally destructive program in the nation- "has shown signs of dramatic success," meaning that scores on the test itself have risen (Kolker, 1999). 69. The story appears in Calkins et al., p.47. 70. Kathy Greeley of Graham and Parks School. Personal communication, 1998. 71. National Association for the Education of Young Children, 1987, p.1. Piaget (p. 74) anticipated this problem decades ago: "The school examination becomes an end in itself because it dominates the teacher's concerns, instead of fostering his natural role as one who stimulates consciences and minds, and he directs all the work of the students toward the artificial result which is success on final tests." 72. Personal communication, 1998. 73. The work of FairTest (see note 1) is relevant here. Recall also that the questions on a norm-referenced test are more likely to be included not only when just some students get them right but when they are answered correctly by those who answer the other questions correctly, too. Thus, "a test item on which African-Americans do particularly well but whites do not is likely to be discarded because of the interaction of two factors: African-Americans are a minority, and African-Americans tend to score low" (Neill and Medina, 1989, p.692). 74. Virtually no one, including defenders of standardized tests, will deny this fact; it is evident to anyone who visits schools around the country. Those looking for empirical confirmation can find it in Madaus et al., pp.2, 15-16; and in Herman and Golan, who discovered a particularly striking discrepancy in how much time was spent taking practice tests in poor versus rich schools. In the most extreme cases, such as Baltimore elementary schools serving low-income African-American students, "teachers are now spending the entire school year teaching for the standardized tests." Sadder still, some teachers don't seem to mind: "Their scores went up this last time, so it's worth it," one third-grade teacher remarked (Winerip, 1999, p.40). 76. Jones is quoted in Cenziper, p. 7C.
|