Thursday, March 8, 2012

Steven Brill - Vol. 3 - Test Data: A False Sense of Reliability

In this ongoing series, I’ll break down Brill’s presentation in the Microsoft Auditorium at Seattle Central Library on November 1st, 2011.  Brill teaches journalism at Yale, used to work on Court TV, and wrote about the New York school district’s infamous “rubber rooms” early last decade.

When Steven Brill was asked whether he might be overgeneralizing in his criticisms of public education by just focusing on a few bad examples, his answer reveals much about how members of the PeWKoB (People Who Know Better), also known as the “meddlers,” view educational improvement.

Brill’s answer was that he is not overgeneralizing. His evidence is that in Washington state, our “performance verses expense ratio” is very bad, and our “Race to the Top” application was denied.  From these two items, Brill declares an entire state a failure in education. 

This is data-driven decision making?  No wonder these guys don’t get why standardized tests are poor measures of student learning.

Before we get into the details, think about this answer.  This man, who was quite possibly setting foot in Washington for the first time in his life, castigates the entire state because of a ratio he read about and a denied federal program application–a program that many believe is not in the best interests of students.

He declares that not much good is happening here, in spite of the fact that our SAT scores have consistently ranked in the upper level compared to other states.  In spite of the thousands of graduates every year who go on to college and occupational success all around the world every year.

I have learned that people who make broad, general statements such as this are usually covering for their lack of expert knowledge.  But Brill gave evidence to back up his statement.  So let’s consider his evidence.  What is this “performance verses expense” ratio he’s talking about?

The Ratio Rules You
Quite simply, this ratio is what it sounds like: How well our students perform compared to how much money is spent on them.

In other words, because we spend a certain amount of money per student, we should be seeing a certain level of performance.

A host of questions immediately come to mind:
  1. What level of performance is “good enough” to justify the expense, and who decides this?
  2. How are they determining if this level has been reached, and how many students have to reach it for us to be considered a success?  Is perfection the standard?
  3. If they’re using a test, which test, and is this test a valid measure of student performance?
  4. How do we know the money being spent directly correlates to how students perform on this test?
  5. If you declare our ratio to be bad, which Brill does, to whom are you comparing it?  Other states? Other nations? Is this a fair comparison, considering that other states and nations have different systems, different populations, and very different economies?
As you can see, the problem the meddlers like Brill refuse to acknowledge is the inherent complexity of anything as large as education.  To boil an entire system down to a single ratio is to deny all the complexity within it. 

I have found, when you talk to people who like to call education a “failure,” you find they scoff at any explanation that touches on the social complexity of the system.  But anything involving people is going to be more complicated than things like making coffee cups. 

So, relying on this ratio and ignoring the human element involved in all aspects of education enables the PeWKoB to overlook those questions I asked above, and promote policies that have little data support and no chance of significantly improving education.

So let’s delve into these questions for a bit.

5 Questions with Hard Answers
1. What level of performance is good enough?  2. Who determines this, and how?

Most people have never thought about this.  Who decides what constitutes “passing” on any standardized test, such as the HSPE?  Think about what would go into this. 

Writing the Test
First, you have to write the test.  This is a daunting, laborious task that takes years–yes, years–to complete.  We pay testing companies millions to develop these tests. 

Why does it take so long?  Because writing a quality test is much more difficult than most people realize.  You need clearly written questions.  They must be clear enough to be correctly answered by any student, even one who does not know English that well.  Remove buzzwords and language with double meanings.  Make sure enough information is given.  Make sure any diagrams or tables are clear.  You also need questions that assess what students are supposed to know, according to the standards (of which there are too many to cover in a single year in most classes).  And you need a good variety of questions for each standard.  You can’t have ten addition questions and only one subtraction problem. 

To accomplish all this takes lots of manpower, lots of editing, lots of time.  But, even with all this, the test will reflect the biases of its authors.  Our state math tests are a great example of this.  Early versions of this test asked some of the most bizarre questions imaginable.  How could they get past all those rounds of editing I just described?

Because the process itself was geared to produce an incomprehensible test.  Before even beginning, they set out to write a test that assesses “problem solving,” that forever-vague but great-sounding notion common to reform mathematics.  With that mandate, they wrote a test that was doomed to fail the majority of students from the outset, because it didn’t assess what students actually do in class on a day-to-day basis. 

Eventually, passing rates scrapped their way up to around 50 or 60% statewide, but never got higher.  Until, that is, when we switched to End of Course (EOC) Algebra and Geometry tests.  Then, magically, test scores shot up over 20 percentage points.  At my school, scores doubled.  Yes, doubled.  And that still includes the students who score zero for failing to show up.

Yes, schools are penalized even for absent test-takers.  Did you know that?  What do you think about it?

Now, why did test scores shoot up in just one year? 

Ooh! I know! I know! Because we have great teachers!

No, silly, because the tests finally assessed more of what students actually learn in class.  I know.  It’s a groundbreaking concept.  Interesting, though, that when scores are bad, they blame the teachers, but when scores go up, all the naysayers are oddly silent.  For them, education will always be in crisis.

So, the tests were lousy for years, and the scores reflected that.  The 20-point increase in one year proves it. But Brill is using this data to say our “performance” on a terrible exam is not living up to our expenses.  Hmm. Bogus data, invalid conclusions. 

What is Passing?
All this test writing aside, let’s assume it’s a fair measure of student learning–a quality assessment.  What then?  How do we decide who passes?  Have you ever wondered about this?  Why is the passing score what it is?  How do they determine a 1, 2, 3, or 4 for each category?

This too is highly subjective.

First, you have to decide if all the questions are worth the same amount of points.  If there are any free response questions, should they be worth more?  How much more?  Second, you have to decide how to award those points.  If a question is worth 4 points, what is a 3?  What is a 2? Can a student get the wrong answer and still earn 3 out of 4?  Can they get the right answer but not get 4 out of 4?  The more complicated the question, the more complicated the scoring. 

Third, you have to determine how many points all students can be fairly expected to earn.  You need some kind of an “attainable minimum.”  And then, you have to decide if you want to make “passing” higher or lower than this minimum.  So, if a test is worth 100 points, we can call 60 a passing score.  We can call 55 a passing score.  The testing industry decides what constitutes a passing score on their tests. 

How do they determine all this? 

One way they do it is by piloting tests for a couple years before making them “count.” In other words, the first few years we had the WASL (which later became the HSPE, and which has, in part, now become the EOC exams), the students taking it knew they didn’t have to pass it. They were, in essence, informing the well-paid test writers and analysts how students will do on this type of test. 

Then, these well-paid statistical analysts crunch the numbers (including those of students who left whole questions blank because they didn’t care or didn’t understand–a very common occurrence), see where the averages are, assess which questions seemed to be too hard, too easy, or poorly written, and settle on a level that students must surpass in order to pass the test, once it counts for real.

Do you see how much subjectivity goes into all this?  Remember, these are people who don’t know any students.  They never see a classroom.  But we like to delude ourselves that because it’s a state test, and especially when it’s multiple choice (because it makes scoring a lot easier), then it’s totally objective and a fair assessment of what a student knows.

It’s not enough to have grades from all the classes, to have finals, to have teacher knowledge of their students based on daily classroom interactions, ongoing formative and summative assessments, regular feedback, and class discussions.  None of this is enough.  No, we have to also have an additional test written by someone we never meet, and we give this test more weight in the determination of whether this student should graduate or not.

Further, we never get to see results of the test specific enough to see what we did wrong.  We just get a number.  Here’s your score.  Either you passed, or you suck.  All the other things you’ve done in high school mean nothing now.

Isn’t it interesting that most teachers I know can guess who in their classes will pass the state tests with reasonable accuracy?  Isn’t it also interesting that in many cases, the same students who pass these tests were already all doing well in school?  That the ones who are doing poorly in class also don’t do well on these tests? 

How much money is spent on tests, then, that aren’t telling us much we didn’t already know?

How do you realistically set a bar that determines what it means to pass a test?  I’m by no means saying it can’t be done.  Graduate schools and specialty schools do it.  There’s nothing wrong with a required minimum.

But it’s also a different thing when a law school does it to determine who is qualified to pursue a specific career in law, compared to an entire state (or nation?) doing it simply to determine if a student should graduate.

It’s one thing to assess one subject or topic.  It’s quite enough to try to sum up 12 full years of educational growth in a single test. 

Are we too trusting of the people determining what it means to pass?

3. Are these tests valid measures of what a student knows and has learned?

There is a science test given in our state that is mostly “biology.” I put this in quotes because, if it were a biology test, you would think that a student who hasn’t taken biology would have a hard time passing it.

Yet, at our school, we have some students who take chemistry as sophomores instead of biology.  They are still forced to take the state science test.  Last year, we gave a two week crash course to these students on biology concepts, and on how to interpret the goofy questions and scoring methods used on the state science test.  (And by the way, we weren’t paid for the collective 40 man-hours spent on this)

Out of about thirty such students–ones who have never taken biology–all but one passed the state science test.  All but one.  Now to be fair, these are students who could be labeled “advanced.” But still.

What does this tell you about a test that its writers say is comprised mostly of biology?  This test is really a reading test.  It’s a test about knowing how to put an answer the scorers will like. It’s about jumping through the fiery hoop in such a way that you don’t get burned.

For example, they used to have a question that required students to write out a detailed procedure for a scientific investigation.  Most students failed the question.  In fact, historically, science scores have been by far the lowest across the state compared to math, reading, and writing.  Why?

The procedure question was particularly terrible.  You could write a solid procedure and still fail the question.  Why?  It goes back to the first two questions I asked.  You could fail because you didn’t answer it in the way their scoring guide says you needed to. Not because you didn’t know the concept.

So, you could include all the steps, but if you didn’t say “repeat for other trials,” you lost a point. If you didn’t have a step to “verify your results,” you lost a point.  There were several other goofy requirements, and when it all got added up, students could lose half the points even if their procedure made complete sense. 

Some of you might be saying, “Well, they should have read the directions then.”  Ah.  But the directions didn’t say you had to have a step about verifying your results.  Nowhere was this stated, anywhere, yet students were graded down for not doing it. 

I know this because the state released old test items and scoring guides so we could help our students do better.  It was then that I realized, to put it bluntly, this is a stupid test.  This is one of the worst tests I’ve ever seen.  Most college students would fail it.  They are expecting every student across the state to write a procedure in the exact same format. 

Is this how life works?  Do real scientists do this?  No, actually.  And I know this too, because I did real research and read real science papers in college.  There is no universal procedure-writing template.  There is no single “scientific method” to which all scientists swear allegiance before entering their laboratories each day.  This is not about what a student knows. It’s not about conceptual understanding or skill mastery.

It’s about whether they know what the scorers are looking for.  And the students know this even better than we do.  Why do you think the “test prep” industry exists?  They don’t learn subject knowledge in these workshops! Students hate these tests, because they see through the facade of legitimacy they purport to have.

Now, these bad procedure questions were from several years ago. Surely, these kinds of issues have been ironed out.  Surely.


I’m not so sure. And don’t call me Shirley.

I am waiting with frothing anticipation to see how my chemistry sophomores this year do on the brand new (piloted) End of Course “Biology” exam, even though they have never taken biology.

I’m dying to see this.  I’m betting the same thing will happen as last year.  I’m betting that almost all of them will pass a “Biology” test, even though they haven’t taken biology. 

And this is a major problem.  It shouldn’t be possible to pass a test pertaining to a course you’ve never taken and didn’t study for.  If it’s this easy to do, even for “advanced” students, then this is not a biology test.  It’s a hoop that some students can jump through more easily than others.

4. How do we know the money spent correlates to student achievement?

This is another huge question.  Some argue that money is overrated.  They say that we have thrown more money at education for years and look at the results.

In this, I agree with them.  Money alone is not the answer.  But why?  This is yet another immensely complex question (see? complexity....yet Brill writes this off in a single sentence).

One reason is because it depends on how the money is spent. If you spend all the money to raise teacher pay, this probably isn’t going to make a difference in student achievement.  Why would it?  What am I going to do differently whether I get paid $65,000 or $67,000 per year? What is that extra two grand going to do for my students that it couldn’t do before?

Nothing, probably. 

Interestingly, this is the exact reason why teachers oppose merit pay.  Because it won’t work. Most teachers are already working so hard, the extra pay just makes them feel a little better.  And districts that have tried this have found this exact result.  Some of my best laughs come when people say we oppose merit pay because the union tells us to.  I opposed it the first time I heard about it–because it won’t work, and it’s impossible to implement.

Personally, I’ve gotten down to about nine and half hours per day this year.  This is a big deal for me, because last year I was over ten per day most of the year. And sure, everyone likes higher pay, and teaching as a profession deserves a certain compensation.  But that’s a discussion for a different day.

Another reason money isn’t the answer is because, assuming you don’t spend it on higher teacher salaries, what else will you spend it on? 

And here’s where the tub overflows.

Parental contact, tutoring, summer programs, extra counselors, extra intervention specialists, extra classroom aides, essay graders, new textbooks, new computers, new software, new classroom technology, new after-school programs (of which there are dozens of possible ones), community partnering programs, internships, extra-curricular activities, clubs... 

Shall I go on?

I came up with that in about one minute.  Give me time to think, and in a room with other teachers and administrators, and see what else we come up with. 

There’s no shortage of ideas for what to do with extra funding. 

Who, exactly, is Steven Brill to say that none of these ideas–not a single one–is worth putting money into? 

So, one reason money doesn’t seem to make a difference is because some of these uses have more immediate bang for the buck than others do.  Some do work, but the payoff might not be for several years.  Some programs work better than others.  Some cost more, some cost less.  Some work in some schools, but won’t work in others.  People generally cost the most, far more than resources.  So any position that requires staffing is a big expense.

For example, at our school we have an entire position dedicated to administering all the state-mandated tests.  And he’s busy all year long.  Is Brill in favor of that use of money?

But my real question to the meddlers like Brill is this: If money doesn’t matter, why do the most successful charter schools (like Harlem Children’s Zone) spend far more per student than public schools do?  What are they spending it on?  Interesting, in fact, that most of this extra money goes to address social problems, not classroom ones.  They have “pre-birth” classes for expecting parents.  Pre-birth.  We’ll help you before you even have a kid. 

That’s some money being spent there.  But at all the “failing” public schools, it’s still the teachers’ fault, of course.

Money, like technology, is simply a tool.  And like technology, it can be wasted, taken for granted, or utilized well.  Often, it gets wasted on technology, ironically.  Like, what’s up with these schools buying iPads for all their students?  Talk about wasting money.  What does Brill think of “iPads for all,” I wonder.

5. How can you compare one state to another, or one nation to another?

Finally, we have to question how this dictatorial ratio can be the basis for comparing states or nations, all of which have different cultures, economies, educational systems, and standardized tests.

Some states have very high pass rates.  Others have lower ones.  When you look into the reasons, it’s because some states have harder tests.  Yeah, I know. I was shocked too.

Do the states with harder tests have “higher” standards, then?  Not necessarily.  They might just have lousy tests.  (Eh, Washington?) 

But the fact we see such divergent results state to state proves inexorably that this whole testing obsession has little basis in reality.  We see that it doesn’t matter what kind of education system you have.  All that matters is what I brought up at the beginning: What kind of test do you use, and where is the bar for passing.
If we really want to improve our test scores, all we have to do is lower the bar a few points, and lots more students will pass.  Is this “lowering our standards?”  I know what the “reform” crowd thinks. 

But for those who are open-minded to questions like this, ask yourself:

Q: Is it possible for a standard to be too high?  Is it possible? 

Requiring every student to pass Algebra 2 to graduate, as one district did in Tennessee a few years ago, is a really terrible idea and will just cause an increase in the dropout rate. You will never get every single student to pass Algebra 2.  It’s just too hard for some kids.  If we required every student to write a two-hundred page novel to graduate, that would be too high a standard.  So enough with this fake panic they always do when we talk about “lowering standards,” as if this is always a terrible catastrophe of an idea.

Just like not every kid can run as fast, or hit a baseball as well, or memorize, or master a trade as quickly as others, not every student can perform equally well at every subject.  This is human nature.  We’re different. 

A standard that you want every student–every student, not just the top ones–to be able to realistically attain must be lower than one that allows students into college.

We aren’t testing for college entrance exams.  We’re testing to see if you’ve met a minimum requirement.  Minimum.  That means it must be attainable.

Purpose of Tests
Now, I personally believe we shouldn’t be testing at all.  I think the entire industry is a total waste of time, money, and effort, and we’d be better off spending that in the classroom and on the subject matter.  Let grades stand for themselves. 

But, I’m willing to compromise.  End of Course tests are far more preferable than these generic, statewide tests that assess every skill imaginable.  But, even still, I’ve been giving End of Course exams my whole career.  They’re called “finals.”  They’re working well enough.  The truth is, there’s more to school than tests. 

Tests make much more sense as a means to get in to something, rather than get out.  You want in a specific college or grad school program?  You have to pass a test.  You want to drive a car?  Pass a test.  You want to be a teacher?  Pass (a whole bunch of) tests. 

But to require a test as a means to get out of school, this is a different proposition, because there is no specific goal, and therefore no guide for the test-writers to base their work on.

So when Brill says we’re not scoring well enough compared to how much money we’re spending, he says this in spite of the fact that we have a lousy state testing program that doesn’t accurately assess what our students know.  He doesn’t know how hard our test is compared to other states and other nations. He doesn’t know about all the specific local programs in place that do make a difference for some students.  He doesn’t know where the money gets spent.

How much gets spent on testing, for example?  This ratio, for it to have any use,  is dependent on a series of assumptions.  We must assume our tests are valid and accurate assessments.  They aren’t.  We must assume they are scored well and that the bar for passing is fair.  This is in question.  We must assume that all the money being spent directly affects test scores, as if test scores are the only thing that matter.  It doesn’t because they aren’t.

Thus, we can conclude by assuming that this ratio he bases his “assessment” of our entire state education system on is in fact, quite useless.

Students are not bar codes.  Complexity cannot be idly dismissed.  And lots of good things are happening, even if 30% of students drop out.  That means 70% are graduating.

If perfection is your standard, then yes, we are failing.  If reality is allowed to play its part, then I think we’re doing alright.  And I know, because I see them graduate every year, as prepared as they can be for the next step in their lives.

No comments: