Computers As Graders

September 8, 2003 by Ed Felten

One of my least favorite tasks as a professor is grading papers. So there’s good news – of a sort – in J. Greg Phelan’s New York Times article from last week, about the use of computer programs to grade essays.

The computers are surprisingly good at grading – essentially as accurate as human graders, where an “accurate” grade is defined as one that correlates with the grade given by another human. To put it another way, the variance between a human grader and a computer is no greater than between two human graders.

Eric Rescorla offers typically interesting commentary on this. He points out, first, that the lesson here might not be that computers are good at grading, but that human graders are surprisingly bad. I know how hard it is to give the thirtieth essay in the stack the careful reading it deserves. If the grader’s brain is on autopilot, you’ll get the kind of formulaic grading that a computer might be able to handle.

Another possibility, which Eric also discusses, is that there is something simple – I’ll call it the X-factor – about an essay’s language or structure that happens to correlate very well with good writing. If this is true, then a computer program that looks only for the X-factor will give “accurate” grades that correlate well with the grades assigned by a human reader who actually understands the essays. The computer’s grade will be “accurate” even though the computer doesn’t really understand what the student is trying to say.

The article even gives hints about the nature of the X-factor:

For example, a high score almost always contains topically relevant vocabulary, a variety of sentence structures, and the use of cue terms like “in summary,” for example, and “because” to organize an argument. By analyzing 50 of these features in a sampling of essays on a particular topic that were scored by human beings, the system can accurately predict how the same human readers would grade additional essays on the same topic.

This is all very interesting, but the game will be up as soon as students and their counselors figure out what the X-factor is and how to maximize it. Then the SAT-prep companies will teach students how to crank out X-factor-maximizing essays, in some horrendous stilted writing style that only a computerized grader could love. The correlation between good writing and the X-factor will be lost, and we’ll have to switch back to human graders – or move on to the next generation of computerized graders, looking for a new improved X-factor.

Comments

manofnun says

October 21, 2003 at 5:14 pm

This is the beginning or should I say another beginning,in my own opinion at least, to conformisim as meathod of fastest easiest. Allways run by the people who are making the money. There may be computer geeks involved, ones like the type who tediously programe super computers for chess games, though there have been no true allies of the progress computers admiting to a deviously unhuman adeptation. i like jim bean wait on the shelf for a effort that is worth a full shot. -MAN
bill harvey says

October 9, 2003 at 7:54 am

Responding to Mary Hodder. Good, well educated labour (I’m english I won’t follow your spelling) is emphatically NOT expensive. It may cost a lot of money (Though in most capitalist countries it is grossly undervalued) but that is not the same as being expensive. A good teacher is beyond price. The return on good teaching is immeasurable. It actually lasts for generations for parents pass it on to their children.
Bob Jonkman says

September 10, 2003 at 6:51 am

If the grading software does indeed rely on statistical properties of the essay under test, then I see a great application for a Markov Chain generator to crank out essays that correlate perfectly with the grading software’s expectations. Creating randomized essays should put a good dent in the effectiveness of plagiarization-detection software, too.

–Bob.
Chris says

September 10, 2003 at 5:12 am

I have to agree with Alex–my students are already myopically focused on superficial elements of what they see as discourse markers of academic writing. Too seldom do they attempt to engage in any substantive thought that may not conform to Turabian or Strunk & White.
EKR says

September 9, 2003 at 10:52 am

Seth, I think you’re absolutely right. It DOES work with humans. To my mind the real scandal here is that the human grading is probably incredibly shallow–in the interest of predictability. I’m not sure you can do this kind of mass grading and not have it be shallow.
Duncan says

September 8, 2003 at 11:36 pm

This is kind of interesting. What immediately occurred to me is that students could apply this method to tune their essays.

Let’s assume that the students know that their essays will be graded by a program, and have access to the program themselves. Then they will be able to pre-grade their own essays, and better still, tune them to maximize scores.

Even if the essay is subsequently graded by a human, if the correlation is strong enough, it might still work.

How long before the Microsoft Word spelling and grammar checker becomes the spelling, grammar and essay grading checker?
mary hodder says

September 8, 2003 at 8:46 pm

This has already happened. Two years ago, in prep for the GMAT, which requires two essays, I did online test prep. The prep was specifically set up to conform exactly to the “X factor” you describe. The online prep told me that one of the three graders would be a machine, and that it would look for “because” and “in summary” etc. and that the whole thing was a sham. I did sample essay tests online every day, following the test prep’s strict format, and my final score on the test was a perfect 6, as averaged between the two humans and the one machine. Totally gamed. Cynical. And the only way they can grade millions of these stupid things for standardized tests every year.

I don’t think I’m a great writer, but I do appreciate the personal attention I have received in both undergraduate and have in graduate school now, and believe that I’m better at writing because of it. A really good education comes from personal attention. But good well educated labor is expensive, and so attempts to make education more efficient will keep happening. But the reality is that if you are going to teach people to think and learn for themselves, which are the most important things I believe you can teach, then you will have to do it personally. Whether that personal attention comes from some computational system, in the form of some sophisticated AI program, or a person, is another issue. But grading programs like the above only inspire students to think more cynically.
Seth Finkelstein says

September 8, 2003 at 6:16 pm

I dunno. The “fake X-factor” sounds to me a lot like the standard academic writing style 🙂

One might argue it works with humans too, per Alan Sokal’s hoax of an article written in semi-gibberish.
Eric Rescorla says

September 8, 2003 at 5:32 pm

I agree. It sounds very much like spam filtering.

It’s actually fairly easy to counter the “fake X-factor” tactic. You have some small fraction (5-10% ought to be enough) of essays ready by a human grader. People who have totally broken essays get penalized. I suspect that this would be enough incentive to keep students in line.
Justin Mason says

September 8, 2003 at 5:07 pm

ha! sounds a lot like spam filtering 😉

Computers As Graders

Comments

Contributors

Archives by Month

Computers As Graders

Comments

What We Discuss

Contributors

Archives by Month