May 30, 2024

Gymnastics Scores and Grade Inflation

The gymnastics scoring in this year’s Olympics has generated some controversy, as usual. Some of the controversy feel manufactured: NBC tried to create a hubbub over Nastia Liukin losing the uneven bars gold medal on the Nth tiebreaker; but top-level sporting events whose rules do not admit ties must sometimes decide contests by tiny margins.

A more interesting discussion relates to a change in the scoring system, moving from the old 0.0 to 10.0 scale, to a new scale that adds together an “A score” measuring the difficulty of the athlete’s moves and a “B score” measuring how well the moves were performed. The B score is on the old 0-10 scale, but the A score is on an open-ended scale with fixed scores for each constituent move and bonuses for continuously connecting a series of moves.

One consequence of the new system is that there is no predetermined maximum score. The old system had a maximum score, the legendary “perfect 10”, whose demise is mourned old-school gymnastics gurus like Bela Karolyi. But of course the perfect 10 wasn’t really perfect, at least not in the sense that a 10.0 performance was unsurpassable. No matter how flawless a gymnast’s performance, it is always possible, at least in principle, to do better, by performing just as flawlessly while adding one more flip or twist to one of the moves. The perfect 10 was in some sense a myth.

What killed the perfect 10, as Jordan Ellenberg explained in Slate, was a steady improvement in gymnastic performance that led to a kind of grade inflation in which the system lost its ability to reward innovators for doing the latest, greatest moves. If a very difficult routine, performed flawlessly, rates 10.0, how can you reward an astonishingly difficult routine, performed just as flawlessly? You have to change the scale somehow. The gymnastics authorities decided to remove the fixed 10.0 limit by creating an open-ended difficulty scale.

There’s an interesting analogy to the “grade inflation” debate in universities. Students’ grades and GPAs have increased slowly over time, and though this is not universally accepted, there is plausible evidence that today’s students are doing better work than past students did. (At the very least, today’s student bodies at top universities are drawn from a much larger pool of applicants than before.) If you want a 3.8 GPA to denote the same absolute level of performance that it denoted in the past, and if you also want to reward the unprecendented performance of today’s very best students, then you have to expand the scale at the top somehow.

But maybe the analogy from gymnastics scores to grades is imperfect. The only purpose of gymnastics scores is to compare athletes, to choose a winner. Grades have other purposes, such as motivating students to pay attention in class, or rewarding students for working hard. Not all of these purposes require consistency in grading over time, or even consistency within a single class. Which grading policy is best depends on which goals we have in mind.

One thing is clear: any discussion of gymnastics scoring or university grading will inevitably be colored by nostalgic attachment to the artists or students of the past.


  1. I still miss the perfect 10 in gymnastics scoring……..remember the old K.I.S.S. theory (keep it simple stupid)

  2. I would argue that we already have a two fold grading system, like they now have in gymnastics. One is the grade and one is the name of the course, the university the course was given at and so on. We do not put this into one unified score as they do in gymnastics.

    To illustrate this we could start giving grades in mathematics like the used to give in gymnastics. So there would not be courses in lower or higher mathematics. There would just be different mathematics courses with different difficulty level. So that in elementary math you could not score higher than 3.0 even with a perfect score on the test. You would have to go up to masters level before you could hope to score a perfect 10.0 that would mean you completed an advanced course and you got a perfect score.
    But then you get to the same problem that gymnastics came to, what do you give a student on the Ph. D. level who takes a mathematics course and scores a perfect score? You want to give 11.0, but the scale is fixed at 10.0 as max. So the solution would be to downgrade the maximum score at all the other courses. But what happens then when a teacher want to give a perfect grade to a student that took an even more advanced course, if one ever was taught?

    The problem comes up when you have applications for courses at the university that people from all over the world compete for. How do you compare a student with perfect scores from a local university in China, to a student with almost perfect score from a top Taiwanese university, to a Norwegian student with middle grades, to a person from a top US university that has crappy grades? Well, you add what you know about the courses and the universities to get a unified score or an almost objective ranking about their relative strengths.

  3. I really don’t understand all the hemming and hawing over grade inflation. There are schools, colleges, and universities that use either written evaluations or a combination of written evaluations and grades to make very nuanced statements about a given performance.

    It’s hard to make any single number reflect something as complex as performance in a class over a semester. Unlike gymnastics though, and as you point out, grades are not a competition. Written evaluations can go a long way toward helping anyone willing to read several pages much more informed on the nature of a student’s performance.

  4. The Australian judge screwed Nastia out of her gold medal in the uneven bars, Ed, and NBC had nothing to do with it.

  5. I find the gymnastics scoring controversy amusing. I’ve been watching the Olympics since at least 1976, and they’re *always* complaining about gymnastics scoring. It almost wouldn’t seem like the summer Olympics without it.

    On coming up with a routine with an A score of 17, that wouldn’t work. Although the A scores are set beforehand, they aren’t set in stone. If you don’t actually complete one of the elements, or pause too long between what are supposed to be connected elements, they can reduce your A score as well.

  6. Temporary Contributor says

    Sorry about using this as a category to pass on some information, but I can’t find any “contact” link.

    Just to let you know that the Amergence Group (formerly known as SunnComm) is being sued by their previous legal counsel David Kahn.

    What is interesting in relation to Freedom To Tinker is that Kahn was their lawyer at the time they planned to file a lawsuit against Alex Halderman for exposing the infamous “shift key” bypass to their copy protection.

    “David L. Kahn, a Los Angeles lawyer representing SunnComm, said on Thursday that the company plans to file a lawsuit against Mr. Halderman, although he did not say when. He said Mr. Halderman had violated a provision of the Digital Millennium Copyright Act that makes it illegal to bypass a technology designed to limit the copying of electronic material.”

    Enjoy the irony.

  7. m,

    This story that grade inflation started or accelerated during the Vietnam War years is often repeated, but as far as I know it’s not supported by the data. The data at Harvard, for example, shows a slow and steady increase in average GPA since about 1930, which is as far back as data is available. (Harry Lewis gives an interesting history of grade increases at Harvard in his book “Excellence Without a Soul”.)

  8. Actually, not only did I write that Slate about gymnastics, I wrote a Slate piece many years ago about grade inflation; there I argued that grade transcripts, at least in their present form, do a pretty good job at distingishing extraordinarily good students.

  9. “I wonder if school grades should take into account the difficulty of the course. Should an “A” in calculus have the same weight as an “A” in algebra?”

    Certainly in my state in Australia this is the case.

    In our high school senior system certain subjects, like Latin or the more advanced maths subjects contribute more to your score. It used to be the case that if you wanted to be a lawyer you’d do well to avoid Legal Studies because a perfect LS score contributed less than a decent Biology score.

    At then end of the day, you’re only ever marked according to the achievements of your contemporaries, so we get ranked with the highest attainable score being 99.95. Obviously you can’t get 100.00 because then you’d be better than 100% of students… which would include yourself.

  10. In gymnastics, we are only trying to rank competitors, and only in a single tournament. First of all, we only compare people who directly competed against each other at an event, and never care for numbers such as the “season average” or “career average” of a gymnast. Secondly, score differences are meaningless — all that matters is who came first, second and third. The problem with the “perfect 10” system was not that the numbers weren’t right, or lack of consistency across time, but that the numbers didn’t reflect the desired ordering — not enogh reward for hard routines, for example.

    On the other hand. In most universities we rarely care about ranking the students who took a particular class. Rather, course grades are used as actual numerical values (for example, we calculate GPAs and use them in job and grad school admission and to give prizes and scholarships). I don’t think the current grading system works well, but in any case it’s incomparable to the system used in gymnastics.

  11. A significant period of grade inflation occurred during the Vietnam war when the Selective Service system decided to limit student deferments on the basis of grades. Many profs, especially those opposed to the war, did not want to feel responsible for being a contributing cause of a combat death. Fs became Cs to satisfy the deferment requirements, and Cs and Bs tended to be upgraded to As.

  12. One thing that worries be about the new A-B scoring is that you could conceivably concoct a routine with an A score of, say, 17.0, and then just get up and try your best (and probably injure yourself) and beat all the mortals. Some of the very hard vaults, etc. appeared to be exceedingly dangerous.

  13. Dan S.,

    I’m certain that computer science students are better now than ten years ago, but this is a special case, as you point out.

    There’s less evidence than one would like about the question of whether students are improving over time. One confounding factor is the change in the number of students going to college.

    There’s pretty good evidence that IQ test scores are increasing over time (the “Flynn Effect”). If IQ scores correlate with academic performance, or if whatever causes the Flynn effect tends to improve academic performance too — and both of these assumptions are at least plausible — then we would expect academic performance to improve.

    Another plausible story is that economic and social changes have made college available to more people, so that some people at the top of the distribution who would not have gone to college in the past can go now. The applicant pool at Princeton is much larger and more diverse than it was in (say) 1958, including many more students from outside the northeastern U.S., not to mention the addition of women to the applicant pool around 1970.

    It’s also the case that the methods and criteria used in admissions have changed over time, in ways that are likely to have affected the quality of the student body, in one direction or the other.

    And of course grades have been rising steadily since at least the 1930’s, which is itself suggestive of improvement in the student body.

    This is far from an open-and-shut case, but I think it’s more likely than not that students are improving slowly over time, and that that improvement accounts for at least a substantial part of the increase in average grades.

  14. Dan W.,

    One approach would be to tell students you’re going to grade them, thereby creating the desired incentive, but then at the end of the semester refrain from giving them grades. Of course, this trick isn’t repeatable.

  15. I’d be interested in seeing the “plausible evidence that today’s students are doing better work than past students did”. Although it’s easy to believe that today’s incoming students are far ahead of past students in their knowledge and understanding of computer science, I’m under the impression that the proficiency of incoming (and therefore, presumably, at least freshman and probably upper-year as well) college students at traditional core subjects such as English, math and history has declined steadily over the past decades, even–perhaps especially–at the top end of the scale.

    The usual explanation is that college-track education was once geared towards a much more select group of students, of whom much more could be (and was) expected, whereas today’s college-track education targets a large fraction of the population, and thus has been dumbed down accordingly. (The SAT, for example, was “re-normed” in 1995, presumably to measure more accurately differences among students at the lower tier of achievement that is of interest to most universities.)

    But I’m happy to be persuaded by evidence to the contrary. Would you happen to have any pointers to data showing that student academic achievement, either overall or at the top end, has actually improved over the last few decades?

  16. Algebra for a college student should have a very low difficulty, but Algebra for an elementary student should have a very high difficulty. Calculus for a Literature major is a different animal than Calculus for a Engineering major, too…

    It might be interesting to experiment with grading scales like this used to select valedictorians, but when it comes to accreditation, it may take a long time to change things.

  17. I love teaching. I hate grading. This is a problem, since these tend to go together.

    If students would just attend my lectures and do the projects I give out, they’d learn all sorts of good things that would help them later in their life. Without grades, though, some of them might claim to have learned more than they actually did.

    Ultimately, grades serve as a signal to external evaluators about how good one student is relative to another (which, as a side effect, adds additional motivation for the student to perform well — their future career may depend on it).

    If we could just find a better way to signal about students’ relative strengths than numeric grades, the world would be a better place.

    (Meanwhile, it’s not hard to argue that any sport that requires subjective opinions of referees is going to always have inscrutable outcomes, much like we see with the gymnastics competitions. There’s much less argument about whether Michael Phelps legitimately won all his races, as there’s very little subjectivity in their touch-pad sensors.)

  18. I like the central idea of the new system. That execution is a scale 1-10 and difficulty can essentially be infinite. But it runs into trouble in making the sport accessible to the layman (not necessarily a goal they’re working towards). From my own viewing as an untrained eye the difficulty scores seem almost made up. Apparently some things that look incredibly hard are actually easy and vice-versa, the new system makes it more apparent that we lay people have no idea about gymnastics and really just want to see people do flips and cool looking stuff.

    I also think the tie-breaker rule is out and out dumb. Dropping scores till a winner is determined? What happens if they’re tied all the way down, or tied down to 1-2 judges? That seems to defeat the purpose of having multiple judges in the first place, if they’re saying that any one judge’s score is fair game. The only fair way I could see is either do gymnast-off where they have to redo their routines or determining that they are tied and thus of equivalent status (2 golds).

  19. I wonder if school grades should take into account the difficulty of the course. Should an “A” in calculus have the same weight as an “A” in algebra?

    Many high schools do award more points for “advanced placement” courses, where an A is worth 5 points on a 4-point scale, a B is worth 4 points, etc.

    Weighting grades based on the difficulty of the course would eliminate the incentive to prop up one’s grade with the proverbial course in basket weaving.