reliability Archives

On December 31, some models of the Zune, Microsoft’s portable music player, went dark. The devices were unusable until the following day. Failures like this are sometimes caused by complex chains of mishaps, but this particular one is due to a single programming error that is reasonably easy to understand. Let’s take a look.

Here is the offending code (reformatted slightly), in the part of the Zune’s software that handles dates and times:

year = 1980;

while (days > 365) {
    if (IsLeapYear(year))  {
        if (days > 366)  {
            days -= 366;
            year += 1;
        }
     } else {
        days -= 365;
        year += 1;
    }
}

At the beginning of this code, the variable days is the number of days that have elapsed since January 1, 1980. Given this information, the code is supposed to figure out (a) what year it is, and (b) how many days have elapsed since January 1 of the current year. (Footnote for pedants: here “elapsed since” actually means “elapsed including”, so that days=1 on January 1, 1980.)

On December 31, 2008, days was equal to 10592. That is, 10592 days had passed since January 1, 1980. It follows that 10226 days had passed since January 1, 1981. (Why? Because there were 366 days in 1980, and 10592 minus 366 is 10226.) Applying the same logic repeatedly, we can figure out how many days had passed since January 1 of each subsequent year. We can stop doing this when the number of remaining days is less than a year — then we’ll know which year it is, and which day within that year.

This is the method used by the Zune code quoted above. The code keeps two variables, days and year, and it maintains the rule that days days have passed since January 1 of year. The procedure continues as long as there are more than 365 days remaining (“while (days > 365)“). If the current year is a leap year (“if (IsLeapYear(year))“), it subtracts 366 from days and adds one to year; otherwise it subtracts 365 from days and adds one to year.

On December 31, 2008, starting with days=10592 and years=1980, the code would eventually reach the point where days=366 and year=2008, which means (correctly) that 366 days had elapsed since January 1, 2008. To put it another way, it was the 366th day of 2008.

This is where things went horribly wrong. The code decided it wasn’t time to stop yet, because days was more than 365. (“while (days > 365)”) It then asked whether year was a leap year, concluding correctly that 2008 was a leap year. (“if (IsLeapYear(year))”) It next determined that days was not greater than 366 (“if (days > 366)“), so that no arithmetic should be performed. The code had gotten stuck: it couldn’t stop, because days was greater than 365, but it couldn’t make progress, because days was not greater than 366. This section of code would keep running forever — leaving the Zune seemingly dead in the water.

The only way out of this mess was to wait until the next day, when the computation would go differently. Fortunately, the same problem would not occur again until December 31, 2012 (the last day of the next leap year), and Microsoft has ample time to patch the Zune code by then.

What lessons can we learn from this? First, even seemingly simple computations can be hard to get right. Microsoft’s quality control process, which is pretty good by industry standards, failed to catch the problem in this simple code. How many more errors like this are lurking in popular software products? Second, errors in seemingly harmless parts of a program can have serious consequences. Here, a problem computing dates caused the entire system to be unusable for a day.

This story might help to illustrate why experienced engineers assume that any large software program will contain errors, and why they distrust anyone who claims otherwise. Getting a big program to run at all is an impressive feat of engineering. Making it error-free is too much to hope for. For the foreseeable future, software errors will be a fact of life.

[Hat tip: “itsnotabigtruck” at ZuneBoards.]

Which States have the Highest Risk of an E-Voting Meltdown?

Debugging the Zune Blackout

Contributors

Archives by Month

Which States have the Highest Risk of an E-Voting Meltdown?

Debugging the Zune Blackout

What We Discuss

Contributors

Archives by Month