October 18, 2018

Debugging the Zune Blackout

On December 31, some models of the Zune, Microsoft’s portable music player, went dark. The devices were unusable until the following day. Failures like this are sometimes caused by complex chains of mishaps, but this particular one is due to a single programming error that is reasonably easy to understand. Let’s take a look.

Here is the offending code (reformatted slightly), in the part of the Zune’s software that handles dates and times:

year = 1980;

while (days > 365) {
    if (IsLeapYear(year))  {
        if (days > 366)  {
            days -= 366;
            year += 1;
        }
     } else {
        days -= 365;
        year += 1;
    }
}

At the beginning of this code, the variable days is the number of days that have elapsed since January 1, 1980. Given this information, the code is supposed to figure out (a) what year it is, and (b) how many days have elapsed since January 1 of the current year. (Footnote for pedants: here “elapsed since” actually means “elapsed including”, so that days=1 on January 1, 1980.)

On December 31, 2008, days was equal to 10592. That is, 10592 days had passed since January 1, 1980. It follows that 10226 days had passed since January 1, 1981. (Why? Because there were 366 days in 1980, and 10592 minus 366 is 10226.) Applying the same logic repeatedly, we can figure out how many days had passed since January 1 of each subsequent year. We can stop doing this when the number of remaining days is less than a year — then we’ll know which year it is, and which day within that year.

This is the method used by the Zune code quoted above. The code keeps two variables, days and year, and it maintains the rule that days days have passed since January 1 of year. The procedure continues as long as there are more than 365 days remaining (“while (days > 365)“). If the current year is a leap year (“if (IsLeapYear(year))“), it subtracts 366 from days and adds one to year; otherwise it subtracts 365 from days and adds one to year.

On December 31, 2008, starting with days=10592 and years=1980, the code would eventually reach the point where days=366 and year=2008, which means (correctly) that 366 days had elapsed since January 1, 2008. To put it another way, it was the 366th day of 2008.

This is where things went horribly wrong. The code decided it wasn’t time to stop yet, because days was more than 365. (“while (days > 365)”) It then asked whether year was a leap year, concluding correctly that 2008 was a leap year. (“if (IsLeapYear(year))”) It next determined that days was not greater than 366 (“if (days > 366)“), so that no arithmetic should be performed. The code had gotten stuck: it couldn’t stop, because days was greater than 365, but it couldn’t make progress, because days was not greater than 366. This section of code would keep running forever — leaving the Zune seemingly dead in the water.

The only way out of this mess was to wait until the next day, when the computation would go differently. Fortunately, the same problem would not occur again until December 31, 2012 (the last day of the next leap year), and Microsoft has ample time to patch the Zune code by then.

What lessons can we learn from this? First, even seemingly simple computations can be hard to get right. Microsoft’s quality control process, which is pretty good by industry standards, failed to catch the problem in this simple code. How many more errors like this are lurking in popular software products? Second, errors in seemingly harmless parts of a program can have serious consequences. Here, a problem computing dates caused the entire system to be unusable for a day.

This story might help to illustrate why experienced engineers assume that any large software program will contain errors, and why they distrust anyone who claims otherwise. Getting a big program to run at all is an impressive feat of engineering. Making it error-free is too much to hope for. For the foreseeable future, software errors will be a fact of life.

[Hat tip: “itsnotabigtruck” at ZuneBoards.]

Comments

  1. Michael Donnelly says:

    This kind of very localized defect is a kind of “meta-error”. That is, the reason this came to light in a production environment cannot be reversed from the code. For instance, you can’t tell if…

    …the project manager decided at the last minute to have the developer fix this, after all of the major QA had been done and the “final build’ already made.

    …the developer was on his last day and didn’t have time (or bother to) make a hard unit test. A skilled developer gets a shaky feeling down the back of his spine with a tight loop like that, but not everyone feels that fear enough to brute-force test it with all possible dates.

    …someone from another team or sub-team hacked that in real quick. Being outside the process and confident that it would work (quick unit test today worked!), he checked it into source control.

    The neat thing about errors like these is that there is very likely to be a small story behind it, but we just can’t see it from here. The not-so-neat thing is that these kinds of code defects, which are really just team defects, are also very much present in software that needs to be secure, unlike this one example.

    • Yes, there’s probably an interesting backstory here. I’d love to know what it is. It might be possible to make an educated guess by thinking carefully about the structure of the (erroneous) code and how it might have come about.

  2. tehdiplomat says:

    A small correction for the post in the paragraph that starts with:
    “This is where things went horribly wrong.”

    After checking if days was greater than 365 and the year was a leap year, it then checked if days was greater than 366: if (days > 366) not less than as the article was stating.

    However, the final statement is unphased by this slight error. Since days == 366, no arithmetic could occur for either check (lesser or greater than). And the while loop could not be broken out of.

  3. Bryan Feir says:

    And this sort of event is why common problems such as date handling should be built into libraries that get re-used, rather than being rebuilt for each new application… the problem gets found and fixed once rather than being rediscovered by each new developer who thinks he just has to get something out quickly.

    Especially with leap years and the ‘every four, but not every one hundred, but every four hundred’ decision tree that people do seem to keep getting wrong.

    See the Risks digest at http://catless.ncl.ac.uk/Risks/25.50.html#subj2.1 for more commentary; the following article in the digest goes through the same source code analysis as this posting.

  4. I think it’s even more interesting that the code started its life somewhere else — Motorola / Freescale — and then was (potentially) modified and used by Microsoft. So most likely the bug wasn’t even Microsoft’s, and other Freescale hardware using this same boilerplate code suffers the same fate. And, despite good QA, obviously no one tested this scenario. I suspect they didn’t QA this code as much as what they engineered themselves, and I think it’s an important thing to consider: code reuse is often a good idea but it has security and performance implications.

  5. This raises the question–what errors exist in the super-secret DRM layer in Vista/Win7?

  6. I wonder why this bug was specific to only one version of the Zune. You’d think that date software would be common across all of them.

  7. Anonymous says:

    why is date so important on a mp3 player all we want to do is play music . . .

    • – determine if a synchronization is necessary when connected to a PC
      – music sharing features between ZUNEs have an expiry dates on each of the tracks
      – current time and date displaying on the ZUNE.

  8. Anonymous says:

    Why didn’t they just compute the year and day directly, using a modulo function, instead of going through a big loop? Does no one use a DIVIDE any more?

    AND…was this open source code? If not, how did anyone get access to proprietary source code in the first place?

  9. Anonymous says:

    //Why didn’t they just compute the year and day directly, using a modulo function, instead of going through a big loop? Does no one use a DIVIDE any more?//

    One would have to take out 1461-day chunks to do that. Of course, that’s what one SHOULD do whether one is using divides or subtracts. If one isn’t chunking out 4-year blocks, the modulo function won’ t work.

    BTW, if one is doing four-year chunks, how does the performance of integer divide compare with that of repeated subtracts? Even if one starts in 1980 and assumes the unit will last until 2040, there will be a maximum of 15 subtracts of 1461 days.

  10. The earlier anonymous commenter makes the most accurate observation. This was not Microsoft-written code. It was written by Motorola/Freescale and built into the product by Toshiba.

    The first Zune was a modified Toshiba Gigabeat — evidence easily found on the web shows that the same bug affected the Toshiga Gigabeat (there are just fewer of them, and Toshiba doesn’t make the news the way Microsoft does).

    The second-generation Zunes were designed by Microsoft, and don’t have this bug. I’m pretty familiar with Microsoft’s coding practices, and I can assure you, a bug this basic would never have passed code review. Date (and string) manipulations trigger automatic flags, because, as noted above, there are extremely well-tested libraries for these sorts of things that are required use.

    Microsoft’s failure was shipping code written by someone else without doing complete code reviews. This was probably due to the compressed schedule for the original Zune (conception to shipping in 8 months), and the team moving on to the next product as quickly as possible.

    If there’s a lesson here, it’s that open source != bug free.

    • I’d be interested to know if any of the Toshiba Gigabeat devices experienced this issue. Are there any of them out there that are still in use?

  11. =CrAzYG33K= says:

    Agree about the bug…
    But if this was the case.. Shouldn’t the same bug have appeared in the previous leap years as well ? (eg : 2000, 2004 etc etc..) Why didn’t it appear then?

  12. Anonymous says:

    Why don’t you address the Freescale/Toshiba heritage of the code and the player in the article? Or would that generate fewer page views than bashing Microsoft’s software practices?

    • To me, who originally wrote the code is less important than who shipped it in their product. Microsoft had the source code, and presumably they subjected it to at least part of their quality control process. As I say in the article, Microsoft’s quality control process “is pretty good by industry standards”. If you re-read the article, you’ll see I’m not Microsoft-bashing.

  13. You’d think that these things were well organised and controlled however I remember a fault that occurred on early 68881 (68k floating point coprocessor) , which was traced back to the microcode. When we asked motorola about it they said “we don’t understand the code – it was written by a consultant who no longer works for us” when the consultant was asked he said ” I don’t understand it either , I was on a different astral plane – going around pluto – when I wrote it!”

  14. In which way would modulo functions help with leap years? Considering that for instance 2100 is no leap year. So, modulo will only take you that far.

    The Julian day (JD) is the number of days from a fixed date in the past. E.g., 31 Dec 2008 was JD 2,454,115. If you read the “Calculation” section of this wikipedia page, you will see that there is an arithmetic algorithm for calculating the date corresponding to a give JD. Since 1 Jan 1980 was JD 2,444,240, I think we can calculate the JD and use the algorithms to get the year, month, and day on our (Gregorian) calendar.

    If you want a reference, “Calendrical Calculations” by Dershowitz and Reingold is good. Load the java applet on the page below and play around. You might learn something.

    http://emr.cs.uiuc.edu/home/reingold/calendar-book/index.shtml

  15. Anonymous says:

    As an embedded systems programmer – I learned to avoid modulo because it was inefficent on CPUs that did not have a divide instruction (and sometimes too inefficent even if the CPU did have one). Indeed still tell the story of finding a simple loop counter (going from 0 to 9) implemented with a modulo command. The problem was that this occured in a interupt-service-routine and generated tons of code. Changing it into a simple loop (if i++>=10; i=0) greatly shortned the ISR.

    Part of being an embedded systems programmer is knowing the system well enough to account for things like that.

  16. Pedro Vasconcelos says:

    “Getting a big program to run at all is an impressive feat of engineering. Making it error-free is too much to hope for. ”

    This conclusion is too pessimistic and not at all true for this kind of simple programming errors. An abstract interpretation of the original program would reveal that there is a possible control flow in the loop body where none of the loop variables is modified, namely, when days>365 && IsLeapYear(year) && days <= 366. Such kind of loop invariants can even be machine verified and may indeed be so in future mainstream programming — check Microsoft's own Spec# research project, for example.

    http://research.microsoft.com/en-us/projects/specsharp/