Last week Skype, the popular, free Net telephony service, was unavailable for a day or two due to technical problems. Failures of big systems are always interesting and this is no exception.
We have only limited information about what went wrong. Skype said very little at first but is now opening up a little. Based on their description, it appears that the self-organization mechanism in Skype’s peer-to-peer network became unstable. Let’s unpack that to understand what it means, and what it can tell us about systems like this.
One of the surprising facts about big information systems is that the sheer scale of a system changes the engineering problems you face. When a system grows from small to large, the existing problems naturally get harder. But you also see entirely new problems that didn’t even exist at small scale – and, worse yet, this will happen again and again as your system keeps growing.
Skype uses a peer-to-peer organization, in which the traffic flows through ordinary users’ computers rather than being routed through a set of central servers managed by Skype itself. The advantage of exploiting users’ computers is that they’re available at no cost and, conveniently, there are more of them to exploit when there are more users requesting service. The disadvantage is that users’ computers tend to reboot or go offline more than dedicated servers would.
To deal with the ever-changing population of user computers, Skype has to use a clever self-organization algorithm that allows the machines to organize themselves without relying (more than a tiny bit) on a central authority. Self-organization has two goals: (1) the system must respond quickly to changed conditions to get back into a good configuration soon, and (2) the system must maintain stability as conditions change. These two goals aren’t entirely contradictory, but they are at least in tension. Responding quickly to changes makes it difficult to maintain stability, and the system must be engineered to make this tradeoff wisely in a wide range of conditions. Getting this right in a huge P2P system like Skype is tricky.
Which brings us to the story of last week’s failure, as described by Skype. On Tuesday August 14, Microsoft released a new set of patches to Windows, according to their normal monthly cycle. Many Windows machines downloaded the patch, installed it, and then rebooted. Each such machine would leave the Skype network when it shut down, then rejoin after booting. So the effect of Microsoft’s patch release was to increase the turnover in Skype’s network.
The result, Skype says, is that the network became unstable as the respond-quickly mechanism outran the maintain-stability mechanism; and the problem snowballed as the growing instability caused ever stronger (but poorly aimed) responses. The Skype service was essentially unavailable for a day or two starting on Thursday August 16, until the company could track down the problem and fix a code bug that it said contributed to the problem.
The biggest remaining mystery is why the problem took so long to develop. Microsoft issued the patch on Tuesday, and Skype didn’t get into deep trouble until Thursday. We can explain away some of the delay by noting that Windows machines might take up to a day to download the patch and reboot, but this still means it took Skype’s network at least a day to melt down. I’d love to know more about how this happened.
I would hesitate to draw too many broad conclusions from a single failure like this. Large systems of all kinds, whether centralized or P2P, must fight difficult stability problems. When a problem like this does occur, it’s a useful natural experiment in how large systems behave. I only hope Skype has more to say about what went wrong.