Last week Skype, the popular, free Net telephony service, was unavailable for a day or two due to technical problems. Failures of big systems are always interesting and this is no exception.
We have only limited information about what went wrong. Skype said very little at first but is now opening up a little. Based on their description, it appears that the self-organization mechanism in Skype’s peer-to-peer network became unstable. Let’s unpack that to understand what it means, and what it can tell us about systems like this.
One of the surprising facts about big information systems is that the sheer scale of a system changes the engineering problems you face. When a system grows from small to large, the existing problems naturally get harder. But you also see entirely new problems that didn’t even exist at small scale – and, worse yet, this will happen again and again as your system keeps growing.
Skype uses a peer-to-peer organization, in which the traffic flows through ordinary users’ computers rather than being routed through a set of central servers managed by Skype itself. The advantage of exploiting users’ computers is that they’re available at no cost and, conveniently, there are more of them to exploit when there are more users requesting service. The disadvantage is that users’ computers tend to reboot or go offline more than dedicated servers would.
To deal with the ever-changing population of user computers, Skype has to use a clever self-organization algorithm that allows the machines to organize themselves without relying (more than a tiny bit) on a central authority. Self-organization has two goals: (1) the system must respond quickly to changed conditions to get back into a good configuration soon, and (2) the system must maintain stability as conditions change. These two goals aren’t entirely contradictory, but they are at least in tension. Responding quickly to changes makes it difficult to maintain stability, and the system must be engineered to make this tradeoff wisely in a wide range of conditions. Getting this right in a huge P2P system like Skype is tricky.
Which brings us to the story of last week’s failure, as described by Skype. On Tuesday August 14, Microsoft released a new set of patches to Windows, according to their normal monthly cycle. Many Windows machines downloaded the patch, installed it, and then rebooted. Each such machine would leave the Skype network when it shut down, then rejoin after booting. So the effect of Microsoft’s patch release was to increase the turnover in Skype’s network.
The result, Skype says, is that the network became unstable as the respond-quickly mechanism outran the maintain-stability mechanism; and the problem snowballed as the growing instability caused ever stronger (but poorly aimed) responses. The Skype service was essentially unavailable for a day or two starting on Thursday August 16, until the company could track down the problem and fix a code bug that it said contributed to the problem.
The biggest remaining mystery is why the problem took so long to develop. Microsoft issued the patch on Tuesday, and Skype didn’t get into deep trouble until Thursday. We can explain away some of the delay by noting that Windows machines might take up to a day to download the patch and reboot, but this still means it took Skype’s network at least a day to melt down. I’d love to know more about how this happened.
I would hesitate to draw too many broad conclusions from a single failure like this. Large systems of all kinds, whether centralized or P2P, must fight difficult stability problems. When a problem like this does occur, it’s a useful natural experiment in how large systems behave. I only hope Skype has more to say about what went wrong.
Toksyuryel, are you sure about Linux?
Interesting how this entire fiasco could have been avoided if Windows didn’t require reboots to to install new software and updates. I can think of no other modern OS with this bizzare requirement- not Mac, not GNU/Linux, not BSD, not Solaris.
Well?
.doc?
Do you have this in a non-toxic format, such as html?
The most credible analysis is not from Skype, but from Julian Cain in a series of comments that he made to a Gigom article about the outage. Julian is lead architect at Pando and, earlier, was head of Mac development for Kazaa at Sharmen Networks. So he knows a lot about peer-to-peer networks and his work at Sharmen put him in a position to know quite a bit about the P2P technology that’s also used by Skype (and likely by Joost).
I’ve collected Julian’s comments in a single file here:
http://blogs.nmss.com/Julian_Cain_on_August_2007_Skype_Outage.doc
and other URLs relevant to Skype internals in this blog post
http://blogs.nmss.com/communications/2007/08/best-skype-cras.html
The Gonzalez resignation is a tip to the true conspiracy explanation: system taken down to install warrantless wiretapping.
This just shows the folly on setting Windows to download and install patches automatically. I’ve never done that and I never will.
Reason one: I don’t want to be in the middle of typing, playing a game, or otherwise using the machine and suddenly it just reboots by itself. Basically, setting it to install them automatically, at least, amounts to telling it to randomly and gratuitously crash every few weeks. And this story and some of its comments make clear how disruptive this can potentially be. From nonfunctioning phones to delays doing presumably time-limited exams.
Reason two: Not every so-called “critical update” is for the benefit of the user. I’d have been nailed by the evil “Windows Genuine Advantage Notifications” patch if I’d had automatic installation of updates — the notorious “critical update” that does not increase the computer owner’s security but actually decreases it, because it may cause Windows XP to suddenly decide it needs reactivation after having a false positive in trying to detect if it’s pirated. People whose windows is genuine (e.g. it came with the computer from a known manufactuter) and is treated as genuine by Windows Genuine Advantage had WGA Notifications decide it wasn’t — and then XP would require reactivating. Given that this was right around the time Vista became available as an “up”grade, the high false-positive rate and undisclosed ability to force reactivation has been considered “significant” by many. Worse, the patch when inspected at the WU site reports that it cannot be uninstalled, and of course it masquerades as a security patch for the user, when it’s a security patch for Microsoft’s cash flow and will not improve, and may in fact degrade, the security of the affected computer as determined by its rightful owner. It’s also notable that hiding it at the WU site causes a nag screen there on subsequent manual visits, as well as it periodically unhiding itself. Needless to say I have it notify me of updates, and then pick what I want to download manually, then decide when it’s convenient to actually install them and thus cause a reboot. My computer, my decision.
Of course, a nefarious patch with a completely dishonest description (e.g. the generic “fixes a problem that could allow an attacker to gain control over your computer”) won’t be as easily avoided. Even then, not automatically installing it but waiting until it’s convenient gives me at least a chance to hear horror stories start coming out about a nefarious (or even just a buggy) patch before it gets applied to my own machine.
It’s worth pointing out that not all users’ computers are equally exploited. Some are “supernodes”, which means they get used to route traffic for other users’ calls. Apparently the problem was, at least partly, to do with running short of people to use as supernodes.
I have been experimenting with online, proctored exams two times now. Both times only 2-4 computers (out of 20) downloaded a patch and rebooted *during* the exam. I would have thought that the entire lab would reboot, or that the rebooting would occur at the same time, but they were not in any way synchronized.
Bryan Feir, thank you for your wonderful story! Almost as good as a Borg assimilation tale.
Apropos collective, what really annoyed me was the introduction of the Borg queen in the movie Star Trek: First Contact. The original idea of an autonomous collective in earlier TV episodes was fantastic. What self-assured collective needs a queen? I read that the film makers thought people could not comprehend such a radical idea. And why do we need Skype? Shouldn’t the collective be able to run the P2P phone network without queen Skype? So far we have no conclusive explanation for the outage, but it looks to me that the Skype network, the queen’s hub, was part of the problem.
I looked recently for outdoor IEEE 802.11 WiFi antennas and found that even some omnidirectional antennas can receive signals at a distance of up to 3km to the source. A few years ago some company (forgot the name) produced an external harddisk enclosure with various interfaces. Its most interesting feature was the built-in file server (via various IP protocols) which runs independent of the attached computer. A web server was just one of many additional features. This is all low power stuff and can run 24*7. The only problem with this thing is that most telcos/ISPs block all inbound communication, in particular protocols like HTTP (web). Guess why.
The Linksys Wireless router WRT54L (formerly known as the old WRT54G, not to be confused with the different new model WRT54G) can be programmed with additional software (via upload into the flash memory). This little gadget can run 24*7, too, and is perfectly suited for a wireless mesh.
Now, I am wondering how long it takes for the people to accept that They can build an autonomous collective. Skype via mostly wired IP shows us that it works. But why count on Skype, an eBay company? eBay is certainly not interested in providing free service, at least not for too long.
So, when do you get that implant with a subspace transmitter and receiver? Are you ready for the assimilation?
Always Microsofts fault 🙂
Maybe a tad off-topic, but has to do with automatic Windows rebooting:
I have been experimenting with online, proctored exams two times now. Both times only 2-4 computers (out of 20) downloaded a patch and rebooted *during* the exam. I would have thought that the entire lab would reboot, or that the rebooting would occur at the same time, but they were not in any way synchronized.
Moodle gets extra points for saving the current exam state and carrying on after the reboot at the position the student was in the exam. I must admit I was rather panicky about that first reboot, though….
This doesn’t add up in my mind. From the articles I read immediately following the issue, Skype’s problem only happened to users after they rebooted–if they didn’t reboot, they didn’t have the issue. Furthermore, Microsoft often pushes out patches on a known schedule.
So why did it happen this time; what was different? The information I have seen so far doesn’t point to anything being different, or is that because they are trying to keep their model proprietary so they just say anything to ‘satisfy’ the questions?
I’m reminded of the 1990 Martin Luther King Day outage for AT&T…
For those who don’t remember (I was working at Bell-Nothern Research at the time, and got a fairly good description), what happened was that there was a bug in the network recovery procedures. The process happened something like this:
* Switch A crashes due to a hardware glitch
* Switch A supervisor sends a message to switches B and C saying ‘I’m resetting, don’t forward any new calls to me.’
* Switch A reboots and resets its state
* Switch A starts accepting new calls, and sends call routing messages to B and C indicating it’s back online
* Switches B and C start rebuilding their routing tables to include switch A
* Switch A sends more call routing messages to switches B and C
* Switches B and C cannot handle call routing messages while they’re rebuilding their tables, and both crash.
* Switches B and C supervisors send messages to switches A, D, E, and F saying ‘I’m resetting, don’t forward any new calls to me.’
And the process starts rippling out from the initial crash site.
This bug had actually been present in the AT&T switch code for about a month before the conditions set it off and resulted in AT&T’s network being mostly down for hours. (I say ‘mostly’ down because you could still sometimes get calls through, and a call once connected wouldn’t disconnect even if the switches reset.) Similar sort of thing as this: the recovery procedures caused the entire network to destabilize…
I’m not really all that surprised. MS-Windows updates have been causing minor denial of service attacks on all Internet users for the past few years. Everything works a bit slower and servers will drop out a bit more often when all the MS users are loading updates.
For WWW users, it’s a small annoyance. For VoIP users it’s a bigger problem (even for regular non-self-organizing VoIP users) because ISPs don’t provide guaranteed bandwidth for VoIP, they provide average bandwidth (which gets clobbered on the day all the updates are downloading).
I could well appreciate that with nodes rebooting PLUS available bandwidth between nodes randomly throttled back PLUS the central servers getting many more “hello” and “goodbye” messages (and probably more “timeout” events too) as the links bounce around; the self-organizing algorithm that worked nicely under favorable conditions is likely to malfunction.
This is one of the problems of statistical multiplexing for communication routing — there are fringe cases where it simply doesn’t work. The fringe cases are apparently extremely rare when you do a theoretical analysis, using your well-known normal distributions and Poisson distributions. Out in the real world, traffic is strongly correlated to weird events like election results, MS updates and other natural disasters… the fringe cases are not rare anymore.
As for botnets and virus problems — it’s already happened.
http://www.guardian.co.uk/technology/2004/may/05/viruses.security
The trains in Sydney were out of action for something like half a day while computers sat and rebooted themselves. Most of the Sydney rail system runs on MS-Windows and you can tell because now and then you walk past a blue-screen (BSOD) or some other obvious Microsoft artifact.
This incident is also interesting because it shows the potentially disruptive power of seemingly innocuous yet widespread events such as Windows Update. Each update is, on its own, a drop, but when all updating machines are added together, those drops turn into a deluge.
It shows that mass events like Windows Update can be used to trigger targeted failures akin to denial of service attacks WITH a plausible denial of culpability. (Microsoft or other parties may claim that the meltdown was neither their fault nor their intention.) As users become more dependent network services, this problem will only get worse.
Of course, the real nightmare scenario is not Microsoft, Symantec, or some other company meddling with networks. It’s when the mechanisms, such as Windows Update, that they have in place are commandeered by virus writers and botnet operators.