November 19, 2017

Archives for August 2007

Why Was Skype Offline?

Last week Skype, the popular, free Net telephony service, was unavailable for a day or two due to technical problems. Failures of big systems are always interesting and this is no exception.

We have only limited information about what went wrong. Skype said very little at first but is now opening up a little. Based on their description, it appears that the self-organization mechanism in Skype’s peer-to-peer network became unstable. Let’s unpack that to understand what it means, and what it can tell us about systems like this.

One of the surprising facts about big information systems is that the sheer scale of a system changes the engineering problems you face. When a system grows from small to large, the existing problems naturally get harder. But you also see entirely new problems that didn’t even exist at small scale – and, worse yet, this will happen again and again as your system keeps growing.

Skype uses a peer-to-peer organization, in which the traffic flows through ordinary users’ computers rather than being routed through a set of central servers managed by Skype itself. The advantage of exploiting users’ computers is that they’re available at no cost and, conveniently, there are more of them to exploit when there are more users requesting service. The disadvantage is that users’ computers tend to reboot or go offline more than dedicated servers would.

To deal with the ever-changing population of user computers, Skype has to use a clever self-organization algorithm that allows the machines to organize themselves without relying (more than a tiny bit) on a central authority. Self-organization has two goals: (1) the system must respond quickly to changed conditions to get back into a good configuration soon, and (2) the system must maintain stability as conditions change. These two goals aren’t entirely contradictory, but they are at least in tension. Responding quickly to changes makes it difficult to maintain stability, and the system must be engineered to make this tradeoff wisely in a wide range of conditions. Getting this right in a huge P2P system like Skype is tricky.

Which brings us to the story of last week’s failure, as described by Skype. On Tuesday August 14, Microsoft released a new set of patches to Windows, according to their normal monthly cycle. Many Windows machines downloaded the patch, installed it, and then rebooted. Each such machine would leave the Skype network when it shut down, then rejoin after booting. So the effect of Microsoft’s patch release was to increase the turnover in Skype’s network.

The result, Skype says, is that the network became unstable as the respond-quickly mechanism outran the maintain-stability mechanism; and the problem snowballed as the growing instability caused ever stronger (but poorly aimed) responses. The Skype service was essentially unavailable for a day or two starting on Thursday August 16, until the company could track down the problem and fix a code bug that it said contributed to the problem.

The biggest remaining mystery is why the problem took so long to develop. Microsoft issued the patch on Tuesday, and Skype didn’t get into deep trouble until Thursday. We can explain away some of the delay by noting that Windows machines might take up to a day to download the patch and reboot, but this still means it took Skype’s network at least a day to melt down. I’d love to know more about how this happened.

I would hesitate to draw too many broad conclusions from a single failure like this. Large systems of all kinds, whether centralized or P2P, must fight difficult stability problems. When a problem like this does occur, it’s a useful natural experiment in how large systems behave. I only hope Skype has more to say about what went wrong.

E-Voting Ballots Not Secret; Vendors Don't See Problem

Two Ohio researchers have discovered that some of the state’s e-voting machines put a timestamp on each ballot, which severely erodes the secrecy of ballots. The researchers, James Moyer and Jim Cropcho, used the state’s open records law to get access to ballot records, according to Declan McCullagh’s story at news.com. The pair say they have reconstructed the individual ballots for a county tax referendum in Delaware County, Ohio.

Timestamped ballots are a problem because polling-place procedures often record the time or sequence of voter’s arrivals. For example, at my polling place in New Jersey, each voter is given a sequence number which is recorded next to the voter’s name in the poll book records and is recorded in notebooks by Republican and Democratic poll watchers. If I’m the 74th voter using the machine today, and the recorded ballots on that machine are timestamped or kept in order, then anyone with access to the records can figure out how I voted. That, of course, violates the secret ballot and opens the door to coercion and vote-buying.

Most e-voting systems that have been examined get this wrong. In the recent California top-to-bottom review, researchers found that the Diebold system stores the ballots in the order they were cast and with timestamps (report pp. 49-50), and the Hart (report pp. 59) and Sequoia (report p. 64) systems “randomize” stored ballots in an easily reversible fashion. Add in the newly discovered ES&S system, and the vendors are 0-for-4 in protecting ballot secrecy.

You’d expect the vendors to hurry up and fix these problems, but instead they’re just shrugging them off.

An ES&S spokeswoman at the Fleishman-Hillard public relations firm downplayed concerns about vote linking. “It’s very difficult to make a direct correlation between the order of the sign-in and the timestamp in the unit,” said Jill Friedman-Wilson.

This is baloney. If you know the order of sign-ins, and you can put the ballots in order by timestamp, you’ll be able to connect them most of the time. You might make occasional mistakes, but that won’t reassure voters who want secrecy.

You know things are bad when questions about a technical matter like security are answered by a public-relations firm. Companies that respond constructively to security problems are those that see them not merely as a PR (public relations) problem but as a technology problem with PR implications. The constructive response in these situations is to say, “We take all security issues seriously and we’re investigating this report.”

Diebold, amazingly, claims that they don’t timestamp ballots – even though they do:

Other suppliers of electronic voting machines say they do not include time stamps in their products that provide voter-verified paper audit trails…. A spokesman for Diebold Election Systems (now Premier Election Solutions), said they don’t for security and privacy reasons: “We’re very sensitive to the integrity of the process.”

You have to wonder why e-voting vendors are so much worse at responding to security flaw reports than makers of other products. Most software vendors will admit problems when they’re real, will work constructively with the problems’ discoverers, and will issue patches promptly. Companies might try PR bluster once or twice, but they learn that bluster doesn’t work and they’re just driving away customers. The e-voting companies seem to make the same mistakes over and over.

OLPC Review Followup

Last week’s review of the One Laptop Per Child (OLPC) machine by twelve-year-old “SG” was one of our most-commented-upon posts ever. Today I want to follow up on a few items.

First, the machine I got for SG was the B2 (Beta 2) version of the OLPC system, which is not the latest. Folks from the OLPC project suggest that some of the problems SG found are fixed in the latest version. They have graciously offered to send an up to date OLPC machine for SG to review. SG has agreed to try out the new machine and review it here on Freedom to Tinker.

Second, I was intrigued by the back-and-forth in the comments over SG’s gender. I had originally planned to give SG a pseudonym that revealed SG’s gender, but a colleague suggested that I switch to a gender-neutral pseudonym. Most commenters didn’t seem to assume one gender or the other. A few assumed that SG is a boy, which generated some pushback from others who found that assumption sexist. My favorite comment in this series was from “Chris,” who wrote:

Why are you assuming the review was written by a boy?
At 12 we’re only two years from 8th grade level, the rumored grail (or natural default) of our national publications. SG, you’re clearly capable of writing for most any publication in this country, you go girl! (even if you are a boy)

Third, readers seem to be as impressed as I was by the quality of SG’s writing. Some found it hard to believe that a twelve-year-old could have written the post. But it was indeed SG’s work. I am assured that SG’s parents did not edit the post but only suggested in general terms the addition of a paragraph about what SG did with the machine. I suggested only one minor edit to preserve SG’s anonymity. Otherwise what you read is what SG wrote.

Though sentences like “My expectations for this computer were, I must admit, not very high.” seem unusual for a twelve-year-old, others show a kid’s point of view. One example: “Every time you hit a key, it provides a certain amount of satisfaction of how squishy and effortless it is. I just can’t get over that keyboard.”

SG is welcome to guest blog here in the future. Kids can do a lot, if we let them.

[Update (June 2012): I can reveal now that SG is a girl: my daughter Claire Felten.]