September 24, 2017

Wall Street software failure and a relationship to voting

An article in The Register explains what happened in the Aug 1 2012 Wall Street glitch that cost Knight Capital $440M, resulted in a $12M fine, nearly bankrupted Knight Capital (and forced them to merge with someone else). In short, there were 8 servers that handled trades; 7 of them were correctly upgraded with new software, but the 8th was not. A particular type of transaction triggered the updated code, which worked properly on the upgraded servers. On the non-upgraded server, the transaction triggered an obsolete piece of software, which behaved altogether differently. The result was large numbers of incorrect “buy” transactions.

Bottom line is that the cause of the failure was lack of careful procedures in how the software was deployed, coupled with a poor design choice that allowed a new feature to reuse a previously used obsolete option, which meant that the trigger (instead of being ignored of causing an error) caused an unanticipated result.

So what does this have to do voting? It’s not hard to imagine an internet voting scheme using 8 servers, and even if the software doesn’t have security flaws per se, a botched upgrade like this might work just fine for 7/8 of the voters, and silently fail for the 1/8. If the procedures aren’t in place to check all of the systems (and such procedures apparently didn’t exist at Knight Capital), a functional check might not detect a mismatch.

This experience emphasizes that proper operation isn’t just having the software itself being built correctly – it’s also having it fielded properly. In a way this is similar to the DC internet voting experiment – in that case, there was a bug in the software, but that particular bug wouldn’t have been exploitable if it hadn’t been for a mistake in how the software was fielded, replacing one version of a software library with a different version that had an exploitable bug. [This is not to suggest that this was the only bug in the DC voting software, or that internet voting is safe, just tying to the particular exploit that happened.]

Comments

  1. If a botched upgrade just led to a silent failure, you could probably consider yourself lucky. As soon as you start allowing communication among the servers, tickling the right bug/misfeature in one server could corrupt the whole shooting match.

    (Does anyone else think immediately of the Pluribus story, where the initial fault-tolerant multiprocessor design led the machine to undo upgrades as soon as they were applied?

  2. Jeffrey Mattox says:

    @paul: link, please, to that Pluribus story.

    • Quick search does not reveal it; you can read about the error-detection and correction system, but not any gotchas. (I had it from Severo Ornstein about 30 years ago during an interview at PARC, so that’s not much use.)