Software increasingly manages the world around us, in subtle ways that are often hard to see. Software helps fly our airplanes (in some cases, particularly military fighter aircraft, software is the only thing keeping them in the air). Software manages our cars (fuel/air mixture, among other things). Software manages our electrical grid. And, closer to home for me, software runs our voting machines and manages our elections.
Sunday’s NY Times Magazine has an extended piece about faulty radiation delivery for cancer treatment. The article details two particular fault modes: procedural screwups and software bugs.
The procedural screwups (e.g., treating a patient with stomach cancer with a radiation plan intended for somebody else’s breast cancer) are heartbreaking because they’re something that could be completely eliminated through fairly simple mechanisms. How about putting barcodes on patient armbands that are read by the radiation machine? “Oops, you’re patient #103 and this radiation plan is loaded for patent #319.”
The software bugs are another matter entirely. Supposedly, medical device manufacturers, and software correctness people, have all been thoroughly indoctrinated in the history of Therac-25, a radiation machine from the mid-80’s whose poor software engineering (and user interface design) directly led to several deaths. This article seems to indicate that those lessons were never properly absorbed.
What’s perhaps even more disturbing is that nobody seems to have been deeply bothered when the radiation planning software crashed on them! Did it save their work? Maybe you should double check? Ultimately, the radiation machine just does what it’s told, and the software than plans out the precise dosing pattern is responsible for getting it right. Well, if that software is unreliable (which the article clearly indicates), you shouldn’t use it again until it’s fixed!
What I’d like to know more about, and which the article didn’t discuss at all, is what engineering processes, third-party review processes, and certification processes were used. If there’s anything we’ve learned about voting systems, it’s that the federal and state certification processes were not up to the task of identifying security vulnerabilities, and that the vendors had demonstrably never intended their software to resist the sorts of the attacks that you would expect on an election system. Instead, we’re told that we can rely on poll workers following procedures correctly. Which, of course, is exactly what the article indicates is standard practice for these medical devices. We’re relying on the device operators to do the right thing, even when the software is crashing on them, and that’s clearly inappropriate.
Writing “correct” software, and further ensuring that it’s usable, is a daunting problem. In the voting case, we can at least come up with procedures based on auditing paper ballots, or using various cryptographic techniques, that allow us to detect and correct flaws in the software (although getting such procedures adopted is a daunting problem in its own right, but that’s a story for another day). In the aviation case, which I admit to not knowing much about, I do know they put in sanity-checking software, that will detect when the the more detailed algorithms are asking for something insane and will override it. For medical devices like radiation machines, we clearly need a similar combination of mechanisms, both to ensure that operators don’t make avoidable mistakes, and to ensure that the software they’re using is engineered properly.