November 25, 2024

Secrecy in Science

There’s an interesting dispute between astronomers about who deserves credit for discovering a solar system object called 2003EL61. Its existence was first announced by Spanish astronomers, but another team in the U.S. believes that the Spaniards may have learned about the object due to an information leak from the U.S. team.

The U.S. team’s account appears on their web page and was in yesterday’s NY Times. The short version is that the U.S. team published an advance abstract about their paper, which called the object by a temporary name that encoded the date it had been discovered. They later realized that an obscure website contained a full activity log for the telescope they had used, which allowed anybody with a web browser to learn exactly where the telescope had been pointing on the date of the discovery. This in turn allowed the object’s orbit to be calculated, enabling anybody to point their telescope at the object and “discover” it. Just after the abstract was released, the Spanish team apparently visited the telescope log website; and a few days later the Spanish team announced that they had discovered the object.

If this account is true, it’s clearly a breach of scientific ethics by the Spaniards. The seriousness of the breach depends on other circumstances which we don’t know, such as the possibility that the Spaniards had already discovered the object independently and were merely checking whether the Americans’ object was the same one. (If so, their announcement should have said that the American team had discovered the object independently.)

[UPDATE (Sept. 15): The Spanish team has now released their version of the story. They say they discovered the object on their own. When the U.S. group’s abstract, containing a name for the object, appeared on the Net, the Spaniards did a Google search for the object name. The search showed a bunch of sky coordinates. They tried to figure out whether any of those coordinates corresponded to the object they had seen, but they were unable to tell one way or the other. So they went ahead with their own announcement as planned.

This is not inconsistent with the U.S. team’s story, so it seems most likely to me that both stories are true. If so, then I was too hasty in inferring a breach of ethics, for which I apologize. I should have realized that the Spanish team might have been unable to tell whether the objects were the same.]

When this happened, the American team hastily went public with another discovery, of an object called 2003UB313 which may be the tenth planet in our solar system. This raised the obvious question of why the team had withheld the announcement of this new object for as long as they did. The team’s website has an impassioned defense of the delay:

Good science is a careful and deliberate process. The time from discovery to announcement in a scientific paper can be a couple of years. For all of our past discoveries, we have described the objects in scientific papers before publicly announcing the objects’ existence, and we have made that announcement in under nine months…. Our intent in all cases is to go from discovery to announcement in under nine months. We think that is a pretty fast pace.

One could object to the above by noting that the existence of these objects is never in doubt, so why not just announce the existence immediately upon discovery and continue observing to learn more? This way other astronomers could also study the new object. There are two reasons we don’t do this. First, we have dedicated a substantial part of our careers to this survey precisely so that we can discover and have the first crack at studying the large objects in the outer solar system. The discovery itself contains little of scientific interest. Almost all of the science that we are interested in doing comes from studying the object in detail after discovery. Announcing the existence of the objects and letting other astronomers get the first detailed observations of these objects would ruin the entire scientific point of spending so much effort on our survey. Some have argued that doing things this way “harms science” by not letting others make observations of the objects that we find. It is difficult to understand how a nine month delay in studying an object that no one would even know existed otherwise is in any way harmful to science!

Many other types of astronomical surveys are done for precisely the same reasons. Astronomers survey the skies looking for ever higher redshift galaxies. When they find them they study them and write a scientific paper. When the paper comes out other astronomers learn of the distant galaxy and they too study it. Other astronomers cull large databases such as the 2MASS infrared survey to find rare objects like brown dwarves. When they find them they study them and write a scientific paper. When the paper comes out other astronomers learn of the brown dwarves and they study them in perhaps different ways. Still other astronomers look around nearby stars for the elusive signs of directly detectable extrasolar planets. When they find one they study it and write a scientific paper….. You get the point. This is the way that the entire field of astronomy – and probably all of science – works. It’s a very effective system; people who put in the tremendous effort to find these rare objects are rewarded with getting to be the first to study them scientifically. Astronomers who are unwilling or unable to put in the effort to search for the objects still get to study them after a small delay.

This describes an interesting dynamic that seems to occur in all scientific fields – I have seen it plenty of times in computer science – where researchers withhold results from their colleagues for a while, to ensure that they get a headstart on the followup research. That’s basically what happens when an astronomer delays announcing the discovery of an object, in order to do followup analyses of the object for publication.

The argument against this secrecy is pretty simple: announcing the first result would let more people do followup work, making the followup work both quicker and more complete on average. Scientific discovery would benefit.

The argument for this kind of secrecy is more subtle. The amount of credit one gets for a scientific result doesn’t always correlate with the difficulty of getting the result. If a result is difficult to get but doesn’t create much credit to the discoverer, then there is an insufficient incentive to look for that result. The incentive is boosted if the discoverer gets an advantage in doing followup work, for example by keeping the original result secret for a while. So secrecy may increase the incentive to do certain kinds of research.

Note that there isn’t much incentive to keep low-effort / high-credit research secret, because there are probably plenty of competing scientists who are racing to do such work and announce it first. The incentive to keep secrets is biggest for high-effort / low-credit research which enables low-effort / high-credit followup work. And this is exactly the case where incentives most need to be boosted.

Michael Madison compares the astronomers’ tradeoff between publication and secrecy to the tradeoff an inventor faces between keeping an invention secret, and filing for a patent. As a matter of law, discovered scientific facts are not patentable, and that’s a good thing.

As Madison notes, science does have its own sort of “intellectual property” system that tries to align incentives for the public good. There is a general incentive to publish results for the public good – scientific credit goes to those who publish. Secrecy is sometimes accepted in cases where secret-keeping is needed to boost incentives, but the system is designed to limit this secrecy to cases where it is really needed.

But this system isn’t perfect. As the astronomers note, the price of secrecy is that followup work by others is delayed. Sometimes the delay isn’t too serious – 2003UB313 will still be plodding along in its orbit and there will be plenty of time to study it later. But sometimes delay is a bigger deal, as when an astronomical object is short-lived and cannot be studied at all later. Another example, which arises more often in computer security, is when the discovery is about an ongoing risk to the public which can be mitigated more quickly if it is more widely known. Scientific ethics tend to require at least partial publication in cases like these.

What’s most notable about the scientific system is that it works pretty well, at least within the subject matter of science, and it does so without much involvement by laws or lawyers.

RIAA, MPAA Join Internet2 Consortium

RIAA and MPAA, trade associations that include the major U.S. record and movie companies, joined the Internet2 consortium on Friday, according to a joint press release. I’ve heard some alarm about this, suggesting that this will allow the AAs to control how the next generation Internet is built. But once we strip away the hype, there’s not much to worry about in this announcement.

Despite its grand name, Internet2 is not a new network. Its main purpose has been to add some fast links to today’s Internet, to connect bandwidth-hungry universities, e.g., so that researchers at one university can explore the results of climate simulations done at a peer university. The Internet2 links carry traffic of all sorts and they use the same protocols as the rest of the Internet.

A lesser function of Internet2 is to host discussions among researchers studying specific topics. It’s good when people studying similar problems can talk to each other, as long as one group isn’t put in charge of what the other groups do. And as I understand it, the Internet2 discussions are just that – discussions – and not a top-down management structure. So it doesn’t look to me like Internet2, as a corporate body, could do much to divert the natural course of research, even if it wanted to.

Finally, Internet2 is not in a position to dicate what technology gets deployed in the future Internet. Internet2 may give birth to ideas that are then adopted by the industry; but those ideas will only be deployed if market pressures drive the industry to build them. If the AAs think that they can sit down with Internet2 and negotiate the future of the Internet, they’re sadly mistaken. But I very much doubt that that’s what they think.

So why are the AAs joining Internet2? My guess is that they joined for mostly the same reasons that other non-IT-industry corporate members did. Why did Johnson and Johnson join? Why did Ford join? Because their business strategies depend on the future of high-performance networks. The same is true of the record and movie companies. Their business models will one day center on online, digital distribution of content. It’s best for them, and probably for everybody else too, if they face that future squarely, right away. I’m hope their presence in Internet2 will help them see what is coming, and figure out how to adapt to it.

Acoustic Snooping on Typed Information

Li Zhuang, Feng Zhou, and Doug Tygar have an interesting new paper showing that if you have an audio recording of somebody typing on an ordinary computer keyboard for fifteen minutes or so, you can figure out everything they typed. The idea is that different keys tend to make slightly different sounds, and although you don’t know in advance which keys make which sounds, you can use machine learning to figure that out, assuming that the person is mostly typing English text. (Presumably it would work for other languages too.)

Asonov and Agrawal had a similar result previously, but they had to assume (unrealistically) that you started out with a recording of the person typing a known training text on the target keyboard. The new method eliminates that requirement, and so appears to be viable in practice.

The algorithm works in three basic stages. First, it isolates the sound of each individual keystroke. Second, it takes all of the recorded keystrokes and puts them into about fifty categories, where the keystrokes within each category sound very similar. Third, it uses fancy machine learning methods to recover the sequence of characters typed, under the assumption that the sequence has the statistical characteristics of English text.

The third stage is the hardest one. You start out with the keystrokes put into categories, so that the sequence of keystrokes has been reduced a sequence of category-identifiers – something like this:

35, 12, 8, 14, 17, 35, 6, 44, …

(This means that the first keystroke is in category 35, the second is in category 12, and so on. Remember that keystrokes in the same category sound alike.) At this point you assume that each key on the keyboard usually (but not always) generates a particular category, but you don’t know which key generates which category. Sometimes two keys will tend to generate the same category, so that you can’t tell them apart except by context. And some keystrokes generate a category that doesn’t seem to match the character in the original text, because the key happened to sound different that time, or because the categorization algorithm isn’t perfect, or because the typist made a mistake and typed a garbbge charaacter.

The only advantage you have is that English text has persistent regularities. For example, the two-letter sequence “th” is much more common that “rq”, and the word “the” is much more common than “xprld”. This turns out to be enough for modern machine learning methods to do the job, despite the difficulties I described in the previous paragraph. The recovered text gets about 95% of the characters right, and about 90% of the words. It’s quite readable.

[Exercise for geeky readers: Assume that there is a one-to-one mapping between characters and categories, and that each character in the (unknown) input text is translated infallibly into the corresponding category. Assume also that the input is typical English text. Given the output category-sequence, how would you recover the input text? About how long would the input have to be to make this feasible?]

If the user typed a password, that can be recovered too. Although passwords don’t have the same statistical properties as ordinary text (unless they’re chosen badly), this doesn’t pose a problem as long as the password-typing is accompanied by enough English-typing. The algorithm doesn’t always recover the exact password, but it can come up with a short list of possible passwords, and the real password is almost always on this list.

This is yet another reminder of how much computer security depends on controlling physical access to the computer. We’ve always known that anybody who can open up a computer and work on it with tools can control what it does. Results like this new one show that getting close to a machine with sensors (such as microphones, cameras, power monitors) may compromise the machine’s secrecy.

There are even some preliminary results showing that computers make slightly different noises depending on what computations they are doing, and that it might be possible to recover encryption keys if you have an audio recording of the computer doing decryption operations.

I think I’ll go shut my office door now.

Aussie Judge Tweaks Kazaa Design

A judge in Australia has found Kazaa and associated parties liable for indirect copyright infringement, and has tentatively imposed a partial remedy that requires Kazaa to institute keyword-based filtering.

The liability finding is based on a conclusion that Kazaa improperly “authorized” infringement. This is roughly equivalent to a finding of indirect (i.e. contributory or vicarious) infringement under U.S. law. I’m not an expert in Australian law, so on this point I’ll refer you to Kim Weatherall’s recap.

As a remedy, the Kazaa parties will have to pay the 90% of the copyright owners’ trial expenses, and will have to pay damages for infringement, in an amount to be determined by future proceedings. (According to Kim Weatherall, Australian law does not allow the copyright owners to reap automatic statutory damages as in the U.S. Instead, they must prove actual damages, although the damages are boosted somehow for infringements that are “flagrant”.)

More interestingly, the judge has ordered Kazaa to change the design of their product, by incorporating keyword-based filtering. Kazaa allows users to search for files corresponding to certain artist names and song titles. The required change would disallow search terms containing certain forbidden patterns.

Designing such a filter is much harder than it sounds, because there are so many artist names and song names. These two namespaces are so crowded that a great many common names given to non-infringing recordings are likely to contain forbidden patterns.

The judge’s order uses the example of the band Powderfinger. Presumably the modified version of Kazaa would ban searches with “Powderfinger” as part of the artist name. This is all well and good when the artist name is so distinctive. But what if the artist name is a character string that occurs frequently in names, such as “beck”, “smiths”, or “x”? (All are names of artists with copyrighted recordings.) Surely there will be false positives.

It’s even worse for song names. You would have to ban simple words and phrases, like “Birthday”, “Crazy”, “Morning”, “Sailing”, and “Los Angeles”, to name just a few. (All are titles of copyrighted recordings.)

The judge’s order asks the parties to agree on the details of how a filter will work. If they can’t agree on the details, the judge will decide. Given the enormous number of artist and song names, and the crowded namespace, there are a great many details to decide, balancing over- and under-inclusiveness. It’s hard to see how the parties can agree on all of the details, or how the judge can impose a detailed design. The only hope is to appoint some kind of independent arbiter to make these decisions.

Ultimately, I think the tradeoff between over- and under-inclusiveness will prove too difficult – the filters will either fail to block many infringing files, or will block many non-infringing files, or both.

This is the same kind of filtering that Judge Patel ordered Napster to use, after she found Napster liable for indirect infringement. It didn’t work for Napster. Users just changed the spelling of artist and song names, adopting standard misspellings (e.g., “Metallica” changed to “Metalica” or “MetalIGNOREica” or the Pig Latin “Itallicamay”), or encoding the titles somehow. Napster updated its filters to compansate, but was always one step behind. And Napster’s job was easier, because the filtering was done on Napster’s own computers. Kazaa will have to try to download updates to users’ computers every time it changes its filters.

To the judge’s credit, he acknowledges that filtering will be imprecise and might even fail miserably. So he orders only that Kazaa must use filtering, but not that the filtering must succeed in stopping infringement. As long as Kazaa makes its best effort to make the agreed-upon (or ordered) filtering scheme work, it will have have satisfied the order, even if infringement goes on.

Kim Weatherall calls the judge’s decision “brave”, because it wades into technical design and imposes a remedy that requires an ongoing engagement between the parties, two things that courts normally try to avoid. I’m not optimistic about this remedy – it will impose costs on both sides and won’t do much to stop infringement. But at least the judge didn’t just order Kazaa to stop all infringement, an order with which no general-purpose communication technology could ever hope to comply.

In the end, the redesign may be moot, as the prospect of financial damages may kill Kazaa before the redesign must occur. Kazaa is probably dying anyway, as users switch to newer services. From now on, the purpose of Kazaa, in the words of the classic poster, may be to serve as a warning to others.

Back in the Saddle

Hi, all. I’m back from a lovely vacation, which included a stint camping in Sequoia / King’s Canyon National Park, beyond the reach of Internet technology. In transit, I walked right by Jack Valenti in the LA airport. He looked as healthy as ever, and more relaxed than in his MPAA days.

Blogging will resume tomorrow, once I’ve dug out sufficiently from the backlog. In the meantime, I recommend reading Kim Weatherall’s summary of the Australian judge’s decision in the Kazaa case.