November 29, 2024

Boosting

Congratulations to my Princeton colleague Rob Schapire on winning ACM’s prestigious Kanellakis Award (shared with Columbia’s Yoav Freund). The annual award is given for a contribution to theoretical computer science that has a significant practical impact. Schapire and Freund won this year for an idea called boosting, so named because it can take a mediocre machine learning algorithm and automatically “boost” it into a really good one. The basic idea is cool, and not too hard to understand at a basic level.

A common type of machine learning problem involves learning how to classify objects based on examples. A learning algorithm (i.e., a computer program) is shown a bunch of example objects, each of which falls into one of two categories. Each example object has a label saying which category it is in. Each object can be labeled with an “importance weight” that tells us the relative importance of categorizing that object correctly; objects with higher weights are more important. The machine learning algorithm’s job is to figure out a rule that can be used to distinguish the two categories of objects, so that it can categorize objects that it hasn’t seen before. The algorithm isn’t told what to look for, but has to figure that out for itself.

Any fool can “solve” this problem by creating a rule that just guesses at random. That method will get a certain number of cases right. But can we do better than random? And if so, how much better?

Suppose we have a machine learning algorithm that does just a little better than random guessing for some class of problems. Schapire and Freund figured out a trick for “boosting” the performance of any such algorithm. To use their method, start by using the algorithm on the example data to deduce a rule. Call this Rule 1. Now look at each example object and see whether Rule 1 categorizes it correctly. If Rule 1 gets an object right, then lower that object’s importance weight a little; if Rule 1 gets an object wrong, then raise that object’s weight a little. Now run the learning algorithm again on the objects with the tweaked weights, to deduce another rule. Call this Rule 1a.

Intuitively, Rule 1a is just like Rule 1, except that Rule 1a pays extra attention to the examples that Rule 1 got wrong. We can think of Rule 1a as a kind of correction factor that is designed to overcome the mistakes of Rule 1. What Shapire and Freund proved is that if you combine Rule 1 and Rule 1a in a certain special way, the combined rule that results is guaranteed to be more accurate than Rule 1. This trick takes Rule 1, a mediocre rule, and makes it better.

The really cool part is that you can then take the improved rule and apply the same trick again, to get another rule that is even better. In fact, you can keep using the trick over and over to get rules that are better and better.

Stated this way, the idea doesn’t seem to complicated. Of course, the devil is in the details. What makes this discovery prize-worthy is that Schapire and Freund worked out the details of exactly how to tweak the weights and exactly how to combine the partial rules – and they proved that the method does indeed yield a better rule. That’s a very nice bit of computer science.

(Princeton won more of the major ACM awards than any other institution this year. Besides Rob Schapire’s award, Jennifer Rexford won the Grace Murray Hopper award (for contributions by an under-35 computer scientist) for her work on Internet routing protocols, and incoming faculty member Boaz Barak won the Dissertation Award for some nice crypto results. Not that we’re gloating or anything….)

Computer Science Professors' Brief in Grokster

Today, seventeen computer science professors (including me) are filing an amicus brief with the Supreme Court in the Grokster case. Here is the summary of our argument, quoted from the brief:

Amici write to call to the Court’s attention several computer science issues raised by Petitioners [i.e., the movie and music companies] and amici who filed concurrent with Petitioners, and to correct certain of their technical assertions. First, the United States’ description of the Internet’s design is wrong. P2P networks are not new developments in network design, but rather the design on which the Internet itself is based. Second, a P2P network design, where the work is done by the end user’s machine, is preferable to a design which forces work (such as filtering) to be done within the network, because a P2P design can be robust and efficient. Third, because of the difficulty in designing distributed networks, advances in P2P network design – including BitTorrent and Respondents’ [i.e., Grokster’s and Streamcast’s] software – are crucial to developing the next generation of P2P networks, such as the NSF-funded IRIS Project. Fourth, Petitioners’ assertion that filtering software will work fails to consider that users cannot be forced to install the filter, filtering software is unproven or that users will find other ways to defeat the filter. Finally, while Petitioners state that infringers’ anonymity makes legal action difficult, the truth is that Petitioners can obtain IP addresses easily and have filed lawsuits against more than 8,400 alleged infringers. Because Petitioners seek a remedy that will hobble advances in technology, while they have other means to obtain relief for infringement, amici ask the Court to affirm the judgment below.

The seventeen computer science professors are Harold Abelson (MIT), Thomas Anderson (U. Washington), Andrew W. Appel (Princeton), Steven M. Bellovin (Columbia), Dan Boneh (Stanford), David Clark (MIT), David J. Farber (CMU), Joan Feigenbaum (Yale), Edward W. Felten (Princeton), Robert Harper (CMU), M. Frans Kaashoek (MIT), Brian Kernighan (Princeton), Jennifer Rexford (Princeton), John C. Reynolds (CMU), Aviel D. Rubin (Johns Hopkins), Eugene H. Spafford (Purdue), and David S. Touretzky (CMU).

Thanks to our counsel, Jim Tyre and Vicky Hall, for their work in turning a set of ideas and chunks of rough text into a coherent brief.

Forecast for Infotech Policy in the New Congress

Cameron Wilson, Director of the ACM Public Policy Office in Washington, looks at changes (made already or widely reported) in the new Congress and what they tell us about likely legislative action. (He co-writes the ACM U.S. Public Policy Blog, which is quite good.)

He mentions four hot areas. The first is regulation of peer-to-peer technologies. Once the Supreme Court’s decision in Grokster comes down, expect Congress to spring into action, to protect whichever side claims to be endangered by the decision. A likely focal point for this is the new Intellectual Property subcommittee of the Senate Judiciary Committee. (The subcommittee will be chaired by Sen. Orrin Hatch, who has not been shy about regulating infotech in the name of copyright. He championed of the Induce Act.) This issue will start out being about P2P but could easily expand to regulate a wider class of technologies.

The second area is telecom. Sen. Ted Stevens is the new chair of the Senate Commerce Committee, and he seems eager to work on a big revision of the Telecom Act of 1996. This will be a battle royal involving many interest groups, and telecom policy wonks will be fully absorbed. Regulation of non-telecom infotech products seems likely to creep into the bill, given the technological convergence of telecom with the Internet.

The third area is privacy. The Real ID bill, which standardizes state driver’s licenses to create what is nearly a de facto national ID card, is controversial but seems likely to become law. The recent ChoicePoint privacy scandal may drive further privacy legislation. Congress is likely to do something about spyware as well.

The fourth area is security and reliability of systems. Many people on the Hill will want to weigh in on this issue, but it’s not clear what action will be taken. There are also questions over which committees have jurisdiction. Many of us hope that PITAC’s report on the sad state of cybersecurity research funding will trigger some action.

As someone famous said, it’s hard to make predictions, especially about the future. There will surely be surprises. About the only thing we can be sure of is that infotech policy will get even more attention in this Congress than in the last one.

More on Ad-Blocking

I’m on the road today, so I don’t have a long post for you. (Good news: I’m in Rome. Bad news: It’s Rome, New York.)

Instead, let me point you to an interesting exchange about copyright and ad-blocking software on my course blog, in which “Archer” opens with a discussion of copyright and advertising revenue, and Harlan Yu responds by asking whether distributing Firefox AdBlock is a contributory infringement.

There’s plenty of interesting writing on the course blog. Check it out!

UPDATE (Feb. 28): Another student, “Unsuspecting Innocent,” has more on this topic.

Can P2P Nets Be Poisoned?

Christin, Weigend, and Chuang have an interesting new paper on corruption of files in P2P networks. Some files are corrupted accidentally (they call this “pollution”), and some might be corrupted deliberately (“poisoning”) by copyright owners or their agents. The paper measures the availability of popular, infringing files on the eDonkey, Overnet, Gnutella, and FastTrack networks, and simulates the effect of different pollution strategies that might be used.

The paper studied a few popular files for which corruption efforts were not occurring (or at least not succeeding). Polluted versions of these files are found, especially on FastTrack, but these aren’t a barrier to user access because non-corrupted files tend to have more replicas available than polluted files do, and the systems return files with more replicas first.

They move on to simulate the effect of various pollution strategies. They conclude that a sufficiently sophisticated pollution strategy, which injects different decoy versions of a file at different times, and injects many replicas of the same decoy at the same time, would significantly reduce user access to targeted files.

Some P2P programs use simple reputation systems to try to distinguish corrupted files from non-corrupted ones; the paper argues that these will be ineffective against their best pollution strategy. But they also note that better reputation systems could can detect their sophisticated poisoning strategy.

They don’t say anything more about the arms race between reputation technologies and pollution technologies. My guess is that in the long run reputation systems will win, and poisoning strategies will lose their viability. In the meantime, though, it looks like copyright owners have much to gain from poisoning.

[UPDATE (6:45 PM): I changed the second paragraph to eliminate an error that was caused by my misreading of the paper. Originally I said, incorrectly, that the study found little if any evidence of pollution for the files they studied. In fact, they chose those files because they were not subject to pollution. Thanks to Cypherpunk, Joe Hall, and Nicolas Christin for pointing out my error.]