November 27, 2020

Boosting

Congratulations to my Princeton colleague Rob Schapire on winning ACM’s prestigious Kanellakis Award (shared with Columbia’s Yoav Freund). The annual award is given for a contribution to theoretical computer science that has a significant practical impact. Schapire and Freund won this year for an idea called boosting, so named because it can take a mediocre machine learning algorithm and automatically “boost” it into a really good one. The basic idea is cool, and not too hard to understand at a basic level.

A common type of machine learning problem involves learning how to classify objects based on examples. A learning algorithm (i.e., a computer program) is shown a bunch of example objects, each of which falls into one of two categories. Each example object has a label saying which category it is in. Each object can be labeled with an “importance weight” that tells us the relative importance of categorizing that object correctly; objects with higher weights are more important. The machine learning algorithm’s job is to figure out a rule that can be used to distinguish the two categories of objects, so that it can categorize objects that it hasn’t seen before. The algorithm isn’t told what to look for, but has to figure that out for itself.

Any fool can “solve” this problem by creating a rule that just guesses at random. That method will get a certain number of cases right. But can we do better than random? And if so, how much better?

Suppose we have a machine learning algorithm that does just a little better than random guessing for some class of problems. Schapire and Freund figured out a trick for “boosting” the performance of any such algorithm. To use their method, start by using the algorithm on the example data to deduce a rule. Call this Rule 1. Now look at each example object and see whether Rule 1 categorizes it correctly. If Rule 1 gets an object right, then lower that object’s importance weight a little; if Rule 1 gets an object wrong, then raise that object’s weight a little. Now run the learning algorithm again on the objects with the tweaked weights, to deduce another rule. Call this Rule 1a.

Intuitively, Rule 1a is just like Rule 1, except that Rule 1a pays extra attention to the examples that Rule 1 got wrong. We can think of Rule 1a as a kind of correction factor that is designed to overcome the mistakes of Rule 1. What Shapire and Freund proved is that if you combine Rule 1 and Rule 1a in a certain special way, the combined rule that results is guaranteed to be more accurate than Rule 1. This trick takes Rule 1, a mediocre rule, and makes it better.

The really cool part is that you can then take the improved rule and apply the same trick again, to get another rule that is even better. In fact, you can keep using the trick over and over to get rules that are better and better.

Stated this way, the idea doesn’t seem to complicated. Of course, the devil is in the details. What makes this discovery prize-worthy is that Schapire and Freund worked out the details of exactly how to tweak the weights and exactly how to combine the partial rules – and they proved that the method does indeed yield a better rule. That’s a very nice bit of computer science.

(Princeton won more of the major ACM awards than any other institution this year. Besides Rob Schapire’s award, Jennifer Rexford won the Grace Murray Hopper award (for contributions by an under-35 computer scientist) for her work on Internet routing protocols, and incoming faculty member Boaz Barak won the Dissertation Award for some nice crypto results. Not that we’re gloating or anything….)

Comments

  1. Do you have a link to their paper?

  2. Darren Coolidge says:

    Perhaps you can introduce genetic algorithms into the mix. You can deduce a bunch of rules from rule 1 to like rule 1a.1 rule 1a.2 … rule 1a.N. Than you can go back and pick the best ones and deduce rules from the population and produce a new population. That would be cool…

  3. Boosting link

  4. David Z says:

    Sounds like this has some very practical applications. Perhaps I’m being a bit wacky, but it’s also a bit exciting as it’s a significant and directly-applicable potential component of a future self-improving AI.

  5. If you keep applying the boosting technique over and over again, what happens asymptotically? Presumably the quality of the algorithm must converge to something or other, but what quality will it converge to? How much better can repeated application of boosting make the learning process?

  6. Yikes, what’s with the self-improving ai link? A race between nanotechnology researchers and ai researchers to see who can end the world first? Grey goo, or the Terminatrix, which will it be? 🙂