This week I devoted to learning Boosting methods, a series of 2-class classification algos that are easy to use, but excellent in performance. How does the Boosting achieve this? People are still seeking the explanation in terms of the ‘margin’ theory. In the following, I will present the basic of Boosting and, moreover, some people’s effort on revealing the mystery of Boosting methods.

I Basics of Boosting

1. The Formulation

The concept of Boosting comes from the format of majority vote among a series of Weak Learners (WL).  Boosting here refers to a general method of producing a very accurate prediction rule by combing several rough and moderately inaccurate rules (Weak Learners). The AdaBoost, proposed in 1995, solved many of  the practical difficulties of the earlier boosting algos and is the most representative algo in the Boosting family. The basic idea of AdaBoost can be illustrated below,

$G(x) = sgn[f(x)]= sgn[\sum_{m} \alpha_{m} G_{m}(x)]$

where $\{G_{m}(x)\}$ are a series of WLs that maps $\cal{X}$ to {$\pm$ 1}. $\alpha_{m}$ are respective weights in the final result, like the influence in final vote. Then the following questions is, how to choose $\{G_{m}(x)\}$ and $\alpha_{m}$.

2. Calculation of $\{G_{m}(x)\}$ and $\alpha_{m}$

I has been shown that the optimization related to AdaBoost can be formulated as a exponential loss minimization problem,

$\min_{\alpha, G()} \sum_{i} Loss [y_{i}, f(x_{i})]$.

To solve the problem, on each round AdaBoost chooses $G_{m}()$ and then sets the $\alpha_{m}$ as a way of minimizing above function.

The solver used by AdaBoost, in essence,  is doing a kind of steepest descent search to minimize the objective. Since it has too many vars, the search is constrained at each round via coordinate gradient descent algo to solve the problem step by step. So in each step, we just solve the problem below,

$\min_{\alpha_{m}, G_{m}()} \sum_{i} Loss [y_{i}, f_{m-1}(x_{i}) + \alpha_{m}G_{m}(x)]$,

where only one weight and function is determined in each optimization (round).

The details concerning the solution process of $\{G_{m}(x)\}$ and $\alpha_{m}$ can be found in many papers, books and more illustratively, some tutorial videos. So here we omit them. Despite the details, however, there are important points to note for the solution algo.

• The solution process seems heuristic, since in each round the parameters ($\alpha_{m}, G_{m}(x), w_{i,m}$) are optimized based on the result of the latest WL $G_{m-1}(x)$. But the solution seems optimized.
•  Regarding the choice of $G_{m}(x)$ in each round, there are two strategies: optimal vs non-optimal. The optimal case always chooses the best WL in round m while the non-opitimal just chooses one that is above the threshold. Those two strategies correspond to different performance of the final classification function $f(x)$.

3. The Performance : Training Error and Generalization Error

It can be proven that the training error of AdaBoost can be upper-bounded by

$\frac{1}{N} \sum_{i} I(y_{i} \neq G(x_{i})) \leq exp(-2 \sum_{m} \gamma_{m}^{2})$.

Therefore, the training error of AdaBoost decreases exponentially as the round increases, turing a bunk of WLs into a strong learner.

But people are more concerned with the generalization error (test error) of AdaBoost. Empirically, it is observed that the generalization error of AdaBoost continues to decrease even if the training error reaches zero. This is a surprising phenomenon that goes in contraction with early theoretical predictions, given by following generalization error upper bound:

$Pr[H(x) \neq y] + O (\sqrt{\frac{Td}{m}})$,

where states clearly the error bound goes up with T increase.  So a number of researchers puzzled over this unusual finding that An easy scheme as AdaBoost would have such great performance in practice.

II. Explanations

1. The Loss Function

It has been proved by [Friedman, Hastie, Tibshirani, 2000] that the algo of AdaBoost, in essence, is minimizing the exponential loss of all examples. This is inferred by extracting the ‘Additive Model’ and ‘Forward Stagewise Algo’ features of AdaBoost. Omitting the proof detail, we just present the final formulation:

$\min_{\alpha, G} \sum_{i} L (y_{i}, f(x_{i})) = \sum_{i} exp(-y_{i}f(x_{i}))$

2. Beyond the Error Rate: the Margin Theory

Despite the exponential loss function, people are still seeking explanations to uncover more of the success of AdaBoost, especially in terms of the margin theory. It tends to give a strong indication of a classifier’s performance in practice, like its explanations established for the SVM. Therefore, once the explanation is found, it is also possible to compare, or find the link between AdaBoost and other popular classifiers.So the question is how to explain the performance of AdaBoost by margin theory?

The margin of example $\{x_{i},y_{i}\}$ in AdaBoost is defined as below,

$\frac{ y_{i} \sum_{m} \alpha_{m} G_{m}(x_{i}) } { \sum_{m} \alpha_{m}}$

Firstly, Schapire et al. gave an alternative analysis of the generalization error bound in terms of the margin theory, shown below:

$Pr[margin_{f}(x,y) \leq \theta] + O (\sqrt{\frac{d}{m{\theta}^{2}}} )$

Note that this new upper bound is independent of training round T, and thus it explains the reason why AdaBoost avoids the overfitting.

Moreover, they proved that boosting continues to increase the margin of the training data when training error reaches zero, and consequently improve the generalization error of each round.

Then following Rastch and Warmuth’s work, Rudin et al. studied the cyclic behavior of weights $\alpha_{m}$, in order to find out whether AdaBoost provides a maximum margin solution and the relationship between the two. Particularly, they stated that either non-optimal or optimal AdaBoost may not always converge to the maximum margin. Moreover, they found a sufficient condition between cyclic behaviors and the maximum margin AdaBoost.

There are still effort to discovery what does AdaBoost do in terms of margin optimization? Chunhua’s work has provided a generalization for a varieties of Boosting methods, looking at their optimizations from the entropy view.

By comparing the primal and dual problem formulation between AdaBoost_{l1} and LPBoost (maximize the minimum margin), LogitBoost(minimize logistic loss), it concludes that the dual of AdaBoost_{l1} is the Shannon entropy maximization (negative entropy minimization) problem, and that others are generalization of AdaBoost’s entropy maximization problem.

Most of all, the author proved that the essence of AdaBoost_{l1} is to maximize the average margin while minimize the variance of examples, by exploring the statistical properties of the examples margins. When the examples are independently drawn and the set of WLs is infinite,  their margins can be regarded to follow the Gaussian distribution and above conclusion establishes. But what does the AdaBoost do, without such constrains ? It is still an open question.

3. Game theory and the Min-Max Theorem

In this section we will try to consider the problem from an alternative angle and discover more on the solution of AdaBoost. [Freund1996] states that the solution process of AdaBoost can be interpreted as a zero-sum repeated game in the standard form, like below:

$\bold{w}^{T}\bold{M} \bold\alpha$

where $\bold{w}$ and $\bold{\alpha}$ signifies the weight on examples and on WLs respectively. $\bold{M}$ is the ‘correct matrix’ where $M^{'}_{i,j} = I [G_{j}(x_{i}) = y_{i}]$. And in game theory, the product above stands for the loss of the row player (selector of w),  and in turn the gain of the column player (selector of $\bold \alpha$).

1) Min-Max Theorem

In each shot, the row player tries to minimize its loss while the column player maximize its gain given each other’s strategy. In AdaBoost, the game plays so on and so forth and thus formulates a repeated game (here it means the stagewise optimization in AdaBoost is heuristic). The process stops when the game reaches certain round limits T.

In the game, the row player would like to minimize following function,

$\min_{\bold{w}} \bold{w}^{T}\bold{M}\bold{\alpha^{*}}$

while the column player max the following,

$\max_{\alpha} \bold{w}^{*T}\bold{M}\bold{\alpha}$

Then the solution is found (game solved) when two players come to an agreement:

$\min_{\bold{w}} \bold{w}^{T}\bold{M}\bold{\alpha^{*}} = \max_{\alpha} \bold{w}^{*T}\bold{M}\bold{\alpha}$

This is exactly the Von-Neumann’s min-max theorem.

2) Edge and Margin

Above game theoretic model explains the role of  row and column player in each iteration of AdaBoost. Basically, the minimizes the loss along its update, as a way of increasing the weight of the worst example (the ones that are not correctly classified by latest WL)On the other hand, $\latex \alpha$ maximizes the gain in order to select the best WL that fit for the latest w. However, above optimizations can be interpreted by the concept of ‘edge’ and ‘margin’.

The edge of a WL j measures the performance (correct rate – incorrect rate) of the WL j .
In above game setting,  edge of WL j with respect to example distribution $w_{i}$ can be expressed as follows,

$Edge(j) = (\bold{w_{i}} \bold{M})_{j}$

The margin concept descends from that in SVM.
In the game, the margin of an example i with respect to the WL weight $\alpha_{m}$ can be expressed as follows,

$Margin(i) = (\bold{M} \bold{\alpha_{m}})_{i}$

So from here we see what the row and column player actually do is:

row player: min the maximum edge of all WLs
column player: max the minimum margin of all examples.

It is above interactive process that establish the repeated game. But unfortunately, since the process is heuristic (as always in the game), so the final solution (game equilibrium) may not be necessarily the global-optimal. That may be one reason why AdaBoost cannot always converge to the max margin (above statement is just out of my thinking).

3) Parameters Update and Learning Games

We know that both the evolution of and $\alpha$ are updated heuristically. Now let’s have a look at it in more detail. Say in iteration m, $\alpha_{m}$ updates according to the result of the game,

$\alpha_{m} = \arg\max \min_{i} (\bold{M} (\alpha_{m-1}+\alpha_{m}))_{i}$

but every time it returns only a pure strategy where $\alpha_{m}$ has only one entry (in the optimal case).

Moreover, it is pending to me to see if the update of w obeys the min-max edge of all the WL. $late$. Some paper claim that process is based on regret matching in learning games. If so, can it converge to the max margin? If not, that may be another reason why AdaBoost cannot always converge to the max margin.

III. Loss Function and Margins : Two Lenses to Discovery the Truth?

Above I have summarized two methods we use to uncover the performance of AdaBoost. In fact, I think maybe any 2- class classifier can be evaluated under the two lenses: Loss Function and Margin Theory. If there were any new classifier that catch our eyes, we may check it with the two lens if it can be explained accordingly. Once we have done it, we can establish the theoretical reasonings on why that classifier can do so well or so bad. In other words, they provide us two ways of understanding the essence of the classifier.

So, the remaining question is, are there any new lenses to evaluate a classifier?

REFERENCE:

1. Freund, Yoav, Robert Schapire, and N Abe. “A Short Introduction to Boosting.” Journal-Japanese Society For Artificial Intelligence 14, no. 771-780 (1999): 1612.

2. Freund, Yoav, and Robert E Schapire. “A Desicion-Theoretic Generalization of on-Line Learning and an Application to Boosting.” Paper presented at the Computational learning theory, 1995.

3. Schapire, Robert E, Yoav Freund, Peter Bartlett, and Wee Sun Lee. “Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods.” The annals of statistics 26, no. 5 (1998): 1651-86.

4. Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. “Additive Logistic Regression: A Statistical View of Boosting (with Discussion and a Rejoinder by the Authors).” The annals of statistics 28, no. 2 (2000): 337-407.

5. Rudin, Cynthia, Ingrid Daubechies, and Robert E. Schapire. “The Dynamics of Adaboost: Cyclic Behavior and Convergence of Margins.” J. Mach. Learn. Res. 5 (2004): 1557-95.

6.Shen, Chunhua, and Hanxi Li. “On the Dual Formulation of Boosting Algorithms.” IEEE Trans. Pattern Anal. Mach. Intell. 32, no. 12 (2010): 2216-31.

7. Freund, Yoav, and Robert E. Schapire. “Game Theory, on-Line Prediction and Boosting.” In Proceedings of the ninth annual conference on Computational learning theory, 325-32. Desenzano del Garda, Italy: ACM, 1996.

8. Schapire, Robert E, and Yoram Singer. “Improved Boosting Algorithms Using Confidence-Rated Predictions.” Machine learning 37, no. 3 (1999): 297-336.

9. Foster, Dean P., and Rakesh Vohra. “Regret in the on-Line Decision Problem.” Games and Economic Behavior 29, no. 1–2 (10// 1999): 7-35.