A Beginner’s Guide to AdaBoost

@PatrickYoon
5 min readDec 3, 2021

So you’ve built your first ever model implementing a machine learning algorithm. However, when you run your model, it gives an exceedingly poor result. You spend hours tuning the parameters and debugging to see if there is anything wrong. Eventually, you realize that your dataset is poor quality. So what do you do?

Adaptive Boosting (AdaBoost) is an algorithm that can be used to improve the performance of other machine learning algorithms. The basic idea behind AdaBoost is to set weights to both classifiers (which categorizes data into one or more sets of “classes”) and data points (samples) such that the classifiers are forced to concentrate on observations that are difficult to correctly classify. The algorithm runs so that the weights are adjusted at each iteration, which is considered a sequential process. As mentioned earlier, AdaBoost is a type of algorithm that can be used in combination with several other models to improve the final predictive performance. This is known as an ensemble. Therefore, AdaBoost is referred to as a sequential ensemble method.

The concept behind AdaBoost is relatively simple and easy to follow. However, many explanations on the topic can be filled with mathematical notations and concepts that are not easy to understand. Here is the pseudocode for AdaBoost from the developers of AdaBoost, Yoav Freund and Robert Shapire:

Figure 1: AdaBoost pseudocode

As you can see, anyone without a deep understanding of machine learning/data science/math could find this entire pseudocode quite intimidating. There are many notations and terminologies that are assumed to be known by the reader. This article will clear away the confusion and explain how AdaBoost works by breaking down each notation, formula, and terminology.

Knowing the Terminologies

Here are some of the basic terminologies that you should know.

Classifier: This is an algorithm that utilizes some data to make a prediction.

Weak learners: These are classifiers that perform only slightly better than random chance.

Boosting: Combining many weak learners to create a highly accurate prediction.

Hypothesis function: This is the function that our machine learning algorithm makes to approximate a prediction.

AdaBoost Step-By-Step

Now, let’s go over the pseudocode provided by Frund and Schapire. I will break down every notation at each step.

1) Given (x_1, y_1),…,(x_m, y_m) where x_i ∈ X, y_i ∈ {-1, +1}

Notations:

(x_1, y_1): The first training sample.

(x_m, y_m): The m-th training sample.

∈: means “element of”

{-1, +1}: a set including -1 and 1

x_i ∈ X: this means that the i-th element of x is in X

y_i ∈ {-1, +1}: the i-th element of y is in the set of {-1, +1}

Now that we broke down the notations, this first part can be simplified to: Given the training set containing m samples where all the x inputs are an element of the total set X and all the y outputs are an element of a set with -1 (negative class) and 1 (positive class)…

2) Initialize: D_1(i) = 1/m for i = 1, … , .

Notations:

D_1 = the first distribution (denoted by the _1) of weights of all the samples

D_1(i) = the weight of the i-th training sample

Simplified: Initialize all weights of your samples to 1/m, where m is the total number of samples.

3) For t=1, … , T:

This denotes that this part of the code is in a loop, which essentially means that t will increase by 1 in each iteration until t = T. Each time t increases by 1, the rest of the code that is in the loop runs. Unless said otherwise, every step I explain below will be part of the loop.

4) The next three bullet points:

  • Train weak learner using distribution D_t
  • Get weak hypothesis h_t : X -> {-1, +1}
  • Aim: select h_t with low weighted error: ε = Pr_i~Dt [h_t(xi) not equal to y_i]

Notations:

D_t = the t-th distribution of weights of all samples

h_t = hypothesis/classifier

ε = minimum misclassification error for the model

Here, you are fitting the t-th hypothesis to fit to the training data such that each prediction is either -1 or 1. Then, you calculate the error for that hypothesis by finding the probability of that hypothesis not predicting correctly.

5) Next two bullet points

  • Choose α_t = 1/2 * ln(1-ε / ε)
  • Update, for i = 1, … , m: D_t+1(i) = D_t(i)exp(-α_t * y_i * h_t(x_i) / Z_t

Notations:

α = weight for the classifier

exp = euler’s constant e: 2.71828

Z_t = normalization factor, used to ensure that weights represent a true distribution

After calculating the error from the last step, we can calculate the weights for the classifier, α, using the formula: ½ * ln(1-error / error). From this formula, we can see that the closer the error is to 0.5 (a.k.a a classifier with exactly 50% accuracy), the closer the error is to zero. This is important for updating the next iteration’s distribution of weights. We can see that we are multiplying the inverse of the error to the predicted and the actual values (remember they are either -1 or 1), and raising that to e. With larger errors, that makes misclassified cases to be updated with increased weights in the next iteration, while correctly classified cases would receive a decreased weight. Increased weights would allow the machine learning algorithm to pay more attention to that specific observation while a decreased weight would receive less attention in the next iteration. We then start this loop all over again while t increases by one. We continue this process until a low enough training error is achieved or we have a determined set number of weak learners.

6) Final Step: H(x) = sign(Σ α_t h_t(x))

Notations:

sign: This is known as the signum function. If the value that is passed in is greater than 0, a 1 will be returned. If the value is less than 0, a -1 is returned. If that value equals 0, a 0 will return.

Σ: Summation symbol.

Note that we are finally outside of the loop. This is our final hypothesis/classifier. We are adding up the weighted prediction of every classifier from our loop and run it under the signum function.

Summary

We have now seen how AdaBoost works by breaking down each notation and terminology in the original pseudocode. By explaining each step at a granular level, you can see how each component of the formula plays a part. Now you will be able to apply this algorithm to your own machine learning algorithm!

--

--