Ensemble Learning Techniques
What are the ensemble methods?
Ensemble learning is a technique which we try to combine several base model to get better performance of our model. Ensemble methods usually produce more accurate solutions than a single model would. The main hypothesis is that when weak models are correctly combined we can obtain more accurate and/or robust models. This is the reason why in many Kaggle competitions people use these kinds of techniques to get the optimal model.
Weak Learner/Base model
In ML, whenever we were doing classification or regression it is a very crucial part which model to choose which suits best for our application. This depends on several aspects like the number of features, the quantity of data, Outliers, etc.
While making a machine learning model we face problems like high bias and/or high variance. High bias is also known as Underfitting in which our model is over generalized. So it won’t be able to perform well on the training data as well as the testing data. And high variance is known as overfitting in which model captures noise also while training. So it will be able to perform well on training data but when it is tested on some other data from the same distribution as training data still it will not be able to perform well. We want our models to have low bias and low variance.
A low bias and a low variance, although they most often vary in opposite directions, are the two most fundamental features expected for a model. Indeed, to be able to “solve” a problem, we want our model to have enough degrees of freedom to resolve the underlying complexity of the data we are working with, but we also want it to have not too many degrees of freedom to avoid high variance and be more robust. This is the well-known bias-variance tradeoff.
Combining Weak Learners
While combining several weak learners it is not necessary that we must use homogeneous weak learners. We can use different weak learners as well.
But while making the choice of weak learners we need to keep in mind that if our model has high bias and low variance then it must be aggregated with the base model which will reduce bias. Similarly, we have to combine the model which is having high variance with the one which will reduce the variance.
Ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model in order to decrease variance (bagging), bias (boosting) or improve predictions (stacking).
Bagging stands for bootstrap aggregation. It is a parallel method in which we fit the data independently. It aims to produce an ensemble model which is more robust than individual weak learners.
First of all, we must know what does bootstrap means? It is a technique in which we choose M different training examples from the original training set of size N with replacements.
Bootstrap samples are often used, for example, to evaluate variance or confidence intervals of statistical estimators. By definition, a statistical estimator is a function of some observations and, so, a random variable with variance coming from these observations. In order to estimate the variance of such an estimator, we need to evaluate it on several independent samples drawn from the distribution of interest. In most of the cases, considering truly independent samples would require too much data compared to the amount really available. We can then use bootstrapping to generate several bootstrap samples that can be considered as being “almost-representative” and “almost-independent” (almost i.i.d. samples). These bootstrap samples will allow us to approximate the variance of the estimator, by evaluating its value for each of them.
The idea of bagging is then simple: we want to fit several independent models and “average” their predictions in order to obtain a model with a lower variance. However, we can’t, in practice, fitfully independent models because it would require too much data. So, we rely on the good “approximate properties” of bootstrap samples (representativity and independence) to fit models that are almost independent.
First, we create multiple bootstrap samples so that each new bootstrap sample will act as another (almost) independent dataset drawn from the true distribution. Then, we can fit a weak learner for each of these samples and finally aggregate them such that we kind of “average” their outputs and, so, obtain an ensemble model with less variance than its components. Roughly speaking, as the bootstrap samples are approximatively independent and identically distributed (i.i.d.), so are the learned base models. Then, “averaging” weak learners outputs do not change the expected answer but reduce its variance (just like averaging i.i.d. random variables preserve expected value but reduce variance).
There are plenty of ways we can aggregate weak learners. If we are talking about Regression problem than we can take the median of the output given by weak learners. We can also take mean of the outputs to get the final output. And if we talk about classification problem then we can use the process of voting in which we will choose final output to be the one which was outputted by the majority of base learners. It is also known as hard voting.
For example, we are doing binary classification and we are having 5 base learners. Here, It is better if we use an odd number of base models because we don't want to generate the situation of a tie. So suppose out of 5 3 of them outputted Yes and 2 of them outputted No then final output will be YES!
Boosting methods work in the same spirit as bagging methods: we build a family of models that are aggregated to obtain a strong learner that performs better. However, unlike bagging that mainly aims at reducing variance, boosting is a technique that consists in fitting sequentially multiple weak learners in a very adaptative way: each model in the sequence is fitted giving more importance to observations in the dataset that were badly handled by the previous models in the sequence. Intuitively, each new model focuses its efforts on the most difficult observations to fit up to now, so that we obtain, at the end of the process, a strong learner with lower bias (even if we can notice that boosting can also have the effect of reducing variance). Boosting, like bagging, can be used for regression as well as for classification problems.
Being mainly focused on reducing bias, the base models that are often considered for boosting are models with a low variance but high bias. For example, if we want to use trees as our base models, we will choose most of the time shallow decision trees with only a few depths. Another important reason that motivates the use of low variance but high bias models as weak learners for boosting is that these models are in general less computationally expensive to fit (few degrees of freedom when parametrised). Indeed, as computations to fit the different models can’t be done in parallel (unlike bagging), it could become too expensive to fit sequentially several complex models.
Once the weak learners have been chosen, we still need to define how they will be sequentially fitted (what information from previous models do we take into account when fitting current model?) and how they will be aggregated (how do we aggregate the current model to the previous ones?). We will discuss these questions in the two following subsections, describing more especially two important boosting algorithms: AdaBoost and gradient boosting.
In stacking, we combine mostly heterogeneous models whereas in bagging and boosting we mostly combine homogeneous models. Here we need weak leaners followed by the other model which we combine the output of the base models.
For example, for a classification problem, we can choose as weak learners a KNN classifier, a logistic regression and a SVM, and decide to learn a neural network as a meta-model. Then, the neural network will take as inputs the outputs of our three weak learners and will learn to return final predictions based on it.
How does stacking work?
- We split training data into K-folds just like K-fold cross-validation.
- Weak leaners are fitted on this K-1 parts and we make a prediction on K^th part.
- Do it for each part of the training data.
- The base model is then fitted on the whole train data set to calculate its performance on the test set.
- Repeat these steps for all base models.
- Predictions of bases model are passed to the next level and that level is used for prediction.
These were just brief about ensemble learning and its Techniques. I hope it will help you.
Thanks for reading!