Meta- and Ensemble Learning
Ensemble learning is using multiple algorithms to create a single prediction. Meta-learning is subset of ensemble learning where learning algorithms are applied on meta-data about machine learning experiments to further improve the predictions. These terms are used interchangeably.
Ensembles are commonly formed from fast algorithms like decision trees. But that is not always the case.
decision tree + decision tree + decision tree = random forest
Use as dissimilar algorithms and configurations as possible. Empirically it has been shown that ensembles yield better results when there is a significant diversity among the models; such as algorithm and innate randomness in the training algorithm.
Model count will have huge effect on prediction accuracy. Theoretically, using the same number of models as class labels gives the highest accuracy in classification.
10 => 30 => 100 decision trees in a random forest is a good place to start
Bootstrap Aggregating (Bagging)
Prediction are averaged. Each model in an ensemble votes for a prediction with equal weight.
Bagging reduces variance and helps to avoid overfitting. Averaging multiple answers from different experts usually give a better prediction. Luckily, with machine learning we can train your 300 experts easily and get their answers instantaneously.
For example, random forest is an ensemble model, combining multiple decision trees with bagging.
Boosting focuses on mis-classified training instances. Boosting is incrementally building an ensemble by training a model emphasizing data that previous models mis-classified.
Boosting can give better results than bagging, but is more prone to overfitting.
There are three common ways to merge original and boosted models:
Some algorithms allow combining the weights of the models.
Models are combined together using a particular cost function e.g. majority vote.
Others algorithms force you to create cascade models, where low confidence predictions get redirected to the boosted models for confirmation.
Stacking focuses on creating a combiner model. Training a learning algorithm to combine the predictions of several other learning algorithms.
- All base models are trained with available data.
- Combiner model takes original input plus the final output of all underlaying models.
Logistic regression is often used as the combiner.
Stacking has been empirically proven to give the best results but costing more computational power and memory.
Bucket of Models
Bucket of Models introduces a model selection algorithm that chooses the model to be used for each problem.
Model selection approaches:
- Cross-validation Selection aka. Bake-off Contest: train all models with the training samples and use the one that got the highest accuracy.
- Gating: train another machine learning model to decide which of the underlying models to use.
- Landmark Learning: use all the fast algorithms in the bucket, then use results from these models to help determine which slow algorithms will most likely to do best.
Also known as Cascading Classifiers when applied in classification problem.
A sequence of several models, using all information collected from the output from a given model as additional information for the next model in the cascade.
There can also be confidence thresholds to control the cascading.
1st model tells if an ad is clearly a malicious ad or not. If it is clearly not malicious, return response and continue to the next sample. If result is unclear or clearly malicious feed features to all malicious detection models to see in which categories does it go to.
Bayesian Model Averaging (BMA)
Also called Bayesian Parameter Averaging (BPA), Bayesian Model Selection (BMS) or Bayesian Adaptive Sampling (BAS).
Approximate Bayes Optimal Classifier and combine hypotheses using Bayes' Law, thus making it practically implementable unlike BOC.
This approach frequently has worse performance than other ensemble methods and requires further research.
Bayesian Model Combination (BMC)
BMC is an extension to Bayesian Model Averaging. Instead of sampling each model individually, it samples from the space of possible ensembles.
Usually works better than BMA and bagging, but is more computationally expensive.