09-Ensemble Methods - Anvesh Jhuboo

### Why Use Ensembling Methods? Ensembling methods combine multiple machine learning models to improve overall performance. The key idea is that by aggregating the predictions of several models, the ensemble can achieve better accuracy, robustness, and generalization than any single model could on its own. ### The Need for Ensembling #### The Bias-Variance Tradeoff - **Bias** refers to the error introduced by approximating a real-world problem, which might be complex, by a simplified model. High bias can cause the model to miss relevant relations between features and target outputs (underfitting). - **Variance** refers to the error introduced by the model's sensitivity to the training data. High variance can cause the model to model the random noise in the training data, resulting in poor generalization to new data (overfitting). Individual models often struggle to balance this tradeoff effectively. Simple models may have high bias and low variance, while complex models may have low bias and high variance. Ensembling methods can help to achieve a balance by leveraging multiple models. #### Limitations of Base Estimators 1. **High Variance (Overfitting):** - Complex models like decision trees or neural networks can capture intricate patterns in the training data, but they might also capture noise, leading to poor performance on unseen data. 2. **High Bias (Underfitting):** - Simple models like linear regression or logistic regression may not be able to capture the underlying complexity of the data, leading to systematic errors and poor performance. 3. **Model Instability:** - Some models, such as decision trees, are sensitive to changes in the training data. Small changes can result in a completely different tree being generated. 4. **Single Perspective:** - Different models have different strengths and weaknesses. Relying on a single model means you are limited to one perspective on the data. Ensembling methods address these limitations by combining the strengths of multiple models, thereby reducing the impact of their individual weaknesses. ### Types of Ensembling Methods 1. **Bagging (Bootstrap Aggregating):** - Reduces variance by training multiple models on different subsets of the data and averaging their predictions. Random Forest is a popular example of a bagging method. 2. **Boosting:** - Reduces bias and variance by sequentially training models to correct the errors of their predecessors. AdaBoost and Gradient Boosting are common examples of boosting methods. 3. **Stacking:** - Combines multiple base models and a meta-model to improve predictive performance. Base models generate predictions that are used as inputs for the meta-model, which makes the final prediction. --- ## Bagging Bagging, or Bootstrap Aggregating, aims to reduce the variance of a model by training multiple models on different subsets of the training data and averaging their predictions. ### Algorithm 1. **Bootstrap Sampling**: Generate multiple subsets of the training data by randomly sampling with replacement. 2. **Training**: Train a base model on each bootstrap sample. 3. **Aggregation**: Combine the predictions of all models by averaging (for regression) or majority voting (for classification). ### Random Forest Random Forest is an extension of bagging that uses decision trees as base models and introduces additional randomness by selecting a random subset of features for each tree. #### Python Implementation ```python from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split # Load data iris = load_iris() X, y = iris.data, iris.target # Split data into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Random Forest model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) y_pred = model.predict(X_test) # Results print("Random Forest Accuracy:", model.score(X_test, y_test))` ``` #### Pros and Cons - **Pros**: - Reduces overfitting by averaging multiple models. - Handles high-dimensional data well. - Robust to noise and outliers. - **Cons**: - Computationally expensive due to training multiple models. - May not perform well with small datasets. --- ## Boosting Boosting is an ensemble method that combines multiple weak learners to create a strong learner by training each model to correct the errors of its predecessor. ### Adaptive Boosting (AdaBoost) #### Algorithm 1. Initialize weights for all training samples. 2. Train a weak learner on the weighted dataset. 3. Increase weights of incorrectly predicted samples. 4. Combine the weak learners with weighted voting. #### Python Implementation ```python from sklearn.ensemble import AdaBoostClassifier # AdaBoost model = AdaBoostClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) y_pred = model.predict(X_test) # Results print("AdaBoost Accuracy:", model.score(X_test, y_test))` ``` #### Pros and Cons - **Pros**: - Reduces both bias and variance. - Works well with various types of data. - **Cons**: - Sensitive to noisy data and outliers. - Can be slow to train due to sequential learning. --- ### Gradient Boosting #### Algorithm 1. Initialize with a simple model (e.g., mean value for regression). 2. Iteratively add models that minimize the residual errors of previous models. 3. Combine models by summing their predictions. #### Python Implementation ```python from sklearn.ensemble import GradientBoostingClassifier # Gradient Boosting model = GradientBoostingClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) y_pred = model.predict(X_test) # Results print("Gradient Boosting Accuracy:", model.score(X_test, y_test))` ``` #### Pros and Cons - **Pros**: - Highly accurate due to iterative correction of errors. - Handles various types of data well. - **Cons**: - Computationally expensive and can be slow to train. - Can overfit if not properly regularized. --- ## Stacking Stacking is an ensemble method that combines multiple base models (level-0 models) and a meta-model (level-1 model) to improve predictive performance. ### Training Base Estimators Train multiple base models on the training data. ### Preventing Overfitting with K-Fold Cross Validation Use K-Fold Cross Validation to train base models and generate out-of-fold predictions for the meta-model. ### Feature Augmentation Combine the original features with the predictions of base models to create an augmented feature set for the meta-model. ### Training Process 1. **Train Base Models**: Train multiple base models on the training data. 2. **Generate Meta-Features**: Use K-Fold Cross Validation to generate out-of-fold predictions for the training data. 3. **Train Meta-Model**: Train the meta-model on the out-of-fold predictions (meta-features). #### Python Implementation ```python from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import KFold from sklearn.base import clone import numpy as np # Sample data X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]]) y = np.array([0, 1, 0, 1, 0]) # Base models base_models = [DecisionTreeClassifier(), SVC(probability=True), RandomForestClassifier()] meta_model = LogisticRegression() # K-Fold Cross Validation kf = KFold(n_splits=5) meta_features = np.zeros((X.shape[0], len(base_models))) for i, model in enumerate(base_models): for train_idx, val_idx in kf.split(X): instance = clone(model) instance.fit(X[train_idx], y[train_idx]) meta_features[val_idx, i] = instance.predict_proba(X[val_idx])[:, 1] # Train meta-model meta_model.fit(meta_features, y)` ``` #### Pros and Cons - **Pros**: - Can leverage the strengths of multiple models. - Often results in better performance than individual models. - **Cons**: - Complex and computationally expensive. - Can be prone to overfitting if not properly validated. --- ## Conclusion ### Summary of Ensembling Methods - **Bagging**: Reduces variance by averaging multiple models trained on bootstrap samples (e.g., Random Forest). - **Boosting**: Reduces bias and variance by sequentially correcting errors of previous models (e.g., AdaBoost, Gradient Boosting). - **Stacking**: Combines multiple base models and a meta-model to improve predictive performance. Continue: [[10-ML Workflows]]