08-Regularization and Hyperparameter Tuning

## Why Regularize? Regularization is a technique used to prevent overfitting by adding a penalty to the loss function. This penalty discourages the model from becoming too complex, which helps improve its generalization to unseen data. ## What is Overfitting? Overfitting occurs when a model learns not only the underlying pattern but also the noise in the training data. This results in high accuracy on training data but poor performance on validation and test data. ### Detection Overfitting can be detected by comparing the model's performance on training and validation datasets. If the training performance is significantly better than the validation performance, overfitting is likely occurring. ### R-squared Score The R-squared score ($R^2$) is a metric used to evaluate the goodness-of-fit of a regression model. It ranges from 0 to 1, with higher values indicating better fit. #### Formula $R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$ where: - $y_i$ are the actual values. - $\hat{y}_i$ are the predicted values. - $\bar{y}$ is the mean of the actual values. ### Python Implementation ```python from sklearn.metrics import r2_score # Sample data y_true = [3, -0.5, 2, 7] y_pred = [2.5, 0.0, 2, 8] # R-squared score r2 = r2_score(y_true, y_pred) print("R-squared score:", r2)` ``` --- ## The Loss Function The loss function measures how well a model's predictions match the actual data. Common loss functions include Mean Squared Error (MSE) for regression and Cross-Entropy Loss for classification. ### Mean Squared Error (MSE) For a regression model, MSE is defined as: $MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ ### Python Implementation ```python from sklearn.metrics import mean_squared_error # Sample data y_true = [3, -0.5, 2, 7] y_pred = [2.5, 0.0, 2, 8] # Mean Squared Error mse = mean_squared_error(y_true, y_pred) print("Mean Squared Error:", mse)` ``` --- ## The Regularization Term The regularization term is added to the loss function to penalize model complexity. It discourages large coefficients, which can help prevent overfitting. ## L1 or Lasso Regularization Lasso (Least Absolute Shrinkage and Selection Operator) regularization adds the absolute values of the coefficients to the loss function. It can shrink some coefficients to zero, effectively performing feature selection. ### Formula The Lasso loss function is: $L = \text{MSE} + \alpha \sum_{j=1}^{p} |w_j|$ where $\alpha$ is the regularization parameter. ### Python Implementation ```python from sklearn.linear_model import Lasso # Sample data X = [[1], [2], [3], [4], [5]] y = [1, 3, 2, 3, 5] # Lasso regression model = Lasso(alpha=0.1) model.fit(X, y) y_pred = model.predict(X) # Coefficients print("Coefficients:", model.coef_) print("Intercept:", model.intercept_)` ``` --- ## L2 or Ridge Regularization Ridge regularization adds the squared values of the coefficients to the loss function. It discourages large coefficients but does not shrink them to zero. ### Formula The Ridge loss function is: $L = \text{MSE} + \alpha \sum_{j=1}^{p} w_j^2$ where $\alpha$ is the regularization parameter. ### Python Implementation ```python from sklearn.linear_model import Ridge # Sample data X = [[1], [2], [3], [4], [5]] y = [1, 3, 2, 3, 5] # Ridge regression model = Ridge(alpha=0.1) model.fit(X, y) y_pred = model.predict(X) # Coefficients print("Coefficients:", model.coef_) print("Intercept:", model.intercept_)` ``` > [!info] > Differences between L1 and L2 Regularization | Aspect | L1 Regularization (Lasso) | L2 Regularization (Ridge) | | ---------------------------- | ---------------------------------------------------------- | --------------------------------------------------- | | Model Complexity | Sparse models, can perform feature selection | Smooth models, does not perform feature selection | | Interpretability | High (due to feature selection) | Lower (all features contribute) | | Handling Irrelevant Features | Effective | Less effective | | Handling Collinearity | Less effective | Effective | | Numerical Stability | May have numerical stability isssues | Improves numerical stability | | When to use | High-dimensional data, feature selection, interpretability | Colinearity, small feature set, numerical stability | --- ## Hyperparameter Tuning Hyperparameter tuning involves selecting the optimal values for the hyperparameters, such as the regularization parameter $\alpha$. This can be done using techniques like grid search or random search with cross-validation. ### Grid Search The _grid search_ algorithm for hyperparameter tuning works by training a model on predetermined lists of hyperparameter values. This method tries every hyperparameter value on the list, and then uses the one that makes the model perform best. ### Random Search The _random search_ algorithm works similarly, but instead of using a predetermined list of hyperparameter values, the values are randomly chosen. As with grid search, it selects the hyperparameter that performed the best. ### Bayesian Optimization _Bayesian optimization_ is another approach to hyperparameter tuning. It uses ideas from the field of Bayesian statistics to iterate through different hyperparameter values. Each time the Bayesian optimization algorithm evaluates a new hyperparameter value, it gains more information about where it should look for the best hyperparameter value. ### Genetic Algorithms _Genetic algorithms_ are another possible hyperparameter tuning method. These work by going through several generations of hyperparameter values. Within each generation, the fittest (i.e., best-performing) hyperparameter values are slightly mutated (i.e., changed) in order to produce the next generation. --- ## Bias-Variance Tradeoff The bias-variance tradeoff is a fundamental concept in machine learning. It describes the tradeoff between a model's ability to generalize to new data (low variance) and its ability to accurately capture the underlying patterns in the training data (low bias). ### High Bias High bias occurs when a model is too simple and cannot capture the underlying patterns in the data. This results in underfitting. ### High Variance High variance occurs when a model is too complex and captures the noise in the training data. This results in overfitting. ### Illustration The ideal model achieves a balance between bias and variance, minimizing the total error. ### Python Implementation ```python import matplotlib.pyplot as plt import numpy as np from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import make_pipeline # Sample data np.random.seed(0) X = np.sort(np.random.rand(30) * 10) y = np.sin(X) + np.random.randn(30) * 0.5 # Models degree = [1, 4, 15] plt.figure(figsize=(18, 4)) for i in range(len(degree)): plt.subplot(1, len(degree), i + 1) plt.scatter(X, y, s=20, edgecolor="black", c="darkorange", label="data") model = make_pipeline(PolynomialFeatures(degree[i]), LinearRegression()) model.fit(X[:, np.newaxis], y) y_pred = model.predict(X[:, np.newaxis]) plt.plot(X, y_pred, color="cornflowerblue", label="model") plt.xlabel("X") plt.ylabel("y") plt.title(f"Degree {degree[i]}") plt.legend() plt.show()` ``` --- ## Conclusion ### Summary - **Regularization:** Techniques to prevent overfitting by adding a penalty to the loss function. - **Overfitting:** When a model learns the noise in the training data, resulting in poor performance on new data. - **Loss Function:** Measures how well a model's predictions match the actual data. - **L1 (Lasso) Regularization:** Adds the absolute values of the coefficients to the loss function, potentially shrinking some coefficients to zero. - **L2 (Ridge) Regularization:** Adds the squared values of the coefficients to the loss function, discouraging large coefficients. - **Hyperparameter Tuning:** Selecting the optimal values for hyperparameters to improve model performance. - **Bias-Variance Tradeoff:** Balancing a model's ability to generalize to new data and accurately capture the underlying patterns in the training data. Continue: [[09-Ensemble Methods]]