## Why Regularize?
Regularization is a technique used to prevent overfitting by adding a penalty to the loss function. This penalty discourages the model from becoming too complex, which helps improve its generalization to unseen data.
## What is Overfitting?
Overfitting occurs when a model learns not only the underlying pattern but also the noise in the training data. This results in high accuracy on training data but poor performance on validation and test data.
### Detection
Overfitting can be detected by comparing the model's performance on training and validation datasets. If the training performance is significantly better than the validation performance, overfitting is likely occurring.
### R-squared Score
The R-squared score ($R^2$) is a metric used to evaluate the goodness-of-fit of a regression model. It ranges from 0 to 1, with higher values indicating better fit.
#### Formula
$R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$
where:
- $y_i$ are the actual values.
- $\hat{y}_i$ are the predicted values.
- $\bar{y}$ is the mean of the actual values.
### Python Implementation
```python
from sklearn.metrics import r2_score
# Sample data
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
# R-squared score
r2 = r2_score(y_true, y_pred)
print("R-squared score:", r2)`
```
---
## The Loss Function
The loss function measures how well a model's predictions match the actual data. Common loss functions include Mean Squared Error (MSE) for regression and Cross-Entropy Loss for classification.
### Mean Squared Error (MSE)
For a regression model, MSE is defined as:
$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$
### Python Implementation
```python
from sklearn.metrics import mean_squared_error
# Sample data
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
# Mean Squared Error
mse = mean_squared_error(y_true, y_pred)
print("Mean Squared Error:", mse)`
```
---
## The Regularization Term
The regularization term is added to the loss function to penalize model complexity. It discourages large coefficients, which can help prevent overfitting.
## L1 or Lasso Regularization
Lasso (Least Absolute Shrinkage and Selection Operator) regularization adds the absolute values of the coefficients to the loss function. It can shrink some coefficients to zero, effectively performing feature selection.
### Formula
The Lasso loss function is:
$L = \text{MSE} + \alpha \sum_{j=1}^{p} |w_j|$
where $\alpha$ is the regularization parameter.
### Python Implementation
```python
from sklearn.linear_model import Lasso
# Sample data
X = [[1], [2], [3], [4], [5]]
y = [1, 3, 2, 3, 5]
# Lasso regression
model = Lasso(alpha=0.1) model.fit(X, y)
y_pred = model.predict(X)
# Coefficients
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)`
```
---
## L2 or Ridge Regularization
Ridge regularization adds the squared values of the coefficients to the loss function. It discourages large coefficients but does not shrink them to zero.
### Formula
The Ridge loss function is:
$L = \text{MSE} + \alpha \sum_{j=1}^{p} w_j^2$
where $\alpha$ is the regularization parameter.
### Python Implementation
```python
from sklearn.linear_model import Ridge
# Sample data
X = [[1], [2], [3], [4], [5]]
y = [1, 3, 2, 3, 5]
# Ridge regression
model = Ridge(alpha=0.1) model.fit(X, y)
y_pred = model.predict(X)
# Coefficients
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)`
```
> [!info]
> Differences between L1 and L2 Regularization
| Aspect | L1 Regularization (Lasso) | L2 Regularization (Ridge) |
| ---------------------------- | ---------------------------------------------------------- | --------------------------------------------------- |
| Model Complexity | Sparse models, can perform feature selection | Smooth models, does not perform feature selection |
| Interpretability | High (due to feature selection) | Lower (all features contribute) |
| Handling Irrelevant Features | Effective | Less effective |
| Handling Collinearity | Less effective | Effective |
| Numerical Stability | May have numerical stability isssues | Improves numerical stability |
| When to use | High-dimensional data, feature selection, interpretability | Colinearity, small feature set, numerical stability |
---
## Hyperparameter Tuning
Hyperparameter tuning involves selecting the optimal values for the hyperparameters, such as the regularization parameter $\alpha$. This can be done using techniques like grid search or random search with cross-validation.
### Grid Search
The _grid search_ algorithm for hyperparameter tuning works by training a model on predetermined lists of hyperparameter values. This method tries every hyperparameter value on the list, and then uses the one that makes the model perform best.
### Random Search
The _random search_ algorithm works similarly, but instead of using a predetermined list of hyperparameter values, the values are randomly chosen. As with grid search, it selects the hyperparameter that performed the best.
### Bayesian Optimization
_Bayesian optimization_ is another approach to hyperparameter tuning. It uses ideas from the field of Bayesian statistics to iterate through different hyperparameter values. Each time the Bayesian optimization algorithm evaluates a new hyperparameter value, it gains more information about where it should look for the best hyperparameter value.
### Genetic Algorithms
_Genetic algorithms_ are another possible hyperparameter tuning method. These work by going through several generations of hyperparameter values. Within each generation, the fittest (i.e., best-performing) hyperparameter values are slightly mutated (i.e., changed) in order to produce the next generation.
---
## Bias-Variance Tradeoff
The bias-variance tradeoff is a fundamental concept in machine learning. It describes the tradeoff between a model's ability to generalize to new data (low variance) and its ability to accurately capture the underlying patterns in the training data (low bias).
### High Bias
High bias occurs when a model is too simple and cannot capture the underlying patterns in the data. This results in underfitting.
### High Variance
High variance occurs when a model is too complex and captures the noise in the training data. This results in overfitting.
### Illustration
The ideal model achieves a balance between bias and variance, minimizing the total error.
### Python Implementation
```python
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# Sample data
np.random.seed(0)
X = np.sort(np.random.rand(30) * 10)
y = np.sin(X) + np.random.randn(30) * 0.5
# Models
degree = [1, 4, 15]
plt.figure(figsize=(18, 4))
for i in range(len(degree)):
plt.subplot(1, len(degree), i + 1)
plt.scatter(X, y, s=20, edgecolor="black", c="darkorange", label="data")
model = make_pipeline(PolynomialFeatures(degree[i]), LinearRegression())
model.fit(X[:, np.newaxis], y)
y_pred = model.predict(X[:, np.newaxis])
plt.plot(X, y_pred, color="cornflowerblue", label="model")
plt.xlabel("X")
plt.ylabel("y")
plt.title(f"Degree {degree[i]}")
plt.legend()
plt.show()`
```
---
## Conclusion
### Summary
- **Regularization:** Techniques to prevent overfitting by adding a penalty to the loss function.
- **Overfitting:** When a model learns the noise in the training data, resulting in poor performance on new data.
- **Loss Function:** Measures how well a model's predictions match the actual data.
- **L1 (Lasso) Regularization:** Adds the absolute values of the coefficients to the loss function, potentially shrinking some coefficients to zero.
- **L2 (Ridge) Regularization:** Adds the squared values of the coefficients to the loss function, discouraging large coefficients.
- **Hyperparameter Tuning:** Selecting the optimal values for hyperparameters to improve model performance.
- **Bias-Variance Tradeoff:** Balancing a model's ability to generalize to new data and accurately capture the underlying patterns in the training data.
Continue: [[09-Ensemble Methods]]