> [!info]
> This is a more advanced continuation of [[02-Feature Engineering]]
## Filter Methods
### Variance Threshold
Variance threshold is a simple baseline approach to feature selection. It removes features with low variance, assuming that low variance features contain less information.
#### Formula
For a given feature $x$:
$Variance(x)= \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2$
where $\bar{x}$ is the mean of the feature.
#### Python Implementation
```python
from sklearn.feature_selection import VarianceThreshold
# Sample data
X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
# Variance threshold
selector = VarianceThreshold(threshold=0.2)
X_selected = selector.fit_transform(X)
print("Selected features:\n", X_selected)`
```
---
### Pearson's Correlation
Pearson's correlation measures the linear relationship between two continuous variables.
#### Formula
For two variables $X$ and $Y$:
$r_{xy} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2}}$
#### Python Implementation
```python
import numpy as np
from scipy.stats import pearsonr
# Sample data
X = np.array([1, 2, 3, 4, 5])
Y = np.array([2, 4, 6, 8, 10])
# Pearson correlation
correlation, _ = pearsonr(X, Y)
print("Pearson's correlation:", correlation)`
```
---
### Point Biserial Correlation Coefficient
Point biserial correlation measures the relationship between a continuous variable and a binary variable.
#### Formula
For a continuous variable $X$ and a binary variable $Y$:
$r_{pb} = \frac{\bar{X}_1 - \bar{X}_0}{s_X} \sqrt{\frac{n_1 n_0}{n^2}}$
where:
- $\bar{X}_1$ and $\bar{X}_0$ are the means of $X$ for $Y=1$ and $Y=0$.
- $s_X$ is the standard deviation of $X$.
- $n_1$ and $n_0$ are the number of occurrences of $Y=1$ and $Y=0$.
- $n$ is the total number of observations.
#### Python Implementation
```python
from scipy.stats import pointbiserialr
# Sample data
X = np.array([1, 2, 3, 4, 5])
Y = np.array([0, 1, 0, 1, 0])
# Point biserial correlation
correlation, _ = pointbiserialr(X, Y)
print("Point Biserial Correlation:", correlation)`
```
---
### Mutual Information
Mutual information measures the dependency between two variables. It is more general than correlation as it can capture non-linear relationships.
#### Formula
For two discrete variables $X$ and $Y$:
$I(X; Y) = \sum_{x \in X} \sum_{y \in Y} p(x, y) \log \left( \frac{p(x, y)}{p(x) p(y)} \right)$
#### Python Implementation
```python
from sklearn.feature_selection import mutual_info_classif
# Sample data
X = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]]
y = [0, 1, 0, 1, 0]
# Mutual information
mi = mutual_info_classif(X, y)
print("Mutual Information:", mi)`
```
---
### F-statistic (using f_regression from scikit-learn)
The F-statistic measures the linear relationship between features and the target variable.
#### Formula
For feature $X_i$ and target $Y$:
$F = \frac{\text{variance between groups}}{\text{variance within groups}}$
#### Python Implementation
```python
from sklearn.feature_selection import f_regression
# Sample data
X = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]]
y = [0, 1, 0, 1, 0]
# F-statistic
f_values, p_values = f_regression(X, y)
print("F-statistic values:", f_values)`
```
---
## Wrapper Methods
### Sequential Forward Selection (SFS)
Sequential Forward Selection starts with an empty set of features and adds one feature at a time that improves the model the most until no significant improvement is observed.
#### Python Implementation
```python
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
# Sample data
X = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]]
y = [0, 1, 0, 1, 0]
# Model
model = LogisticRegression()
# SFS
sfs = SFS(model, k_features=1, forward=True, floating=False, scoring='accuracy', cv=0)
sfs = sfs.fit(X, y)
print("Selected features:", sfs.k_feature_idx_)`
```
---
### Sequential Backward Selection (SBS)
Sequential Backward Selection starts with all features and removes one feature at a time that improves the model the most until no significant improvement is observed.
#### Python Implementation
```python
# SBS
sbs = SFS(model, k_features=1, forward=False, floating=False, scoring='accuracy', cv=0)
sbs = sbs.fit(X, y)
print("Selected features:", sbs.k_feature_idx_)`
```
---
### Sequential Forward Floating Selection (SFFS)
Sequential Forward Floating Selection is similar to SFS but allows for features to be added and then removed if they no longer contribute significantly.
#### Python Implementation
```python
# SFFS
sffs = SFS(model, k_features=1, forward=True, floating=True, scoring='accuracy', cv=0)
sffs = sffs.fit(X, y)
print("Selected features:", sffs.k_feature_idx_)`
```
---
### Sequential Backward Floating Selection (SBFS)
Sequential Backward Floating Selection is similar to SBS but allows for features to be removed and then added back if they become significant again.
#### Python Implementation
```python
# SBFS
sbfs = SFS(model, k_features=1, forward=False, floating=True, scoring='accuracy', cv=0)
sbfs = sbfs.fit(X, y)
print("Selected features:", sbfs.k_feature_idx_)`
```
---
### Recursive Feature Elimination (RFE)
Recursive Feature Elimination recursively removes the least important features and builds the model until the specified number of features is reached.
#### Python Implementation
```python
from sklearn.feature_selection import RFE
# Model
model = LogisticRegression()
# RFE
rfe = RFE(model, n_features_to_select=1)
rfe = rfe.fit(X, y)
print("Selected features:", rfe.support_)`
```
---
## Feature Importance
### Tree-based Feature Importance
Tree-based models can compute feature importance based on the reduction in impurity (e.g., Gini impurity) brought by each feature.
### Gini Impurity
#### Formula
Gini impurity for a node with classes $c_1, c_2, \ldots, c_k$:
$G = 1 - \sum_{i=1}^{k} p_i^2$
where $p_i$ is the probability of class $c_i$.
#### Python Implementation
```python
from sklearn.ensemble import RandomForestClassifier
# Sample data
X = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]]
y = [0, 1, 0, 1, 0]
# Random forest model
model = RandomForestClassifier()
model.fit(X, y)
# Feature importance
importances = model.feature_importances_
print("Feature importances:", importances)`
```
---
### Aggregate Methods
Aggregate methods combine the feature importances from multiple models to provide a more robust ranking.
---
### Permutation-based Methods
Permutation-based methods evaluate the importance of a feature by measuring the change in model performance after shuffling the feature values.
#### Python Implementation
```python
from sklearn.inspection import permutation_importance
# Permutation importance
result = permutation_importance(model, X, y, n_repeats=10, random_state=42)
print("Permutation importances:", result.importances_mean)`
```
---
### Coefficients
Linear models can use the magnitude of coefficients as a measure of feature importance.
#### Python Implementation
```python
from sklearn.linear_model import LogisticRegression
# Logistic regression model
model = LogisticRegression() model.fit(X, y)
# Coefficients
coefficients = model.coef_ print("Coefficients:", coefficients)`
```
## Conclusion
### Summary of Feature Selection Methods
- **Filter Methods:** are based on statistical properties of features. Examples include Variance Threshold, Pearson's Correlation, Point Biserial Correlation, Mutual Information, and F-statistic.
- **Wrapper Methods:** are based on model performance. Examples include Sequential Forward Selection, Sequential Backward Selection, Sequential Forward Floating Selection, Sequential Backward Floating Selection, and Recursive Feature Elimination.
- **Feature Importance:** are based on model-specific importance scores. Examples include Tree-based feature importance, Aggregate methods, Permutation-based methods, and Coefficients.
Continue: [[04-Supervised Learning Regressors and Classifiers]]