03-Feature Selection Methods

> [!info] > This is a more advanced continuation of [[02-Feature Engineering]] ## Filter Methods ### Variance Threshold Variance threshold is a simple baseline approach to feature selection. It removes features with low variance, assuming that low variance features contain less information. #### Formula For a given feature $x$: $Variance(x)= \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2$ where $\bar{x}$ is the mean of the feature. #### Python Implementation ```python from sklearn.feature_selection import VarianceThreshold # Sample data X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]] # Variance threshold selector = VarianceThreshold(threshold=0.2) X_selected = selector.fit_transform(X) print("Selected features:\n", X_selected)` ``` --- ### Pearson's Correlation Pearson's correlation measures the linear relationship between two continuous variables. #### Formula For two variables $X$ and $Y$: $r_{xy} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2}}$ #### Python Implementation ```python import numpy as np from scipy.stats import pearsonr # Sample data X = np.array([1, 2, 3, 4, 5]) Y = np.array([2, 4, 6, 8, 10]) # Pearson correlation correlation, _ = pearsonr(X, Y) print("Pearson's correlation:", correlation)` ``` --- ### Point Biserial Correlation Coefficient Point biserial correlation measures the relationship between a continuous variable and a binary variable. #### Formula For a continuous variable $X$ and a binary variable $Y$: $r_{pb} = \frac{\bar{X}_1 - \bar{X}_0}{s_X} \sqrt{\frac{n_1 n_0}{n^2}}$ where: - $\bar{X}_1$ and $\bar{X}_0$ are the means of $X$ for $Y=1$ and $Y=0$. - $s_X$ is the standard deviation of $X$. - $n_1$ and $n_0$ are the number of occurrences of $Y=1$ and $Y=0$. - $n$ is the total number of observations. #### Python Implementation ```python from scipy.stats import pointbiserialr # Sample data X = np.array([1, 2, 3, 4, 5]) Y = np.array([0, 1, 0, 1, 0]) # Point biserial correlation correlation, _ = pointbiserialr(X, Y) print("Point Biserial Correlation:", correlation)` ``` --- ### Mutual Information Mutual information measures the dependency between two variables. It is more general than correlation as it can capture non-linear relationships. #### Formula For two discrete variables $X$ and $Y$: $I(X; Y) = \sum_{x \in X} \sum_{y \in Y} p(x, y) \log \left( \frac{p(x, y)}{p(x) p(y)} \right)$ #### Python Implementation ```python from sklearn.feature_selection import mutual_info_classif # Sample data X = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]] y = [0, 1, 0, 1, 0] # Mutual information mi = mutual_info_classif(X, y) print("Mutual Information:", mi)` ``` --- ### F-statistic (using f_regression from scikit-learn) The F-statistic measures the linear relationship between features and the target variable. #### Formula For feature $X_i$ and target $Y$: $F = \frac{\text{variance between groups}}{\text{variance within groups}}$ #### Python Implementation ```python from sklearn.feature_selection import f_regression # Sample data X = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]] y = [0, 1, 0, 1, 0] # F-statistic f_values, p_values = f_regression(X, y) print("F-statistic values:", f_values)` ``` --- ## Wrapper Methods ### Sequential Forward Selection (SFS) Sequential Forward Selection starts with an empty set of features and adds one feature at a time that improves the model the most until no significant improvement is observed. #### Python Implementation ```python from mlxtend.feature_selection import SequentialFeatureSelector as SFS from sklearn.linear_model import LogisticRegression # Sample data X = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]] y = [0, 1, 0, 1, 0] # Model model = LogisticRegression() # SFS sfs = SFS(model, k_features=1, forward=True, floating=False, scoring='accuracy', cv=0) sfs = sfs.fit(X, y) print("Selected features:", sfs.k_feature_idx_)` ``` --- ### Sequential Backward Selection (SBS) Sequential Backward Selection starts with all features and removes one feature at a time that improves the model the most until no significant improvement is observed. #### Python Implementation ```python # SBS sbs = SFS(model, k_features=1, forward=False, floating=False, scoring='accuracy', cv=0) sbs = sbs.fit(X, y) print("Selected features:", sbs.k_feature_idx_)` ``` --- ### Sequential Forward Floating Selection (SFFS) Sequential Forward Floating Selection is similar to SFS but allows for features to be added and then removed if they no longer contribute significantly. #### Python Implementation ```python # SFFS sffs = SFS(model, k_features=1, forward=True, floating=True, scoring='accuracy', cv=0) sffs = sffs.fit(X, y) print("Selected features:", sffs.k_feature_idx_)` ``` --- ### Sequential Backward Floating Selection (SBFS) Sequential Backward Floating Selection is similar to SBS but allows for features to be removed and then added back if they become significant again. #### Python Implementation ```python # SBFS sbfs = SFS(model, k_features=1, forward=False, floating=True, scoring='accuracy', cv=0) sbfs = sbfs.fit(X, y) print("Selected features:", sbfs.k_feature_idx_)` ``` --- ### Recursive Feature Elimination (RFE) Recursive Feature Elimination recursively removes the least important features and builds the model until the specified number of features is reached. #### Python Implementation ```python from sklearn.feature_selection import RFE # Model model = LogisticRegression() # RFE rfe = RFE(model, n_features_to_select=1) rfe = rfe.fit(X, y) print("Selected features:", rfe.support_)` ``` --- ## Feature Importance ### Tree-based Feature Importance Tree-based models can compute feature importance based on the reduction in impurity (e.g., Gini impurity) brought by each feature. ### Gini Impurity #### Formula Gini impurity for a node with classes $c_1, c_2, \ldots, c_k$: $G = 1 - \sum_{i=1}^{k} p_i^2$ where $p_i$ is the probability of class $c_i$. #### Python Implementation ```python from sklearn.ensemble import RandomForestClassifier # Sample data X = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]] y = [0, 1, 0, 1, 0] # Random forest model model = RandomForestClassifier() model.fit(X, y) # Feature importance importances = model.feature_importances_ print("Feature importances:", importances)` ``` --- ### Aggregate Methods Aggregate methods combine the feature importances from multiple models to provide a more robust ranking. --- ### Permutation-based Methods Permutation-based methods evaluate the importance of a feature by measuring the change in model performance after shuffling the feature values. #### Python Implementation ```python from sklearn.inspection import permutation_importance # Permutation importance result = permutation_importance(model, X, y, n_repeats=10, random_state=42) print("Permutation importances:", result.importances_mean)` ``` --- ### Coefficients Linear models can use the magnitude of coefficients as a measure of feature importance. #### Python Implementation ```python from sklearn.linear_model import LogisticRegression # Logistic regression model model = LogisticRegression() model.fit(X, y) # Coefficients coefficients = model.coef_ print("Coefficients:", coefficients)` ``` ## Conclusion ### Summary of Feature Selection Methods - **Filter Methods:** are based on statistical properties of features. Examples include Variance Threshold, Pearson's Correlation, Point Biserial Correlation, Mutual Information, and F-statistic. - **Wrapper Methods:** are based on model performance. Examples include Sequential Forward Selection, Sequential Backward Selection, Sequential Forward Floating Selection, Sequential Backward Floating Selection, and Recursive Feature Elimination. - **Feature Importance:** are based on model-specific importance scores. Examples include Tree-based feature importance, Aggregate methods, Permutation-based methods, and Coefficients. Continue: [[04-Supervised Learning Regressors and Classifiers]]