04-Supervised Learning Regressors and Classifiers

## Regression and Classification Supervised learning involves learning a function from labeled training data. The goal is to make predictions on new, unseen data. Supervised learning tasks can be broadly categorized into regression and classification: - **Regression:** Predicts continuous values. - **Classification:** Predicts discrete labels. --- ## Linear Regression Linear regression is used to model the relationship between a dependent variable and one or more independent variables. The relationship is modeled by fitting a linear equation to observed data. ### Mathematical Formulation For a single feature $x$, the linear regression model is: $y=\beta_0 + \beta_1 x + \epsilon$ where: - $y$ is the dependent variable. - $x$ is the independent variable. - $\beta_0$ is the intercept. - $\beta_1$ is the slope. - $\epsilon$ is the error term. The goal is to minimize the sum of squared errors (SSE): $SSE= \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ ### Python Implementation ```python import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression # Sample data X = np.array([[1], [2], [3], [4], [5]]) y = np.array([1, 3, 2, 3, 5]) # Linear regression model model = LinearRegression() model.fit(X, y) y_pred = model.predict(X) # Plotting plt.scatter(X, y, color='blue') plt.plot(X, y_pred, color='red') plt.xlabel('X') plt.ylabel('y') plt.title('Linear Regression') plt.show()` ``` --- ## Multiple Linear Regression Multiple linear regression models the relationship between a dependent variable and multiple independent variables. ### Mathematical Formulation For multiple features $X = [x_1, x_2, \ldots, x_p]$, the model is: $y= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p + \epsilon$ ### Python Implementation ```python # Sample data X = np.array([[1, 1], [2, 2], [3, 3], [4, 4], [5, 5]]) y = np.array([2, 3, 4, 5, 6]) # Multiple linear regression model model = LinearRegression() model.fit(X, y) y_pred = model.predict(X) # Results print("Coefficients:", model.coef_) print("Intercept:", model.intercept_)` ``` --- ## Logistic Regression Logistic regression is used for binary classification. It models the probability that a given input belongs to a particular class. ### Mathematical Formulation The logistic function (sigmoid) is used to model the probability: $P(y=1∣X)= \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p)}}$ ### Python Implementation ```python from sklearn.linear_model import LogisticRegression # Sample data X = np.array([[1], [2], [3], [4], [5]]) y = np.array([0, 0, 1, 1, 1]) # Logistic regression model model = LogisticRegression() model.fit(X, y) y_pred = model.predict(X) # Results print("Predicted labels:", y_pred) print("Probabilities:", model.predict_proba(X))` ``` --- ## K-Nearest Neighbors (K-NN) ### K-NN Classifier K-NN is a non-parametric method used for classification and regression. For classification, it assigns the class most common among the k nearest neighbors. ### Mathematical Formulation The distance metric (e.g., Euclidean) is used to find the nearest neighbors: $d(x,xi) = \sqrt{\sum_{j=1}^{p} (x_j - x_{ij})^2}d(x,xi)$ ### Python Implementation ```python from sklearn.neighbors import KNeighborsClassifier # Sample data X = np.array([[1], [2], [3], [4], [5]]) y = np.array([0, 0, 1, 1, 1]) # K-NN classifier model = KNeighborsClassifier(n_neighbors=3) model.fit(X, y) y_pred = model.predict(X) # Results print("Predicted labels:", y_pred)` ``` --- ### K-NN Regressor For regression, K-NN predicts the average of the k nearest neighbors. ### Python Implementation ```python from sklearn.neighbors import KNeighborsRegressor # Sample data X = np.array([[1], [2], [3], [4], [5]]) y = np.array([1, 3, 2, 3, 5]) # K-NN regressor model = KNeighborsRegressor(n_neighbors=3) model.fit(X, y) y_pred = model.predict(X) # Results print("Predicted values:", y_pred)` ``` ## Decision Trees ### Introduction Decision trees are used for both classification and regression. They split the data into subsets based on the feature that results in the most significant information gain (classification) or variance reduction (regression). ### Mathematical Formulation The impurity of a node can be measured using Gini impurity or entropy for classification: - Gini = $1 - \sum_{i=1}^{n} p_i^2$ - Entropy = $-\sum_{i=1}^{n} p_i \log(p_i)$ For regression, the variance reduction is used: - Variance = $\frac{1}{N} \sum_{i=1}^{N} (y_i - \bar{y})^2$ ### Python Implementation ```python from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor # Sample data X = np.array([[1], [2], [3], [4], [5]]) y_class = np.array([0, 0, 1, 1, 1]) y_reg = np.array([1, 3, 2, 3, 5]) # Decision tree classifier clf = DecisionTreeClassifier() clf.fit(X, y_class) y_class_pred = clf.predict(X) # Decision tree regressor reg = DecisionTreeRegressor() reg.fit(X, y_reg) y_reg_pred = reg.predict(X) # Results print("Classification:", y_class_pred) print("Regression:", y_reg_pred)` ``` --- ## Normalization Normalization scales the data to a specific range, improving the performance and training stability of machine learning models. ### Min-Max Normalization Scales the data to a range [0, 1]: $x' = \frac{x - x_{min}}{x_{max} - x_{min}}$ ### Z-Score Normalization Scales the data based on mean and standard deviation: $x' = \frac{x - \mu}{\sigma}$ ### Python Implementation ```python from sklearn.preprocessing import MinMaxScaler, StandardScaler # Sample data X = np.array([[1], [2], [3], [4], [5]]) # Min-Max normalization min_max_scaler = MinMaxScaler() X_min_max = min_max_scaler.fit_transform(X) # Z-score normalization z_score_scaler = StandardScaler() X_z_score = z_score_scaler.fit_transform(X) # Results print("Min-Max Normalized Data:", X_min_max) print("Z-Score Normalized Data:", X_z_score)` ``` --- # Training, Validation, and Test Datasets Supervised machine learning models have the potential to learn patterns from labeled datasets and make predictions. However, we need a systematic way to evaluate the performance of our models to ensure their predictions are accurate and useful. This is where the training, validation, and test datasets come into play. ## Training-Validation-Test Split ### Purpose of Each Set - **Training Set:** The data used to train the model. It is the subset of the data from which the model learns the underlying patterns. - **Validation Set:** The data used to tune the hyperparameters and evaluate the model during the training process. It helps in model selection and improvement. - **Test Set:** The data used to evaluate the final model. It provides an unbiased evaluation of the model's performance on unseen data. ### Splitting the Data Before fitting a machine learning model, we must split the dataset into training, validation, and test sets. A common split ratio is 70% for training, 15% for validation, and 15% for testing. #### Python Implementation ```python from sklearn.model_selection import train_test_split # Sample data X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]) y = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0]) # Split data into training and test sets X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42) # Further split the temporary set into validation and test sets X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) print("Training set:", X_train, y_train) print("Validation set:", X_val, y_val) print("Test set:", X_test, y_test)` ``` ### Evaluating the Model During model fitting, both the features ($X$) and the true labels ($y$) of the training set ($X_{train}, y_{train}$) are used to learn. When evaluating the performance of the model with the validation ($X_{val}, y_{val}$) or test ($X_{test}, y_{test}$) set, we pretend like we do not know the true labels. By using the features of the validation or test sets as inputs to the trained model, we receive predicted labels ($y_{pred}$). We can then compare the true labels ($y_{val}$ or $y_{test}$) with the predicted labels ($y_{pred}$) to get a quantitative evaluation of the model's performance. #### Example Metrics - Accuracy - Precision - Recall - F1-Score #### Python Implementation ```python from sklearn.metrics import accuracy_score # Assuming we have a trained model y_val_pred = model.predict(X_val) # Calculate accuracy accuracy = accuracy_score(y_val, y_val_pred) print("Validation Accuracy:", accuracy)` ``` ## N-Fold Cross-Validation ### Introduction When the dataset is too small to split effectively into training, validation, and test sets, N-fold cross-validation is a valuable technique. It helps to ensure that every observation in the dataset has the chance of appearing in the training and validation set. ### Process 1. Split the dataset into N equal-sized chunks. 2. For each iteration (fold), use N-1 chunks as the training set and the remaining chunk as the validation set. 3. Train the model on the training set and evaluate it on the validation set. 4. Repeat the process N times, each time with a different chunk as the validation set. 5. Average the evaluation metrics across all folds to get a robust estimate of model performance. #### Python Implementation ```python from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression # Sample data X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]) y = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0]) # Logistic regression model model = LogisticRegression() # Perform 5-fold cross-validation scores = cross_val_score(model, X, y, cv=5) print("Cross-Validation Scores:", scores) print("Average Cross-Validation Score:", scores.mean())` ``` ### Advantages of Cross-Validation - **Better Utilization of Data:** Every observation is used for both training and validation, which is particularly useful for small datasets. - **Reduced Overfitting:** By averaging the performance across different folds, the model's robustness and generalizability are improved. - **Hyperparameter Tuning:** Cross-validation helps in tuning hyperparameters by providing a reliable estimate of model performance. ## Conclusion Understanding the importance of splitting a dataset into training, validation, and test sets is crucial for developing robust and reliable machine learning models. The training set allows the model to learn, the validation set helps in tuning and improving the model, and the test set provides an unbiased evaluation of the model's performance. In cases where data is limited, N-fold cross-validation serves as an effective alternative to ensure thorough model evaluation. Continue: [[05-Supervised Learning Evaluation Metrics for Classification]]