## K-means Clustering K-means clustering is a method of vector quantization that aims to partition $n$ observations into $k$ clusters in which each observation belongs to the cluster with the nearest mean. ### Algorithm 1. **Initialization**: Choose $k$ initial cluster centroids randomly. 2. **Assignment**: Assign each data point to the nearest centroid. 3. **Update**: Calculate new centroids as the mean of all points assigned to each cluster. 4. **Repeat**: Repeat the assignment and update steps until convergence (i.e., the centroids do not change significantly). ### Mathematical Formulation The objective of K-means is to minimize the within-cluster sum of squares (WCSS): $WCSS = \sum_{i=1}^{k} \sum_{x_j \in C_i} \| x_j - \mu_i \|^2$ where: - $C_i$​ is the set of points in cluster $i$. - $\mu_i$​ is the centroid of cluster $i$. ### Python Implementation ```python import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans # Sample data X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]]) # K-means clustering kmeans = KMeans(n_clusters=2, random_state=0) kmeans.fit(X) # Cluster centroids centroids = kmeans.cluster_centers_ print("Centroids:\n", centroids) # Predicted clusters labels = kmeans.labels_ print("Labels:", labels) # Plotting plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis') plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.title('K-means Clustering') plt.show() ``` --- ## Principal Component Analysis (PCA) Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms data into a new coordinate system where the greatest variances lie on the first coordinates (called principal components), and so on. ### Algorithm 1. **Standardization**: Standardize the data to have a mean of 0 and a standard deviation of 1. 2. **Covariance Matrix**: Compute the covariance matrix of the standardized data. 3. **Eigen Decomposition**: Compute the eigenvalues and eigenvectors of the covariance matrix. 4. **Principal Components**: Select the top $k$ eigenvectors (principal components) corresponding to the top $k$ eigenvalues. 5. **Transformation**: Transform the data into the new subspace defined by the top $k$ principal components. ### Mathematical Formulation Given a dataset $X$ with zero mean, the covariance matrix $\Sigma$ is: $Σ= \frac{1}{n-1} X^T X$ The eigenvalue decomposition of $\Sigma$ gives: $Σ= V \Lambda V^T$ where: - $V$ is the matrix of eigenvectors. - $\Lambda$ is the diagonal matrix of eigenvalues. The principal components are the columns of $V$ corresponding to the largest eigenvalues. ### Python Implementation ```python import numpy as np from sklearn.decomposition import PCA import matplotlib.pyplot as plt # Sample data X = np.array([[2.5, 2.4], [0.5, 0.7], [2.2, 2.9], [1.9, 2.2], [3.1, 3.0], [2.3, 2.7], [2, 1.6], [1, 1.1], [1.5, 1.6], [1.1, 0.9]]) # PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(X) # Explained variance explained_variance = pca.explained_variance_ratio_ print("Explained variance ratio:", explained_variance) # Plotting plt.scatter(X_pca[:, 0], X_pca[:, 1]) plt.xlabel('Principal Component 1') plt.ylabel('Principal Component 2') plt.title('PCA') plt.show() ``` ## Conclusion ### Summary of Unsupervised Learning Methods - **K-means Clustering**: A partitioning method that divides data into $k$ clusters by minimizing the within-cluster sum of squares. - **Principal Component Analysis (PCA)**: A dimensionality reduction technique that transforms data into a new coordinate system based on the directions of maximum variance. (example for 5 dimensions to 2 dimension for easier processing) Continue: [[08-Regularization and Hyperparameter Tuning]]