## K-means Clustering
K-means clustering is a method of vector quantization that aims to partition $n$ observations into $k$ clusters in which each observation belongs to the cluster with the nearest mean.
### Algorithm
1. **Initialization**: Choose $k$ initial cluster centroids randomly.
2. **Assignment**: Assign each data point to the nearest centroid.
3. **Update**: Calculate new centroids as the mean of all points assigned to each cluster.
4. **Repeat**: Repeat the assignment and update steps until convergence (i.e., the centroids do not change significantly).
### Mathematical Formulation
The objective of K-means is to minimize the within-cluster sum of squares (WCSS):
$WCSS = \sum_{i=1}^{k} \sum_{x_j \in C_i} \| x_j - \mu_i \|^2$
where:
- $C_i$ is the set of points in cluster $i$.
- $\mu_i$ is the centroid of cluster $i$.
### Python Implementation
```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Sample data
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
# K-means clustering
kmeans = KMeans(n_clusters=2, random_state=0) kmeans.fit(X)
# Cluster centroids
centroids = kmeans.cluster_centers_ print("Centroids:\n", centroids)
# Predicted clusters
labels = kmeans.labels_ print("Labels:", labels)
# Plotting
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-means Clustering')
plt.show()
```
---
## Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms data into a new coordinate system where the greatest variances lie on the first coordinates (called principal components), and so on.
### Algorithm
1. **Standardization**: Standardize the data to have a mean of 0 and a standard deviation of 1.
2. **Covariance Matrix**: Compute the covariance matrix of the standardized data.
3. **Eigen Decomposition**: Compute the eigenvalues and eigenvectors of the covariance matrix.
4. **Principal Components**: Select the top $k$ eigenvectors (principal components) corresponding to the top $k$ eigenvalues.
5. **Transformation**: Transform the data into the new subspace defined by the top $k$ principal components.
### Mathematical Formulation
Given a dataset $X$ with zero mean, the covariance matrix $\Sigma$ is:
$Σ= \frac{1}{n-1} X^T X$
The eigenvalue decomposition of $\Sigma$ gives:
$Σ= V \Lambda V^T$
where:
- $V$ is the matrix of eigenvectors.
- $\Lambda$ is the diagonal matrix of eigenvalues.
The principal components are the columns of $V$ corresponding to the largest eigenvalues.
### Python Implementation
```python
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Sample data
X = np.array([[2.5, 2.4], [0.5, 0.7], [2.2, 2.9],
[1.9, 2.2], [3.1, 3.0], [2.3, 2.7],
[2, 1.6], [1, 1.1], [1.5, 1.6], [1.1, 0.9]])
# PCA
pca = PCA(n_components=2) X_pca = pca.fit_transform(X)
# Explained variance
explained_variance = pca.explained_variance_ratio_ print("Explained variance ratio:", explained_variance)
# Plotting
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA')
plt.show()
```
## Conclusion
### Summary of Unsupervised Learning Methods
- **K-means Clustering**: A partitioning method that divides data into $k$ clusters by minimizing the within-cluster sum of squares.
- **Principal Component Analysis (PCA)**: A dimensionality reduction technique that transforms data into a new coordinate system based on the directions of maximum variance. (example for 5 dimensions to 2 dimension for easier processing)
Continue: [[08-Regularization and Hyperparameter Tuning]]