Multi-Instance GPUs in Kubernetes

### Multi-Instance GPUs (MIGs) in Kubernetes #### Overview This paper is meant to provide a comprehensive overview of supporting Multi-Instance GPUs (MIG) in Kubernetes, highlighting the challenges, design decisions, and implementation strategies. ![mig | center | 512](https://cdn.prod.website-files.com/6295808d44499cde2ba36c71/65cb84f9b6c2baa647bf553a_egx-cloud-native-core-stack-diag.jpeg) #### Background MIG (Multi-Instance GPU) is a feature for Nvidia GPUs that allows partitioning a single GPU into multiple MIG devices. Each MIG device acts as an independent GPU with a dedicated portion of memory and compute resources. This partitioning is done in fixed-size chunks called slices. For example, a next-generation Nvidia GPU might have 8 memory slices and 7 compute slices. MIG allows for flexible GPU usage by enabling configurations such as: - One device with 4 memory slices and 3 compute slices - One device with 2 memory slices and 2 compute slices - One device with 1 memory slice and 1 compute slice This configuration allows users to efficiently utilize GPU resources based on their specific needs. #### High-Level Design Decisions 1. **Static Configuration:** MIG devices are pre-configured on GPUs and are not created dynamically within the Kubernetes stack. 2. **Single GPU and Compute Instance:** Each MIG device consists of a single GPU instance and a single compute instance. 3. **Single MIG Device per Container:** Containers request a single MIG device to meet their resource needs. 4. **Node-Level Strategy Configuration:** Different strategies for exposing MIG devices are configurable at the node level. #### Supporting MIG on Kubernetes Two main strategies for exposing MIG devices on Kubernetes nodes are implemented: **single** and **mixed**. ##### Single Strategy - **Node Configuration:** All GPUs on a node must be of the same type, have MIG enabled, and expose the same type of MIG device. - **Resource Exposure:** The k8s-device-plugin exposes MIG devices using the traditional `nvidia.com/gpu` resource type. Node labels are applied to indicate the properties of the exposed MIG device. ```yaml apiVersion: v1 kind: Pod metadata: name: gpu-example spec: containers: - name: gpu-example image: nvidia/cuda:11.0-base resources: limits: nvidia.com/gpu: 1 nodeSelector: nvidia.com/gpu.product: A100-SXM4-40GB MIG 3g.20gb ``` ##### Mixed Strategy - **Node Configuration:** All GPUs on a node must be of the same type, but can be configured with a mix of MIG devices. - **Resource Exposure:** The k8s-device-plugin exposes non-MIG GPUs with `nvidia.com/gpu` and individual MIG devices with `nvidia.com/mig-<slice_count>g.<memory_size>gb`. Node labels are applied to indicate the properties of each MIG device type. ```yaml apiVersion: v1 kind: Pod metadata: name: gpu-example spec: containers: - name: gpu-example image: nvidia/cuda:11.0-base resources: limits: nvidia.com/mig-3g.20gb: 1 nodeSelector: nvidia.com/gpu.product: A100-SXM4-40GB ``` #### Discussion - **Single Strategy:** Suitable for large deployments where nodes are dedicated to a single type of MIG device. - **Mixed Strategy:** Suitable for smaller deployments that require flexibility in GPU resource allocation. ### Expanding Kubernetes for Machine Learning To effectively use Kubernetes for Machine Learning (ML), especially for scaling GPU workloads, consider the following additional strategies and best practices: #### Dynamic GPU Allocation Dynamic GPU allocation allows for the automatic scaling of GPU resources based on workload demands. This can be achieved using Kubernetes' Horizontal Pod Autoscaler (HPA) combined with custom metrics for GPU utilization. ```yaml apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: gpu-scaler spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: gpu-app minReplicas: 1 maxReplicas: 10 metrics: - type: Pods pods: metric: name: gpu_utilization target: type: AverageValue averageValue: 80% ``` #### GPU Sharing GPU sharing allows multiple containers to use a fraction of a GPU, which is useful for inference workloads that do not require the full capacity of a GPU. ##### NVIDIA GPU Sharing NVIDIA’s device plugin for Kubernetes supports GPU sharing, allowing multiple containers to access GPU resources concurrently. ```yaml apiVersion: v1 kind: Pod metadata: name: shared-gpu-pod spec: containers: - name: container1 image: nvidia/cuda:11.0-base resources: limits: nvidia.com/gpu: 0.5 - name: container2 image: nvidia/cuda:11.0-base resources: limits: nvidia.com/gpu: 0.5 ``` #### Optimizing for ML Workloads - **Data Locality:** Ensure data is close to the GPU nodes to minimize latency. - **Node Affinity and Anti-Affinity:** Use node affinity rules to schedule ML workloads on nodes with GPU capabilities. ```yaml apiVersion: v1 kind: Pod metadata: name: ml-pod spec: containers: - name: ml-container image: my-ml-image resources: limits: nvidia.com/gpu: 1 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: nvidia.com/gpu.present operator: Exists ``` #### Code Snippets for MIG and Kubernetes Integration **Configuring MIG on Kubernetes Nodes:** ```bash # Enable MIG mode on the GPU sudo nvidia-smi -mig 1 # Configure MIG devices (example for A100) sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C` ``` **Deploying a Pod with MIG resources:** ```yaml apiVersion: v1 kind: Pod metadata: name: ml-pod spec: containers: - name: ml-container image: nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04 resources: limits: nvidia.com/mig-2g.10gb: 1 nodeSelector: nvidia.com/gpu.product: A100-SXM-80GB ``` #### Monitoring and Logging - **Prometheus and Grafana:** Integrate Prometheus for monitoring GPU metrics and Grafana for visualization. - **Logging:** Use centralized logging systems like ELK stack (Elasticsearch, Logstash, Kibana) for debugging and performance analysis. ### Conclusion Supporting MIG in Kubernetes enhances the flexibility and efficiency of GPU resource allocation. Expanding these capabilities for ML workloads involves dynamic GPU allocation, GPU sharing, optimized scheduling, and robust monitoring and logging practices. These strategies ensure scalable and efficient deployment of ML models in a Kubernetes environment. References: [MIGs in Kubernetes, by Kevin Klues](https://docs.google.com/document/d/1mdgMQ8g7WmaI_XVVRrCvHPFPOMCm5LQD5JefgAh6N8g/edit)