### Overview High availability (HA) in Kubernetes ensures that your cluster remains operational and resilient even in the face of node failures or other disruptions. This is particularly crucial for production environments and applications requiring continuous uptime, such as those involving machine learning workloads. ### Cluster High Availability A key feature of Kubernetes is the ability to join multiple control plane (cp) nodes with collocated etcd databases, which kubeadm facilitates. This setup enhances redundancy and fault tolerance. If one control plane node goes down, the others can maintain the cluster's operations and synchronize state information once the failed node is restored. #### Quorum and etcd Etcd is a distributed key-value store that Kubernetes uses to store all its cluster data. For high availability, etcd requires a quorum, which is a majority of nodes that agree on the current state of the data. Typically, three etcd instances are recommended to ensure that a quorum can be achieved even if one instance fails. The etcd cluster will elect a leader from the available nodes to manage operations. #### Collocated vs. Non-Collocated Databases There are two main approaches to deploying etcd in a high-availability setup: collocated and non-collocated databases. **Collocated Databases** The simplest method is to collocate etcd instances with the control plane nodes. This can be easily achieved using kubeadm to join additional control plane nodes to the cluster. The process is similar to adding worker nodes but includes the `--control-plane` flag and a certificate key. Example command to join a control plane node: ```sh kubeadm join <control-plane-endpoint> --token <token> --discovery-token-ca-cert-hash <hash> --control-plane --certificate-key <key> ``` If a node fails, both the control plane and its etcd instance are lost, but the remaining nodes maintain quorum and cluster functionality. **Non-Collocated Databases** In a more complex setup, you can use an external etcd cluster. This setup requires additional hardware and configuration but provides greater resilience as the etcd cluster is separate from the control plane nodes. Steps for setting up a non-collocated etcd cluster: 1. **Configure the external etcd cluster**: This involves setting up etcd on separate nodes. 2. **Generate and distribute certificates**: Manually copy the required certificates to the control plane nodes. 3. **Update kubeadm configuration**: Modify the `kubeadm-config.yaml` to specify the external etcd endpoints and certificate locations. 4. **Initialize control planes**: Initialize each control plane node one at a time, ensuring each is fully operational before adding the next. Example configuration in `kubeadm-config.yaml`: ```yaml etcd: external: endpoints: - https://<etcd-node-1>:2379 - https://<etcd-node-2>:2379 - https://<etcd-node-3>:2379 caFile: /etc/kubernetes/pki/etcd/ca.crt certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key ``` ### Load Balancing To ensure continuous access to the control plane, it is essential to use a load balancer. This load balancer distributes incoming requests to multiple control plane nodes, providing redundancy and scalability. The load balancer should be configured as a TCP pass-through to handle SSL traffic efficiently. Using a fully qualified domain name (FQDN) instead of an IP address is recommended for easier management and flexibility. ### High Availability for Machine Learning Workloads High availability is particularly important for machine learning workloads, which often involve long-running training jobs and real-time inference services. Kubernetes can scale GPU workloads, providing the necessary computational power and resilience. #### Scaling GPU Workloads Kubernetes supports GPU scheduling, allowing you to allocate GPU resources to your pods. This is crucial for training complex machine learning models and running inference workloads. Example of a pod requesting GPU resources: ```yaml apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: gpu-container image: tensorflow/tensorflow:latest-gpu resources: limits: nvidia.com/gpu: 1 # Request one GPU ``` #### High Availability in Machine Learning Production In production, machine learning workloads require high availability to ensure continuous data processing and model serving. This involves: - **Redundant Control Plane Nodes**: Ensuring the control plane is resilient to failures. - **Distributed etcd**: Using an external etcd cluster for greater fault tolerance. - **Load Balancers**: Distributing traffic across multiple control plane nodes. - **Autoscaling**: Automatically scaling GPU nodes based on workload demands using Horizontal Pod Autoscaler (HPA) and Cluster Autoscaler. Example of Horizontal Pod Autoscaler: ```yaml apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: gpu-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: gpu-deployment minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: nvidia.com/gpu target: type: Utilization averageUtilization: 50 ``` ### Conclusion Achieving high availability in Kubernetes involves careful planning and configuration of both control plane and etcd clusters. Whether using collocated or non-collocated databases, the goal is to ensure that your cluster can withstand failures and continue to operate smoothly. For machine learning workloads, leveraging GPU scaling and high availability practices is crucial to maintain robust and efficient model training and serving in production environments. Back to Start: [[01-Intro to K8s]]