### kube-scheduler As Kubernetes deployments grow in size and complexity, effective scheduling becomes crucial. The `kube-scheduler` is responsible for deciding which nodes will run a Pod using a topology-aware algorithm. #### How kube-scheduler Works The scheduler tracks nodes in the cluster, filters and scores them to determine the best fit for each Pod. The Pod specification is sent to the kubelet on the selected node for creation. Default scheduling can be influenced through node or Pod labels, taints, and affinities. ### Node Selection Process #### Filtering Stage The scheduler identifies nodes that can run the Pod based on resource requirements (requests and limits). If no nodes qualify, the Pod remains unscheduled. #### Scoring Stage The scheduler rates the remaining nodes to determine the best placement. Each qualifying node receives a score based on the scheduler's configuration. The node with the highest score is chosen. ### Scheduling Configuration You can customize the scheduler's behavior by writing a configuration file and passing its path as a command-line argument. Scheduling profiles allow you to configure different stages of scheduling with plugins that modify scheduler behavior. #### Extension Points There are twelve stages in scheduling where plugins can be used: - queueSort - preFilter - filter - postFilter - preScore - score - reserve - permit - preBind - bind - postBind - multiPoint #### Example of Custom Scheduling Profile ```yaml apiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration profiles: - schedulerName: default-scheduler - schedulerName: custom-scheduler plugins: preFilter: disabled: - name: '*' filter: disabled: - name: '*' postFilter: disabled: - name: '*' ``` In this example, the scheduler runs two profiles: one with default plugins and another with all filtering plugins disabled. ### Pod Specification for Scheduling Most scheduling decisions are made in the Pod specification using fields such as: - `nodeName` - `nodeSelector` - `affinity` - `schedulerName` - `tolerations` #### Specifying Node Label with nodeSelector ```yaml spec: containers: - name: redis image: redis nodeSelector: net: fast ``` This configuration ensures the Pod is scheduled on a node with the label `net: fast`. ### Affinity and Anti-Affinity #### Pod Affinity Example ```yaml spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - S1 ``` This configuration ensures the Pod is scheduled on a node running a Pod with the label `security: S1`. #### Pod Anti-Affinity Example ```yaml podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: security operator: In values: - S2 ``` This setting prefers nodes without Pods labeled `security: S2`, but will still schedule the Pod if no such nodes are available. ### Node Affinity Node affinity allows scheduling based on node labels. This is similar to `nodeSelector` but offers more flexibility and will eventually replace it. #### Node Affinity Example ```yaml spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: diskspeed operator: In values: - quick - fast ``` This configuration prefers nodes with `diskspeed: quick` or `diskspeed: fast`, but will schedule the Pod on any available node if no matches are found. ### Taints and Tolerations Taints prevent Pods from being scheduled on certain nodes unless they tolerate the taint. #### Taint Example ```yaml tolerations: - key: "server" operator: "Equal" value: "ap-east" effect: "NoExecute" tolerationSeconds: 3600 ``` This configuration allows a Pod to tolerate a taint with `key=server`, `value=ap-east`, and `effect=NoExecute` for 3600 seconds. ### Custom Schedulers If default scheduling mechanisms are insufficient, you can write custom schedulers. Custom schedulers are programmed and deployed in the cluster, and Pods can specify which scheduler to use in their `schedulerName` field. ### Viewing Scheduler Information Use the following command to view scheduler events and other information: ```bash $ kubectl get events ``` ### Advanced Scheduling for Machine Learning When using Kubernetes for machine learning, efficient scheduling is essential for performance and resource utilization. Kubernetes supports GPU scheduling, allowing you to leverage powerful hardware for ML workloads. #### Using Kubernetes for ML with GPUs To schedule Pods that require GPUs, you can specify resource requests and limits for GPUs in your Pod specifications. For example, to request a GPU, use the following configuration: ```yaml spec: containers: - name: ml-container image: ml-image resources: limits: nvidia.com/gpu: 1 ``` ### Kubeflow for Machine Learning Kubeflow is a Kubernetes-native platform for deploying, managing, and scaling machine learning models. It provides components for various stages of the ML lifecycle, including training, serving, and monitoring. #### Deploying Kubeflow Deploy Kubeflow using the following command: ```bash $ kubectl apply -f https://raw.githubusercontent.com/kubeflow/manifests/master/kubeflow/kubeflow.yaml ``` Kubeflow integrates with Kubernetes scheduling to efficiently manage ML workloads, utilizing features such as GPU scheduling, auto-scaling, and resource management. ### Conclusion Effective scheduling in Kubernetes is crucial for resource management and efficient application deployment. By leveraging features like node and pod affinity, taints, tolerations, and custom schedulers, you can fine-tune Pod placement to meet your specific needs. When deploying machine learning workloads, Kubernetes' support for GPU scheduling and platforms like Kubeflow can significantly enhance performance and scalability. Continue: [[12-Logging]]