### kube-scheduler
As Kubernetes deployments grow in size and complexity, effective scheduling becomes crucial. The `kube-scheduler` is responsible for deciding which nodes will run a Pod using a topology-aware algorithm.
#### How kube-scheduler Works
The scheduler tracks nodes in the cluster, filters and scores them to determine the best fit for each Pod. The Pod specification is sent to the kubelet on the selected node for creation. Default scheduling can be influenced through node or Pod labels, taints, and affinities.
### Node Selection Process
#### Filtering Stage
The scheduler identifies nodes that can run the Pod based on resource requirements (requests and limits). If no nodes qualify, the Pod remains unscheduled.
#### Scoring Stage
The scheduler rates the remaining nodes to determine the best placement. Each qualifying node receives a score based on the scheduler's configuration. The node with the highest score is chosen.
### Scheduling Configuration
You can customize the scheduler's behavior by writing a configuration file and passing its path as a command-line argument. Scheduling profiles allow you to configure different stages of scheduling with plugins that modify scheduler behavior.
#### Extension Points
There are twelve stages in scheduling where plugins can be used:
- queueSort
- preFilter
- filter
- postFilter
- preScore
- score
- reserve
- permit
- preBind
- bind
- postBind
- multiPoint
#### Example of Custom Scheduling Profile
```yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
- schedulerName: custom-scheduler
plugins:
preFilter:
disabled:
- name: '*'
filter:
disabled:
- name: '*'
postFilter:
disabled:
- name: '*'
```
In this example, the scheduler runs two profiles: one with default plugins and another with all filtering plugins disabled.
### Pod Specification for Scheduling
Most scheduling decisions are made in the Pod specification using fields such as:
- `nodeName`
- `nodeSelector`
- `affinity`
- `schedulerName`
- `tolerations`
#### Specifying Node Label with nodeSelector
```yaml
spec:
containers:
- name: redis
image: redis
nodeSelector:
net: fast
```
This configuration ensures the Pod is scheduled on a node with the label `net: fast`.
### Affinity and Anti-Affinity
#### Pod Affinity Example
```yaml
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
```
This configuration ensures the Pod is scheduled on a node running a Pod with the label `security: S1`.
#### Pod Anti-Affinity Example
```yaml
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
```
This setting prefers nodes without Pods labeled `security: S2`, but will still schedule the Pod if no such nodes are available.
### Node Affinity
Node affinity allows scheduling based on node labels. This is similar to `nodeSelector` but offers more flexibility and will eventually replace it.
#### Node Affinity Example
```yaml
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: diskspeed
operator: In
values:
- quick
- fast
```
This configuration prefers nodes with `diskspeed: quick` or `diskspeed: fast`, but will schedule the Pod on any available node if no matches are found.
### Taints and Tolerations
Taints prevent Pods from being scheduled on certain nodes unless they tolerate the taint.
#### Taint Example
```yaml
tolerations:
- key: "server"
operator: "Equal"
value: "ap-east"
effect: "NoExecute"
tolerationSeconds: 3600
```
This configuration allows a Pod to tolerate a taint with `key=server`, `value=ap-east`, and `effect=NoExecute` for 3600 seconds.
### Custom Schedulers
If default scheduling mechanisms are insufficient, you can write custom schedulers. Custom schedulers are programmed and deployed in the cluster, and Pods can specify which scheduler to use in their `schedulerName` field.
### Viewing Scheduler Information
Use the following command to view scheduler events and other information:
```bash
$ kubectl get events
```
### Advanced Scheduling for Machine Learning
When using Kubernetes for machine learning, efficient scheduling is essential for performance and resource utilization. Kubernetes supports GPU scheduling, allowing you to leverage powerful hardware for ML workloads.
#### Using Kubernetes for ML with GPUs
To schedule Pods that require GPUs, you can specify resource requests and limits for GPUs in your Pod specifications. For example, to request a GPU, use the following configuration:
```yaml
spec:
containers:
- name: ml-container
image: ml-image
resources:
limits:
nvidia.com/gpu: 1
```
### Kubeflow for Machine Learning
Kubeflow is a Kubernetes-native platform for deploying, managing, and scaling machine learning models. It provides components for various stages of the ML lifecycle, including training, serving, and monitoring.
#### Deploying Kubeflow
Deploy Kubeflow using the following command:
```bash
$ kubectl apply -f https://raw.githubusercontent.com/kubeflow/manifests/master/kubeflow/kubeflow.yaml
```
Kubeflow integrates with Kubernetes scheduling to efficiently manage ML workloads, utilizing features such as GPU scheduling, auto-scaling, and resource management.
### Conclusion
Effective scheduling in Kubernetes is crucial for resource management and efficient application deployment. By leveraging features like node and pod affinity, taints, tolerations, and custom schedulers, you can fine-tune Pod placement to meet your specific needs. When deploying machine learning workloads, Kubernetes' support for GPU scheduling and platforms like Kubeflow can significantly enhance performance and scalability.
Continue: [[12-Logging]]