The Kubernetes Scheduler is a very important part of the overall platform but its functionality and capabilities are not widely known. Why? because for the most part the scheduler just runs out of the box with little to no additional configuration.
So what does this thing do? It determines what server in the cluster a new pod should run on. Pretty simple yet oh so complex. The scheduler has to very quickly answer questions like-
How much resource (memory, cpu, disk) is this pod going to require?
What workers (minions) in the cluster have the resources available to manage this pod?
Are there external ports associated with this pod? If so, what hosts may already be utilizing that port?
Does the pod config have nodeSelector set? If so, which of the workers have a label fitting this requirement?
Has a weight been added to a given policy?
What affinity rules are in place for this pod?
What Anti-affinity rules does this pod apply to?
All of these questions and more are answered through two concepts within the scheduler. Predicates and Priority functions.
Predicates – as the name suggests, predicates set the foundation or base for selecting a given host for a pod.
Priority functions – Assign a number between 0 and 10 with 0 being worst fit and 10 being best.
These two concepts combined determine where a given pod will be hosted in the cluster.
Ok so lets look at the default configuration as of Kubernetes 1.2.
{ "kind" : "Policy", "version" : "v1", "predicates" : [ {"name" : "PodFitsPorts"}, {"name" : "PodFitsResources"}, {"name" : "NoDiskConflict"}, {"name" : "MatchNodeSelector"}, {"name" : "HostName"} ], "priorities" : [ {"name" : "LeastRequestedPriority", "weight" : 1}, {"name" : "BalancedResourceAllocation", "weight" : 1}, {"name" : "ServiceSpreadingPriority", "weight" : 1} ] }
The predicates listed perform the following actions: I think they are fairly obvious but I’m going to list their function for posterity.
{“name” : “PodFitsPorts”} – Makes sure the pod doesn’t require ports that are already taken on hosts
{“name” : “PodFitsResources”} – Ensure CPU and Memory are available on the host for the given pod
{“name” : “NoDiskConflict”} – Makes sure if the Pod has Local disk requirements that the Host can fulfill it
{“name” : “MatchNodeSelector”} – If nodeSelector is set, determine which nodes match
{“name” : “HostName”} – A Pod can be added to a specific host through the hostname
Priority Functions: These get a little bit interesting.
{“name” : “LeastRequestedPriority”, “weight” : 1} – Calculates percentage of expected resource consumption based on what the POD requested.
{“name” : “BalancedResourceAllocation”, “weight” : 1} – Calculates actual consumed resources and determines best fit on this calc.
{“name” : “ServiceSpreadingPriority”, “weight” : 1} – Minimizes the number of pods belonging to the same service from living on the same host.
So here is where things start to get really cool with the Scheduler. With v1.2, Kubernetes has it built-in to spread Pods across multiple Zones (Availability Zones in AWS). This works for both GCE and AWS. We run in AWS so I’m going to show the config for that here. Setup accordingly for GCE.
All you have to do in AWS is label your workers(minions) properly and Kubernetes will handle the rest. It is a very specific label you must use. Now I will say, we added a little weight to ServiceSpreadingPriority to make sure Kubernetes gave more priority to spreading pods across AZs.
kubectl label nodes <server_name> failure-domain.beta.kubernetes.io/region=$REGION kubectl label nodes <server_name> failure-domain.beta.kubernetes.io/zone=$AVAIL_ZONE
You’ll notice the label looks funny. ‘failure-domain’ made a number of my Ops colleagues cringe when they saw it for the first time prior to understanding its meaning. One of them happened to be looking at our newly created cluster and thought we already had an outage. My Bad!
You will notice $REGION and $AVAIL_ZONE are variables we set.
The $REGION we define in Terraform during cluster build but it looks like any typical AWS region.
REGION="us-west-2"
The availability zone we derive on the fly by having our EC2 instances query the AWS API via curl. The IP address is a globally usable IP for all EC2 instances. So you can literally copy this command and use it.
AVAIL_ZONE=`curl http://169.254.169.254/latest/meta-data/placement/availability-zone`
IMPORTANT NOTE: If you create a customer policy for the Scheduler, you MUST include everything in it you want. The DEFAULT policies will not exist if you don’t place them in the config. Here is our policy.
{ "kind" : "Policy", "version" : "v1", "predicates" : [ {"name" : "PodFitsPorts"}, {"name" : "PodFitsResources"}, {"name" : "NoDiskConflict"}, {"name" : "MatchNodeSelector"}, {"name" : "HostName"} ], "priorities" : [ {"name" : "ServiceSpreadingPriority", "weight" : 2}, {"name" : "LeastRequestedPriority", "weight" : 1}, {"name" : "BalancedResourceAllocation", "weight" : 1} ] }
And within the kube-scheduler.yaml config we have:
- --policy-config-file="/path/to/customscheduler.json"
Alright, if that wasn’t enough. You can write your own schedulers within Kubernetes. Personally I’ve not had to do this but here is a link that can provide more information if you are interested.
And if you need more depth around Kubernetes Scheduling the best article I’ve seen written on it is at OpenShift. You can find more information around Affinity/Anti-Affinity, Configurable Predicates and Configurable Priority functions.