AWS
- EC2 user-data file should be generic – Having to modify user-data can cause major issues. Prefer to offload runtime based config to something like Ansible.
- AWS VPC CNI for EKS is not required. Although its an interesting option, the choice to run an alternate CNI plugin is yours.
Kubernetes
- Applications with slow startup times, should have a lower scaling (HPA) threshold so they can scale quickly enough to meet load demands.
- Kubernetes clusters running multiple host sizes should ensure pods are tainted/tolerated to run on the correct hosts.
- If automation of cloud services via Kubernetes is “in the cards” make sure all the dependencies can also be automated.
- Example: I once automated the use of AWS ALBs, ELBs and Route53 DNS via Kubernetes. Eventually we chose to use Cloud Front as well but there is no automation for it via Kubernetes (at the time of this writing). This left us with maintaining cloud front manually or writing Terraform separately.
Language specific best practices:
NodeJS
- Nodejs applications require 1 CPU and 1.5 GB of RAM by default. Make sure any application running nodejs has QoS set to 1 CPU and 1.75 GB of RAM. Nodejs apps without this run the risk of killing themselves because they assume they have the default regardless of what is set for QoS. The alternative is to modify the default resource requirements of Nodejs but many do not recommend doing this.
- Nodejs applications more heavily utilize DNS to make requests as they don’t by default cache a DNS entry. This tends to cause a significant amount of load on Kubernetes DNS.
Java
- Many Java applications utilize off-heap memory. Ensure QoS memory allocation for a Java app accounts for off-heap memory use.
- Java 8 and older use the server CPU for determining how much CPU is available. It ignores the amount set by Docker. This can lead to crashing if the app attempts to consume more CPU than it is allowed.
- Java 9+ can properly detect the correct CPU allocated.