Kubernetes Autoscaling

Last Updated : 07 Dec, 2024

The Main point of the cloud and Kubernetes is the ability to scale in the way that we can be able to add new nodes if the existing ones get full and at the same if the demand drops we should be able to delete those nodes. To solve this problem we can use Kubernetes auto scaler which is a component that allows us to scale the resources up and down according to the usage this method is called Kubernetes autoscaling. There are three different methods of Kubernetes autoscaling:

Horizontal Pod Autoscaler (HPA)
Vertical Pod Autoscaler (VPA)
Cluster Autoscaler (CA)

Life before Kubernetes is like writing our code and pushing the code into physical servers in a data center and managing the resources needed by that server to run our application smoothly and another type is deploying our code in virtual machines(VM). With VMs also have problems with hardware and software components required by VMs costs are high and there are some security risks with VMs. Here comes the role of Kubernetes. It is an open-source platform that allows users to manage, deploy and maintain a group of containers and it is like a tool that manages multiple docker environments together. The problems we faced in VMs can be overcome by Kubernetes(K8s).

1. Kubernetes Horizontal Pod Autoscaling(HPA)

Horizontal Pod Autoscaler(HPA) is a controller that can scale most of the pod-based resources up and down based on your application workload. It does this by scaling the number of replicas of your pod once certain preconfigured thresholds are met and for the many applications we deploy scaling mostly depends on only a single metric which is CPU usage. To use HPA we need to define the number of maximum and minimum pods that we want to use for a particular application and also the memory percentage. If HPA is successfully enabled for a particular application Kubernetes will automatically monitor and controls the scaling up and down of pods based on the minimum and maximum limit we have defined.

For example, we will consider an application like Airbnb that runs in Kubernetes and it experiences high traffic of users if there is any offer on booking hotels and flights if the application is not optimized for handling this traffic, users may experience slow response times or even downtime. By using HPA, you may specify a target CPU usage percentage, a minimum and a maximum number of running pods, and other parameters. Kubernetes will automatically increase the number of pods to manage the increasing traffic when the CPU utilization reaches the specified level.

YAML code for HPA

apiVersion: autoscaling/v2    
#this specifies Kubernetes API Version 
kind: HorizontalPodAutoscaler   
# this specifies Kubernetes object like HPA or VPA 
metadata:
 name: name_of_app   
spec:
 scaleTargetRef:
   apiVersion: apps/v2
   kind: Deployment
   name: name_of_app
 minReplicas: 1
 maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 40
  - type: Resource
    resource:
     name: memory
     target:
      type: Utilization
      averageUtilization: 40

The last line ‘targetCPUUtilizationPercentage’ specifies the target CPU utilization percentage that the HPA will aim for when scaling the deployment. In this case, it is set to 50%, meaning that the HPA will attempt to keep the CPU utilization of the deployment at or below 50%. This YAML code will automatically scale the specified deployment based on CPU Utilization with a minimum of 1 and a maximum of 10 replicas. If the average CPU utilization of the container exceeds 50%, the HPA will automatically scale up the deployment to maintain optimal performance

Working of Horizontal Pod Autoscaler

The working of HPA can be broken down into these key steps:

Metrics Collection: The HorizontalPodAutoscaler continuously monitors the resource usage (e.g., CPU, memory) of the pods in your deployment. This is typically achieved by the Kubernetes Metrics Server, which collects data at regular intervals (default: every 15 seconds).
Threshold Comparison: The collected resource metrics are compared against the desired threshold (e.g., CPU usage target of 60%). If the usage exceeds the target threshold, Kubernetes determines that the application requires more resources, and HPA triggers an action to add more pods.
Scaling Logic: The HPA uses this logic to decide when and how much to scale:
- Scale Up: When resource utilization surpasses the defined threshold, HPA increases the number of pods. For instance, if the CPU utilization exceeds 70% across multiple pods, HPA might add more replicas to distribute the load evenly.
- Scale Down: If resource utilization falls below the threshold (e.g., CPU usage drops to 30%), HPA scales down by removing some of the pods, ensuring resource efficiency during low traffic periods.
Feedback Loop: HPA operates in a feedback loop. As the traffic and resource demand changes, HPA will continuously adjust the pod count in response to real-time data. This ensures the system dynamically adapts to current workloads.

Limitations of HPA

The HorizontalPodAutoscaler (HPA) is great for scaling applications automatically in Kubernetes but it does have limitations that can impact its use in real-world scenarios:

Limited Metric Support: HPA mainly uses CPU and memory for scaling which may not represent the true load. Applications often need to scale based on other factors like request rates or network traffic. Custom metrics can be added but this requires extra setup and complexity.
Cold Starts and Delays: When HPA scales up there is a delay before new pods are ready. This can lead to performance drops when the current pods are overloaded. Pre-warming pods or planning for spikes can help but it requires more effort and resources.
Reactive Scaling: HPA reacts after thresholds are breached rather than scaling proactively. This can leave your application under-provisioned during sudden traffic spikes causing poor performance. You can use predictive scaling models but that adds complexity to infrastructure.
One Metric at a Time: HPA typically scales based on one metric like CPU or memory. Many applications need multiple factors like network or request rate considered together. To handle this you can use tools like KEDA but it increases operational overhead.
Handling Burst Traffic: HPA struggles with burst traffic since it does not scale fast enough to handle sudden demand spikes. Using queue-based systems like RabbitMQ can help manage bursts but adds more complexity.
Scaling Granularity: HPA scales pods as whole units which may be inefficient for applications that need finer control over resources like just increasing CPU. For more precise scaling the VerticalPodAutoscaler (VPA) can adjust resources for individual pods.
Dependence on Metrics: HPA relies on the availability and accuracy of resource metrics. If the Metrics Server fails HPA cannot make scaling decisions which can lead to resource issues. Ensuring high availability for metrics is crucial.
Fixed Scaling Intervals: HPA checks metrics at fixed intervals which can miss short traffic spikes. This can lead to delayed scaling or inefficient resource usage in dynamic environments. Adjusting the interval or combining HPA with event-driven scaling can help.

For a practical implementation guide on how to set up the Autoscaling nn Amazon EKS, refer to - Implementing Autoscaling in Amazon EKS

Usage and Cost Reporting with HPA

The Horizontal Pod Autoscaler (HPA) in Kubernetes helps keep applications performing optimally by adjusting the number of pod replicas based on demand to avoid over-provisioning and reduce costs. This guide explains how to monitor and report on HPA-driven usage to manage costs effectively.
Tracking HPA’s impact on costs helps avoid unnecessary expenses while capturing usage patterns to refine scaling decisions based on real data.

Setting Up Usage and Cost Reporting with HPA

Define Metrics and Cost Allocation to track CPU memory and scaling events with tags for accurate cost attribution
Use Monitoring Tools like Prometheus and Grafana to visualize usage patterns and compare metrics to cost data
Add Custom Metrics to tailor HPA for specific application needs to keep scaling efficient

Cost Optimization Tips

Spot Cost Anomalies by identifying scaling events that drive up costs unexpectedly
Refine HPA with Historical Data by adjusting thresholds and cooldowns to reduce unneeded scaling
Automate Reporting to maintain insight into usage trends and make informed scaling choices that are cost-conscious

2. Kubernetes Vertical Pod Autoscaler(VPA)

The Vertical Pod Autoscaler (VPA) for Kubernetes is a tool that provides automated CPU and memory requests and limits modifications based on past resource utilization metrics. It may assist you in effectively and automatically allocating resources inside a Kubernetes cluster, down to the level of individual containers, when utilized appropriately. In addition to enhancing a pod's performance and efficiency by managing its resource demands and limits, VPA may lower the cost of maintaining the application by reducing the wastage of resources. Pod resource use in a Kubernetes cluster may be improved using VPA, a useful feature.

The VPA deployment has three components namely:

VPA Admission Controller
VPA Recommender
VPA Updater

2.1 VPA Admission Controller

It is a component that makes sure that before it is created or changed in the cluster, any new or updated Pod spec complies with the VPA criteria. All the pod creation or update requests are monitored by the VPA Admission Controller, who then applies a set of rules to the pod specifications. These rules are set up in accordance with the active VPA policy. The Kubernetes resources that are being created or altered must agree to the VPA policy, which is another check performed by the VPA Admission Controller.

2.2 VPA Recommender

It is a component in Kubernetes that is based on the resource utilization of those containers over time and suggests resource requests and limitations for specific containers in a pod. The Kubernetes Metrics Server, which offers real-time resource usage analytics for all containers running in the Kubernetes cluster, provides data on resource consumption to the VPA Recommender. The VPA Recommender creates suggestions for resource requests and restrictions for each container in a pod based on this data. It considers the factors like past usage, limits, and pod requirements while generating the recommendations.

2.3 VPA Updater

It is a component that modifies the resource usage and limitations for each container in a pod using the changes made by the VPA Recommender. The VPA Updater updates the Pod standard with the recommended resource requests and restrictions by continuously monitoring the suggestions made by the VPA Recommender. Using the Kubernetes API server, the changes are applied to the Pod standard. Moreover, the VPA Updater makes sure that the updated resource requests and restrictions correspond to the existing VPA policy. The VPA Updater will reject the update and stop the pod from being updated if the new values do not satisfy the VPA policy's requirements.

YAML file for VPA:
apiVersion: autoscaling.k8s.io/v1beta2
kind: VerticalPodAutoscaler
metadata:
 name: example-vpa
spec:
 targetRef:
   apiVersion: "apps/v2"
   kind: Deployment
   name: example_deployment
 updatePolicy:
   updateMode: "Auto"
 resourcePolicy:
   containerPolicies:
   - containerName: example_container
     minAllowed:
       cpu: 110m
       memory: 150Mi
     maxAllowed:
       cpu: 500m
       memory: 1Gi
     mode: "Auto"

Here the ‘resource policy’ specifies the resource policies that the VPA should use. In this case, there is only one container policy specified for a container named "example_container". The minAllowed and maxAllowed fields specify the minimum and maximum allowed resource requests and limits, respectively. Here the mode is set to "Auto", which means that the VPA will automatically adjust the resource requests and limits of the container within the specified range.

3. Kubernetes Cluster Autoscaler(CA)

The cluster Autoscaler is a tool that acts according to the requirements of your workloads, cluster autoscaler dynamically changes the number of nodes in a certain node pool. The cluster autoscaler scales back down to a minimum size that you choose when demand is low. This can increase the availability of your workload when you needs it. We don’t need to manually add or remove the nodes instead we can set a limit of maximum and minimum size for the node pool and the rest is taken care of by cluster autoscaler.

For example, A replica's Pod could be rescheduled onto a new node if its current node is removed, for instance, if your workload comprises a controller with a single replica. Design your workloads to endure unexpected interruptions or make sure that crucial Pods are not disrupted before activating cluster autoscaler. Scaling choices are not made by CA based on CPU or memory use. It just looks at a pod's requests and allotted amounts of CPU and memory. Due to this limitation, CA will not be able to identify any unused computing resources requested by users, creating a cluster with inefficient use and waste. The Cluster autoscaler eliminates nodes to the minimal size of the node pool if nodes are underutilized and all Pods can be scheduled even with fewer nodes in the node pool. Cluster autoscaler won't try to scale down a node if there are Pods on it that can't relocate to other nodes in the cluster. Cluster autoscaler does not address resource shortages on nodes if pods have requested insufficient amounts of resources (or) have left the defaults in place, which may be insufficient. By explicitly requesting resources for each job, you may ensure that the cluster autoscaler operates as correctly as possible.

YAML for cluster autoscaling:

apiVersion: autoscaling/v2
kind: ClusterAutoscaler
metadata:
 name: cluster_autoscaler
spec:
 scaleTargetRef:
   apiVersion: apps/v2
   kind: Deployment
   name: cluster-autoscaler
 minReplicas: 1
 maxReplicas: 8
 autoDiscovery:
   clusterName: my_kubernetes_cluster
   tags:
     k8s.io/cluster_autoscaler/enabled: "true"
 balanceSimilarNodeGroups: true

The Cluster Autoscaler's auto-discovery section described in this is the name of the Kubernetes cluster in which the Cluster Autoscaler is currently executing is specified in this example's clusterName parameter. The tags field instructs the Cluster Autoscaler to scale the cluster by nodes that have the tag "k8s.io/cluster-autoscaler/enabled" set to "true".And the ‘balanceSimilarNodeGroups’ section field specifies whether the Cluster Autoscaler should attempt to balance similar node groups when scaling the cluster.

Kubernetes HPA vs VPA

Feature	Horizontal Pod Autoscaler (HPA)	Vertical Pod Autoscaler (VPA)
Purpose	Scales the number of pod replicas	Adjusts CPU and memory resources within individual pods
Primary Metric	CPU and memory usage or custom metrics	CPU and memory usage
Use Case	Handling fluctuating demand by adding/removing pods	Optimizing resource allocation for existing pods
Scaling Direction	Horizontal (increases/decreases the number of pods)	Vertical (adjusts resources for existing pods)
Ideal For	Applications needing more instances during high demand	Applications requiring optimized resources per pod
Impact on Application Design	Minimal; scales out by adding more pods	May require adjustments if resources are constrained
Common Usage Scenarios	Web applications, microservices	Resource-intensive applications, background processing
Configuration Complexity	Typically straightforward	Requires tuning to avoid excessive scaling

Conclusion

Kubernetes Autoscaling plays a crucial role in modern application deployment, enabling dynamic and efficient management of resources that respond to changing demands. By implementing Horizontal Pod Autoscaler, Vertical Pod Autoscaler, and Cluster Autoscaler, organizations can ensure their applications remain resilient, responsive, and cost-effective. Autoscaling not only optimizes resource usage but also enhances application performance by allocating resources precisely when and where they’re needed. With Kubernetes Autoscaling, businesses can manage traffic surges seamlessly, reduce operational costs, and improve the overall user experience, making it an indispensable tool for any cloud-native environment.

Kubernetes kOps

yaswanths

Improve

Article Tags :