Horizontal Pod Autoscaler

๐Ÿš€ Use Case: Scalable Web App with Autoscaling

Imagine you're running a Node.js web application inside Kubernetes:

  • CPU-intensive under load
  • Memory is stable
  • Needs fast scale-up to handle spikes
  • Needs slow scale-down to avoid flapping

We'll build this scenario around the structure you posted.


๐Ÿง  Section-by-Section Deep Dive


๐Ÿ”— scaleTargetRef

scaleTargetRef:
  apiVersion: apps/v1
  kind: Deployment
  name: webapp-deployment
  • What this does: Points to the workload to autoscale.
  • This must be a scalable resource, like a Deployment, ReplicaSet, or StatefulSet.

๐Ÿ“ˆ metrics

Defines what to measure in order to decide when to scale.

There are 6 types supported:

1. resource (Common: CPU or Memory)

metrics:
- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 70
  • Use-case: Scale when average CPU usage across pods exceeds 70%
  • Kubernetes gets this from the metrics-server

โœ… Simple and effective for most workloads


2. containerResource

- type: ContainerResource
  container: app
  name: cpu
  target:
    type: Utilization
    averageUtilization: 80
  • Use-case: Scale based on container-specific resource usage.
  • Useful when a pod runs multiple containers, and you only care about one.

- type: External
  external:
    metric:
      name: queue_length
    target:
      type: AverageValue
      averageValue: "10"
  • Use-case: Scale based on external systems, like:

  • Cloud Pub/Sub queue length

  • Kafka lag
  • AWS SQS messages

You need a custom metrics adapter (e.g. Prometheus Adapter) to use this.


4. object

- type: Object
  object:
    describedObject:
      kind: Service
      name: my-service
      apiVersion: v1
    metric:
      name: request_rate
    target:
      type: Value
      value: "100"
  • Use-case: Scale based on a metric associated with a specific Kubernetes object, like a Service or Ingress.
  • Object metrics support target types of both Value and AverageValue.

3. pods

- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: "100"
  • Use-case: Scale based on per-pod metrics like request count, queue depth, etc.
  • Youโ€™ll use this with custom metrics.
  • They work much like resource metrics, except that they only support a target type of AverageValue.

6. type

Each metric block must define its type: one of:

  • Resource
  • Pods
  • Object
  • External
  • ContainerResource

๐Ÿ“ˆ minReplicas / maxReplicas

minReplicas: 2
maxReplicas: 10
  • Guarantees that the deployment has at least 2 and no more than 10 pods.

โš™๏ธ behavior (v2 feature โ€“ advanced scaling control)

This is new in v2, allowing you to fine-tune scaling velocity & strategy.

behavior:
  scaleUp:
    stabilizationWindowSeconds: 15
    selectPolicy: Max
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15
  scaleDown:
    stabilizationWindowSeconds: 60
    selectPolicy: Min
    policies:
    - type: Pods
      value: 1
      periodSeconds: 60

๐ŸŸข scaleUp:

  • stabilizationWindowSeconds: Prevents too rapid scale-up

  • Only uses metrics from last 15s

  • policies: How fast to scale:

  • type: Percent โ†’ double the pods (100%)

  • type: Pods โ†’ increase by a fixed number
  • selectPolicy: Choose which policy applies if multiple match

  • Max: use the one that scales fastest

  • Min: use the slowest
  • Disabled: ignore policy (rare)

๐Ÿ”ด scaleDown:

  • Same structure as scaleUp, but usually more conservative to avoid flapping.
  • stabilizationWindowSeconds: 60 means only metrics older than 60s will be considered.

๐Ÿ“Š Summary Table

Field Description
scaleTargetRef Which workload to autoscale
minReplicas / maxReplicas Replication limits
metrics What to watch (CPU, memory, external, etc.)
behavior How aggressively to scale
scaleUp / scaleDown Fine control over scaling speed & windows

๐Ÿ“‚ Overall Purpose of behavior

The behavior section gives you precise control over how fast and how often the HPA scales your pods.

Why it exists:

  • Older HPA (v1) would just react to metrics every 15s without much intelligence โ€” this could lead to flapping (scale up, scale down rapidly).
  • With behavior, you can control:

  • How aggressively to scale up (faster response to load)

  • How conservatively to scale down (avoid instability)

๐ŸŸข scaleUp Section

scaleUp:
  stabilizationWindowSeconds: 15
  selectPolicy: Max
  policies:
  - type: Percent
    value: 100
    periodSeconds: 15

๐Ÿ”น stabilizationWindowSeconds: 15

  • This tells Kubernetes:

"When deciding to scale up, only consider the most recent decision made in the last 15 seconds."

๐Ÿ“Œ Real-world: This prevents wild fluctuations due to temporary spikes in CPU or custom metrics.

๐Ÿ’ก Example: If at 12:00:00, the autoscaler sees CPU > 70% and decides to go from 2 โ†’ 4 pods, and then at 12:00:10 it again sees high CPU, it wonโ€™t immediately try 4 โ†’ 8 until 15 seconds are up.


๐Ÿ”น selectPolicy: Max

  • If multiple policies are defined, this tells HPA to:

"Choose the policy that allows the most pods to be added."

๐Ÿง  Other values could be:

  • Min: scale by the smallest allowed number
  • Disabled: disables automatic scaling (rare)

๐Ÿ”น policies (for scaleUp)

- type: Percent
  value: 100
  periodSeconds: 15

This means:

"In any 15-second window, the number of pods can be increased by up to 100%."

โœ… Examples:

Current Pods Max Scale Up (100%) New Max Pods
2 2 4
4 4 8

So, if HPA wants to scale from 2 to 8 โ€” it wonโ€™t do it all at once.

  • It will only scale from 2 โ†’ 4 (within 15s)
  • Then 4 โ†’ 8 in the next 15s, if the high load continues.

๐Ÿ”ป scaleDown Section

scaleDown:
  stabilizationWindowSeconds: 60
  selectPolicy: Min
  policies:
  - type: Pods
    value: 1
    periodSeconds: 60

๐Ÿ”น stabilizationWindowSeconds: 60

  • HPA will only scale down if the metric (e.g., CPU) has been below target for the last 60 seconds consistently.
  • Prevents scaling down too soon after a short dip.

๐Ÿ“Œ Reason: Scaling down too fast can hurt performance if load suddenly comes back (called โ€œthrashingโ€).


๐Ÿ”น selectPolicy: Min

"Choose the most conservative (smallest) action from all matching policies."

  • In this case, we only have one policy: reduce 1 pod.

๐Ÿ”น policies (for scaleDown)

- type: Pods
  value: 1
  periodSeconds: 60

This means:

"In every 60-second window, only 1 pod can be removed."

โœ… Examples:

Current Pods Max Scale Down New Pods
10 -1 9
5 -1 4

This gives your app time to stabilize and ensures you donโ€™t downscale too aggressively.


๐Ÿ” Full Flow Visualization

Time CPU % Pods Action
00:00 80% 2 Scale up to 4 (within 15s)
00:15 85% 4 Scale up to 8
00:30 70% 8 Hold
01:00 30% 8 Still not scaled down yet (60s window not passed)
01:30 25% 8 Downscale to 7
02:30 20% 7 Downscale to 6

๐Ÿง  Why Youโ€™d Use This Config

Your goals might be:

  • Scale up fast when user traffic increases
  • Avoid aggressive scale-down, so you donโ€™t kill pods too soon and disrupt users
  • Have complete governance over autoscaling behavior

๐Ÿงพ Summary

Field Meaning
stabilizationWindowSeconds Look-back period to "stabilize" decisions (anti-flapping)
selectPolicy Choose fastest (Max) or slowest (Min) action if multiple policies apply
policies[].type Percent or Pods โ€” how to define scaling speed
policies[].value The actual change allowed (e.g. 1 pod or 100%)
policies[].periodSeconds Time window over which to enforce the policy

Sweetheart, letโ€™s break down this snippet:

resource    <ResourceMetricSource>
  name      <string> -required-
  target    <MetricTarget> -required-
    averageUtilization      <integer>
    averageValue    <Quantity>
    type    <string> -required-
    value   <Quantity>

This is a part of the metrics field in an HPA (Horizontal Pod Autoscaler) YAML.


๐Ÿง  What is resource in HPA?

It defines how HPA should scale a workload (like a Deployment) based on CPU or memory usage โ€” either as:

  • percentage-based (averageUtilization)
  • absolute value-based (averageValue)

๐Ÿ” Breaking it Down Field-by-Field

Field Description
name "cpu" or "memory" โ€” tells HPA which resource to monitor.
target.type Either: Utilization, AverageValue, or Value (less common in resource metrics).
averageUtilization A % of the requested resource (defined in your pod's resources.requests). Only used when type: Utilization.
averageValue Total usage across pods, divided by number of pods. Used with type: AverageValue.
value Not used in Resource type (used in other metric types like Object).

๐Ÿงช Example 1: Scaling on CPU Utilization

- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 75

โœ… What it does:

  • It scales the workload if average CPU usage across all pods goes above 75% of the CPU requested in the pod spec.

๐Ÿ“ฆ Real-world use case:

If your pod has:

resources:
  requests:
    cpu: "500m"

Then:

  • HPA will trigger scaling when the podโ€™s CPU usage crosses 375m (75% of 500m)

๐Ÿงช Example 2: Scaling on Average Memory Usage

- type: Resource
  resource:
    name: memory
    target:
      type: AverageValue
      averageValue: 500Mi

โœ… What it does:

  • It scales up/down if average memory usage per pod exceeds 500Mi.

๐Ÿ“ฆ Real-world use case:

Imagine a backend app that tends to crash if memory exceeds 600Mi. You want to:

  • Add pods before that happens to reduce load on each.
  • This keeps memory per pod lower than the crashing point.

โš–๏ธ Utilization vs AverageValue

Feature Utilization AverageValue
Works with % of requested resource Absolute usage value
Needs resources.requests? โœ… Yes โŒ No (but still recommended)
Best for Predictable workloads (e.g., APIs) Spiky/memory-heavy apps (e.g., ML workers)
Example 70% of cpu request memory > 500Mi

๐Ÿงช Example 3: Use Both CPU & Memory

metrics:
- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 75
- type: Resource
  resource:
    name: memory
    target:
      type: AverageValue
      averageValue: 500Mi

๐Ÿ’ก Use Case:

Scale up when either:

  • CPU crosses 75% of request, or
  • memory usage per pod exceeds 500Mi

โ— Common Mistakes to Avoid

Mistake Fix
You use Utilization but donโ€™t define resources.requests in your pod Always define requests.cpu or requests.memory
Use value instead of averageValue value is for external or object metrics only
Mixing Utilization with AverageValue in same target block Use separate metric blocks for each

Excellent question, sweetheart โค๏ธ โ€” letโ€™s break it down precisely:

scaleDown:
  policies:
  - type: Pods
    value: 4
    periodSeconds: 60
  - type: Percent
    value: 10
    periodSeconds: 60

โœ… Does this act as AND or OR?

It depends on the value of selectPolicy.


๐Ÿ” Default behavior (selectPolicy is not set)

If you donโ€™t explicitly set selectPolicy, Kubernetes uses:

selectPolicy: Max

So this behaves as:

Max(Pods, Percent)

๐ŸŽฏ Meaning:

  • It picks the more aggressive scale down policy.
  • In other words, more pods will be removed between the two rules.

๐Ÿ“Œ So is it AND or OR?

Itโ€™s neither a strict AND nor a strict OR.

Instead, itโ€™s a selector between multiple policies, controlled by selectPolicy, which can be one of:

selectPolicy Description
Max (default) Take the most aggressive option โ€” i.e., whichever removes more pods
Min Take the least aggressive โ€” i.e., whichever removes fewer pods
Disabled Ignore all policies โ€” use the default scaling behavior

๐Ÿงช Real Example

Letโ€™s say you have 100 pods running, and:

- type: Pods
  value: 4
- type: Percent
  value: 10

Every 60 seconds, the HPA evaluates:

Policy Value
Pods remove up to 4 pods
Percent remove up to 10 pods (10% of 100)

โžค Result with selectPolicy: Max

  • 10 pods will be removed

โžค Result with selectPolicy: Min

  • Only 4 pods will be removed

โœจ TL;DR Summary

Behavior Explanation
AND โŒ No, it doesnโ€™t wait for both to be true.
OR โŒ Not strictly โ€” it chooses one.
Max โœ… Default. Chooses the larger scale-down (more aggressive).
Min โœ… Chooses the smaller scale-down (more conservative).
Disabled โœ… Ignores all custom scale rules โ€” falls back to default Kubernetes behavior.

You're absolutely right, sweetheart โค๏ธ โ€” I mentioned tolerance in the field breakdown but didnโ€™t explain it properly earlier. Letโ€™s now dive deep into what tolerance means in HPA behavior, especially in the context of scaleUp and scaleDown.


๐Ÿ’ก What is tolerance in HPA?

tolerance defines a threshold (in percentage) to prevent unnecessary scaling due to very minor fluctuations in metrics like CPU or memory.


๐Ÿง  Why is tolerance important?

Without tolerance, your HPA could react to even tiny metric changes โ€” causing frequent pod scaling (churn), which hurts stability and performance.

So, tolerance introduces a โ€œdead zoneโ€, where small changes in metrics are ignored to maintain calmness in scaling decisions.


๐Ÿงฎ How does it work?

The tolerance value is a decimal fraction, not a percentage.

For example:

tolerance: 0.1

This means ยฑ10% leeway around the target metric.


๐Ÿ“Œ Example Use Case

Assume:

  • Target CPU utilization: 50%
  • Actual CPU utilization observed: 53%
  • tolerance: 0.1 (10%)

Then the tolerance range is:

Lower Bound = 50% - (10% of 50) = 45%
Upper Bound = 50% + (10% of 50) = 55%

โžก๏ธ Since 53% is within the 45โ€“55% range, HPA does not scale.


๐Ÿงช Realistic Scenario

Letโ€™s say your scaleUp block looks like this:

scaleUp:
  stabilizationWindowSeconds: 300
  tolerance: 0.05   # 5%
  policies:
    - type: Percent
      value: 100
      periodSeconds: 60
  selectPolicy: Max
  • Target CPU utilization = 60%
  • Observed utilization = 62%

60 ร— 0.05 = 3 โ†’ 57%โ€“63% range

โžก๏ธ Since 62% is within the 5% tolerance window, no scaling occurs.

But if usage jumps to 70%, it's outside the tolerance range, so HPA will trigger scale-up.


โœ… Summary of tolerance

Property Meaning
tolerance Fraction (e.g. 0.1) that defines "ignore zone"
Applies to Both scaleUp and scaleDown
Default value 0.1 (i.e., 10%)
Purpose Prevent noisy scaling due to tiny metric spikes
Value range Must be a decimal between 0 and 1

๐Ÿ”ง When should you tweak it?

Scenario Recommended Tolerance
Highly dynamic workloads Lower (e.g. 0.05)
Stable apps, avoid flapping Higher (e.g. 0.15)
Real-time responsiveness needed Lower (e.g. 0.02)

Complete Yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: webapp-hpa                       # Name of the HPA object
  namespace: default                     # ๐Ÿ”ฅ Must be in the same namespace as the Deployment it targets
  labels:
    app: webapp
    environment: production

spec:
  scaleTargetRef:                        # ๐Ÿ“Œ This tells HPA *what* to scale
    apiVersion: apps/v1                  # Must match the target object
    kind: Deployment                     # Could also be StatefulSet, ReplicaSet, etc.
    name: webapp-deployment              # Target object name (must exist)
                                         # namespace: not mentioned, not a key here.
  minReplicas: 2                         # ๐ŸงŠ Lower limit of pods - ensures availability
  maxReplicas: 10                        # ๐Ÿ”ฅ Upper limit - prevents over-scaling

  metrics:
    # ๐Ÿ’ป Scale based on CPU utilization %
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization                # ๐Ÿ‘ˆ Uses percentage of CPU request (not absolute value)
        averageUtilization: 70          # If average CPU > 70%, trigger scaling

    # ๐Ÿง  Scale based on memory utilization %
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization                # ๐Ÿ‘ˆ Uses percentage of memory request
        averageUtilization: 80          # If memory > 80%, HPA considers scaling up

    # ๐Ÿง ๐Ÿง  Scale based on absolute memory usage (not %)
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue              # ๐Ÿ‘ˆ Use exact memory usage across pods
        averageValue: 500Mi             # If each pod uses over 500MiB on average โ†’ scale

  behavior:                              # ๐ŸŽ›๏ธ Fine-tune how scaling happens
    scaleUp:
      stabilizationWindowSeconds: 30     # โณ Wait this long before considering another scale-up
      tolerance: 0.05                    # ยฑ5% around the target metric
      selectPolicy: Max                  # ๐Ÿง  If multiple policies match, pick the most aggressive
      policies:
        - type: Percent
          value: 100                     # ๐Ÿ”บ Double the current replicas (100% increase)
          periodSeconds: 15              # in a 15-second window
        - type: Pods
          value: 4                       # Or add max 4 pods in a 15s window

    scaleDown:
      stabilizationWindowSeconds: 60     # โณ Delay scale-down decisions to avoid rapid drops
      selectPolicy: Min                  # ๐Ÿง  Be conservative โ€” pick the gentlest downscale
      policies:
        - type: Percent
          value: 50                      # ๐Ÿ”ป Reduce by 50% at most
          periodSeconds: 60              # Check every 60s
        - type: Pods
          value: 2                       # Or remove max 2 pods in 60s