Load Average in top Command - What It Really Means

Understanding system load metrics beyond CPU usage

Featured image



Overview

Load Average is one of the most commonly monitored metrics in Linux systems administration and performance analysis. It provides a high-level view of system resource utilization over time, serving as an early warning system for potential performance bottlenecks. When you run commands like top, uptime, or htop, you’ll see three numbers representing the 1, 5, and 15-minute load averages.

However, load average is also one of the most misunderstood metrics. Many system administrators erroneously equate it directly with CPU usage, leading to incorrect interpretations like “a load of 2.0 means the CPU is 200% utilized” or “any load above 1.0 indicates an overloaded system.”

The reality is more nuanced – load average combines several factors into a single metric that reflects the “pressure” on your system. Understanding this metric correctly is essential for effective system monitoring, capacity planning, and troubleshooting in production environments.

Historical Context

The concept of load average originated in the early Unix systems of the 1970s. It was designed to provide a simple measure of system busyness that could be understood at a glance and compared over time.

The original Unix implementation only counted processes in the run queue (either running or waiting for CPU time). However, Linux later expanded this definition in the early 1990s to include processes in uninterruptible sleep states (D state), primarily those waiting for I/O operations to complete.

This difference in implementation explains why the interpretation of load average varies between Unix-like systems, and why Linux load averages tend to be higher than those on systems like FreeBSD or macOS for equivalent workloads. The Linux kernel developer Matthias Urlichs described the inclusion of I/O wait in load average as "the most unlucky design decision in the history of Unix" due to the confusion it causes, though others argue it provides a more complete picture of system resource contention.



What is Load Average?

Load Average is a measurement of system demand over time, expressed as the average number of processes that are either:

  1. Running (using CPU)
  2. Runnable (waiting for CPU time)
  3. Uninterruptible Sleep (waiting for I/O operations, typically disk I/O)

Think of it as a measure of the “queue length” of processes demanding system resources. If you have a load average of 1.0, it means that over the measured time period, there was, on average, exactly one process active or waiting for resources at any given time.

The three load average numbers you typically see represent averages over different time periods:

Visualization: Understanding Load Average

Imagine a single checkout lane at a grocery store:

  • A load of 1.0 means there's exactly one customer being served at all times.
  • A load of 0.5 means the checkout lane is idle half the time.
  • A load of 2.0 means there's one customer being served and one waiting in line, on average.
  • A load of 10.0 means there's one customer being served and nine waiting in line, on average.

Now imagine you have a store with 4 checkout lanes (like a 4-core CPU):

  • A load of 4.0 means all checkout lanes are busy, but no one is waiting.
  • A load below 4.0 means some lanes are occasionally idle.
  • A load above 4.0 means customers are waiting (indicating potential performance issues).



How Load Average is Calculated

Load Average isn’t a simple arithmetic mean but an Exponential Moving Average. This means recent data carries more weight in the calculation, making it more responsive to changes.

This calculation method makes it easy to compare sudden load increases with sustained load conditions.



Components of Load Average

Load Average is not simply CPU usage, but is calculated based on the number of processes in the following states:

State Included Description
Running Yes Processes currently executing
Runnable Yes Processes waiting for CPU allocation
I/O Wait Yes (Linux only) Processes waiting for I/O (e.g., disk/network)
Sleeping / Idle No Processes not doing any work

In essence, Load Average represents how many processes “are currently using or waiting to use the CPU.”


Load Average vs CPU Usage

Aspect Load Average CPU Usage
Meaning Overall system workload Percentage of time CPU is actually working
Components Running + waiting processes CPU time (%)
Unit Number of processes Percentage (%)
Example Load 2.00 = 2 processes running simultaneously CPU usage 80% = CPU busy 80% of the time

Common Misconceptions



Interpreting Load Average Values

How you interpret load average depends on your system’s CPU count. The common rule of thumb is:

Practical Example: Analyzing System Load

Consider a 4-core server with these load averages: 2.34, 3.45, 4.56

  • 1-minute (2.34): The system is at ~58% capacity (2.34/4). There's still headroom.
  • 5-minute (3.45): The system is at ~86% capacity (3.45/4). Getting close to full utilization.
  • 15-minute (4.56): The system is at ~114% capacity (4.56/4). The server has been overloaded for some time.

This pattern suggests:

  • The system was overloaded (15-min average > 4)
  • The situation is improving (1-min average < 5-min average < 15-min average)
  • No immediate action may be needed, but monitoring should continue

If the pattern were reversed (1-min > 5-min > 15-min), it would indicate a worsening situation requiring prompt investigation.

Is High Load Always Bad?

Not necessarily. Consider these factors:

  1. Duration: Brief spikes are normal during batch operations
  2. Composition: CPU-bound load differs from I/O-bound load
  3. System Response: If the system remains responsive despite high load, it may be acceptable
  4. Expected Patterns: Some applications naturally create periodic high load



Interpretation Tips for Production

Checkpoints

Monitoring Examples

How is Load Average Used in Kubernetes?

In K8s node autoscaling, custom metrics based on Load Average can be applied alongside CPU usage criteria. Example: With Prometheus + KEDA combination, node_load1 / core count can be used as a criterion for scaling out.



Practical Examples

$ top
load average: 1.5, 1.2, 1.0
$ top
load average: 5.0, 4.9, 5.1



Troubleshooting High Load Average

When load average exceeds your CPU count for extended periods, here are some commands to help diagnose the issue:

# View top processes by CPU usage
top

# Sort by memory usage in top
# Press Shift+M while top is running

# View processes in tree format
pstree -p

# Check I/O operations
iostat -x 1

# View memory usage
free -h

# Check for processes in uninterruptible sleep (D state)
ps aux | grep -w D

# Check system activity reports over time
sar -q

The most common causes of high load average include:

  1. CPU-bound processes: Computationally intensive tasks consuming CPU cycles
  2. Memory pressure: Excessive swapping due to memory shortage
  3. I/O bottlenecks: Slow disk operations causing processes to wait
  4. Resource contention: Multiple processes competing for the same resources
  5. Runaway processes: Processes in infinite loops or with memory leaks
Pro Tip: For AWS EC2 instances, check if you're hitting CPU credit limits on burstable instance types (T2/T3/T4g). These instances can temporarily exceed their baseline performance but will throttle once credits are exhausted.



Conclusion: “Not CPU Usage, But Length of the System’s Queue”

Load Average is not simply CPU usage, but an indirect indicator of system load showing ‘how many processes were waiting for CPU or resource allocation.’

Therefore, rather than simply evaluating it as “high/low”:

In practice, it’s best to comprehensively identify bottlenecks by looking at CPU usage + Load Average + iowait together.

Practical Tips

In conclusion, Load Average is not just a number but closer to the “breathing of your system.” Being able to read this metric correctly gives you the ability to detect problem signs early and prevent failures.



References