load_average_5

Ancairon · January 26, 2022, 7:56pm

load_average_5

OS: Linux

This alarm calculates the system load average (CPU and I/O demand) over the period of five minutes.
If you receive this alarm, it means that your system is “overloaded.”

The alert gets raised into warning if the metric is 4 times the expected value and cleared if the value is 3.5 times the expected value.

For further information on how our alerts are calculated, please have a look at our Documentation.

What does "load average" mean?

The term system load average on a Linux machine, measures the number of threads that are currently working and those waiting to work (CPU, disk, uninterruptible locks)¹². So simply stated: System load average measures the number of threads that aren’t idle.

What does "overloaded" mean?

Andre Lewis explains the term “overloaded” by using an example in his Blog post “Understanding Linux CPU
Load - when should you be worried?”³
You can click on the footnote or find it in our links section.

Let’s look at a single core CPU system and think of its core count as car lanes on a bridge. A car represents a process in this example:

On a 0.5 load average, the traffic on the bridge is fine, it is at 50% of its capacity.
If the load average is at 1, then the bridge is full, and it is utilized 100%.
If the load average gets to 2 (remember we are on a single core machine), it means that there is one car lane that is passing the bridge. However, there is another full car lane that waits to pass the bridge.

So this is how you can imagine CPU load, but keep in mind that load average counts also I/O demand, so there is an analogous example there.

References and Sources

Troubleshooting Section

Determine if the problem is CPU or I/O bound

First you need to check if you are running on a CPU load or an I/O load problem.

To get a report about your system statistics, use vmstat (or vmstat 1, to set a delay between updates in seconds):

root@netdata~ # vmstat 
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 8  0 1200384 168456  48840 1461540    4   14    65    51  334  196  3  1 95  0  0

The procs column, shows:
r: The number of runnable processes (running or waiting for run time).
b: The number of processes blocked waiting for I/O to complete.

List your currently running processes using the ps command:

The grep command will fetch the processes that their state code starts either with R (running or runnable (on run queue)) or D(uninterruptible sleep (usually IO)).

Minimize the load by closing any unnecessary main consumer processes. We strongly advise you to double-check if the process you want to close is necessary.

Check per-process CPU/disk usage to find the top consumers

To see the processes that are the main CPU consumers, use the task manager program top like this:
```
root@netdata~ # top -o +%CPU -i
```
Use iotop:
iotop is a useful tool, similar to top, used to monitor Disk I/O usage, if you don’t have it,
then install it
```
root@netdata~ # sudo iotop
```

Note: If iotop is not installed on your machine, please refer to the install instructions

Minimize the load by closing any unnecessary main consumer processes. We strongly advise you to double-check if the process you want to close is necessary.

andrewm4894 · January 26, 2022, 10:26pm

The alert gets raised into warning if the metric is 4 times the expected value and cleared if the value is 3.5 times the expected value.

Do you have a sort of intuition of eli5 for “expected value” here - curious as how to think about that…

Ancairon · January 27, 2022, 8:30am

Sure!
In simple words, we got 3 load_average metrics:

load_average_1
load_average_5
load_average_15

Each of them is calculating the load_average of a Linux system over a different time span.

The “expected value” is the normal (or usual) load_average for 1, 5 or 15 minutes respectively.
That value also depends on the active core count of the system.

If the load_average becomes 4 times grater than that value, something is going wrong, the system is getting overloaded and the user is warned.

In each of these alerts we have different limits also. This happens because for load_average_1, for example, the limits are the highest, because we don’t consider short and small spikes for that metric a problem.

I hope this helped😄!

Topic		Replies	Views
load_average_15 Alerts	0	10668	February 1, 2022
load_average_1 Alerts	0	8511	October 20, 2021
load1 > 40. No charts can explain it Help	7	1340	February 1, 2023
20min_steal_cpu Alerts	0	21681	November 20, 2021
False Alarm? Help agent	1	698	September 4, 2020

load_average_5

load_average_5

OS: Linux

Troubleshooting Section

Related topics