load_average_1

Ancairon · October 20, 2021, 5:44pm

load_average_1

OS: Linux

This alert calculates the system load average (CPU and I/O demand) over the period of one minute.
If you receive this alert, it means that your system is overloaded.

What does "load average" mean

The term system load average on a Linux machine, measures the number of threads that are currently working and those waiting to work (CPU, disk, uninterruptible locks).^{1 2}
So simply stated: it measures the number of threads that aren’t idle.

What does "overloaded" mean

The term overloaded can be better illustrated using an example as Andre Lewis says in Understanding Linux CPU Load - when should you be worried?³.

We are going to take a single core CPU system and think of its core count as bridge lanes.

On a 0.5 load average, the traffic on the bridge is fine, it is at 50% of its capacity.
If the load average is at 1, then the bridge is full, and it is utilized 100%.
If the load average gets to 2 (remember we are on a single core machine), it means that there is one lane that is passing the bridge and one other full lane that waits on the side. On this example, traffic and thus cars, are processes.

So this is how you can imagine CPU load, but keep in mind that load average counts also I/O demand, so there is an analogous example there.

How we calculate the alert

On Netdata, in the load.conf file, under the health.d directory, you can see how we calculate when the alert should be raised.

First, there is load_cpu_number where it provides the load_average alerts with the core count of the machine.
In the line warn: ($this * 100 / $load_cpu_number) > (($status >= $WARNING) ? 700 : 800),
($this * 100 / $load_cpu_number) is the current system load average in %.
Last, we check if that value exceeds 700% or 800% (depending on the $status of the alert).

References and Sources

Troubleshooting Section

Determine if the problem is CPU or I/O bound

First you need to check if you are running on a CPU load or an I/O load problem.

You can use vmstat (or vmstat 1, to set a delay between updates in seconds)

root@netdata~ # vmstat 
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 8  0 1200384 168456  48840 1461540    4   14    65    51  334  196  3  1 95  0  0

The procs column, shows;
r: The number of runnable processes (running or waiting for run time).
b: The number of processes blocked waiting for I/O to complete.

After that, you can use the ps and specifically ps -eo s,user,cmd | grep ^[RD].

The grep command will fetch the processes that their state code starts either with R (running or runnable (on run queue)) or D(uninterruptible sleep (usually IO)).

Note: It would be helpful to close any of the main consumer processes, but Netdata strongly suggests knowing exactly what processes you are closing and being certain that they are not necessary.

Check per-process CPU/disk usage to find the top consumers

Use top:
```
root@netdata~ # top -o +%CPU -i
```
Here, you can see which processes are the main cpu consumers on the %CPU column.
Use iotop:
iotop is a useful tool, similar to top, used to monitor Disk I/O usage, if you don’t have it, then install it
```
root@netdata~ # sudo iotop
```
Using this, you can see which processes are the main Disk I/O consumers on the IO column.

Note: It would be helpful to close any of the main consumer processes, but Netdata strongly suggests knowing exactly what processes you are closing and being certain that they are not necessary.

Topic		Replies	Views
load_average_15 Alerts	0	10668	February 1, 2022
load_average_5 Alerts	2	7071	January 27, 2022
load1 > 40. No charts can explain it Help	7	1340	February 1, 2023
20min_steal_cpu Alerts	0	21681	November 20, 2021
10min_disk_utilization Alerts	0	4803	December 2, 2021

load_average_1

load_average_1

OS: Linux

Troubleshooting Section

Related topics