load1 > 40. No charts can explain it

ben · January 28, 2023, 7:13am

Problem/Question

Server load sometimes reaches insane amount. No cpu/ram or other bottle neck. At least none which I could find via netdata. Should I completely ignore load?

ilyam8 · January 31, 2023, 4:52pm

Hi, @ben. How many CPU cores are in your system? It could be not that, as you say, insane.

Load average measures the number of threads that are currently working and those waiting to work (CPU, disk, uninterruptible locks). It doesn’t always indicate a problem. It doesn’t say much (who knows if that is a CPU or IO-bound problem).

Do you have CPU/memory/disk pressure charts? If you want to check if your server is stalled on CPU/memory/IO - always prefer pressure charts.

ben · January 31, 2023, 4:59pm

Thank you for reply.
AMD Ryzen 9 5950X 16-Core Processor

Attaching new screenshots from today:
Notice how load doesn’t correlate with all pressure graphs. Or am I wrong on that?

ilyam8 · January 31, 2023, 5:13pm

System pressure shows the percentage of time some processes (some_pressure) or all processes (full_pressure) and the amount of time (*stall_time charts) have been waiting due to CPU, memory, or I/O (storage) congestion.

Click on “Information” to see the description

If you see high congestion due to CPU - check CPU% charts and Applications cpu.
If you see high congestion due to memory - check ram% charts and Applications mem.

ben · January 31, 2023, 5:20pm

Yes I understand this. But none of pressure graphs match load. In my first screenshot load is > 40 for all selected period. However pressures are raising only for specific parts of that period.
Basically I can’t find any graph that would raise same time and same duration as load.
Or in other words if this load would be related to any of the pressure, once pressure goes down load should too. But it stays same.

Austin_Hemmelgarn · January 31, 2023, 8:01pm

Speaking from experience, because of how the load average is calculated, you can sometimes see behavior like this when the system is seeing lots of very short-lived processes created one after the other. In general, such a workload will also show a similar spike in the number of new processes (system.forks) and context switches (system.ctxt), though depending on what those short-lived processes are doing you may not see any associated spike in CPU usage (or at least, not one anywhere near as pronounced as the spike in load average) or the PSI metrics (because there may just be no actual resource contention involved).

ben · January 31, 2023, 8:13pm

@Austin_Hemmelgarn Thank you for insight. I had similar experience before but I think it doesn’t apply here:

I’m also including zoomed out charts. Just for reference

ilyam8 · February 1, 2023, 10:27am

@ben, as I said

Load average measures the number of threads that are currently working and those waiting to work (CPU, disk, uninterruptible locks). It doesn’t always indicate a problem. It doesn’t say much (who knows if that is a CPU or IO-bound problem).

See the number of processes - it explains (" No charts can explain it").
Keep in mind that the number of processes is the current value, load avg and pressure - trends over N-seconds windows.

Topic		Replies	Views
CPU and RAM dial gauges no longer accurate Help	5	490	August 10, 2023
load_average_1 Alerts	0	8511	October 20, 2021
load_average_15 Alerts	0	10668	February 1, 2022
Total CPU load in graph Help	2	30	August 5, 2024
load_average_5 Alerts	2	7071	January 27, 2022

load1 > 40. No charts can explain it

Problem/Question

Related topics