DSK busy 100%

Thin_Troll · May 20, 2022, 12:50pm

Looked at atop and noticed disk load from 95% to 100%.

I started to analyze, it all started with the fact that I turned off all the working projects on this dedicated server and noticed that the load had dropped to 15-20%, I thought it was in the projects … but it wasn’t there, the load returned again and began to reach 75-85%, in atop it was clear that when kworker appeared, the disk load instantly jumped.

atop screenshots:

https://i.stack.imgur.com/r81Wr.png
https://i.stack.imgur.com/lsd8f.png
https://i.stack.imgur.com/nQ86t.png

I look in perf log, perf top and see:

https://i.stack.imgur.com/1VOxm.png
https://i.stack.imgur.com/KdXFa.png

Drives are healthy, speed result:

1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.4319 s, 2.5 GB/s

Timing buffered disk reads: 3878 MB in 3.00 seconds = 1292.39 MB/sec

what can be done in the next steps to localize the problem and load disks by 95-100%?
debian 10 Debian 4.19.181-1

The problem is similar to the one described in the closed request on github.
can you tell me the options for the outcome, how to fix where?

github.com/Atoptool/atop

Incorrect Busy/AVIO metrics -- NVME device

opened 05:42PM - 16 Jan 19 UTC

closed 07:59AM - 03 Aug 19 UTC

martyg77

Updated to 2.4.0-1 on Arch Linux. This is the first time I see this very useful… tool expose nvme devices. The numbers I am getting from atop don't look right. (Busy and avio metrics on idle system) ![atop](https://user-images.githubusercontent.com/2901784/51266703-ce51a700-1970-11e9-84bc-7c74cae645f9.png) I can't see how my machine would be so badly misconfigured, and the system does not look/feel to have IO bottlenecks from the user perspective. I see the same behavior on kernels 4.19 and 4.20. I am attaching an iostat sample for reference. [iostat.txt](https://github.com/Atoptool/atop/files/2765366/iostat.txt) Not sure where to go with this. atop bug? kernel bug? Arch packaging problem? Broken machine? Could you point me in the right direction, please? Thanks!

cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline

cat /sys/block/nvme1n1/queue/scheduler
[none] mq-deadline

Tasos_Katsoulas · May 20, 2022, 1:06pm

Hello @Thin_Troll, I am not so sure what’s the question here. If you seeing an abnormal behavior in you system why don’t by installing Netdata Agent to have a much clearer view of what’s is going on in your system?

Thin_Troll · May 20, 2022, 2:04pm

I am a delitant in this matter and I don’t understand where I should poke, while everything that I did, and this
outputs: iotop, atop, perf top, perf log server standard values
and collecting logs (at the request of people and their commands) leads me to the fact that they naturally answer: the disk is broken!
but I found on github that the problem may be in the Linux kernel, in a bad RAID, and I’m trying to get to the bottom of the truth.

if you don’t mind, could you please tell me how to install netdata agent for debian 10?

Tasos_Katsoulas · May 20, 2022, 2:16pm

You can follow this Install Netdata with kickstart.sh | Learn Netdata section in our docs to install the Agent

Thin_Troll · May 20, 2022, 2:31pm

ok i installed netdata it looks very nice and handy.
I have several questions:

how to protect this page from prying eyes?
Where should I look to identify a disk problem?

i see something like that

this situation happens with two disks

Tasos_Katsoulas · May 20, 2022, 2:49pm

Now I would suggest you two paths, choose whatever you like.

Claim(connect) this Agent to the cloud which will give you the ability to
A. aggregate all of your dashboards in a single pane (dashboard from many agents, servers which are running Netdata Agent) AND restrict access to this local dashboard (Web server | Learn Netdata)
B. Extra troubleshooting features like Metric Correlations | Learn Netdata && Anomaly Advisor | Learn Netdata
Restrict access to only local subnets (Web server | Learn Netdata) and continue with Agent’s local dashboard. You can also attach this dashboard to an nginx server (if you already running one in this system) and set password Step 10. Set up a proxy | Learn Netdata

Thin_Troll · May 20, 2022, 2:53pm

what can you say about the source data?
The load reaches its maximum values, what could it be?

Tasos_Katsoulas · May 20, 2022, 2:57pm

Let’s go to the Application metrics and see which app group is utilizing those disks. Relevant documentantion apps.plugin | Learn Netdata

Thin_Troll · May 20, 2022, 3:10pm

this is all I found, I would like to look and understand the config specified in the wiki, but I don’t have this path.

that’s all i found

Tasos_Katsoulas · May 20, 2022, 3:23pm

You don’t see this, under the Applications?

Thin_Troll · May 20, 2022, 3:30pm

oh right

Thin_Troll · May 20, 2022, 3:59pm

I don’t see anything illegal, could this be a monitoring error?

Tasos_Katsoulas · May 20, 2022, 4:28pm

Ok that’s strange, none of the processes is utilizing disk resources. So it’s maybe be something in the underlying mechanisms or and the reporting mechanism for these nvmes (calculating wrong values).

Thin_Troll · May 20, 2022, 4:34pm

how is the solution?

Tasos_Katsoulas · May 20, 2022, 5:13pm

@vlvkobal I see this netdata/collectors/proc.plugin: wrong output for nvme disk stats, nvme disks not autodiscovered · Issue #5744 · netdata/netdata · GitHub relevant issue. And we closed it with this Check device names in diskstats plugin by vlvkobal · Pull Request #10843 · netdata/netdata · GitHub patch. Do you thing that switching the I/O scheduler like the user suggested in #5744 is necessary (and advised) to receive correct metrics?

Thin_Troll · May 20, 2022, 6:10pm

well, everything is clear with one of the links, the task is to change the I / O parameter.
but here is the second link with diskstats, what should I do with it?
I don’t understand at all, well, do I have corrected files?
Where do I put them and how do I use them?

UPD(1) well, I understand that they need to be marked up in netdata/collectors/proc.plugin/proc_diskstats.c
but how do i find the location of netdata?

UPD(2) found the location, netdata, but these files are not there.

Thin_Troll · May 20, 2022, 11:02pm

could it be a problem that my snaps are 100% loaded?

vlvkobal · May 24, 2022, 3:03pm

Do you thing that switching the I/O scheduler like the user suggested in #5744 is necessary (and advised) to receive correct metrics?

I don’t know. The issue we fixed was about incorrectly collected data, not about a real disk load.

Topic		Replies	Views
Strange disk utilization alert Help agent	15	3415	April 6, 2021
10min_disk_utilization Alerts	0	4812	December 2, 2021
How to Monitor with Netdata. A crash course for Absolute Beginners General how-to	4	7561	April 6, 2022
Current issues under Rocky 9 and docker Help agent , collectors	15	2789	January 22, 2023
10m disk utilization Help agent	1	520	November 7, 2022

DSK busy 100%

Related topics