DSK busy 100%

Looked at atop and noticed disk load from 95% to 100%.

I started to analyze, it all started with the fact that I turned off all the working projects on this dedicated server and noticed that the load had dropped to 15-20%, I thought it was in the projects … but it wasn’t there, the load returned again and began to reach 75-85%, in atop it was clear that when kworker appeared, the disk load instantly jumped.

atop screenshots:

  1. https://i.stack.imgur.com/r81Wr.png

  2. https://i.stack.imgur.com/lsd8f.png

  3. https://i.stack.imgur.com/nQ86t.png

    I look in perf log, perf top and see:


    Drives are healthy, speed result:

    1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.4319 s, 2.5 GB/s

    Timing buffered disk reads: 3878 MB in 3.00 seconds = 1292.39 MB/sec

    what can be done in the next steps to localize the problem and load disks by 95-100%?
    debian 10 Debian 4.19.181-1

The problem is similar to the one described in the closed request on github.
can you tell me the options for the outcome, how to fix where?

cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline

cat /sys/block/nvme1n1/queue/scheduler
[none] mq-deadline

Hello @Thin_Troll, I am not so sure what’s the question here. If you seeing an abnormal behavior in you system why don’t by installing Netdata Agent to have a much clearer view of what’s is going on in your system?

I am a delitant in this matter and I don’t understand where I should poke, while everything that I did, and this
outputs: iotop, atop, perf top, perf log server standard values
and collecting logs (at the request of people and their commands) leads me to the fact that they naturally answer: the disk is broken!
but I found on github that the problem may be in the Linux kernel, in a bad RAID, and I’m trying to get to the bottom of the truth.

if you don’t mind, could you please tell me how to install netdata agent for debian 10?

You can follow this Install Netdata with kickstart.sh | Learn Netdata section in our docs to install the Agent

ok i installed netdata it looks very nice and handy.
I have several questions:

  1. how to protect this page from prying eyes?
  2. Where should I look to identify a disk problem?

i see something like that

this situation happens with two disks

Now I would suggest you two paths, choose whatever you like.

  1. Claim(connect) this Agent to the cloud which will give you the ability to
    A. aggregate all of your dashboards in a single pane (dashboard from many agents, servers which are running Netdata Agent) AND restrict access to this local dashboard (Web server | Learn Netdata)
    B. Extra troubleshooting features like Metric Correlations | Learn Netdata && Anomaly Advisor | Learn Netdata

  2. Restrict access to only local subnets (Web server | Learn Netdata) and continue with Agent’s local dashboard. You can also attach this dashboard to an nginx server (if you already running one in this system) and set password Step 10. Set up a proxy | Learn Netdata

what can you say about the source data?
The load reaches its maximum values, what could it be?

Let’s go to the Application metrics and see which app group is utilizing those disks. Relevant documentantion apps.plugin | Learn Netdata

this is all I found, I would like to look and understand the config specified in the wiki, but I don’t have this path.

that’s all i found

You don’t see this, under the Applications?

oh right

I don’t see anything illegal, could this be a monitoring error?

Ok that’s strange, none of the processes is utilizing disk resources. So it’s maybe be something in the underlying mechanisms or and the reporting mechanism for these nvmes (calculating wrong values).

how is the solution?

@vlvkobal I see this netdata/collectors/proc.plugin: wrong output for nvme disk stats, nvme disks not autodiscovered · Issue #5744 · netdata/netdata · GitHub relevant issue. And we closed it with this Check device names in diskstats plugin by vlvkobal · Pull Request #10843 · netdata/netdata · GitHub patch. Do you thing that switching the I/O scheduler like the user suggested in #5744 is necessary (and advised) to receive correct metrics?

well, everything is clear with one of the links, the task is to change the I / O parameter.
but here is the second link with diskstats, what should I do with it?
I don’t understand at all, well, do I have corrected files?
Where do I put them and how do I use them?

UPD(1) well, I understand that they need to be marked up in netdata/collectors/proc.plugin/proc_diskstats.c
but how do i find the location of netdata?

UPD(2) found the location, netdata, but these files are not there.

could it be a problem that my snaps are 100% loaded?

Do you thing that switching the I/O scheduler like the user suggested in #5744 is necessary (and advised) to receive correct metrics?

I don’t know. The issue we fixed was about incorrectly collected data, not about a real disk load.