I’m curious how other people use the metrics disk.backlog, disk.util, and disk.busy in practice and whether they are producing useful results for anyone.
To recap some things from documentation and other forum posts:
utilization and backlog are Field 10 and Field 11 from /proc/diskstats
Quoting from https://www.kernel.org/doc/Documentation/iostats.txt:
Field 9 – # of I/Os currently in progress
The only field that should go to zero. Incremented as requests are
given to appropriate struct request_queue and decremented as they finish.
Field 10 – # of milliseconds spent doing I/Os
This field increases so long as field 9 is nonzero.
Field 11 – weighted # of milliseconds spent doing I/Os
This field is incremented at each I/O start, I/O completion, I/O
merge, or read of these stats by the number of I/Os in progress
(field 9) times the number of milliseconds spent doing I/O since the
last update of this field. This can provide an easy measure of both
I/O completion time and the backlog that may be accumulating.
Let me know if I’m getting anything wrong below:
Field 10 (utilization) will increase as long as there is an I/O in progress (including queued I/Os). But, as the man page for iostat says:
%util Percentage of elapsed time during which I/O
requests were issued to the device (bandwidth
utilization for the device). Device saturation
occurs when this value is close to 100% for devices
serving requests serially. But for devices serving
requests in parallel, such as RAID arrays and
modern SSDs, this number does not reflect their
So this metric would probably have made more sense for historic disk architectures, and on modern systems, a “100% utilized” disk could potentially be doing more work without any degradation in response times. It’s unclear if iostat is getting the %util from the same source, but it sounds like the concept is the same.
disk.busy looks like%disk.util converted to milliseconds: for example, at 45% disk.util we have 450 ms disk.busy. I’ve talked to some people who were misinterpreting this as a disk response time and giving expensive advice based on that.
The backlog (field 11) sounds like it’s field 9 times field 10 since the last update (the last update of what is unclear - I would read the code if I thought I would understand it). I would expect this to be a fairly high number on a busy system, except when there are zero I/Os in progress and field 9 goes to zero.
I’m concerned that these metrics could mislead people into always concluding that a busy system is I/O-bound just because there are I/Os happening continuously (a common workload pattern in databases etc).
Some questions to wrap up:
Has anyone found value in the above metrics or are they just confusing people?
I’ve generally focused on the response time (disk.await) as a performance indicator. Is this a better approach?
Would disk.await times the queue size (it looks like this is disk.qops in netdata?) be a better approach for measuring potential backlog, with the caveat that it could be lower as some operations could be done in parallel?
Thank you for reading.