Interpreting disk.backlog, disk.util, and disk.busy

Hello,

I’m curious how other people use the metrics disk.backlog, disk.util, and disk.busy in practice and whether they are producing useful results for anyone.

To recap some things from documentation and other forum posts:

utilization and backlog are Field 10 and Field 11 from /proc/diskstats

Quoting from https://www.kernel.org/doc/Documentation/iostats.txt:

Field 9 – # of I/Os currently in progress
The only field that should go to zero. Incremented as requests are
given to appropriate struct request_queue and decremented as they finish.

Field 10 – # of milliseconds spent doing I/Os
This field increases so long as field 9 is nonzero.

Field 11 – weighted # of milliseconds spent doing I/Os
This field is incremented at each I/O start, I/O completion, I/O
merge, or read of these stats by the number of I/Os in progress
(field 9) times the number of milliseconds spent doing I/O since the
last update of this field. This can provide an easy measure of both
I/O completion time and the backlog that may be accumulating.

Let me know if I’m getting anything wrong below:

Field 10 (utilization) will increase as long as there is an I/O in progress (including queued I/Os). But, as the man page for iostat says:

%util Percentage of elapsed time during which I/O
requests were issued to the device (bandwidth
utilization for the device). Device saturation
occurs when this value is close to 100% for devices
serving requests serially. But for devices serving
requests in parallel, such as RAID arrays and
modern SSDs, this number does not reflect their
performance limits.

So this metric would probably have made more sense for historic disk architectures, and on modern systems, a “100% utilized” disk could potentially be doing more work without any degradation in response times. It’s unclear if iostat is getting the %util from the same source, but it sounds like the concept is the same.

disk.busy looks like%disk.util converted to milliseconds: for example, at 45% disk.util we have 450 ms disk.busy. I’ve talked to some people who were misinterpreting this as a disk response time and giving expensive advice based on that.

The backlog (field 11) sounds like it’s field 9 times field 10 since the last update (the last update of what is unclear - I would read the code if I thought I would understand it). I would expect this to be a fairly high number on a busy system, except when there are zero I/Os in progress and field 9 goes to zero.

I’m concerned that these metrics could mislead people into always concluding that a busy system is I/O-bound just because there are I/Os happening continuously (a common workload pattern in databases etc).

Some questions to wrap up:

Has anyone found value in the above metrics or are they just confusing people?

I’ve generally focused on the response time (disk.await) as a performance indicator. Is this a better approach?

Would disk.await times the queue size (it looks like this is disk.qops in netdata?) be a better approach for measuring potential backlog, with the caveat that it could be lower as some operations could be done in parallel?

Thank you for reading.

James

Hi @jrl - welcome and thanks for the interesting question!

I’m not really sure myself tbh.

I wonder if you could use a tool like stress-ng to formulate and test some of those ideas?

https://manpages.ubuntu.com/manpages/bionic/man1/stress-ng.1.html

(this yt video goes into stress-ng in lots of details with some examples)

@Costa_Tsaousis or @Austin_Hemmelgarn any opinions or thoughts?

First, a quick bit of clarity regarding the quote from the iostat manpage. When it says ‘modern SSD’s’ it’s talking about things like NVMe storage or really fancy SAS SSDs that can actually reliably do things in parallel (as compared to SATA SSDs, which often do not have particularly good support for parallelized IO operations at the device level). So even with some relatively ‘modern’ systems it’s still useful for the same reasons that it traditionally was.


Now, as to the actual questions:

Has anyone found value in the above metrics or are they just confusing people?

I have personally found some value in the percent utilization for identifying when the system isn’t heavily impacted by disk performance. The lower that number is, the less impacted by disk performance your workload is, irrespective of whether the storage device can parallelize things or not. It’s still not an absolute measure, but it is good enough for comparison among workloads. IOW, you can still use this for performance tuning the application even if you can’t use it effectively for determining if you need better hardware.

Additionally, if you already know that storage performance is a bottleneck, then the disk backlog can give you a quick relative estimate of how bad it is in ways that simply counting IOPS or bytes transferred cannot. You can get the same kind of information by computing it from other charts, but it’s not quite ‘at a glance’ that way.

I’ve generally focused on the response time (disk.await) as a performance indicator. Is this a better approach?

If your goal is low latency, then yes, that’s a good indicator on it’s own.

For IOPS and/or throughput though, you’re better off looking at the actual IOPS or throughput charts and comparing to theoretical numbers for the device (hitting or exceeding the theoretical numbers generally means you’re storage-bound).

Would disk.await times the queue size (it looks like this is disk.qops in netdata?) be a better approach for measuring potential backlog, with the caveat that it could be lower as some operations could be done in parallel?

That would be a definite maybe. The problem here is stuff like flushes and discards. Both significantly inflate IO completion times but both are also in a separate category for accounting. IOW, you need to also look at the extended IO operations charts as well to get a good idea off what’s going on. Flushes in particular are especially problematic, because they are accounted differently before and after any merging of requests happens.