This alert calculates the average time of
iowait through 10 minute interval periods.
iowait is the percentage of time where there has been at least one I/O request in progress while the CPU has been idle.
We call I/O -on a process level- the use of the read and write services, such as reading data from a physical drive.
It’s important to note that during the time a process waits on I/O, the system can schedule other processes, but
iowait is measured specifically while the CPU is idle.
A common example of when this alert might be triggered would be when your CPU requests some data and the device responsible for it can’t deliver it fast enough. As a result the CPU (in the next clock interrupt) is idle, so you encounter
iowait. If this persists for some time and the average from the metrics we gather exceeds the value that is being checked in the
.conf file, then the alert is raised because the CPU is being bottlenecked by your system’s disks. 1 2
Check for main I/O related processes and hardware issues
Generally, this issue is caused by having slow hard drives that cannot keep up with the speed of your CPU. You can see the percentage of
iowaitby going to your node on Netdata Cloud and clicking the
iowaitdimension under the Total CPU Utilization chart.
You can use
vmstat 1, to set a delay between updates in seconds)
root@netdata~ # vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 8 0 1200384 168456 48840 1461540 4 14 65 51 334 196 3 1 95 0 0
procs column, shows;
b: The number of processes blocked waiting for I/O to complete.
After that, you can use
ps and specifically
ps -eo s,user,cmd | grep ^[D].
grepcommand will fetch the processes that their state code starts with
Dwhich means uninterruptible sleep (usually IO).
Note: It would be helpful to close any of the main consumer processes, but Netdata strongly suggests knowing exactly what processes you are closing and being certain that they are not necessary.
- If you see that you don’t have a lot of processes that you can terminate (or you need them for your workflow), then you would have to upgrade your system’s drives; if you have an HDD, upgrading to an SSD or an NVME drive would make a great impact on this metric.
Check your database
- As another example, in a database environment, you would want to optimize your operations. Check for potential inserts on large data sets, keeping in mind that
writeoperations take more time than
read. You should also search for complex requests, like large joins and queries over a big data set. These can introduce
iowaitand need to be optimized.