over the last years of usage and contribution to netdata, I’ve came across multiple bugs and use-cases where netdata provided an invaluable help in understanding and fixing.
I’ll try to share some of them here, so everyone can see and understand the power of this tool
Server stall every 1300s
One recent bug that bit me: users complained that “randomly”, their application froze for a few seconds, then came back to normal, while the server was not really loaded.
Looking at netdata graphs, I could indeed see a recurrent IO-wait pattern every 1300seconds (precisely).
And the processes were indeed blocked for the kernel
Checking the per-CPU detail, IO wait took whole CPU time, 100% wait
Checking the time between 2 freeze, we had 1300s
And freeze were between 2 to 8 seconds
We involved the hardware constructor, and the OS Vendor, but none of them were able to track down the issue.
FInally, during a firmware upgrade, we noticed the same behavior started on newly patched servers. It was due to a certain version of the BMC firmware that issued commands to the storage controller and triggered bus scan that froze the whole system…
Kernel memory leak
Another one that was seemingly randomly triggered : the server started to show slowness, then we had
page allocation failure messages in
/var/log/messages, then the network interface reset.
CPUs spent a lot of time in kernel time
While there still was a lot of free memory…
Interestingly, the SLAB (a cache for kernel objects) was increasingly growing,
Fortunately, I contributed the slabinfo collector a few months ago, so I got the detail of the kernel objects.
Having this detail of
buffer_head we tracked it down to a bug in a driver that created a rcu_stall (kinda deadlock)
When your server goes out of RAM, it will start swapping out old pages before being considered “out of memory” and trigger OOM Killer.
While this seems a correct thing, this usally leads to an unresponsive server, as more and more process are stuck waiting for memory to be allocated, themselves needing space to be freed by sending old pages to disk.
When you have old machines that had GB of SWAP (following the old sizing rule of
min("RAM * 1.5" , 16GB), it can lead to a significant unavailability time, due to the slow disks.
Now, I simply disable swap. I’d rather have a large process killed than a “not dead but not usable” server that I had to reboot fully.
In 1 image, you can see the result of 2 commands:
sudo sysctl vm.drop_caches=3 : drop the whole cache to do some space for the swap to be retrieved in memory. This is the first drop in the blue stack of the first graph (which then starts growing again)
sudo swapoff /dev/mapper/rootvg-swap : Disabling the swap: On the 2nd graph you see free swap space dropping to 0 (no more swap allocated), then the used swap space slowly shriking to 0 (swap being retrieved from disk to memory). The 3d graph show the disk I/O operations on the swapfile when removing swap from disk and replacing it into memory.
Happy debugging with netdata !