What did netdata do for you (and no other tool could have)

Hello everyone,

over the last years of usage and contribution to netdata, I’ve came across multiple bugs and use-cases where netdata provided an invaluable help in understanding and fixing.

I’ll try to share some of them here, so everyone can see and understand the power of this tool

Server stall every 1300s

One recent bug that bit me: users complained that “randomly”, their application froze for a few seconds, then came back to normal, while the server was not really loaded.

Looking at netdata graphs, I could indeed see a recurrent IO-wait pattern every 1300seconds (precisely).
!

And the processes were indeed blocked for the kernel
!

Checking the per-CPU detail, IO wait took whole CPU time, 100% wait
!


Checking the time between 2 freeze, we had 1300s
!
And freeze were between 2 to 8 seconds
!

We involved the hardware constructor, and the OS Vendor, but none of them were able to track down the issue.

FInally, during a firmware upgrade, we noticed the same behavior started on newly patched servers. It was due to a certain version of the BMC firmware that issued commands to the storage controller and triggered bus scan that froze the whole system…

Kernel memory leak

Another one that was seemingly randomly triggered : the server started to show slowness, then we had page allocation failure messages in /var/log/messages, then the network interface reset.
!


CPUs spent a lot of time in kernel time
!
While there still was a lot of free memory…
!
Interestingly, the SLAB (a cache for kernel objects) was increasingly growing,
!

Fortunately, I contributed the slabinfo collector a few months ago, so I got the detail of the kernel objects.
!


Having this detail of filp and buffer_head we tracked it down to a bug in a driver that created a rcu_stall (kinda deadlock)

Disabling Swap

When your server goes out of RAM, it will start swapping out old pages before being considered “out of memory” and trigger OOM Killer.
While this seems a correct thing, this usally leads to an unresponsive server, as more and more process are stuck waiting for memory to be allocated, themselves needing space to be freed by sending old pages to disk.
When you have old machines that had GB of SWAP (following the old sizing rule of min("RAM * 1.5" , 16GB), it can lead to a significant unavailability time, due to the slow disks.

Now, I simply disable swap. I’d rather have a large process killed than a “not dead but not usable” server that I had to reboot fully.
In 1 image, you can see the result of 2 commands:
sudo sysctl vm.drop_caches=3 : drop the whole cache to do some space for the swap to be retrieved in memory. This is the first drop in the blue stack of the first graph (which then starts growing again)
sudo swapoff /dev/mapper/rootvg-swap : Disabling the swap: On the 2nd graph you see free swap space dropping to 0 (no more swap allocated), then the used swap space slowly shriking to 0 (swap being retrieved from disk to memory). The 3d graph show the disk I/O operations on the swapfile when removing swap from disk and replacing it into memory.
!

Happy debugging with netdata !

2 Likes

For me it made all my cloud data go away!
(seriously, did that just happen for anyone else)