What did netdata do for you (and no other tool could have)



  • Hello everyone,

    over the last years of usage and contribution to netdata, I’ve came across multiple bugs and use-cases where netdata provided an invaluable help in understanding and fixing.

    I’ll try to share some of them here, so everyone can see and understand the power of this tool

    Server stall every 1300s

    One recent bug that bit me: users complained that “randomly”, their application froze for a few seconds, then came back to normal, while the server was not really loaded.

    Looking at netdata graphs, I could indeed see a recurrent IO-wait pattern every 1300seconds (precisely).
    bug_idrac_perc_system.png

    And the processes were indeed blocked for the kernel
    bug_idrac_perc_process.png

    Checking the per-CPU detail, IO wait took whole CPU time, 100% wait
    bug_idrac_perc_cpus.png
    Checking the time between 2 freeze, we had 1300s
    bug_idrac_perc_cpus_zoom.png
    And freeze were between 2 to 8 seconds
    bug_idrac_perc_cpus_zoom2.png

    We involved the hardware constructor, and the OS Vendor, but none of them were able to track down the issue.

    FInally, during a firmware upgrade, we noticed the same behavior started on newly patched servers. It was due to a certain version of the BMC firmware that issued commands to the storage controller and triggered bus scan that froze the whole system…

    Kernel memory leak

    Another one that was seemingly randomly triggered : the server started to show slowness, then we had page allocation failure messages in /var/log/messages, then the network interface reset.
    kernel_driver_slab_system.png
    CPUs spent a lot of time in kernel time
    kernel_driver_slab_cpus.png
    While there still was a lot of free memory…
    kernel_driver_slab_ram.png
    Interestingly, the SLAB (a cache for kernel objects) was increasingly growing,
    kernel_driver_slab_slabtotal.png

    Fortunately, I contributed the slabinfo collector a few months ago, so I got the detail of the kernel objects.
    kernel_driver_slab_slabdetail.png
    Having this detail of filp and buffer_head we tracked it down to a bug in a driver that created a rcu_stall (kinda deadlock)

    Disabling Swap

    When your server goes out of RAM, it will start swapping out old pages before being considered “out of memory” and trigger OOM Killer.
    While this seems a correct thing, this usally leads to an unresponsive server, as more and more process are stuck waiting for memory to be allocated, themselves needing space to be freed by sending old pages to disk.
    When you have old machines that had GB of SWAP (following the old sizing rule of min("RAM * 1.5" , 16GB), it can lead to a significant unavailability time, due to the slow disks.

    Now, I simply disable swap. I’d rather have a large process killed than a “not dead but not usable” server that I had to reboot fully.
    In 1 image, you can see the result of 2 commands:
    sudo sysctl vm.drop_caches=3 : drop the whole cache to do some space for the swap to be retrieved in memory. This is the first drop in the blue stack of the first graph (which then starts growing again)
    sudo swapoff /dev/mapper/rootvg-swap : Disabling the swap: On the 2nd graph you see free swap space dropping to 0 (no more swap allocated), then the used swap space slowly shriking to 0 (swap being retrieved from disk to memory). The 3d graph show the disk I/O operations on the swapfile when removing swap from disk and replacing it into memory.
    swap_disable_memory.png

    Happy debugging with netdata !



  • For me it made all my cloud data go away!
    (seriously, did that just happen for anyone else)


Log in to reply