So I got a chance to play with the brand new metrics correlation feature on the Netdata cloud and it’s been very helpful!!..
I know I’m not alone in getting fairly regular alerts for “ipv4 tcp resets received”. In fact I seem to get them once a day or whenever I restart the Netdata agent. So today I got another one at 07:23 AM and I decided to use the metrics correlation for that spike of TCP resets.
The results came back quickly and were illuminating.
It turns out that I see a spike of CPU usage right before this flurry of TCP resets:
There are other charts with correlated data but actually what was really interesting was how many of the charts had a gap of missing data from the exact time of the TCP resets (note that the CPU above precedes the TCP resets)…
When I scrolled down to the Applications areas I could see what was eating up the CPU time:
When I check the Netdata agent logs I can indeed see that the agent shut itself down and started back up. I’m presuming that the CPU usage by BUILD was the daily agent upgrade job.
I had started to simply live with the regular TCP reset alerts wondering if it was something being buggy on my network but still rather worried about what was causing them (some sort of network attack??). But in 5 minutes I managed to figure out what caused the spikes!!
Well done Netdata team: this is such a good feature!!
@Luis-Johnstone yep - trying to do something around alarms is defo on the list.
That’s a good idea and could be a nice place to start, just giving you some sort of basic intelligence on the alarms you already have configured. And then could maybe even get more advanced at a later stage with being able to pick out ‘normal’ alarms vs unusual or unexpected alarms and maybe handle that it some smarter way (if user was to opt into that approach of course).
You’re most welcome. This is really nice tech.
I’m happy to provide feedback and have plenty of ideas coming from an Ops background.
@andrewm4894
Even further, you could do some analysis of the frequency of occurrence of various alerts and have an algorithm to filter out the noise automatically. Additionally, you could bubble them up to the user and get them to up or down vote whether they consider the alert or correlation to be note-worthy (not sure if the existing buttons do exactly that already).
I helped work on this feature so would love to chat with anyone about it and any other places we could use some ML and stats driven stuff to make useful features.
This is exactly the sort of little case study or example i was hoping someone would reach out with, as there is only so much ‘real world’ type testing we can do on our own data as we build features like this.
The point about getting frequent alarms being a sort of trigger for you to play with it is interesting too as we have been thinking about maybe we could use some notion of a “stormy period” of lots/regular alarms as a signal to generate a window automatically and then sort of automatically run the correlations or at least somehow surface a pre-baked url for users to hit if they want to then manually kick of a suggested metric correlations run.
This is Dimitris from product team. It is great to hear from you that our “metric correlations” feature is very helpful. I was wondering if you will be willing to spend 15min with our team (product/ML), to walk us through some specific use cases? Thank you very much!