So I got a chance to play with the brand new metrics correlation feature on the Netdata cloud and it’s been very helpful!!..
I know I’m not alone in getting fairly regular alerts for “ipv4 tcp resets received”. In fact I seem to get them once a day or whenever I restart the Netdata agent. So today I got another one at 07:23 AM and I decided to use the metrics correlation for that spike of TCP resets.
The results came back quickly and were illuminating.
It turns out that I see a spike of CPU usage right before this flurry of TCP resets:
There are other charts with correlated data but actually what was really interesting was how many of the charts had a gap of missing data from the exact time of the TCP resets (note that the CPU above precedes the TCP resets)…
When I scrolled down to the Applications areas I could see what was eating up the CPU time:
We can also see that a particular application’s memory usage changes quite hugely (from 101MB-steady down to about 60MB):
When I check the Netdata agent logs I can indeed see that the agent shut itself down and started back up. I’m presuming that the CPU usage by BUILD was the daily agent upgrade job.
I had started to simply live with the regular TCP reset alerts wondering if it was something being buggy on my network but still rather worried about what was causing them (some sort of network attack??). But in 5 minutes I managed to figure out what caused the spikes!!
Well done Netdata team: this is such a good feature!!
Now to log a ticket for the TCP resets…