We have this PR to add a “ChangeFinder” python collector that does online changepoint detection using the changefinder python package.
(Note: there is a lot more info on what this is and how it works in the README for the collector)
I just wanted to make a thread for collecting feedback on it. I will add feedback in here as we dogfood it ourselves too.
Here are some initial findings from running it on a small dev node (n1-standard-2 (2 vCPUs, 7.5 GB memory)) overnight to see sorts of stuff it found and benchmark some of its performance overhead and characteristics.
This is not really a measure of how useful or not it might be since i am just running it on a VM that’s doing nothing but just wanted to share some of the charts that others might find useful if they want to play with this in future.
I’m just going to add charts and comments as i look at it.
So if i go into my node for last 12 hours i see two ‘spikes’ of changes:
Lets zoom in on the first one:
Here i can see about 6 charts got flagged as having changes for about 30 seconds or so:
Lets pick pgpgio and look at that as an example:
Yep does indeed look like a change in “out” alrite:
I see changes in interrupts, ctxt and intr:
Let’s look at those charts:
Indeed i see changes:
Ok so it looks like it is picking up changes - in this example its just large spikes that are going on for some reason.
So let’s look at the overhead/cost on the system:
Let’s first look at runtime:
Note: i set it to use 4 hours (n_score_samples: 14400
) for working out the percentiles as opposed to 3600 i was originally using when developing this. So this is needing to keep 14400 previous scores in memory to do the percentile calculation at each step - so i am curious the impact of this on both runtime and ram the collector needs for this.
Looking at runtime i see it peaked around 80ms and then settled down to ~70ms. So seems totally fine for this to have an update_every: 1
for the default configuration i am using (which works on all System.* charts).
Now looking at CPU overhead i see about 5.5% of one core so maybe 2.75% going towards the python collector. Thats ok, i can live with that.
And then for RAM i see it’s about 90mb being used by the python.d.plugin which here is mostly just the changefinder collector as thats all i have enabled.
So that’s ~2.7% CPU, 70ms runtime, and 90mb of RAM to run this on my node with the default config.
That’s not bad at all and definitely something that could work very well on a parent node if that was too much of an overhead for each node itself - it all depends really so giving as many flexible options as possible is important for things like this.
My plans next are to work to get the PR merged and then start dogfooding it on our netdata parent node side by side the anomalies collector we are already running for each of our production kubernetes nodes and start to get a feel for the performance characteristics in terms of how it behaves on some real world type data.
Will report back here once i have a clearer picture on all that.
In the meantime feel free to play around with enabling this collector and i’d love to hear any feedback!
My feeling is it might not be as powerful for anomaly detection as the anomalies collector (as that is explicitly building models that can capture much more complex patterns and changes) but it will be cheaper to run and so having it running on more charts or dimensions might still be useful in some scenarios where the occurrence of lots of ‘changes’ at the same time at a much higher rate then typical on your node might still be very useful in detecting when something unusual enough to merit a bit of investigation may have occurred.