Just an update here as we dogfood this collector a bit internally on our netdata cloud prod data.
We have about 15-20 kubernetes nodes from various different services and clusters streaming to a netdata parent node.
This netdata parent node is just a n2-standard-4
GCP instance and seems to be well able to handle about 20 seperate anomalies jobs - one for each node and some then on aggregations of the nodes into meaningful groupings using the aggregator collector i am hoping to get merged.
Generally we are just using the default config like:
{h}:
name: '{h}'
update_every: 3
priority: 90
host: '127.0.0.1:19999/host/{h}'
protocol: 'http'
charts_regex: 'system\..*'
charts_to_exclude: 'system.uptime,system.entropy'
model: 'pca'
train_max_n: 100000
train_every_n: 1800
train_n_secs: 14400
train_no_prediction_n: 10
offset_n_secs: 120
lags_n: 5
smooth_n: 3
diffs_n: 1
contamination: 0.001
include_average_prob: true
reinitialize_at_every_step: true
Where h
is for each host streaming to our parent node.
I have set it to update_every 3 seconds just to be kind to my netdata parent. Looking in the python.d
section of the Netdata Monitoring section i do see typical runtimes of each anomalies collector job of around 2.3 seconds or so. So id say update every of 3 might be as low as i can go given the amount of anomalies collector jobs i have running, which is totally fine for what we need - i could probably even increase it to 5 or 10 and still be useful for us.
During the training step i was seeing it take anywhere from 10-40 seconds and so sometimes just after a train step there would be a gap in the anomalies charts but that’s fine too.
Here are some charts from the netdata parent that is doing all the work.
You can see its running at about 30% CPU during the prediction steps and then spiking to about 50% for a little period during a retraining step.
In terms of memory usage then i see about 650 MB going towards the python plugin which is probably mostly related to the ~20 anomalies collector jobs.
So my main learnings so far are that this might actually be useful for us in monitoring our production infra of a couple of kubernetes clusters and dozens of nodes which is great and should help us learn a lot about where it is and is not useful.
I am planning to watch it closely over next couple of weeks and if we find any useful anomalies from it you can be sure there will be a blog post lol.
Anyway - just wanted to share an update on here for anyone interested. I will probably add some updates as we use it more and add any improvements via PR’s.
I’m happy to help anyone else interested in using the anomalies collector so feel free to just ask any questions in here.