Netdata Community

Anomalies collector feedback megathread!

I just wanted to create a thread in here to capture feedback, in one place, from users of the anomalies collector:

I would love to hear all positive and negative feedback so that we can incorporate it into future versions of the collector or even new ones we might try do in Go or slightly different ones looking at more specific things like changepoint detection or online/streaming anomaly detection.

If you would rather, feel free to also just drop a mail to analytics-ml-team@netdata.cloud. Although ideally i was hoping to get some discussion going in here for all the netdata community who might be interested in these sorts of feature.

2 Likes

I was wondering last week if it’s possible for our team to create some preconfigured custom models for popular applications like MySQL, PostreSQL, OpenVPN, etc? This way users wouldn’t have to set up the charts_regex themselves, but instead could just say “yes” to a custom model that helps them monitor the most critical types of anomalies for that application?

If we had some smart default custom models, we could then mention them as a novel, zero-configuration way to monitor X or Y application.

I think that’s a great idea. I guess we could just add them as commented out sections in the conf. Or maybe a better way would be a sort of library of conf’s for each use case - just not sure where such stuff would live - a library of reference conf’s for each collector? Maybe some sort of sub section in the docs perhaps.

Hi guys
So I started playing with anomaly detection yesterday. I firstly added an SNMP collector to collect data from internal and internet facing interfaces on my mikrotik router.

I then created the config for anomaly detection for the SNMP collector. So far so good and will test over the weekend to see how it goes and provide feedback:

I have some questions though:

  1. It looks like the default training schedule is every 30 minutes using the previous 4 hours data. Does it also use the current model in it’s retraining? Ideally you would not like to only have 4 hours worth of data to use for ML.
  2. What happens on restart of the agent? Is the model saved or does everything start from scratch?

Hi @Morne_Supra thanks for trying it out!

  1. It will just drop and retrain a new model based on last 4 hours every 30 minutes. So if your system has very strong time patterns each day then this could be a limiting factor in how well the model could ever do (since only knows about last 4 hours). For some systems you might actually want this - you want the model to evolve and adapt as the system changes and new workloads come onto it etc. But for others, maybe web servers perhaps, you want to try take more advantage of the strong ‘time of day’ type effect. So it’s hard to really know up front what the best middle ground here is and how to make it easy for users to tweek for either setting/scenario. One thing you could do is increase train_n_secs to be 24 hours and then i’d recommend using train_max_n to cap the size of the training data. So instead the data would be samples from last 24 hours. This could help avoid the training step being too resource expensive on the agent. But again this can depend on the agent itself and number of charts/models you have it set up for, so can be a bit of trial and error and exploration involved here.

  2. At restart of the agent it will all start from scratch.

It could be cool to have a collector that just incrementally trains models over time in small batches - so it could be that every hour it somehow incrementally trains the model and maybe even pushes the new model as a new version so you could even have some model management type functionality. Some deep learning models and approaches as sometimes easier to do in this regard when you are just updating the latest weights for the model in some way. Or also some traditional ML models where you could ensemble more recent models with the ones you have already trained on a longer history of data.

The main idea to begin with was to try keep it as simple as we can get away with and see if the community finds any use in it.

Am super curious to hear how you get on with default settings after a couple of days of it running.

Hi @andrewm4894 , thanks for the response.

This then does not look like the right tool for me. As an example, I am currently only testing against 2 interfaces on my private home router, just to get the hang of things, but if it works, I would like to use it on more complicated stuff. Below is the setup I would like to manage/check for anomales:

Traffic usage will be seasonal:

  1. Traffic on a Sunday might not be the same as traffic on a Saturday, compared again to traffic on a workday.
  2. Traffic during the day should be different to traffic after hours.
  3. After hours can then again be split into while I am awake compared to while I am sleeping.
  4. Traffic in week one of a month might be different to traffic for example week 2 or even the full month.
  5. Traffic in December, might be different to traffic during the rest of the year.

I think a lot of users will be interested in seasonal ML, so being able to train on data over days, months and years, will help. I am not sure how you do your training, but I would guess that it should be possible to have your first training cycle on the last x amount of hours. Then when you retrain, you train on the last 30 minutes, as per the default setting plus what is in your current model.

Then on a restart of the agent, it would be great if it could retrain on the time since last seen plus what is in the model. As an example, as I am investigating, I have to restart quite a few times to ensure that changes I make to either the input layer (SNMP) and the ML config. Also, lets say you have configured to train on 24 hours, as in your post, and then you maybe have a server restart, you lose the previous 24 hours worth of training.

I see what you mean. I think its might also depend on how smooth or abrupt the transition between these periods is.

For example let’s say it’s sunday night at 00:00. At this time the model will be trained by default based on data from anywhere between 19:30 to 23:30 or 20:00 to 00:00 (depending on what was last 30 mins it did a train step which would be determined by last time netdata was restarted as it uses a run counter in the collector itself for train_every_n as opposed to wall clock time). So if traffic suddenly drops from 00:00 to 01:00 say then you might very well get a lot of false positives as soon as this big drop happens.

However if the transition from ‘evening’ to ‘overnight’ traffic is a bit more gradual then you might not get any false positives as the model follows this more gradual evolution in daily traffic. But if some time in the night you get some traffic that looks completely different to everything else in the 4 hour window then it would see that as anomalous.

So i think the abruptness of the different traffic patterns over time will also matter a lot and again will vary system by system.

I think your approach might be more well suited to a more typical approach people sometimes take which is to build as good a forecast model as you can for each metric that takes in things like time of day, day of week, a calendar list of bank holidays and special events etc, in addition to the raw metrics data itself and then trains a model to predict the next value or values in a window looking forward. Then if the error of the predicted metric vs the actual observed value crosses some threshold you trigger an anomaly - the expected value was very different from what we observed so it’s an anomaly.

I’m pretty sure this is kind of similar to what datadog or new relic might do for example and tools/platforms like anodot (which is a great company in this space btw) for anomaly detection as they have the luxury (and cost) of already having the data in the cloud somewhere off the agent where they can crunch it a bit more.

The tricky thing with the netdata agent is that this would just be too expensive to run on the agent without blowing up the thing we are trying to monitor. If we had all the raw metric data in say netdata cloud then we could totally follow a similar approach where we could just train the models in batch more at regular intervals on our own cloud infrastructure, push those trained models then down to the agent in some industry standard format and use something like onnx runtime to get the predictions and anomaly scores at each time step.

But being able to do distributed anomaly detection without having to centralize all the data in some cloud somewhere, and still be as light as possible as we can be on the agent itself (or parent node if you are streaming to a parent) is one of the design choices (or constraints :slight_smile: ) we are working under so we have to do it a little differently.

It would be cool if we could have the best of both worlds - and i think there could be some way to do the incremental training like you mention. Is for sure an idea worth exploring as we build out more ML features in both the agent and netdata cloud itself.

The point on having to restart multiple times as you investigate and impact that would have on all this is totally new to me and not something i had thought about - so great feedback for us. There is an offset_n_secs which you could set to tell it to ignore most recent n secs when training the model which could help a little here.

But being able to have the anomaly scores be robust in some way to the troubleshooting itself by maintaining models on disk so some better way to persist things is really great feedback and defo something we will think about more.

Thanks @andrewm4894 . It is nice to speak to guys that are willing to listen and to suggest options.

Just to let you know, that if I could do things better than you guys, we would not have had this conversation, so well done in what you have done. I am always looking for a solution that limits the amount of effort I have to put into it.

I know what I want, but it is not easy to get it off the shelf. :slight_smile: and :frowning:

1 Like

Just an update here as we dogfood this collector a bit internally on our netdata cloud prod data.

We have about 15-20 kubernetes nodes from various different services and clusters streaming to a netdata parent node.

This netdata parent node is just a n2-standard-4 GCP instance and seems to be well able to handle about 20 seperate anomalies jobs - one for each node and some then on aggregations of the nodes into meaningful groupings using the aggregator collector i am hoping to get merged.

Generally we are just using the default config like:

{h}:
    name: '{h}'
    update_every: 3
    priority: 90
    host: '127.0.0.1:19999/host/{h}'
    protocol: 'http'
    charts_regex: 'system\..*'
    charts_to_exclude: 'system.uptime,system.entropy'
    model: 'pca'
    train_max_n: 100000
    train_every_n: 1800
    train_n_secs: 14400
    train_no_prediction_n: 10
    offset_n_secs: 120
    lags_n: 5
    smooth_n: 3
    diffs_n: 1
    contamination: 0.001
    include_average_prob: true
    reinitialize_at_every_step: true

Where h is for each host streaming to our parent node.

I have set it to update_every 3 seconds just to be kind to my netdata parent. Looking in the python.d section of the Netdata Monitoring section i do see typical runtimes of each anomalies collector job of around 2.3 seconds or so. So id say update every of 3 might be as low as i can go given the amount of anomalies collector jobs i have running, which is totally fine for what we need - i could probably even increase it to 5 or 10 and still be useful for us.

During the training step i was seeing it take anywhere from 10-40 seconds and so sometimes just after a train step there would be a gap in the anomalies charts but that’s fine too.

Here are some charts from the netdata parent that is doing all the work.

You can see its running at about 30% CPU during the prediction steps and then spiking to about 50% for a little period during a retraining step.

In terms of memory usage then i see about 650 MB going towards the python plugin which is probably mostly related to the ~20 anomalies collector jobs.

So my main learnings so far are that this might actually be useful for us in monitoring our production infra of a couple of kubernetes clusters and dozens of nodes which is great and should help us learn a lot about where it is and is not useful.

I am planning to watch it closely over next couple of weeks and if we find any useful anomalies from it you can be sure there will be a blog post lol.

Anyway - just wanted to share an update on here for anyone interested. I will probably add some updates as we use it more and add any improvements via PR’s.

I’m happy to help anyone else interested in using the anomalies collector so feel free to just ask any questions in here.

Good news! I managed to get this working on my raspberry pi 4 and it didn’t even blow it up! It’s actually looking like not that much overhead with the default config.

Here is what i needed to do.

(p.s. i’m totally new to Raspberry Pi and so there might be better/easier ways to get this working - if so, please chime in)


# Step 1. update default to be python 3 (not sure if this is or is not needed)
sudo update-alternatives --install /usr/bin/python python /usr/bin/python3 10

# Step 2. first tell netdata to use python 3
cd /etc/netdata/
sudo ./edit-config netdata.conf

# scroll down to python plugin conf section and pass -ppython3 command option
[plugin:python.d]
        # update every = 1
        command options = -ppython3

# now install some things the python libraries that the anomalies collector needs but might not be there by default

# Step 3. install llvm
sudo apt install llvm-9

# Step 4. install some stuff for numpy (you might be able to skip this step, i'm not sure)
sudo apt-get install libatlas3-base libgfortran5 libatlas-base-dev

# Step 5. install python libraries needed

# become netdata user
sudo su -s /bin/bash netdata

# install python libs and tell it where to find llvm as you pip3 install what is needed
LLVM_CONFIG=llvm-config-9 pip3 install --user llvmlite numpy==1.20.1 netdata-pandas==0.0.32 numba==0.50.1 scikit-learn==0.23.2 pyod==0.8.3

# Step 6. turn on the anomalies collector
cd /etc/netdata/
sudo ./edit-config python.d.conf
# set `anomalies: no` to `anomalies: yes`

# Step 7. restart netdata
sudo systemctl restart netdata

# wait a couple of minutes and refresh the netdata dashboard and you should see the "Anomalies local" menu just above the "Netdata Monitoring" section. 

# hopefully you now have some sexy ml driven anomaly detection on your edge!

In terms of overhead on the Pi itself i see something like this…

Some initial CPU bumps to around 40-50% as the models were trained and then on ongoing CPU usage of about 6%:

Runtimes for the collector of about 245ms:

About 90mb of RAM going to the models and data the collector needs to keep in memory:

So not bad at all really - i was expecting it to be a bit more expensive on the Pi itself.

My plan is to have this Pi look out my back garden and do a sort of object detection and then image classification pipeline to grab snaps of the birds in my back garden, classify them and then save to cloud storage or something. All just as a fun side project as i had never done anything on Pi. So will be interesting to see if the anomalies collector can hold up when i’m doing continual image processing and vision model inference on it.

I’ll report back once i do.

Signing off, pleasantly surprised that i was able to get this done so painlessly! (hope i have not just jinxed it)

1 Like