Anomaly Advisor - beta launch!!!!

andrewm4894 · March 30, 2022, 2:44pm

We are very excited to beta launch our new “Anomaly Advisor” feature for early adopters in the Netdata community. The Anomaly Advisor builds on the recent ML capabilities we have added to the Netdata Agent in order to easily surface potentially anomalous charts and metrics.

What is the “Anomaly Advisor”?

The Anomaly Advisor gives Netdata Cloud a new “Anomalies” tab where you can quickly scan for periods of time with elevated numbers of anomalous metrics and highlight time periods of interest to surface a sorted list of the most anomalous metrics during the highlighted window.

Here is a quick sneak peak video of the feature and here is a slightly more extended one where we run a little chaos engineering attack on some nodes and see how it plays out in the Anomaly Advisor.

Getting Started

To enable the Anomaly Advisor you must first enable ML on your nodes via a small config change in netdata.conf. Once the anomaly detection models have trained on the agent (with default settings this takes a couple of hours until enough data has been seen to train the models) you will then be able to enable the Anomaly Advisor feature in Netdata Cloud.

1. Enable ML on Netdata Agent

To enable ML on you Netdata Agent you just need to edit the [ml] section in your netdata.conf to look something like below.

Once done, restart Netdata with a command like sudo systemctl restart netdata for the config changes to take effect. You can find more info on restarting Netdata here.

At a minimum you just need to set enabled = yes to enable ML with default params. More details can be found in the Netdata Agent ML docs.

[ml]
    enabled = yes
    # maximum num samples to train = 14400
    # minimum num samples to train = 3600
    # train every = 3600
    # num samples to diff = 1
    # num samples to smooth = 3
    # num samples to lag = 5
    # maximum number of k-means iterations = 1000
    # dimension anomaly score threshold = 0.99
    # host anomaly rate threshold = 0.01000
    # minimum window size = 30.00000
    # maximum window size = 600.00000
    # idle window size = 30.00000
    # window minimum anomaly rate = 0.25000
    # anomaly event min dimension rate threshold = 0.05000
    # hosts to skip from training = !*
    # charts to skip from training = !* netdata.*

Note: follow this guide if you are unfamiliar with making configuration changes in Netdata.

2. Enable Anomaly Advisor in Netdata Cloud

To enable the Anomaly Advisor feature in Netdata Cloud itself you just need to set a anomaly_advisor feature flag to true in your browser.

Here is a short video showing how to do this.

While on Netdata Cloud, in your browser, if you press F12 you should see the developer tools tab. Press the “Application” tab and under the “Local Storage” section for https://app.netdata.cloud you can add a new key & value pair of anomaly_advisor & true. Once you refresh the page you should now see the new “Anomalies” tab.

Notes

You can see a detailed list of notes relating to the anomaly detection capabilities of the Netdata Agent here.
If you would like to learn in more detail how the Netdata Agent anomaly detection works please check out the Netdata Agent ML docs.
The default configuration requires at least 3600 seconds (1 hour) of data and will (re)train every 3600 seconds. So after you enable ML on your node, it should take around 2 hours for the first set of models to be trained and anomaly rates to become available for use by the Anomaly Advisor in Netdata Cloud.

Feedback

We’d love to hear any feedback you have on this thread. This feature is still very much in beta and so may be subject to change. We would love the Netdata community to help us shape this feature more and contribute to its further development in the coming months.

andrewm4894 · April 5, 2022, 1:42pm

For anyone interested in trying this but would like to run it on a parent instead of at the edge below shows some configuration options.

Below assumes 3 child nodes streaming to 1 parent node and illustrates the main ways you might want to configure running ml for the children on the parent, running ml on the children themselves or even a mix of approaches.

# parent will run ml for itself and child 1,2.
# child 0 will run its own ml at the edge and just stream its ml charts to parent.
# child 1 will run its own ml at the edge, even though parent will also run ml for it, a bit wasteful potentially to run ml in both places but is possible.
# child 2 will not run ml at the edge, it will be run in the parent only.

# parent-ml-ml-stress-0
# run ml on all hosts apart from child-ml-ml-stress-0
[ml]
        enabled = yes
        minimum num samples to train = 900
        train every = 900
        charts to skip from training = !*
        hosts to skip from training = child-ml-ml-stress-0

# child-ml-ml-stress-0
# run ml on child-ml-ml-stress-0 and stream ml charts to parent
[ml]
        enabled = yes
        minimum num samples to train = 900
        train every = 900
        stream anomaly detection charts = yes

# child-ml-ml-stress-1
# run ml on child-ml-ml-stress-1 and stream ml charts to parent
[ml]
        enabled = yes
        minimum num samples to train = 900
        train every = 900
        stream anomaly detection charts = yes

# child-ml-ml-stress-2
# don't run ml on child-ml-ml-stress-2, it will instead run on parent-ml-ml-stress-0
[ml]
        enabled = no

hugo · April 7, 2022, 9:23am

Also a very good hands-on demo done on Cloud Native Live: Power up your machine learning - Automated anomaly detection in case anyone wants to see this in action!

andrewm4894 · April 23, 2022, 10:54am

Another sort of walkthrough video where our CEO is using the Anomaly Advisor to detect a potential bug in his Raspbian OS.

andrewm4894 · May 5, 2022, 10:33am

fyi - some official docs here:

https://learn.netdata.cloud/docs/configure/machine-learning

andrewm4894 · May 18, 2022, 3:36pm

We’re live!!

Topic		Replies	Views
Anomalies collector feedback megathread! General anomaly-detection , machine-learning	14	2562	December 13, 2021
anomaly detection not found in menu Help agent	3	549	December 7, 2021
Centralize the truth of your infrastructure with alarm notifications General agent-release , feature-release , announcement	0	1301	December 17, 2020
GUIDE: Detect and monitor anomalies in systems and applications Media Center content-guide	0	820	January 20, 2021
anomalies_anomaly_flags Alerts	0	443	November 3, 2021