Netdata Community

Collector: [anomalies] error on loading source

Environment

Netdata (testet on): 
v1.26.0
v1.26.0-312-gf5af54e0

bash-4.4$ cat /etc/redhat-release
CentOS Linux release 8.2.2004 (Core)

bash-4.4$ python3 --version
Python 3.6.8

bash-4.4$ pip3 list
anyio (1.4.0)
asks (2.3.7)
asn1crypto (0.24.0)
async-generator (1.10)
attrs (20.3.0)
blivet (3.1.0)
Brlapi (0.6.7)
certifi (2020.11.8)
cffi (1.11.5)
chardet (3.0.4)
chrome-gnome-shell (0.0.0)
combo (0.1.1)
configobj (5.0.6)
contextvars (2.4)
coverage (4.5.1)
cryptography (2.3)
cupshelpers (1.0)
cycler (0.10.0)
dbus-python (1.2.4)
decorator (4.2.1)
ethtool (0.14)
gpg (1.10.0)
h11 (0.11.0)
idna (2.5)
immutables (0.14)
iniparse (0.4)
initial-setup (0.3.62.1)
isc (2.0)
joblib (0.17.0)
kiwisolver (1.3.1)
langtable (0.0.38)
llvmlite (0.33.0)
matplotlib (3.3.3)
netdata-pandas (0.0.32)
netifaces (0.10.6)
nftables (0.1)
ntplib (0.3.3)
numba (0.50.1)
numpy (1.19.4)
ordered-set (2.0.2)
outcome (1.1.0)
pandas (1.0.4)
patsy (0.5.1)
pciutils (2.3.6)
perf (0.1)
pid (2.1.1)
Pillow (8.0.1)
pip (9.0.3)
ply (3.9)
productmd (1.11)
pwquality (1.4.0)
pycairo (1.16.3)
pycparser (2.14)
pycups (1.9.72)
pycurl (7.43.0.2)
pydbus (0.6.0)
pygobject (3.28.3)
pyinotify (0.9.6)
pykickstart (3.16.10)
pyod (0.8.3)
pyOpenSSL (18.0.0)
pyparsing (2.1.10)
pyparted (3.11.0)
PySocks (1.6.8)
python-dateutil (2.6.1)
python-dmidecode (3.12.2)
python-linux-procfs (0.6)
python-meh (0.47.2)
pytz (2017.2)
pyudev (0.21.0)
pyxdg (0.25)
PyYAML (3.12)
requests (2.23.0)
requests-file (1.4.3)
requests-ftp (0.3.1)
rhnlib (2.8.6)
rpm (4.14.2)
schedutils (0.6)
scikit-learn (0.23.2)
scipy (1.5.4)
selinux (2.9)
sepolicy (1.1)
setools (4.2.2)
setroubleshoot (1.1)
setuptools (39.2.0)
simpleline (1.1)
six (1.11.0)
slip (0.6.4)
slip.dbus (0.6.4)
sniffio (1.2.0)
sortedcontainers (2.3.0)
sos (3.8)
SSSDConfig (2.2.3)
statsmodels (0.12.1)
subscription-manager (1.26.20)
suod (0.0.4)
syspurpose (1.26.20)
systemd-python (234)
threadpoolctl (2.1.0)
trio (0.16.0)
urllib3 (1.24.2)

Problem/Question

Problem starting anomalies

bash-4.4$ grep ‘anomalies’ /var/log/netdata/error.log
2020-12-04 14:19:57: python.d WARNING: plugin[main] : [anomalies] error on loading source : SyntaxError('invalid syntax', ('/usr/libexec/netdata/python.d/anomalies.chart.py', 82, 108, " self.charts_available = [c for c in list(requests.get(f'{self.protocol}://{self.host}/api/v1/charts').json().get('charts', {}).keys())]\n")), skipping it

What I expected to happen

Not sure, I wanted to try the anomalies collector but seems something is wrong. (?)

@andrewm4894, please, can you take a look on this. I remember that on Slackware I had problems to install some packages using pip3, because the C headers were not initially available.

can you maybe try

grep ‘python’ /var/log/netdata/error.log

which should give a line from the collectors to say what version of python it is using.

Or maybe try:

sudo su -s /bin/bash netdata
python --version

Just in case for some reason netdata is using python 2 instead of 3.

I’ll try spin up a similar machine if i can and see if i can recreate it.

p.s. thanks for trying out the collector! I’m keen to get some feedback from the community on it.

Hmm - strange i created a centos machine and installed netdata from source, installed the python packages as per docs, enabled the collector and restarted netdata and it seems to work for me.

In below message the “training complete” line in the logs indicates it looks like its working.

Then if i go to the dashboard i see the initial cpu spike for that training run, then some smallish window while its waiting to build up enough data to get the lagged values to start doing the predictions, during with you will see some error messages like below in the log.

2020-12-07 11:53:14: python.d INFO: anomalies[local] : training complete in 0.24 seconds (runs_counter=1, model=pca, train_n_secs=14400, models=21, n_fit_success=21, n_fit_fails=0, after=1607327594, before=1607341994).
2020-12-07 11:53:45: python.d ERROR: anomalies[local] : update() unhandled exception: index -1 is out of bounds for axis 0 with size 0
2020-12-07 11:53:51: python.d ERROR: anomalies[local] : update() unhandled exception: index -1 is out of bounds for axis 0 with size 0
2020-12-07 11:53:57: python.d ERROR: anomalies[local] : update() unhandled exception: index -1 is out of bounds for axis 0 with size 0
2020-12-07 11:54:03: python.d ERROR: anomalies[local] : update() unhandled exception: index -1 is out of bounds for axis 0 with size 0
2020-12-07 11:54:12: python.d ERROR: anomalies[local] : update() unhandled exception: index -1 is out of bounds for axis 0 with size 0
2020-12-07 11:54:21: python.d ERROR: anomalies[local] : update() unhandled exception: index -1 is out of bounds for axis 0 with size 0
2020-12-07 11:54:30: python.d ERROR: anomalies[local] : update() unhandled exception: index -1 is out of bounds for axis 0 with size 0
2020-12-07 11:54:39: python.d ERROR: anomalies[local] : update() unhandled exception: index -1 is out of bounds for axis 0 with size 0

But then you should see some cpu being used for the prediction steps:

And the chart itself available once you refresh the page.

I’m not sure what might be going wrong for you. Only hint i can think of is that the python error message itself looks like its not liking the python f-strings which were only available from python 3.6+ iirc.

I’m the ML guy so not really an expert on any systems stuff like this but happy to try help you debug it as best we can and ask other in Netdata who might be more knowledgeable.

I do recall one of our developers hit a similar error message here: https://github.com/netdata/netdata/pull/10060#issuecomment-737310266

But it went away once he reinstalled from the branch and/or rebooted the machine. He said the error went away but still not quite sure where it came from or what was driving it.

Let me know if you are still getting it and if you can find anymore info on it.

One thing to try would be to just run the collector in debug mode and see if that gives anymore info or detail in the error message:

# become netdata user
sudo su -s /bin/bash netdata
# run collector in debug using `nolock` option if netdata is already running the collector itself.
/usr/libexec/netdata/plugins.d/python.d.plugin anomalies debug trace nolock
1 Like

Apparently python 2 was globally set, typical in CentOS. One can change this using alternatives but I’d rather change it just for Netdata or that Netdata (plugin) could check for python3 first.

Any tips for setting the relevant env and how is welcome. For now I hope python3 will work globally, yum dosent complain for starters but something else might.

To change the default python version in CentOS 8 run:

alternatives --config python

Anomalies are up and running! Alerts are fired of! :slight_smile:

It is possible to set in the netdata.conf

[ilyam@pc netdata]$ grep "\[plugin:python.d\]" netdata.conf -A 2
[plugin:python.d]
	# update every = 1
	# command options =

=>

[plugin:python.d]
	# update every = 1
	command options = -ppython3
1 Like

@andrewm4894 we could add python version check to the anomalies module and log a meaningful message if actual version < expected version.

2 Likes

@caleno glad you got sorted!

Let me know how you get on with the anomalies collector after its running for a few days as i’ve love to hear any feedback or examples of where its was useful and where it was false positives or missed something.

Am i ok to mark this thread as resolved?

Yes, you can mark it as resolved. I’ll stick with setting the global python version for now.

@andrewm4894
I’ll let you know if I gain some useful insights using anomalies, for know it is producing a lot of alerts from time to time, getting som burst sometimes. I’m not sure what it all is or if it is useful. I guess it has to run for a while to learn the patterns of a “normal” state.

@ilyam8
I tried setting the command option but couldn’t get the anomalies to work, seems it starts up but then gets killed.

/var/log/netdata/.local/lib/python3.6/site-packages/pyod/models/pca.py:269: RuntimeWarning: divide by zero encountered in true_divide
/var/log/netdata/.local/lib/python3.6/site-packages/numpy/core/_methods.py:202: RuntimeWarning: invalid value encountered in subtract
2020-12-08 11:58:03: python.d INFO: anomalies[local] : training complete in 0.72 seconds (runs_counter=1, model=pca, train_n_secs=14400, models=22, n_fit_success=22, n_fit_fails=0, after=1607410682, before=1607425082).
2020-12-08 11:58:15: netdata ERROR : PLUGINSD[python.d] : requested a BEGIN on chart 'netdata.runtime_alarms_local', which does not exist on host 'localhost.localdomain'. Disabling it.
2020-12-08 11:58:15: netdata INFO  : PLUGINSD[python.d] : PARSER ended
2020-12-08 11:58:15: netdata ERROR : PLUGINSD[python.d] : '/usr/libexec/netdata/plugins.d/python.d.plugin' (pid 5284) disconnected after 379 successful data collections (ENDs).
2020-12-08 11:58:15: netdata ERROR : PLUGINSD[python.d] : child pid 5284 killed by signal 15.
2020-12-08 11:58:15: netdata INFO  : PLUGINSD[python.d] : '/usr/libexec/netdata/plugins.d/python.d.plugin' (pid 5284) was killed with SIGTERM. Disabling it.
2020-12-08 11:58:15: netdata INFO  : PLUGINSD[python.d] : thread with task id 5271 finished

Running the plugin in debug mode with the python3 option as netdata user yields

bash-4.4$ /usr/libexec/netdata/plugins.d/python.d.plugin -ppython3 anomalies debug trace nolock

2020-12-08 12:04:05: python.d INFO: plugin[main] : using python v3
2020-12-08 12:04:05: python.d DEBUG: plugin[main] : looking for 'python.d.conf' in ['/etc/netdata', '/usr/lib/netdata/conf.d']
2020-12-08 12:04:05: python.d DEBUG: plugin[main] : loading '/etc/netdata/python.d.conf'
2020-12-08 12:04:05: python.d DEBUG: plugin[main] : '/etc/netdata/python.d.conf' is loaded
2020-12-08 12:04:05: python.d DEBUG: plugin[main] : looking for 'pythond-jobs-statuses.json' in /var/lib/netdata
2020-12-08 12:04:05: python.d DEBUG: plugin[main] : loading '/var/lib/netdata/pythond-jobs-statuses.json'
2020-12-08 12:04:05: python.d DEBUG: plugin[main] : '/var/lib/netdata/pythond-jobs-statuses.json' is loaded
/var/log/netdata/.local/lib/python3.6/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.utils.testing module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.
  warnings.warn(message, FutureWarning)
2020-12-08 12:04:06: python.d DEBUG: plugin[main] : [anomalies] looking for 'anomalies.conf' in ['/etc/netdata/python.d', '/usr/lib/netdata/conf.d/python.d']
2020-12-08 12:04:06: python.d DEBUG: plugin[main] : [anomalies] loading '/usr/lib/netdata/conf.d/python.d/anomalies.conf'
2020-12-08 12:04:06: python.d DEBUG: plugin[main] : [anomalies] '/usr/lib/netdata/conf.d/python.d/anomalies.conf' is loaded
2020-12-08 12:04:06: python.d INFO: plugin[main] : [anomalies] built 1 job(s) configs
2020-12-08 12:04:06: python.d DEBUG: plugin[main] : anomalies[local] was previously active, applying recovering settings
2020-12-08 12:04:06: python.d INFO: plugin[main] : anomalies[local] : check success
CHART netdata.runtime_anomalies_local '' 'Execution time for anomalies_local' 'ms' 'python.d' netdata.pythond_runtime line 145000 1
DIMENSION run_time 'run time' absolute 1 1

2020-12-08 12:04:06: python.d DEBUG: anomalies[local] : started, update frequency: 1
/var/log/netdata/.local/lib/python3.6/site-packages/pyod/models/pca.py:269: RuntimeWarning: divide by zero encountered in true_divide
  cdist(X, self.selected_components_) / self.selected_w_components_,
/var/log/netdata/.local/lib/python3.6/site-packages/numpy/core/_methods.py:202: RuntimeWarning: invalid value encountered in subtract
  x = asanyarray(arr - arrmean)
2020-12-08 12:04:07: python.d INFO: anomalies[local] : training complete in 0.69 seconds (runs_counter=1, model=pca, train_n_secs=14400, models=22, n_fit_success=22, n_fit_fails=0, after=1607411046, before=1607425446).
2020-12-08 12:04:07: python.d DEBUG: anomalies[local] : get_data() returned no data
2020-12-08 12:04:07: python.d DEBUG: anomalies[local] : update => [FAILED] (elapsed time: -, failed retries in a row: 1)
2020-12-08 12:04:07: python.d DEBUG: anomalies[local] : get_data() returned no data
2020-12-08 12:04:07: python.d DEBUG: anomalies[local] : update => [FAILED] (elapsed time: -, failed retries in a row: 2)
2020-12-08 12:04:08: python.d DEBUG: anomalies[local] : get_data() returned no data
2020-12-08 12:04:08: python.d DEBUG: anomalies[local] : update => [FAILED] (elapsed time: -, failed retries in a row: 3)
2020-12-08 12:04:09: python.d DEBUG: anomalies[local] : get_data() returned no data
2020-12-08 12:04:09: python.d DEBUG: anomalies[local] : update => [FAILED] (elapsed time: -, failed retries in a row: 4)
2020-12-08 12:04:10: python.d DEBUG: anomalies[local] : get_data() returned no data
2020-12-08 12:04:10: python.d DEBUG: anomalies[local] : update => [FAILED] (elapsed time: -, failed retries in a row: 5)
2020-12-08 12:04:13: python.d DEBUG: anomalies[local] : get_data() returned no data
2020-12-08 12:04:13: python.d DEBUG: anomalies[local] : update => [FAILED] (elapsed time: -, failed retries in a row: 6)
2020-12-08 12:04:16: python.d DEBUG: anomalies[local] : get_data() returned no data
2020-12-08 12:04:16: python.d DEBUG: anomalies[local] : update => [FAILED] (elapsed time: -, failed retries in a row: 7)

Thanks for the help :slight_smile:

@caleno yep - you might see a few warnings from it for a bit - i think the 2m and some of the thresholds in here might be a bit low by default: https://github.com/netdata/netdata/blob/master/health/health.d/anomalies.conf

When you see some bursts do the underlying charts look at least strange or different, even if they might not be full on anomalies you actually care about. That will be the tricky part - differentiating between charts that look strange vs charts that look strange in a way where you need to take some immediate action.

Or it could be that the default pca model itself might not quite suit your node so could try the cblof or hbos model instead perhaps.

Anyway, thanks for trying it out and happy to hear any and all feedback on it so i can best understand its strengths and weaknesses.

@andrewm4894

/var/log/netdata/.local/lib/python3.6/site-packages/pyod/models/pca.py:269: RuntimeWarning: divide by zero encountered in true_divide
  cdist(X, self.selected_components_) / self.selected_w_components_,
/var/log/netdata/.local/lib/python3.6/site-packages/numpy/core/_methods.py:202: RuntimeWarning: invalid value encountered in subtract
  x = asanyarray(arr - arrmean)

There are some unhandled exceptions, are those warnings we can filter or they are unexpected expections?

EDIT:

RuntimeWarning

ok, i see, warning, lets filter thim then!

@andrewm4894 collectors returns no data, is it expected behavior?

yep - for the first 10 or so steps it won’t have enough data to make the matrix of lagged and smoothed values it needs for the predict() function. It should then kick in once it has enough data.

Yeah - on the warnings, numpy and sklean and the various libraries can sometimes give warnings. And it can depend sometimes on the nature of the specific data itself - eg the warning you got should probably go away soon as is being driven by the fact that i think thePCA did not yet have enough data to train a valid model (it might not converge internally) - was that a fresh node by any chance?

I have suppressed some of the most common ones i had seen in here: https://github.com/netdata/netdata/blob/master/collectors/python.d.plugin/anomalies/anomalies.chart.py#L26

So i can make a PR to add any more insignificant warnings we come across as people use it.

See https://github.com/netdata/netdata/blob/1cdf2851623e568d56c38604fe1d8b216380cb77/collectors/python.d.plugin/python_modules/bases/FrameworkServices/SimpleService.py#L52-L55

python.d.plugin applies some penalty to update_every every 5 failed attempts in a row. It resets it after first successful data collection.

Controlled by penalty option which is True by default.

PR to add the additional warnings: https://github.com/netdata/netdata/pull/10369

Yeah - i was a bit unsure of the consequences of this. And i didnt want to try get too fancy in capturing the few errors at the start as could just be more complexity.

In reality i think the retries penalty means that it just takes like maybe ~20-30 seconds or so to begin producing valid results then the theoretical default (depending on lags_n and smooth_n) using the default params of around 5 lag values + 3 smoothing values = i think ~7 steps before you have enough data to make a prediction.

I was kinda thinking to just let the fails happen and pay the penalty a bit at the start for the sake of less complexity in the code and for fear of doing anything to mask or hide other more serious errors.

Do you think we should do something different here maybe or ok as is?

I think it is ok, just want to ensure you are aware of that penalty feature :grinning_face_with_smiling_eyes:

1 Like