Troubleshooting useless/misleading NGINX "number of seconds since the last successful data collection" alerts?

Ryan_F · April 22, 2024, 8:29pm

Thank you for any and all help. This is actually happening on several servers somewhat randomly, and they are false positives in the sense that the service NGINX is running proper.

Problem/Question

Getting regular alerts for NGINX:
Escalated to Critical, NGINX = 1213 seconds ago, on server.host.name
number of seconds since the last successful data collection

Relevant docs you followed/actions you took to solve the issue

Unable to find any documentation

Environment/Browser/Agent’s version etc

ackaging:
Netdata Version ____________________________________________ : v1.45.3
Installation Type __________________________________________ : binpkg-rpm
Package Architecture _______________________________________ : x86_64
Package Distro _____________________________________________ :
Configure Options __________________________________________ : dummy-configure-command
Default Directories:
User Configurations ________________________________________ : /etc/netdata
Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
Permanent Databases ________________________________________ : /var/lib/netdata
Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
Static Web Files ___________________________________________ : /usr/share/netdata/web
Log Files __________________________________________________ : /var/log/netdata
Lock Files _________________________________________________ : /var/lib/netdata/lock
Home _______________________________________________________ : /var/lib/netdata
Operating System:
Kernel _____________________________________________________ : Linux
Kernel Version _____________________________________________ : 4.18.0-513.11.1.lve.el8.x86_64
Operating System ___________________________________________ : CloudLinux
Operating System ID ________________________________________ : cloudlinux
Operating System ID Like ___________________________________ : rhel fedora centos
Operating System Version ___________________________________ : 8.9 (Anatoly Levchenko)
Operating System Version ID ________________________________ : none
Detection __________________________________________________ : /etc/os-release
Hardware:
CPU Cores __________________________________________________ : 32
CPU Frequency ______________________________________________ : 2600000000
RAM Bytes __________________________________________________ : 126162604032
Disk Capacity ______________________________________________ : 1717986918400
CPU Architecture ___________________________________________ : x86_64
Virtualization Technology __________________________________ : kvm
Virtualization Detection ___________________________________ : systemd-detect-virt
Container:
Container __________________________________________________ : none
Container Detection ________________________________________ : systemd-detect-virt
Container Orchestrator _____________________________________ : none
Container Operating System _________________________________ : none
Container Operating System ID ______________________________ : none
Container Operating System ID Like _________________________ : none
Container Operating System Version _________________________ : none
Container Operating System Version ID ______________________ : none
Container Operating System Detection _______________________ : none
Features:
Built For __________________________________________________ : Linux
Netdata Cloud ______________________________________________ : YES
Health (trigger alerts and send notifications) _____________ : YES
Streaming (stream metrics to parent Netdata servers) _______ : YES
Back-filling (of higher database tiers) ____________________ : YES
Replication (fill the gaps of parent Netdata servers) ______ : YES
Streaming and Replication Compression ______________________ : YES (zstd gzip)
Contexts (index all active and archived metrics) ___________ : YES
Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
Machine Learning ___________________________________________ : YES
Database Engines:
dbengine ___________________________________________________ : YES
alloc ______________________________________________________ : YES
ram ________________________________________________________ : YES
none _______________________________________________________ : YES
Connectivity Capabilities:
ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
static (Netdata internal web server) _______________________ : YES
h2o (web server) ___________________________________________ : YES
WebRTC (experimental) ______________________________________ : NO
Native HTTPS (TLS Support) _________________________________ : YES
TLS Host Verification ______________________________________ : YES
Libraries:
LZ4 (extremely fast lossless compression algorithm) ________ : NO
ZSTD (fast, lossless compression algorithm) ________________ : YES
zlib (lossless data-compression library) ___________________ : YES
Brotli (generic-purpose lossless compression algorithm) ____ : NO
protobuf (platform-neutral data serialization protocol) ____ : YES (system)
OpenSSL (cryptography) _____________________________________ : YES
libdatachannel (stand-alone WebRTC data channels) __________ : NO
JSON-C (lightweight JSON manipulation) _____________________ : YES
libcap (Linux capabilities system operations) ______________ : NO
libcrypto (cryptographic functions) ________________________ : YES
libyaml (library for parsing and emitting YAML) ____________ : YES
Plugins:
apps (monitor processes) ___________________________________ : YES
cgroups (monitor containers and VMs) _______________________ : YES
cgroup-network (associate interfaces to CGROUPS) ___________ : YES
proc (monitor Linux systems) _______________________________ : YES
tc (monitor Linux network QoS) _____________________________ : YES
diskspace (monitor Linux mount points) _____________________ : YES
freebsd (monitor FreeBSD systems) __________________________ : NO
macos (monitor MacOS systems) ______________________________ : NO
statsd (collect custom application metrics) ________________ : YES
timex (check system clock synchronization) _________________ : YES
idlejitter (check system latency and jitter) _______________ : YES
bash (support shell data collection jobs - charts.d) _______ : YES
debugfs (kernel debugging metrics) _________________________ : YES
cups (monitor printers and print jobs) _____________________ : YES
ebpf (monitor system calls) ________________________________ : YES
freeipmi (monitor enterprise server H/W) ___________________ : YES
nfacct (gather netfilter accounting) _______________________ : NO
perf (collect kernel performance events) ___________________ : YES
slabinfo (monitor kernel object caching) ___________________ : YES
Xen ________________________________________________________ : NO
Xen VBD Error Tracking _____________________________________ : NO
Logs Management ____________________________________________ : YES
Exporters:
AWS Kinesis ________________________________________________ : NO
GCP PubSub _________________________________________________ : NO
MongoDB ____________________________________________________ : YES
Prometheus (OpenMetrics) Exporter __________________________ : YES
Prometheus Remote Write ____________________________________ : YES
Graphite ___________________________________________________ : YES
Graphite HTTP / HTTPS ______________________________________ : YES
JSON _______________________________________________________ : YES
JSON HTTP / HTTPS __________________________________________ : YES
OpenTSDB ___________________________________________________ : YES
OpenTSDB HTTP / HTTPS ______________________________________ : YES
All Metrics API ____________________________________________ : YES
Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
Trace All Netdata Allocations (with charts) ________________ : NO
Developer Mode (more runtime checks, slower) _______________ : NO

What I expected to happen

No alerts are expected, as NGINX is working properly.

ilyam8 · April 23, 2024, 9:51am

Hi, @Ryan_F.

Escalated to Critical, NGINX = 1213 seconds ago, on server.host.name

What alarm are you referring to? Can you provide the alarm/template name? The latest stable version of Netdata is v1.45.3 and nginx last collected alarm was removed in v1.32.0 (2.5 years ago).

Ryan_F · April 23, 2024, 5:01pm

Forgive me if I did not provide enough context. We have a Slack integration:

[Escalated to Critical, NGINX = 1213 seconds ago, on server.host.name]
number of seconds since the last successful data collection

Alert:
NGINX

Chart:
nginx_local.requests

Context:
nginx.requests

Another one:

Escalated to Critical, apache_last_collected_secs = 1213 seconds ago, on server.host.name

number of seconds since the last successful data collection

Alert:
apache_last_collected_secs

Chart:
apache_local.requests

Context:
apache.requests

Based on what I’m seeing, there is an alarm that goes off at 1213 seconds on various servers for both NGINX and Apache.

If you need different information, please be more specific. Thank you.

Ryan_F · April 23, 2024, 5:03pm

Also, some more context: We made our own NGINX last collected alarm by copying Apache’s, and they’re alerting at the same time.

ilyam8 · April 24, 2024, 11:45am

Are you using parent/child setup or standalone instances?

The reason for this alarm may be:

nginx collector tries to gather metrics and keeps failing (e.g. nginx is not running or not reachable for some reason)
collector stopped (exited) for some reason - in this case you shouldn’t see go.d.plugin in ps faxu | grep netdata output.

Do you see Nginx/Apache charts being updated while the alarm is triggered?

Ryan_F · April 30, 2024, 4:58pm

No, NGINX and Apache charts are not updated during these alarm triggers. However they seem to have stopped as suddenly as they started.

We’re using a parent/child instance of Netdata.

ilyam8 · May 1, 2024, 8:27am

Then the alarm is not useless. It indicates that the data collection stopped for some reason. Did you check the logs when this happened? Checking logs when something is not working is usually a good thing to do. This is how to get logs since the last time Netdata was started on a systemd system. You can send me your logs to ilya@netdata.cloud and I will take a look.

It indicates that the data collection stopped for some reason.

And share the ps faxu | grep netdata output when the alarm is active.

Topic		Replies	Views
fping_last_collected_secs Alerts	0	805	November 3, 2021
Marking metrics or collectors as required for alerting purposes General feedback	2	373	September 7, 2023
go.d_job_last_collected_secs Alerts	0	6440	November 3, 2021
python.d_job_last_collected_secs Alerts	0	6544	November 4, 2021
mdstat_last_collected Alerts	0	1078	November 3, 2021