Hello all and a belated happy new year.
We’re running netdata on a few cheap ‘Root’ Servers (= root access to the server, dedicated CPU Cores, KVM Virtualized, no access to the KVM Hostsystem) and are observing something that I would call microfreezes on those servers without being able to pinpoint the source of this issue so i’m writing in hope that you guys could share your experience and help me with digging deeper into it. To be completely honest I really don’t know how to gain some ground with this issue.
Images do more than thousand words, so I’ll attach ttwo screenshots:
And information about out systems ( analog to Constant gaps in all charts ) :
uname -a; grep -Hv "^#" /etc/*release
Linux [...] 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
/etc/lsb-release:DISTRIB_ID=Ubuntu
/etc/lsb-release:DISTRIB_RELEASE=20.04
/etc/lsb-release:DISTRIB_CODENAME=focal
/etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"
/etc/os-release:NAME="Ubuntu"
/etc/os-release:VERSION="20.04.3 LTS (Focal Fossa)"
/etc/os-release:ID=ubuntu
/etc/os-release:ID_LIKE=debian
/etc/os-release:PRETTY_NAME="Ubuntu 20.04.3 LTS"
/etc/os-release:VERSION_ID="20.04"
/etc/os-release:HOME_URL="https://www.ubuntu.com/"
/etc/os-release:SUPPORT_URL="https://help.ubuntu.com/"
/etc/os-release:BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
/etc/os-release:PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
/etc/os-release:VERSION_CODENAME=focal
/etc/os-release:UBUNTU_CODENAME=focal
Netdata:
netdata -W buildinfo
Version: netdata v1.32.1-13-nightly
Configure options: '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/libexec' '--libdir=/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--without-bundled-protobuf' '--with-bundled-lws' '--with-bundled-libJudy' 'CFLAGS=' 'LDFLAGS='
Features:
dbengine: YES
Native HTTPS: YES
Netdata Cloud: YES
ACLK Next Generation: YES
ACLK-NG New Cloud Protocol: YES
ACLK Legacy: YES
TLS Host Verification: YES
Machine Learning: YES
Libraries:
protobuf: YES (system)
jemalloc: NO
JSON-C: YES
libcap: NO
libcrypto: YES
libm: YES
LWS: YES static v3.2.2
mosquitto: YES
tcalloc: NO
zlib: YES
Plugins:
apps: YES
cgroup Network Tracking: YES
CUPS: NO
EBPF: NO
IPMI: YES
NFACCT: NO
perf: YES
slabinfo: YES
Xen: NO
Xen VBD Error Tracking: NO
Exporters:
AWS Kinesis: NO
GCP PubSub: NO
MongoDB: YES
Prometheus Remote Write: YES
Error log:
2022-01-07 20:16:38: netdata ERROR : ACLK_Stats : heartbeat missed 3780115 monotonic microseconds
2022-01-07 23:51:38: netdata ERROR : ACLK_Stats : heartbeat missed 2536088 monotonic microseconds
2022-01-08 15:14:13: netdata ERROR : ACLK_Stats : heartbeat missed 520038 monotonic microseconds
2022-01-08 16:00:37: netdata ERROR : ACLK_Stats : heartbeat missed 3416109 monotonic microseconds
2022-01-09 07:34:28: netdata ERROR : ACLK_Stats : heartbeat missed 623976 monotonic microseconds
2022-01-09 14:30:39: netdata ERROR : ACLK_Stats : heartbeat missed 3839931 monotonic microseconds
2022-01-10 02:00:35: netdata ERROR : ACLK_Stats : heartbeat missed 9555843 monotonic microseconds
2022-01-10 02:00:38: netdata ERROR : ACLK_Stats : heartbeat missed 1615961 monotonic microseconds
2022-01-11 14:14:21: netdata ERROR : ACLK_Stats : heartbeat missed 667975 monotonic microseconds
2022-01-11 14:19:38: netdata ERROR : ACLK_Stats : heartbeat missed 1291966 monotonic microseconds
2022-01-11 15:35:40: netdata ERROR : ACLK_Main : Connection Error or Dropped
2022-01-11 15:35:40: netdata INFO : ACLK_Main : Wait before attempting to reconnect in 0.000 seconds
2022-01-11 15:35:40: netdata INFO : ACLK_Main : Attempting connection now
2022-01-11 15:35:45: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:35:45: netdata INFO : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:35:48: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 409
2022-01-11 15:35:48: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409) (errno 22, Invalid argument)
2022-01-11 15:35:48: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrAlreadyConnected", Msg:"node already connected", BlockRetry:false, Backoff:0s (-1 unset by cloud)
2022-01-11 15:35:48: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP (errno 22, Invalid argument)
2022-01-11 15:35:48: netdata INFO : ACLK_Main : Wait before attempting to reconnect in 1.909 seconds
2022-01-11 15:35:50: netdata INFO : ACLK_Main : Attempting connection now
2022-01-11 15:35:53: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:35:53: netdata INFO : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:35:54: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 409
2022-01-11 15:35:54: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409) (errno 22, Invalid argument)
2022-01-11 15:35:54: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrAlreadyConnected", Msg:"node already connected", BlockRetry:false, Backoff:0s (-1 unset by cloud)
2022-01-11 15:35:54: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP (errno 22, Invalid argument)
2022-01-11 15:35:54: netdata INFO : ACLK_Main : Wait before attempting to reconnect in 2.180 seconds
2022-01-11 15:35:57: netdata INFO : ACLK_Main : Attempting connection now
2022-01-11 15:35:57: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:35:57: netdata INFO : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:35:59: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 409
2022-01-11 15:35:59: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409) (errno 22, Invalid argument)
2022-01-11 15:35:59: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrAlreadyConnected", Msg:"node already connected", BlockRetry:false, Backoff:0s (-1 unset by cloud)
2022-01-11 15:35:59: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP (errno 22, Invalid argument)
2022-01-11 15:35:59: netdata INFO : ACLK_Main : Wait before attempting to reconnect in 5.578 seconds
2022-01-11 15:36:04: netdata INFO : ACLK_Main : Attempting connection now
2022-01-11 15:36:05: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:36:05: netdata INFO : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:36:05: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 409
2022-01-11 15:36:05: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409) (errno 22, Invalid argument)
2022-01-11 15:36:05: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrAlreadyConnected", Msg:"node already connected", BlockRetry:false, Backoff:0s (-1 unset by cloud)
2022-01-11 15:36:05: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP (errno 22, Invalid argument)
2022-01-11 15:36:05: netdata INFO : ACLK_Main : Wait before attempting to reconnect in 11.155 seconds
2022-01-11 15:36:16: netdata INFO : ACLK_Main : Attempting connection now
2022-01-11 15:36:17: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:36:17: netdata INFO : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:36:17: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 409
2022-01-11 15:36:17: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409) (errno 22, Invalid argument)
2022-01-11 15:36:17: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrAlreadyConnected", Msg:"node already connected", BlockRetry:false, Backoff:0s (-1 unset by cloud)
2022-01-11 15:36:17: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP (errno 22, Invalid argument)
2022-01-11 15:36:17: netdata INFO : ACLK_Main : Wait before attempting to reconnect in 19.423 seconds
2022-01-11 15:36:37: netdata INFO : ACLK_Main : Attempting connection now
2022-01-11 15:36:37: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:36:37: netdata INFO : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:36:38: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 409
2022-01-11 15:36:38: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409) (errno 22, Invalid argument)
2022-01-11 15:36:38: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrAlreadyConnected", Msg:"node already connected", BlockRetry:false, Backoff:0s (-1 unset by cloud)
2022-01-11 15:36:38: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP (errno 22, Invalid argument)
2022-01-11 15:36:38: netdata INFO : ACLK_Main : Wait before attempting to reconnect in 45.258 seconds
2022-01-11 15:37:23: netdata INFO : ACLK_Main : Attempting connection now
2022-01-11 15:37:24: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:37:24: netdata INFO : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:37:24: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 409
2022-01-11 15:37:24: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409) (errno 22, Invalid argument)
2022-01-11 15:37:24: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrAlreadyConnected", Msg:"node already connected", BlockRetry:false, Backoff:0s (-1 unset by cloud)
2022-01-11 15:37:24: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP (errno 22, Invalid argument)
2022-01-11 15:37:24: netdata INFO : ACLK_Main : Wait before attempting to reconnect in 85.835 seconds
2022-01-11 15:38:50: netdata INFO : ACLK_Main : Attempting connection now
2022-01-11 15:38:51: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:38:51: netdata INFO : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:38:52: netdata INFO : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:38:52: netdata INFO : ACLK_Main : ACLK_OTP Got Challenge from Cloud
2022-01-11 15:38:52: netdata INFO : ACLK_Main : HTTPS "POST" request to "app.netdata.cloud" finished with HTTP code: 201
2022-01-11 15:38:52: netdata INFO : ACLK_Main : ACLK_OTP Got Password from Cloud
2022-01-11 15:38:53: netdata INFO : ACLK_Main : [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
2022-01-11 15:38:53: netdata INFO : ACLK_Main : ACLK connection successfully established
2022-01-11 15:38:53: netdata ERROR : ACLK_Main : Sending `connect` payload immediately as popcorning was finished already. (errno 22, Invalid argument)
2022-01-12 00:17:04: netdata ERROR : ACLK_Stats : heartbeat missed 823972 monotonic microseconds
2022-01-13 18:40:41: netdata ERROR : ACLK_Stats : heartbeat missed 2311950 monotonic microseconds
2022-01-13 21:35:48: netdata ERROR : ACLK_Stats : heartbeat missed 5315906 monotonic microseconds
Netdata runs inside a Docker Container.
Our/My thoughts:
We’re able to observe this problem distributed over most of or all of our servers, independent on what kind of applications/services are running there - depending on the load (and number of requests/tasks) the volatility of the metrics increases (which is to be expected i guess).
I think we can rule out that this is netdata or docker related and rather something that affects the whole system. If this occours to our loadbalancer, our uptime monitoring reports a downtime (or long response times, depending on the duration of the freeze and also on the current load/requests i guess)
If it happens on our (not yet redundant/HA) nfs server or database, we see this also reflected to the servers trying to access them which leads me to the conclusion that this affects the whole system.
Depending on the current load (which is usually not higher than 20% of the total capacity) on the server, watchdogd reports a Soft Lockup in seemingly random processes. We’re thinking that this is only a sympthom and doesn’t help us getting to the root cause of the problem.
Since we’re trying for quite a while to get a hold of this but are unable to identify anything we’re also considering this might be a problem with the KVM hostsystem and thus out of our reach.
Any help would be hightly appreciated.
Thanks
Christian