Troubleshooting 'Microfreezes' of our servers

Hello all and a belated happy new year.

We’re running netdata on a few cheap ‘Root’ Servers (= root access to the server, dedicated CPU Cores, KVM Virtualized, no access to the KVM Hostsystem) and are observing something that I would call microfreezes on those servers without being able to pinpoint the source of this issue so i’m writing in hope that you guys could share your experience and help me with digging deeper into it. To be completely honest I really don’t know how to gain some ground with this issue.

Images do more than thousand words, so I’ll attach ttwo screenshots:

And information about out systems ( analog to Constant gaps in all charts ) :

uname -a; grep -Hv "^#" /etc/*release
Linux [...] 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
/etc/lsb-release:DISTRIB_ID=Ubuntu
/etc/lsb-release:DISTRIB_RELEASE=20.04
/etc/lsb-release:DISTRIB_CODENAME=focal
/etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"
/etc/os-release:NAME="Ubuntu"
/etc/os-release:VERSION="20.04.3 LTS (Focal Fossa)"
/etc/os-release:ID=ubuntu
/etc/os-release:ID_LIKE=debian
/etc/os-release:PRETTY_NAME="Ubuntu 20.04.3 LTS"
/etc/os-release:VERSION_ID="20.04"
/etc/os-release:HOME_URL="https://www.ubuntu.com/"
/etc/os-release:SUPPORT_URL="https://help.ubuntu.com/"
/etc/os-release:BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
/etc/os-release:PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
/etc/os-release:VERSION_CODENAME=focal
/etc/os-release:UBUNTU_CODENAME=focal

Netdata:

netdata -W buildinfo
Version: netdata v1.32.1-13-nightly                                                                                                                                                                                                                                             
Configure options:  '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=/usr/libexec' '--libdir=/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--without-bundled-protobuf' '--with-bundled-lws' '--with-bundled-libJudy' 'CFLAGS=' 'LDFLAGS='
Features:                                                                                                                                                                                                                                                                       
    dbengine:                   YES                                                                                                                                                                                                                                             
    Native HTTPS:               YES                                                                                                                                                                                                                                             
    Netdata Cloud:              YES                                                                                                                                                                                                                                             
    ACLK Next Generation:       YES                                                                                                                                                                                                                                             
    ACLK-NG New Cloud Protocol: YES                                                                                                                                                                                                                                             
    ACLK Legacy:                YES                                                                                                                                                                                                                                             
    TLS Host Verification:      YES                                                                                                                                                                                                                                             
    Machine Learning:           YES                                                                                                                                                                                                                                             
Libraries:                                                                                                                                                                                                                                                                      
    protobuf:                YES (system)                                                                                                                                                                                                                                       
    jemalloc:                NO                                                                                                                                                                                                                                                 
    JSON-C:                  YES                                                                                                                                                                                                                                                
    libcap:                  NO                                                                                                                                                                                                                                                 
    libcrypto:               YES                                                                                                                                                                                                                                                
    libm:                    YES                                                                                                                                                                                                                                                
    LWS:                     YES static v3.2.2
    mosquitto:               YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    NO
    IPMI:                    YES
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 YES
    Prometheus Remote Write: YES

Error log:

2022-01-07 20:16:38: netdata ERROR : ACLK_Stats : heartbeat missed 3780115 monotonic microseconds
2022-01-07 23:51:38: netdata ERROR : ACLK_Stats : heartbeat missed 2536088 monotonic microseconds
2022-01-08 15:14:13: netdata ERROR : ACLK_Stats : heartbeat missed 520038 monotonic microseconds
2022-01-08 16:00:37: netdata ERROR : ACLK_Stats : heartbeat missed 3416109 monotonic microseconds
2022-01-09 07:34:28: netdata ERROR : ACLK_Stats : heartbeat missed 623976 monotonic microseconds
2022-01-09 14:30:39: netdata ERROR : ACLK_Stats : heartbeat missed 3839931 monotonic microseconds
2022-01-10 02:00:35: netdata ERROR : ACLK_Stats : heartbeat missed 9555843 monotonic microseconds
2022-01-10 02:00:38: netdata ERROR : ACLK_Stats : heartbeat missed 1615961 monotonic microseconds
2022-01-11 14:14:21: netdata ERROR : ACLK_Stats : heartbeat missed 667975 monotonic microseconds
2022-01-11 14:19:38: netdata ERROR : ACLK_Stats : heartbeat missed 1291966 monotonic microseconds
2022-01-11 15:35:40: netdata ERROR : ACLK_Main : Connection Error or Dropped
2022-01-11 15:35:40: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 0.000 seconds
2022-01-11 15:35:40: netdata INFO  : ACLK_Main : Attempting connection now
2022-01-11 15:35:45: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:35:45: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:35:48: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 409
2022-01-11 15:35:48: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409) (errno 22, Invalid argument)
2022-01-11 15:35:48: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrAlreadyConnected", Msg:"node already connected", BlockRetry:false, Backoff:0s (-1 unset by cloud)
2022-01-11 15:35:48: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP (errno 22, Invalid argument)
2022-01-11 15:35:48: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 1.909 seconds
2022-01-11 15:35:50: netdata INFO  : ACLK_Main : Attempting connection now
2022-01-11 15:35:53: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:35:53: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:35:54: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 409
2022-01-11 15:35:54: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409) (errno 22, Invalid argument)
2022-01-11 15:35:54: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrAlreadyConnected", Msg:"node already connected", BlockRetry:false, Backoff:0s (-1 unset by cloud)
2022-01-11 15:35:54: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP (errno 22, Invalid argument)
2022-01-11 15:35:54: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 2.180 seconds
2022-01-11 15:35:57: netdata INFO  : ACLK_Main : Attempting connection now
2022-01-11 15:35:57: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:35:57: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:35:59: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 409
2022-01-11 15:35:59: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409) (errno 22, Invalid argument)
2022-01-11 15:35:59: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrAlreadyConnected", Msg:"node already connected", BlockRetry:false, Backoff:0s (-1 unset by cloud)
2022-01-11 15:35:59: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP (errno 22, Invalid argument)
2022-01-11 15:35:59: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 5.578 seconds
2022-01-11 15:36:04: netdata INFO  : ACLK_Main : Attempting connection now
2022-01-11 15:36:05: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:36:05: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:36:05: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 409
2022-01-11 15:36:05: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409) (errno 22, Invalid argument)
2022-01-11 15:36:05: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrAlreadyConnected", Msg:"node already connected", BlockRetry:false, Backoff:0s (-1 unset by cloud)
2022-01-11 15:36:05: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP (errno 22, Invalid argument)
2022-01-11 15:36:05: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 11.155 seconds
2022-01-11 15:36:16: netdata INFO  : ACLK_Main : Attempting connection now
2022-01-11 15:36:17: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:36:17: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:36:17: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 409
2022-01-11 15:36:17: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409) (errno 22, Invalid argument)
2022-01-11 15:36:17: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrAlreadyConnected", Msg:"node already connected", BlockRetry:false, Backoff:0s (-1 unset by cloud)
2022-01-11 15:36:17: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP (errno 22, Invalid argument)
2022-01-11 15:36:17: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 19.423 seconds
2022-01-11 15:36:37: netdata INFO  : ACLK_Main : Attempting connection now
2022-01-11 15:36:37: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:36:37: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:36:38: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 409
2022-01-11 15:36:38: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409) (errno 22, Invalid argument)
2022-01-11 15:36:38: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrAlreadyConnected", Msg:"node already connected", BlockRetry:false, Backoff:0s (-1 unset by cloud)
2022-01-11 15:36:38: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP (errno 22, Invalid argument)
2022-01-11 15:36:38: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 45.258 seconds
2022-01-11 15:37:23: netdata INFO  : ACLK_Main : Attempting connection now
2022-01-11 15:37:24: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:37:24: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:37:24: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 409
2022-01-11 15:37:24: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409) (errno 22, Invalid argument)
2022-01-11 15:37:24: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrAlreadyConnected", Msg:"node already connected", BlockRetry:false, Backoff:0s (-1 unset by cloud)
2022-01-11 15:37:24: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP (errno 22, Invalid argument)
2022-01-11 15:37:24: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 85.835 seconds
2022-01-11 15:38:50: netdata INFO  : ACLK_Main : Attempting connection now
2022-01-11 15:38:51: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:38:51: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-01-11 15:38:52: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-01-11 15:38:52: netdata INFO  : ACLK_Main : ACLK_OTP Got Challenge from Cloud
2022-01-11 15:38:52: netdata INFO  : ACLK_Main : HTTPS "POST" request to "app.netdata.cloud" finished with HTTP code: 201
2022-01-11 15:38:52: netdata INFO  : ACLK_Main : ACLK_OTP Got Password from Cloud
2022-01-11 15:38:53: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
2022-01-11 15:38:53: netdata INFO  : ACLK_Main : ACLK connection successfully established
2022-01-11 15:38:53: netdata ERROR : ACLK_Main : Sending `connect` payload immediately as popcorning was finished already. (errno 22, Invalid argument)
2022-01-12 00:17:04: netdata ERROR : ACLK_Stats : heartbeat missed 823972 monotonic microseconds
2022-01-13 18:40:41: netdata ERROR : ACLK_Stats : heartbeat missed 2311950 monotonic microseconds
2022-01-13 21:35:48: netdata ERROR : ACLK_Stats : heartbeat missed 5315906 monotonic microseconds

Netdata runs inside a Docker Container.

Our/My thoughts:

We’re able to observe this problem distributed over most of or all of our servers, independent on what kind of applications/services are running there - depending on the load (and number of requests/tasks) the volatility of the metrics increases (which is to be expected i guess).

I think we can rule out that this is netdata or docker related and rather something that affects the whole system. If this occours to our loadbalancer, our uptime monitoring reports a downtime (or long response times, depending on the duration of the freeze and also on the current load/requests i guess)
If it happens on our (not yet redundant/HA) nfs server or database, we see this also reflected to the servers trying to access them which leads me to the conclusion that this affects the whole system.

Depending on the current load (which is usually not higher than 20% of the total capacity) on the server, watchdogd reports a Soft Lockup in seemingly random processes. We’re thinking that this is only a sympthom and doesn’t help us getting to the root cause of the problem.

Since we’re trying for quite a while to get a hold of this but are unable to identify anything we’re also considering this might be a problem with the KVM hostsystem and thus out of our reach.

Any help would be hightly appreciated.

Thanks
Christian

I’ve had the same problem on a dedicated server I manage (Ubuntu 20.04).
Running uptime would show the time since the latest such gap.

I initially thought it was a file descriptor limit issue, but increasing that didn’t help much. After 5 days of normal behavior the issue came up again.

Utilization of RAM has always been around 10-30% of the available memory and the most process intensive task is mysqld running as an application service on docker.

I had Netdata installed using the installer script.
After uninstalling it, the machine no longer freezes. I think it’s too soon to declare a victory since only about 2 days have passed since.

@chr1s Hey Christian! Somehow we missed your message, sorry for that! Are you still facing issues?

@datio Hello Alex! Thanks for reporting this! We do have scenarios with netdata running on constrained resources and still works fine so let’s see if there is any specific bug.

Is it possible to share some more details about your netdata installation (e.g version, install method) and your machine configuration?

Thanks,
Costas

My server froze again so it is certainly not netdata causing the problem.

I wonder if @chr1s pinpointed the reason those freezes happen.

Thanks @datio for your feedback, keep us informed about future challenges!

Hi @datio and @copi,

sorry for replying so late, kind of busy recently.

We weren’t able to really get any further with this and kind of came to the conclusion that it is indeed the KVM hostsystem which manages the VMs we’ve rented. Together with the hoster/provider we were able to reduce this observed problem to a minimum (they said they tweaked some kernel configuration on their hosts but no details were given) but it’s still not gone completely and we’re still unable to tell why it happens, besides guessing that it is not our fault.

It would be amazing if anyone here does have a clue how to identify and attribute these kind of problems because it’s always hard to tell a provider that their setup is having problems on VMs/dedicated/root servers. Kind of understandable, I bet most of the support tickets these guys get are related to someone messing up their own setup but how great would it be to be able to say “nothing wrong here, look at these metrics, it’s you” right away without having to turn your whole system upside down.

@datio sorry for not being able to share any helpful advice here, we’re pretty much as lost as you are, i guess. if you figure out something new we’d love to hear from you. (btw. is that dedicated server of yorus a kvm virtual machine too? possibly running at a german hoster?)

thanks and have a nice week

Hey @chr1s,

We’ve recently merged a PR related with our scheduling policy, more info can be found here: Netdata and CPUSchedulingPolicy · Issue #12254 · netdata/netdata · GitHub.

This actually allows our monitoring solution to be still operational in cases where the system is under heavy CPU stress and enable you to understand what went wrong and troubleshoot the root problem.

I would strongly suggest to update netdata on your systems. In case this is not possible you can also manually apply the changes in netdata.confand/or systemd service file.

1 Like

will try so shortly - thanks a lot for the response @copi