parent node web ui crashing every couple of days

hgarcia · May 2, 2023, 11:14pm

As the title mentions, I have a parent node that every 2-3 days, the web server goes down and will receive 504 errors. There are roughly 30 child nodes that are streaming to it and it sits behind an HA-proxy set up, routing traffic to its default port.

When the web server is unreachable, checking the status of the service on the server shows that it is active/running and cpu/mem/disk are not reporting high usage. Restarting the service fixes the web server but will eventually become un reachable in a couple of days.

Since we noticed the web server going down last week, the charts on the dashboard have lost data and do not go back more than a couple of days.

Parent node

grep -Hv “^#” /etc/*release:

/etc/lsb-release:DISTRIB_ID=Ubuntu
/etc/lsb-release:DISTRIB_RELEASE=22.04
/etc/lsb-release:DISTRIB_CODENAME=jammy
/etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
/etc/os-release:PRETTY_NAME="Ubuntu 22.04.1 LTS"
/etc/os-release:NAME="Ubuntu"
/etc/os-release:VERSION_ID="22.04"
/etc/os-release:VERSION="22.04.1 LTS (Jammy Jellyfish)"
/etc/os-release:VERSION_CODENAME=jammy
/etc/os-release:ID=ubuntu
/etc/os-release:ID_LIKE=debian
/etc/os-release:HOME_URL="https://www.ubuntu.com/"
/etc/os-release:SUPPORT_URL="https://help.ubuntu.com/"
/etc/os-release:BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
/etc/os-release:PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
/etc/os-release:UBUNTU_CODENAME=jammy

build info:

Version: netdata v1.38.1
Configure options:  '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--disable-option-checking' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--with-user=netdata' '--with-math' '--with-zlib' '--with-webdir=/var/lib/netdata/www' '--disable-dependency-tracking' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -ffile-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -ffile-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security'
Install type: binpkg-deb
    Binary architecture: x86_64
    Packaging distro:
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES
    ACLK:                       YES
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         YES
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    YES
    EBPF:                    YES
    IPMI:                    YES
    NFACCT:                  YES
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES
Debug/Developer Features:
    Trace Allocations:       NO

netdata.conf:

[db]
    mode = dbengine
    storage tiers = 3
    update every = 1
    dbengine multihost disk space MB = 1024
    dbengine page cache size MB = 64
    dbengine tier 1 update every iterations = 60
    dbengine tier 1 multihost disk space MB = 384
    dbengine tier 1 page cache size MB = 36
    dbengine tier 2 update every iterations = 60
    dbengine tier 2 multihost disk space MB = 192
    dbengine tier 2 page cache size MB = 36

stream.conf:

[key]
    enabled = yes
    default memory mode = dbengine

Logs from error.log containing ‘WEB’:

2023-05-01 09:02:52: netdata INFO  : WEB[3] : STREAM 'child-node-1' [receive from [<ip>]:<port>]: multiple connections for same host, old connection was used 6 secs ago (new connection not accepted). STATUS: ALREADY CONNECTED
2023-05-01 09:04:31: netdata ERROR : WEB[1] : STREAM 'child-node-2' [receive from [<ip>]:<port>]: thread 1589156 takes too long to stop, giving up...
2023-05-01 09:04:31: netdata INFO  : WEB[1] : STREAM 'child-node-2' [receive from [<ip>]:<port>]: multiple connections for same host, old connection was used 51 secs ago (signaled old receiver to stop). STATUS: ALREADY CONNECTED
2023-05-01 09:07:01: netdata INFO  : WEB[4] : STREAM 'child-node-1' [receive from [<ip>]:<port>]: multiple connections for same host, old connection was used 20 secs ago (new connection not accepted). STATUS: ALREADY CONNECTED

Has anyone run into this or know of a solution to stop the web server from crashing ? Any help would be greatly appreciated!

Manolis_Vasilakis · May 4, 2023, 7:11am

Hi @hgarcia !

Would it be possible to send both error.log and access.log to manolis@netdata.cloud ?

Ideally, if you could pinpoint in those logs when you see the web server unreachable.

Since our internal web server handles both the agent’s dashboard, and the incoming streaming connections, just to make sure, do both suffer after a couple of days?

It would be also interesting to see the error.log of a streaming child at that point.

Another question regarding your setup, are the children on the same local network, or over more remote locations? Also do you use SSL on your setup?

Thanks!

hgarcia · May 4, 2023, 7:49pm

Hi @Manolis_Vasilakis, thank you for the response.

The logs requested have been sent to the email you posted.

Other than data on the graphs no longer going back further than a few days, dashboards and streaming do not appear to be impacted while the web server is down. After restarting the netdata service to get the web server back online. The graphs show data during the time where the server was unavailable. I noted the times in the email containing the logs.

To expand a little data no longer being available past a few days. The configuration for the parent node has not changed and since the crashing the graphs now show this message: Want to extend your history of real-time metrics? Configure Netdata's history or use the DB engine.

All nodes in the setup are in our internal network and SSL is not being used to stream between the nodes.

Topic		Replies	Views
I'm receiving a storm of reachability alerts Help cloud	6	819	March 24, 2021
Claimed node regularly unreachable or unresponsive Help cloud	4	1745	February 5, 2021
False alert: Netdata Cloud is not able to reach the node Help cloud	4	665	November 16, 2022
Can't view graphs when node loses connectivity for short period of time Help	1	272	October 16, 2023
Need help analyzing a server crash after systemd services decline over 2 h period Help agent , alerts , linux , platform	1	336	September 30, 2023

parent node web ui crashing every couple of days

Related topics