As the title mentions, I have a parent node that every 2-3 days, the web server goes down and will receive 504 errors. There are roughly 30 child nodes that are streaming to it and it sits behind an HA-proxy set up, routing traffic to its default port.
When the web server is unreachable, checking the status of the service on the server shows that it is active/running and cpu/mem/disk are not reporting high usage. Restarting the service fixes the web server but will eventually become un reachable in a couple of days.
Since we noticed the web server going down last week, the charts on the dashboard have lost data and do not go back more than a couple of days.
Parent node
grep -Hv “^#” /etc/*release:
/etc/lsb-release:DISTRIB_ID=Ubuntu
/etc/lsb-release:DISTRIB_RELEASE=22.04
/etc/lsb-release:DISTRIB_CODENAME=jammy
/etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
/etc/os-release:PRETTY_NAME="Ubuntu 22.04.1 LTS"
/etc/os-release:NAME="Ubuntu"
/etc/os-release:VERSION_ID="22.04"
/etc/os-release:VERSION="22.04.1 LTS (Jammy Jellyfish)"
/etc/os-release:VERSION_CODENAME=jammy
/etc/os-release:ID=ubuntu
/etc/os-release:ID_LIKE=debian
/etc/os-release:HOME_URL="https://www.ubuntu.com/"
/etc/os-release:SUPPORT_URL="https://help.ubuntu.com/"
/etc/os-release:BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
/etc/os-release:PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
/etc/os-release:UBUNTU_CODENAME=jammy
build info:
Version: netdata v1.38.1
Configure options: '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--disable-option-checking' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--with-user=netdata' '--with-math' '--with-zlib' '--with-webdir=/var/lib/netdata/www' '--disable-dependency-tracking' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -ffile-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -ffile-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security'
Install type: binpkg-deb
Binary architecture: x86_64
Packaging distro:
Features:
dbengine: YES
Native HTTPS: YES
Netdata Cloud: YES
ACLK: YES
TLS Host Verification: YES
Machine Learning: YES
Stream Compression: YES
Libraries:
protobuf: YES (system)
jemalloc: NO
JSON-C: YES
libcap: NO
libcrypto: YES
libm: YES
tcalloc: NO
zlib: YES
Plugins:
apps: YES
cgroup Network Tracking: YES
CUPS: YES
EBPF: YES
IPMI: YES
NFACCT: YES
perf: YES
slabinfo: YES
Xen: NO
Xen VBD Error Tracking: NO
Exporters:
AWS Kinesis: NO
GCP PubSub: NO
MongoDB: NO
Prometheus Remote Write: YES
Debug/Developer Features:
Trace Allocations: NO
netdata.conf:
[db]
mode = dbengine
storage tiers = 3
update every = 1
dbengine multihost disk space MB = 1024
dbengine page cache size MB = 64
dbengine tier 1 update every iterations = 60
dbengine tier 1 multihost disk space MB = 384
dbengine tier 1 page cache size MB = 36
dbengine tier 2 update every iterations = 60
dbengine tier 2 multihost disk space MB = 192
dbengine tier 2 page cache size MB = 36
stream.conf:
[key]
enabled = yes
default memory mode = dbengine
Logs from error.log containing ‘WEB’:
2023-05-01 09:02:52: netdata INFO : WEB[3] : STREAM 'child-node-1' [receive from [<ip>]:<port>]: multiple connections for same host, old connection was used 6 secs ago (new connection not accepted). STATUS: ALREADY CONNECTED
2023-05-01 09:04:31: netdata ERROR : WEB[1] : STREAM 'child-node-2' [receive from [<ip>]:<port>]: thread 1589156 takes too long to stop, giving up...
2023-05-01 09:04:31: netdata INFO : WEB[1] : STREAM 'child-node-2' [receive from [<ip>]:<port>]: multiple connections for same host, old connection was used 51 secs ago (signaled old receiver to stop). STATUS: ALREADY CONNECTED
2023-05-01 09:07:01: netdata INFO : WEB[4] : STREAM 'child-node-1' [receive from [<ip>]:<port>]: multiple connections for same host, old connection was used 20 secs ago (new connection not accepted). STATUS: ALREADY CONNECTED
Has anyone run into this or know of a solution to stop the web server from crashing ? Any help would be greatly appreciated!