Problem/Question
All of a sudden I have a bunch of Ubuntu 20.04 instances on AWS where the Netdata agent is crashing. I have not run any updates other than the default enabled security updates.
I can’t find any commonalities between the instances that are exhibiting this problem. Netdata error log does not show anything interesting. The kernel message are all the same:
dmesg
[4443253.505196] PD[go.d][3707964]: segfault at 0 ip 000055d4bcbb56bf sp 00007f9a2c4f8ef0 error 4 in netdata[55d4bc9f5000+44c000]
[4443253.505210] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443285.508686] PD[go.d][3708424]: segfault at 0 ip 000055c7d22756bf sp 00007f73c7c9bef0 error 4 in netdata[55c7d20b5000+44c000]
[4443285.508700] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443327.506960] PD[go.d][3708902]: segfault at 0 ip 000055726ea096bf sp 00007fc6c5c46ef0 error 4 in netdata[55726e849000+44c000]
[4443327.506972] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443369.504564] PD[go.d][3709379]: segfault at 0 ip 000055cd2f5866bf sp 00007f642a287ef0 error 4 in netdata[55cd2f3c6000+44c000]
[4443369.504581] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443401.507259] PD[go.d][3709872]: segfault at 0 ip 0000562d0a0916bf sp 00007fc202262ef0 error 4 in netdata[562d09ed1000+44c000]
[4443401.507271] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443433.510024] PD[go.d][3710328]: segfault at 0 ip 00005578503d96bf sp 00007f2f6580def0 error 4 in netdata[557850219000+44c000]
[4443433.510040] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443465.505646] PD[go.d][3710792]: segfault at 0 ip 0000560f7337a6bf sp 00007ffb092b4ef0 error 4 in netdata[560f731ba000+44c000]
[4443465.505661] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443507.502544] PD[go.d][3711408]: segfault at 0 ip 0000555c412b26bf sp 00007f7ecd727ef0 error 4 in netdata[555c410f2000+44c000]
[4443507.502557] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443539.509368] PD[go.d][3711888]: segfault at 0 ip 000055690cbf76bf sp 00007f56ac35aef0 error 4 in netdata[55690ca37000+44c000]
[4443539.509379] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443571.508165] PD[go.d][3712353]: segfault at 0 ip 00005584fdcb86bf sp 00007fdf6a096ef0 error 4 in netdata[5584fdaf8000+44c000]
[4443571.508178] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443603.521864] PD[go.d][3712819]: segfault at 0 ip 000055a4de4bb6bf sp 00007efbf29f9ef0 error 4 in netdata[55a4de2fb000+44c000]
[4443603.521875] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
/var/log/netdata/error.log
2023-09-16 03:16:20: netdata INFO : ANALYTICS : thread created with task id 3723096
2023-09-16 03:16:20: netdata INFO : ANALYTICS : set name of thread 3723096 to ANALYTICS
2023-09-16 03:16:20: netdata INFO : P[proc] : Using now_boottime_usec() for uptime (dt is 8 ms)
2023-09-16 03:16:29: netdata INFO : PD[charts.d] : thread with task id 3722867 finished
2023-09-16 03:17:01: netdata INFO : MAIN : IEEE754: system is using IEEE754 DOUBLE PRECISION values
2023-09-16 03:17:01: netdata INFO : MAIN : TIMEZONE: using the contents of /etc/timezone
2023-09-16 03:17:01: netdata INFO : MAIN : TIMEZONE: fixed as 'Etc/UTC'
2023-09-16 03:17:01: netdata INFO : MAIN : NETDATA STARTUP: next: initialize ML
2023-09-16 03:17:01: netdata INFO : MAIN : NETDATA STARTUP: in 3 ms, initialize ML - next: initialize signals
2023-09-16 03:17:01: netdata INFO : MAIN : NETDATA STARTUP: in 0 ms, initialize signals - next: initialize static threads
2023-09-16 03:17:01: netdata INFO : MAIN : NETDATA STARTUP: in 0 ms, initialize static threads - next: initialize web server
2023-09-16 03:17:01: netdata INFO : MAIN : NETDATA STARTUP: in 0 ms, initialize web server - next: initialize httpd server
2023-09-16 03:17:01: netdata INFO : MAIN : NETDATA STARTUP: in 0 ms, initialize httpd server - next: set resource limits
2023-09-16 03:17:01: netdata INFO : MAIN : resources control: allowed file descriptors: soft = 1024, max = 524288
2023-09-16 03:17:01: netdata INFO : MAIN : NETDATA STARTUP: in 0 ms, set resource limits - next: become daemon
2023-09-16 03:17:01: netdata INFO : MAIN : Out-Of-Memory (OOM) score is already set to the wanted value 0
2023-09-16 03:17:01: netdata INFO : MAIN : Adjusted netdata scheduling policy to batch (3), with priority 0.
2023-09-16 03:17:01: netdata INFO : MAIN : Running with process scheduling policy 'batch', nice level 19
2023-09-16 03:17:01: netdata INFO : MAIN : netdata started on pid 3723132.
...
Relevant docs you followed/actions you took to solve the issue
Already tried rebooting installing updates, upgrading Netdata to the latest version, and even stopping and starting the instance.
Environment/Browser/Agent’s version etc
netdata v1.42.3
netdata -W buildinfo
Packaging:
Netdata Version ____________________________________________ : v1.42.3
Installation Type __________________________________________ : binpkg-deb
Package Architecture _______________________________________ : x86_64
Package Distro _____________________________________________ :
Configure Options __________________________________________ : '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--with-user=netdata' '--with-math' '--with-zlib' '--with-webdir=/var/lib/netdata/www' '--disable-dependency-tracking' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security'
Default Directories:
User Configurations ________________________________________ : /etc/netdata
Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
Permanent Databases ________________________________________ : /var/lib/netdata
Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
Static Web Files ___________________________________________ : /var/lib/netdata/www
Log Files __________________________________________________ : /var/log/netdata
Lock Files _________________________________________________ : /var/lib/netdata/lock
Home _______________________________________________________ : /var/lib/netdata
Operating System:
Kernel _____________________________________________________ : Linux
Kernel Version _____________________________________________ : 5.15.0-1044-aws
Operating System ___________________________________________ : Ubuntu
Operating System ID ________________________________________ : ubuntu
Operating System ID Like ___________________________________ : debian
Operating System Version ___________________________________ : 20.04.6 LTS (Focal Fossa)
Operating System Version ID ________________________________ : none
Detection __________________________________________________ : /etc/os-release
Hardware:
CPU Cores __________________________________________________ : 16
CPU Frequency ______________________________________________ : 2999000000
CPU Architecture ___________________________________________ : 32810553344
RAM Bytes __________________________________________________ : 53687091200
Disk Capacity ______________________________________________ : x86_64
Virtualization Technology __________________________________ : kvm
Virtualization Detection ___________________________________ : systemd-detect-virt
Container:
Container __________________________________________________ : none
Container Detection ________________________________________ : systemd-detect-virt
Container Orchestrator _____________________________________ : none
Container Operating System _________________________________ : none
Container Operating System ID ______________________________ : none
Container Operating System ID Like _________________________ : none
Container Operating System Version _________________________ : none
Container Operating System Version ID ______________________ : none
Container Operating System Detection _______________________ : none
Features:
Built For __________________________________________________ : Linux
Netdata Cloud ______________________________________________ : YES
Health (trigger alerts and send notifications) _____________ : YES
Streaming (stream metrics to parent Netdata servers) _______ : YES
Replication (fill the gaps of parent Netdata servers) ______ : YES
Streaming and Replication Compression ______________________ : YES (lz4)
Contexts (index all active and archived metrics) ___________ : YES
Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
Machine Learning ___________________________________________ : YES
Database Engines:
dbengine ___________________________________________________ : YES
alloc ______________________________________________________ : YES
ram ________________________________________________________ : YES
map ________________________________________________________ : YES
save _______________________________________________________ : YES
none _______________________________________________________ : YES
Connectivity Capabilities:
ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
static (Netdata internal web server) _______________________ : YES
h2o (web server) ___________________________________________ : YES
WebRTC (experimental) ______________________________________ : NO
Native HTTPS (TLS Support) _________________________________ : YES
TLS Host Verification ______________________________________ : YES
Libraries:
LZ4 (extremely fast lossless compression algorithm) ________ : YES
zlib (lossless data-compression library) ___________________ : YES
Judy (high-performance dynamic arrays and hashtables) ______ : YES (bundled)
dlib (robust machine learning toolkit) _____________________ : YES (bundled)
protobuf (platform-neutral data serialization protocol) ____ : YES (system)
OpenSSL (cryptography) _____________________________________ : YES
libdatachannel (stand-alone WebRTC data channels) __________ : NO
JSON-C (lightweight JSON manipulation) _____________________ : YES
libcap (Linux capabilities system operations) ______________ : NO
libcrypto (cryptographic functions) ________________________ : YES
libm (mathematical functions) ______________________________ : YES
jemalloc ___________________________________________________ : NO
TCMalloc ___________________________________________________ : NO
Plugins:
apps (monitor processes) ___________________________________ : YES
cgroups (monitor containers and VMs) _______________________ : YES
cgroup-network (associate interfaces to CGROUPS) ___________ : YES
proc (monitor Linux systems) _______________________________ : YES
tc (monitor Linux network QoS) _____________________________ : YES
diskspace (monitor Linux mount points) _____________________ : YES
freebsd (monitor FreeBSD systems) __________________________ : NO
macos (monitor MacOS systems) ______________________________ : NO
statsd (collect custom application metrics) ________________ : YES
timex (check system clock synchronization) _________________ : YES
idlejitter (check system latency and jitter) _______________ : YES
bash (support shell data collection jobs - charts.d) _______ : YES
debugfs (kernel debugging metrics) _________________________ : YES
cups (monitor printers and print jobs) _____________________ : YES
ebpf (monitor system calls) ________________________________ : YES
freeipmi (monitor enterprise server H/W) ___________________ : YES
nfacct (gather netfilter accounting) _______________________ : YES
perf (collect kernel performance events) ___________________ : YES
slabinfo (monitor kernel object caching) ___________________ : YES
Xen ________________________________________________________ : NO
Xen VBD Error Tracking _____________________________________ : NO
Exporters:
AWS Kinesis ________________________________________________ : NO
GCP PubSub _________________________________________________ : NO
MongoDB ____________________________________________________ : NO
Prometheus (OpenMetrics) Exporter __________________________ : YES
Prometheus Remote Write ____________________________________ : YES
Graphite ___________________________________________________ : YES
Graphite HTTP / HTTPS ______________________________________ : YES
JSON _______________________________________________________ : YES
JSON HTTP / HTTPS __________________________________________ : YES
OpenTSDB ___________________________________________________ : YES
OpenTSDB HTTP / HTTPS ______________________________________ : YES
All Metrics API ____________________________________________ : YES
Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
Trace All Netdata Allocations (with charts) ________________ : NO
Developer Mode (more runtime checks, slower) _______________ : NO
netdatacli aclk-state
ACLK Available: Yes
ACLK Version: 2
Protocols Supported: Protobuf
Protocol Used: Protobuf
MQTT Version: 5
Claimed: No
Online: No
Reconnect count: 0
Banned By Cloud: No
What I expected to happen
Should not segfault