Segfaults all of a sudden

Problem/Question

All of a sudden I have a bunch of Ubuntu 20.04 instances on AWS where the Netdata agent is crashing. I have not run any updates other than the default enabled security updates.

I can’t find any commonalities between the instances that are exhibiting this problem. Netdata error log does not show anything interesting. The kernel message are all the same:

dmesg

[4443253.505196] PD[go.d][3707964]: segfault at 0 ip 000055d4bcbb56bf sp 00007f9a2c4f8ef0 error 4 in netdata[55d4bc9f5000+44c000]
[4443253.505210] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443285.508686] PD[go.d][3708424]: segfault at 0 ip 000055c7d22756bf sp 00007f73c7c9bef0 error 4 in netdata[55c7d20b5000+44c000]
[4443285.508700] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443327.506960] PD[go.d][3708902]: segfault at 0 ip 000055726ea096bf sp 00007fc6c5c46ef0 error 4 in netdata[55726e849000+44c000]
[4443327.506972] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443369.504564] PD[go.d][3709379]: segfault at 0 ip 000055cd2f5866bf sp 00007f642a287ef0 error 4 in netdata[55cd2f3c6000+44c000]
[4443369.504581] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443401.507259] PD[go.d][3709872]: segfault at 0 ip 0000562d0a0916bf sp 00007fc202262ef0 error 4 in netdata[562d09ed1000+44c000]
[4443401.507271] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443433.510024] PD[go.d][3710328]: segfault at 0 ip 00005578503d96bf sp 00007f2f6580def0 error 4 in netdata[557850219000+44c000]
[4443433.510040] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443465.505646] PD[go.d][3710792]: segfault at 0 ip 0000560f7337a6bf sp 00007ffb092b4ef0 error 4 in netdata[560f731ba000+44c000]
[4443465.505661] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443507.502544] PD[go.d][3711408]: segfault at 0 ip 0000555c412b26bf sp 00007f7ecd727ef0 error 4 in netdata[555c410f2000+44c000]
[4443507.502557] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443539.509368] PD[go.d][3711888]: segfault at 0 ip 000055690cbf76bf sp 00007f56ac35aef0 error 4 in netdata[55690ca37000+44c000]
[4443539.509379] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443571.508165] PD[go.d][3712353]: segfault at 0 ip 00005584fdcb86bf sp 00007fdf6a096ef0 error 4 in netdata[5584fdaf8000+44c000]
[4443571.508178] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c
[4443603.521864] PD[go.d][3712819]: segfault at 0 ip 000055a4de4bb6bf sp 00007efbf29f9ef0 error 4 in netdata[55a4de2fb000+44c000]
[4443603.521875] Code: 87 00 00 00 48 8b 6f 08 48 83 fe 02 74 7d 48 8b 57 10 48 83 fe 03 0f 84 3c 01 00 00 48 8b 5f 18 48 85 ed 74 66 48 85 d2 74 61 <0f> be 03 84 c0 74 5a 49 8b 7c 24 38 48 85 ff 0f 84 9c 00 00 00 3c

/var/log/netdata/error.log

2023-09-16 03:16:20: netdata INFO  : ANALYTICS : thread created with task id 3723096
2023-09-16 03:16:20: netdata INFO  : ANALYTICS : set name of thread 3723096 to ANALYTICS
2023-09-16 03:16:20: netdata INFO  : P[proc] : Using now_boottime_usec() for uptime (dt is 8 ms)
2023-09-16 03:16:29: netdata INFO  : PD[charts.d] : thread with task id 3722867 finished
2023-09-16 03:17:01: netdata INFO  : MAIN : IEEE754: system is using IEEE754 DOUBLE PRECISION values
2023-09-16 03:17:01: netdata INFO  : MAIN : TIMEZONE: using the contents of /etc/timezone
2023-09-16 03:17:01: netdata INFO  : MAIN : TIMEZONE: fixed as 'Etc/UTC'
2023-09-16 03:17:01: netdata INFO  : MAIN : NETDATA STARTUP: next: initialize ML
2023-09-16 03:17:01: netdata INFO  : MAIN : NETDATA STARTUP: in       3 ms, initialize ML - next: initialize signals
2023-09-16 03:17:01: netdata INFO  : MAIN : NETDATA STARTUP: in       0 ms, initialize signals - next: initialize static threads
2023-09-16 03:17:01: netdata INFO  : MAIN : NETDATA STARTUP: in       0 ms, initialize static threads - next: initialize web server
2023-09-16 03:17:01: netdata INFO  : MAIN : NETDATA STARTUP: in       0 ms, initialize web server - next: initialize httpd server
2023-09-16 03:17:01: netdata INFO  : MAIN : NETDATA STARTUP: in       0 ms, initialize httpd server - next: set resource limits
2023-09-16 03:17:01: netdata INFO  : MAIN : resources control: allowed file descriptors: soft = 1024, max = 524288
2023-09-16 03:17:01: netdata INFO  : MAIN : NETDATA STARTUP: in       0 ms, set resource limits - next: become daemon
2023-09-16 03:17:01: netdata INFO  : MAIN : Out-Of-Memory (OOM) score is already set to the wanted value 0
2023-09-16 03:17:01: netdata INFO  : MAIN : Adjusted netdata scheduling policy to batch (3), with priority 0.
2023-09-16 03:17:01: netdata INFO  : MAIN : Running with process scheduling policy 'batch', nice level 19
2023-09-16 03:17:01: netdata INFO  : MAIN : netdata started on pid 3723132.
...

Relevant docs you followed/actions you took to solve the issue

Already tried rebooting installing updates, upgrading Netdata to the latest version, and even stopping and starting the instance.

Environment/Browser/Agent’s version etc

netdata v1.42.3
netdata -W buildinfo
Packaging:
    Netdata Version ____________________________________________ : v1.42.3
    Installation Type __________________________________________ : binpkg-deb
    Package Architecture _______________________________________ : x86_64
    Package Distro _____________________________________________ :  
    Configure Options __________________________________________ :  '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--with-user=netdata' '--with-math' '--with-zlib' '--with-webdir=/var/lib/netdata/www' '--disable-dependency-tracking' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -fdebug-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security'
Default Directories:
    User Configurations ________________________________________ : /etc/netdata
    Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
    Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
    Permanent Databases ________________________________________ : /var/lib/netdata
    Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
    Static Web Files ___________________________________________ : /var/lib/netdata/www
    Log Files __________________________________________________ : /var/log/netdata
    Lock Files _________________________________________________ : /var/lib/netdata/lock
    Home _______________________________________________________ : /var/lib/netdata
Operating System:
    Kernel _____________________________________________________ : Linux
    Kernel Version _____________________________________________ : 5.15.0-1044-aws
    Operating System ___________________________________________ : Ubuntu
    Operating System ID ________________________________________ : ubuntu
    Operating System ID Like ___________________________________ : debian
    Operating System Version ___________________________________ : 20.04.6 LTS (Focal Fossa)
    Operating System Version ID ________________________________ : none
    Detection __________________________________________________ : /etc/os-release
Hardware:
    CPU Cores __________________________________________________ : 16
    CPU Frequency ______________________________________________ : 2999000000
    CPU Architecture ___________________________________________ : 32810553344
    RAM Bytes __________________________________________________ : 53687091200
    Disk Capacity ______________________________________________ : x86_64
    Virtualization Technology __________________________________ : kvm
    Virtualization Detection ___________________________________ : systemd-detect-virt
Container:
    Container __________________________________________________ : none
    Container Detection ________________________________________ : systemd-detect-virt
    Container Orchestrator _____________________________________ : none
    Container Operating System _________________________________ : none
    Container Operating System ID ______________________________ : none
    Container Operating System ID Like _________________________ : none
    Container Operating System Version _________________________ : none
    Container Operating System Version ID ______________________ : none
    Container Operating System Detection _______________________ : none
Features:
    Built For __________________________________________________ : Linux
    Netdata Cloud ______________________________________________ : YES
    Health (trigger alerts and send notifications) _____________ : YES
    Streaming (stream metrics to parent Netdata servers) _______ : YES
    Replication (fill the gaps of parent Netdata servers) ______ : YES
    Streaming and Replication Compression ______________________ : YES (lz4)
    Contexts (index all active and archived metrics) ___________ : YES
    Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
    Machine Learning ___________________________________________ : YES
Database Engines:
    dbengine ___________________________________________________ : YES
    alloc ______________________________________________________ : YES
    ram ________________________________________________________ : YES
    map ________________________________________________________ : YES
    save _______________________________________________________ : YES
    none _______________________________________________________ : YES
Connectivity Capabilities:
    ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
    static (Netdata internal web server) _______________________ : YES
    h2o (web server) ___________________________________________ : YES
    WebRTC (experimental) ______________________________________ : NO
    Native HTTPS (TLS Support) _________________________________ : YES
    TLS Host Verification ______________________________________ : YES
Libraries:
    LZ4 (extremely fast lossless compression algorithm) ________ : YES
    zlib (lossless data-compression library) ___________________ : YES
    Judy (high-performance dynamic arrays and hashtables) ______ : YES (bundled)
    dlib (robust machine learning toolkit) _____________________ : YES (bundled)
    protobuf (platform-neutral data serialization protocol) ____ : YES (system)
    OpenSSL (cryptography) _____________________________________ : YES
    libdatachannel (stand-alone WebRTC data channels) __________ : NO
    JSON-C (lightweight JSON manipulation) _____________________ : YES
    libcap (Linux capabilities system operations) ______________ : NO
    libcrypto (cryptographic functions) ________________________ : YES
    libm (mathematical functions) ______________________________ : YES
    jemalloc ___________________________________________________ : NO
    TCMalloc ___________________________________________________ : NO
Plugins:
    apps (monitor processes) ___________________________________ : YES
    cgroups (monitor containers and VMs) _______________________ : YES
    cgroup-network (associate interfaces to CGROUPS) ___________ : YES
    proc (monitor Linux systems) _______________________________ : YES
    tc (monitor Linux network QoS) _____________________________ : YES
    diskspace (monitor Linux mount points) _____________________ : YES
    freebsd (monitor FreeBSD systems) __________________________ : NO
    macos (monitor MacOS systems) ______________________________ : NO
    statsd (collect custom application metrics) ________________ : YES
    timex (check system clock synchronization) _________________ : YES
    idlejitter (check system latency and jitter) _______________ : YES
    bash (support shell data collection jobs - charts.d) _______ : YES
    debugfs (kernel debugging metrics) _________________________ : YES
    cups (monitor printers and print jobs) _____________________ : YES
    ebpf (monitor system calls) ________________________________ : YES
    freeipmi (monitor enterprise server H/W) ___________________ : YES
    nfacct (gather netfilter accounting) _______________________ : YES
    perf (collect kernel performance events) ___________________ : YES
    slabinfo (monitor kernel object caching) ___________________ : YES
    Xen ________________________________________________________ : NO
    Xen VBD Error Tracking _____________________________________ : NO
Exporters:
    AWS Kinesis ________________________________________________ : NO
    GCP PubSub _________________________________________________ : NO
    MongoDB ____________________________________________________ : NO
    Prometheus (OpenMetrics) Exporter __________________________ : YES
    Prometheus Remote Write ____________________________________ : YES
    Graphite ___________________________________________________ : YES
    Graphite HTTP / HTTPS ______________________________________ : YES
    JSON _______________________________________________________ : YES
    JSON HTTP / HTTPS __________________________________________ : YES
    OpenTSDB ___________________________________________________ : YES
    OpenTSDB HTTP / HTTPS ______________________________________ : YES
    All Metrics API ____________________________________________ : YES
    Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
    Trace All Netdata Allocations (with charts) ________________ : NO
    Developer Mode (more runtime checks, slower) _______________ : NO
netdatacli aclk-state
ACLK Available: Yes
ACLK Version: 2
Protocols Supported: Protobuf
Protocol Used: Protobuf
MQTT Version: 5
Claimed: No
Online: No
Reconnect count: 0
Banned By Cloud: No

What I expected to happen

Should not segfault

Hi @tuaris

Does sudo coredumpctl have any more information? Could you try sudo coredumpctl debug ?

Also, did you notice the same issue with previous versions?

I was on 1.40.1 when I noticed the problem, then upgraded to v1.42.3 to see if that would solve it. No luck.
All of the EC2 instances (except for 3 that I troubleshooted with) are currently on 1.40.1 and I’ve so far only noticed the problem on about 6.

Where would I normally find the Coredumps for Netdata?

Can you try sudo coredumpctl ? Any coredumps from netdata should be listed there.

Also, could you please share your error.log and collector.log from a crashing agent with manolis@netdata.cloud ?

Not finding any coredumps:

sudo coredumpctl 
No coredumps found.

I sent over the logs as requested.

Hi @tuaris thanks, got them!

So, it seems from the logs that there is a crash… right after the agent starts streaming.

I noticed also that you’re using memory mode none. Tried replicating this with the same memory mode streaming to a parent, but it doesn’t crash.

However this was on a completely different environment, will try an Ubuntu 20.04 to see if I can replicate. Edit: Tried on a 22.04 → 22.04, seems to work ok.

Is there any other non-default setting you’ve done?

Thanks!

I don’t think it’s any too out of the ordinary. I pasted my configs below. One thing that I noticed that’s in common with the systems having this issue is they have a rails app using GitHub - discourse/prometheus_exporter: A framework for collecting and aggregating prometheus metrics, but that’s also on systems not having the problem.

Agents

/etc/netdata/netdata.conf:

[global]
	web files owner = root
	web files group = netdata
	bind socket to IP = 0.0.0.0
	memory mode = none
	history = 86400

[web]
    respect do not track policy = yes

[registry]
	enabled = no
	registry to announce = http://xxxxxxxxxxxxxxxxx:19999

[health]
	enabled = no

/etc/netdata/stream.conf

[stream]
	enabled = yes 
	destination =xxxxxxxxxxxxxxx
	api key = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

/var/lib/netdata/cloud.d/cloud.conf

[global]
    enabled = no

Parent

/etc/netdata/netdata.conf

[global]
	web files owner = root
	web files group = netdata
	bind socket to IP = 0.0.0.0
	memory mode = dbengine
	page cache size MB = 131072
	dbengine multihost disk space = 81664	
	process scheduling policy = keep
	
[web]
	respect do not track policy = yes

[logs]
	access = none

[registry]
	enabled = yes
	registry to announce = http://xxxxxxxxxxxxxxxxxxxxxx:19999

[ml]
	enabled = no

/etc/netdata/stream.conf

[xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx]
	enabled = yes
	default memory mode = dbengine

/var/lib/netdata/cloud.d/cloud.conf

[global]
    enabled = no

/etc/netdata/health.d/postfix.conf

alarm: postfix_queue
on: postfix_local.qemails
class: Workload
type: Messaging
component: Postfix
every: 10s
calc: $emails
warn: $this > 4000
crit: $this > 10000
to: sysadmin
repeat: warning 1800s critical 300s
info: number of emails in the postfix queue

Also… just looked at it again today and it appears to have stopped crashing. I didn’t change anything, very odd.

[438393.905689] PD[go.d][2078107]: segfault at 0 ip 000055ed43d49992 sp 00007f831df4ee00 error 4 in netdata[55ed43b6f000+47d000]
[438393.905711] Code: 0f 84 b2 08 00 00 48 8b 55 10 48 83 f9 03 0f 84 47 21 00 00 48 8b 5d 18 48 85 f6 0f 84 97 08 00 00 48 85 d2 0f 84 8e 08 00 00 <0f> be 03 84 c0 0f 84 83 08 00 00 49 8b 7f 60 48 85 ff 0f 84 2c 1b
[438426.901568] PD[go.d][2078850]: segfault at 0 ip 000056400aea6992 sp 00007fb565ea9e00 error 4 in netdata[56400accc000+47d000]
[438426.901578] Code: 0f 84 b2 08 00 00 48 8b 55 10 48 83 f9 03 0f 84 47 21 00 00 48 8b 5d 18 48 85 f6 0f 84 97 08 00 00 48 85 d2 0f 84 8e 08 00 00 <0f> be 03 84 c0 0f 84 83 08 00 00 49 8b 7f 60 48 85 ff 0f 84 2c 1b
[438459.904693] PD[go.d][2079644]: segfault at 0 ip 0000555835211992 sp 00007eff7d4f1e00 error 4 in netdata[555835037000+47d000]
[438459.904704] Code: 0f 84 b2 08 00 00 48 8b 55 10 48 83 f9 03 0f 84 47 21 00 00 48 8b 5d 18 48 85 f6 0f 84 97 08 00 00 48 85 d2 0f 84 8e 08 00 00 <0f> be 03 84 c0 0f 84 83 08 00 00 49 8b 7f 60 48 85 ff 0f 84 2c 1b
[648123.215166] SGI XFS with ACLs, security attributes, realtime, quota, no debug enabled
[648123.315042] raid6: avx512x4 gen() 18835 MB/s
[648123.383041] raid6: avx512x4 xor()  8929 MB/s
[648123.451041] raid6: avx512x2 gen() 18825 MB/s
[648123.519039] raid6: avx512x2 xor() 29665 MB/s
[648123.587040] raid6: avx512x1 gen() 19001 MB/s
[648123.655040] raid6: avx512x1 xor() 26610 MB/s
[648123.723043] raid6: avx2x4   gen() 18655 MB/s
[648123.791038] raid6: avx2x4   xor()  8406 MB/s
[648123.859040] raid6: avx2x2   gen() 18050 MB/s
[648123.927037] raid6: avx2x2   xor() 19712 MB/s

Also interesting is all the “gaps” in the charts are not there anymore either, as if nothing happened. Nice, but weird.

Hmm, yes, thats weird …

If I understand correctly as well, the issue is with the children, right? Not the parent?

Correct, and only with a small number of them.

Hi @tuaris

Just as a checking in, do you still see stable operation? Thanks.

Was stable for about 3 weeks, now it’s back to doing the same. Just started a few moments ago out of nowhere.