Cloud fails when disconnecting from OpenVPN

It seems that whenever I disconnect from an OpenVPN connection (via systemd unit), the Netdata agent stops showing up as alive in the cloud dashboard. The local dashboard works fine.
The local dashboard also tells me that the agent is not currently connected to the cloud.
Interestingly, reconnecting the VPN does not resolve the issue but a restart of the Netdata service does.

The error.log does not show any activity when the issue occurs either; perhaps something important has fallen over?

When the issue happened I used to netcat to check liveness of port 443 against “api.netdata.cloud” and “mqtt.netdata.cloud” and they come back as being live.

(Issue repro’s every time BTW)

EDIT for update: I ran metric correlation against the time-frame of an issue and saw something interesting. Right before the service restart I see IPv6 bandwidth being used and not IPv4, but right at and after the issue I see it has switch to IPv4 only.
It could be a red-herring but perhaps the agent is not handling a switch being IP protocol versions during run-time?

EDIT 2: I have repro’d the same behaviour when switching from non-VPN to a VPN connection

Environment:
Ubuntu 22.04.1 LTS (arm64)
Edge Version 108.0.1462.46 (Official build) (64-bit)
Netdata agent:
Version: netdata v1.37.0-40-nightly
Configure options: ‘–prefix=/usr’ ‘–sysconfdir=/etc’ ‘–localstatedir=/var’ ‘–libexecdir=/usr/libexec’ ‘–libdir=/usr/lib’ ‘–with-zlib’ ‘–with-math’ ‘–with-user=netdata’ ‘–with-bundled-protobuf’ ‘CFLAGS=-O2 -pipe’ ‘LDFLAGS=’
Install type: kickstart-build
Features:
dbengine: YES
Native HTTPS: YES
Netdata Cloud: YES
ACLK: YES
TLS Host Verification: YES
Machine Learning: YES
Stream Compression: YES
Libraries:
protobuf: YES (bundled)
jemalloc: NO
JSON-C: YES
libcap: NO
libcrypto: YES
libm: YES
tcalloc: NO
zlib: YES
Plugins:
apps: YES
cgroup Network Tracking: YES
CUPS: NO
EBPF: NO
IPMI: NO
NFACCT: NO
perf: YES
slabinfo: YES
Xen: NO
Xen VBD Error Tracking: NO
Exporters:
AWS Kinesis: NO
GCP PubSub: NO
MongoDB: NO
Prometheus Remote Write: NO
Debug/Developer Features:
Trace Allocations: NO

What do you guys need to troubleshoot this?

I’m using the standard systemd unit file for OpenVPN on Ubuntu (can provide that if needed).

What does error.log say around the time you switch the connections?

hi @underhood as above, the log just stops and there are no new entries during that period. I’ll test today and see how long that persists for.

@underhood
OK, I’ve tested this a couple of times now.
After switching network I see nothing in the logs and the cloud dashboard shows the node as online even though drilling down into the node then shows that no charts can be loaded etc.
After about 1.5-2 minutes the cloud dashboard shows the node as not currently connected.
The logs don’t print anything but after around 3 minutes they sometimes print the following (note though, that this only printed 2 out of the 5 tests…):

2022-12-27 16:46:35: netdata INFO : MAIN : METADATA: Checking dimensions starting after row 0
2022-12-27 16:46:35: netdata INFO : MAIN : METADATA: Checked 3156, deleted 0 – will resume after row 0 in 3600 seconds

When I restart the netdata agent service I saw the following errors which may or may not be relevant.

2022-12-27 16:51:15: netdata ERROR : ACLK_Main : [mqtt_wss] W: poll interrupted by EINTR
2022-12-27 16:51:16: netdata ERROR : ACLK_Main : Wasn’t able to gracefully shutdown ACLK in time!
2022-12-27 16:51:17: netdata ERROR : MAIN : Failed to flush dirty buffers quickly enough in dbengine instance “/var/cache/netdata/dbengine-tier2”. Metric data at risk of not being stored in the database, please reduce disk load or use a faster disk.
2022-12-27 16:51:24: netdata ERROR : MAIN : RRDLABEL: Cannot reload the configuration file ‘/etc/netdata/netdata.conf’, using labels in memory (errno 13, Permission denied)

Let me know what you need :slight_smile: