data missing from live node

Hello,
since yesterday in netdata cloud i have one of two nodes with missing data although this node is shown as live:

The local dashboard of this node shows all data as expected, but the Netdata Cloud connection status says: This node is currently Not Connected to Netdata Cloud
Adding the node with a different token didn’t fix.

Thx, ToffiCap

1 Like

Hi there,

Could you share with us the netdata -W buildinfo output for the node in question?

Also the netdatacli aclk-state (please filter out sensitive information like claimed ID, Node ID, hostname etc)

Tasos.

Hi,

here is the output of netdata -W buildinfo:

/opt/netdata/bin$ ./netdata -W buildinfo
Version: netdata v1.34.0-167-gb531d4fd8
Configure options:  '--prefix=/opt/netdata/usr' '--sysconfdir=/opt/netdata/etc' '--localstatedir=/opt/netdata/var' '--libexecdir=/opt/netdata/usr/libexec' '--libdir=/opt/netdata/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--enable-cloud' '--without-bundled-protobuf' '--disable-dependency-tracking' '--with-bundled-libJudy' 'CFLAGS=-static -O2 -I/openssl-static/include -pipe' 'LDFLAGS=-static -L/openssl-static/lib' 'PKG_CONFIG_PATH=/openssl-static/lib/pkgconfig'
Install type: kickstart-static
    Binary architecture: x86_64
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES 
    ACLK Next Generation:       YES
    ACLK-NG New Cloud Protocol: YES
    ACLK Legacy:                NO
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         YES
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    YES
    IPMI:                    NO
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES

and the output of netdatacli aclk-state:

/opt/netdata/bin$ ./netdatacli aclk-state
ACLK Available: Yes
ACLK Version: 2
Protocols Supported: Legacy, Protobuf
Protocol Used: Protobuf
Claimed: Yes
Claimed Id: xxx
Cloud URL: https://app.netdata.cloud
Online: Yes
Reconnect count: 1
Banned By Cloud: No
Last Connection Time: 2022-05-16 02:31:26
Last Connection Time + 3 PUBACKs received: 2022-05-16 02:31:26
Last Disconnect Time: 2022-05-16 02:29:51
Received Cloud MQTT Messages: 9
MQTT Messages Confirmed by Remote Broker (PUBACKs): 151

> Node Instance for mGUID: "xxx" hostname "mint-server"
	Claimed ID: xxx
	Node ID: xxx
	Streaming Hops: 0
	Relationship: self
	Alert Streaming Status:
		Updates: 1
		Batch ID: 3
		Last Acked Seq ID: 79
		Pending Min Seq ID: 0
		Pending Max Seq ID: 0
		Last Submitted Seq ID: 82
	Chart Streaming Status:
		Updates: 1
		Batch ID: 3
		Min Seq ID: 1
		Max Seq ID: 3622
		Pending Min Seq ID: 0
		Pending Max Seq ID: 0
		Sent Min Seq ID: 1
		Sent Max Seq ID: 3622
		Acked Min Seq ID: 1
		Acked Max Seq ID: 3622

Thx, ToffiCap

Ok this seems ok,

  1. please restart the Agent systemctl restart netdata
  2. Verify that the local dashboard of the Agent works fine (present the metrics) <node_ip>:19999
  3. If yes, check if Netdata cloud can fetch metrics for this node

If the problem still exists or/and the local dashboard is abnormal

  1. Check the error logs of the Agent /var/log/netdata/error.log
  2. Force restart the Agent Start, stop, or restart the Netdata Agent | Learn Netdata
  3. Check if the problem is resolved

Last, if this also didn’t fix your issue, please send me the claimedID for the Node in question in a DM to check your issue with the Cloud team.

Tasos.

Hi,
thank you very much for your efforts, but none of your suggestions brought success.
Not even a full uninstall and reinstall. After that the commands above give the same results.
Mysterious that it worked at first.
I would like to send you a DM with the Claimed ID of the node.

Thx, ToffiCap

1 Like

I have the same issue.

Ok let’s clarify the following, does the local dashboard of the Agent display the metrics right?

This issue seems related to [Bug]: Node does not show data in Netdata Cloud · Issue #12915 · netdata/netdata · GitHub

Yes, the local dashboard works perfect.

Hm, I’m just looking at my cloud dashboard and the affected node is showing me all the data again.

In the error.log i found the following entrys:

2022-05-17 02:28:34: netdata ERROR : ACLK_Main : Connection Error or Dropped
2022-05-17 02:28:34: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 0.000 seconds
2022-05-17 02:28:34: netdata INFO  : ACLK_Main : Attempting connection now
2022-05-17 02:28:35: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-05-17 02:28:35: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-05-17 02:28:35: netdata INFO  : ACLK_Main : Switching ACLK to new protobuf protocol. Due to /env response.
2022-05-17 02:28:36: netdata INFO  : ACLK_Main : HTTPS "GET" request to "api.netdata.cloud" finished with HTTP code: 409
2022-05-17 02:28:36: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409) (errno 22, Invalid argument)
2022-05-17 02:28:36: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrAlreadyConnected", Msg:"node already connected", BlockRetry:false, Backoff:0s (-1 unset by cloud) (errno 1, Operation not permitted)
2022-05-17 02:28:36: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP
2022-05-17 02:28:36: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 1.303 seconds

the above block repeated five times, then:

2022-05-17 02:29:57: netdata INFO  : ACLK_Main : Attempting connection now
2022-05-17 02:29:58: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-05-17 02:29:58: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-05-17 02:29:58: netdata INFO  : ACLK_Main : Switching ACLK to new protobuf protocol. Due to /env response.
2022-05-17 02:29:58: netdata INFO  : ACLK_Main : HTTPS "GET" request to "api.netdata.cloud" finished with HTTP code: 200
2022-05-17 02:29:58: netdata INFO  : ACLK_Main : ACLK_OTP Got Challenge from Cloud
2022-05-17 02:29:59: netdata INFO  : ACLK_Main : HTTPS "POST" request to "api.netdata.cloud" finished with HTTP code: 201
2022-05-17 02:29:59: netdata INFO  : ACLK_Main : ACLK_OTP Got Password from Cloud
2022-05-17 02:29:59: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
2022-05-17 02:29:59: netdata INFO  : ACLK_Main : ACLK connection successfully established
2022-05-17 02:29:59: netdata INFO  : ACLK_Main : Queuing status update for node=(number deleted by me), live=1, hops=0

Now all works fine on the cloud dashboard.

Many thanks for the excellent support and this great program, ToffiCap

2 Likes

Hey Tasos,

it’s some days my dashboard is showing nothing, I did all requested on this support ticket but nothing works. Any problem on the cloud side?

Please see the image attached.

Thanks, Leandro

Hi there,

I can see two issues here:

  1. The obvious one, loading
  2. The first host doesn’t even have the ttl-cpu-util, Avail-mmry- etc etc

Could you check the following:

These mini charts are actual mirror charts (but smaller) produced by the Agent and presented in Netdata Cloud. First we want to make sure that you can see the “mirrored” detailed chart of each an every one of these mini charts in the node’s dashboard.

How you do that? First of all you identify the exact name of this mini chart

Then you navigate under a node’s dashboard and you try to find (simple ctrl+F) if this chart is produced.

Hi @Tasos_Katsoulas,

as you can see on the imag attached all metric are flagged as are working on others nodes, on the second image you can see as per example the CPU’s inside the node, but I still can not see on the dashboard.

Any idea?


@Leandro_Aguiar I updated the comment above with a clearer explanation. Could you follow it again?

For example here you present me the cpu.cpu_softirqs chart, which I think is not in the list of the minimized metrics (except of the case in which you changed the names).

Thanks in advance,
Tasos.

Hey @Tasos_Katsoulas ,

you are right on the nodes the charts are NOT produced.

How can I add these charts to the nodes that are not there?

@Leandro_Aguiar excellent!

let’s explore two more possibilities with one more question:

  1. The Agent is not producing the chart (try to just restart the Agent) and follow it’s logs /var/log/netdata/error.log
  2. The Agent produces the chart but the Netdata cloud is not querying it.

Try :

curl -X 'GET' \
  '<NODE_IP>:19999/api/v1/chart?chart=<CHART_ID>' \
  -H 'accept: application/json'

For example NODE_IP could be the IP of the node in question and CHART_ID could be one of the chart missing (system.cpu)

In the first case we need to explore the logs for common errors and in the second it’s an upstream issue of Netdata Cloud.

Tasos.

hi @Tasos_Katsoulas maybe I had found the problem but not the solution, on the logs I found these:

2022-06-16 03:37:27: netdata ERROR : MAIN : CGROUP: cannot read directory ‘/sys/fs/cgroup/cpu,cpuacct’ (errno 13, Permission denied)
2022-06-16 03:37:27: netdata ERROR : MAIN : CGROUP: cannot read directory ‘/sys/fs/cgroup/blkio’ (errno 13, Permission denied)
2022-06-16 03:37:27: netdata ERROR : MAIN : CGROUP: cannot read directory ‘/sys/fs/cgroup/memory’ (errno 13, Permission denied)
2022-06-16 03:37:27: netdata ERROR : MAIN : CGROUP: cannot read directory ‘/sys/fs/cgroup/devices’ (errno 13, Permission denied)
2022-06-16 03:37:38: netdata ERROR : MAIN : CGROUP: cannot read directory ‘/sys/fs/cgroup/cpu,cpuacct’ (errno 13, Permission denied)
2022-06-16 03:37:38: netdata ERROR : MAIN : CGROUP: cannot read directory ‘/sys/fs/cgroup/blkio’ (errno 13, Permission denied)
2022-06-16 03:37:38: netdata ERROR : MAIN : CGROUP: cannot read directory ‘/sys/fs/cgroup/memory’ (errno 13, Permission denied)
2022-06-16 03:37:38: netdata ERROR : MAIN : CGROUP: cannot read directory ‘/sys/fs/cgroup/devices’ (errno 13, Permission denied)
2022-06-16 03:37:48: netdata ERROR : MAIN : CGROUP: cannot read directory ‘/sys/fs/cgroup/cpu,cpuacct’ (errno 13, Permission denied)
2022-06-16 03:37:48: netdata ERROR : MAIN : CGROUP: cannot read directory ‘/sys/fs/cgroup/blkio’ (errno 13, Permission denied)
2022-06-16 03:37:48: netdata ERROR : MAIN : CGROUP: cannot read directory ‘/sys/fs/cgroup/memory’ (errno 13, Permission denied)
2022-06-16 03:37:48: netdata ERROR : MAIN : CGROUP: cannot read directory ‘/sys/fs/cgroup/devices’ (errno 13, Permission denied)
2022-06-16 03:37:59: netdata ERROR : MAIN : CGROUP: cannot read directory ‘/sys/fs/cgroup/cpu,cpuacct’ (errno 13, Permission denied)

it’s much longer than, but I think this permission denied is the problem.,

How to solve it?

hi @Tasos_Katsoulas in the other node that I have the same problem the logs are a quite different:

2022-06-16 04:12:28: netdata INFO : MAIN : Creating new data and journal files in path /var/cache/netdata/dbengine
2022-06-16 04:12:28: netdata INFO : MAIN : Created data file “/var/cache/netdata/dbengine/datafile-1-0000000113.ndf”.
2022-06-16 04:12:28: netdata INFO : MAIN : Created journal file “/var/cache/netdata/dbengine/journalfile-1-0000000113.njf”.
2022-06-16 04:12:30: netdata ERROR : PLUGIN[diskspace slow] : Chart ‘disk_space._home_virtfs_mymeteosales_opt’ has the OBSOLETE flag set, but it is collected.
2022-06-16 04:12:30: netdata ERROR : PLUGIN[diskspace slow] : Chart ‘disk_inodes._home_virtfs_mymeteosales_opt’ has the OBSOLETE flag set, but it is collected.
2022-06-16 04:12:35: netdata ERROR : PLUGIN[diskspace slow] : Chart ‘disk_space._home_virtfs_mymeteosales_opt’ has the OBSOLETE flag set, but it is collected.
2022-06-16 04:12:35: netdata ERROR : PLUGIN[diskspace slow] : Chart ‘disk_inodes._home_virtfs_mymeteosales_opt’ has the OBSOLETE flag set, but it is collected.
2022-06-16 04:12:40: netdata ERROR : PLUGIN[diskspace slow] : Chart ‘disk_space._home_virtfs_mymeteosales_opt’ has the OBSOLETE flag set, but it is collected.
2022-06-16 04:12:40: netdata ERROR : PLUGIN[diskspace slow] : Chart ‘disk_inodes._home_virtfs_mymeteosales_opt’ has the OBSOLETE flag set, but it is collected.

How to solve this case as well?

Thanks.

For this one (the first issue) can you show me the output of the id netdata command in this server

For the second one, @ilyam8 any ideas about that has the OBSOLETE flag set?