Remote netdata storage/query question

We absolutely love netdata and would implement it yesterday, but the one thing concerning us is the fact that all of the data is stored on each host node running netdata and then when using netdata cloud, it’s querying each individual node.

Is there a way to have the agent pipe all date to a remote db and the netdata cloud/UI pulls from there instead of querying each node?

When managing hundreds of nodes, this could get quite expensive over time if multiple people are in the netdata cloud UI.

I came across this: How metrics streaming works | Learn Netdata

but it seems like netdata cloud still sends queries to each individual node.

Thank you!

Hey, @ben_g. Streaming is exactly what you need. Children Netdata will stream to the Parent instance and Cloud will query only the Parent. You don’t even need to claim (connect to Cloud) children nodes. Cloud will do it automatically.

But keep in mind that you will need to tune the default (at least) dbengine setting of the Parent if you have hundreds of children nodes. Also, children nodes should be configured to be more lightweight because parent will do the heavy work - store metrics, run health queries, do ML.

but it seems like netdata cloud still sends queries to each individual node.

Correct if:

  • children nodes were claimed (connected to Cloud) explicitly (not through parent/streaming)
  • parent is not available (e.g. offline).

Thank you so much for the quick reply, that’s great news!

So basically, when installing the netdata agent, we scaffold up all of our go.d and health.d settings, but then also configure the instance as a child and it will then send metrics to the parent?

PS. Is there a job board on the community where we can post to hire a netdata expert or a way to pay netdata for help implementing this?

Thank you again

Also, I see there are a few options for Headless vs Replication.

Is there another config value we need to use headless streaming of metrics to the parent?

Thanks again, really appreciate it.

Sorry, one last question. If we need to add more parent nodes as we continue to grow, do we need all parents to talk to each other, or do we just set them up indapendant and it will all still go to the same netdata cloud?

I assume parent <> parent is for high availability, so for example:

25 nodes → parent1
25 nodes → parent2

Both parents just report data into netdata cloud and there’s no need to connect both parents?

@ben_g TL;DR is

Parent

It should be a VM/server with enough resources: CPU, memory, fast disks, enough disk space. At least you need to adjust the dbengine settings in netdata.conf, you want to change the following:

[db]
	dbengine page cache size MB = XXX (default is 32)
	dbengine extent cache size MB = XXX (default is 0)
	dbengine multihost disk space MB = XXX (default is 256)
	dbengine tier 1 multihost disk space MB = XXX (default is 128)
	dbengine tier 2 multihost disk space MB = XXX (default is 64)

The “multi-host disk space” settings will reflect your retention. These values are not specified for each child, but are a general maximum for the self + children metrics.

Increasing “cache” will make queries/response faster. Increase it if you have enough RAM.

Children

  • Change memory mode to ram - no need to keep metrics locally, parent will store them. The default retention for ram is 1 hour. It means that there will be no metrics lost if parent is not available <= 1hr - children will replicate their metrics after reconnecting. Can be changed if you think that 1 hr is not enough.
  • Disable ML and health, no need to run locally, parent will do it.
[db]
	mode = ram

[ml]
    enabled = no

[health]
    enabled = no

For instance, we have a lab with 500 children and with these db settings on one of the parents:

[db]
	dbengine multihost disk space MB = 500000
	dbengine tier 1 multihost disk space MB = 250000
	dbengine tier 2 multihost disk space MB = 125000

With so much space I get 3 months of metrics.

if you want - it will increase HA but at the same time, it means that you will have a copy of everything on each parent (double the amount of used space and RAM and CPU).

  • If your children are ephemeral (new nodes, come and go) set is ephemeral node to true in streaming.conf for the API key. If you have both ephemeral and not ephemeral nodes use 2 API keys.
  • Claim (connect) only Parent nodes. They will maintain Agent-Cloud connection. Don’t claim children.

Keep in mind that the free version of Netdata Cloud is limited, check the limitations before you decide to connect nodes - for instance, there is a hard limit for the number of currently connected nodes (5). It is a no-go if you plan to use the community plan.

Thank you so much, this is super helpful! We will be going for the paid version of netdata, we absolutely love it.

We’re replacing site24x7 with it and it’s night and day so far.

One last question, if we’re concerned with bandwidth on shipping these logs ever, is there a setting to have agents only send data every say 30 or 60 seconds?

I’m loving how it is so far, so hoping it’s ok data wise.

Right now, the parent has 8 cores (16 threads), 64GB RAM, and 1TB NVMe storage and seems to be humming along nicely.

Just wanted to check in on one last thing. I’m using the following settings on the parent:

[db]
	dbengine multihost disk space MB = 500000
	dbengine tier 1 multihost disk space MB = 100000
	dbengine tier 2 multihost disk space MB = 125000
        replication threads = 5

The load seems to be quite high and I have 48 children sending data to the parent. I read online we should expect much more children<>parent ratio, but just trying to understand how the performance works.

 11:51:53 up 19 days,  3:30,  2 users,  load average: 11.97, 6.74, 4.34
root@netdata:/etc/netdata# top cd1
top - 11:51:57 up 19 days,  3:30,  2 users,  load average: 12.05, 6.84, 4.38
Tasks: 246 total,   1 running, 245 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  3.5 sy, 17.0 ni, 79.1 id,  0.1 wa,  0.0 hi,  0.2 si,  0.0 st
Linux 5.15.0-91-generic (netdata) 	01/19/24 	_x86_64_	(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.01    5.82    1.31    0.03    0.00   92.85

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
dm-0            83.54   2226.13     0.00   0.00    0.11    26.65   33.29    357.97     0.00   0.00    0.12    10.75    0.01    564.26     0.00   0.00    0.40 55278.05    0.00    0.00    0.01   3.95
loop0            0.00      0.00     0.00   0.00    0.07     9.22    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop1            0.00      0.00     0.00   0.00    0.24    17.59    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop2            0.00      0.01     0.00   0.00    0.08    37.46    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop3            0.00      0.01     0.00   0.00    0.05    34.24    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop4            0.00      0.00     0.00   0.00    0.00     6.20    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop5            0.00      0.00     0.00   0.00    0.00     1.27    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
nvme0n1         83.54   2226.17     0.00   0.00    0.10    26.65   31.68    357.97     1.71   5.14    0.16    11.30    0.01    565.20     0.00   0.00    0.40 55360.06    0.00    0.00    0.01   3.95

One thing I noticed, is we’re alerting on basically everything even if it’s not defined in health.d. Is there a way to have the reverse and maybe it’ll help us on resource consumption?

Basically, only alert on what we want to alert on.

Thanks!