We have a small setup with three independent networks. Each network has a netdata parent that collects metrics from some nodes. The parents forward their collected data to a “master” parent, which is our dashboard for all nodes from all networks and acts as a central alert manager.
We have a problem with the resulting network traffic. It is continuously going out about >=1 Mb. That’s pretty costly.
We changed the “update every” attribute in the [global] section in netdata.conf to 60 seconds. The amount of traffic is now >=500Kb. Still expensive.
We would like to avoid to “micro administer” charts by deactivating some.
The question is now: Is our approach for the infrastructure smart enough? What would be an alternative? Or how could we reduce traefik more without lossing metric information?
Thank you in advance for your thoughts!
Hi, there is an open PR that adds ZSTD compression. It has a better compression ratio than LZ4 (also it has different compression levels, I think it will be configurable ). But I am not sure what would be the best solution for you, I’d start with disabling not important collectors - e.g. Netdata internal metrics (“Netdata Monitoring” on the dashboard).
On all instances:
netdata monitoring = no
Thank you! We tried out a bit of this and a bit of that. But the result was not convincing in terms of lower traffic data.
So we decided to do it without a master parent for all networks until further notice. This way, each network has its own “master” parent node, and all network traffic stays within the network. The three parents act as alert managers for their respective networks.
Does that make sense from your point of view? We are new to Netdata, coming from Zabbix. The architecture is important for monitoring
There’s decent article on this stuff in case you haven’t already seen it:
I presume that the main focus will be to turn off features (as per Ilyam’s example) at the child levels and enable them at the top-level parent level (for example, disabling Machine Learning)
All that said, 500kb/s doesn’t seem like very much traffic. You mention cost here so it sounds less like a performance issue (i.e. it’s using too much bandwidth) than a cost issue (i.e. our bandwidth costs are very high). Without meaning to be presumptuous, is there any way to reduce how much the bandwidth is costing you? I’m no expert in such matters but perhaps if you gave a more detailed idea of the setup (i.e. I presume this is a cloud deployment) someone might be able to suggest an optimisation on that side of things?
Also, you haven’t explicitly said where the data transfer of 500kb/s - 1mb/s is occurring. Are we talking about bandwidth between each child and it’s parent, or are we talking about between each parent and the uber-parent?
Out of interest, if it’s the latter case, what is the ingress-egress ratio of netdata traffic for the normal parents?
And just a quick gotcha which you are probably aware of already but…the dashboard only streams data from the parent (in your case the uber-parent) when you are viewing the dashboard in ‘play mode’. If you tend to stick the setting on forced play mode then it’ll stream all the time and if you have a bunch of engineers doing the same thing at the same time…
Hi @kingfisher - in the stream.conf there’s one directive that allows for an easy way of disabling graphs being streamed to the central parent. I suspect that you are not acting on most alerts, so I’d narrow down to the ones you are acting upon, like CPU/Memory/Disk/etc… you can filter them with regex, like netdata_cpu* and so on.
The costs result from egress Azure Cloud from the parent in Azure network to master parent in another network . We have roughly extrapolated about 100€/month vnet traffic costs with our experimental Netdata setup. Ratio is about 10 ingress 90 egress. The stream from parent in the Azure network (8 nodes) was constantly and not related to using the GUI.
We are now quite happy with the mitigated setup - we will see