My personal experience running Netdata in a cluster with 100+ nodes/80+ pods a server is to use it for 1s granular metrics and deploy a server on each node and then use a proxy to get to each server. Ideally avoid using the Parent server. My experience in 100+ cluster is the parent is barely usable. If the parent goes down it takes a long time to get back up and see metrics. And my experience the parent goes down often. It got more unstable after v1.31. And then use Prometheus for longer term less granular metrics.
my 2 cents
Thanks for the feedback!
Am curious, about how many charts or metrics was that on each node? Sounds like a fairly big set up alrite.
I know there is a fair bit of work going on on the agent side right now around data tiering to make retention easier and also around having parents scale much to large amounts of nodes like you have.
@Stelios_Fragkakis sounds like some potentially really useful feedback here from @data-smith
The node with the most pods has → ~26,000 metrics and ~3,000 charts. Kubernetes allows us to efficiently use our resources and many of our nodes have a lot running on them.
I also found memory mode RAM to be more stable than SAVE. I read they both use memory mapped files so I’m not sure why one would be more stable than the other.