Still waiting (the ever maturing Netdata)

Today I use Netdata in our development instances. Why only there?

I’ve been evaluating, and even developing and and working with Netdata since 2016. Back in 2016 it was “something to watch” as I presented it to our company.

Since then, I’ve used netdata with nightly builds on our development instances only. Why? Well, occasionally netdata has instabilities, and sometimes even critical instabilities that impact the entire platform.

But it’s over now, right?

No. As netdata continually changes dependencies, grows and shrinks and changes file location (sigh), these sorts of changes cause a lot of impact with regards to deployment and use.

So, we’re still waiting for this project to reasonably stabilize. Until then, dev it is!!

Also, I think removing the ability for netdata cloud to be embedded into websites was a mistake and wish that could be an “option” if at all possible. I have made videos about how to integrate netdata into other things, and none of that can work, at least not fully anymore.

So, lots of change, causing a lot of impact and preventing me from full deployment and now making a lot of integrations non-functional. I’ll keep watching the project closely and will continue to contribute, but IMHO, too much frustration that could have been easily avoided.

4 Likes

Thanks @cjcox for this feedback - I am going to make sure all our senior management team see’s this!

I’m not sure exactly how to respond or what to say but just wanted to say thanks for sharing your experience.

If possible or not too much hassle, would you be able to maybe make a list of top changes or examples of issues that you would like to see us work more on or improve on? Just asking so we can try make a list of sorts and see what parts of it we think we have on the roadmap and if any parts we need to add or re-prioritize.

3 Likes

I, too, share similar concerns. Thank you @cjcox for articulating this in a constructive way. I had also been watching this project for several years before finally having an opportunity to deploy across our fleet. Less than 24 hours after completing the rollout, an agent update was released which broke connectivity to Netdata Cloud and required a fair amount of troubleshooting and effort to resolve. Not a good start.

That said, I do believe feedback like this is critical to making Netdata better. I really do want this tool to work in our environment 1) because the more mainstream paid SaaS offerings inevitably go beyond our financial means and 2) because no other tool comes remotely close to the capabilities of Netdata. I also believe you will not find a group of developers more dedicated to a project than those working on Netdata.

So I remain optimistic about Netdata and want this to become our primary monitoring platform, but we’re not quite there yet.

3 Likes

We REALLY appreciate your feedback gents and we know it’s coming from a good place. I will explain the reason at the end, but the first thing is to tell you how I would do things in your position, given the situation you have described very well.

If the allowance for errors/failure is small, I strongly suggest you have the auto-updater work only for stable releases, by using the option --stable-channel. Not a perfect solution, since we occasionally have to deploy patch releases, but definitely better than nightlies.

If the allowance for errors/failure is extremely small or zero, I strongly suggest the following:

  • Turn off automatic updates by using the option --no-updates
  • Always run the update manually first on a staging system, that is essentially identical to production.
  • Configure Netdata to have only the modules/capabilities and collectors you really need (see next paragraph)

A very large number of installation and configuration options is available, allowing for much smaller footprint and of course fewer possibilities for failure. All kickstart options are documented [here] (Install Netdata with kickstart.sh | Learn Netdata). Some suggestions on things to disable for a lower possible error surface would be similar to the ones that improve performance, but you’re a better judge of what you need.

But why is taking so long to stabilize it? We won’t hide the fact that nightly automatic updates are turned on by default to help us identify issues not covered by our QA processes. The telemetry on failures on the field and the user feedback on bugs are some of the biggest contributions of our community to improving Netdata. Are we proud of deploying by default code that’s not as perfect as we could possibly make it? Of course not. We’ve had this discussion internally several times and it comes down to a simple statement: The moment we have sufficient automated test coverage to guarantee that our users will almost never discover bugs, is the moment we can default to auto updating on stable releases.

We have a lot of work ahead of us to catch up with years of limited to no test automation, while making much needed improvements and additions. We are trying to do it on our own, but it will be much faster with help from the community. Everything we do on the agent is open source, so if there’s ANYTHING you can help with to improve the situation, we’d appreciate greatly - and send some swag over too :wink:

1 Like