monitoring and predictive charts for netdata db engine

I’ve been using the dbengine sizing calculator and it’s very handy but it occurred to me that there is no equivalent data provided in a Netdata agent chart. To see what I mean, think of it this way: You start off with x number of agents etc and do your calculations and then size up your infrastructure to get the retentions that you want and all is great. But then later on, for shame, things change! Now you might have a lot more nodes and or metrics. Now, you could kick it old school and keep going back to the calculator and updating it with new numbers but then you miss the point of automating monitoring, right?

So one chart I have in my mind shows growth over time of some consumed resource such as disk space or memory. This would be a minimum and should be easily doable since it’s just reporting data.
The other chart type I have in mind is unlike most netdata charts in that it’s one that projects a cone into the future so that we can do professional things like plan ahead of time for expanding memory or disk space. Don’t get me wrong, as much as I’ll miss the excitement of a service falling over because a LUN ran out of space I think that this might be useful :smiley:

So,

  1. Low-hanging fruit: Can we get a chart for the disk space used by the dbengine (unless I missed it somewhere)
  2. General fruit-harvest: Can we get some stats which give a clear indication of current retention (can’t we just use some of the calculator logic using the goodness of real data?).
  3. Cider for all: Some cone-style charts to predict the murky future.
1 Like

Hi @Luis_Johnstone ,

Thank so much you for the thorough and valuable feedback!

Before going into details let me tell you that the team is working hard on dbengine v2.0 which will focus on:

  • even better memory optimization to further reduce its memory footprint → this will also make calculations easier and more straightforward
  • improve cache efficiency
  • refactor and code-cleaning

Now, you could kick it old school and keep going back to the calculator and updating it with new numbers but then you miss the point of automating monitoring, right?

Hopefully with the simplifications mentioned above this will be easier. We are aiming to bring a kind of calculator into our UI so we will probably be able to address it then.

  1. Low-hanging fruit: Can we get a chart for the disk space used by the dbengine (unless I missed it somewhere)

On Netdata monitoring section we have a couple of charts on dbengine but in fact we don’t have one on disk space usage. This is for sure going to be included with the 2.0 release :slight_smile:

  1. General fruit-harvest: Can we get some stats which give a clear indication of current retention (can’t we just use some of the calculator logic using the goodness of real data?).

We had introduced some general stats the Home tab of Netdata Cloud, not sure if you remember this cards I show below, but due to a lot of moving pieces on Cloud and Agent these had to be removed. We are thinking on reworking the Home tab and improving the Node inspector to we will bring some information around this

  1. Cider for all: Some cone-style charts to predict the murky future.

That would be the cherry on top for sure! Let’s see where we get with the things mentioned on step 2. and what can be done.

Once again thanks for all the feedback.

@Luis_Johnstone hello!!!

ML for capacity management is something i had thought about but kinda deprioritized it in my head thinking it’s maybe less relevant for low latency monitoring tool like Netdata.

But you are correct that it for sure could be a thing - maybe for a subset of charts that ability to forecast at some higher aggregation out N-steps ahead etc. Very similar to what Grafana Cloud is doing with their ML stuff - basically letting you configure a forecast model for each metric.

As always, the issue we have is where would the training/prediction compute happen in this case. Would be fairly easy to do in the cloud if netdata cloud was to sample the metric a few times a day and then build models on that and have ability to get forecasts then somewhere in NC.

Or could just build ability to create a forecast model in the agent itself that users can decide to enable on the node or a parent if they’d rather.

I’ll defo think on this more though as i do think would be good to get to some sort of forecasting stuff eventually.

E.g if you could up-sample a metric to different frequencies, so for example you pick a metric and aggregate it to hourly or daily or something - then you could imagine being able to fit a forecast model and predict (worse predictions with each step but a cone of incertainty like you say) N steps ahead. I think something like this could be a good place to start.

Starting with a forecast on a 1 sec metric will end up mostly being garbage in/ garbage out and best you can do would be a small window and so no use for sort of thing you after. But the data is there so could be a nice way to sort of give users ability to (1) downsample a metrric and then (2) enable a forecast model for these lower frequency aggregaions of the underlying metrics.

made this GH discussion for visibility and to see if other users have some interest who might not see this post. ml based forecasting in netdata? · Discussion #14130 · netdata/netdata · GitHub

@hugo Thanks very much for the reply!

That all sounds great :slight_smile:
I would note that the general stats you mentioned is, for me, more of a sales-type or C-deck type of thing. After all, I’m not sure that people using Netdata care as much about that sort of stuff (IMO); we care about whether things are configured right so that we’ve set aside enough resources to get the retention we want. After all, I might be storing loads of data for ages overall but maybe one particular server with lots more collectors has a very short retention because of all the data-points.

Again, thanks for the reply and I’ll hold tight. On a related note:

Do we have an ETA on this?? :smiley:

Hey @andrewm4894 :slight_smile:

Yeah, my impression is that the work being done right now on getting scalability of the dbengine etc is geared towards handling better concurrent scale but also better retention (e.g. multiple-tiers). As such, I expect that there are strong synergies with capacity planning/modelling. Even if you think in smaller time-scales, surely it’s better to know when you’re likely to auto-scale up a cluster before it happens, right? Naturally, I was first thinking in terms of the dbengine itself but autoscaling stuff is a good example as is capacity for storage appliances and so forth.

I actually think that storage capacity might be an excellent use-case for MC. I’ve used systems that gave me a prediction of when I’d run out of space on a SAN but they all had the weakness that if for some reason you had to dump a load of data onto the SAN, say because you’re doing a P2V or something, then the predictions all become garbage. Sounds exactly like a cloud premium-type feature IMO.

Yes, the cone-chart with progressive uncertainty is exactly what I had in mind because it should be much simpler to do in the short term and could provide good value to users as pert of the always-free core.
Definitely agreed on down-sampling because for that sort of stuff you really don’t want fine-grained data anyway. Stuff like that is supposed to go up and down in the short-term, it’s the mid-long terms that causes issues; and the alarms handle the short -term.

I’m not sure, since I don’t know how that PR relates to the dbengine v2.0 work, this PR DBENGINE v2 by ktsaou · Pull Request #14125 · netdata/netdata · GitHub.