monitoring and predictive charts for netdata db engine

Luis_Johnstone · December 5, 2022, 4:38pm

I’ve been using the dbengine sizing calculator and it’s very handy but it occurred to me that there is no equivalent data provided in a Netdata agent chart. To see what I mean, think of it this way: You start off with x number of agents etc and do your calculations and then size up your infrastructure to get the retentions that you want and all is great. But then later on, for shame, things change! Now you might have a lot more nodes and or metrics. Now, you could kick it old school and keep going back to the calculator and updating it with new numbers but then you miss the point of automating monitoring, right?

So one chart I have in my mind shows growth over time of some consumed resource such as disk space or memory. This would be a minimum and should be easily doable since it’s just reporting data.
The other chart type I have in mind is unlike most netdata charts in that it’s one that projects a cone into the future so that we can do professional things like plan ahead of time for expanding memory or disk space. Don’t get me wrong, as much as I’ll miss the excitement of a service falling over because a LUN ran out of space I think that this might be useful

So,

Low-hanging fruit: Can we get a chart for the disk space used by the dbengine (unless I missed it somewhere)
General fruit-harvest: Can we get some stats which give a clear indication of current retention (can’t we just use some of the calculator logic using the goodness of real data?).
Cider for all: Some cone-style charts to predict the murky future.

hugo · December 12, 2022, 11:30am

Hi @Luis_Johnstone ,

Thank so much you for the thorough and valuable feedback!

Before going into details let me tell you that the team is working hard on dbengine v2.0 which will focus on:

even better memory optimization to further reduce its memory footprint → this will also make calculations easier and more straightforward
improve cache efficiency
refactor and code-cleaning

Now, you could kick it old school and keep going back to the calculator and updating it with new numbers but then you miss the point of automating monitoring, right?

Hopefully with the simplifications mentioned above this will be easier. We are aiming to bring a kind of calculator into our UI so we will probably be able to address it then.

Low-hanging fruit: Can we get a chart for the disk space used by the dbengine (unless I missed it somewhere)

On Netdata monitoring section we have a couple of charts on dbengine but in fact we don’t have one on disk space usage. This is for sure going to be included with the 2.0 release

General fruit-harvest: Can we get some stats which give a clear indication of current retention (can’t we just use some of the calculator logic using the goodness of real data?).

We had introduced some general stats the Home tab of Netdata Cloud, not sure if you remember this cards I show below, but due to a lot of moving pieces on Cloud and Agent these had to be removed. We are thinking on reworking the Home tab and improving the Node inspector to we will bring some information around this

Cider for all: Some cone-style charts to predict the murky future.

That would be the cherry on top for sure! Let’s see where we get with the things mentioned on step 2. and what can be done.

Once again thanks for all the feedback.

andrewm4894 · December 12, 2022, 4:03pm

@Luis_Johnstone hello!!!

ML for capacity management is something i had thought about but kinda deprioritized it in my head thinking it’s maybe less relevant for low latency monitoring tool like Netdata.

But you are correct that it for sure could be a thing - maybe for a subset of charts that ability to forecast at some higher aggregation out N-steps ahead etc. Very similar to what Grafana Cloud is doing with their ML stuff - basically letting you configure a forecast model for each metric.

As always, the issue we have is where would the training/prediction compute happen in this case. Would be fairly easy to do in the cloud if netdata cloud was to sample the metric a few times a day and then build models on that and have ability to get forecasts then somewhere in NC.

Or could just build ability to create a forecast model in the agent itself that users can decide to enable on the node or a parent if they’d rather.

I’ll defo think on this more though as i do think would be good to get to some sort of forecasting stuff eventually.

E.g if you could up-sample a metric to different frequencies, so for example you pick a metric and aggregate it to hourly or daily or something - then you could imagine being able to fit a forecast model and predict (worse predictions with each step but a cone of incertainty like you say) N steps ahead. I think something like this could be a good place to start.

Starting with a forecast on a 1 sec metric will end up mostly being garbage in/ garbage out and best you can do would be a small window and so no use for sort of thing you after. But the data is there so could be a nice way to sort of give users ability to (1) downsample a metrric and then (2) enable a forecast model for these lower frequency aggregaions of the underlying metrics.

andrewm4894 · December 12, 2022, 4:06pm

made this GH discussion for visibility and to see if other users have some interest who might not see this post. ml based forecasting in netdata? · Discussion #14130 · netdata/netdata · GitHub

Luis_Johnstone · December 14, 2022, 4:03pm

@hugo Thanks very much for the reply!

That all sounds great
I would note that the general stats you mentioned is, for me, more of a sales-type or C-deck type of thing. After all, I’m not sure that people using Netdata care as much about that sort of stuff (IMO); we care about whether things are configured right so that we’ve set aside enough resources to get the retention we want. After all, I might be storing loads of data for ages overall but maybe one particular server with lots more collectors has a very short retention because of all the data-points.

Again, thanks for the reply and I’ll hold tight. On a related note:

Do we have an ETA on this??

github.com/netdata/netdata

New journal disk based indexing for agent memory reduction

netdata:master ← stelfrag:0_journal_v2

opened 01:20PM - 16 Nov 22 UTC

stelfrag

+2898 -802

##### Summary The agent requires a lot of memory to index pages and how they ma…p to the actual files that store metrics - Produce a new journal index file that the agent will MMAP and use that instead of creating all the entries in memory ### File structure The new file based index has a structure that allows quick access of the needed metadata. The file structure consists of - File header - List of extents - List of unique metric identifiers (sorted) - Detailed page info for each metric (page @ time information) During the agent start up, the journal replay only needs to create the necessary pages (unique metrics) which is very fast (initial tests indicate that is ~x100 faster than the current journal replay). This is aided by the fast that individual pages are not created in memory during startup but only when needed (during data queries). Pages that are no longer needed (evicted from the cache) are removed. They will also be removed when unused for more than 10 minutes. You can see the number of descriptors in memory under under **netdata.dbengine_long_term_page_stats**, **journal v2 descriptors** ### Creation of new journal index files When the agent starts it will check if a new index file exists for each journal file that needs to be processed. If it exists, it will use that instead. If the index file does not exist, it will replay the old journal file and use that information to create the new journal file and start using that immediately. The agent will generated new index files for all journals except the last (active) one ### New datafiles while the agent is running When a new datafile / journal pair is created the agent will check and create a new journal index file for the journal that was just completed. ### Known issues - New journal creation may not trigger index creation for the last journal file do to a race condition (pending transactions) ### Other fixes This PR also fixes: - [x] DBENGINE was under conditions allowing past time ranges to be injected to the database, resulting in overlapping data pages in the database. After this PR, DBENGINE only allows future data to be stored, relative to the last data collection time. ##### Test Plan

Luis_Johnstone · December 14, 2022, 4:21pm

Hey @andrewm4894

Yeah, my impression is that the work being done right now on getting scalability of the dbengine etc is geared towards handling better concurrent scale but also better retention (e.g. multiple-tiers). As such, I expect that there are strong synergies with capacity planning/modelling. Even if you think in smaller time-scales, surely it’s better to know when you’re likely to auto-scale up a cluster before it happens, right? Naturally, I was first thinking in terms of the dbengine itself but autoscaling stuff is a good example as is capacity for storage appliances and so forth.

I actually think that storage capacity might be an excellent use-case for MC. I’ve used systems that gave me a prediction of when I’d run out of space on a SAN but they all had the weakness that if for some reason you had to dump a load of data onto the SAN, say because you’re doing a P2V or something, then the predictions all become garbage. Sounds exactly like a cloud premium-type feature IMO.

Yes, the cone-chart with progressive uncertainty is exactly what I had in mind because it should be much simpler to do in the short term and could provide good value to users as pert of the always-free core.
Definitely agreed on down-sampling because for that sort of stuff you really don’t want fine-grained data anyway. Stuff like that is supposed to go up and down in the short-term, it’s the mid-long terms that causes issues; and the alarms handle the short -term.

hugo · December 14, 2022, 4:40pm

I’m not sure, since I don’t know how that PR relates to the dbengine v2.0 work, this PR DBENGINE v2 by ktsaou · Pull Request #14125 · netdata/netdata · GitHub.

Topic		Replies	Views
Retrieving running Netdata configuration Help agent-dbengine , agent	18	929	November 28, 2020
Modify configuration after upgrading to netdata v1.38.0? Help agent , dbengine , configuration	1	458	February 8, 2023
Recommended setting of "page cache size" for dbengine? Help agent	4	943	April 26, 2021
history in dbengine.. lost? Help agent , dbengine	15	884	May 23, 2024
save only a few metrics in dbengine Help agent	2	381	November 10, 2022

monitoring and predictive charts for netdata db engine

Related topics