So I’m doing some config of the Netdata agent and the changes I’ve made have not had the expected result.
I am wondering how I can get a hold of the currently configured running values for a Netdata agent.
i.e. If I edit the netdata.conf file I want to be able to see whether those settings are actually being used or not. Is this something that can be exposed via the API, perhaps?
I’ve looked at :
{hostname}:19999/api/v1/info
But that doesn’t give me the information I’m looking for; in this case, what I want to know is which mode the dbengine is running in.
I am pretty sure that for most changes you would need to restart the agent. So, you can safely assume that unless you restart the agent, no changes have propagated.
Can you tell us more about the issue that you encountered with a configuration not being applied? It must have been quite frustrating. What did you do? What did you expect?
I include a service restart after any config change to Netdata, so it won’t be that.
I set “dbengine disk space = 923” as the calculator tool says I should get about 14 days of stats with that.
But on some hosts I have data going back 5 days and others it goes back only 3 days.
Now, there are a number of possible causes:
Perhaps I didn’t configure “dbengine disk space = 923” on all hosts at the same time (certainly possible).
The value I chose is too small (unlikely as it should give me at least 5 days even if not the full 14 that I was after).
The dbengine “memory_mode” is running in a mode that requires additional configuration? (currently the mode has not been changed from stock and so I assume that it’s running in “dbengine” mode).
I can see that a service restart will give me running config (well, at least some of the config) but I’m surprised to not find a way to poll that config live.
Hey @Markos_Fountoulakis thanks for chiming in. Why do you think that @Luis_Johnstone is running legacy configuration or even better, how will a user know? Netdata Agent version?
From the documentation link I shared above there is a lot of information, including:
“For Netdata Agents earlier than v1.23.2 , the Agent on the parent node uses one dbengine instance for itself, and another instance for every child node it receives metrics from.”
So for older netdata configurations, each host will have each own disk-space, thus if they harvest a different number of metrics, they will have different retention in the end. Am I correct?
OK, I’ll try that different switch and comment out the old, deprecated one and then report back
The documentation around this really needs some work. I already asked some questions about it on another thread. For example, that document says that the default “page cache size” is 32 but when you use the calculator it doesn’t mention changing that value. It’s very strange to mention a value unless it’s in the context of changing it or it’s being a tunable; but the doc does not explain when or why I’d want/need to change that value.
Yes, it says:
You do not need to edit the page cache size setting to store more metrics using the database engine. However, if you want to store more metrics specifically in memory , you can increase the cache size.
But then why would you increase it? Why would I want to store more or less metrics in memory? It might be something specific to setting the “memory_mode” to ‘RAM’ (for example) but then why is there a default of 32 which applies (AFAIC) to the dbengine mode?
Now, I can easily have a guess but it’s not obvious from the documentation.
Hopefully you guys take this for what it is: as constructive feedback to make the product better and not as me being pedantic (well, there’s that too).
If we’re really focused on measuring stuff then there should be some metric in netdata that gives me a good rule of thumb about whether the page cache needs tuning which I can monitor and alerts on to know using real empirical data such as “NetData DB engine RAM usage (netdata.dbengine_ram)” and which is documented with the setting.
Thanks a lot @Luis_Johnstone for that feedback, it’s indeed valuable.
I am cc @joel who is taking care of documentation here at Netdata so he may have visibility when he returns from leave.
In regards to what you are saying, yes the documentation could have been clearer regarding why the user may want more data in memory. Perhaps @Markos_Fountoulakis can give us his expert explanation, since he is the very designer of our database engine!
Regarding the last point, well that’s interesting. We indeed are collecting metrics from dbengine, as you can see in the Netdata Monitoring section of the agent, but I am not too sure of what they mean. This dynamic modification of page cache based on real data could be something useful. cc @andrewm4894and@ilyam8 for visibility.
Increasing the page cache size, means that a larger percentage of the metrics that have been archived on disk are also available in memory.
That means that an application or the netdata dashboard can query and visualize the metrics faster because it the database does not need to go to disk to fetch the metric as often.
@Markos_Fountoulakis
Yes, that makes sense but doesn’t really help with tuning anything
That means that an application or the netdata dashboard can query and visualize the metrics faster because it the database does not need to go to disk to fetch the metric as often.
What does “faster” mean? What does too slow mean?
You might think that is down to the specific user but it’s not because you guys set the value to 32 and not 16 or some other arbitrary number.
You guys already know what “fast enough” is because you’ve got defaults that give a user experience that you were happy to go into Production with.
So the only question is which metrics represent that acceptable user experience (or a user experience that has parity with the default Netdata config).
There is going to be some measure such as average database query execution time or average graph draw time but it is measurable and it’s exactly what you need to decide to tune the product.
@OdysLam
Automation would be awesome but I’d settle for an initial iteration where it’s documented
The documentation says in the “Memory requirements” section that an additional #dimensions-being-collected x 4096 x 2 bytes are allocated to the page cache. That means an extra 2 pages per metric being gathered. In the worst case of a metric being collected every 1 second, this means you still get about 34 minutes of caching the most recent metrics in memory. As far as user experience is concerned, in practice, the user should not experience disk I/O when viewing the latest time window in the dashboard, as long as the zoom level shows fewer than 17 minutes.
The 32 MiB figure does not mean much on its own. It was chosen so that the total memory consumption of the netdata agent stays relatively low when enabling memory mode = dbengine since the agent can be installed in very small IoT devices.
One way the 32MiB can easily become insufficient is if a dashboard user zooms out a lot (a span of hours or days visible in the dashboard) and starts scrolling up or down. This is a disk thrashing pattern for the database and will test the limits of the page cache replacement policy and size.
As far as benchmarking is concerned, we don’t have a tuning guide for multiple use cases yet. What you can see in the documentation in the “Evaluation” section is the comparison of a page cache size of 64MiB on HDD versus a page cache size of 16GiB, which gives a speedup of about 100 for that specific workload.
Is there a way to get the running values used by the agent as opposed to the config file?
Is there a metric exposed by the agent which can be used to assess whether the page cache needs tuning? If it already exists (as there are indeed a number of page cache stats under the Netdata node of the web UI) can we document what each is used for so far as sizing and performance is concerned?
This is an interesting conversation. Please keep the questions flowing, especially as @joel will eventually join the discussion and will have some ideas on improving the documentation on this complex matter.
@Luis_Johnstone about 1, you can get the running values by using the URL 127.0.0.1:1999/netdata.conf
On 2, you can check-out the netdata.page_cache_hit_ratio chart. As long as the ratio dimension is close to 100%, then there is no reason to increase the page cache size.
Since this metric is a rate, it can drop to 0% when there are no queries to the database.
So I would say, if the disk is doing a lot of read I/O operations (check the chart netdata.dbengine_io_throughput and the reads dimension) and at the same time your netdata.page_cache_hit_ratio is low, then you should increase the page cache size.
All,
Is it worth documenting the explanations from this thread (I’m not going to be the last person with this question) or do you think the thread itself serves as sufficient documentation?
For example, I did know about that URL but I didn’t know if it was live data or just a reflection of the on-disk config.