Are health.d/ram.conf percent units useful? How can it be improved?

I am getting these alerts:

ram_available 6.78%
Details: percentage of estimated amount of RAM available for userspace processes, without causing swapping
Chart: mem.available
Family: system
Raised to warning, for 0 second

However, I am not really sure that the percent of available ram is a definitely sign of starting swapping.

For systems with large amounts of ram, say 64GB and more, even 1 percent available is still not an indication that swapping will occur, which kind of renders the percent-type units in memory alarms quite not useful.

Unless we know the kernel policy on swappiness, what kernel will choose to do (swap? or evict some cache?), the presence of systemd-oomd/early-oom, and so on, we cannot issue a warning alarm.

How can this be refactored into a more probable sign of problem than just an annoying alarm? other than lowering the percent or silencing it altogheter.

Hey @ml35, interesting thoughts. Did you had in mind Swappiness (/proc/sys/vm/swappiness) on Linux? E.g a zero Swapiness value would mean “I just disabled swap, no need for warnings” and a higher value would mean more aggressive swap needs are desired so take that in mind when creating the relevant alerts?

In my understanding the value 0 of vm.swappiness doesn’t mean swapping disabled, but instead “avoid as much as possible to swap”. Linux kernel docs say:

This control is used to define how aggressive the kernel will swap
memory pages. Higher values will increase aggressiveness, lower values
decrease the amount of swap. A value of 0 instructs the kernel not to
initiate swap until the amount of free and file-backed pages is less
than the high water mark in a zone.

A kernel developer further explains that value of 0 for vm.swappiness is not recommended, but 1, if you want maximum avoiding of swapping. That is a long insightful article on ram and swap that I haven’t managed to finish reading as I want first to finish commenting here.

In my particular use case that triggered the alarm, the machine has vm.swappiness 20 and is running a rsync cron job. This makes me remember the proposal I formulated last summer about the ability to have a veto configuration option so for example I could edit the alarm and instruct it that if the conditions are met, the alarm should not be raised if the process rsync is found running and time is between 7:30am and 08:00am when the script is expected to run. Has any progress been made internally or was the idea further discussed later? I see there is some progress with the ML engine, but it seems to not be able to handle this case.

Also recent Linux kernels have implemented memory pressure stall information. I see netdata is already monitoring it. Perhaps the available/used memory alarm should check with memory PSI and conclude the alarm should not be raised unless the memory PSI metric also confirms a real deficit of needed memory?

I have re-read your question and I think now I understand the meaning of it.

I think swappiness should not be an alarm silencer but rather have a weight in deciding if the alarm should be raised or ignored, by lowering the credibility index (yeah a term I just invented right now as a feature suggestion that probably needs it’s own discussion = a method to give some alerts a probability ranking. maybe this belongs to the ML abilities of netdata?)

If you look that in the context of the alarm veto concept I proposed in the other older thread I mentioned, this would be a simple matter of this imaginary configuration to the ram.conf

alarm_ignored_if: process_name_exists = chrome

PS: Also I find for me numeric value would be more useful than a percent.

PS2: Also I observe you are calculating the available memory. But the available memory is already estimated by the kernel and supplied in /proc/meminfo, would it make more sense to read it from there directly (on systems where it is present) instead?

I think swappiness should not be an alarm silencer but rather have a weight in deciding if the alarm should be raised or ignored, by lowering the credibility index

That was exactly what I had in my mind and I really like the term, it was like it’s there forever :slight_smile: .

Let us first explore what we can do within the specific context and then we can see whether we can generalise this. I also found the veto idea quite interesting. And let’s involve some more people into the discussion and then we can create a new feature request.

1 Like