Support for Minor and Major alarm status and generate alarm as SNMP trap notification

Environment

Linux - Redhat

Problem/Question

Currently, we’re looking for an alternative to IBM’s Netcool/SSM (resource monitoring agent ) due to EOS. We come across Netdata and evaluating it

In “Configure health alarms”, could see the options to set conditions for warning and critical threshold limits.

In our existing alarm implementation, we have alarm statuses like “warning”, “minor”, “major”, “critical”, and “clear”, and we have the option to customize the alarm messages based on the status.

Also, we have the option to directly pass the alarm notifications as SNMP traps to the external backend.

What I expected to happen

  1. Options to trigger and clear minor and major alarms based on threshold limits
  2. Options to configure custom alarm messages (for both raise and clear) based on the status
  3. Options to generate an alarm as SNMP trap notification to external backend

I’m not sure if we have the ability to define custom alarm statuses but i know it’s come up in the past as something people might sometimes want/need.

I know there are some SNMP examples in here:

But i’m not quite sure if we have any example alarm configs based on this.

I do recall there may be some examples in some of these threads:

Thanks, Andrew for the response and suggestions.

Regarding SNMP, I’ve gone the links shared. I could see examples for SNMP device monitoring. But couldn’t find an example to generate an alarm as SNMP trap notifications to external . Please suggest, if you have any links that could help me.

Also, I’d like to know, if there is a way to define custom messages for alarm rising and clearing.
E.g.
CPU usage has been above 85% for the last 5 minutes (currently 98%)
CPU usage has returned to normal (less than 85%)

Hey,

Sorry for the late response, I was in PTO and just returned!

In general, netdata is not idea for SNMP, but it can support it. SNMP is just another “collector”, a data source for Netdata. Whatever we can do to the other gathered dimensions, we can also do it to SNMP values.

  1. Netdata supports 3 thresholds and their respective status: CLEAR, WARNING, CRITICAL.
  2. It is possible to do that, but you will possibly need to create a custom notification method. Notification and alarms are 2 different modules in the architecture. Netdata’s health module will constantly evaluate alarms and raise them when needed, while it will call the notification script to issue any appropriate notification.
  3. This is not currently possible and I can’t imagine that we move to implement it in the near future. Perhaps it’s a great first feature for you/your team to contribute to the open source netdata agent!

The notification script is a bash script that supports many different notification methods, including defining a custom one. In that custom notification method, you could code any logic you want for the message and then call an API to send out the notification (e.g Twillio, Pagerduty, etc.).

Some helpful documentation:

1 Like

Hi OdysLam,

Thanks for the response and suggestions.

I defined custom logic and used it in custom notification method. It works fine.

Thanks,
Vasanth

So did you create a custom notification function to issue SNMP traps? It’s a bit of a niche need, but perhaps you can share the function so maybe it can eventually become a supported notification method.

Hi Christopher,

Sure, I can share it. However, it is not a production-ready solution. It would be great if Netdata could improve and make it a standard solution.

Alarm definition health.d/cpu.conf

	alarm: 111111
	   on: system.cpu
	class: Utilization
	 type: System
component: CPU
	   os: linux
	hosts: *
   lookup: average -30s unaligned of user,system
	units: %
	every: 10s
	 warn: $this > 60
	 info: CPU Usage Warning
	   to: sysadmin

INI file:

[111111]
info=CPU Usage Warning
moduleId=System Resources
<Skipping...>
raisetext="CPU Usage Warning, CPU usage has been above 60% for the last 5 minutes (currently ${value}%)"
falltext="CPU Usage Warning, CPU usage has returned to normal (less than 60%)"

Wrapper script:

#!/bin/bash

inifile="</path/to/ini file>" # Replace here

# declare an associative array
declare -A conf


# Functions
#-------------

# Parse alarm translation INI file using regex matches
# Take out only the necessary infomation and create array
Get_INI_Section ()
{

		local filename="$1"
		local section="$2"
		if [ -f "$filename" ] && [ -n "$section" ]; then
				block_Start="^[ \t]*\[${section}\][ \t]*$"
				block_end='^[\t]*\[[^]]+\][\t]*$'

				while IFS='=' read -r key val ; do
				   conf[$key]="${val}"
				done <  <( sed -nre "/${block_Start}/, /${block_end}/ {
												s/${block_Start}//;         # Skip the start of the match pattern
												s/${block_end}//;           # Skip the end of the match pattern
												s/^\s*//;                   # Trim white spaces before
												s/\s*$//;                   # Trim white spaces after
												s/\#.*//;                   # remove comments from the lines
												s/\s*=\s*/=/;               # remove spaces before and after =
												/^$/ d;                     # Delete empty lines
												p;                          # print the content
				}" "$filename" )

				#for key in "${!conf[@]}"
				#do
				#        echo "$key", "${conf[$key]}"
				#done
		else
				echo "Missing INI file and/or INI section"
				exit 1
		fi

}



# Main
#------
if [ "$#" -eq "5" ] && [ -n "$1" ] && [ -n "$2" ] && [ -n "$3" ] && [ -n "$4" ] && [ -n "$5" ] ; then
		host="$1"
		errorCode="$2"
		status="$3"
		value="$4"
		family="$5"

		Get_INI_Section "$inifile" "$errorCode"
		conf[raisetext]=$(eval echo $(echo ${conf[raisetext]}))
		conf[falltext]=$(eval echo $(echo ${conf[falltext]}))

		nOID="< Trap OID>" # replace here
		if [ "${status}" == "WARNING" ]
		then
				snmptrap -v 2c -c <Community> localhost:<port> '' ${nOID} ${nOID}.0 s "${conf[raisetext]}" # update here
		else
				snmptrap -v 2c -c <Community> localhost:<port> '' ${nOID} ${nOID}.0 s "${conf[falltext]}" # update here
				
		fi
		ec=$?
		if [ "${ec}" == "0" ]; then
		   echo "Sent notification \"${conf[info]}\" to Custom endpoint"
		   exit 0
		else
		   echo "Failed to send notification \"${conf[info]}\" to Custom endpoint"
		   exit 1
		fi


else
  echo "Missing arguments and/or arguments are not initialized"
  echo "Usage: $0 <host> <errorcode> <status> <current_value> <family>"
  exit 1
fi

Inside custom sender function

/etc/netdata/netdata_alarm_notify.sh "${host}" "${name}" "${status}" "${value}" "${family}"
ec=$?
if [ "${ec}" == "0" ]; then
	info "Successfully sent notification to Custom endpoint"
else
	error "Failed to send notification to Custom endpoint"
fi

Thanks,
Vasanth

Ok, I took a look at this. A few comments, that might make your life easier.
There are two things to accomplish here:

  1. Send different messages when raising severity vs when decreasing severity.
  2. Use snmptrap as a notification method.

For the first, you don’t have the logic quite right. Specifically, going to status “CRITICAL” will send conf[falltext]. We have ${status} and ${old_status}, so you’d want to do the check also the transition from CRITICAL to WARNING and probably have an additional text for that. See for example the code after netdata/alarm-notify.sh.in at master · netdata/netdata · GitHub
A more generic way to solve this would be to do a PR to add similar logic with the html sender in the example above around netdata/alarm-notify.sh.in at master · netdata/netdata · GitHub to define the type of state transition. Then any sender could optionally do different things (like define different messages) based on the type of the transition, without checking status and old_status.

For the latter, I didn’t quite get why you needed the INI file and the wrapper, the custom sender function has every parameter at its disposal and is defined in you own installations’ health_alarm_notify.conf. So you could probably achieve everything more simply, if you just did /etc/netdata/edit-config health_alarm_notify.conf on one of your nodes and copied the file over to your other deployments.

To summarize and suggest the way forward to our dev team, in case we want to do a PR for this soon:
Modify https://github.com/netdata/netdata/blob/master/health/notifications/alarm-notify.sh.in to:

  • Implement generically the logic found in the html sender, that permits different messages to be sent, depending on the state transition.
  • Add an snmptrap sender function that sends different messages based on the state transition (we can generalize what happens in the htmlsender and steal from there).