Telegram notifications no longer working

Hi,

after an update it seem telegram notifications are broken on all my servers running netdata.

Trying to send a test notification do work, however zero real notification are sent to telegram.

After some searches I tried to comment or remove delay and options: no-clear-notification but it does not help.

I tried to follow instructions here but no luck : Not all alarms send to the Telegram

Please find the content of httpdcheck.conf :

# This is a fast-reacting no-notification alarm ideal for custom dashboards or badges
 template: httpcheck_web_service_up
 families: *
       on: httpcheck.status
    class: Utilization
     type: Web Server
component: HTTP endpoint
   lookup: average -1m unaligned percentage of success
     calc: ($this < 75) ? (0) : ($this)
    every: 5s
    units: up/down
     info: average ratio of successful HTTP requests over the last minute (at least 75%)
       to: silent

 template: httpcheck_web_service_bad_content
 families: *
       on: httpcheck.status
    class: Workload
     type: Web Server
component: HTTP endpoint
   lookup: average -5m unaligned percentage of bad_content
    every: 10s
    units: %
     warn: $this >= 10 AND $this < 40
     crit: $this >= 40
#    delay: down 5m multiplier 1.5 max 1h
     info: average ratio of HTTP responses with unexpected content over the last 5 minutes
#  options: no-clear-notification
       to: webmaster

 template: httpcheck_web_service_bad_status
 families: *
       on: httpcheck.status
    class: Workload
     type: Web Server
component: HTTP endpoint
   lookup: average -5m unaligned percentage of bad_status
    every: 10s
    units: %
     warn: $this >= 10 AND $this < 40
     crit: $this >= 40
#    delay: down 5m multiplier 1.5 max 1h
     info: average ratio of HTTP responses with unexpected status over the last 5 minutes
#  options: no-clear-notification
       to: webmaster

 template: httpcheck_web_service_timeouts
 families: *
       on: httpcheck.status
    class: Latency
     type: Web Server
component: HTTP endpoint
   lookup: average -5m unaligned percentage of timeout
    every: 10s
    units: %
     info: average ratio of HTTP request timeouts over the last 5 minutes

 template: httpcheck_no_web_service_connections
 families: *
       on: httpcheck.status
    class: Errors
     type: Other
component: HTTP endpoint
   lookup: average -5m unaligned percentage of no_connection
    every: 10s
    units: %
     info: average ratio of failed requests during the last 5 minutes

# combined timeout & no connection alarm
 template: httpcheck_web_service_unreachable
 families: *
       on: httpcheck.status
    class: Errors
     type: Web Server
component: HTTP endpoint
     calc: ($httpcheck_no_web_service_connections >= $httpcheck_web_service_timeouts) ? ($httpcheck_no_web_service_connections) : ($httpcheck_web_service_timeouts)
    units: %
    every: 10s
     warn: ($httpcheck_no_web_service_connections >= 10 OR $httpcheck_web_service_timeouts >= 10) AND ($httpcheck_no_web_service_connections < 40 OR $httpcheck_web_service_timeouts < 40)
     crit: $httpcheck_no_web_service_connections >= 40 OR $httpcheck_web_service_timeouts >= 40
#    delay: down 5m multiplier 1.5 max 1h
     info: ratio of failed requests either due to timeouts or no connection over the last 5 minutes
#  options: no-clear-notification
       to: webmaster

 template: httpcheck_1h_web_service_response_time
 families: *
       on: httpcheck.responsetime
    class: Latency
     type: Other
component: HTTP endpoint
   lookup: average -1h unaligned of time
    every: 30s
    units: ms
     info: average HTTP response time over the last hour

 template: httpcheck_web_service_slow
 families: *
       on: httpcheck.responsetime
    class: Latency
     type: Web Server
component: HTTP endpoint
   lookup: average -3m unaligned of time
    units: ms
    every: 10s
     warn: ($this > ($httpcheck_1h_web_service_response_time * 2) )
     crit: ($this > ($httpcheck_1h_web_service_response_time * 3) )
#    delay: down 5m multiplier 1.5 max 1h
     info: average HTTP response time over the last 3 minutes, compared to the average over the last hour
#  options: no-clear-notification
       to: webmaster

This was working 100% before and I never changed any options this stopped to work due to an update but I can’t udnerstand how to fix this, any help is welcome.

Hi @iniOr

Thanks for your report. We will try to replicate this.

In the mean time, can you check your error.log file for any indications or an error from the alarm-notify.sh script?

Can you also please post the version of netdata you are running? /usr/sbin/netdata -W buildinfo ?

Thanks

Hi, thanks for your reply.

Please find the check I done on the logs it seems nothing related in the recent days :

# zgrep "telegram" /var/log/netdata/*.log.*
/var/log/netdata/error.log.1:2022-11-13 06:33:24: go.d ERROR: prometheus[telegram_bot_for_alertmanager_local] Get "http://127.0.0.1:9087/metrics": dial tcp 127.0.0.1:9087: connect: connection refused
/var/log/netdata/error.log.1:2022-11-13 06:33:24: go.d ERROR: prometheus[telegram_bot_for_alertmanager_local] check failed
/var/log/netdata/error.log.1:2022-11-13 06:54:20: go.d ERROR: prometheus[telegram_bot_for_alertmanager_local] Get "http://127.0.0.1:9087/metrics": dial tcp 127.0.0.1:9087: connect: connection refused
/var/log/netdata/error.log.1:2022-11-13 06:54:20: go.d ERROR: prometheus[telegram_bot_for_alertmanager_local] check failed
/var/log/netdata/error.log.8.gz:2022-11-06 07:05:41: go.d ERROR: prometheus[telegram_bot_for_alertmanager_local] Get "http://127.0.0.1:9087/metrics": dial tcp 127.0.0.1:9087: connect: connection refused
/var/log/netdata/error.log.8.gz:2022-11-06 07:05:41: go.d ERROR: prometheus[telegram_bot_for_alertmanager_local] check failed

And :

# zgrep "alarm" /var/log/netdata/*.log.*
/var/log/netdata/error.log.1:2022-11-13 06:33:24: netdata INFO  : MAIN : HEALTH [<hostname>]: Table health_log_92678554_2526_484f_92cc_600d1ee9245d, loaded 533 alarm entries, errors in 0 entries.
/var/log/netdata/error.log.1:2022-11-13 06:33:24: netdata INFO  : MAIN : Host '<hostname>' (at registry as '<hostname>') with guid '92678554-2526-484f-92cc-600d1ee9245d' initialized, os 'linux', timezone 'Europe/Berlin', tags '', program_name 'netdata', program_version 'v1.36.1', update every 1, memory mode dbengine, history entries 0, streaming disabled (to '' with api key ''), health enabled, cache_dir '/var/cache/netdata', varlib_dir '/var/lib/netdata', health_log '/var/lib/netdata/health/health-log.db', alarms default handler '/usr/libexec/netdata/plugins.d/alarm-notify.sh', alarms default recipient 'root'
/var/log/netdata/error.log.1:2022-11-13 06:33:24: python.d INFO: plugin[main] : [alarms] is disabled by default, skipping it
/var/log/netdata/error.log.1:2022-11-13 06:54:18: netdata INFO  : MAIN : HEALTH [<hostname>]: Table health_log_92678554_2526_484f_92cc_600d1ee9245d, loaded 704 alarm entries, errors in 0 entries.
/var/log/netdata/error.log.1:2022-11-13 06:54:18: netdata INFO  : MAIN : Host '<hostname>' (at registry as '<hostname>') with guid '92678554-2526-484f-92cc-600d1ee9245d' initialized, os 'linux', timezone 'Europe/Berlin', tags '', program_name 'netdata', program_version 'v1.36.1', update every 1, memory mode dbengine, history entries 0, streaming disabled (to '' with api key ''), health enabled, cache_dir '/var/cache/netdata', varlib_dir '/var/lib/netdata', health_log '/var/lib/netdata/health/health-log.db', alarms default handler '/usr/libexec/netdata/plugins.d/alarm-notify.sh', alarms default recipient 'root'
/var/log/netdata/error.log.1:2022-11-13 06:54:19: python.d INFO: plugin[main] : [alarms] is disabled by default, skipping it
/var/log/netdata/error.log.8.gz:2022-11-06 07:05:37: netdata INFO  : MAIN : HEALTH [<hostname>]: Table health_log_92678554_2526_484f_92cc_600d1ee9245d, loaded 360 alarm entries, errors in 0 entries.
/var/log/netdata/error.log.8.gz:2022-11-06 07:05:37: netdata INFO  : MAIN : Host '<hostname>' (at registry as '<hostname>') with guid '92678554-2526-484f-92cc-600d1ee9245d' initialized, os 'linux', timezone 'Europe/Berlin', tags '', program_name 'netdata', program_version 'v1.36.1', update every 1, memory mode dbengine, history entries 0, streaming disabled (to '' with api key ''), health enabled, cache_dir '/var/cache/netdata', varlib_dir '/var/lib/netdata', health_log '/var/lib/netdata/health/health-log.db', alarms default handler '/usr/libexec/netdata/plugins.d/alarm-notify.sh', alarms default recipient 'root'
/var/log/netdata/error.log.8.gz:2022-11-06 07:05:38: python.d INFO: plugin[main] : [alarms] is disabled by default, skipping it

Please find the build info :

# /usr/sbin/netdata -W buildinfo
Version: netdata v1.36.1
Configure options:  '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--disable-option-checking' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--with-user=netdata' '--with-math' '--with-zlib' '--with-webdir=/var/lib/netdata/www' '--disable-dependency-tracking' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -ffile-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-z,relro' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -ffile-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security'
Install type: binpkg-deb
    Binary architecture: x86_64
    Packaging distro:  
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES 
    ACLK Next Generation:       YES
    ACLK-NG New Cloud Protocol: YES
    ACLK Legacy:                NO
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         YES
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    YES
    EBPF:                    YES
    IPMI:                    YES
    NFACCT:                  YES
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES

This issue began in Augustus at that time I was apparently using the nightly build, I changed to be on stable in the hope it solve this problem what I don’t get is commenting delay or option does nothing the test notification works but not the real one.

Hi @iniOr ,

I was testing now on my environment the latest nightly and I could not recreate the issue, please, can you run the following test?

bash-5.2# /usr/libexec/netdata/plugins.d/alarm-notify.sh test

# SENDING TEST WARNING ALARM TO ROLE: sysadmin
2022-11-14 21:47:38: alarm-notify.sh: INFO: sent slack notification for: hades test.chart.test_alarm is WARNING to '#myalarms'
2022-11-14 21:47:39: alarm-notify.sh: INFO: sent telegram notification for: hades test.chart.test_alarm is WARNING to '-YYYYYY'
# OK

# SENDING TEST CRITICAL ALARM TO ROLE: sysadmin
2022-11-14 21:47:40: alarm-notify.sh: INFO: sent slack notification for: hades test.chart.test_alarm is CRITICAL to '#myalarms'
2022-11-14 21:47:40: alarm-notify.sh: INFO: sent telegram notification for: hades test.chart.test_alarm is CRITICAL to '-YYYYYY'
# OK

# SENDING TEST CLEAR ALARM TO ROLE: sysadmin
2022-11-14 21:47:41: alarm-notify.sh: INFO: sent slack notification for: hades test.chart.test_alarm is CLEAR to '#myalarms'
2022-11-14 21:47:42: alarm-notify.sh: INFO: sent telegram notification for: hades test.chart.test_alarm is CLEAR to '-YYYYYY'
# OK

I received all notifications on my telegram.

I also wrote a simple alert:

 alarm: dev_dim_template
    on: system.cpu
    os: linux
lookup: sum -3s at 0 every 3 percentage foreach *
 units: %
 every: 1s
  warn: $this > 1
  crit: $this > 4

as you can see:

2022-11-14 21:43:49: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template is CRITICAL to '-YYYYYY'

Do you set all necessary variables inside health_alarm_notify.conf for Telegram?

Best regards!

Hi thanks for your help.

as explained before the test always work, real notifications are no longer sent.

I really don’t get why test notification do work : message is sent on the correct Telegram chat using the correct telegram bot however a real notification is never sent ?

I set up Telegram according to the docs in the the file /etc/netdata/health_alarm_notify.conf

#------------------------------------------------------------------------------
# telegram (telegram.org) global notification options

# To get your chat ID send the command /my_id to telegram bot @get_id.
# Users also need to open a query with the bot (see below).

# note: multiple recipients can be given like this:
#                  "CHAT_ID_1 CHAT_ID_2 ..."
#telegram : "<working telegram id>"
# enable/disable sending telegram messages
SEND_TELEGRAM="YES"

# Contact the bot @BotFather to create a new bot and receive a bot token.
# Without it, netdata cannot send telegram messages.
TELEGRAM_BOT_TOKEN="<token that used to work for years and that work when I test>"

# If a role's recipients are not configured, a message will be send to
# this chat id (empty = do not send a notification for unconfigured roles):
DEFAULT_RECIPIENT_TELEGRAM="<my actual telegram id>"

The content of /usr/lib/netdata/conf.d/go.d/apache.conf :

# netdata go.d.plugin configuration for apache
#
# This file is in YAML format. Generally the format is:
#
# name: value
#
# There are 2 sections:
#  - GLOBAL
#  - JOBS
#
#
# [ GLOBAL ]
# These variables set the defaults for all JOBs, however each JOB may define its own, overriding the defaults.
#
# The GLOBAL section format:
# param1: value1
# param2: value2
#
# Currently supported global parameters:
#  - update_every
#    Data collection frequency in seconds. Default: 1.
#
#  - autodetection_retry
#    Re-check interval in seconds. Attempts to start the job are made once every interval.
#    Zero means not to schedule re-check. Default: 0.
#
#  - priority
#    Priority is the relative priority of the charts as rendered on the web page,
#    lower numbers make the charts appear before the ones with higher numbers. Default: 70000.
#
#
# [ JOBS ]
# JOBS allow you to collect values from multiple sources.
# Each source will have its own set of charts.
#
# IMPORTANT:
#  - Parameter 'name' is mandatory.
#  - Jobs with the same name are mutually exclusive. Only one of them will be allowed running at any time.
#
# This allows autodetection to try several alternatives and pick the one that works.
# Any number of jobs is supported.
#
# The JOBS section format:
#
# jobs:
#   - name: job1
#     param1: value1
#     param2: value2
#
#   - name: job2
#     param1: value1
#     param2: value2
#
#   - name: job2
#     param1: value1
#
#
# [ List of JOB specific parameters ]:
#  - url
#    Server URL.
#    Syntax:
#      url: http://localhost:80
#
#  - username
#    Username for basic HTTP authentication.
#    Syntax:
#      username: tony
#
#  - password
#    Password for basic HTTP authentication.
#    Syntax:
#      password: stark
#
#  - proxy_url
#    Proxy URL.
#    Syntax:
#      proxy_url: http://localhost:3128
#
#  - proxy_username
#    Username for proxy basic HTTP authentication.
#    Syntax:
#      username: bruce
#
#  - proxy_password
#    Password for proxy basic HTTP authentication.
#    Syntax:
#      username: wayne
#
#  - timeout
#    HTTP response timeout.
#    Syntax:
#      timeout: 1
#
#  - method
#    HTTP request method.
#    Syntax:
#      method: GET
#
#  - body
#    HTTP request method.
#    Syntax:
#      body: '{fake: data}'
#
#  - headers
#    HTTP request headers.
#    Syntax:
#      headers:
#        X-API-Key: key
#
#  - not_follow_redirects
#    Whether to not follow redirects from the server.
#    Syntax:
#      not_follow_redirects: yes/no
#
#  - tls_skip_verify
#    Whether to skip verifying server's certificate chain and hostname.
#    Syntax:
#      tls_skip_verify: yes/no
#
#  - tls_ca
#    Certificate authority that client use when verifying server certificates.
#    Syntax:
#      tls_ca: path/to/ca.pem
#
#  - tls_cert
#    Client tls certificate.
#    Syntax:
#      tls_cert: path/to/cert.pem
#
#  - tls_key
#    Client tls key.
#    Syntax:
#      tls_key: path/to/key.pem
#
#
# [ JOB defaults ]:
#  url: http://localhost/server-status?auto
#  timeout: 2
#  method: GET
#  not_follow_redirects: no
#  tls_skip_verify: no
#
#
# [ JOB mandatory parameters ]:
#  - name
#  - url
#
# ------------------------------------------------MODULE-CONFIGURATION--------------------------------------------------

# update_every: 1
# autodetection_retry: 0
# priority: 70000

jobs:
  - name: local
    url: http://localhost/server-status?auto

  - name: local
    url: http://127.0.0.1/server-status?auto

I also see this config /etc/netdata/health.d/httpcheck.conf, may it interfere with the previous one ?

# This is a fast-reacting no-notification alarm ideal for custom dashboards or badges
 template: httpcheck_web_service_up
 families: *
       on: httpcheck.status
    class: Utilization
     type: Web Server
component: HTTP endpoint
   lookup: average -1m unaligned percentage of success
     calc: ($this < 75) ? (0) : ($this)
    every: 5s
    units: up/down
     info: average ratio of successful HTTP requests over the last minute (at least 75%)
       to: silent

 template: httpcheck_web_service_bad_content
 families: *
       on: httpcheck.status
    class: Workload
     type: Web Server
component: HTTP endpoint
   lookup: average -5m unaligned percentage of bad_content
    every: 10s
    units: %
     warn: $this >= 10 AND $this < 40
     crit: $this >= 40
#    delay: down 5m multiplier 1.5 max 1h
     info: average ratio of HTTP responses with unexpected content over the last 5 minutes
#  options: no-clear-notification
       to: webmaster

 template: httpcheck_web_service_bad_status
 families: *
       on: httpcheck.status
    class: Workload
     type: Web Server
component: HTTP endpoint
   lookup: average -5m unaligned percentage of bad_status
    every: 10s
    units: %
     warn: $this >= 10 AND $this < 40
     crit: $this >= 40
#    delay: down 5m multiplier 1.5 max 1h
     info: average ratio of HTTP responses with unexpected status over the last 5 minutes
#  options: no-clear-notification
       to: webmaster

 template: httpcheck_web_service_timeouts
 families: *
       on: httpcheck.status
    class: Latency
     type: Web Server
component: HTTP endpoint
   lookup: average -5m unaligned percentage of timeout
    every: 10s
    units: %
     info: average ratio of HTTP request timeouts over the last 5 minutes

 template: httpcheck_no_web_service_connections
 families: *
       on: httpcheck.status
    class: Errors
     type: Other
component: HTTP endpoint
   lookup: average -5m unaligned percentage of no_connection
    every: 10s
    units: %
     info: average ratio of failed requests during the last 5 minutes

# combined timeout & no connection alarm
 template: httpcheck_web_service_unreachable
 families: *
       on: httpcheck.status
    class: Errors
     type: Web Server
component: HTTP endpoint
     calc: ($httpcheck_no_web_service_connections >= $httpcheck_web_service_timeouts) ? ($httpcheck_no_web_service_connections) : ($httpcheck_web_service_timeouts)
    units: %
    every: 10s
     warn: ($httpcheck_no_web_service_connections >= 10 OR $httpcheck_web_service_timeouts >= 10) AND ($httpcheck_no_web_service_connections < 40 OR $httpcheck_web_service_timeouts < 40)
     crit: $httpcheck_no_web_service_connections >= 40 OR $httpcheck_web_service_timeouts >= 40
#    delay: down 5m multiplier 1.5 max 1h
     info: ratio of failed requests either due to timeouts or no connection over the last 5 minutes
#  options: no-clear-notification
       to: webmaster

 template: httpcheck_1h_web_service_response_time
 families: *
       on: httpcheck.responsetime
    class: Latency
     type: Other
component: HTTP endpoint
   lookup: average -1h unaligned of time
    every: 30s
    units: ms
     info: average HTTP response time over the last hour

 template: httpcheck_web_service_slow
 families: *
       on: httpcheck.responsetime
    class: Latency
     type: Web Server
component: HTTP endpoint
   lookup: average -3m unaligned of time
    units: ms
    every: 10s
     warn: ($this > ($httpcheck_1h_web_service_response_time * 2) )
     crit: ($this > ($httpcheck_1h_web_service_response_time * 3) )
#    delay: down 5m multiplier 1.5 max 1h
     info: average HTTP response time over the last 3 minutes, compared to the average over the last hour
#  options: no-clear-notification
       to: webmaster

Can you please explain me why the test do work and not “real” alarms ?

If we can’t see where is the problem how do I remove netdata and all config ?
I will try to make it work from scratch I suspect the default release “nightly” to have messed my setup.

Hello,

The test does not get values from database to process data, while normal notifications are sent according values processed.

Since last release, we had modified netdata internally to improve and fix some issues. @Manolis_Vasilakis do you remember some changes that could directly health main thread? I remember in the past that data corruption could create situations like this, but I do not remember nothing like this in the past months.

I am developing a PR now using [db].mode = ram and I am constantly receiving alerts at my telegram.

Best regards!

Hello,

We were talking about the issue you are having, let me do a question for you, do you have errors starting with sentence failed to send telegram notification for inside your error.log?

Best regards!