Problem/Question
Recently I noticed netdata was using over 20gb and upon inspection I saw that there was a significant allocation to the db engine.
This was a gradual increase and took about 3 months before raising alarms. I have since restarted netdata on the 29th, but the memory usage is steadily increasing and consumes around 1gb in 24 hours.
Some screenshots from grafana:
The odd thing is, we have this exact deployment in another location with the same uptime and netdata only uses about 300mb. Both locations are using an unmodified varnish.chart.py with identical versions of netdata, and identical configurations.
The only difference I can see is the bad location has about double the outout from varnishstat (1000 lines vs 2200 lines)
Ive checked the docs and used the resource calculator with the following and it tells me 156.25 MiB in total disk space 188 MiB in system memory. My configuration is well within these boundaries.
- days needed for storage: 0
- update every: 30
- Metrics collected: 20k
- streaming nodes: 0
- compression ratio: 70
- page cache size: 32
Could the extra dimensions coming from the varnish plugin be to blame? If so, shouldn’t netdata be limiting these resources?
Any help in troubleshooting this would be greatly appreciated. Thanks
Additional steps
Ive since installed netdata v1.30.1 on one of the effected nodes and purged any legacy settings, however it seems to still having the same issue.
Configuration
[global]
hostname = varnish01.example.com
run as user = netdata
history = 300
process scheduling policy = idle
OOM score = 1000
update every = 30
memory mode = dbengine
page cache size = 32
dbengine disk space = 256
[web]
web files owner = root
web files group = netdata
bind to = localhost
mode = none
[plugins]
proc = yes
diskspace = yes
cgroups = no
tc = no
idlejitter = yes
enable running new plugins = no
check for new plugins every = 60
slabinfo = no
apps = yes
charts.d = no
ebpf = no
fping = no
go.d = no
ioping = no
node.d = no
perf = no
python.d = yes
[health]
enabled = no
[registry]
enabled = no
[backend]
enabled = yes
data source = average
type = graphite
destination = graphite.example.com:2003
prefix = netdata
hostname = varnish01_example_com
update every = 60
buffer on failures = 10
timeout ms = 20000
send charts matching = *
[statsd]
enabled = no
[plugin:proc]
netdata server resources = yes
/proc/pagetypeinfo = no
/proc/stat = yes
/proc/uptime = yes
/proc/loadavg = yes
/proc/sys/kernel/random/entropy_avail = yes
/proc/pressure = yes
/proc/interrupts = yes
/proc/softirqs = yes
/proc/vmstat = yes
/proc/meminfo = yes
/sys/kernel/mm/ksm = yes
/sys/block/zram = yes
/sys/devices/system/edac/mc = yes
/sys/devices/system/node = yes
/proc/net/dev = yes
/proc/net/sockstat = yes
/proc/net/sockstat6 = yes
/proc/net/netstat = yes
/proc/net/snmp = no
/proc/net/snmp6 = no
/proc/net/sctp/snmp = no
/proc/net/softnet_stat = yes
/proc/net/ip_vs/stats = yes
/sys/class/infiniband = no
/proc/net/stat/conntrack = no
/proc/net/stat/synproxy = no
/proc/diskstats = yes
/proc/mdstat = yes
/proc/net/rpc/nfsd = no
/proc/net/rpc/nfs = no
/proc/spl/kstat/zfs/arcstats = no
/sys/fs/btrfs = no
ipc = yes
/sys/class/power_supply = no
[plugin:proc:diskspace]
update every = 30
check for new mount points every = 60
[plugin:apps]
update every = 30
[plugin:python.d]
update every = 30
[netdata.statsd_metrics]
enabled = no
[netdata.statsd_useful_metrics]
enabled = no
[netdata.statsd_events]
enabled = no
[netdata.statsd_reads]
enabled = no
[netdata.statsd_bytes]
enabled = no
[netdata.statsd_packets]
enabled = no
[netdata.tcp_connects]
enabled = no
[netdata.tcp_connected]
enabled = no
[netdata.private_charts]
enabled = no
[netdata.plugin_statsd_charting_cpu]
enabled = no
[netdata.plugin_statsd_collector1_cpu]
enabled = no
Environment
centos 7.8
netdata 1.26.0