Recently I noticed netdata was using over 20gb and upon inspection I saw that there was a significant allocation to the db engine.
This was a gradual increase and took about 3 months before raising alarms. I have since restarted netdata on the 29th, but the memory usage is steadily increasing and consumes around 1gb in 24 hours.
Some screenshots from grafana:
The odd thing is, we have this exact deployment in another location with the same uptime and netdata only uses about 300mb. Both locations are using an unmodified varnish.chart.py with identical versions of netdata, and identical configurations.
The only difference I can see is the bad location has about double the outout from varnishstat (1000 lines vs 2200 lines)
Ive checked the docs and used the resource calculator with the following and it tells me 156.25 MiB in total disk space 188 MiB in system memory. My configuration is well within these boundaries.
- days needed for storage: 0
- update every: 30
- Metrics collected: 20k
- streaming nodes: 0
- compression ratio: 70
- page cache size: 32
Could the extra dimensions coming from the varnish plugin be to blame? If so, shouldn’t netdata be limiting these resources?
Any help in troubleshooting this would be greatly appreciated. Thanks
Ive since installed netdata v1.30.1 on one of the effected nodes and purged any legacy settings, however it seems to still having the same issue.
[global] hostname = varnish01.example.com run as user = netdata history = 300 process scheduling policy = idle OOM score = 1000 update every = 30 memory mode = dbengine page cache size = 32 dbengine disk space = 256 [web] web files owner = root web files group = netdata bind to = localhost mode = none [plugins] proc = yes diskspace = yes cgroups = no tc = no idlejitter = yes enable running new plugins = no check for new plugins every = 60 slabinfo = no apps = yes charts.d = no ebpf = no fping = no go.d = no ioping = no node.d = no perf = no python.d = yes [health] enabled = no [registry] enabled = no [backend] enabled = yes data source = average type = graphite destination = graphite.example.com:2003 prefix = netdata hostname = varnish01_example_com update every = 60 buffer on failures = 10 timeout ms = 20000 send charts matching = * [statsd] enabled = no [plugin:proc] netdata server resources = yes /proc/pagetypeinfo = no /proc/stat = yes /proc/uptime = yes /proc/loadavg = yes /proc/sys/kernel/random/entropy_avail = yes /proc/pressure = yes /proc/interrupts = yes /proc/softirqs = yes /proc/vmstat = yes /proc/meminfo = yes /sys/kernel/mm/ksm = yes /sys/block/zram = yes /sys/devices/system/edac/mc = yes /sys/devices/system/node = yes /proc/net/dev = yes /proc/net/sockstat = yes /proc/net/sockstat6 = yes /proc/net/netstat = yes /proc/net/snmp = no /proc/net/snmp6 = no /proc/net/sctp/snmp = no /proc/net/softnet_stat = yes /proc/net/ip_vs/stats = yes /sys/class/infiniband = no /proc/net/stat/conntrack = no /proc/net/stat/synproxy = no /proc/diskstats = yes /proc/mdstat = yes /proc/net/rpc/nfsd = no /proc/net/rpc/nfs = no /proc/spl/kstat/zfs/arcstats = no /sys/fs/btrfs = no ipc = yes /sys/class/power_supply = no [plugin:proc:diskspace] update every = 30 check for new mount points every = 60 [plugin:apps] update every = 30 [plugin:python.d] update every = 30 [netdata.statsd_metrics] enabled = no [netdata.statsd_useful_metrics] enabled = no [netdata.statsd_events] enabled = no [netdata.statsd_reads] enabled = no [netdata.statsd_bytes] enabled = no [netdata.statsd_packets] enabled = no [netdata.tcp_connects] enabled = no [netdata.tcp_connected] enabled = no [netdata.private_charts] enabled = no [netdata.plugin_statsd_charting_cpu] enabled = no [netdata.plugin_statsd_collector1_cpu] enabled = no