Is it time for multiple streaming servers?

I have a single Netdata server and a fairly powerful system, 32 CPU cores and 256 GB of RAM. It’s setup as a streaming server to “collect” stats from about 150 clients using the stream feature. I’ve followed the guide over at Netdata daemon | Learn Netdata to get the most out of the dedicated system. I am still having some problems.

I’ve started to see (on the server) a “too many open files” error and Netdata is just halted. Not collecting stats and not responding to the clients.

2023-01-28 06:31:51: netdata ERROR : PLUGIN[diskspace] : PROCFILE: Cannot open file '/proc/self/mountinfo' (errno 24, Too many open files)
2023-01-28 06:31:51: netdata ERROR : PLUGIN[diskspace] : PROCFILE: Cannot open file '/proc/1/mountinfo' (errno 24, Too many open files)
2023-01-28 06:31:58: netdata ERROR : WEB_SERVER[static6] : POLLFD: LISTENER: too many open files - used by this thread 1, max for this thread 42 (similar messages repeated 55466 times in the last 10 secs) (sleeping for 1000 microseconds every time this happens) (errno 24, Too many open files)
2023-01-28 06:32:06: netdata ERROR : PLUGIN[diskspace] : PROCFILE: Cannot open file '/proc/self/mountinfo' (errno 24, Too many open files)
2023-01-28 06:32:06: netdata ERROR : PLUGIN[diskspace] : PROCFILE: Cannot open file '/proc/1/mountinfo' (errno 24, Too many open files)
2023-01-28 06:32:08: netdata ERROR : WEB_SERVER[static4] : POLLFD: LISTENER: too many open files - used by this thread 1, max for this thread 42 (similar messages repeated 55239 times in the last 10 secs) (sleeping for 1000 microseconds every time this happens) (errno 24, Too many open files)
2023-01-28 06:32:18: netdata ERROR : WEB_SERVER[static6] : POLLFD: LISTENER: too many open files - used by this thread 1, max for this thread 42 (similar messages repeated 55291 times in the last 10 secs) (sleeping for 1000 microseconds every time this happens) (errno 24, Too many open files)
2023-01-28 06:32:21: netdata ERROR : PLUGIN[diskspace] : PROCFILE: Cannot open file '/proc/self/mountinfo' (errno 24, Too many open files)
2023-01-28 06:32:21: netdata ERROR : PLUGIN[diskspace] : PROCFILE: Cannot open file '/proc/1/mountinfo' (errno 24, Too many open files)
2023-01-28 06:32:28: netdata ERROR : WEB_SERVER[static5] : POLLFD: LISTENER: too many open files - used by this thread 1, max for this thread 42 (similar messages repeated 55278 times in the last 10 secs) (sleeping for 1000 microseconds every time this happens) (errno 24, Too many open files)

Should I look into daisy-chaining multiple instances as described here: Streaming and replication | Learn Netdata. Would I be able to setup multiple proxy instances behind a loadbalancer?

The clients are configured with astream.conf:

[stream]
	enabled = yes 
	destination = netdata.server.domain.tld
	api key = XXXXXXXXXX-XXXXXXXXXXX-XXXX-XXXXXXXXXXXXX

And netdata.conf

[global]
	web files owner = root
	web files group = netdata
	bind socket to IP = 0.0.0.0
	memory mode = none
	history = 86400

[health]
	enabled = no

The Netdata “server” has this in stream.conf

[XXXXXXXXXX-XXXXXXXXXXX-XXXX-XXXXXXXXXXXXX]
	enabled = yes
	default memory mode = dbengine

And this in netdata.conf

[global]
	web files owner = root
	web files group = netdata
	bind socket to IP = 0.0.0.0
	memory mode = dbengine
	page cache size MB = 32768
	dbengine multihost disk space = 81664
	
	# Systemd
	process scheduling policy = keep

On the server, the Systemd unit file has:

CPUSchedulingPolicy=rr
CPUSchedulingPriority=75

Now the server is just crashing

free(): double free detected in tcache 2
2023-01-28 07:36:00: apps.plugin ERROR : APPS_READER : Received error on stdin.
EOF found in spawn pipe.
Shutting down spawn server event loop.
Shutting down spawn server loop complete.
2023-01-28 07:36:00: go.d INFO: main[main] received terminated signal (15). Terminating...
2023-01-28 07:36:00: go.d INFO: build[manager] instance is stopped
2023-01-28 07:36:00: go.d INFO: discovery[file manager] instance is stopped
2023-01-28 07:36:00: go.d INFO: run[manager] instance is stopped
2023-01-28 07:36:00: go.d INFO: discovery[manager] instance is stopped

Hi, @tuaris. What is your Netdata version (both parent and children instances)?

The majority of the child instances are on 1.37.1. There is probably 3 or 4 on a slightly lesser version, but still on the 1.3x series.

Parent is 1.37.1:

Every second, Netdata collects 4,250 metrics on xxxxxxxx, presents them in 633 charts and monitors them with 57 alarms.

netdata
v1.37.1

We need an exact setup (parent version, children version) to try to replicate the problem. I suggest you wait for v1.38.0 (which will happen next week), update all your instances, and see if it helps. Additionally, if you get “too many open files” I suggest you to bump the limit for the parent instance.

Ah, no problem. I use Salt stack and am able to query all systems easily.

There are exactly 130 child instances at the moment of querying for version. All but one instance is using v1.37.1. There is a single instance That was provisioned only a few days ago that’s running v1.38.0.

The parent is running v1.37.1

I also noticed that the parent will occasionally cease collecting stats and start throwing alerts like “ram available = 0%” or “mysql galera cluster status = 0 status” even though the child instances are fine.

I’ll upgrade to v1.38.0 and report back on how things go.

EDIT: looks like 1.38.1 is available… updating to that instead.

Oh, and this is all on Ubuntu 20.04.5 LTS on AWS EC2 if it matters.

Not seeing much improvements with v1.38.1. The parent server is locking up, but the utilization is very small. I am not getting any too open files errors this time and am unsure of where the problem is.

top - 18:14:08 up 11 days,  3:08,  1 user,  load average: 7.30, 9.16, 8.86
Tasks: 377 total,   1 running, 376 sleeping,   0 stopped,   0 zombie
%Cpu0  :  6.1 us,  1.9 sy,  0.0 ni, 70.9 id, 17.5 wa,  0.0 hi,  3.6 si,  0.0 st
%Cpu1  :  5.6 us,  0.7 sy,  0.0 ni, 86.0 id,  7.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu2  :  7.2 us,  2.3 sy,  0.0 ni, 73.8 id, 16.4 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu3  :  6.3 us,  1.7 sy,  0.0 ni, 72.6 id, 16.8 wa,  0.0 hi,  2.6 si,  0.0 st
%Cpu4  :  7.6 us,  1.3 sy,  0.0 ni, 83.5 id,  7.3 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu5  :  6.6 us,  1.3 sy,  0.0 ni, 82.7 id,  9.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu6  :  6.0 us,  1.7 sy,  0.0 ni, 71.5 id, 20.8 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  6.3 us,  1.3 sy,  0.0 ni, 83.7 id,  7.3 wa,  0.0 hi,  1.3 si,  0.0 st
%Cpu8  :  7.7 us,  1.7 sy,  0.0 ni, 71.9 id, 18.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :  6.7 us,  2.3 sy,  0.0 ni, 73.2 id, 15.4 wa,  0.0 hi,  2.3 si,  0.0 st
%Cpu10 :  7.6 us,  1.7 sy,  0.0 ni, 78.2 id, 12.5 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  6.0 us,  1.3 sy,  0.0 ni, 62.2 id, 30.4 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 :  6.1 us,  4.4 sy,  0.0 ni, 55.2 id, 34.3 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 :  5.7 us,  1.7 sy,  0.0 ni, 71.7 id, 20.7 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu14 :  5.6 us,  3.3 sy,  0.0 ni, 49.0 id, 41.1 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu15 :  6.0 us,  2.0 sy,  0.0 ni, 69.5 id, 22.2 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu16 :  5.7 us,  2.0 sy,  0.0 ni, 72.3 id, 17.0 wa,  0.0 hi,  3.0 si,  0.0 st
%Cpu17 :  7.8 us,  1.6 sy,  0.0 ni, 67.0 id, 23.5 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 :  7.3 us,  1.0 sy,  0.0 ni, 85.3 id,  4.7 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu19 :  5.7 us,  1.7 sy,  0.0 ni, 80.1 id, 12.5 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu20 :  7.6 us,  1.0 sy,  0.0 ni, 86.8 id,  4.6 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu21 :  5.7 us,  1.7 sy,  0.0 ni, 66.7 id, 23.3 wa,  0.0 hi,  2.7 si,  0.0 st
%Cpu22 :  5.0 us,  1.7 sy,  0.0 ni, 69.6 id, 23.4 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu23 :  6.7 us,  0.7 sy,  0.0 ni, 85.7 id,  6.7 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu24 :  8.3 us,  1.0 sy,  0.0 ni, 85.4 id,  5.3 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu25 :  7.3 us,  1.3 sy,  0.0 ni, 79.7 id, 11.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu26 :  6.7 us,  1.7 sy,  0.0 ni, 83.0 id,  7.7 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu27 :  6.1 us,  1.4 sy,  0.0 ni, 80.7 id, 11.8 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu28 :  7.7 us,  2.0 sy,  0.0 ni, 65.2 id, 23.1 wa,  0.0 hi,  2.0 si,  0.0 st
%Cpu29 :  6.3 us,  2.0 sy,  0.0 ni, 57.3 id, 34.3 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu30 :  5.7 us,  1.7 sy,  0.0 ni, 66.7 id, 25.7 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu31 :  6.0 us,  1.7 sy,  0.0 ni, 79.7 id, 12.7 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 255670.7 total,   1264.8 free, 253165.8 used,   1240.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   1070.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                   
    895 netdata  -76   0  255.2g 245.6g   1.0g S 262.5  98.4  55805:55 netdata                                                                                                   
   2088 netdata  -76   0  135832  23272   2112 S   2.3   0.0 365:35.10 apps.plugin                                                                                                                     

Netdata is completely locked up and not responding to requests:

# time curl -I http://localhost:19999
^C

real	99m9.004s
user	0m0.126s
sys	0m0.126s

The volume IO isn’t saturated, but there does seem to be a lot of activity.

Total DISK READ:         6.10 M/s | Total DISK WRITE:       389.61 K/s
Current DISK READ:       6.12 M/s | Current DISK WRITE:     285.92 K/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                         
   1598 rt/4 netdata   182.24 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1600 rt/4 netdata   257.64 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1601 rt/4 netdata     0.00 B/s   12.57 K/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1602 rt/4 netdata    37.70 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1603 rt/4 netdata     0.00 B/s    3.14 K/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1604 rt/4 netdata    18.85 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1607 rt/4 netdata   414.75 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1608 rt/4 netdata     0.00 B/s   34.56 K/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1609 rt/4 netdata    50.27 K/s   31.42 K/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1610 rt/4 netdata    53.41 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1611 rt/4 netdata   433.60 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1612 rt/4 netdata   185.38 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1613 rt/4 netdata   311.06 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1614 rt/4 netdata     0.00 B/s   40.85 K/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1615 rt/4 netdata   323.63 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1616 rt/4 netdata   153.96 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1617 rt/4 netdata     0.00 B/s   25.14 K/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1618 rt/4 netdata     0.00 B/s   25.14 K/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1620 rt/4 netdata   326.77 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1621 rt/4 netdata   270.21 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1622 rt/4 netdata     3.14 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1623 rt/4 netdata     0.00 B/s   28.28 K/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1624 rt/4 netdata   166.53 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1625 rt/4 netdata     0.00 B/s    3.14 K/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1627 rt/4 netdata    91.12 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1628 rt/4 netdata   543.57 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1630 rt/4 netdata   160.24 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1632 rt/4 netdata     9.43 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1633 rt/4 netdata   172.81 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1635 rt/4 netdata   216.80 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1636 rt/4 netdata   235.65 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1637 rt/4 netdata   216.80 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1638 rt/4 netdata     0.00 B/s   84.83 K/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1639 rt/4 netdata     0.00 B/s    3.14 K/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1641 rt/4 netdata   113.11 K/s    3.14 K/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1642 rt/4 netdata   204.23 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1643 rt/4 netdata    21.99 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1644 rt/4 netdata   194.80 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1645 rt/4 netdata     0.00 B/s    3.14 K/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1647 rt/4 netdata   113.11 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1648 rt/4 netdata   219.94 K/s    0.00 B/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1649 rt/4 netdata     0.00 B/s    3.14 K/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
   1651 rt/4 netdata     0.00 B/s   12.57 K/s  ?unavailable?  netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]

Underlying volume’s performance does not appear to be the problem either:

# dd if=/dev/zero of=/var/cache/netdata/testfile.bin bs=1M count=1k conv=fdatasync; rm -f /var/cache/netdata/testfile.bin
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 7.39761 s, 145 MB/s

145 active children (all running 1.38.1) connected to it as of this moment

# ss -4tn src :19999 | awk '{print $5}' | awk -F':' '{print $1}' | sort | uniq | wc -l
147

Looks like it’s really just doing nothing?

# strace -p895
strace: Process 895 attached
pause(

Interestingly enough… alerts still seem to be functioning (thank goodness).

Hi, @tuaris. What do you mean by “The parent server is locking up”?

  • Can you disable ML and see if that helps
    • open netdata.conf
    • find [ml]
    • uncomment enabled and set it to no
    • restart netdata service

Do you have K8s clients? Or any nodes with ephemeral containers (containers that frequently come and go)?

For example if I try to curl it, it sits there forever.

Did that in late January 2023 shortly after creating this thread. There was a huge improvement, but still see the issues like I mentioned recently.

No, these are all EC2 instances. Although we provision new ones and de-provision older ones from time to time, they don’t come and go as often as containers would.

The pattern so far appears to be roughly 20-30 days, where the Netdata parent slows down to a crawl and stops collecting and responding. A restart of the process seems to solve things until the next time.

I think for now as a workaround I will just need to setup a cron to restart the Netdata daemon every night, just to avoid issues.

Perhaps I have hit the upper limits of what a single Netdata parent can handle.

I understand what the issue is. The amount of data that Prometheus needs to scrape from this one Netdata parent is massive. It takes about 140 seconds from query to end of response. I will probably need to move to Netdata pushing instead of having Prometheus pull from Netdata.

In the meantime I’ll increase my scrape interval to 5 minutes and the time out to 200 seconds. I’m not sure if there is actually anything Netdata devs can do or recommend to help other than that.

Interesting.

I guess try this and see. Export metrics to Prometheus remote write providers | Learn Netdata

Seems like an interesting set up so keep us posted on how you get on.

I wonder also, if you could pass some sort of filter to /allmetrics then maybe you could set up a few different Prometheus jobs to scrape a subset each time and just time or sequence them to occur not at same time then I wonder if that could help.

I don’t think there is a way to filter /allmetrics based on some sort of query etc but if was a viable solution it could maybe make a useful feature request.

Not sure myself, need to think a bit more and maybe see what some others on agent team think.

Oh wait seems like you can pass a filter to allmetrics so I wonder if you could just split it into two separate Prometheus scrape jobs in some.way using a filter or something like that.

https://editor.swagger.io/?url=https://raw.githubusercontent.com/netdata/netdata/master/web/api/netdata-swagger.yaml