Not seeing much improvements with v1.38.1. The parent server is locking up, but the utilization is very small. I am not getting any too open files errors this time and am unsure of where the problem is.
top - 18:14:08 up 11 days, 3:08, 1 user, load average: 7.30, 9.16, 8.86
Tasks: 377 total, 1 running, 376 sleeping, 0 stopped, 0 zombie
%Cpu0 : 6.1 us, 1.9 sy, 0.0 ni, 70.9 id, 17.5 wa, 0.0 hi, 3.6 si, 0.0 st
%Cpu1 : 5.6 us, 0.7 sy, 0.0 ni, 86.0 id, 7.0 wa, 0.0 hi, 0.7 si, 0.0 st
%Cpu2 : 7.2 us, 2.3 sy, 0.0 ni, 73.8 id, 16.4 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu3 : 6.3 us, 1.7 sy, 0.0 ni, 72.6 id, 16.8 wa, 0.0 hi, 2.6 si, 0.0 st
%Cpu4 : 7.6 us, 1.3 sy, 0.0 ni, 83.5 id, 7.3 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu5 : 6.6 us, 1.3 sy, 0.0 ni, 82.7 id, 9.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu6 : 6.0 us, 1.7 sy, 0.0 ni, 71.5 id, 20.8 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 6.3 us, 1.3 sy, 0.0 ni, 83.7 id, 7.3 wa, 0.0 hi, 1.3 si, 0.0 st
%Cpu8 : 7.7 us, 1.7 sy, 0.0 ni, 71.9 id, 18.7 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu9 : 6.7 us, 2.3 sy, 0.0 ni, 73.2 id, 15.4 wa, 0.0 hi, 2.3 si, 0.0 st
%Cpu10 : 7.6 us, 1.7 sy, 0.0 ni, 78.2 id, 12.5 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu11 : 6.0 us, 1.3 sy, 0.0 ni, 62.2 id, 30.4 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu12 : 6.1 us, 4.4 sy, 0.0 ni, 55.2 id, 34.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu13 : 5.7 us, 1.7 sy, 0.0 ni, 71.7 id, 20.7 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu14 : 5.6 us, 3.3 sy, 0.0 ni, 49.0 id, 41.1 wa, 0.0 hi, 1.0 si, 0.0 st
%Cpu15 : 6.0 us, 2.0 sy, 0.0 ni, 69.5 id, 22.2 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu16 : 5.7 us, 2.0 sy, 0.0 ni, 72.3 id, 17.0 wa, 0.0 hi, 3.0 si, 0.0 st
%Cpu17 : 7.8 us, 1.6 sy, 0.0 ni, 67.0 id, 23.5 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu18 : 7.3 us, 1.0 sy, 0.0 ni, 85.3 id, 4.7 wa, 0.0 hi, 1.7 si, 0.0 st
%Cpu19 : 5.7 us, 1.7 sy, 0.0 ni, 80.1 id, 12.5 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu20 : 7.6 us, 1.0 sy, 0.0 ni, 86.8 id, 4.6 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu21 : 5.7 us, 1.7 sy, 0.0 ni, 66.7 id, 23.3 wa, 0.0 hi, 2.7 si, 0.0 st
%Cpu22 : 5.0 us, 1.7 sy, 0.0 ni, 69.6 id, 23.4 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu23 : 6.7 us, 0.7 sy, 0.0 ni, 85.7 id, 6.7 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu24 : 8.3 us, 1.0 sy, 0.0 ni, 85.4 id, 5.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu25 : 7.3 us, 1.3 sy, 0.0 ni, 79.7 id, 11.7 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu26 : 6.7 us, 1.7 sy, 0.0 ni, 83.0 id, 7.7 wa, 0.0 hi, 1.0 si, 0.0 st
%Cpu27 : 6.1 us, 1.4 sy, 0.0 ni, 80.7 id, 11.8 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu28 : 7.7 us, 2.0 sy, 0.0 ni, 65.2 id, 23.1 wa, 0.0 hi, 2.0 si, 0.0 st
%Cpu29 : 6.3 us, 2.0 sy, 0.0 ni, 57.3 id, 34.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu30 : 5.7 us, 1.7 sy, 0.0 ni, 66.7 id, 25.7 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu31 : 6.0 us, 1.7 sy, 0.0 ni, 79.7 id, 12.7 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 255670.7 total, 1264.8 free, 253165.8 used, 1240.0 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 1070.4 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
895 netdata -76 0 255.2g 245.6g 1.0g S 262.5 98.4 55805:55 netdata
2088 netdata -76 0 135832 23272 2112 S 2.3 0.0 365:35.10 apps.plugin
Netdata is completely locked up and not responding to requests:
# time curl -I http://localhost:19999
^C
real 99m9.004s
user 0m0.126s
sys 0m0.126s
The volume IO isn’t saturated, but there does seem to be a lot of activity.
Total DISK READ: 6.10 M/s | Total DISK WRITE: 389.61 K/s
Current DISK READ: 6.12 M/s | Current DISK WRITE: 285.92 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
1598 rt/4 netdata 182.24 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1600 rt/4 netdata 257.64 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1601 rt/4 netdata 0.00 B/s 12.57 K/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1602 rt/4 netdata 37.70 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1603 rt/4 netdata 0.00 B/s 3.14 K/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1604 rt/4 netdata 18.85 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1607 rt/4 netdata 414.75 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1608 rt/4 netdata 0.00 B/s 34.56 K/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1609 rt/4 netdata 50.27 K/s 31.42 K/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1610 rt/4 netdata 53.41 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1611 rt/4 netdata 433.60 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1612 rt/4 netdata 185.38 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1613 rt/4 netdata 311.06 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1614 rt/4 netdata 0.00 B/s 40.85 K/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1615 rt/4 netdata 323.63 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1616 rt/4 netdata 153.96 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1617 rt/4 netdata 0.00 B/s 25.14 K/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1618 rt/4 netdata 0.00 B/s 25.14 K/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1620 rt/4 netdata 326.77 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1621 rt/4 netdata 270.21 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1622 rt/4 netdata 3.14 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1623 rt/4 netdata 0.00 B/s 28.28 K/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1624 rt/4 netdata 166.53 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1625 rt/4 netdata 0.00 B/s 3.14 K/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1627 rt/4 netdata 91.12 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1628 rt/4 netdata 543.57 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1630 rt/4 netdata 160.24 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1632 rt/4 netdata 9.43 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1633 rt/4 netdata 172.81 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1635 rt/4 netdata 216.80 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1636 rt/4 netdata 235.65 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1637 rt/4 netdata 216.80 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1638 rt/4 netdata 0.00 B/s 84.83 K/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1639 rt/4 netdata 0.00 B/s 3.14 K/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1641 rt/4 netdata 113.11 K/s 3.14 K/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1642 rt/4 netdata 204.23 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1643 rt/4 netdata 21.99 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1644 rt/4 netdata 194.80 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1645 rt/4 netdata 0.00 B/s 3.14 K/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1647 rt/4 netdata 113.11 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1648 rt/4 netdata 219.94 K/s 0.00 B/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1649 rt/4 netdata 0.00 B/s 3.14 K/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
1651 rt/4 netdata 0.00 B/s 12.57 K/s ?unavailable? netdata -D -P /var/run/netdata/netdata.pid [LIBUV_WORKER]
Underlying volume’s performance does not appear to be the problem either:
# dd if=/dev/zero of=/var/cache/netdata/testfile.bin bs=1M count=1k conv=fdatasync; rm -f /var/cache/netdata/testfile.bin
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 7.39761 s, 145 MB/s
145 active children (all running 1.38.1) connected to it as of this moment
# ss -4tn src :19999 | awk '{print $5}' | awk -F':' '{print $1}' | sort | uniq | wc -l
147
Looks like it’s really just doing nothing?
# strace -p895
strace: Process 895 attached
pause(