apps.plugin doesn't seem to do anything

Most of our machines are “standard” but not recognized by kickstarter as anything in particular. (Which is true. They’re all built from source.) And most of them are still arch=i686 rather than x86_64.

It took a little futzing but I didn’t really have any trouble getting netdata compiled. I messed with kickstarter but wound up cloning the repo and building it, which worked better.

The problem I have is that (in particular), apps.plugin doesn’t seem to do anything. It sits in its command loop but never reports any data.

I compiled it with “NETDATA_INTERNAL_CHECKS” defined and that does produce more output at startup, but I’m expecting a steady stream of stats on stdout, and there’s nothing.

Without any of the rest of the daemons running, if I run “apps.plugin 1”, here’s the output after clipping off the timestamp and “comm=apps.plugin source=collector”. Note: This is from an i686 machine, but it behaved the same on x86_64.

level=info tid=3820 thread=MAIN msg=“Loaded config file ‘/opt/netdata/etc/netdata/apps_groups.conf’”
level=info tid=3820 thread=MAIN msg=“started on pid 3820”
level=debug tid=3820 thread=MAIN msg="ARAL: ‘dict-items’ element size 48 (requested 44 bytes), min elements per page 85 (requested 2), max elements per page 1365, max page size 65520 bytes (requested 65536) "
level=debug tid=3820 thread=MAIN msg="ARAL: ‘dict-shared-items’ element size 16 (requested 12 bytes), min elements per page 256 (requested 2), max elements per page 4096, max page size 65536 bytes (requested 65536) "
FUNCTION GLOBAL “processes” 10 “Detailed information on the currently running processes.” “top” “members” 10
level=debug tid=3821 thread=APPS_READER msg=“set name of thread 3821 to APPS_READER”
level=debug tid=3822 thread=APPS_WORK[1] msg=“set name of thread 3822 to APPS_WORK[1]”

So the data header comes out, but that’s it. And then it just sits there forever. If I hit enter, I get this:

level=notice tid=3821 thread=APPS_READER msg=“Received unknown command: (unset)”

so I know it’s alive.

I spent a couple minutes with gdb to know that it was getting into the heartbeat_next() call in libnetdata/clocks/clocks.c and–as far as I could tell–never returning from that. So some sort of clock problem? It’s vmware, if that matters.

My next step was to instrument that and see where it’s getting stuck, but I thought I’d drop it in here before I burn a bunch more time on this.

(I think debugfs.plugin is doing the same thing but I haven’t dug into that one.)

Hello @eigenstr ,

Welcome to our forum.

It is possible you are having issues with permissions, so let me do some questions to help you more:

  1. What is the current apps permission ? This is the expected result when you compile from scratch:
ls -l /usr/libexec/netdata/plugins.d/apps.plugin 
-rwxr-x--- 1 root netdata 2465872 Dec 27 15:42 /usr/libexec/netdata/plugins.d/apps.plugin*
  1. Do you have normal content inside /proc?
  2. What is your Linux distribution and kernel version?
  3. What is the output when you run collector on terminal?
# /usr/libexec/netdata/plugins.d/apps.plugin 2> err >out

Best regards!

crux 11:55:38 SU > ls -l /opt/netdata/usr/libexec/netdata/plugins.d/apps.plugin
-rwsr-x--- 1 root netdata 581264 Dec 26 18:09 /opt/netdata/usr/libexec/netdata/plugins.d/apps.plugin*

The kernel does have capabilities support, and the setcap calls during install succeed.

crux 12:09:40 SU > getcap /opt/netdata/usr/libexec/netdata/plugins.d/apps.plugin
/opt/netdata/usr/libexec/netdata/plugins.d/apps.plugin cap_dac_read_search,cap_sys_ptrace=ep

  1. /proc is perfectly normal. If I strace apps.plugin, the only two things it touches in /proc before going into its idle loop are reading /proc/sys/kernel/pid_max (32768) and /proc/stat (the expected blob of numbers).

  2. The distro is indeterminate, as we build everything from source. It started as Yggdrasil in 1995, though. :slight_smile: We’re running kernel version 6.6.8.

  3. The collector output is as I pasted above in the original post. That’s all there is.

I can provide the output from strace if you want it.

The main thread sits on a call to clock_nanosleep(2) with what looks like a 1-hour timeout.

The APPS_READER thread sits waiting on stdin.

The APPS_WORK[1] thread sits on a call to fuxtex:

futex(0x80dfcf8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished …>

If I’m reading that right, it’s waiting on a realtime-clock timeout, but the timespec passed in is NULL. I’m not sure what should be happening there, but that doesn’t feel right.

Hello @eigenstr ,

I am sorry for the delay. Thank your for your answer!
The clock_nanosleep should wait few seconds instead this long time you are waiting for.
Your initial conclusion looks like correct, and it is possible something related to i686 hosts that we missed.
If it is possible, you can attach for us the strace here, or you can send me to thiago@netdata.cloud, please, select the best for you.
I have a Raspberry that is also 32 bits, I will try to compile and recreate the issue on it.

Best regards!

I should point out - I’m not sure I did before - that the x86_64 build does the exact same thing.

I’ll email the traces (i686 and x86_64) since they’re pretty big to paste in here. i686 was compiled with the trace flag set; x86_64 was not.

I can get you access to at least the i686 machine in question if you need it.

1 Like

Hello @eigenstr ,

Happy new year!
Thank you very much to give me access for your host. I could fix the issue with the PR Fix plugin running on Xeon i686 by thiagoftsm · Pull Request #16710 · netdata/netdata · GitHub.
I expect we will merge it next hours.

Best regards!

I see that there was another commit, which I’m guessing would address
this?

time=2024-01-02T13:54:33.859-05:00 comm=apps.plugin source=collector level=error errno=“22, Invalid argument” tid=8477 thread=apps.plugin msg=“Cannot nanosleep() for 2326140396 microseconds.”

The collector has been running, but if I look at that machine which
is completely idle other than running Netdata, it has a load average
ov 15.

I killed apps.plugin and ran it with a timeout of 5 and saw the above
error.

Thiago Marques via Netdata Community Forums writes:

Hello Todd,

I made a commit early today, but I had to go out after this.
I am updating the branch and recompiling it. I will monitor CPU usage to verify if there is still an issue. I will update you later today.

Best regards!