Netdata Cloud - VMWare vCenter not logging in platform after "successful" installation

Problem/Question:

We were able to “sucessfully” install a Netdata sensor on our vSphere vCenter node, but it is not reporting in the Netdata Cloud platform. We have created the vsphere.conf file at /opt/netdata/etc/netdata/go.d/vsphere.conf, the go.d.conf file at /opt/netdata/etc/netdata/go.d.conf and have verified through the go.d debugger that it is operational and pulling back related data. Systemctl status netdata reports:

● netdata.service - Real time performance monitoring
   Loaded: loaded (/lib/systemd/system/netdata.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2022-09-28 03:30:55 UTC; 15h ago
  Process: 14161 ExecStartPre=/bin/chown -R netdata:netdata /run/netdata (code=exited, status=0/SUCCESS)
  Process: 14160 ExecStartPre=/bin/mkdir -p /run/netdata (code=exited, status=0/SUCCESS)
  Process: 14158 ExecStartPre=/bin/chown -R netdata:netdata /opt/netdata/var/cache/netdata (code=exited, status=0/SUCCESS)
  Process: 14156 ExecStartPre=/bin/mkdir -p /opt/netdata/var/cache/netdata (code=exited, status=0/SUCCESS)
 Main PID: 14164 (netdata)
    Tasks: 69 (limit: 19660)
   Memory: 272.5M
   CGroup: /system.slice/netdata.service
           ├─14164 /opt/netdata/bin/srv/netdata -P /run/netdata/netdata.pid -D
           ├─14166 /opt/netdata/bin/srv/netdata --special-spawn-server
           ├─14571 /opt/netdata/usr/libexec/netdata/plugins.d/apps.plugin 1
           ├─14578 /usr/bin/python3 /opt/netdata/usr/libexec/netdata/plugins.d/python.d.plugin 1
           ├─14582 /opt/netdata/usr/libexec/netdata/plugins.d/go.d.plugin 1
           └─74354 bash /opt/netdata/usr/libexec/netdata/plugins.d/tc-qos-helper.sh 1

Sep 28 03:30:57 <URL REDACTED> [14561]: Cannot read process groups configuration file '/opt/netdata/etc/netdata/apps_groups.conf'. Will try '/opt/netdata/usr/lib/netdata/conf.d/apps_groups.conf'
Sep 28 03:31:42 <URL REDACTED> sendmail[15497]: NOQUEUE: SYSERR(netdata): can not chdir(/var/spool/clientmqueue/): Permission denied
Sep 28 03:32:08 <URL REDACTED> sendmail[15874]: NOQUEUE: SYSERR(netdata): can not chdir(/var/spool/clientmqueue/): Permission denied
Sep 28 04:20:49 <URL REDACTED> [53415]: Does not have a configuration file inside `/opt/netdata/etc/netdata/ebpf.d.conf. It will try to load stock file.
Sep 28 04:20:49 <URL REDACTED> [53415]: Your environment does not have BTF file /sys/kernel/btf//vmlinux. The plugin will work with 'legacy' code.
Sep 28 04:20:49 <URL REDACTED> [53415]: Name resolution is disabled, collector will not parser "hostnames" list.
Sep 28 04:20:49 <URL REDACTED> [53415]: The network value of CIDR 127.0.0.1/8 was updated for 127.0.0.0 .
Sep 28 04:20:49 <URL REDACTED> [53415]: Cannot read process groups configuration file '/opt/netdata/etc/netdata/apps_groups.conf'. Will try '/opt/netdata/usr/lib/netdata/conf.d/apps_groups.conf'
Sep 28 04:20:49 <URL REDACTED> [53415]: PROCFILE: Cannot open file '/proc/14561/status'
Sep 28 04:20:49 <URL REDACTED> [53415]: Cannot open /proc/14561/status

The /opt/netdata/var/log/netdata/error.log tail shows:

2022-09-28 18:32:50: netdata INFO  : ACLK_Main : Attempting connection now
2022-09-28 18:32:50: netdata ERROR : ACLK_Main : Cert Chain verify error:num=20:unable to get local issuer certificate:depth=2:/C=US/O=Internet Security Research Group/CN=ISRG Root X1 (errno 2, No such file or directory)
2022-09-28 18:32:50: netdata ERROR : ACLK_Main : SSL_write Err: SSL_ERROR_SSL
2022-09-28 18:32:50: netdata ERROR : ACLK_Main : Couldn't write HTTP request header into SSL connection (errno 22, Invalid argument)
2022-09-28 18:32:50: netdata ERROR : ACLK_Main : Couldn't process request
2022-09-28 18:32:50: netdata ERROR : ACLK_Main : Error trying to contact env endpoint (errno 22, Invalid argument)
2022-09-28 18:32:50: netdata ERROR : ACLK_Main : Failed to Get ACLK environment
2022-09-28 18:32:50: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 0.549 seconds

2022-09-28 18:32:51: netdata INFO  : ACLK_Main : Attempting connection now
2022-09-28 18:32:51: netdata LOG FLOOD PROTECTION too many logs (201 logs in 16 seconds, threshold is set to 200 logs in 1200 seconds). Preventing more logs from process 'netdata' for 1184 seconds.
2022-09-28 18:36:00: go.d INFO: vsphere[<URL REDACTED>] discovering : found 1 dcs, 123 folders, 1 clusters (0 dummy), 5 hosts, 1127 vms, process took 350.603896ms
2022-09-28 18:36:00: go.d INFO: vsphere[<URL REDACTED>] discovering : building : removed 922 vms (not powered on)
2022-09-28 18:36:00: go.d INFO: vsphere[<URL REDACTED>] discovering : building : built 1/1 dcs, 123/123 folders, 1/1 clusters, 5/5 hosts, 205/1127 vms, process took 645.283µs
2022-09-28 18:36:00: go.d INFO: vsphere[<URL REDACTED>] discovering : hierarchy : set 1/1 clusters, 5/5 hosts, 205/205 vms, process took 64.816µs
2022-09-28 18:36:00: go.d INFO: vsphere[<URL REDACTED>] discovering : filtering : filtered 0/5 hosts, 0/205 vms, process took 16.207µs
2022-09-28 18:36:00: go.d INFO: vsphere[<URL REDACTED>] discovering : metric lists : collected metric lists for 5/5 hosts, 205/205 vms, process took 43.293µs
2022-09-28 18:36:00: go.d INFO: vsphere[<URL REDACTED>] discovering : discovered 5/5 hosts, 205/1127 vms, the whole process took 351.56398ms
2022-09-28 18:41:00: go.d INFO: vsphere[<URL REDACTED>] discovering : found 1 dcs, 123 folders, 1 clusters (0 dummy), 5 hosts, 1127 vms, process took 212.485425ms
2022-09-28 18:41:00: go.d INFO: vsphere[<URL REDACTED>] discovering : building : removed 922 vms (not powered on)
2022-09-28 18:41:00: go.d INFO: vsphere[<URL REDACTED>] discovering : building : built 1/1 dcs, 123/123 folders, 1/1 clusters, 5/5 hosts, 205/1127 vms, process took 444.154µs
2022-09-28 18:41:00: go.d INFO: vsphere[<URL REDACTED>] discovering : hierarchy : set 1/1 clusters, 5/5 hosts, 205/205 vms, process took 55.846µs
2022-09-28 18:41:00: go.d INFO: vsphere[<URL REDACTED>] discovering : filtering : filtered 0/5 hosts, 0/205 vms, process took 12.62µs
2022-09-28 18:41:00: go.d INFO: vsphere[<URL REDACTED>] discovering : metric lists : collected metric lists for 5/5 hosts, 205/205 vms, process took 49.414µs
2022-09-28 18:41:00: go.d INFO: vsphere[<URL REDACTED>] discovering : discovered 5/5 hosts, 205/1127 vms, the whole process took 213.228886ms

Relevant docs you followed/actions you took to solve the issue

Other links but since I’m a new user I’m limited to 5.

Environment/Browser/Agent’s version etc

Netdata -W buildinfo shows:

Version: netdata v1.36.0-154-g373c97d3b
Configure options:  '--prefix=/opt/netdata/usr' '--sysconfdir=/opt/netdata/etc' '--localstatedir=/opt/netdata/var' '--libexecdir=/opt/netdata/usr/libexec' '--libdir=/opt/netdata/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--enable-cloud' '--without-bundled-protobuf' '--disable-dependency-tracking' 'CFLAGS=-static -O2 -I/openssl-static/include -pipe' 'LDFLAGS=-static -L/openssl-static/lib' 'PKG_CONFIG_PATH=/openssl-static/lib/pkgconfig'
Install type: kickstart-static
    Binary architecture: x86_64
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES
    ACLK:                       YES
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         YES
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    YES
    IPMI:                    NO
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES

What I expected to happen

Expect for the installer to be able to sucessfully install when it’s marked as Sucessful - or have error handling surrounding this issue. Hoping to get this fixed, as vSphere / vCenter host metrics tracking is a huge objective for our team.

Hi there,

2 questions:

  • Do you see any metrics from that node on Netdata Cloud?
  • @ilyam8 requested to see the output of a debug run of the go.d.plugin for vsphere, in case the issue is just with that plugin. Can you provide it here?

Hi,

Thank you for responding so quickly, it means a lot.

We do not see any metrics from Netdata Cloud for the Node. Below is all we see.
Screen Shot 2022-09-29 at 9.46.56 AM

As far as the debug run for the go.d.plugin, this is what we’re seeing, which seems to suggest the go.d.plugin conf, and vsphere.conf is correct, except for the agent interrupt I see on line 9:

[ DEBUG ] run[manager] run.go:43 tick 1
[ DEBUG ] run[manager] run.go:43 tick 2
[ DEBUG ] run[manager] run.go:43 tick 3
[ DEBUG ] run[manager] run.go:43 tick 4
[ DEBUG ] run[manager] run.go:43 tick 5
[ DEBUG ] run[manager] run.go:43 tick 6
[ DEBUG ] run[manager] run.go:43 tick 7
q[ DEBUG ] run[manager] run.go:43 tick 8
^C[ INFO  ] main[main] agent.go:104 received interrupt signal (2). Terminating...
[ INFO  ] run[manager] run.go:33 instance is stopped
[ INFO  ] discovery[manager] manager.go:98 instance is stopped
[ INFO  ] build[manager] build.go:108 instance is stopped
[ INFO  ] discovery[file manager] discovery.go:74 instance is stopped
[ INFO  ] vsphere[<URLREDACTED>] job.go:212 stopped
[ INFO  ] main[main] agent.go:137 instance is stopped
netdata [ ~/usr/libexec/netdata/plugins.d ]$ ./go.d.plugin -d -m vsphere
[ DEBUG ] main[main] main.go:113 plugin: name=go.d, version=v0.40.1
[ DEBUG ] main[main] main.go:115 current user: name=netdata, uid=996
[ INFO  ] main[main] agent.go:136 instance is started
[ INFO  ] main[main] setup.go:42 loading config file
[ INFO  ] main[main] setup.go:50 looking for 'go.d.conf' in [/opt/netdata/etc/netdata /opt/netdata/usr/lib/netdata/conf.d]
[ INFO  ] main[main] setup.go:57 found '/opt/netdata/etc/netdata/go.d.conf
[ INFO  ] main[main] setup.go:64 config successfully loaded
[ INFO  ] main[main] agent.go:140 using config: enabled 'true', default_run 'true', max_procs '0'
[ INFO  ] main[main] setup.go:69 loading modules
[ INFO  ] main[main] setup.go:88 enabled/registered modules: 1/74
[ INFO  ] main[main] setup.go:93 building discovery config
[ INFO  ] main[main] setup.go:123 looking for 'vsphere.conf' in [/opt/netdata/etc/netdata/go.d /opt/netdata/usr/lib/netdata/conf.d/go.d]
[ INFO  ] main[main] setup.go:139 found '/opt/netdata/etc/netdata/go.d/vsphere.conf
[ INFO  ] main[main] setup.go:144 dummy/read/watch paths: 0/1/0
[ INFO  ] discovery[manager] manager.go:92 registered discoverers: [file discovery: [file reader]]
[ INFO  ] discovery[manager] manager.go:97 instance is started
[ INFO  ] discovery[file manager] discovery.go:73 instance is started
[ INFO  ] discovery[file reader] read.go:41 instance is started
[ INFO  ] build[manager] build.go:107 instance is started
[ INFO  ] run[manager] run.go:32 instance is started
[ INFO  ] discovery[file reader] read.go:42 instance is stopped
[ DEBUG ] build[manager] build.go:154 received config group ('/opt/netdata/etc/netdata/go.d/vsphere.conf'): 1 jobs (added: 1, removed: 0)
[ DEBUG ] build[manager] build.go:303 building vsphere[<URLREDACTED>] job, config: map[__provider__:file reader __source__:/opt/netdata/etc/netdata/go.d/vsphere.conf autodetection_retry:0 host_include:[/*] module:vsphere name:<URLREDACTED> password:kayO&lNlWRibLf2W40Oc priority:70000 tls_skip_verify:true update_every:20 url:https://10.20.34.10 username:administrator@vsphere.local]
[ DEBUG ] run[manager] run.go:43 tick 0
[ DEBUG ] vsphere[<URLREDACTED>] discover.go:96 discovering : starting resource discovering process
[ DEBUG ] vsphere[<URLREDACTED>] discover.go:104 discovering : found 1 dcs, process took 5.387863ms
[ DEBUG ] vsphere[<URLREDACTED>] discover.go:111 discovering : found 123 folders, process took 15.818147ms
[ DEBUG ] vsphere[<URLREDACTED>] discover.go:118 discovering : found 1 clusters, process took 6.160146ms
[ DEBUG ] vsphere[<URLREDACTED>] discover.go:125 discovering : found 5 hosts, process took 23.659266ms
[ DEBUG ] vsphere[<URLREDACTED>] discover.go:132 discovering : found 5 vms, process took 285.305572ms
[ INFO  ] vsphere[<URLREDACTED>] discover.go:142 discovering : found 1 dcs, 123 folders, 1 clusters (0 dummy), 5 hosts, 1127 vms, process took 336.91217ms
[ DEBUG ] vsphere[<URLREDACTED>] build.go:14 discovering : building : starting building resources process
[ INFO  ] vsphere[<URLREDACTED>] build.go:166 discovering : building : removed 924 vms (not powered on)
[ INFO  ] vsphere[<URLREDACTED>] build.go:25 discovering : building : built 1/1 dcs, 123/123 folders, 1/1 clusters, 5/5 hosts, 203/1127 vms, process took 486.323µs
[ DEBUG ] vsphere[<URLREDACTED>] hierarchy.go:12 discovering : hierarchy : start setting resources hierarchy process
[ INFO  ] vsphere[<URLREDACTED>] hierarchy.go:20 discovering : hierarchy : set 1/1 clusters, 5/5 hosts, 203/203 vms, process took 47.536µs
[ DEBUG ] vsphere[<URLREDACTED>] filter.go:26 discovering : filtering : starting filtering resources process
[ DEBUG ] vsphere[<URLREDACTED>] filter.go:47 discovering : filtering : removed 0 unmatched hosts
[ DEBUG ] vsphere[<URLREDACTED>] filter.go:58 discovering : filtering : removed 0 unmatched vms
[ INFO  ] vsphere[<URLREDACTED>] filter.go:31 discovering : filtering : filtered 0/5 hosts, 0/203 vms, process took 34.301µs
[ DEBUG ] vsphere[<URLREDACTED>] metric_lists.go:16 discovering : metric lists : starting resources metric lists collection process
[ INFO  ] vsphere[<URLREDACTED>] metric_lists.go:32 discovering : metric lists : collected metric lists for 5/5 hosts, 203/203 vms, process took 55.007327ms
[ INFO  ] vsphere[<URLREDACTED>] discover.go:76 discovering : discovered 5/5 hosts, 203/1127 vms, the whole process took 392.809916ms
[ INFO  ] vsphere[<URLREDACTED>] discover.go:9 starting discovery process, will do discovery every 5m0s
[ INFO  ] vsphere[<URLREDACTED>] job.go:191 check success
[ INFO  ] vsphere[<URLREDACTED>] job.go:211 started, data collection interval 20s
[ DEBUG ] run[manager] run.go:43 tick 1
[ DEBUG ] run[manager] run.go:43 tick 2
[ DEBUG ] run[manager] run.go:43 tick 3
[ DEBUG ] run[manager] run.go:43 tick 4

It looks like the agent was claimed (announced itself to the cloud), but can’t connect. I should have actually understood this from the certificate chain thing. I’m not 100% certain, but can you check Certificate verification error connecting to the cloud to see if it applies?

Then we could look again at if the plugin dies what you expect.

Hi,

Thanks so much for the response. We were curious about that as well. I’ll go ahead and do a bit more research with my team today to see if we can get to the bottom of this. Things seem a bit different on VMWare’s Photon OS compared to the post you linked re: RHEL / CentOS, but it definitely gives us a lot to work off of. Thank you again - I’ll update when we do a bit more digging.

Hi,

You’re absolutely correct about it being a TLS error - but we’re really scratching our heads as to why this is happening. The error we’re seeing in the individual node dashboard’s API is:

	"cloud-enabled": true,
	"cloud-available": true,
	"agent-claimed": true,
	"aclk-available": false,

I also looked in the cloud.conf file for any related ACLK configuration issues - but this looks like it’s configured, as you’d expect to see from the API response:

[global]
  enabled = yes
  cloud base url = https://app.netdata.cloud

The error we’re seeing in the error.log is:

2022-09-29 03:33:42: netdata INFO  : ACLK_Main : Attempting connection now
2022-09-29 03:33:42: netdata ERROR : ACLK_Main : Cert Chain verify error:num=20:unable to get local issuer certificate:depth=2:/C=US/O=Internet Security Research Group/CN=ISRG Root X1 (errno 2, No such file or directory)
2022-09-29 03:33:42: netdata ERROR : ACLK_Main : SSL_write Err: SSL_ERROR_SSL
2022-09-29 03:33:42: netdata ERROR : ACLK_Main : Couldn't write HTTP request header into SSL connection (errno 22, Invalid argument)
2022-09-29 03:33:42: netdata ERROR : ACLK_Main : Couldn't process request
2022-09-29 03:33:42: netdata ERROR : ACLK_Main : Error trying to contact env endpoint (errno 22, Invalid argument)
2022-09-29 03:33:42: netdata ERROR : ACLK_Main : Failed to Get ACLK environment
2022-09-29 03:33:42: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 1.024 seconds

But the thing that has us a bit confused is that the vCenter (vSphere) appliance is showing proper certificate status. We have both the cross signed ISRG X1 / R3 certificate, as well as the single signature ISRG X1 certificate installed in vSphere. The other thing that is really odd is that the signing authority for app.netdata.cloud is the exact same - both are identical as far as the Issuer:
|Common Name (CN)|R3|
|Organization (O)|Let’s Encrypt|
|Organizational Unit (OU)||

Any ideas here? We’re definitely a bit confused with this one.

I’m fairly certain that the answer is in fact in the links I referred you to. Let’s Encrypt did some funky stuff with their CA expiration dates, but it goes over my head.

Does this help? Preparing Let’s Encrypt SSL Certificates for vCenter, NSX-T Manager and Avi Controller – VMwire

Hi,

Thanks so much for all your help. It does in fact look like this is relating to our vCenter version, and the way that their platform handles certificates. Our current vCenter version is on OpenSSL 1.0.2y-fips, which is affected. Newer versions of OpenSSL don’t have issues with the way that LetsEncrypt handled the certificate path. We’ll have to upgrade vCenter to the version with the OpenSSL 1.1.1 fix - and retry.

I’m fairly confident that this will address the issue we’re having, as I was seeing nearly identical behavior with an Ubuntu 16.04 box with OpenSSL 1.0.2g (which was also affected.) I’ll update this thread with the fix since it’s pretty specific to vCenter / vSphere certificate handling / OpenSSL - which is a pretty different beast than a traditional Linux deployment from the looks of it.

Another solution is to set up a parent on a dedicated VM with a newer Linux version and connect the parent to the cloud instead. So your netdata on vcenter would just stream its metrics to the parent.

We recommend using parents in all production deployments anyway, for data replication and availability during serious incidents.

Hi,

Christopher - this was absolutely the perfect suggestion! I was going to ask if this functionality to proxy data existed, but I’m so glad you mentioned it!

vSphere is now being monitored, and this is a significantly easier fix for our team than doing vCenter / vSphere upgrades, as well as certificate fixes due to the way their system is designed. Thank you for all your hard work - we love your product!

Travis