Hosts Unreachable due to ACLK-NG flag being ignored on auto-update

Problem/Question

Good morning Netdata people!

Unfortunately after many days of working without a problem (see this thread for more information and performed actions) this morning my hosts are “Unreachable” again.

EDIT: I didn’t want to hijack that thread hence the new one. Plus it seems to be a totally new issue

Once again the reason (I assume) was that the ACLK-NG flag was ignored on the daily automatic update.

Environment/Browser

All are affected

What I expected to happen

Since I have set the

aclk implementation = ng

in netdata.conf file I was expecting this to be used in auto updates.

Indeed this was the case for some time now but again today it has reverted and from the API info I can see that:

_aclk_ng_available	"false"
_aclk_legacy_available	"true"
_aclk_impl	"Legacy"

Why is this happening?
I am sorry to say it but I am really frustrated with Netdata not working as it should and doing things on its own…Please excuse my language here but it doesn’t look very robust :frowning:

EDIT: What I 'd like to say here is that the purpose of a monitoring system like Netdata should be “set it and forget it”. Instead I need to always have it in my mind and check over it again and again.

Here is the buildinfo:

# netdata -W buildinfo
Version: netdata v1.31.0-105-nightly
Configure options: ‘–prefix=/usr’ ‘–sysconfdir=/etc’ ‘–localstatedir=/var’ ‘–libexecdir=/usr/libexec’ ‘–libdir=/usr/lib’ ‘–with-zlib’ ‘–with-math’ ‘–with-user=netdata’ ‘–with-bundled-lws’ ‘–with-bundled-libJudy’ ‘CFLAGS=-O2’ ‘LDFLAGS=’
Features:
dbengine: YES
Native HTTPS: YES
Netdata Cloud: YES
ACLK Next Generation: NO
ACLK Legacy: YES
TLS Host Verification: YES
Libraries:
jemalloc: NO
JSON-C: YES
libcap: NO
libcrypto: YES
libm: YES
LWS: YES static v3.2.2
mosquitto: YES
tcalloc: NO
zlib: YES
Plugins:
apps: YES
cgroup Network Tracking: YES
CUPS: NO
EBPF: YES
IPMI: NO
NFACCT: NO
perf: YES
slabinfo: YES
Xen: NO
Xen VBD Error Tracking: NO
Exporters:
AWS Kinesis: NO
GCP PubSub: NO
MongoDB: NO
Prometheus Remote Write: NO

Hi, @idet2 sorry for the trouble caused. We are currently undergoing a big development effort on new protocol for communicating Agent<->Cloud. Problem here is that new dependency is required to build ACLK-NG. Google protocol buffers (protoc, protobuf). I will think about how to sort this out. However, installing all required protobuf packages is the solution for now.

We are moving to protobuf based binary payloads which have multiple benefits (e.g. decreased traffic between cloud and agent, better synchronization where only differing data are sent) etc.

Therefore protobuf will be requirement for all Cloud users from next release on.

I do apologize about nightly release being released without heads up.

Absolutely right. Sorry about that we are working on big redesign right now (to improve Netdata Cloud on multiple fronts). Therefore ACLK-NG is in heavy development (hence why it is in nightlies only)
After next stable release things will stabilize a lot. google protocol buffers unfortunately are a new dependency from current nightlies (and next stable release)

@underhood : I am sorry for being so criticism about this but in my opinion even on nightly builds this should never pass any QA/Testing and go into production.
Additionally you should consider not doing all these “heavy development” as you say under nightly channel since this is the “default” setting suggested to users running Netdata.
I understand the meaning of “nightly” but unfortunately as presented and suggested (default) by Netdata doesn’t give any heads up or any other disclosure for it being extremely error prone.

Having said that I am on CentOS 8 systems and have installed the following packages:

  • protobuf

  • protobuf-devel

  • protobuf-compiler

  • protobuf-c

  • protobuf-c-compiler

  • protobuf-c-devel

  • protobuf-lite

  • protobuf-lite-devel

which pretty much are packages that are available for “protobuf*” but still not able to install Netdata with the following error:

Compilation Error

— Compile netdata —
[/tmp/netdata-updater-A3ARxVErO0/netdata-v1.31.0-111-g0813070bc]# make -j4
make: *** No rule to make target ‘aclk/aclk-schemas/proto/agent/v1/connection.proto’, needed by ‘aclk/aclk-schemas/proto/agent/v1/connection.pb.cc’. Stop.
FAILED

Fri Jul 9 15:24:29 EEST 2021 : ERROR: FAILED TO UPDATE NETDATA : FAILED TO COMPILE/INSTALL NETDATA
— Compile netdata —
[/tmp/netdata-updater-A3ARxVErO0/netdata-v1.31.0-111-g0813070bc]# make -j4
make: *** No rule to make target ‘aclk/aclk-schemas/proto/agent/v1/connection.proto’, needed by ‘aclk/aclk-schemas/proto/agent/v1/connection.pb.cc’. Stop.
FAILED

Fri Jul 9 15:24:29 EEST 2021 : ERROR: FAILED TO UPDATE NETDATA : FAILED TO COMPILE/INSTALL NETDATA

What am I missing?

Are you building from source manually? *.proto files missing would normally mean missing git submodules. If running manually the git submodule update --init --recursive --force should do the trick.

To your other comment. Yes, even nightlies should always work. That is in fact a policy here. The heavy development is done on other branches and only parts deemed stable are merged to nightlies. Sometimes something slips in nonetheless :frowning: (that is why all the workflow exist dev->nightlies->stable).
While most users do not care yet if they use NG or Legacy (both work for them) you are one of few who do need ACLK-NG.

Again after this things should be stable (won’t be adding any more deps). We are also investigating option of bundle/build protobuf on systems where it is not installed to make the update smoother.

Hello again @underhood !

I am not building it manually! I am running the netdata-updater as the cron will do in the night in order to update and have a working version again.
EDIT: The only difference from before is that I have installed the additional packages mentioned above.

So what am I missing and what should I do?

Initial installation was performed using the one line kickstart script .

Based on that no one should be able to run the “ACLK-NG” using a nightly build…That’s why I was really surprised how this passed any QA/Tests :open_mouth: .

Anyway, I understand that things can go awry sometimes but as you know for me the last few months dealing with Netdata when I shouldn’t spend so much time at it is a bit of overkill, hence the frustration.

On the other hands you guys are doing an excellent job and I shouldn’t be so offensive! I apologize and I 'll try to calm myself down :slight_smile: :relaxed:

By the way I am all ears and open to suggestions to make the “Legacy ACLK” work again properly if you would like to go back to the other thread :wink: .

Switched all servers to “stable” release and hopefully I will not see again in the future problems like this!

I should have gone with the “stable” since the beginning.

Additionally I have enabled ACLK-NG for it.

I now have

Netdata Build Info

# netdata -W buildinfo
Version: netdata v1.31.0
Configure options: ‘–prefix=/usr’ ‘–sysconfdir=/etc’ ‘–localstatedir=/var’ ‘–libexecdir=/usr/libexec’ ‘–libdir=/usr/lib’ ‘–with-zlib’ ‘–with-math’ ‘–with-user=netdata’ ‘–with-aclk-ng’ ‘CFLAGS=-O2’ ‘LDFLAGS=’
Features:
dbengine: YES
Native HTTPS: YES
Netdata Cloud: YES
Cloud Implementation: Next Generation
TLS Host Verification: YES
Libraries:
jemalloc: NO
JSON-C: YES
libcap: NO
libcrypto: YES
libm: YES
tcalloc: NO
zlib: YES
Plugins:
apps: YES
cgroup Network Tracking: YES
CUPS: NO
EBPF: YES
IPMI: NO
NFACCT: NO
perf: YES
slabinfo: YES
Xen: NO
Xen VBD Error Tracking: NO
Exporters:
AWS Kinesis: NO
GCP PubSub: NO
MongoDB: NO
Prometheus Remote Write: NO

I have also included in netdata.conf the aclk implementation = ng flag in the cloud section.

Is there a way (by log investigation or anything else) to determine how am I connected to the cloud (Legacy vs NG)?
If for example I want to revert to Legacy by my understanding all I have to do is remove the aforementioned flag. But how do I know/confirm that the connection is then Legacy or NG?

You can do netdata -W buildinfo or localdashboardurl/api/v1/info that will return JSON payload and near the bottom there should be aclk-impl key.

Difference between current stable and next upcomming stable is:

  • current stable is Legacy default and can have only one of them compiled in. NG can be compiled by installer (compile time) flag as you know
  • next stable will have ability to have both (as do nightlies) and makes them switchable by config flag
  • next stable will have additional dependency of google protocol buffers - we are working on a in installer bundler for systems that do not have protobuf installed.
  • next stable will use NG by default and Legacy as fallback

As in current stable you have only one ACLK compiled at the time and your -W buildinfo says Cloud Implementation: Next Generation that would mean you are using NG.

@underhood
OK! Great! Thanks a lot for the info provided.

So for the current stable since I have compiled it with NG flag is using NG and cannot be changed to Legacy. This will be configurable from next stable as far as I understand.
Thanks!