If my understanding is correct, Netdata checks during startup all the supported services and then monitors all of those that are responding while Netdata is probing them initially. If one of those services misbehaves or stops working completely, Netdata recognizes that and will create alerts following the health configuration.
Following that logic, if a service is not up and running when Netdata starts, it will be ignored and Netdata assumes, that service isn’t present on that node.
Hence my question: can we provide a list of services that we know should be present and if not Netdata should raise an alert?
As you said Netdata tries to detect all services that it can monitor, this kind of selection can be achieved using one or both of the following options:
1 - You can disable plugins that you do not want to monitor, because by default Netdata will try to detect it.
2 - If you want monitor a most specific service, you can define them inside apps_groups.conf, with this action Netdata will create dimensions for your apps/services.
3 - You can configure an alarm for your application, please, take a look on this.
The problem is that we currently have no way to tell Netdata to ‘expect’ a specific collector to actually find a service to collect data from.
Even using the apps.plugin, there’s not a good way right now to reliably handle alerting on a missing service because the apps.plugin code only creates dimensions for groups it actually sees, with the net result being that there is no dimension matching the alarm expression if the service is missing when Netdata starts up.
And, TBH, it’s that startup situation that’s most problematic for potentially missing services that are supposed to be present, because depending on exact circumstances it’s possible (but extremely unlikely) for us to never end up collecting from a given service even if it actually is running.
Exactly what @Austin_Hemmelgarn said is my main concern too. The problematic startup situation is probably something that needs some attention. The way it is engineered now is a great out-of-the-box experience, because there is no need for configuration and Netdata works like magic - at least for most of the parts.
The longer you use it and the more you look behind the scene, the more you realize that you would want to do some configuration - it’s worth the effort to get more specific results. That’s why I think a list of expected services might be worth considering - I hope.
This is honestly something I would like to see myself. One of the few issues I personally have with Netdata is that I have to start it really late in the boot process on all of my systems so it actually picks up all the services it’s supposed to be monitoring, which translates to potentially missing some things that may otherwise matter, and even then I sometimes have to restart it to pick things up.
Unfortunately, it’s not exactly trivial to implement something like this, as it would require a revision of the API external plugins use to communicate with the core. There’s been some talk about revising this API anyway for a while now for other reasons, but I don’t know when that will actually happen at this point (and even when it does, I probably won’t personally be very involved with it, as most of what I work on is the installation and packaging stuff, and much of the C code in the agent is still an arcane mystery to me).
In general you are right, but there is little detail - python.d.plugin and go.d.plugin keep jobs states (active/failed basically)
ilyam@debian-s-1vcpu-1gb-fra1-01:/opt/netdata/var/lib/netdata$ ls -l | grep jobs
-rw-rw---- 1 netdata netdata 44721 Mar 3 09:17 god-jobs-statuses.json
-rw-rw---- 1 netdata netdata 92 Mar 3 09:17 pythond-jobs-statuses.json
Both plugins respect previous job state (on restart ). If the job was active both plugins apply recovery settings - don’t give on check fail and keep trying every 30 secods/10 attemps (5 minutes).
Hence my question: can we provide a list of services that we know should be present and if not Netdata should raise an alert?
We have httpcheck and tcpcheck collectors, should be enough to ensure http/tcp endpoint is alive.