Dogfooding feedback on a KS8 cluster running Apache Airflow

I am going to share feedback from myself as a user using Netdata to monitor our own internal Apache Airflow k8s cluster running in AWS that is responsible for most of our internal ETL data pipelines etc.

1. Node count keeps growing and it kills parent

It seems that Airflow has 1 scheduler node that is long lived and then will create multiple worker nodes each day as jobs come and go and workers are required to be spun up and down.

End result being that worker nodes that have long since gona away never to return end up getting kept around in NC and the netdata parent.

Eg i cleaned this up a week or so ago and you can see i already have 391 offline nodes:


These nodes have done their job and are never coming back. Ideally i would have a way to tell both the netdata parent and NC itself this fact and have it just naturally “expire” them after some period of being offline and maybe based on some tags i set against them.

2. Node names in NC don’t mean much to me.

It seems that by default node name in NC is the internal host ip in AWS. Problem is this means nothing to me - looking at the screenshot below i see 2 nodes, i know one is the scheduler and one is the worker - i’d like to be able to easily add some way to name them as such - so maybe some way to add a custom name in the helm chart of the agent or even in NC based on host labels perhaps. e.g. maybe even just adding a prefix or suffix would be good.

For example on AWS i see the aws:autoscaling:groupName is like eks-single-az-main-2.... for the main scheduler node and then eks-single-az-wrk-c2xlarge-2..... It would be cool if NC had these labels and i could add a prefix or suffix onto the node name based on something like eks-*-wrk-* mapping to “airflow-worker” and similar for the scheduler for example.

3. Parent node was crashing i think due to having to do all the bookkeeping of 1000’s of offline nodes

So we don’t actually have a parent now which really limits the metrics to just the last few hours on each agent which i believe is the default settings in the helm chart.

4. Airflow statsd metrics were not picked up by default

Airflow emits lots of custom statsd metrics by default: Metrics — Airflow Documentation

For some reason Netdata did not get these - am assuming its some additional config we might need to add in the helm chart maybe but curious to understand it a bit more and see if any improvements we can make there.

5. Ability to create some sort of “virtual” or “derived” tag in NC.

Ability to create some sort of “virtual” tag in NC that is logic aggregating existing tags would be very useful. Here is a concrete example. In the gif below you can see multiple “metrics…” time series - these stem from multiple ETL dags and tasks relating to internal business metrics pipelines kicking off around the same time. It would be very powerful to be able to group in a custom way these metrics based on the fact they all start with “metrics…” or any other custom logic i might want to do on the fly from within NC.