How to collect data efficiently with a command tool which taking 1s to collect one sample?

I’m developing a plugin under python.plugin.d of collector. The class is inherited from ExecutableService and executes a command to get the raw data.

Here’ s a problem.

The command is hinicadm and it can only collect metrics of one network port at once and each collecting takes about 1s.
hinicadm currenet_port_traffic -i enp98s0 -t 1
It will takes about 1s and collect metrics of network card enp98s0.
There are 3 network cards and 12 ports at total.
If I collect the metrics serially, it will takes about 12s to collect metrics of all ports. The timestamp is not precies and collecting frequency is really low.

Should I create threads in get_data to collect metrics of 12 ports parallelly? Will this behavior introduce much overhead?

Any good ideas?

2 Likes

Hey @oleotiger,

The timing couldn’t been best, since we are releasing (today perhaps ? @joel ) a new python.d guide!

In relation to your question, I think the best route here is to configure 12 different jobs in the configuration yaml of your collector.

So, your collector will collect data using that particular command from any network card, the network card being passed dynamically as a configuration variable. Thus, you will define 12 different jobs, each one passing a different network card.

What python.d does is to function as an orchestrator. So, for each collector, it will run as many jobs as they are defined in the configuration file, all in parallel.

Taking the example from the documentation:

# module defaults:
update_every : 2
priority     : 20000

local:  # job name
  update_every : 5 # job update frequency
  other_var1   : some_val # module specific variable

other_job:
  priority     : 5 # job position on dashboard
  other_var2   : val # module specific variable

In this configuration, we first define some defaults for the entire module. In case we define the same variable inside a job, the job’s variable will be used in place of the module-wide.

Now, all you have to do to access those variables is:

  1. When initializing the framework you used, define the configuration argument configuration = configuration
  2. get the configuration variable: self.network_switch = self.configuration.get('<variable_name>', <default_value>). The default value will be used in case the code does not find that configuration variable in your configuration yaml.

Using our anomalies collector, created by @andrewm4894 as an example:

  1. Access configuration inside the code: netdata/anomalies.chart.py at c527863c5327123a43047bb61e7fdf5325177cbd · netdata/netdata · GitHub
  2. Configuration file: netdata/anomalies.conf at master · netdata/netdata · GitHub

To recap:

Python.d is an orchestrator which runs different modules. Each module can run different instances of itself, which are called jobs. This is used when gathering data from different sources/with different methods but for the same application. You can dynamically define parts of your code using the configuration yaml file.

If you have any further question, I would be happy to answer!

EDIT: Welcome to our community!

2 Likes

+1 for everything @OdysLam says!

One small addition that i have come up against recently is what if you don’t happen to know in advance all the info you need to create the jobs in the .conf file.

For example we have gke nodes streaming to a parent netdata and if one dies or goes away for whatever reason then the name of the node gets changed as a new one comes online and so my conf would not know about it.

In that case what i did was just make a little script that runs via a regular cron to scrape all the hosts available and then generate the conf every now and then and then just copy it into the relevant folder for netdata to pick up.

Just mentioning in case for some reason some of the params you need to pass into hinicadm might change over time such that’s you need something more like a dynamic config.

Not sure if my approach of a cronjob to just create the .conf every 30 minutes is the best way to solve this issue but worked for me so just sharing in case of relevance here.

1 Like

OMG I received a :heart: from the creator of Python.d (@ilyam8 ). Super honoured🙏

1 Like

I think that it’s probably time to create a proper guide for my consul-quickstart which is like a production-grade solution for dynamically editing the configuration of Netdata Agents.

@ilyam8 a question that comes and goes to my mind is whether our collector system is able to play nice with more modern approaches to threading, such as async/await.

Also slightly irrelevant, would be the support of virtualenv and poetry.

Hi @oleotiger

Well, you don’t know until you check. But it adds some overhead, yes.


each collecting takes about 1s

is that huawei network device or something? I googled hinicadm - Customized Management Tool hinicadm - MZ731 NIC V100R001 User Guide 03 - Huawei.

Is it possible to gather data via SNMP?

SNMP

Oooh, boy, here we go again :sweat_smile:

It’s really a nice way to collect with mutiple jobs.

But there may be some difference.

The number and name of devices are changed due to different hosts. If defining all jobs in configuration file for every host, I think it may be not flexible and time consuming.

I find that some plugins (may be written in c or go) collect with multiple jobs as well and the configuration files for them are dynamic generated during the installing of netdata.

I think if I could generated the configuration file or start multiple jobs dynamically with python codes (get device names dynamically), it would be better.
Take a step back, if the configuration files for my plugins could be generated during installing of netdata dynamically , I think it could be a nice way as well.

Yep, I would just work out a script to generate the conf file dynamically and then just schedule that via cron as regular as needed.

I’m pretty sure each time the collector runs it would use the conf from the latest conf file that was updated by the cron job.

Would indeed be cool if there was some notion or a dynamic conf built into netdata. But the cron based workaround has been working for me when needed.

You could also consider putting the logic into the collector itself so could dynamically change the configuration and charts etc on the fly within the collector itself. Eg every n runs just reinitialize whatever you need. So that would be another option.

Alarms collector does this with the update_chart() function. If it sees new alarms it adds them or if some alarms go away it removes them. Could imagine something similar as an option for what you need, so long as some api or endpoint you can get the info you need from.

I have tried to dynamically chage the charts at the start of collector ( in function check), but the key problem is that there is no way to start multiple jobs in function check to collector multiple metrics of different devices at once with one tools which will take about 1s to run. If there is a way to dynamically load charts and dynamically load multiple jobs, it would be better that I don’t need to change configuration files.

yeah i dont think an easy way to manage the jobs from within the function check. So if they really need to be seperate jobs (e.g. can not just use a longer update_every and have it all be one job that runs every 10 seconds say where might take 5 seconds or so to get the data). Then i think the approach of a cron to dynamically update the conf file as needed or at regular intervals is best solution i can think of (or at least what has worked for me in the past).

i guess you could kick off other threads within the job itself but i’m not that much of an expert in python as to if this is a good idea or not within settings of a netdata python collector.

that said, i do see some threading stuff in here: https://github.com/netdata/netdata/blob/master/collectors/python.d.plugin/couchdb/couchdb.chart.py

maybe that approach could be adapted to what you need as another alternative.

Although @ilyam8 could correct me, I think that the intended model is for every instance to be a single thread, and we leave concurrency to the python.d plugin to manage between different jobs.

@oleotiger have you considered using something like Ansible? You could create a recipe that creates dynamically the correct configuration files, then places them at the proper location, and finally starts Netdata. @joel has written a nice quick start guide on deploying Netdata with Ansible:

Another idea is to use Consul as a KV store and then consul-template to dynamically create the configuration files based on that KV store. You can dynamically edit the KV by running a shell script that updates the KV store via curl based on some dynamic conditions.

I have created a small PoC on our Community GitHub, but I intend to come back eventually.

Every collector that monitors dynamic elements needs to handle the list of those elements itself. Imagine if we had to reload the configuration every time a new process started or stopped running! So config changes aren’t the answer in such cases.

For SNMP specifically, a proper implementation would work with SNMP WALK. The nodejs thing is obsolete, we need at some point to find a decent Go or C library to replace that thing.

Any data collection for network ports needs to be done in C. What does hinicadm provide that our native collectors don’t? That’s what we should be working on adding to netdata. 1 sec for each interface is insane, we can do much better.

I don’t really understand starting mutliple jobs in check approach.

Let’s get a little bit back.

That is what i can get from readin the topic (could be missing something).

  • we need to collect using hinicadm tool
  • source of the metrics is uknown (for me), but it has several modules woth ports :smile:
  • hinicadm can’t get metrics for all the ports in one run
  • it takes 1s+ to collect a port metrics using hinicadm

Using hinicadm to collect metrics doesn’t looks good, that is why i asked about the device - is it a network switch/server? Is it possible to use SNMP to get metrcs?

If no then we stuck with hinicadm.

Yes, it is a thread per job. That job has get_data method. Python.d.plugin executes get_data method every update_every (update interval of the job).

Nothing stops us from using threads in our get_data. @andrewm4894 gave a correct example, python couchdb creates a thread per endpoint (/_active_tasks, /_node/{node}/_system, etc.)


@oleotiger so you can follow couchdb approach and create a thread per card (or per port). In terms of time it is faster, but it will take more cpu resources (spawn a lot of threads and every of them spawn a process (hinicadm)).

if hinicadm has loop mode (probe until Ctrl+C at specified interval) then it is better, no need to create threads on every update_every.

Using async/await is better then threads if you are familiar with syntax.

I have read all your replies and they really bring me a lot of inspiration.

I will try the method proposed by @andrewm4894 with thread(maybe i’ll try async/await).

Hinicadm is a tool to collect data from network cards( made by Huawei). Restricted by the driver, I have to use hinicadm to collect data and SNMP is not a possible way.

Hinicadm can be used like hinicadm currenet_port_traffic -i $port_name -t $ time. Hinicadm will collect data of port $port_name every 1s among $time seconds. So I can set $time a large number (maybe 20 years, :smile:).

Is there example written in python that can show me how to write function get_data with a tool generating data in loop mode?

BTW: thank you all for your reply and help.

Hey @oleotiger,

I think what you want is the subprocess.Popen function (Python docs) and then read periodically from the PIPE.

Here are a couple of examples that I found online:

https://www.endpoint.com/blog/2015/01/28/getting-realtime-output-using-python

I know this method but I’m afraid of some situation that this method can not handle.

For example, when hinicadm hangs for some reason for about 0.5 second and at this second we cannot read any data though cache flush.
At the next second we will read 2 metrics with flushing one of which is the delayed metrics of pre second.

I wanna find a more elegant way to read data from a looping Popen thread.

I am pretty sure you can read by line, so you can have an inner loop that loops through the lines that each read will do.

The thing is that at t1 netdata will have 0 value and at t2 will have 2 values, so you have to do something like:
t1->0, t2->AVG(<values)