Netdata Community

Streaming not working when activating SSL

Environment

Debian 10

Problem/Question

Hi,
I’m trying to setup encrypted streaming between two netdata instances, one acting as a parent, and the other as a child.
Everything works as expected when streaming without SSL. But when I enable SSL in the child destination configuration, the child suddenly can’t connect to the parent. It keeps trying to connect unsuccessfully.

Here is a sample of the child’s error.log :

2021-01-13 15:22:42: netdata INFO  : STREAM_SENDER[child] : STREAM slave [send to parent:19996]: connecting...
2021-01-13 15:22:42: netdata INFO  : STREAM_SENDER[child] : STREAM slave [send to parent:19996]: initializing communication...
2021-01-13 15:22:42: netdata ERROR : STREAM_SENDER[child] : SSL cannot connect with the server:  error:00000000:lib(0):func(0):reason(0)

And here is a sample of the parent’s access.log :

2021-01-13 14:30:20: 521: 9620 '[CHILD_IP]:50842' 'CONNECTED'
2021-01-13 14:30:20: 521: 9620 '[CHILD_IP]:50842' 'DISCONNECTED'
2021-01-13 14:30:25: 522: 9620 '[CHILD_IP]:50880' 'CONNECTED'
2021-01-13 14:30:25: 522: 9620 '[CHILD_IP]:50880' 'DISCONNECTED'
2021-01-13 14:30:30: 523: 9620 '[CHILD_IP]:50922' 'CONNECTED'
2021-01-13 14:30:30: 523: 9620 '[CHILD_IP]:50922' 'DISCONNECTED'

My configuration

Port 19999 is used behind an Nginx proxy with basic auth for the dashboard.
Port 19996 is open and directly used for streaming.

I use a letsencrypt certificate for SSL communications but this should not matter as I disabled certificate verification for testing.

Parent’s netdata.conf

[web]
        ssl key = /etc/letsencrypt/live/certificate/privkey.pem
        ssl certificate = /etc/letsencrypt/live/certificate/cert.pem
        bind to = *:19999=dashboard|netdata.conf^SSL=optional, *:19996=streaming^SSL=optional

Parent’s stream.conf

[111...555]
    enabled = yes
    allow from = *
    default history = 3600
    default memory mode = ram
    health enabled = yes
    default postpone alarms on connect seconds = 60

Child’s stream.conf

[stream]
    enabled = yes
    destination = parent:19996:SSL
    ssl skip certificate verification = yes
    api key = 111...555
    timeout seconds = 60
    default port = 19999
    send charts matching = *
    buffer size bytes = 1048576
    reconnect delay seconds = 5
    initial clock resync iterations = 60

Hello @lmoretba ,

Let me do few questions for you to understand better the problem you are having:

1 - Can you access Netdata dashboard without any warning using a browser?
2 - Does Netdata have permission to access the key and certificate?
3 - What is the OpenSSL version installed on your host?
4 - Finally when you run the next command, can you see any ssl lib linked to netdata?

bash-5.1# ldd /usr/sbin/netdata | grep ssl
        libssl.so.1.1 => /lib64/libssl.so.1.1 (0x00007f696e413000)

Best regards!

1 Like

Hello @Thiago_Marques_0 , thank you very much for your quick answer!

There is no problem with the Netdata dashboard, I can access it behind the Nginx proxy and it works flawlessly without giving any alert.

I made sure that Netdata has access to the certificate files but this does not change anything (as expected since certificate check is currently disabled for testing).

The OpenSSL version is the same on both servers:

$ openssl version
OpenSSL 1.1.1d  10 Sep 2019

And it seems that libssl is indeed linked to netdata

$ ldd /usr/sbin/netdata | grep ssl
	libssl.so.1.1 => /lib/x86_64-linux-gnu/libssl.so.1.1 (0x00007fa7fd1e0000)

Best regards!

Hello @lmoretba ,

Sorry for the delay, but I had to solve personal problems and only now I could work with Debian.

I tested now the stream running Netdata on Debian 10.7 with the same OpenSSL version you reported , I also set an analogous configuration to be sure that I would be able to recreate the problem. Unfortunately I could not have the problems, and after to check my netdata.conf I remembered that I am using memory mode = ram instead memory mode = dbengine inside my netdata.conf and I also changed my stream.conf to have ram mode setting default memory mode = ram, to simplify some tests I am doing during development. Please, can you change the memory mode on parent inside both netdata.conf and stream.conf? After this restart your netdata parent, I am suspecting that your database could be corrupted.

Best regards!

Hello @Thiago_Marques_0 ,

I just tried switching my parent’s netdata.conf to memory mode = ram and stream.conf to default memory mode = ram.
I then tried to restart both instances, but it did not change anything, so this probably rules out the database.

At this point I think that I probably made a silly mistake somewhere on my side or I forgot something. But I’m stuck figuring it out.

I tried temporarily disabling firewalls on both servers but it did not change anything.

I still don’t get why unencrypted communication with destination = parent:19996 works well, but destination = parent:19996:SSL does not.

Once again, thank you for your quick reply!
Best regards!

Hello,

I remember I already had a problem with Letsencrypt, this was in the beginning of the project, my certificate was not recognized, because when OpenSSL tried to verify if the certificate was valid, the verifiers were not recognizing it. If this is your case, you can bring the server certificate for the machine and set the variable CApath = /etc/ssl/certs/ inside the child stream.conf. I will try to create a certificate using certbot to test if this is the possible problem.

Now, about the configuration, I have the following configuration at my environment:

On Parent:

My netdata.conf:

[global]
      run as user = thiago
[web]
        ssl key = /etc/netdata/ssl/key.pem
        ssl certificate = /etc/netdata/ssl/cert.pem
       bind to = *=dashboard|registry|streaming|management|netdata.conf|badges *:20000=dashboard|registry|streaming|netdata.conf|badges|management^SSL=optional *:20001=dashboard|registry|streaming|badges|management^SSL=force unix:/tmp/netdata/netdata.sock *:20002=streaming^SSL=optional

I tested using the port 20002 and everything worked.
My stream.conf:

[11111111-2222-3333-4444-555555555555]
    enabled = yes
    allow from = *
    health enabled by default = auto
    default memory mode = ram
    default postpone alarms on connect seconds = 60
    update every = 15

On child:

My stream.conf

[stream]
    enabled = yes
    destination = 192.168.0.12:20002:SSL
    ssl skip certificate verification = yes
    api key = 11111111-2222-3333-4444-555555555555

Hi again,

I’ve been working on this again today. I’ve tried different things and finally got it working, but things don’t make much sense anymore.

First, I tried using a self-signed OpenSSL certificate on the parent instead of the letsencryt certificate.
I set the child to skip certificate verification and it did not work.

Then I added the self-signed certificate to the list of trusted certificates on the child following the instructions here. After that, SSL streaming was finally working.

That made me think that there was a problem with the letsencrypt certificate.

But I still didn’t understand why I had to add the certificate on the child server and skip certificate verification = yes would not work.
So I removed the certificate from the child, and now skip certificate verification would behave as expected.
If set to yes I did not need to have the certificate on the child.

So then out of curiosity I tried using the letsencrypt certificate again, and it worked with skip certificate verification = yes, but the child could not verify it.

So at this point I’m a bit lost and I don’t really know why it did not work initially nor why it’s working now.

I also tried connecting other freshly installed netdata instances as child and they behave identically so the problem is probably on the parent side?

Yesterday I finally found out what went wrong and I feel dumb, but this still raises a few questions.

I basically forgot that with Letsencrypt certificates generated by certbot, your live directory does not contain the actual certificate files, but only symlinks which point to the real files in the archive directory.

So therefore, when I set permissions recursively to the live directory, it didn’t apply to those. Now that the permissions are fixed, everything works fine on a fresh install.

What made me realize that there might be an issue with permissions is the fact that this OpenSSL error (error:00000000:lib(0):func(0):reason(0)) also happened when I used an incorrect certificate path on purpose.

Now, it seems that when I previously tried using Letsencrypt certificates again just after using self-signed certificates, Netdata still used the self-signed certificates instead of erroring out like it should have done because it didn’t have permissions for Letsencrypt certificates?

Thank you very much @Thiago_Marques_0 for taking the time to investigate my problem on this dumb issue :’)

Hello @lmoretba ,

Firstly I apologize for the delay, I was busy these days working on new features that Netdata will have.

I will do some tests with certificates and permissions to give a better message on errors. Thank you to report this error.

You are welcome! It is always a pleasure to help our users! :slight_smile:

Best regard!

1 Like