Parent Pod: "Failed to connect to https://app.netdata.cloud, return code 6", Child Nodes Working

Suggested template:

Problem/Question

Parent pod is crash looping with:

ls: /var/run/balena.sock: No such file or directory
ls: /var/run/docker.sock: No such file or directory
Unable to communicate with Netdata daemon, querying config from disk instead.
Unable to communicate with Netdata daemon, querying config from disk instead.
Token: ****************
Base URL: https://app.netdata.cloud
Id: [ID]
Rooms: [ROOMS]
Hostname: [PODNAME]
Proxy: 
Netdata user: netdata
Failed to connect to https://app.netdata.cloud, return code 6
Connection attempt 1 failed. Retry in 1s.
Failed to connect to https://app.netdata.cloud, return code 6
Connection attempt 2 failed. Retry in 2s.
Failed to connect to https://app.netdata.cloud, return code 6
Connection attempt 3 failed. Retry in 3s.
grep: /var/lib/netdata/cloud.d/tmpout.txt: No such file or directory
grep: /var/lib/netdata/cloud.d/tmpout.txt: No such file or directory
Failed to claim node with the following error message:"Unknown HTTP error message"

Netdata is running on a MicroK8s (1.22.6) cluster. It is provisioned (via Flux) with Helm with the following config:

spec:
  values:
    child:
      claiming:
        enabled: true
        rooms: [ROOMS]
      envFrom:
        - secretRef:
            name: netdata-secrets
    notifications:
      slackurl: [SLACKURL]
    parent:
      alarms:
        storageclass: nfs-hdd
        volumesize: 1Gi
      claiming:
        enabled: true
        rooms: [ROOMS]
      database:
        storageclass: nfs-ssd
        volumesize: 10Gi
      envFrom:
        - secretRef:
            name: netdata-secrets
      livenessProbe:
        failureThreshold: 10
        periodSeconds: 60
        timeoutSeconds: 10
      readinessProbe:
        failureThreshold: 10
        periodSeconds: 60
        timeoutSeconds: 10
    replicaCount: 1
  interval: 1m0s
  releaseName: netdata
  targetNamespace: netdata

Child pods all connect quickly, without any obvious issues. Parent pod is stuck crash looping.

Relevant docs you followed/actions you took to solve the issue

I’ve tried forcing the parent pod to other nodes in the cluster, disabling database.persistence, extending thresholds on probes (as seen in the current config above), and haven’t been able to get the parent pods healthy. Child pods are using the same room value as parent and pulling the token from the same kube secret.

Environment/Browser/Agent’s version etc

Netdata Docker Image Version: v1.33.1

What I expected to happen

Parent pods to start and become healthy.

Hi @jotojoto1324 ! Welcome !

So the claim script fails to connect to the cloud. Return code 6 (if you have curl, it should be from that or else wget) indicates that it couldn’t resolve the host. Is there an easy way to test that the network where the parent node lives is ok ?

There is a child pod running on the same node as the parent pod and it doesn’t seem to have any issues. I did notice the child pod uses host networking by default and the parent doesn’t.

I will ssh into the parent pod and see if I can resolve the host from there.

I ended up discovering Debugging DNS Resolution | Kubernetes was the issue. I resolved it by implementing the solution proposed in the article:

This can be fixed manually by using kubelet’s --resolv-conf flag to point to the correct resolv.conf (With systemd-resolved , this is /run/systemd/resolve/resolv.conf ).

And then restarting kubelite, then restarting coredns, then restarting the netdata parent pod. Things look good now. Thanks for the assist (the “couldn’t resolve host” tip was very helpful).

Thank you for the follow up!! Glad you got it solved!