Unable to start Netdata child in k8s

Suggested template:

Problem/Question

I am unable to get the netdata child pod running on a specific node in my k8s cluster. It is stuck in a fail loop because of the health check that kills it. I’ve checked the logs trying to see any error messages there, but they are normally empty, except one time I got this error, but I couldn’t find anything relevant to my issue:

Creating docker group 0
addgroup: gid '0' in use
Assign netdata user to docker group 0
Could not add group docker with ID 0, its already there probably

I don’t have docker installed on any nodes in my cluster, so I’m assuming this is something that k8s is doing.
The child is working on the other node in my cluster, and after redeploying multiple times, the issue is isolated to this node. The interesting part is I had netdata running on this node before, but I reset the cluster, and it stopped working.

Relevant docs you followed/actions you took to solve the issue

  • Checked the logs, no output besides the previously mentioned error.
  • Updated helm repo (1.35 came out while I was troubleshooting this, but that update did not fix anything)
  • Cleared persistent volume contents

Environment/Browser/Agent’s version etc

kubectl version --short output:

Kustomize Version: v4.5.4
Server Version: v1.24.1

helm show chart netdata/netdata output:

appVersion: v1.34.1
description: Real-time performance monitoring, done right!
home: https://netdata.cloud/
icon: https://netdata.github.io/helmchart/logo.png
keywords:
- alerting
- metric
- monitoring
maintainers:
- email: ilya@netdata.cloud
  name: Ilya Mashchenko
- email: knatsakis@netdata.cloud
  name: Konstantinos Natsakis
- email: mansour@netdata.cloud
  name: Mansour Behabadi
- email: vryumin@gmail.com
  name: Vladimir Ryumin
- email: karuppiah7890@gmail.com
  name: Karuppiah Natarajan
- email: kamikaze.ua@gmail.com
  name: Oleksii Kravchenko
name: netdata
sources:
- https://github.com/netdata/helmchart
- https://github.com/netdata/netdata
type: application
version: 3.7.18

What I expected to happen

Netdata child should start

I remember seeing sonething simular before, it was Starting netdata official container fails with `addgroup: gid '999' in use` · Issue #6251 · netdata/netdata · GitHub

I don’t understand all of that though, so I will ask someone else to help.

Hi, @neboman11.

Could not add group docker with ID 0, its already there probably

That is not a fatal error. These logs come from Netdata’s entrypoint script.

  • Did you check kubectl -n <namespace> describe pod <podname>? Usually the command provides more info and a good starting point to debug an issue with a pod.
  • Do you connect children nodes to Netdata Cloud?

Describing the pod does give me some useful information, but not enough to know the root of the problem. It just tells me that the pod is restarting because it’s failing a health check, and that it’s happening only on a specific node. I would still think that there should be something in the logs that gives me a hint, but they are completely blank. I can include some of the output of the command if you want.

I do have them setup to connect to Netdata Cloud, but neither of them show up in Cloud. The only thing I get is the parent pod, but it always shows as being offline. A weird thing about that though is the node that this is failing on that was originally working with Netdata still shows up in Cloud, but only as stale.

Are you using default liveness/readiness probs timeouts? If so, I’d try to increase both of them (e.g. timeoutSeconds 1->10).

Yes, I was, but increasing them for both the parent and child did not change anything. I did notice in the output of kubectl describe of the child pod that it is exiting with code 143 if that helps any. I’m not able to find anything useful online related to that error code though. I definitely feel like this is somehow related to my node still appearing in Netdata Cloud as stale, since it’s the same node that is having this issue. I don’t have any idea how to troubleshoot that end of it though.

I think you interpret the 143 exit code as 128+15. And 15 is SIGTERM.

Is that sigterm just from it failing the liveliness check though? It still doesn’t explain why it fails to start to begin with or why the node is stale in Cloud. I also don’t see any documentation as to what the stale status means beyond a GitHub issue that isn’t very informative. Is there anything I can try to test or run that will help isolate the root or narrow down what might be causing this?