I am unable to get the netdata child pod running on a specific node in my k8s cluster. It is stuck in a fail loop because of the health check that kills it. I’ve checked the logs trying to see any error messages there, but they are normally empty, except one time I got this error, but I couldn’t find anything relevant to my issue:
Creating docker group 0
addgroup: gid '0' in use
Assign netdata user to docker group 0
Could not add group docker with ID 0, its already there probably
I don’t have docker installed on any nodes in my cluster, so I’m assuming this is something that k8s is doing.
The child is working on the other node in my cluster, and after redeploying multiple times, the issue is isolated to this node. The interesting part is I had netdata running on this node before, but I reset the cluster, and it stopped working.
Relevant docs you followed/actions you took to solve the issue
Checked the logs, no output besides the previously mentioned error.
Updated helm repo (1.35 came out while I was troubleshooting this, but that update did not fix anything)
Did you check kubectl -n <namespace> describe pod <podname>? Usually the command provides more info and a good starting point to debug an issue with a pod.
Describing the pod does give me some useful information, but not enough to know the root of the problem. It just tells me that the pod is restarting because it’s failing a health check, and that it’s happening only on a specific node. I would still think that there should be something in the logs that gives me a hint, but they are completely blank. I can include some of the output of the command if you want.
I do have them setup to connect to Netdata Cloud, but neither of them show up in Cloud. The only thing I get is the parent pod, but it always shows as being offline. A weird thing about that though is the node that this is failing on that was originally working with Netdata still shows up in Cloud, but only as stale.
Yes, I was, but increasing them for both the parent and child did not change anything. I did notice in the output of kubectl describe of the child pod that it is exiting with code 143 if that helps any. I’m not able to find anything useful online related to that error code though. I definitely feel like this is somehow related to my node still appearing in Netdata Cloud as stale, since it’s the same node that is having this issue. I don’t have any idea how to troubleshoot that end of it though.
Is that sigterm just from it failing the liveliness check though? It still doesn’t explain why it fails to start to begin with or why the node is stale in Cloud. I also don’t see any documentation as to what the stale status means beyond a GitHub issue that isn’t very informative. Is there anything I can try to test or run that will help isolate the root or narrow down what might be causing this?
Well, I managed to get netdata to finally run, but definitely not in a desirable configuration. I changed the liveliness probe and readiness probe periods to 30000s to see if allowing the container to run would do anything, and after waiting several minutes for it to start up, I managed to get it actually running (and populating Netdata Cloud). The only problem outside of having the set such a ridiculous configuration to get it to run is that it has at least 1 thread on the CPU constantly pegged at 100% and occasionally runs the whole CPU at 100% (across all 4 threads). No idea what is causing it to run like this, since I managed to start the container without issue when run outside the deployment. The logs also don’t immediately show anything that would signal a cause to this issue, but at least I got it running.
It would be great if the netdata team can find a solution to the CPU usage since that is still a problem.