Kernel Panic Errors Plaguing Our Intel Servers


I hope this message finds you well. We are currently grappling with a critical issue that’s causing major disruptions to our Intel server operations – kernel panic errors. We’re reaching out to tap into the collective expertise and experience of this amazing community to help us find a solution.

For the past few days, our Intel server infrastructure has been hit by recurring kernel panic errors. These errors are leading to sudden system crashes, bringing our services to a grinding halt. The server logs provide little insight into the root cause, and our attempts to troubleshoot have been inconclusive so far.

Reviewed system logs: We’ve meticulously examined system logs, but they haven’t provided a clear picture of what’s triggering these errors.

Hardware diagnostics: We ran comprehensive hardware diagnostics to rule out any underlying issues with components like RAM, storage, and power supplies.

Software updates: Ensured that all software components, including the kernel and drivers, are up-to-date.

Thank you for taking the time to read this post, and we eagerly await your responses.

Hi, @devinmarco. Does it have something to do with Netdata? Do you mean that the problem only occurs when Netdata is running?

Hello @devinmarco ,

Your message did not specify Linux distribution or kernel running, but when I worked years ago with Lenovo they were running Ubuntu.

Do you have any message reported when you run dmesg?
Do you have any coredump available when you run coredumpctl list?

We merged before our last release a new plugin module that has relationship with power supply, but it only parses some files. I am not sure if there is relationship with the issue you are reporting.

Best regards!