1hour_memory_hw_corrupted

1hour_memory_hw_corrupted

OS: Linux

The Linux kernel keeps track of the system memory state. You can find the actual values it tracks in
the man pages 1 under the /proc/meminfo subsection. One of the values that the kernel reports is the HardwareCorrupted, which is the amount of memory, in kibibytes (1024 bytes), with physical memory corruption problems, identified by the hardware and set aside by the kernel so it does not get used.

The Netdata Agent monitors this value. This alert indicates that the memory is corrupted due to a
hardware failure. While primarily the error may be due to a failing RAM chip, it can also be caused
by incorrect seating or improper contact between the socket and memory module.

References and Sources
  1. proc man pages
  2. memtester homepage

Troubleshooting section:

Verify a bad memory module

Most of the times, uncorrectable errors will make your system and reboot/shutdown in a state of
panic. If not, that means that your tolerance level is high enough to not make the system go into
panic. You must identify the defective module immediately.

  1. memtester is a userspace utility for testing the memory subsystem for faults. It’s portable and
    should compile and work on any 32 or 64-bit Unix-like system. For hardware developers, memtester
    can be told to test memory starting at a particular physical address (memtester v4.1.0+). 2

You may also receive this error as a result of incorrect seating or improper contact between the
socket and RAM module. Check on both before consider replacing the RAM module.

We have a hand full of Dell / AMD EPYC servers where HardwareCorrupted sometimes goes to 4KiB after several days of operation. The value has never jumped above 4KiB and neither the iDRAC nor Linux report any ECC errors, correctable or uncorrectable. A reboot will clear the issue, but the HardwareCorrupted will jump up to 4KiB after weeks or months. The systems have been completely stable otherwise.

The fact that the iDRAC doesn’t report the error makes it particularly hard to figure out which DIMM/slot is the culprit. We could play games with pulling the servers, and grinding sets of DIMMs, but that’s a lot of work and downtime for the servers.

The behavior is making me wonder if it’s not actually a hardware error. I’m still trying to think of a course of action that doesn’t involve lengthy offline testing to narrow down the issue, especially since we haven’t seen an operational impact.

Has anyone else run into this? Any ideas on how to narrow down the issue without lengthy offline testing?

For now we’ve raised the warning threshold for this alert, but it’s going to bug me.

1 Like

What units are $this? Can I include my own unit like $this > 4KiB?