1hour_memory_hw_corrupted

Tasos_Katsoulas · November 4, 2021, 4:41pm

1hour_memory_hw_corrupted

OS: Linux

The Linux kernel keeps track of the system memory state. You can find the actual values it tracks in
the man pages ¹ under the /proc/meminfo subsection. One of the values that the kernel reports is the HardwareCorrupted, which is the amount of memory, in kibibytes (1024 bytes), with physical memory corruption problems, identified by the hardware and set aside by the kernel so it does not get used.

The Netdata Agent monitors this value. This alert indicates that the memory is corrupted due to a
hardware failure. While primarily the error may be due to a failing RAM chip, it can also be caused
by incorrect seating or improper contact between the socket and memory module.

References and Sources

Troubleshooting section:

Verify a bad memory module

Most of the times, uncorrectable errors will make your system and reboot/shutdown in a state of
panic. If not, that means that your tolerance level is high enough to not make the system go into
panic. You must identify the defective module immediately.

memtester is a userspace utility for testing the memory subsystem for faults. It’s portable and
should compile and work on any 32 or 64-bit Unix-like system. For hardware developers, memtester
can be told to test memory starting at a particular physical address (memtester v4.1.0+). ²

You may also receive this error as a result of incorrect seating or improper contact between the
socket and RAM module. Check on both before consider replacing the RAM module.

James_Harr · September 25, 2024, 2:44pm

We have a hand full of Dell / AMD EPYC servers where HardwareCorrupted sometimes goes to 4KiB after several days of operation. The value has never jumped above 4KiB and neither the iDRAC nor Linux report any ECC errors, correctable or uncorrectable. A reboot will clear the issue, but the HardwareCorrupted will jump up to 4KiB after weeks or months. The systems have been completely stable otherwise.

The fact that the iDRAC doesn’t report the error makes it particularly hard to figure out which DIMM/slot is the culprit. We could play games with pulling the servers, and grinding sets of DIMMs, but that’s a lot of work and downtime for the servers.

The behavior is making me wonder if it’s not actually a hardware error. I’m still trying to think of a course of action that doesn’t involve lengthy offline testing to narrow down the issue, especially since we haven’t seen an operational impact.

Has anyone else run into this? Any ideas on how to narrow down the issue without lengthy offline testing?

For now we’ve raised the warning threshold for this alert, but it’s going to bug me.

James_Harr · September 26, 2024, 7:59am

What units are $this? Can I include my own unit like $this > 4KiB?

Topic		Replies	Views
1hour_ecc_memory_correctable Alerts	0	692	November 4, 2021
1hour_ecc_memory_uncorrectable Alerts	0	644	November 4, 2021
corrupted ECC Memory check Help agent , collectors , alerts	1	225	January 19, 2024
Segfaults all of a sudden Help agent	14	751	October 18, 2023
insane netdata memory usage Help agent	29	3809	February 9, 2023

1hour_memory_hw_corrupted

1hour_memory_hw_corrupted

OS: Linux

Troubleshooting section:

Related topics