1hour_ecc_memory_uncorrectable

Tasos_Katsoulas · November 4, 2021, 4:41pm

1hour_ecc_memory_uncorrectable

OS: Linux

Error correction code memory (ECC memory) is a type of computer data storage that uses an error
correction code (ECC) to detect and correct n-bit data corruption which occurs in memory. Error
correction codes protect against undetected memory data corruption, and is used in computers where
such corruption is unacceptable, for example in some scientific and financial computing
applications, or in database and file
servers. ¹

The Netdata Agent monitors the number of ECC uncorrectable errors in the last 10 minutes.

See more on uncorrectable errors.

There are two main categories of Uncorrectable Errors (UE) as documented in the
kernel.org ²

Fatal Error, when a UE error happens on a critical component of the system (for example, a piece of the Kernel got corrupted by a UE). The only reliable way to avoid data corruption is to hang or reboot the machine.
Non-fatal Error, when a UE error happens on an unused component, like an unused memory bank. The
system may still run, eventually replacing the affected hardware by a hot spare, if available.

See more on machine checks

Machine checks report internal hardware error conditions detected by the CPU. Uncorrected errors
typically cause a machine check (often with panic), corrected ones cause a machine check log entry.

The behavior your machine will have when UE occurs depends on the tolerance level settings. The tolerance level configures how hard the kernel tries to recover even at some risk of deadlock. Higher tolerant values trade potentially better uptime with the risk of a crash or even corruption (for tolerant >= 3). The Default is 1.

0: always panic on uncorrected errors, log corrected errors

1: panic or SIGBUS on uncorrected errors, log corrected errors

2: SIGBUS or log uncorrected errors, log corrected errors

3: never panic or SIGBUS, log all errors (for testing
only)

Also, when an error happens on a userspace process, it is also possible to kill such process and
let userspace restart it. ³

References and sources:

Troubleshooting section:

Verify a bad memory module

Most of the times, uncorrectable errors will make your system and reboot/shutdown in a state of panic. If not, that means that your tolerance level is high enough to not make the system go into panic. You must identify the defective module immediately.

memtester is a userspace utility for testing the memory subsystem for faults. It’s portable and
should compile and work on any 32 or 64-bit Unix-like system. For hardware developers, memtester
can be told to test memory starting at a particular physical address (memtester v4.1.0+) ²

You may also receive this error as a result of incorrect seating or improper contact between the socket and
RAM module. Check on both before consider replacing the RAM module.

Check for BIOS updates

You should check for critical BIOS updates on your hardware’s vendor support page.

Topic		Replies	Views
1hour_ecc_memory_correctable Alerts	0	692	November 4, 2021
1hour_memory_hw_corrupted Alerts	2	1559	September 26, 2024
corrupted ECC Memory check Help agent , collectors , alerts	1	225	January 19, 2024
Segfaults all of a sudden Help agent	14	751	October 18, 2023
insane netdata memory usage Help agent	29	3809	February 9, 2023

1hour_ecc_memory_uncorrectable

1hour_ecc_memory_uncorrectable

OS: Linux

Troubleshooting section:

Related topics