Unix & Linux Stack Exchange
Q&A for users of Linux, FreeBSD and other Unix-like operating systems
Latest Questions
2
votes
0
answers
87
views
Add error correction to SquashFS images as part of a backup strategy
My backup strategy currently primarily consists of daily backups of all of my machines with Borg Backup, stored on different storage devices in different locations, following the 3-2-1 strategy. These file-level backups are the important ones that matter most to me. My question is *not* about these...
My backup strategy currently primarily consists of daily backups of all of my machines with Borg Backup, stored on different storage devices in different locations, following the 3-2-1 strategy. These file-level backups are the important ones that matter most to me. My question is *not* about these backups, I mention them for context.
Next to Borg Backup I also sporadically create full disk image backups with
dd
. After zero'ing the disk's free space (using zerofree
on ext4, dd if=/dev/zero
otherwise) I usually create a SquashFS image of the raw disk image (e.g. sda.img
becomes the only file in disks.sqfs
). This allows me to store the raw disk image in compressed form, while still allowing me to access the data without the need to decompress everything first.
These full disk image backups are stored on a single storage device (a NAS to be more precise), i.e. don't follow the 3-2-1 strategy. Creating a second copy of the data is out of scope, simply because they take too much space and because I consider investing into more storage a waste of money due to my Borg Backup backups. So, I'm fine with loosing these backups per-se, but I want to protect them a little better. Thus I'm thinking about adding some sort of error correction mechanism.
I read through a lot of resources and found that Reed–Solomon error correction seems to be the way to go. It adds some overhead to the data stored and provides safety in most, even though not all cases.
**My question is the following:** How do I do that in practice? What tools are available and how would I use them in my case? I found [this 10 years old Stack Exchange question](https://unix.stackexchange.com/questions/170652/is-it-possible-to-add-error-correction-codes-bch-rs-or-etc-to-a-single-file) listing a whole bunch of tools, but many of the projects are apparently dead. Plus, they don't seem to fit my needs:
Storing the data in compressed form and yet being able to access the data without the need to decompress it first is a must-have for me. So, unless there's another solution, I'm stuck with SquashFS. However, according to the resources I read, combining ECC with compression is hard: One apparently shouldn't calculate ECCs from compressed data, but from the original data, because ECC doesn't guarantee a 100% correction and even a single remaining corruption could yield all compressed data useless. However, calculating ECCs from original data and then compressing it wouldn't help either, because I might not be able to decompress the data due to the corruptions. So, apparently one needs software that does both at the same time: compression and ECC. Per ddrescue
I found that lzip
can actually do that by creating forward error correction (fec) files alongside compressing the data, but AFAIK I can't tell SquashFS to create these files.
So, I'm kinda stuck with this chicken-and-egg problem... How can I combine SquashFS with ECC, or is there an alternative to SquashFS that allows this?
Any suggestions?
PhrozenByte
(21 rep)
Apr 1, 2025, 07:26 PM
• Last activity: Apr 1, 2025, 07:30 PM
1
votes
1
answers
147
views
Why are edac-ctl and edac-util reporting that there are no EDAC drivers loaded when ECC memory is present?
I've installed Fedora 41 on a Dell Precision 3450 with 128GB of known-good ECC memory. I know that the ECC memory is detected, according to `lshw`, but I can't query any information with EDAC utilities. Why are EDAC drivers not loading and does this mean that the ECC feature isn't working in Linux?...
I've installed Fedora 41 on a Dell Precision 3450 with 128GB of known-good ECC memory. I know that the ECC memory is detected, according to
lshw
, but I can't query any information with EDAC utilities. Why are EDAC drivers not loading and does this mean that the ECC feature isn't working in Linux? Or is does this instead mean that it is working in hardware but that there won't be any system messages when an error occurs?
The system is running an Intel Xeon W-1370 CPU with an Intel W580 chipset, both of which are fully capable of supporting ECC memory. This is also a supported configuration by Dell.
# lshw -class memory
...
*-memory
description: System Memory
physical id: 1000
slot: System board or motherboard
size: 128GiB
capabilities: ecc
configuration: errordetection=ecc
*-bank:0
description: DIMM DDR4 Synchronous 2667 MHz (0.4 ns)
vendor: 000000000000
physical id: 0
serial: 00162263
slot: DIMM3
size: 32GiB
width: 64 bits
clock: 2667MHz (0.4ns)
*-bank:1
description: DIMM DDR4 Synchronous 2667 MHz (0.4 ns)
vendor: 000000000000
physical id: 1
serial: 00164294
slot: DIMM1
size: 32GiB
width: 64 bits
clock: 2667MHz (0.4ns)
*-bank:2
description: DIMM DDR4 Synchronous 2667 MHz (0.4 ns)
vendor: 000000000000
physical id: 2
serial: 00162249
slot: DIMM4
size: 32GiB
width: 64 bits
clock: 2667MHz (0.4ns)
*-bank:3
description: DIMM DDR4 Synchronous 2667 MHz (0.4 ns)
vendor: 000000000000
physical id: 3
serial: 00162258
slot: DIMM2
size: 32GiB
width: 64 bits
clock: 2667MHz (0.4ns)
However, I can't get any ECC information from either edac-ctl
or edac-util
.
# edac-ctl --status
edac-ctl: drivers not loaded.
# edac-util --status
edac-util: EDAC drivers loaded. No memory controllers found
I don't see any drivers loaded when I test with modprobe either.
# modprobe edac
modprobe: FATAL: Module edac not found in directory /lib/modules/6.12.15-200.fc41.x86_64
I do however, see a message in dmesg
that suggests that something has been loaded.
# dmesg | grep EDAC
[ 0.463779] EDAC MC: Ver: 3.0.0
These are all of the EDAC drivers I can find that are available on the system.
# find /lib/modules -type f -iname "*edac*"
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/amd64_edac.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/e752x_edac.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/edac_mce_amd.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i10nm_edac.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i3000_edac.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i3200_edac.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i5100_edac.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i5400_edac.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i7300_edac.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i7core_edac.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i82975x_edac.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/ie31200_edac.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/igen6_edac.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/pnd2_edac.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/sb_edac.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/skx_edac.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/skx_edac_common.ko.xz
/lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/x38_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/amd64_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/e752x_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/edac_mce_amd.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i10nm_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i3000_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i3200_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i5100_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i5400_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i7300_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i7core_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i82975x_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/ie31200_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/igen6_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/pnd2_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/sb_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/skx_edac.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/skx_edac_common.ko.xz
/lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/x38_edac.ko.xz
Zhro
(2831 rep)
Mar 18, 2025, 11:05 PM
• Last activity: Mar 19, 2025, 08:13 PM
0
votes
1
answers
32
views
Can't find sdram_scrub_rate file
I have bought the X670E Pro RS recently and added some ECC memories to it. Now I wanted to enable memory scrubbing, however when I look in `/sys/devices/system/edac/mc/mc0/` the `sdram_scrub_rate` is missing. I can't seem to find an answer as to why online. I changed the ECC from auto to enable just...
I have bought the X670E Pro RS recently and added some ECC memories to it. Now I wanted to enable memory scrubbing, however when I look in
/sys/devices/system/edac/mc/mc0/
the sdram_scrub_rate
is missing. I can't seem to find an answer as to why online. I changed the ECC from auto to enable just to force it to be enabled. lsmod | fgrep edac
reports:
amd64_edac 69632 0
edac_mce_amd 40960 1 amd64_edac
Which is correct, yes(?).
Am I missing a driver, or maybe I missed that edac-utils is required.
Why is sdram_scrub_rate
missing is the question.
Caesar
(25 rep)
Mar 4, 2025, 07:24 PM
• Last activity: Mar 4, 2025, 09:23 PM
5
votes
2
answers
3843
views
Is ZFS safer with non-ECC RAM if you disable checksums?
I've heard about the Scrub of Death. However one can disable checksumming in ZFS datasets. If so, will that make the situation safer for a system that's not using ECC RAM? I'm not thinking of a NAS or anything like that - more of a workstation deployment with a single drive just to use the ZFS volum...
I've heard about the Scrub of Death. However one can disable checksumming in ZFS datasets. If so, will that make the situation safer for a system that's not using ECC RAM?
I'm not thinking of a NAS or anything like that - more of a workstation deployment with a single drive just to use the ZFS volume management and snapshots (and no need for
fsck
) benefits. I don't want to use redundancy even.
Will a bad memory location still completely destroy my storage if I disable ZFS checksums?
unfa
(1825 rep)
Dec 14, 2017, 10:38 PM
• Last activity: Jan 17, 2025, 08:35 AM
0
votes
2
answers
72
views
Hardware Theory: Read only Scrub of HDD/SSD drive using dd
Claim ----- If drives are capable of hardware controller correction of data upon read, then it is possible to routinely catch and repair silent data corruption by simply reading it. Premises -------- * Normally, when a drive (HDD or SSD) writes a sector, it also writes a ECC (checksum) for that sect...
Claim
-----
If drives are capable of hardware controller correction of data upon read, then it is possible to routinely catch and repair silent data corruption by simply reading it.
Premises
--------
* Normally, when a drive (HDD or SSD) writes a sector, it also writes a ECC (checksum) for that sector.
* If later, a bit or two bits are flipped or misread, then the hardware controller can still retrieve the correct data by comparing and correcting it with the ECC.
* If the hardware controller reads a sector with bits that do not match the ECC, and the data is able to be returned read corrected, then the hardware controller may (should?) rewrite that sector data to the drive so that the bits will again match the ECC.
Do these premises appear flawed in any major way? Is there any information out there to help prove or disprove the premises in this claim?
If all of the premises here are correct, then it should be possible to help prevent silent data corruption with a simple cronjob that reads entire drives occasionally (perhaps every couple months).
dd /dev/sda > /dev/null
Sepero
(1619 rep)
Jun 29, 2024, 07:03 AM
• Last activity: Jun 30, 2024, 06:47 PM
4
votes
2
answers
464
views
Is there a filesystem that can maintain extra ECC data like raid5, but in the filesystem to make a fault-tolerant single external drive?
Normally to make a fault-tolerant or corruption-repairing filesystem, you use multiple drives and raid 5, or anything but raid 0. There are also many ways to make a fault-tolerant archive file like dar etc. What I am looking for is a way to make a single external ssd safer against bitrot from extend...
Normally to make a fault-tolerant or corruption-repairing filesystem, you use multiple drives and raid 5, or anything but raid 0.
There are also many ways to make a fault-tolerant archive file like dar etc.
What I am looking for is a way to make a single external ssd safer against bitrot from extended unpowered storage, yet otherwise use the drive as a normal drive, just mount and read/write files when I want like any other filesystem. Merely "when I want" can sometimes be years apart.
"normal" doesn't mean usable from Windows and Mac. Linux-only is ok.
Brian White
(161 rep)
Mar 29, 2023, 07:20 AM
• Last activity: Apr 15, 2024, 04:03 PM
-1
votes
1
answers
534
views
What software alternatives are there to ECC storage under Linux Mint and Linux Mint Debian Edition LMDE to protect against a bit flip problem?
It is known that there are other approaches besides ECC memory that can help avoid data loss due to e.g. flipping of RAM memory cells by cosmic rays (bit flip problem): What the Bit Flip Problem is: * https://web.archive.org/web/20230114090442/https://arstechnica.com/gadgets/2021/01/linus-torvalds-b...
It is known that there are other approaches besides ECC memory that can help avoid data loss due to e.g. flipping of RAM memory cells by cosmic rays (bit flip problem):
What the Bit Flip Problem is:
* https://web.archive.org/web/20230114090442/https://arstechnica.com/gadgets/2021/01/linus-torvalds-blames-intel-for-lack-of-ecc-ram-in-consumer-pcs/
Error correction procedure:
* https://web.archive.org/web/20230114220121/https://en.wikipedia.org/wiki/Error_detection_and_correction
* https://www-tecchannel-de.translate.goog/a/fehlertoleranter-speicher-schuetzt-vor-systemausfaellen-und-datenverlust,402181,4?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en-US&_x_tr_pto=wapp
BIOS implementations of RAM Mirroring:
* https://www-thomas--krenn-com.translate.goog/de/wiki/RAM_Mirroring?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en-US&_x_tr_pto=wapp
* https://web.archive.org/web/20230114220407/https://www-thomas--krenn-com.translate.goog/de/wiki/RAM_Mirroring?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en-US&_x_tr_pto=wapp
Software implementations:
* https://www-admin--magazin-de.translate.goog/Das-Heft/2013/12/Speicherfehler-unter-Linux-erkennen-und-beobachten?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en-US&_x_tr_pto=wapp
SoftECC
https://web.archive.org/web/20230119082028/https://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf
https://web.archive.org/web/20230114103624/https://pdos.csail.mit.edu/papers/softecc%3Addopson-meng/softecc_ddopson-meng.pdf
What software solutions, such as kernel implementations or add-on programs, will be usable under Linux Mint21 and LMDE5 in 2023? Possibly a end to end hash useing technology comparable to ZFS RAIDZ, but for working RAM memory and not for hard disks.
# Hints for possible actual solutions:
# Mirrored memory support:
* https://web.archive.org/web/20230114231952/https://lwn.net/Articles/897734/
* https://web.archive.org/web/20230114232055/https://www.fujitsu.com/jp/documents/products/software/os/linux/catalog/LinuxConJapan2016-Izumi.pdf
* https://web.archive.org/web/20230114232143/https://www.phoronix.com/news/Linux-AArch64-Mirrored-Memory
* https://www.micron.com/-/media/client/global/documents/products/technical-note/nand-flash/tn2971_software_bch_ecc_on_linux.pdf
* https://linux.kernel.narkive.com/cxqgDQlR/software-based-ecc
* https://dspace.mit.edu/handle/1721.1/36769
Alfred.37
(129 rep)
Jan 14, 2023, 09:41 PM
• Last activity: Apr 9, 2024, 08:03 PM
0
votes
2
answers
233
views
Are there any filesystems with builtin data repairing via checksums?
I've read that ZFS/BtrFS have a checksum check, but they don't use it for data recovery, only for recovering data from a full local copy or a mirror copy. On the other hand, RAR archives support data redundancy for a long time, with a configurable amount. The more the amount, the higher is the proba...
I've read that ZFS/BtrFS have a checksum check, but they don't use it for data recovery, only for recovering data from a full local copy or a mirror copy.
On the other hand, RAR archives support data redundancy for a long time, with a configurable amount. The more the amount, the higher is the probability of a successful recovery. Same for Dvdisaster which is able to create .ecc files with recovery data, yet on a separate medium.
Many advanced media, like optical disks or hard disks, have a low-level ECC check implemented in a drive controller, so it's not that needed on higher levels of abstraction. But other ones, like cheap microSD cards, may lack it and are perceivably unreliable.
So, there are ECC checks on hardware level and application level, but are there any ECC-backed filesystems?
bodqhrohro
(386 rep)
Sep 3, 2023, 09:02 PM
• Last activity: Sep 5, 2023, 02:05 PM
31
votes
3
answers
55406
views
How to tell whether RAM ECC is working?
I'm planning on getting some ECC RAM to replace the non-ECC RAM I currently have installed on my Asus M5A97 Pro motherboard (AMD 970 chipset, FX-6100 CPU). After I install the RAM, **how do I tell whether the ECC feature of the RAM is working properly?** I thought about `dmidecode --type memory` whi...
I'm planning on getting some ECC RAM to replace the non-ECC RAM I currently have installed on my Asus M5A97 Pro motherboard (AMD 970 chipset, FX-6100 CPU).
After I install the RAM, **how do I tell whether the ECC feature of the RAM is working properly?**
I thought about
dmidecode --type memory
which currently prints among else for each RAM stick:
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
(For one, I would expect with 1 bit of ECC per byte the data width to remain 64 bits but the total width to read 72 bits.)
Can that be used for determining whether ECC is operative? Or is dmidecode too low level for that? What else could I use (except waiting and seeing if an ECC error shows up in the logs, which would indicate it's working but not that it isn't working)?
**Update:** I later thought of edac-utils. Installing them, I get Not enabling Memory Error Detection and Correction since EDAC_DRIVER is not set
. That gave me edac-util
and edac-ctl
executables. Can one of those be used for this purpose?
user
(29991 rep)
Jun 26, 2014, 10:14 AM
• Last activity: May 16, 2023, 01:24 PM
0
votes
2
answers
700
views
What are the self-healing file formats?
It is known that there are self-healing file systems like e.g. ZFS, Btrfs, bcachefs and self-healing RAM, like e.g. ECC RAM or corresponding software implementations, which can correct single or multiple erroneous bits. What are there for self-healing file formats or projects to self-healing file fo...
It is known that there are self-healing file systems like e.g. ZFS, Btrfs, bcachefs and self-healing RAM, like e.g. ECC RAM or corresponding software implementations, which can correct single or multiple erroneous bits.
What are there for self-healing file formats or projects to self-healing file formats for standard programs ?
What does a file format mean? P.e:
* .txt, .doc, tar.lz4, .mp4
Alfred.37
(129 rep)
Mar 7, 2023, 08:27 AM
• Last activity: Apr 7, 2023, 10:47 AM
19
votes
5
answers
8745
views
Is it possible to add error correction codes (BCH, RS or etc.) to a single file?
As far as I know, WinRAR archives may contain ECC (error correction codes), so if the archive is slightly damaged, then it can be fixed by itself. For example, I can first encode `archives.tar` to `archives.tar.ecc`, and then upload it to my server. If the file is slightly damaged after downloading...
As far as I know, WinRAR archives may contain ECC (error correction codes), so if the archive is slightly damaged, then it can be fixed by itself.
For example, I can first encode
archives.tar
to archives.tar.ecc
, and then upload it to my server. If the file is slightly damaged after downloading by the client, then it can be fixed automatically without downloading the file again by decoding archives.tar.ecc
. I think it will be a great idea if the network connection is unstable.
I wonder whether there is any (open-sourced) software run on Linux that can meet my needs.
Any suggestions?
Kevin Dong
(1179 rep)
Nov 30, 2014, 08:08 AM
• Last activity: Mar 29, 2023, 04:54 PM
1
votes
1
answers
4447
views
Hardware error from APEI Generic Hardware Error Source (ECC RAM)
```lang-none [58306.633900] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [58306.633905] {1}[Hardware Error]: It has been corrected by h/w and requires no further action [58306.633907] {1}[Hardware Error]: event severity: corrected [58306.633909] {1}[Hardware Error]:...
-none
[58306.633900] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[58306.633905] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[58306.633907] {1}[Hardware Error]: event severity: corrected
[58306.633909] {1}[Hardware Error]: Error 0, type: corrected
[58306.633911] {1}[Hardware Error]: fru_text: CorrectedErr
[58306.633912] {1}[Hardware Error]: section_type: memory error
[58306.633914] {1}[Hardware Error]: node: 0 device: 44696
[58306.633916] {1}[Hardware Error]: error_type: 2, single-bit ECC
This has appeared on my Debian Xeon server with **ECC RAM**, does it mean the RAM modules are dying or something else like an error caused by SW for example? I saw some other post claiming his OS rebooted, while mine didn't, which is why I am asking. Thank you.
Vlastimil Burián
(30505 rep)
Mar 17, 2022, 10:01 AM
• Last activity: Mar 17, 2022, 05:11 PM
10
votes
3
answers
3778
views
How to get error detection and correction on a single hard drive on linux (with btrfs or other methods)
One of the cool things about btrfs on linux is that it can correct bit rot if it has redundant data because of its per-block checksumming. I can get redundant data by setting up a raid1 with two disks. However, can I also get redundant data to prevent bit rot on a single disk? I see that btrfs has a...
One of the cool things about btrfs on linux is that it can correct bit rot if it has redundant data because of its per-block checksumming. I can get redundant data by setting up a raid1 with two disks. However, can I also get redundant data to prevent bit rot on a single disk?
I see that btrfs has a DUP option for metadata (
-m dup
) that stores two copies of the metadata on each drive. However, the documentation says that dup is not an option for data (i.e. -d dup
is not an option). Is there a good way around this? Partition a single disk into two equal parts and raid1 them together?
Alternatively, is there another simple way to get file system level error detection and correction on linux (something like an automatic parchive for file systems)?
(I'm not interested in answers suggesting that I use two drives.)
**EDIT:** I did find this , which is a FUSE filesystem that mounts files with error correction as normal files. That said, it's a little hack/proof of concept the someone put together in 2009 and hasn't really touched since.
lnmaurer
(253 rep)
May 2, 2015, 02:50 PM
• Last activity: Apr 19, 2021, 10:07 AM
2
votes
0
answers
602
views
Identify ram module linked to ECC error di DMESG
one of my server is logging the following ECC errors: [lun set 14 00:14:16 2020] {33}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [lun set 14 00:14:16 2020] {33}[Hardware Error]: It has been corrected by h/w and requires no further action [lun set 14 00:14:16 2020] {33...
one of my server is logging the following ECC errors:
[lun set 14 00:14:16 2020] {33}[Hardware Error]: Hardware error from APEI Generic Hardware Error
Source: 1
[lun set 14 00:14:16 2020] {33}[Hardware Error]: It has been corrected by h/w and requires no further action
[lun set 14 00:14:16 2020] {33}[Hardware Error]: event severity: corrected
[lun set 14 00:14:16 2020] {33}[Hardware Error]: Error 0, type: corrected
[lun set 14 00:14:16 2020] {33}[Hardware Error]: fru_text: CorrectedErr
[lun set 14 00:14:16 2020] {33}[Hardware Error]: section_type: memory error
[lun set 14 00:14:16 2020] {33}[Hardware Error]: node: 0 device: 1
[lun set 14 00:14:16 2020] {33}[Hardware Error]: error_type: 2, single-bit ECC
[lun set 14 00:14:16 2020] ghes_edac: Internal error: Can't find EDAC structure
The server has the following RAN configuration:
Handle 0x0029, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Single-bit ECC
Maximum Capacity: 64 GB
Error Information Handle: Not Provided
Number Of Devices: 4
Handle 0x002A, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0029
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM CHA3
Bank Locator: BANK 0
Type: DDR4
Type Detail: Synchronous
Speed: 2133 MHz
Manufacturer: SK Hynix
Serial Number: 71929DA0
Asset Tag: 1651
Part Number: HMA82GU7MFR8N-TF
Rank: 2
Configured Clock Speed: 2133 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: 1.2 V
Handle 0x002B, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0029
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM CHA1
Bank Locator: BANK 1
Type: DDR4
Type Detail: Synchronous
Speed: 2133 MHz
Manufacturer: SK Hynix
Serial Number: 71929CFF
Asset Tag: 1651
Part Number: HMA82GU7MFR8N-TF
Rank: 2
Configured Clock Speed: 2133 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: 1.2 V
Handle 0x002C, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0029
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM CHB4
Bank Locator: BANK 2
Type: DDR4
Type Detail: Synchronous
Speed: 2133 MHz
Manufacturer: SK Hynix
Serial Number: 71929BB8
Asset Tag: 1651
Part Number: HMA82GU7MFR8N-TF
Rank: 2
Configured Clock Speed: 2133 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: 1.2 V
Handle 0x002D, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0029
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: DIMM CHB2
Bank Locator: BANK 3
Type: DDR4
Type Detail: Synchronous
Speed: 2133 MHz
Manufacturer: Samsung
Serial Number: 33BB5E37
Asset Tag: 1641
Part Number: M391A2K43BB1-CPB
Rank: 2
Configured Clock Speed: 2133 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: 1.2 V
How can I identify the faulty module to replace it? I think that the following log's row has the information I need but I miss the way to decrypt it.
[lun set 14 00:14:16 2020] {33}[Hardware Error]: node: 0 device: 1
sKo
(21 rep)
Sep 14, 2020, 10:00 AM
5
votes
1
answers
5882
views
Remove ECC warnings in system log
How can I disable these warnings about ECC? I don't have ECC memory and so disabled it in bios also but it still prints it. [ 4.697057] EDAC amd64: Node 0: DRAM ECC disabled. [ 4.697061] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. Either enable ECC checking or fo...
How can I disable these warnings about ECC? I don't have ECC memory and so disabled it in bios also but it still prints it.
[ 4.697057] EDAC amd64: Node 0: DRAM ECC disabled.
[ 4.697061] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)
[ 4.764909] EDAC amd64: Node 0: DRAM ECC disabled.
[ 4.764911] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)
[ 4.844621] EDAC amd64: Node 0: DRAM ECC disabled.
[ 4.844624] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)
[ 4.889875] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
[ 4.892678] EDAC amd64: Node 0: DRAM ECC disabled.
[ 4.892681] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)
[ 4.913651] EXT4-fs (sdc1): mounted filesystem with ordered data mode. Opts: (null)
[ 4.936635] EDAC amd64: Node 0: DRAM ECC disabled.
[ 4.936637] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)
[ 4.949722] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
[ 4.980600] EDAC amd64: Node 0: DRAM ECC disabled.
[ 4.980602] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)
[ 5.028880] EDAC amd64: Node 0: DRAM ECC disabled.
[ 5.028883] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)
JoKeR
(438 rep)
Apr 2, 2020, 05:55 AM
• Last activity: Sep 2, 2020, 05:20 AM
11
votes
1
answers
7521
views
How do I enable and verify ECC RAM scrubbing in Linux?
I bought my first system with ECC RAM and trying to learn about its possibilities when it comes to alerting and maintenance in Linux. To be specific, [Debian Linux](https://www.samsung.com/semiconductor/dram/module/M393B2G70QH0-YK0/) on a [Super Micro H8SGL](https://www.supermicro.com/Aplus/motherbo...
I bought my first system with ECC RAM and trying to learn about its possibilities when it comes to alerting and maintenance in Linux. To be specific, [Debian Linux](https://www.samsung.com/semiconductor/dram/module/M393B2G70QH0-YK0/) on a [Super Micro H8SGL](https://www.supermicro.com/Aplus/motherboard/Opteron6000/SR56x0/H8SGL.cfm) motherboard with an [AMD Opteron 6386 SE](https://www.amd.com/en/products/cpu/6386-se) CPU and [Samsung M393B2G70QH0-YK0](https://www.samsung.com/semiconductor/dram/module/M393B2G70QH0-YK0/) DDR3 ECC RAM.
I have learnt that it is possible to [_scrub_](https://en.wikipedia.org/wiki/Memory_scrubbing) ECC RAM, which sounds like an excellent idea. ECC RAM can normally _repair_ 1-bit errors and _detect_ 2-bit errors. Scrubbing involves periodically reading RAM to preemptively repair the 1-bit errors before they end up 2-bit errors.
I also learnt that Linux supports this, but I'm having problems using it so I need some help getting started and to figure out the settings.
### Linux EDAC driver
From what I understand, Linux handles ECC RAM using a subsystem called EDAC and the controls for that are exposed under
Right now I'm trying the
/sys/devices/system/edac/
. I can see my two memory controllers here (2 node NUMA):
# ls /sys/devices/system/edac/mc/
mc0 mc1 power subsystem uevent
I can also see that the EDAC drivers are somehow loaded:
# edac-util --status
edac-util: EDAC drivers are loaded. 2 MCs detected
# lsmod | grep edac
amd64_edac_mod 36864 0
edac_mce_amd 28672 1 amd64_edac_mod
Now I want to enable scrubbing. According to the [Linux ABI documentation](https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-devices-edac) the scrub rate is exposed through the /sys/devices/system/edac/mc/mc*/sdram_scrub_rate
file, documented as such:
>The scrubbing rate used by the memory controller is set by
writing a minimum bandwidth in bytes/sec to the attribute file.
The rate will be translated to an internal value that gives at
least the specified rate.
Reading the file will return the actual scrubbing rate employed.
If configuration fails or memory scrubbing is not implemented,
the value of the attribute file will be -1.
But nothing happens when I do this. Writing a sensible value (somewhere in the middle when checking the [source](https://github.com/torvalds/linux/blob/master/drivers/edac/amd64_edac.c) and the [CPU documentation](http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf)) to the file seems to work but it always returns 0
when reading from it:
# cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
0
# echo 1000000 >/sys/devices/system/edac/mc/mc0/sdram_scrub_rate
# echo $?
0
# cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
0
After digging this deep, what am I missing?
### BIOS ECC Configuration
I have also tried different settings in the BIOS. There is an option in BIOS for ECC configuration, but none of them has any effect on the scrub rate visible from linux:

User
setting but I really can't see any difference between these.
pipe
(893 rep)
Jun 15, 2020, 04:21 PM
• Last activity: Jun 24, 2020, 07:09 AM
3
votes
0
answers
302
views
Mapping around ecc errors in Linux does not seem to work?
I get the following ecc error on a Linux box several times a day - ``` May 24 18:21:04 staton-nas kernel: mce: [Hardware Error]: Machine check events logged May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: CPU 0: Mach...
I get the following ecc error on a Linux box several times a day -
May 24 18:21:04 staton-nas kernel: mce: [Hardware Error]: Machine check events logged
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8c000040000800c2
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: TSC 1c35588953416
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: ADDR 117d228000
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: MISC 122100200020008c
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1590358864 SOCKET 0 APIC 0
May 24 18:21:04 staton-nas kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x117d228 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:1 rank:4)
The addr is always the same, so I’m trying to map around it with a ‘memmap=5M$0x117CFA8001’ kernel argument.
The argument seems to be applying because I see the following in syslog -
May 24 16:03:09 staton-nas kernel: user: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
May 24 16:03:09 staton-nas kernel: user: [mem 0x0000000100000000-0x000000117cfa8000] usable
May 24 16:03:09 staton-nas kernel: user: [mem 0x000000117cfa8001-0x000000117d4a8000] reserved
May 24 16:03:09 staton-nas kernel: user: [mem 0x000000117d4a8001-0x000000407fffffff] usable
but I still get the ecc errors.
Am I missing something?
Is the “ADDR 117d228000” in the edac syslog errors not the actual address I need to map around? Do I need to covert that to a physical address somehow?
I’m too cheap to replace a whole dimm for a single bad bit.
The more research I do, the more convinced I become that the “memory scrubbing error“ message indicates the error is coming from memory scrubbing that the hardware is doing. And I can safely ignore it now that I have mapped around it. The OS will never actually use this memory area because I reserved it.
Can anyone confirm that?
statop
(31 rep)
May 25, 2020, 05:07 PM
24
votes
3
answers
15293
views
Is it possible to find the physical address range of a DIMM?
I note that SMBios Type 20 would help here, but it's optional as of version [2.5 (2006-09-05) pp. 25, L796, and pp. 131 ][1], whereas types 16, 17 and 19 are mandatory, but don't quite help. ### Physical Memory Array (Type 16) There is one of these structures for the entire system, explaining what i...
I note that SMBios Type 20 would help here, but it's optional as of version 2.5 (2006-09-05) pp. 25, L796, and pp. 131 , whereas types 16, 17 and 19 are mandatory, but don't quite help.
### Physical Memory Array (Type 16)
There is one of these structures for the entire system, explaining what is possible on this board.
Handle 0x1000, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 768 GB
Error Information Handle: Not Provided
Number Of Devices: 24
### Memory Device (Type 17)
There is one record per each Dimm, which tells you the physical Dimms installed on the board.
Handle 0x1100, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x1000
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 2048 MB
Form Factor: DIMM
Set: 1
Locator: DIMM_A1
Bank Locator: Not Specified
Type: DDR3
Type Detail: Synchronous Registered (Buffered)
Speed: 1600 MHz
Manufacturer: XXXX
Serial Number: XXXX
Asset Tag: XXXX
Part Number: XXXX
Rank: 1
Configured Clock Speed: 1333 MHz
### Memory Array Mapped Address (Type 19)
There can be multiple of these records, and each record lists a range of physical addresses.
Here is the output with two 2GB sticks:
Handle 0x1300, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x000CFFFFFFF
Range Size: 3328 MB
Physical Array Handle: 0x1000
Partition Width: 2
Handle 0x1301, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x00100000000
Ending Address: 0x0012FFFFFFF
Range Size: 768 MB
Physical Array Handle: 0x1000
Partition Width: 2
And here is the output with 4 sticks; 2*2GB and 2*4GB:
Handle 0x1300, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x000CFFFFFFF
Range Size: 3328 MB
Physical Array Handle: 0x1000
Partition Width: 2
Handle 0x1301, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x00100000000
Ending Address: 0x0032FFFFFFF
Range Size: 8960 MB
Physical Array Handle: 0x1000
Partition Width: 2
Note that in the first sample output above, there were two 2GB DIMMs, but two ranges of 3.3GB and 0.7GB. With 4 Dimms, the system will also coalesce the memory array mapped address region into two chunks, as it is just representing the same as the e820 map, i.e. the valid memory physical address ranges.
1 to many Type 20 records are tied to exactly one type 17 memory device, meaning that the entire physical range can be known:
### Example
$ sudo dmidecode -t 20
# dmidecode 2.12
SMBIOS 2.6 present.
Handle 0x002F, DMI type 20, 19 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x000FFFFFFFF
Range Size: 4 GB
Physical Device Handle: 0x002B
Memory Array Mapped Address Handle: 0x002E
Partition Row Position: 1
Handle 0x0030, DMI type 20, 19 bytes
Memory Device Mapped Address
Starting Address: 0x00100000000
Ending Address: 0x001FFFFFFFF
Range Size: 4 GB
Physical Device Handle: 0x002C
Memory Array Mapped Address Handle: 0x002E
Partition Row Position: 1
It seems possible to go from address to DIMM for EDAC - Error Detection & Correction purposes, but not from DIMM to entire range.
Looking at the source code of mcelog , it is also using type 20 for its decoding.
Alun
(409 rep)
Jan 6, 2014, 06:39 AM
• Last activity: Dec 24, 2017, 04:34 AM
4
votes
2
answers
14454
views
Understanding "Hardware error from APEI Generic Hardware Error Source" error message
**Summary**: I'm trying to understand exactly what the following error message means: [17016.923750] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [17016.923758] {4}[Hardware Error]: It has been corrected by h/w and requires no further action [17016.923759] {4}[Hardw...
**Summary**: I'm trying to understand exactly what the following error message means:
[17016.923750] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[17016.923758] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[17016.923759] {4}[Hardware Error]: event severity: corrected
[17016.923761] {4}[Hardware Error]: Error 0, type: corrected
[17016.923762] {4}[Hardware Error]: fru_text: CorrectedErr
[17016.923764] {4}[Hardware Error]: section_type: memory error
**Details**:
I have a server with an
Intel(R) Xeon(R) CPU E3-1275 v3 @ 3.50GHz
CPU that is running Arch Linux (3.18.6-1-ARCH #1 SMP PREEMPT Sat Feb 7 08:44:05 CET 2015 x86_64 GNU/Linux
).
When I run dmesg
I see the error that I posted above. The errors are not that frequent, but they do seem to keep happening. For instance the server has been up for 1 day now since the last reboot, and there are 9 instance of this error listed in the log.
I saw another question that [asked about this error](https://unix.stackexchange.com/questions/150451/apei-generic-hardware-error) and there was an answer that suggested the problem was that the ECC memory is failing.
My questions are:
1) Is there any reference to support the idea that this error message is associated with ECC memory?
2) If I do have a failing DIMM is there a suggested way to figure out which one it is? I tried running memtest86+, but it did not report any memory errors.
3) If the OS reports ECC errors have been corrected does that really mean the DIMM is failing?
I wouldn't be so concerned if the only problem was a few messages in my log file. But I have also noticed that sometimes the server hangs unexpectedly. The machine is being used for research and it's not as important for it to be stable as it would be if it were a production system. Still having the machine hang can be problematic. So I would like to know exactly what this error message means, and if I need to replace a component it would be nice if there were a way to figure out which component needs replacement.
**Edit**
Currently the server has been up for 8 days without hanging and I see 148 instances of this error message in the logs. In addition I see one instance of the following message:
[671211.188084] EDAC MC0: INTERNAL ERROR: csrow value is out of range (6 >= 4)
[671211.188333] EDAC MC0: 1 CE ie31200 CE on unknown memory (channel:1 page:0x0 offset:0x0 grain:0 syndrome:0xc8)
I guess it is likely that one of the DIMMs has a problem. Still I would be interested to know in case anyone had any information about how to interpret these messages, in particular in order to figure out which DIMM is possibly failing.
Gabriel Southern
(843 rep)
Feb 25, 2015, 02:11 AM
• Last activity: Dec 5, 2017, 09:08 PM
2
votes
0
answers
124
views
software-level error detection and correction for raw storage
If I understand data storage correctly, all storage devices are unreliable to some extent, which is why most have hardware-level abstraction layers. Hard drives use error correction. If a sector is read and ECC detects an error (whether it was from the original writing or from random bit flipping ov...
If I understand data storage correctly, all storage devices are unreliable to some extent, which is why most have hardware-level abstraction layers. Hard drives use error correction. If a sector is read and ECC detects an error (whether it was from the original writing or from random bit flipping over time), ECC is used to try to recover from the error and that sector is potentially marked bad and remapped to the spare sector pool. Some hardware devices don't have any of that, though, especially things like flash memory on embedded systems, which gets accessed directly, with no hardware level error-checking layer between it and the kernel.
Does linux provide methods, like special filesystems or logical volumes (by logical volumes, I mean things like cryptsetup or lvm2), that can deal directly with such "raw" devices, doing all of the checksumming, bad sector remapping, error correction, etc. on the software level? Would the method of error checking depend on the type or the properties of the raw storage?
enigmaticPhysicist
(1542 rep)
Oct 25, 2016, 09:25 PM
• Last activity: Oct 26, 2016, 08:18 PM
Showing page 1 of 20 total questions