Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

2 votes

0 answers

87 views

Add error correction to SquashFS images as part of a backup strategy

My backup strategy currently primarily consists of daily backups of all of my machines with Borg Backup, stored on different storage devices in different locations, following the 3-2-1 strategy. These file-level backups are the important ones that matter most to me. My question is *not* about these...

                                  My backup strategy currently primarily consists of daily backups of all of my machines with Borg Backup, stored on different storage devices in different locations, following the 3-2-1 strategy. These file-level backups are the important ones that matter most to me. My question is *not* about these backups, I mention them for context.

Next to Borg Backup I also sporadically create full disk image backups with dd. After zero'ing the disk's free space (using zerofree on ext4, dd if=/dev/zero otherwise) I usually create a SquashFS image of the raw disk image (e.g. sda.img becomes the only file in disks.sqfs). This allows me to store the raw disk image in compressed form, while still allowing me to access the data without the need to decompress everything first.

These full disk image backups are stored on a single storage device (a NAS to be more precise), i.e. don't follow the 3-2-1 strategy. Creating a second copy of the data is out of scope, simply because they take too much space and because I consider investing into more storage a waste of money due to my Borg Backup backups. So, I'm fine with loosing these backups per-se, but I want to protect them a little better. Thus I'm thinking about adding some sort of error correction mechanism.

I read through a lot of resources and found that Reed–Solomon error correction seems to be the way to go. It adds some overhead to the data stored and provides safety in most, even though not all cases.

**My question is the following:** How do I do that in practice? What tools are available and how would I use them in my case? I found [this 10 years old Stack Exchange question](https://unix.stackexchange.com/questions/170652/is-it-possible-to-add-error-correction-codes-bch-rs-or-etc-to-a-single-file)  listing a whole bunch of tools, but many of the projects are apparently dead. Plus, they don't seem to fit my needs:

Storing the data in compressed form and yet being able to access the data without the need to decompress it first is a must-have for me. So, unless there's another solution, I'm stuck with SquashFS. However, according to the resources I read, combining ECC with compression is hard: One apparently shouldn't calculate ECCs from compressed data, but from the original data, because ECC doesn't guarantee a 100% correction and even a single remaining corruption could yield all compressed data useless. However, calculating ECCs from original data and then compressing it wouldn't help either, because I might not be able to decompress the data due to the corruptions. So, apparently one needs software that does both at the same time: compression and ECC. Per ddrescue I found that lzip can actually do that by creating forward error correction (fec) files alongside compressing the data, but AFAIK I can't tell SquashFS to create these files.

So, I'm kinda stuck with this chicken-and-egg problem... How can I combine SquashFS with ECC, or is there an alternative to SquashFS that allows this?

Any suggestions?

PhrozenByte (21 rep)

Apr 1, 2025, 07:26 PM • Last activity: Apr 1, 2025, 07:30 PM

1 votes

1 answers

147 views

Why are edac-ctl and edac-util reporting that there are no EDAC drivers loaded when ECC memory is present?

linux ecc

I've installed Fedora 41 on a Dell Precision 3450 with 128GB of known-good ECC memory. I know that the ECC memory is detected, according to `lshw`, but I can't query any information with EDAC utilities. Why are EDAC drivers not loading and does this mean that the ECC feature isn't working in Linux?...

                                  I've installed Fedora 41 on a Dell Precision 3450 with 128GB of known-good ECC memory. I know that the ECC memory is detected, according to lshw, but I can't query any information with EDAC utilities. Why are EDAC drivers not loading and does this mean that the ECC feature isn't working in Linux? Or is does this instead mean that it is working in hardware but that there won't be any system messages when an error occurs?

The system is running an Intel Xeon W-1370 CPU with an Intel W580 chipset, both of which are fully capable of supporting ECC memory. This is also a supported configuration by Dell.

    # lshw -class memory
    ...
      *-memory
           description: System Memory
           physical id: 1000
           slot: System board or motherboard
           size: 128GiB
           capabilities: ecc
           configuration: errordetection=ecc
         *-bank:0
              description: DIMM DDR4 Synchronous 2667 MHz (0.4 ns)
              vendor: 000000000000
              physical id: 0
              serial: 00162263
              slot: DIMM3
              size: 32GiB
              width: 64 bits
              clock: 2667MHz (0.4ns)
         *-bank:1
              description: DIMM DDR4 Synchronous 2667 MHz (0.4 ns)
              vendor: 000000000000
              physical id: 1
              serial: 00164294
              slot: DIMM1
              size: 32GiB
              width: 64 bits
              clock: 2667MHz (0.4ns)
         *-bank:2
              description: DIMM DDR4 Synchronous 2667 MHz (0.4 ns)
              vendor: 000000000000
              physical id: 2
              serial: 00162249
              slot: DIMM4
              size: 32GiB
              width: 64 bits
              clock: 2667MHz (0.4ns)
         *-bank:3
              description: DIMM DDR4 Synchronous 2667 MHz (0.4 ns)
              vendor: 000000000000
              physical id: 3
              serial: 00162258
              slot: DIMM2
              size: 32GiB
              width: 64 bits
              clock: 2667MHz (0.4ns)

However, I can't get any ECC information from either edac-ctl or edac-util.

    # edac-ctl --status
    edac-ctl: drivers not loaded.

    # edac-util --status
    edac-util: EDAC drivers loaded. No memory controllers found

I don't see any drivers loaded when I test with modprobe either.

    # modprobe edac
    modprobe: FATAL: Module edac not found in directory /lib/modules/6.12.15-200.fc41.x86_64

I do however, see a message in dmesg that suggests that something has been loaded.

    # dmesg | grep EDAC
    [    0.463779] EDAC MC: Ver: 3.0.0

These are all of the EDAC drivers I can find that are available on the system.

    # find /lib/modules -type f -iname "*edac*"
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/amd64_edac.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/e752x_edac.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/edac_mce_amd.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i10nm_edac.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i3000_edac.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i3200_edac.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i5100_edac.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i5400_edac.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i7300_edac.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i7core_edac.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/i82975x_edac.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/ie31200_edac.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/igen6_edac.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/pnd2_edac.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/sb_edac.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/skx_edac.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/skx_edac_common.ko.xz
    /lib/modules/6.11.4-301.fc41.x86_64/kernel/drivers/edac/x38_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/amd64_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/e752x_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/edac_mce_amd.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i10nm_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i3000_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i3200_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i5100_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i5400_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i7300_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i7core_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/i82975x_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/ie31200_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/igen6_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/pnd2_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/sb_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/skx_edac.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/skx_edac_common.ko.xz
    /lib/modules/6.12.15-200.fc41.x86_64/kernel/drivers/edac/x38_edac.ko.xz


                                

Zhro (2831 rep)

Mar 18, 2025, 11:05 PM • Last activity: Mar 19, 2025, 08:13 PM

0 votes

1 answers

32 views

Can't find sdram_scrub_rate file

ecc scrub

I have bought the X670E Pro RS recently and added some ECC memories to it. Now I wanted to enable memory scrubbing, however when I look in `/sys/devices/system/edac/mc/mc0/` the `sdram_scrub_rate` is missing. I can't seem to find an answer as to why online. I changed the ECC from auto to enable just...

                                  I have bought the X670E Pro RS recently and added some ECC memories to it. Now I wanted to enable memory scrubbing, however when I look in /sys/devices/system/edac/mc/mc0/ the sdram_scrub_rate is missing. I can't seem to find an answer as to why online. I changed the ECC from auto to enable just to force it to be enabled. lsmod | fgrep edac reports:

    amd64_edac             69632  0
    edac_mce_amd           40960  1 amd64_edac

Which is correct, yes(?).

Am I missing a driver, or maybe I missed that edac-utils is required.

Why is sdram_scrub_rate missing is the question.

Caesar (25 rep)

Mar 4, 2025, 07:24 PM • Last activity: Mar 4, 2025, 09:23 PM

5 votes

2 answers

3843 views

Is ZFS safer with non-ECC RAM if you disable checksums?

zfs ram integrity ecc

I've heard about the Scrub of Death. However one can disable checksumming in ZFS datasets. If so, will that make the situation safer for a system that's not using ECC RAM? I'm not thinking of a NAS or anything like that - more of a workstation deployment with a single drive just to use the ZFS volum...

                                  I've heard about the Scrub of Death. However one can disable checksumming in ZFS datasets. If so, will that make the situation safer for a system that's not using ECC RAM?

I'm not thinking of a NAS or anything like that - more of a workstation deployment with a single drive just to use the ZFS volume management and snapshots (and no need for fsck) benefits. I don't want to use redundancy even.

Will a bad memory location still completely destroy my storage if I disable ZFS checksums?

unfa (1825 rep)

Dec 14, 2017, 10:38 PM • Last activity: Jan 17, 2025, 08:35 AM

0 votes

2 answers

72 views

Hardware Theory: Read only Scrub of HDD/SSD drive using dd

hard-disk usb-drive ssd ecc

Claim ----- If drives are capable of hardware controller correction of data upon read, then it is possible to routinely catch and repair silent data corruption by simply reading it. Premises -------- * Normally, when a drive (HDD or SSD) writes a sector, it also writes a ECC (checksum) for that sect...

                                  Claim
-----
If drives are capable of hardware controller correction of data upon read, then it is possible to routinely catch and repair silent data corruption by simply reading it.

Premises
--------

* Normally, when a drive (HDD or SSD) writes a sector, it also writes a ECC (checksum) for that sector.
* If later, a bit or two bits are flipped or misread, then the hardware controller can still retrieve the correct data by comparing and correcting it with the ECC.
* If the hardware controller reads a sector with bits that do not match the ECC, and the data is able to be returned read corrected, then the hardware controller may (should?) rewrite that sector data to the drive so that the bits will again match the ECC.

Do these premises appear flawed in any major way? Is there any information out there to help prove or disprove the premises in this claim?

If all of the premises here are correct, then it should be possible to help prevent silent data corruption with a simple cronjob that reads entire drives occasionally (perhaps every couple months).

    dd /dev/sda > /dev/null

                                

Sepero (1619 rep)

Jun 29, 2024, 07:03 AM • Last activity: Jun 30, 2024, 06:47 PM

4 votes

2 answers

464 views

Is there a filesystem that can maintain extra ECC data like raid5, but in the filesystem to make a fault-tolerant single external drive?

filesystems ssd archive corruption ecc

Normally to make a fault-tolerant or corruption-repairing filesystem, you use multiple drives and raid 5, or anything but raid 0. There are also many ways to make a fault-tolerant archive file like dar etc. What I am looking for is a way to make a single external ssd safer against bitrot from extend...

                                  Normally to make a fault-tolerant or corruption-repairing filesystem, you use multiple drives and raid 5, or anything but raid 0.

There are also many ways to make a fault-tolerant archive file like dar etc.

What I am looking for is a way to make a single external ssd safer against bitrot from extended unpowered storage, yet otherwise use the drive as a normal drive, just mount and read/write files when I want like any other filesystem. Merely "when I want" can sometimes be years apart.

"normal" doesn't mean usable from Windows and Mac. Linux-only is ok.

Brian White (161 rep)

Mar 29, 2023, 07:20 AM • Last activity: Apr 15, 2024, 04:03 PM

-1 votes

1 answers

534 views

What software alternatives are there to ECC storage under Linux Mint and Linux Mint Debian Edition LMDE to protect against a bit flip problem?

linux-mint software-rec software-raid ecc

It is known that there are other approaches besides ECC memory that can help avoid data loss due to e.g. flipping of RAM memory cells by cosmic rays (bit flip problem): What the Bit Flip Problem is: * https://web.archive.org/web/20230114090442/https://arstechnica.com/gadgets/2021/01/linus-torvalds-b...

                                  It is known that there are other approaches besides ECC memory that can help avoid data loss due to e.g. flipping of RAM memory cells by cosmic rays (bit flip problem):

What the Bit Flip Problem is:

* https://web.archive.org/web/20230114090442/https://arstechnica.com/gadgets/2021/01/linus-torvalds-blames-intel-for-lack-of-ecc-ram-in-consumer-pcs/ 

Error correction procedure:

* https://web.archive.org/web/20230114220121/https://en.wikipedia.org/wiki/Error_detection_and_correction 

* https://www-tecchannel-de.translate.goog/a/fehlertoleranter-speicher-schuetzt-vor-systemausfaellen-und-datenverlust,402181,4?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en-US&_x_tr_pto=wapp 

BIOS implementations of RAM Mirroring:

* https://www-thomas--krenn-com.translate.goog/de/wiki/RAM_Mirroring?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en-US&_x_tr_pto=wapp 
* https://web.archive.org/web/20230114220407/https://www-thomas--krenn-com.translate.goog/de/wiki/RAM_Mirroring?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en-US&_x_tr_pto=wapp 

Software implementations:

* https://www-admin--magazin-de.translate.goog/Das-Heft/2013/12/Speicherfehler-unter-Linux-erkennen-und-beobachten?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en-US&_x_tr_pto=wapp 

SoftECC

https://web.archive.org/web/20230119082028/https://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf 
https://web.archive.org/web/20230114103624/https://pdos.csail.mit.edu/papers/softecc%3Addopson-meng/softecc_ddopson-meng.pdf 

What software solutions, such as kernel implementations or add-on programs, will be usable under Linux Mint21 and LMDE5 in 2023? Possibly a end to end hash useing technology comparable to ZFS RAIDZ, but for working RAM memory and not for hard disks.

#  Hints for possible actual solutions:
# Mirrored memory support:

* https://web.archive.org/web/20230114231952/https://lwn.net/Articles/897734/ 
* https://web.archive.org/web/20230114232055/https://www.fujitsu.com/jp/documents/products/software/os/linux/catalog/LinuxConJapan2016-Izumi.pdf 
* https://web.archive.org/web/20230114232143/https://www.phoronix.com/news/Linux-AArch64-Mirrored-Memory 
* https://www.micron.com/-/media/client/global/documents/products/technical-note/nand-flash/tn2971_software_bch_ecc_on_linux.pdf 
* https://linux.kernel.narkive.com/cxqgDQlR/software-based-ecc 
* https://dspace.mit.edu/handle/1721.1/36769 
                                

Alfred.37 (129 rep)

Jan 14, 2023, 09:41 PM • Last activity: Apr 9, 2024, 08:03 PM

0 votes

2 answers

233 views

Are there any filesystems with builtin data repairing via checksums?

filesystems sd-card checksum ecc

I've read that ZFS/BtrFS have a checksum check, but they don't use it for data recovery, only for recovering data from a full local copy or a mirror copy. On the other hand, RAR archives support data redundancy for a long time, with a configurable amount. The more the amount, the higher is the proba...

                                  I've read that ZFS/BtrFS have a checksum check, but they don't use it for data recovery, only for recovering data from a full local copy or a mirror copy.

On the other hand, RAR archives support data redundancy for a long time, with a configurable amount. The more the amount, the higher is the probability of a successful recovery. Same for Dvdisaster which is able to create .ecc files with recovery data, yet on a separate medium.

Many advanced media, like optical disks or hard disks, have a low-level ECC check implemented in a drive controller, so it's not that needed on higher levels of abstraction. But other ones, like cheap microSD cards, may lack it and are perceivably unreliable.

So, there are ECC checks on hardware level and application level, but are there any ECC-backed filesystems?

bodqhrohro (386 rep)

Sep 3, 2023, 09:02 PM • Last activity: Sep 5, 2023, 02:05 PM

31 votes

3 answers

55406 views

How to tell whether RAM ECC is working?

linux-kernel ram ecc

I'm planning on getting some ECC RAM to replace the non-ECC RAM I currently have installed on my Asus M5A97 Pro motherboard (AMD 970 chipset, FX-6100 CPU). After I install the RAM, **how do I tell whether the ECC feature of the RAM is working properly?** I thought about `dmidecode --type memory` whi...

                                  I'm planning on getting some ECC RAM to replace the non-ECC RAM I currently have installed on my Asus M5A97 Pro motherboard (AMD 970 chipset, FX-6100 CPU).

After I install the RAM, **how do I tell whether the ECC feature of the RAM is working properly?**

I thought about dmidecode --type memory which currently prints among else for each RAM stick:

	Error Information Handle: Not Provided
	Total Width: 64 bits
	Data Width: 64 bits

(For one, I would expect with 1 bit of ECC per byte the data width to remain 64 bits but the total width to read 72 bits.)

Can that be used for determining whether ECC is operative? Or is dmidecode too low level for that? What else could I use (except waiting and seeing if an ECC error shows up in the logs, which would indicate it's working but not that it isn't working)?

**Update:** I later thought of edac-utils. Installing them, I get Not enabling Memory Error Detection and Correction since EDAC_DRIVER is not set. That gave me edac-util and edac-ctl executables. Can one of those be used for this purpose?

user (29991 rep)

Jun 26, 2014, 10:14 AM • Last activity: May 16, 2023, 01:24 PM

0 votes

2 answers

700 views

What are the self-healing file formats?

file-format ecc

It is known that there are self-healing file systems like e.g. ZFS, Btrfs, bcachefs and self-healing RAM, like e.g. ECC RAM or corresponding software implementations, which can correct single or multiple erroneous bits. What are there for self-healing file formats or projects to self-healing file fo...

                                  It is known that there are self-healing file systems like e.g. ZFS, Btrfs, bcachefs and self-healing RAM, like e.g. ECC RAM or corresponding software implementations, which can correct single or multiple erroneous bits.

What are there for self-healing file formats or projects to self-healing file formats for standard programs ?

What does a file format mean? P.e:
* .txt, .doc, tar.lz4, .mp4

Alfred.37 (129 rep)

Mar 7, 2023, 08:27 AM • Last activity: Apr 7, 2023, 10:47 AM

19 votes

5 answers

8745 views

Is it possible to add error correction codes (BCH, RS or etc.) to a single file?

linux tar rar ecc

As far as I know, WinRAR archives may contain ECC (error correction codes), so if the archive is slightly damaged, then it can be fixed by itself. For example, I can first encode `archives.tar` to `archives.tar.ecc`, and then upload it to my server. If the file is slightly damaged after downloading...

                                  As far as I know, WinRAR archives may contain ECC (error correction codes), so if the archive is slightly damaged, then it can be fixed by itself.

For example, I can first encode archives.tar to archives.tar.ecc, and then upload it to my server. If the file is slightly damaged after downloading by the client, then it can be fixed automatically without downloading the file again by decoding archives.tar.ecc. I think it will be a great idea if the network connection is unstable.

I wonder whether there is any (open-sourced) software run on Linux that can meet my needs.

Any suggestions?

Kevin Dong (1179 rep)

Nov 30, 2014, 08:08 AM • Last activity: Mar 29, 2023, 04:54 PM

1 votes

1 answers

4447 views

Hardware error from APEI Generic Hardware Error Source (ECC RAM)

debian memory hardware dmesg ecc

```lang-none [58306.633900] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [58306.633905] {1}[Hardware Error]: It has been corrected by h/w and requires no further action [58306.633907] {1}[Hardware Error]: event severity: corrected [58306.633909] {1}[Hardware Error]:...

-none
[58306.633900] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[58306.633905] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[58306.633907] {1}[Hardware Error]: event severity: corrected
[58306.633909] {1}[Hardware Error]:  Error 0, type: corrected
[58306.633911] {1}[Hardware Error]:  fru_text: CorrectedErr
[58306.633912] {1}[Hardware Error]:   section_type: memory error
[58306.633914] {1}[Hardware Error]:   node: 0 device: 44696
[58306.633916] {1}[Hardware Error]:   error_type: 2, single-bit ECC

This has appeared on my Debian Xeon server with **ECC RAM**, does it mean the RAM modules are dying or something else like an error caused by SW for example? I saw some other post claiming his OS rebooted, while mine didn't, which is why I am asking. Thank you.

Vlastimil Burián (30505 rep)

Mar 17, 2022, 10:01 AM • Last activity: Mar 17, 2022, 05:11 PM

10 votes

3 answers

3778 views

How to get error detection and correction on a single hard drive on linux (with btrfs or other methods)

btrfs ecc

One of the cool things about btrfs on linux is that it can correct bit rot if it has redundant data because of its per-block checksumming. I can get redundant data by setting up a raid1 with two disks. However, can I also get redundant data to prevent bit rot on a single disk? I see that btrfs has a...

                                  One of the cool things about btrfs on linux is that it can correct bit rot if it has redundant data because of its per-block checksumming. I can get redundant data by setting up a raid1 with two disks. However, can I also get redundant data to prevent bit rot on a single disk?

I see that btrfs has a DUP option for metadata (-m dup) that stores two copies of the metadata on each drive. However, the documentation says that dup is not an option for data (i.e. -d dup is not an option). Is there a good way around this? Partition a single disk into two equal parts and raid1 them together?

Alternatively, is there another simple way to get file system level error detection and correction on linux (something like an automatic parchive for file systems)?

(I'm not interested in answers suggesting that I use two drives.)

**EDIT:** I did find this , which is a FUSE filesystem that mounts files with error correction as normal files. That said, it's a little hack/proof of concept the someone put together in 2009 and hasn't really touched since.

lnmaurer (253 rep)

May 2, 2015, 02:50 PM • Last activity: Apr 19, 2021, 10:07 AM

2 votes

0 answers

602 views

Identify ram module linked to ECC error di DMESG

ram dmesg ecc

one of my server is logging the following ECC errors: [lun set 14 00:14:16 2020] {33}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [lun set 14 00:14:16 2020] {33}[Hardware Error]: It has been corrected by h/w and requires no further action [lun set 14 00:14:16 2020] {33...

                                  one of my server is logging the following ECC errors:

        [lun set 14 00:14:16 2020] {33}[Hardware Error]: Hardware error from APEI Generic Hardware Error 
    Source: 1
        [lun set 14 00:14:16 2020] {33}[Hardware Error]: It has been corrected by h/w and requires no further action
        [lun set 14 00:14:16 2020] {33}[Hardware Error]: event severity: corrected
        [lun set 14 00:14:16 2020] {33}[Hardware Error]:  Error 0, type: corrected
        [lun set 14 00:14:16 2020] {33}[Hardware Error]:  fru_text: CorrectedErr
        [lun set 14 00:14:16 2020] {33}[Hardware Error]:   section_type: memory error
        [lun set 14 00:14:16 2020] {33}[Hardware Error]:   node: 0 device: 1
        [lun set 14 00:14:16 2020] {33}[Hardware Error]:   error_type: 2, single-bit ECC
        [lun set 14 00:14:16 2020] ghes_edac: Internal error: Can't find EDAC structure

The server has the following RAN configuration:

    Handle 0x0029, DMI type 16, 23 bytes
    Physical Memory Array
            Location: System Board Or Motherboard
            Use: System Memory
            Error Correction Type: Single-bit ECC
            Maximum Capacity: 64 GB
            Error Information Handle: Not Provided
            Number Of Devices: 4
    
    Handle 0x002A, DMI type 17, 40 bytes
    Memory Device
            Array Handle: 0x0029
            Error Information Handle: Not Provided
            Total Width: 72 bits
            Data Width: 64 bits
            Size: 16384 MB
            Form Factor: DIMM
            Set: None
            Locator: DIMM CHA3
            Bank Locator: BANK 0
            Type: DDR4
            Type Detail: Synchronous
            Speed: 2133 MHz
            Manufacturer: SK Hynix
            Serial Number: 71929DA0
            Asset Tag: 1651
            Part Number: HMA82GU7MFR8N-TF
            Rank: 2
            Configured Clock Speed: 2133 MHz
            Minimum Voltage: Unknown
            Maximum Voltage: Unknown
            Configured Voltage: 1.2 V
    
    Handle 0x002B, DMI type 17, 40 bytes
    Memory Device
            Array Handle: 0x0029
            Error Information Handle: Not Provided
            Total Width: 72 bits
            Data Width: 64 bits
            Size: 16384 MB
            Form Factor: DIMM
            Set: None
            Locator: DIMM CHA1
            Bank Locator: BANK 1
            Type: DDR4
            Type Detail: Synchronous
            Speed: 2133 MHz
            Manufacturer: SK Hynix
            Serial Number: 71929CFF
            Asset Tag: 1651
            Part Number: HMA82GU7MFR8N-TF
            Rank: 2
            Configured Clock Speed: 2133 MHz
            Minimum Voltage: Unknown
            Maximum Voltage: Unknown
            Configured Voltage: 1.2 V
    
    Handle 0x002C, DMI type 17, 40 bytes
    Memory Device
            Array Handle: 0x0029
            Error Information Handle: Not Provided
            Total Width: 72 bits
            Data Width: 64 bits
            Size: 16384 MB
            Form Factor: DIMM
            Set: None
            Locator: DIMM CHB4
            Bank Locator: BANK 2
            Type: DDR4
            Type Detail: Synchronous
            Speed: 2133 MHz
            Manufacturer: SK Hynix
            Serial Number: 71929BB8
            Asset Tag: 1651
            Part Number: HMA82GU7MFR8N-TF
            Rank: 2
            Configured Clock Speed: 2133 MHz
            Minimum Voltage: Unknown
            Maximum Voltage: Unknown
            Configured Voltage: 1.2 V
    
    Handle 0x002D, DMI type 17, 40 bytes
    Memory Device
            Array Handle: 0x0029
            Error Information Handle: Not Provided
            Total Width: 72 bits
            Data Width: 64 bits
            Size: 16384 MB
            Form Factor: DIMM
            Set: None
            Locator: DIMM CHB2
            Bank Locator: BANK 3
            Type: DDR4
            Type Detail: Synchronous
            Speed: 2133 MHz
            Manufacturer: Samsung
            Serial Number: 33BB5E37
            Asset Tag: 1641
            Part Number: M391A2K43BB1-CPB
            Rank: 2
            Configured Clock Speed: 2133 MHz
            Minimum Voltage: Unknown
            Maximum Voltage: Unknown
            Configured Voltage: 1.2 V

How can I identify the faulty module to replace it? I think that the following log's row has the information I need but I miss the way to decrypt it.

    [lun set 14 00:14:16 2020] {33}[Hardware Error]:   node: 0 device: 1


                                

sKo (21 rep)

Sep 14, 2020, 10:00 AM

5 votes

1 answers

5882 views

Remove ECC warnings in system log

linux-kernel kali-linux dmesg ecc

How can I disable these warnings about ECC? I don't have ECC memory and so disabled it in bios also but it still prints it. [ 4.697057] EDAC amd64: Node 0: DRAM ECC disabled. [ 4.697061] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. Either enable ECC checking or fo...

                                  How can I disable these warnings about ECC? I don't have ECC memory and so disabled it in bios also but it still prints it. 


    [    4.697057] EDAC amd64: Node 0: DRAM ECC disabled.
    [    4.697061] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
                    Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
                    (Note that use of the override may cause unknown side effects.)
    [    4.764909] EDAC amd64: Node 0: DRAM ECC disabled.
    [    4.764911] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
                    Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
                    (Note that use of the override may cause unknown side effects.)
    [    4.844621] EDAC amd64: Node 0: DRAM ECC disabled.
    [    4.844624] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
                    Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
                    (Note that use of the override may cause unknown side effects.)
    [    4.889875] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
    [    4.892678] EDAC amd64: Node 0: DRAM ECC disabled.
    [    4.892681] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
                    Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
                    (Note that use of the override may cause unknown side effects.)
    [    4.913651] EXT4-fs (sdc1): mounted filesystem with ordered data mode. Opts: (null)
    [    4.936635] EDAC amd64: Node 0: DRAM ECC disabled.
    [    4.936637] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
                    Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
                    (Note that use of the override may cause unknown side effects.)
    [    4.949722] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
    [    4.980600] EDAC amd64: Node 0: DRAM ECC disabled.
    [    4.980602] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
                    Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
                    (Note that use of the override may cause unknown side effects.)
    [    5.028880] EDAC amd64: Node 0: DRAM ECC disabled.
    [    5.028883] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
                    Either enable ECC checking or force module loading by setting 'ecc_enable_override'.
                    (Note that use of the override may cause unknown side effects.)
                                

JoKeR (438 rep)

Apr 2, 2020, 05:55 AM • Last activity: Sep 2, 2020, 05:20 AM

11 votes

1 answers

7521 views

How do I enable and verify ECC RAM scrubbing in Linux?

linux-kernel memory ecc

I bought my first system with ECC RAM and trying to learn about its possibilities when it comes to alerting and maintenance in Linux. To be specific, [Debian Linux](https://www.samsung.com/semiconductor/dram/module/M393B2G70QH0-YK0/) on a [Super Micro H8SGL](https://www.supermicro.com/Aplus/motherboard/Opteron6000/SR56x0/H8SGL.cfm) motherboard with an [AMD Opteron 6386 SE](https://www.amd.com/en/products/cpu/6386-se) CPU and [Samsung M393B2G70QH0-YK0](https://www.samsung.com/semiconductor/dram/module/M393B2G70QH0-YK0/) DDR3 ECC RAM. I have learnt that it is possible to [_scrub_](https://en.wikipedia.org/wiki/Memory_scrubbing) ECC RAM, which sounds like an excellent idea. ECC RAM can normally _repair_ 1-bit errors and _detect_ 2-bit errors. Scrubbing involves periodically reading RAM to preemptively repair the 1-bit errors before they end up 2-bit errors. I also learnt that Linux supports this, but I'm having problems using it so I need some help getting started and to figure out the settings. ### Linux EDAC driver From what I understand, Linux handles ECC RAM using a subsystem called EDAC and the controls for that are exposed under /sys/devices/system/edac/. I can see my two memory controllers here (2 node NUMA): # ls /sys/devices/system/edac/mc/ mc0 mc1 power subsystem uevent I can also see that the EDAC drivers are somehow loaded: # edac-util --status edac-util: EDAC drivers are loaded. 2 MCs detected # lsmod | grep edac amd64_edac_mod 36864 0 edac_mce_amd 28672 1 amd64_edac_mod Now I want to enable scrubbing. According to the [Linux ABI documentation](https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-devices-edac) the scrub rate is exposed through the /sys/devices/system/edac/mc/mc*/sdram_scrub_rate file, documented as such: >The scrubbing rate used by the memory controller is set by writing a minimum bandwidth in bytes/sec to the attribute file. The rate will be translated to an internal value that gives at least the specified rate. Reading the file will return the actual scrubbing rate employed. If configuration fails or memory scrubbing is not implemented, the value of the attribute file will be -1. But nothing happens when I do this. Writing a sensible value (somewhere in the middle when checking the [source](https://github.com/torvalds/linux/blob/master/drivers/edac/amd64_edac.c) and the [CPU documentation](http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf)) to the file seems to work but it always returns 0 when reading from it:

# cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
0
# echo 1000000 >/sys/devices/system/edac/mc/mc0/sdram_scrub_rate
# echo $?
0
# cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
0

After digging this deep, what am I missing? ### BIOS ECC Configuration I have also tried different settings in the BIOS. There is an option in BIOS for ECC configuration, but none of them has any effect on the scrub rate visible from linux:

Right now I'm trying the User setting but I really can't see any difference between these.

pipe (893 rep)

Jun 15, 2020, 04:21 PM • Last activity: Jun 24, 2020, 07:09 AM

3 votes

0 answers

302 views

Mapping around ecc errors in Linux does not seem to work?

linux memory ecc

I get the following ecc error on a Linux box several times a day - ``` May 24 18:21:04 staton-nas kernel: mce: [Hardware Error]: Machine check events logged May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: CPU 0: Mach...

I get the following ecc error on a Linux box several times a day -

May 24 18:21:04 staton-nas kernel: mce: [Hardware Error]: Machine check events logged
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8c000040000800c2
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: TSC 1c35588953416 
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: ADDR 117d228000 
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: MISC 122100200020008c 
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1590358864 SOCKET 0 APIC 0
May 24 18:21:04 staton-nas kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x117d228 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:1 rank:4)

The addr is always the same, so I’m trying to map around it with a ‘memmap=5M$0x117CFA8001’ kernel argument. The argument seems to be applying because I see the following in syslog -

May 24 16:03:09 staton-nas kernel: user: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
May 24 16:03:09 staton-nas kernel: user: [mem 0x0000000100000000-0x000000117cfa8000] usable
May 24 16:03:09 staton-nas kernel: user: [mem 0x000000117cfa8001-0x000000117d4a8000] reserved
May 24 16:03:09 staton-nas kernel: user: [mem 0x000000117d4a8001-0x000000407fffffff] usable

but I still get the ecc errors. Am I missing something? Is the “ADDR 117d228000” in the edac syslog errors not the actual address I need to map around? Do I need to covert that to a physical address somehow? I’m too cheap to replace a whole dimm for a single bad bit. The more research I do, the more convinced I become that the “memory scrubbing error“ message indicates the error is coming from memory scrubbing that the hardware is doing. And I can safely ignore it now that I have mapped around it. The OS will never actually use this memory area because I reserved it. Can anyone confirm that?

statop (31 rep)

May 25, 2020, 05:07 PM

24 votes

3 answers

15293 views

Is it possible to find the physical address range of a DIMM?

memory ecc smbios

I note that SMBios Type 20 would help here, but it's optional as of version [2.5 (2006-09-05) pp. 25, L796, and pp. 131 ][1], whereas types 16, 17 and 19 are mandatory, but don't quite help. ### Physical Memory Array (Type 16) There is one of these structures for the entire system, explaining what i...

                                  I note that SMBios Type 20 would help here, but it's optional as of version 2.5 (2006-09-05) pp. 25, L796, and pp. 131 , whereas types 16, 17 and 19 are mandatory, but don't quite help.

### Physical Memory Array (Type 16)

There is one of these structures for the entire system, explaining what is possible on this board.

    Handle 0x1000, DMI type 16, 23 bytes
    Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Multi-bit ECC
        Maximum Capacity: 768 GB
        Error Information Handle: Not Provided
        Number Of Devices: 24

### Memory Device (Type 17)

There is one record per each Dimm, which tells you the physical Dimms installed on the board.

    Handle 0x1100, DMI type 17, 34 bytes
    Memory Device
        Array Handle: 0x1000
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 2048 MB
        Form Factor: DIMM
        Set: 1
        Locator: DIMM_A1 
        Bank Locator: Not Specified
        Type: DDR3
        Type Detail: Synchronous Registered (Buffered)
        Speed: 1600 MHz
        Manufacturer: XXXX
        Serial Number: XXXX
        Asset Tag: XXXX
        Part Number: XXXX 
        Rank: 1
        Configured Clock Speed: 1333 MHz

### Memory Array Mapped Address (Type 19)

There can be multiple of these records, and each record lists a range of physical addresses.

Here is the output with two 2GB sticks:

    Handle 0x1300, DMI type 19, 31 bytes
    Memory Array Mapped Address
        Starting Address: 0x00000000000
        Ending Address: 0x000CFFFFFFF
        Range Size: 3328 MB
        Physical Array Handle: 0x1000
        Partition Width: 2

    Handle 0x1301, DMI type 19, 31 bytes
    Memory Array Mapped Address
        Starting Address: 0x00100000000
        Ending Address: 0x0012FFFFFFF
        Range Size: 768 MB
        Physical Array Handle: 0x1000
        Partition Width: 2

And here is the output with 4 sticks; 2*2GB and 2*4GB:

    Handle 0x1300, DMI type 19, 31 bytes
    Memory Array Mapped Address
        Starting Address: 0x00000000000
        Ending Address: 0x000CFFFFFFF
        Range Size: 3328 MB
        Physical Array Handle: 0x1000
        Partition Width: 2

    Handle 0x1301, DMI type 19, 31 bytes
    Memory Array Mapped Address
        Starting Address: 0x00100000000
        Ending Address: 0x0032FFFFFFF
        Range Size: 8960 MB
        Physical Array Handle: 0x1000
        Partition Width: 2

Note that in the first sample output above, there were two 2GB DIMMs, but two ranges of 3.3GB and 0.7GB. With 4 Dimms, the system will also coalesce the memory array mapped address region into two chunks, as it is just representing the same as the e820 map, i.e. the valid memory physical address ranges.

1 to many Type 20 records are tied to exactly one type 17 memory device, meaning that the entire physical range can be known:

### Example

    $ sudo dmidecode -t 20
    # dmidecode 2.12
    SMBIOS 2.6 present.
    
    Handle 0x002F, DMI type 20, 19 bytes
    Memory Device Mapped Address
    	Starting Address: 0x00000000000
    	Ending Address: 0x000FFFFFFFF
    	Range Size: 4 GB
    	Physical Device Handle: 0x002B
    	Memory Array Mapped Address Handle: 0x002E
    	Partition Row Position: 1
    
    Handle 0x0030, DMI type 20, 19 bytes
    Memory Device Mapped Address
    	Starting Address: 0x00100000000
    	Ending Address: 0x001FFFFFFFF
    	Range Size: 4 GB
    	Physical Device Handle: 0x002C
    	Memory Array Mapped Address Handle: 0x002E
    	Partition Row Position: 1

It seems possible to go from address to DIMM for EDAC - Error Detection & Correction  purposes, but not from DIMM to entire range.

Looking at the source code of mcelog , it is also using type 20 for its decoding.


                                

Alun (409 rep)

Jan 6, 2014, 06:39 AM • Last activity: Dec 24, 2017, 04:34 AM

4 votes

2 answers

14454 views

Understanding "Hardware error from APEI Generic Hardware Error Source" error message

logs hardware ecc

**Summary**: I'm trying to understand exactly what the following error message means: [17016.923750] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [17016.923758] {4}[Hardware Error]: It has been corrected by h/w and requires no further action [17016.923759] {4}[Hardw...

                                  **Summary**: I'm trying to understand exactly what the following error message means:

    [17016.923750] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
    [17016.923758] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
    [17016.923759] {4}[Hardware Error]: event severity: corrected
    [17016.923761] {4}[Hardware Error]:  Error 0, type: corrected
    [17016.923762] {4}[Hardware Error]:  fru_text: CorrectedErr
    [17016.923764] {4}[Hardware Error]:   section_type: memory error

**Details**:

I have a server with an Intel(R) Xeon(R) CPU E3-1275 v3 @ 3.50GHz CPU that is running Arch Linux (3.18.6-1-ARCH #1 SMP PREEMPT Sat Feb 7 08:44:05 CET 2015 x86_64 GNU/Linux).

When I run dmesg I see the error that I posted above.  The errors are not that frequent, but they do seem to keep happening.  For instance the server has been up for 1 day now since the last reboot, and there are 9 instance of this error listed in the log.

I saw another question that [asked about this error](https://unix.stackexchange.com/questions/150451/apei-generic-hardware-error)  and there was an answer that suggested the problem was that the ECC memory is failing.

My questions are:

1) Is there any reference to support the idea that this error message is associated with ECC memory?

2) If I do have a failing DIMM is there a suggested way to figure out which one it is?  I tried running memtest86+, but it did not report any memory errors.

3) If the OS reports ECC errors have been corrected does that really mean the DIMM is failing?  

I wouldn't be so concerned if the only problem was a few messages in my log file.  But I have also noticed that sometimes the server hangs unexpectedly.  The machine is being used for research and it's not as important for it to be stable as it would be if it were a production system.  Still having the machine hang can be problematic.  So I would like to know exactly what this error message means, and if I need to replace a component it would be nice if there were a way to figure out which component needs replacement.

**Edit**

Currently the server has been up for 8 days without hanging and I see 148 instances of this error message in the logs.  In addition I see one instance of the following message:

    [671211.188084] EDAC MC0: INTERNAL ERROR: csrow value is out of range (6 >= 4)
    [671211.188333] EDAC MC0: 1 CE ie31200 CE on unknown memory (channel:1 page:0x0 offset:0x0 grain:0 syndrome:0xc8)

I guess it is likely that one of the DIMMs has a problem.  Still I would be interested to know in case anyone had any information about how to interpret these messages, in particular in order to figure out which DIMM is possibly failing.

Gabriel Southern (843 rep)

Feb 25, 2015, 02:11 AM • Last activity: Dec 5, 2017, 09:08 PM

2 votes

0 answers

124 views

software-level error detection and correction for raw storage

filesystems storage device-mapper ecc

If I understand data storage correctly, all storage devices are unreliable to some extent, which is why most have hardware-level abstraction layers. Hard drives use error correction. If a sector is read and ECC detects an error (whether it was from the original writing or from random bit flipping ov...

                                  If I understand data storage correctly, all storage devices are unreliable to some extent, which is why most have hardware-level abstraction layers. Hard drives use error correction. If a sector is read and ECC detects an error (whether it was from the original writing or from random bit flipping over time), ECC is used to try to recover from the error and that sector is potentially marked bad and remapped to the spare sector pool. Some hardware devices don't have any of that, though, especially things like flash memory on embedded systems, which gets accessed directly, with no hardware level error-checking layer between it and the kernel.

Does linux provide methods, like special filesystems or logical volumes (by logical volumes, I mean things like cryptsetup or lvm2), that can deal directly with such "raw" devices, doing all of the checksumming, bad sector remapping, error correction, etc. on the software level? Would the method of error checking depend on the type or the properties of the raw storage?

enigmaticPhysicist (1542 rep)

Oct 25, 2016, 09:25 PM • Last activity: Oct 26, 2016, 08:18 PM

Showing page 1 of 20 total questions