Sample Header Ad - 728x90

Debian : MPI code - Intel compiler - [Hardware Error]: Unified Memory Controller Error: DRAM ECC error

0 votes
1 answer
386 views
When running an executable compiled with intel mpiicc, I get, after 30 minutes of running, the following errors : kernel:[29585.573874] [Hardware Error]: Corrected error, no action required. Message from syslogd@pablo at Nov 8 09:53:25 ... kernel:[29585.573881] [Hardware Error]: CPU:2 (17:31:0) MC18_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2041000000011b Message from syslogd@pablo at Nov 8 09:53:25 ... kernel:[29585.573887] [Hardware Error]: Error Addr: 0x0000000a6c12d280 Message from syslogd@pablo at Nov 8 09:53:25 ... kernel:[29585.573888] [Hardware Error]: IPID: 0x0000009600550f00, Syndrome: 0xc54c00040a800611 Message from syslogd@pablo at Nov 8 09:53:25 ... kernel:[29585.573891] [Hardware Error]: Unified Memory Controller Extended Error Code: 0 Message from syslogd@pablo at Nov 8 09:53:25 ... kernel:[29585.573893] [Hardware Error]: Unified Memory Controller Error: DRAM ECC error. Message from syslogd@pablo at Nov 8 09:53:25 ... kernel:[29585.573895] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD I am working on a AMD EPYC 7702P 64-Core Processor with 1TB of RAM and a Debian OS : Linux pablo 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 GNU/Linux From what I have seen, I did the command : dmidecode -t memory that gives : # dmidecode 3.2 Getting SMBIOS data from sysfs. SMBIOS 3.2.0 present. Handle 0x0023, DMI type 16, 23 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: Multi-bit ECC Maximum Capacity: 2 TB Error Information Handle: 0x0022 Number Of Devices: 8 Handle 0x002B, DMI type 17, 84 bytes Memory Device Array Handle: 0x0023 Error Information Handle: 0x002A Total Width: 72 bits Data Width: 64 bits Size: 128 GB Form Factor: DIMM Set: None Locator: DIMM 0 Bank Locator: P0 CHANNEL A Type: DDR4 Type Detail: Synchronous Registered (Buffered) LRDIMM Speed: 2933 MT/s Manufacturer: Samsung Serial Number: 03C6F701 Asset Tag: Not Specified Part Number: M386AAG40MMB-CVF Rank: 4 Configured Memory Speed: 2933 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Firmware Version: Unknown Module Manufacturer ID: Bank 1, Hex 0xCE Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: 128 kB Cache Size: None Logical Size: None Handle 0x002E, DMI type 17, 84 bytes Memory Device Array Handle: 0x0023 Error Information Handle: 0x002D Total Width: 72 bits Data Width: 64 bits Size: 128 GB Form Factor: DIMM Set: None Locator: DIMM 0 Bank Locator: P0 CHANNEL B Type: DDR4 Type Detail: Synchronous Registered (Buffered) LRDIMM Speed: 2933 MT/s Manufacturer: Samsung Serial Number: 03C6F3ED Asset Tag: Not Specified Part Number: M386AAG40MMB-CVF Rank: 4 Configured Memory Speed: 2933 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Firmware Version: Unknown Module Manufacturer ID: Bank 1, Hex 0xCE Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: 128 kB Cache Size: None Logical Size: None Handle 0x0031, DMI type 17, 84 bytes Memory Device Array Handle: 0x0023 Error Information Handle: 0x0030 Total Width: 72 bits Data Width: 64 bits Size: 128 GB Form Factor: DIMM Set: None Locator: DIMM 0 Bank Locator: P0 CHANNEL C Type: DDR4 Type Detail: Synchronous Registered (Buffered) LRDIMM Speed: 2933 MT/s Manufacturer: Samsung Serial Number: 03C6F4BA Asset Tag: Not Specified Part Number: M386AAG40MMB-CVF Rank: 4 Configured Memory Speed: 2933 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Firmware Version: Unknown Module Manufacturer ID: Bank 1, Hex 0xCE Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: 128 kB Cache Size: None Logical Size: None Handle 0x0034, DMI type 17, 84 bytes Memory Device Array Handle: 0x0023 Error Information Handle: 0x0033 Total Width: 72 bits Data Width: 64 bits Size: 128 GB Form Factor: DIMM Set: None Locator: DIMM 0 Bank Locator: P0 CHANNEL D Type: DDR4 Type Detail: Synchronous Registered (Buffered) LRDIMM Speed: 2933 MT/s Manufacturer: Samsung Serial Number: 03C6F396 Asset Tag: Not Specified Part Number: M386AAG40MMB-CVF Rank: 4 Configured Memory Speed: 2933 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Firmware Version: Unknown Module Manufacturer ID: Bank 1, Hex 0xCE Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: 128 kB Cache Size: None Logical Size: None Handle 0x0037, DMI type 17, 84 bytes Memory Device Array Handle: 0x0023 Error Information Handle: 0x0036 Total Width: 72 bits Data Width: 64 bits Size: 128 GB Form Factor: DIMM Set: None Locator: DIMM 0 Bank Locator: P0 CHANNEL E Type: DDR4 Type Detail: Synchronous Registered (Buffered) LRDIMM Speed: 2933 MT/s Manufacturer: Samsung Serial Number: 03C6F67D Asset Tag: Not Specified Part Number: M386AAG40MMB-CVF Rank: 4 Configured Memory Speed: 2933 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Firmware Version: Unknown Module Manufacturer ID: Bank 1, Hex 0xCE Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: 128 kB Cache Size: None Logical Size: None Handle 0x003A, DMI type 17, 84 bytes Memory Device Array Handle: 0x0023 Error Information Handle: 0x0039 Total Width: 72 bits Data Width: 64 bits Size: 128 GB Form Factor: DIMM Set: None Locator: DIMM 0 Bank Locator: P0 CHANNEL F Type: DDR4 Type Detail: Synchronous Registered (Buffered) LRDIMM Speed: 2933 MT/s Manufacturer: Samsung Serial Number: 03C6F394 Asset Tag: Not Specified Part Number: M386AAG40MMB-CVF Rank: 4 Configured Memory Speed: 2933 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Firmware Version: Unknown Module Manufacturer ID: Bank 1, Hex 0xCE Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: 128 kB Cache Size: None Logical Size: None Handle 0x003D, DMI type 17, 84 bytes Memory Device Array Handle: 0x0023 Error Information Handle: 0x003C Total Width: 72 bits Data Width: 64 bits Size: 128 GB Form Factor: DIMM Set: None Locator: DIMM 0 Bank Locator: P0 CHANNEL G Type: DDR4 Type Detail: Synchronous Registered (Buffered) LRDIMM Speed: 2933 MT/s Manufacturer: Samsung Serial Number: 03C6F48A Asset Tag: Not Specified Part Number: M386AAG40MMB-CVF Rank: 4 Configured Memory Speed: 2933 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Firmware Version: Unknown Module Manufacturer ID: Bank 1, Hex 0xCE Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: 128 kB Cache Size: None Logical Size: None Handle 0x0040, DMI type 17, 84 bytes Memory Device Array Handle: 0x0023 Error Information Handle: 0x003F Total Width: 72 bits Data Width: 64 bits Size: 128 GB Form Factor: DIMM Set: None Locator: DIMM 0 Bank Locator: P0 CHANNEL H Type: DDR4 Type Detail: Synchronous Registered (Buffered) LRDIMM Speed: 2933 MT/s Manufacturer: Samsung Serial Number: 03C6F3FB Asset Tag: Not Specified Part Number: M386AAG40MMB-CVF Rank: 4 Configured Memory Speed: 2933 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Firmware Version: Unknown Module Manufacturer ID: Bank 1, Hex 0xCE Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: 128 kB Cache Size: None Logical Size: None I don't know where these DRAM ECC error come from, Maybe there are incompatibilies between my motherboard, CPU model or bad version of Intel compiler SDK ? These errors appears roughly every 5 minutes during the execution. I am using the intel compilers version compilers_and_libraries_2020.1.217. **I have also the same error messages when I compile with MPI from official Open-MPI Debian 10 repository version.** I should modify maybe an option in the BIOS but I am not sure. If someone had an idea to solve this issue, this would be fine to tell it.
Asked by youpilat13 (1 rep)
Nov 8, 2020, 03:48 PM
Last activity: Nov 9, 2020, 01:53 PM