Sample Header Ad - 728x90

What is the appropiate iommu kernel parameter for Ryzen5 1600 and multiple RX580 GPU?

3 votes
0 answers
2975 views
Reserving all the PCIe slot for RX580 prevents my PC to boot, due to such errors AMD-VI IOTLB_INV_TIMEOUT, AER: Corrected error received or kernel panic. Adding the linux parameters with iommu=soft and pci=noaer solves the boot issue. On Lubuntu and Ubuntu 20.04 I see these logs from the kernel drm:
00:58:47 lubu kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
00:58:47 lubu kernel: [drm] UVD and UVD ENC initialized successfully.
00:58:47 lubu kernel: [drm] VCE initialized successfully.
00:58:47 lubu kernel: [drm] Cannot find any crtc or sizes
Additionally, on Ubuntu 20.04, the gdm3 floods my journalctl with such messages:
14:02:36 ub20 /usr/lib/gdm3/gdm-x-session: (II) AMDGPU(0): EDID vendor "GSM", prod id 19311
14:02:36 ub20 /usr/lib/gdm3/gdm-x-session: (II) AMDGPU(0): DDCModeFromDetailedTiming: 720x480 Warning: We only handle separate sync.
14:02:36 ub20 /usr/lib/gdm3/gdm-x-session: (II) AMDGPU(0): Using EDID range info for horizontal sync
14:02:36 ub20 /usr/lib/gdm3/gdm-x-session: (II) AMDGPU(0): Using EDID range info for vertical refresh
14:02:36 ub20 /usr/lib/gdm3/gdm-x-session: (II) AMDGPU(0): Printing DDC gathered Modelines:
However, if I use the GPU extensively, I experience a random system freeze, in both distributions. If I check the journalctl mostly, I see such error logs:
00:59:06 lubu kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=175, emitted seq=177
00:59:06 lubu kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
00:59:06 lubu kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
30 00:59:06 lubu kernel: amdgpu: [powerplay] 
                              last message was failed ret is 65535
or
kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=4226, emitted seq=4228
Some Linux user reported that iommu=pt solves such a problem, see https://bbs.archlinux.org/viewtopic.php?id=250297 . I've been confused since several months with this issue, I would like to learn and understand deeply, what is going on here. Thus, I've read the Linux kernel documentation reagarding the iIOMMU tuning in https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html for the possible parameter and the meaning of the parameter in https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt . Since I'm not an IOMMU expert and it is difficult to me to understand the terms, such as GART, re/unmapping, bounch buffering (SWIOTLB) and the interaction between amdgpu and the Linux system itself. I can read: iommu=soft means that Linux uses software bounce buffering (SWIOTLB), and the default value noforce prevents the OS boot on my PC. noforce means that the hardware IOMMU is not forced to use. Then, there is another parameter, called amd_iommu which is of course significant for my hardware setup. This option has three possible values to be assigned: fullflush, off, force_isolation. Unfortunately, I don't know what is the default value of this option. My short question would be: what is the best combination of the parameters iommu and amd_iommu for my hardware to have a fully utilization of my RX580s on Ryzen5? My extra questions: 1. is amd_iommu complements the iommu? or does amd_iommu refer to the AMD Ryzen CPU or the AMD GPU hardware? If there is also intel_iommu, I think amd_iommu refers to the Ryzen chipset. 2. if iommu is not forced to use hardware IOMMU, what is the impact of amd_iommu = off? I will be very grateful to read more deep explanation and a direct link to the source code. Thanks!
Asked by ywiyogo (170 rep)
Oct 30, 2020, 09:26 AM
Last activity: Nov 5, 2020, 09:14 AM