What is the appropiate iommu kernel parameter for Ryzen5 1600 and multiple RX580 GPU?
3
votes
0
answers
2975
views
Reserving all the PCIe slot for RX580 prevents my PC to boot, due to such errors
AMD-VI IOTLB_INV_TIMEOUT
, AER: Corrected error received
or kernel panic. Adding the linux parameters with iommu=soft
and pci=noaer
solves the boot issue. On Lubuntu and Ubuntu 20.04 I see these logs from the kernel drm:
00:58:47 lubu kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
00:58:47 lubu kernel: [drm] UVD and UVD ENC initialized successfully.
00:58:47 lubu kernel: [drm] VCE initialized successfully.
00:58:47 lubu kernel: [drm] Cannot find any crtc or sizes
Additionally, on Ubuntu 20.04, the gdm3 floods my journalctl with such messages:
14:02:36 ub20 /usr/lib/gdm3/gdm-x-session: (II) AMDGPU(0): EDID vendor "GSM", prod id 19311
14:02:36 ub20 /usr/lib/gdm3/gdm-x-session: (II) AMDGPU(0): DDCModeFromDetailedTiming: 720x480 Warning: We only handle separate sync.
14:02:36 ub20 /usr/lib/gdm3/gdm-x-session: (II) AMDGPU(0): Using EDID range info for horizontal sync
14:02:36 ub20 /usr/lib/gdm3/gdm-x-session: (II) AMDGPU(0): Using EDID range info for vertical refresh
14:02:36 ub20 /usr/lib/gdm3/gdm-x-session: (II) AMDGPU(0): Printing DDC gathered Modelines:
However, if I use the GPU extensively, I experience a random system freeze, in both distributions. If I check the journalctl mostly, I see such error logs:
00:59:06 lubu kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=175, emitted seq=177
00:59:06 lubu kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
00:59:06 lubu kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
30 00:59:06 lubu kernel: amdgpu: [powerplay]
last message was failed ret is 65535
or
kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=4226, emitted seq=4228
Some Linux user reported that iommu=pt
solves such a problem, see https://bbs.archlinux.org/viewtopic.php?id=250297 .
I've been confused since several months with this issue, I would like to learn and understand deeply, what is going on here. Thus, I've read the Linux kernel documentation reagarding the iIOMMU tuning in https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html for the possible parameter and the meaning of the parameter in https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt .
Since I'm not an IOMMU expert and it is difficult to me to understand the terms, such as GART, re/unmapping, bounch buffering (SWIOTLB) and the interaction between amdgpu and the Linux system itself. I can read: iommu=soft
means that Linux uses software bounce buffering (SWIOTLB), and the default value noforce
prevents the OS boot on my PC. noforce
means that the hardware IOMMU is not forced to use.
Then, there is another parameter, called amd_iommu
which is of course significant for my hardware setup. This option has three possible values to be assigned: fullflush, off, force_isolation. Unfortunately, I don't know what is the default value of this option.
My short question would be: what is the best combination of the parameters iommu
and amd_iommu
for my hardware to have a fully utilization of my RX580s on Ryzen5?
My extra questions:
1. is amd_iommu
complements the iommu
? or does amd_iommu
refer to the AMD Ryzen CPU or the AMD GPU hardware? If there is also intel_iommu
, I think amd_iommu
refers to the Ryzen chipset.
2. if iommu
is not forced to use hardware IOMMU, what is the impact of amd_iommu = off
?
I will be very grateful to read more deep explanation and a direct link to the source code.
Thanks!
Asked by ywiyogo
(170 rep)
Oct 30, 2020, 09:26 AM
Last activity: Nov 5, 2020, 09:14 AM
Last activity: Nov 5, 2020, 09:14 AM