Sample Header Ad - 728x90

Linux reboots with no panic when booting SMP configuration from kexec

2 votes
0 answers
201 views
I'm working on a project involving kexec. I have it working on some of our hardware platforms. On one platform, I'm getting sudden reboots with no panic dump during SMP setup:
[   25.219028] smpboot: CPU0: AMD EPYC 7402 24-Core Processor (family: 0x17, model: 0x31, stepping: 0x0)
[   25.228083] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[   25.237997] ... version:                0
[   25.247996] ... bit width:              48
[   25.257996] ... generic registers:      6
[   25.267996] ... value mask:             0000ffffffffffff
[   25.277996] ... max period:             00007fffffffffff
[   25.287996] ... fixed-purpose events:   0
[   25.297996] ... event mask:             000000000000003f
[   25.308059] rcu: Hierarchical SRCU implementation.
[   25.318046] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[   25.328283] smp: Bringing up secondary CPUs ...
[   25.335543] x86: Booting SMP configuration:
��
This platform boots fine under normal (i.e., non-kexec) circumstances. The primary and kexec kernels are built from the same codebase but linked differently - this is unlikely to be related to the issue because I've already tested this on an Intel platform. Kexec command line:
[    0.000000] Command line: elfcorehdr=0x86000000 ro panic=5 console=ttyS0,9600 loglevel=8 numifbs=0 nf_conntrack.acct=1 nmi_watchdog=1 profile=0 root=/dev/ram0 initrd=/crashfs.gz libata.force=disable
BIOS - may be relevant since one boot is with and one is without the BIOS:
Version 2.20.1275. Copyright (C) 2022 American Megatrends, Inc.
BIOS V1.05(08/26/2022)
I've tracked it down to wakeup_secondary_cpu_via_init in arch/x86/kernel/smpboot.c. The last output I get is just before the first apic_icr_write. I don't know where to even begin debugging this. Could it possibly be the NMI watchdog forcing a reboot because the only available core is hanging for some reason? Seems unlikely since that hung core wouldn't be able to perform NMI checks.
Asked by Sarvadi (121 rep)
Sep 27, 2023, 05:00 PM