Sample Header Ad - 728x90

Emergency Mode Issue on SUSE Linux HPC System

0 votes
0 answers
1565 views
I have a head node and 4 worker nodes for high-performance computing (HPC). Recently, I had to turn it off for maintenance at our data center. I tried to turn the system back on, but I encountered an error message stating
[ 5.215623][ C14] nvme0: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2)
You are in emergency mode. After logging in, type "journalctl -xb" to view system logs, "systemctl reboot" to reboot, "systemctl default" or "exit" to boot into default mode.
Give root password for maintenance (or press Control-D to continue):
and it seems to be stuck in a loop. Initially, I selected Ctrl+d as suggested to boot into the default mode, but unfortunately, it just recycles back to the same emergency mode error every time. A couple of things that might be relevant: * I wasn't aware, but it seems an external USB was left plugged into the system's back when I turned it on after maintenance. I'm not entirely sure if this could be causing the issue, but it's worth mentioning. * Each node requires two power cables plugged into the power adapter. During the reconnection, I realized that one of the power cables for a node was not initially connected to a power source. However, I have fixed this issue, and now all nodes are receiving power as required. I'm not a Linux expert, so I'm a bit lost as to what could be causing this problem. I've tried searching for solutions online, but nothing seems to be working for me. If any of you have encountered a similar issue or have expertise with SUSE Linux and HPC systems, I would greatly appreciate any advice or guidance on how to troubleshoot and resolve this "emergency mode" problem.
Asked by Train_Learn_2350 (1 rep)
Aug 8, 2023, 03:01 PM
Last activity: Aug 9, 2023, 11:44 AM