TCP/IP Network in Ubuntu 22.04 Becomes Unresponsive After Heavy Network Load from MPI Program
4
votes
0
answers
545
views
I have two identical servers running Ubuntu 22.04.3 LTS. Both systems have 2x AMD 9654 CPUs with 192 total cores and 512 GB of RAM. Each server has two 10G ethernet ports built into the motherboard. These 10G ports are configured to create a single link aggregation with netplan.
The whole network configuration runs perfectly well under normal loads. Here is the output of $ip a from the first server (Thor):
Thor$ ip a
1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: Ethernet-10G-1: mtu 1500 qdisc mq master Bond-10G state UP group default qlen 1000
link/ether 00:00:00:00:00:04 brd ff:ff:ff:ff:ff:ff permaddr a0:36:bc:c8:c6:9b
altname enp15s0f0
3: Ethernet-10G-2: mtu 1500 qdisc mq master Bond-10G state UP group default qlen 1000
link/ether 00:00:00:00:00:04 brd ff:ff:ff:ff:ff:ff permaddr a0:36:bc:c8:c6:9c
altname enp15s0f1
4: Bond-10G: mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 00:00:00:00:00:04 brd ff:ff:ff:ff:ff:ff
inet 10.0.1.203/22 brd 10.0.3.255 scope global dynamic noprefixroute Bond-10G
valid_lft 31554381sec preferred_lft 31554381sec
inet6 fe80::200:ff:fe00:4/64 scope link
valid_lft forever preferred_lft forever
Here is the output of ping from the first server to the second server (Loki) under normal conditions:
Thor$ ping loki
PING loki.elliptic.loc (10.0.1.204) 56(84) bytes of data.
64 bytes from Loki.elliptic.loc (10.0.1.204): icmp_seq=1 ttl=64 time=0.139 ms
This shows that the latency is low, 139 microseconds. Both servers are connected to the same switch, which is a Netgear XS728T 28 port 10 Gigabit L2+ Smart Switch. I also ran a networking test with iperf. The results (not shown here but available if useful) confirm sustained bandwidth of 10.0 gigabits / second between the two hosts.
Now onto my problem. I'm a PhD student in applied math, and I use these servers to run a large scale numerical simulation code. The simulation program uses MPI. I've tested this program and it works perfectly on 192 cores on one host at a time. I can also run this program two hosts if I use a small number of cores, e.g. 8 on each core. But when I try to run it using a large number of cores, the MPI program hangs because it loses TCP connections between processes. Here is example error output when I tried and failed to run it on 192 cores on each host (384 cores total):
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: Thor
PID: 8076
Message: connect() to 10.0.1.204:1162 failed
Error: No route to host (113)
Furthermore, even after the MPI program is terminated, IP networking on one or both of the servers doesn't function anymore. I can gain access to the servers after the networking goes down by using an out of band IPMI tool. Once the TCP/IP network is dead, a call to ping leads to an error message "destination host unreachable." The machine can't even ping the router or network switch in this state. The only way I've been able to restore it in this condition is a full reboot.
I checked the output of $ip a on the remote server in this condition, and it appeared identical to me to where it started. I'll paste it below in case I've missed something:
Loki $ip a
1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: Ethernet-10G-1: mtu 1500 qdisc mq master Bond-10G state UP group default qlen 1000
link/ether 00:00:00:00:00:05 brd ff:ff:ff:ff:ff:ff permaddr a0:36:bc:c8:c7:2b
altname enp15s0f0
3: Ethernet-10G-2: mtu 1500 qdisc mq master Bond-10G state UP group default qlen 1000
link/ether 00:00:00:00:00:05 brd ff:ff:ff:ff:ff:ff permaddr a0:36:bc:c8:c7:2c
altname enp15s0f1
4: Bond-10G: mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 00:00:00:00:00:05 brd ff:ff:ff:ff:ff:ff
inet 10.0.1.204/22 brd 10.0.3.255 scope global dynamic noprefixroute Bond-10G
valid_lft 31555883sec preferred_lft 31555883sec
inet6 fe80::200:ff:fe00:5/64 scope link
valid_lft forever preferred_lft forevera
I've made some small progress which leads me to believe the problem is related to the TCP network being unable to keep up with the load of many connections being quickly created and sending a lot of traffic. Reading through the OpenMPI documentation, I saw some hints that that several linux kernel parameters should be tuned to run MPI with 10 gigabit TCP/IP networks. I entered these changes into /etc/sysctl.d/21-net.conf as follows:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 30000
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.ipv4.tcp_mem = 16777216 16777216 16777216
net.ipv4.route.flush = 1
Before I made these changes, I couldn't even get the program to run with 8 MPI processes on each node. After making the changes, it was able to run with 32 MPI processes on each node. I made another round of changes and increased them even more, bumping net.core.rmem_max and net.ipv4.tcp_mem to their maximum value of 2^31-1. With this change, the program is able to run on 128 cores on each of the two hosts, but still hangs when I try to use all 192 cores.
Here is one last data point. I repeated this test using two completely different computers provided by my PhD adviser. They're a bit older with 28 CPUs each and running Ubuntu 20.04 LTS. Everything was in a completely typical configuration: one gigabit networking without any network bonding. I was able to exactly replicate the problems I have on my machines. The only difference is that the older machines buckled under the network load with just 8 MPI processes on each node.
Here is my intuition. MPI communicates between processes on the same node using shared memory. It's very fast and doesn't put any load on the TCP/IP network. Each connection between a pair of MPI processes across different nodes requires a socket and a TCP connection. When running a big simulation on many cores, this creates a huge load on the TCP/IP network. Increasing the buffer sizes helps, but it's still slow and prone to completely crashing the TCP network. I think this is something of an edge case in the HPC world, since most of the big supercomputing clusters use faster networking solutions like Infiniband. I haven't met anyone else trying to scale up to 1000 CPUs with 10 gigabit ethernet.
If anyone on Stack Exchange is familiar with this class of issues and has any advice, I'd be very grateful to you. I'm in the fourth year of a PhD program and I've already spent over two solid weeks failing to fix this. Thanks a lot from a longtime member.
-Michael
Asked by Michael S. Emanuel
(41 rep)
Nov 3, 2023, 10:03 PM