Unable to run linpack on head node of cluster
1
vote
1
answer
51
views
I recently set up my own home cluster - 4 units of raspberry pi. But I am having problems trying to benchmark all 4 units using Linpack
One node is the head node called rpislave1, it connects to the Internet and my local wifi network using the wlan0 interface while it uses the eth0 on it to connect to the internal LAN that is the cluster.
The other 3 nodes are rpislave2,rpislave3 and rpislave4. Each are connected to the head node - rpislave1 and get their Internet access through rpislave1. To make things simple,these 3 nodes network boot off a flash drive connected to rpislave1.
All units have been allocated their own IP address using their mac address through dhcp.
Here is the /etc/hosts file for the head node
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
127.0.1.1 cluster
192.168.50.1 rpislave1 cluster
192.168.50.11 rpislave2
192.168.50.12 rpislave3
192.168.50.13 rpislave4
All units can be accessed via ssh without passwords from rpislave1 and share a NFS drive at /sharezone - which is connected to a thumbdrive mounted on rpislave1.
I am pretty happy with the learning experience and decided to benchmark the total processing of the cluster -rpislave1, rpislave2, rpislave3, and rpislave4 using HPL or linpack https://www.netlib.org/benchmark/hpl/
I started off installing OpenMPI on the head node -rpislave1.
and it worked on its own clocking at 15 GFlops - nothing to boast about of course but it was fun.
I then proceed to set up linpack and openmpi on rpislave2 and did a standalone test and so on with the remaining units - rpislave3 and rpislave4.
So I decided to try and run it across 2 nodes - rpislave1 and rpislave2.
Here is the HPL.dat I am using for 2 nodes but I don't think the issue is the HPL.dat I am using.
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
40704 Ns
1 # of NBs
192 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
2 Ps
4 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0 Number of additional problem sizes for PTRANS
1200 10000 30000 values of N
0 number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64 values of NB
Even make a host file to use with it
user@rpislave1:/sharezone/hpl $ cat host_file
rpislave1 slots=4
rpislave2 slots=4
Here is the command I used:
time mpirun -hostfile host_file -np 8 /sharezone/xhpl/bin/xhpl
But the output I got was this
user@rpislave1:/sharezone/hpl $ time mpirun -hostfile host_file -np 8 /sharezone/xhpl/bin/xhpl
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 40704
NB : 192
PMAP : Row-major process mapping
P : 2
Q : 4
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected. This is highly unusual.
The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).
Local host: rpislave2
Local PID: 1574
Peer hostname: rpislave1 ([[58941,1],2])
Source IP of socket: 192.168.50.1
Known IPs of peer:
169.254.131.47
--------------------------------------------------------------------------
I have no idea what is causing this issue but I have noticed if run the linpack test on rpislave2, rpislave3 or rpislave4 or any sort of combination of 2, it would work without issue.
It is as it I cannot run on the head node rpislave1.
I have been looking around for days trying all sorts of steps, I suspect the open MPI is accessing the wlan0 I have on the head node to connect to the local wifi network so I tried to use "--mca btl_tcp_if_exclude wlan0" or any sort of mca option but nothing worked. I even went through the github issues but all seem to have been fixed and I should have the latest patches. Here is the openmpi versions I have
user@rpislave1:/sharezone/hpl $ sudo apt install openmpi-bin openmpi-common libopenmpi3 libopenmpi-dev
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libopenmpi-dev is already the newest version (4.1.0-10).
libopenmpi3 is already the newest version (4.1.0-10).
openmpi-bin is already the newest version (4.1.0-10).
openmpi-common is already the newest version (4.1.0-10).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
user@rpislave1:/sharezone/hpl $
Does anyone have any idea what is causing the "Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job " error? I suspect it maybe related to the wlan0 interface since it shows this
Known IPs of peer:
169.254.131.47
a traceroute shows this result
user@rpislave1:/sharezone/hpl $ traceroute 169.254.131.47
traceroute to 169.254.131.47 (169.254.131.47), 30 hops max, 60 byte packets
1 rpislave1.local (169.254.131.47) 0.192 ms 0.107 ms 0.096 ms
user@rpislave1:/sharezone/hpl $
Here is the ifconfig for rpislave1/head node
user@rpislave1:/sharezone/hpl $ ifconfig
eth0: flags=4163 mtu 1500
inet 192.168.50.1 netmask 255.255.255.0 broadcast 192.168.50.255
inet6 fe80::d314:681c:2e82:d5bc prefixlen 64 scopeid 0x20
ether d8:3a:dd:1d:92:15 txqueuelen 1000 (Ethernet)
RX packets 962575 bytes 911745808 (869.5 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 590397 bytes 382892062 (365.1 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73 mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10
loop txqueuelen 1000 (Local Loopback)
RX packets 3831 bytes 488990 (477.5 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 3831 bytes 488990 (477.5 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
wlan0: flags=4163 mtu 1500
inet 192.168.101.15 netmask 255.255.255.0 broadcast 192.168.101.255
inet6 2001:f40:950:b164:806e:1571:b836:23a4 prefixlen 64 scopeid 0x0
inet6 fe80::1636:9990:bd05:dd05 prefixlen 64 scopeid 0x20
ether d8:3a:dd:1d:92:16 txqueuelen 1000 (Ethernet)
RX packets 44632 bytes 12764596 (12.1 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 74151 bytes 13143926 (12.5 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
user@rpislave1:/sharezone/hpl $
I would really appreciate any help on solving this issue.
Asked by AlexChan
(21 rep)
Jul 26, 2023, 10:37 AM
Last activity: Jul 28, 2023, 09:41 AM
Last activity: Jul 28, 2023, 09:41 AM