Unix & Linux Stack Exchange

Q&A for users of Linux, FreeBSD and other Unix-like operating systems

Latest Questions

2 votes

1 answers

3552 views

Openmpi installation

I have just started to use Linux Mint for academic reasons and ran into an error as I was trying to install openmpi-2.0.1. I am getting following error as I am trying to make check make[4]: Leaving directory `/home/kuljeet/Downloads/openmpi-2.0.1/ompi/debuggers' make[3]: Leaving directory `/home/kul...

                                  I have just started to use Linux Mint for academic reasons and ran into an error as I was trying to install openmpi-2.0.1.
I am getting following error as I am trying to make check

        make: Leaving directory `/home/kuljeet/Downloads/openmpi-2.0.1/ompi/debuggers'
    make: Leaving directory `/home/kuljeet/Downloads/openmpi-2.0.1/ompi/debuggers'
    make: Leaving directory `/home/kuljeet/Downloads/openmpi-2.0.1/ompi/debuggers'
    Making check in etc
    make: Entering directory `/home/kuljeet/Downloads/openmpi-2.0.1/ompi/etc'
    make: Nothing to be done for `check'.
    make: Leaving directory `/home/kuljeet/Downloads/openmpi-2.0.1/ompi/etc'
    Making check in mpi/c
    make: Entering directory `/home/kuljeet/Downloads/openmpi-2.0.1/ompi/mpi/c'
    Making check in profile
    make: Entering directory `/home/kuljeet/Downloads/openmpi-2.0.1/ompi/mpi/c/profile'
      CC       pstatus_c2f.lo
    rm: cannot remove '.libs/pstatus_c2f.o': Permission denied
    Assembler messages:
    Fatal error: can't create .libs/pstatus_c2f.o: Permission denied
    make: *** [pstatus_c2f.lo] Error 1
    make: Leaving directory `/home/kuljeet/Downloads/openmpi-2.0.1/ompi/mpi/c/profile'
    make: *** [check-recursive] Error 1
    make: Leaving directory `/home/kuljeet/Downloads/openmpi-2.0.1/ompi/mpi/c'
    make: *** [check-recursive] Error 1
    make: Leaving directory `/home/kuljeet/Downloads/openmpi-2.0.1/ompi'
    make: *** [check-recursive] Error 1
Earlier I got an error

    make: Entering directory `/home/thanhnt/openmpi-1.6/ompi/debuggers'
    CCLD predefined_gap_test
    libtool: link: cannot find the library ../../ompi/libmpi.la' or unhandled argument ../../ompi/libmpi.la'
    make: *** [predefined_gap_test] Error 1
    make: Leaving directory `/home/thanhnt/openmpi-1.6/ompi/debuggers'
    make: *** [check-am] Error 2
    make: Leaving directory `/home/thanhnt/openmpi-1.6/ompi/debuggers'
    make: *** [check-recursive] Error 1
    make: Leaving directory `/home/thanhnt/openmpi-1.6/ompi'
    make: *** [check-recursive] Error

Even after fixing the permission error above, I still got:

> libtool: error: cannot find the library '../../ompi/libmpi.la' or unhandled argument '../../ompi/libmpi.la'
>
> make: *** [predefined_gap_test] Error 1
>
> make: Leaving directory /home/kuljeet/Downloads/openmpi-2.0.1/ompi/debuggers' make: *** [check-am]
>
> Error 2 make: Leaving directory /home/kuljeet/Downloads/openmpi-2.0.1/ompi/debuggers' make: *** [check-recursive] Error 1

                                

Kuljeet Keshav (129 rep)

Dec 19, 2016, 05:15 PM • Last activity: Jul 17, 2025, 11:30 AM

0 votes

1 answers

1927 views

slurm: srun & sbatch different performance with the same settings

shell-script parallelism mpi slurm

In a slurm system, when I use **srun** command to run the program. It runs very slow and seems like only one processor works. srun --pty -A free -J test -N 1 -n 1 -c 1 mpirun -np 16 $FEAPHOME8_3/parfeap/feap -log_summary lu.log But if I write a **sbatch** script, it can run very quickly and looks li...

                                  In a slurm system, when I use **srun** command to run the program. It runs very slow and seems like only one processor works.

     srun --pty -A free -J test  -N 1 -n 1 -c 1  mpirun -np 16
     $FEAPHOME8_3/parfeap/feap  -log_summary lu.log

But if I write a **sbatch** script, it can run very quickly and looks like all the processors work. 

    #!/bin/sh -l
    #SBATCH --job-name=test
    #SBATCH --account=free
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=24
    #SBATCH --cpus-per-task=1
    #SBATCH --exclusive
    #SBATCH --time=6:00:00
    
    echo ' '
    echo ' ****** START OF MAIN-JOB ******'
    date
    
    srun -n 16 echo y | mpirun -np 16 $FEAPHOME8_3/parfeap/feap -log_summary lu.log
    
    echo ' ****** END OF MAIN-JOB ******'
    
    #End of script

Could anybody please tell me what's going on?
                                

Rilin Shen (33 rep)

Sep 18, 2017, 01:29 AM • Last activity: Mar 3, 2025, 05:06 PM

1 votes

3 answers

2068 views

Install MPICH on CentOS error could not determine the size of a Fortran INTEGER

centos mpi

I followed the this [installation guide][1] to install MPICH on my machine. I got the following error while `configure`: configure: error: Unable to configure with Fortran support because configure could not determine the size of a Fortran INTEGER. Consider setting CROSS_F77_SIZEOF_INTEGER to the le...

                                  I followed the this installation guide  to install MPICH on my machine. I got the following error while configure:

    configure: error: Unable to configure with Fortran support because configure could not determine the size of a Fortran INTEGER. Consider setting CROSS_F77_SIZEOF_INTEGER to the length in bytes of a Fortran INTEGER

Here is the full output and config.log file 
Thanks for any guide o comment.

Abolfazl (123 rep)

Apr 7, 2017, 10:48 AM • Last activity: Apr 29, 2024, 03:53 AM

1 votes

1 answers

187 views

MPI program doesn't start right away (Fedora)

fedora mpi

sorry I'm not sure if this is better asked here or Stack Overflow, but this is with a fresh Fedora 39 install, and I've tested it with the packages (installed via dnf) mpich-devel, openmpi-devel and (from linux-homebrew) openmpi and I get the same problem. I guess this has to do with the system, which is why I thought it's better asked here. Basically, if I try to run an MPI program that contains MPI_Init, it does run, but it takes a very long time to actually start. It also doesn't show up in top immediately; it seems it gets delayed for some reason. When it does appear, it runs as quickly as I would expect. Here's a simple hello world MPI program in Fortran:

program hello_world
    use mpi
    implicit none
    integer :: mpi_size, mpi_rank
    integer :: ierr
    double precision :: t1, t2

    call MPI_Init(ierr)
    t1 = MPI_Wtime()

    call MPI_Comm_size(MPI_COMM_WORLD, mpi_size, ierr)
    call MPI_Comm_rank(MPI_COMM_WORLD, mpi_rank, ierr)

    print*, "Hello from rank ", mpi_rank, " of ", mpi_size, "."

    t2 = MPI_Wtime()
    print*, "time for full program", t2-t1
    call MPI_Finalize(ierr)

end program hello_world

I can compile this with mpif90 (no errors) and I can run it with a delay of maybe a few minutes. I looked up what might be causing it and on similar posts it was suggested to close other programs and to disable wifi (don't know why?), but none of that helped. Here is the output for time mpirun -np 2 ./a.out

Hello from rank            0  of            2 .
 time for full program   1.3940195000000001E-002
 Hello from rank            1  of            2 .
 time for full program   1.4017349000000000E-002

real	2m27.524s
user	0m0.033s
sys	0m0.163s

For comparison, here's a much simpler hello world program,

program hello
    print *, 'Hello, World!'
end program hello

If I compile with mpif90, and time it (ran with one processor here to compare against gfortran), I get:

Hello, World!

real	1m13.201s
user	0m0.016s
sys	0m0.074s

With gfortran, (doesn't matter if I run with ./a.out or mpirun -np 1 ./a.out)

Hello, World!

real	0m0.002s
user	0m0.001s
sys	0m0.001s

It looks like it also intermittently works with no wait. (enough so that I deleted this question thinking an update fixed it, but when I ran it again the problem was back) Edit: Looks like the problem is in MPI_Init somehow. You can put in calls like

call date_and_time(values=values)
    print *, values(5),":",values(6),":",values(7),":",values(8)

to find the time at certain lines, and the bottleneck is MPI_Init. Example: Program start time: 17 : 20 : 6 : 221 After MPI_Init: 17 : 21 : 21 : 128 After MPI_Finalize: 17 : 21 : 21 : 132

tmph (111 rep)

Jan 12, 2024, 10:56 AM • Last activity: Apr 16, 2024, 08:30 PM

4 votes

2 answers

24927 views

mpi.h not found

compiling eclipse header-file mpi

I tried to compile the Hello World program in C, inside Eclipse PTP, but it gives me an error related to `mpi.h`. I have included `/usr/local/include` and `/usr/local/lib` in my paths, and also tried running a search with `find / -name mpi.h`. I still get a *No such file or directory* error. I tried...

                                  I tried to compile the Hello World program in C, inside Eclipse PTP, but it gives me an error related to mpi.h.

I have included /usr/local/include and /usr/local/lib in my paths, and also tried running a search with find / -name mpi.h. I still get a *No such file or directory* error.

I tried to install mpich2, but still couldn't find mpi.h.

Also:

- There is no folder inside the include directory, why is that?
- I can find mpicc at /usr/bin/mpicc

The same problem occurs when trying to compile the project as C++ code. What should I do?

Dalia Shouman (57 rep)

Oct 21, 2015, 02:40 PM • Last activity: Nov 18, 2023, 11:47 AM

4 votes

0 answers

545 views

TCP/IP Network in Ubuntu 22.04 Becomes Unresponsive After Heavy Network Load from MPI Program

linux networking tcp mpi

I have two identical servers running Ubuntu 22.04.3 LTS. Both systems have 2x AMD 9654 CPUs with 192 total cores and 512 GB of RAM. Each server has two 10G ethernet ports built into the motherboard. These 10G ports are configured to create a single link aggregation with netplan. The whole network co...

Thor$ ip a
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: Ethernet-10G-1:  mtu 1500 qdisc mq master Bond-10G state UP group default qlen 1000
    link/ether 00:00:00:00:00:04 brd ff:ff:ff:ff:ff:ff permaddr a0:36:bc:c8:c6:9b
    altname enp15s0f0
3: Ethernet-10G-2:  mtu 1500 qdisc mq master Bond-10G state UP group default qlen 1000
    link/ether 00:00:00:00:00:04 brd ff:ff:ff:ff:ff:ff permaddr a0:36:bc:c8:c6:9c
    altname enp15s0f1
4: Bond-10G:  mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:00:00:00:00:04 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.203/22 brd 10.0.3.255 scope global dynamic noprefixroute Bond-10G
       valid_lft 31554381sec preferred_lft 31554381sec
    inet6 fe80::200:ff:fe00:4/64 scope link 
       valid_lft forever preferred_lft forever

Here is the output of ping from the first server to the second server (Loki) under normal conditions:

Thor$ ping loki
PING loki.elliptic.loc (10.0.1.204) 56(84) bytes of data.
64 bytes from Loki.elliptic.loc (10.0.1.204): icmp_seq=1 ttl=64 time=0.139 ms

This shows that the latency is low, 139 microseconds. Both servers are connected to the same switch, which is a Netgear XS728T 28 port 10 Gigabit L2+ Smart Switch. I also ran a networking test with iperf. The results (not shown here but available if useful) confirm sustained bandwidth of 10.0 gigabits / second between the two hosts. Now onto my problem. I'm a PhD student in applied math, and I use these servers to run a large scale numerical simulation code. The simulation program uses MPI. I've tested this program and it works perfectly on 192 cores on one host at a time. I can also run this program two hosts if I use a small number of cores, e.g. 8 on each core. But when I try to run it using a large number of cores, the MPI program hangs because it loses TCP connections between processes. Here is example error output when I tried and failed to run it on 192 cores on each host (384 cores total):

WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: Thor
  PID:        8076
  Message:    connect() to 10.0.1.204:1162 failed
  Error:      No route to host (113)

Furthermore, even after the MPI program is terminated, IP networking on one or both of the servers doesn't function anymore. I can gain access to the servers after the networking goes down by using an out of band IPMI tool. Once the TCP/IP network is dead, a call to ping leads to an error message "destination host unreachable." The machine can't even ping the router or network switch in this state. The only way I've been able to restore it in this condition is a full reboot. I checked the output of $ip a on the remote server in this condition, and it appeared identical to me to where it started. I'll paste it below in case I've missed something:

Loki $ip a
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: Ethernet-10G-1:  mtu 1500 qdisc mq master Bond-10G state UP group default qlen 1000
    link/ether 00:00:00:00:00:05 brd ff:ff:ff:ff:ff:ff permaddr a0:36:bc:c8:c7:2b
    altname enp15s0f0
3: Ethernet-10G-2:  mtu 1500 qdisc mq master Bond-10G state UP group default qlen 1000
    link/ether 00:00:00:00:00:05 brd ff:ff:ff:ff:ff:ff permaddr a0:36:bc:c8:c7:2c
    altname enp15s0f1
4: Bond-10G:  mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:00:00:00:00:05 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.204/22 brd 10.0.3.255 scope global dynamic noprefixroute Bond-10G
       valid_lft 31555883sec preferred_lft 31555883sec
    inet6 fe80::200:ff:fe00:5/64 scope link 
       valid_lft forever preferred_lft forevera

I've made some small progress which leads me to believe the problem is related to the TCP network being unable to keep up with the load of many connections being quickly created and sending a lot of traffic. Reading through the OpenMPI documentation, I saw some hints that that several linux kernel parameters should be tuned to run MPI with 10 gigabit TCP/IP networks. I entered these changes into /etc/sysctl.d/21-net.conf as follows:

net.core.rmem_max				= 16777216
net.core.wmem_max				= 16777216
net.ipv4.tcp_rmem				= 4096 87380 16777216
net.ipv4.tcp_wmem				= 4096 65536 16777216
net.core.netdev_max_backlog		= 30000
net.core.rmem_default			= 16777216
net.core.wmem_default			= 16777216
net.ipv4.tcp_mem				= 16777216 16777216 16777216
net.ipv4.route.flush			= 1

Before I made these changes, I couldn't even get the program to run with 8 MPI processes on each node. After making the changes, it was able to run with 32 MPI processes on each node. I made another round of changes and increased them even more, bumping net.core.rmem_max and net.ipv4.tcp_mem to their maximum value of 2^31-1. With this change, the program is able to run on 128 cores on each of the two hosts, but still hangs when I try to use all 192 cores. Here is one last data point. I repeated this test using two completely different computers provided by my PhD adviser. They're a bit older with 28 CPUs each and running Ubuntu 20.04 LTS. Everything was in a completely typical configuration: one gigabit networking without any network bonding. I was able to exactly replicate the problems I have on my machines. The only difference is that the older machines buckled under the network load with just 8 MPI processes on each node. Here is my intuition. MPI communicates between processes on the same node using shared memory. It's very fast and doesn't put any load on the TCP/IP network. Each connection between a pair of MPI processes across different nodes requires a socket and a TCP connection. When running a big simulation on many cores, this creates a huge load on the TCP/IP network. Increasing the buffer sizes helps, but it's still slow and prone to completely crashing the TCP network. I think this is something of an edge case in the HPC world, since most of the big supercomputing clusters use faster networking solutions like Infiniband. I haven't met anyone else trying to scale up to 1000 CPUs with 10 gigabit ethernet. If anyone on Stack Exchange is familiar with this class of issues and has any advice, I'd be very grateful to you. I'm in the fourth year of a PhD program and I've already spent over two solid weeks failing to fix this. Thanks a lot from a longtime member. -Michael

Michael S. Emanuel (41 rep)

Nov 3, 2023, 10:03 PM

1 votes

1 answers

699 views

NFS server seems to kick diskless nodes off

centos nfs pxe mpi nfsv4

I'm currently helping setting up a lab that will use diskless nodes for some MPI and CUDA computing. The distribution of choice is CentOS 7. To set up the diskless nodes I've followed the guide [here][1]. [1]: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_admi...

                                  I'm currently helping setting up a lab that will use diskless nodes for some MPI and CUDA computing.

The distribution of choice is CentOS 7.
To set up the diskless nodes I've followed the guide here .


I got to boot a diskless node successfully  and even run some MPI test programs.
So everything works fine in terms of connectivity,firewalls,nfs exports etc.

The problem is that after ~12 hours of having booted the diskless node, the main server which acts as a dhcp,tftp and nfs server seems to kick of the diskless node from the nfs service which results with the kernel: nfs: server  not responding, still trying message appearing on the client.
At that point I also stop getting ping replies from the diskless clients. 
Since the client has its root fs obtained by NFS I guess this leaves the client at a "corrupted" state only allowing me to reboot with Ctrl+Alt+Del or the machine's reset switch. No matter how much times passes the client won't connect back.
Inspecting /var/log/messages on the main server I got this interesting in my opinion line: 
Oct 8 23:30:50 myhostname kernel: NFSD: purging unused client (clientid e87d62f6).

Here is a bigger part of the log: 
`Oct  8 23:30:17 myhostname kernel: nfsv4 compound op ffff885c713d4080 opcnt 4 #3: 3: status 0
Oct  8 23:30:17 myhostname kernel: nfsv4 compound op #4/4: 9 (OP_GETATTR)
Oct  8 23:30:17 myhostname kernel: nfsd: fh_verify(36: 01070001 00260308 00000000 996a1153 334c49c8 b8768c81)
Oct  8 23:30:17 myhostname kernel: nfsv4 compound op ffff885c713d4080 opcnt 4 #4: 9: status 0
Oct  8 23:30:17 myhostname kernel: nfsv4 compound returned 0
Oct  8 23:30:17 myhostname kernel: --> nfsd4_store_cache_entry slot ffff885c72a66000
Oct  8 23:30:17 myhostname kernel: renewing client (clientid 5bbb153f/e87d62f7)
Oct  8 23:30:50 myhostname kernel: NFSD: laundromat service - starting
Oct  8 23:30:50 myhostname kernel: NFSD: purging unused client (clientid e87d62f6)
Oct  8 23:30:50 myhostname kernel: nfsd4_umh_cltrack_upcall: cmd: remove
Oct  8 23:30:50 myhostname kernel: nfsd4_umh_cltrack_upcall: arg: 4c696e7578204e465376342e31206e6f64653033
Oct  8 23:30:50 myhostname kernel: nfsd4_umh_cltrack_upcall: env0: (null)
Oct  8 23:30:50 myhostname kernel: nfsd4_umh_cltrack_upcall: env1: (null)
Oct  8 23:30:50 myhostname kernel: nfsd4_umh_cltrack_upcall: /sbin/nfsdcltrack return value: 0
Oct  8 23:30:50 myhostname kernel: NFSD: laundromat_main - sleeping for 57 seconds
Oct  8 23:31:48 myhostname kernel: NFSD: laundromat service - starting
Oct  8 23:31:48 myhostname kernel: NFSD: purging unused client (clientid e87d62f7)
Oct  8 23:31:48 myhostname kernel: nfsd4_umh_cltrack_upcall: cmd: remove
Oct  8 23:31:48 myhostname kernel: nfsd4_umh_cltrack_upcall: arg: 4c696e7578204e465376342e31206e76696469613031
Oct  8 23:31:48 myhostname kernel: nfsd4_umh_cltrack_upcall: env0: (null)
Oct  8 23:31:48 myhostname kernel: nfsd4_umh_cltrack_upcall: env1: (null)
Oct  8 23:30:50 myhostname kernel: nfsd4_umh_cltrack_upcall: cmd: remove
Oct  8 23:30:50 myhostname kernel: nfsd4_umh_cltrack_upcall: arg: 4c696e7578204e465376342e31206e6f64653033
Oct  8 23:30:50 myhostname kernel: nfsd4_umh_cltrack_upcall: env0: (null)
Oct  8 23:30:50 myhostname kernel: nfsd4_umh_cltrack_upcall: env1: (null)
Oct  8 23:30:50 myhostname kernel: nfsd4_umh_cltrack_upcall: /sbin/nfsdcltrack return value: 0
Oct  8 23:30:50 myhostname kernel: NFSD: laundromat_main - sleeping for 57 seconds
Oct  8 23:31:48 myhostname kernel: NFSD: laundromat service - starting
Oct  8 23:31:48 myhostname kernel: NFSD: purging unused client (clientid e87d62f7)
Oct  8 23:31:48 myhostname kernel: nfsd4_umh_cltrack_upcall: cmd: remove
Oct  8 23:31:48 myhostname kernel: nfsd4_umh_cltrack_upcall: arg: 4c696e7578204e465376342e31206e76696469613031
Oct  8 23:31:48 myhostname kernel: nfsd4_umh_cltrack_upcall: env0: (null)
Oct  8 23:31:48 myhostname kernel: nfsd4_umh_cltrack_upcall: env1: (null)
Oct  8 23:31:48 myhostname kernel: nfsd4_umh_cltrack_upcall: /sbin/nfsdcltrack return value: 0
Oct  8 23:31:48 myhostname kernel: NFSD: laundromat_main - sleeping for 90 seconds
Oct  8 23:33:18 myhostname kernel: NFSD: laundromat service - starting
Oct  8 23:33:18 myhostname kernel: NFSD: laundromat_main - sleeping for 90 seconds
Oct  8 23:34:48 myhostname kernel: NFSD: laundromat service - starting
Oct  8 23:34:48 myhostname kernel: NFSD: laundromat_main - sleeping for 90 seconds`

Afterwards it just continues looping the laundromat service starting/sleeping message forever.

nfsstat does not reveal anything weird on the server like badcalls etc.
I've also tried to force use NFSv3 version. I got the same problem, however the purging unused client and laundromat message does not appear in the logs now (guessing it was added in v4?).

Now onto some details regarding how the connectivity is.
The main server features 2 network interfaces.
One realtek one (which works by default with the kernel drivers) and one which is nvidia nforce and needs kmod-forcedeth from elrepo.
All server services are on the nvidia-nforce card.
The diskless node and the server connect via a gigabit switch (can't remember brand name/model sorry).



                                

user-2147482617 (21 rep)

Oct 18, 2018, 08:37 AM • Last activity: May 8, 2023, 12:34 PM

0 votes

0 answers

178 views

How to find out MPI vendor

linux mpi

How do I find out what MPI (MPICH, OpenMPI, etc) is installed? MPI location (returned by `which mpiexec`) is cryptic and does not hint a vendor. `mpiexec --version` returns mpiexec version 1.1.8 `mpiexec -h` also does not mention the vendor.

                                  How do I find out what MPI (MPICH, OpenMPI, etc) is installed?

MPI location (returned by which mpiexec) is cryptic and does not hint a vendor.

mpiexec --version returns

    mpiexec version 1.1.8

mpiexec -h also does not mention the vendor.

user2052436 (123 rep)

Mar 4, 2023, 08:53 PM

0 votes

1 answers

139 views

mpirun kill parallel processes when lose internet connection ssh

linux ubuntu ssh kill mpi

When I'm connected via ssh and are parallel processes running and it loses the internet connection all parallel processes. When I reconnect I find the following message in log file: -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 12 in communi...

                                  When I'm connected via ssh and are parallel processes running and it loses the internet connection all parallel processes. When I reconnect I find the following message in log file:

    --------------------------------------------------------------------------
    MPI_ABORT was invoked on rank 12 in communicator MPI COMMUNICATOR 4 DUP FROM 0 
    with errorcode 15.
    
    NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
    You may or may not see output from other processes, depending on
    exactly when Open MPI kills them.
    --------------------------------------------------------------------------
    0:Terminate signal was sent, status=: 15
    (rank:0 hostname: pid:2953):ARMCI DASSERT fail. ../../ga-5-4/armci/src/common/signaltrap.c:SigTermHandler():477 cond:0


Distribution
 
> Description:          Ubuntu 16.04.6 LTS Release:    
> 16.04 Codename:       xenial

How can I prevent this crash?

                                

gvd (111 rep)

Aug 1, 2022, 08:32 PM • Last activity: Aug 1, 2022, 09:44 PM

0 votes

0 answers

193 views

MPI job stopped with ssh disconnected

ssh mpi

I have been using MPI in THCP diskless sever. However, when I run jobs with MPI, sometimes the process die because of SSH disconnected. ``` client_loop: send disconnect: Broken pipe ``` There is no error or no error in a single job. Also if I tried to connect with SSH for each CPU after the job die,...

I have been using MPI in THCP diskless sever. However, when I run jobs with MPI, sometimes the process die because of SSH disconnected.

client_loop: send disconnect: Broken pipe

There is no error or no error in a single job. Also if I tried to connect with SSH for each CPU after the job die, it connected normally My source code use INTEL MKL library and server is composed with 16 cpus of AMD Ryzen 9 5900x and 48GB RAM memory, also I use MPICH3. How can I solve this problem? Is there a way to find the error log related to sshd disconnection uncertaintly? Thank you.

JeongYeong Kim (1 rep)

Jun 24, 2022, 01:50 AM

0 votes

1 answers

385 views

calling vim with mpiexec says "Warning: Output is not to a terminal / Warning: Input is not from a terminal"

terminal vim gdb mpi

My question is a bit technical. For specific reason, I need to call vim after mpiexec. Example : mpiexec -n 1 vim mytext.txt But this gives the following warning message: Vim: Warning: Output is not to a terminal Vim: Warning: Input is not from a terminal And then, vim does not behave naturally, my...

                                  My question is a bit technical.

For specific reason, I need to call vim after mpiexec.

Example :

    mpiexec -n 1 vim mytext.txt

But this gives the following warning message:

    Vim: Warning: Output is not to a terminal 
    Vim: Warning: Input is not from a terminal

And then, vim does not behave naturally, my input commands are not well interpreted in the editor, and things are not as if I had done simply:

    vim mytext.txt

Any idea on how to redirect correctly input/output from/to my launching terminal in order to be able to use vim after mpiexec ?

Actually, the final goal is to debug in parallel using gdb on a specific proc and to edit functions with vim editor from gdb.

Example :

    mpiexec -s 1 myprog : gdb myprog

So I am starting my program "myprog" on two processes, using gdb on the second one (which is proc 1), and redirecting stdin to proc 1 (thanks to -s 1 [see mpiexec -help]).

But then, if I want to edit a function with vim editor in gdb, I will face the same problems :

    Vim: Warning: Output is not to a terminal
    Vim: Warning: Input is not from a terminal

A quick solution would be to start an xterm window, but I want to avoid that approach:

    mpiexec myprog : xterm -e gdb myprog

Thanks for your help.

Here is my Linux distribution:

    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:centos:centos:7"
    HOME_URL="https://www.centos.org/ "
    BUG_REPORT_URL="https://bugs.centos.org/ "
    
    CENTOS_MANTISBT_PROJECT="CentOS-7"
    CENTOS_MANTISBT_PROJECT_VERSION="7"
    REDHAT_SUPPORT_PRODUCT="centos"
    REDHAT_SUPPORT_PRODUCT_VERSION="7"
                                

Kiven Jecquas (1 rep)

Feb 11, 2022, 11:03 AM • Last activity: Feb 11, 2022, 01:30 PM

0 votes

1 answers

563 views

Calling mpirun from bash file does not recognize $VAR format...?

linux bash mpi

While trying to run the following .sh file I get an error from mpirun: **The file: (additional python config omitted for brevity)** NB_MPI_WORKERS=2 SEED=0 mpirun --n ${NB_MPI_WORKERS} python start.py --base_path ~/temp --seed ${SEED} **The error:** Open MPI has detected that a parameter given to a...

                                  While trying to run the following .sh file I get an error from mpirun:

**The file: (additional python config omitted for brevity)**

    NB_MPI_WORKERS=2
    SEED=0
    
    mpirun --n ${NB_MPI_WORKERS} python start.py --base_path ~/temp --seed ${SEED}

**The error:**

    Open MPI has detected that a parameter given to a command line
    option does not match the expected format:
    
      Option: n
      Param:  2
    
    This is frequently caused by omitting to provide the parameter
    to an option that requires one. Please check the command line and try again.

I have confirmed that simply replacing **${NB_MPI_WORKERS}** with **2** does work, so I'm a bit confused on where the error is, especially since ${seed} is working.

Can anyone clarify, please? Is it actually an issue of formatting, or maybe type?

Versions:

 - Linux Mint 20.2
 - Open MPI 4.1.2

Mandias (43 rep)

Jan 17, 2022, 04:07 AM • Last activity: Jan 18, 2022, 04:12 PM

0 votes

0 answers

2670 views

Problem with intel MPI

mpi

I just installed intel oneApi and the intel MPI was included in it. I can now compile my fortran code with `mpiifort`, but when I try to run it with intel `mpirun`, with just `` it throws this error. ```bash libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ libi40iw-i40iw_ucreate_qp: failed to...

I just installed intel oneApi and the intel MPI was included in it. I can now compile my fortran code with mpiifort, but when I try to run it with intel mpirun, with just `` it throws this error.

libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
[1628187174.213489] [localhost:38060:0]         select.c:406  UCX  ERROR no active messages transport to : self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
[1628187174.213511] [localhost:38061:0]         select.c:406  UCX  ERROR no active messages transport to : self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1169)..............:
MPIDI_OFI_mpi_init_hook(1909): OFI get address vector map failed
Abort(1091215) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1169)..............:
MPIDI_OFI_mpi_init_hook(1909): OFI get address vector map failed

Now, as suggested from here https://github.com/openucx/ucx/issues/4742#issuecomment-584059909 , I set those environment variable with export UCX_TLS=ud,sm,self, now the executable runs but also throws this error,

libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
[1628186868.761953] [localhost:35174:0]            sys.c:618  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[1628186868.793554] [localhost:35173:0]            sys.c:618  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'

output for ipcs -l

------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398442373116
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 128
max semaphores per array = 250
max semaphores system wide = 32000
max ops per semop call = 32
semaphore max value = 32767

I don't understand what's the problem and how to fix this. Can anybody help me out here.

Eular (253 rep)

Aug 5, 2021, 06:34 PM

0 votes

0 answers

142 views

`mpirun -n 2 ./a.x`, the two processes was stuck by epoll_wait, why?

linux strace mpi

I run a mpi progrem with `mpirun -n 2 ./a.x`. However, these two processes was stuck. And it is always get stuck and seldom(actually only once) pass through. I find follow information by `strace` and `lsof`, and what I know is that these two processes was waiting for reading or writing(?) a same fil...

                                  I run a mpi progrem with mpirun -n 2  ./a.x. However, these two processes  was stuck. And it is always get stuck and seldom(actually only once) pass through. 

I find follow information by strace and lsof, and what I know is that these two processes was waiting for reading or writing(?) a same file but it is not be prepared. Then, how to find what the file is and why it is always not ready to be access?

If you have any thoughts or need anything else, just please tell me, thank you!
```
//use strace -p  31352  
epoll_wait(18, [], 100, 0)              = 0
epoll_wait(18, [], 100, 0)              = 0
epoll_wait(18, [], 100, 0)              = 0

//use strace -p 31351 
epoll_wait(19, [], 100, 0)              = 0
epoll_wait(19, [], 100, 0)              = 0
epoll_wait(19, [], 100, 0)              = 0


//use lsof -p 31352
pfci.x  31352 jslo   18u  a_inode               0,13         0     11815 [eventpoll]
//use lsof -p 31351
pfci.x  31351 jslo   19u  a_inode               0,13         0     11815 [eventpoll]

                                

Runfeng Jin (1 rep)

Jul 9, 2021, 01:01 PM

0 votes

0 answers

142 views

Slurm 18 allocates MPI jobs to the same CPUs on the node

slurm mpi

For some reason when launching an mpi job with SLURM on CentOS8 cluster, slurm ties mpi processes to CPUs always starting from CPU0. Say there are 128 CPU cores on a compute node. I launch mpi job asking for 64 CPUs on that node. Fine, it gets allocated on first 64 cores (1st socket) and runs there...

                                  For some reason when launching an mpi job with SLURM on CentOS8 cluster, slurm ties mpi processes to CPUs always starting from CPU0.

Say there are 128 CPU cores on a compute node. I launch mpi job asking for 64 CPUs on that node. Fine, it gets allocated on first 64 cores (1st socket) and runs there fine.

Now if i submit another 64-CPU mpi job to the same node, SLURM places it again on 1st socket, so CPUs 0-63 are used by both jobs, but CPUs 64-127 of the 2nd socket are not used at all.

Played with various mpi parameters to no avail. The only way I was able to assign 2jobs to different sockets is when using rank files with openmpi. But that should not be necessary if SLURM works correctly. 

Consumable resources in SLURM are CR_Core. TaskPlugin=task/affinity. 

If I run  the same 2 x mpi code on the same node without SLURM, the same openmpi allocates CPUs correctly.

What can make SLURM to behave in such a bizarre way?

Alex P (81 rep)

Apr 14, 2021, 08:03 PM • Last activity: Apr 14, 2021, 08:37 PM

2 votes

1 answers

364 views

Very new to slurm. How to get slurm to run multiple core jobs on my linux cluster?

linux slurm mpi

I've been trying to move some existing processes to a revamped linux cluster that now runs on slurm. I thought I have it done, but my problem now is trying to get multiple cores to run. Here is my submission script. ``` #!/bin/bash # #SBATCH --job-name=test_mpi #SBATCH --output=res_mpi.txt # #SBATCH...

#!/bin/bash
   #
   #SBATCH --job-name=test_mpi
   #SBATCH --output=res_mpi.txt
   #
   #SBATCH -n 4
   #SBATCH --time=10:00
   srun mkdir -p /tmp/tedhyu/new
  srun cp Ru13.in /tmp/tedhyu/new/lcao.in
  srun cp ~tedhyu/atom_pbe/* /tmp/tedhyu/new
  srun cd /tmp/tedhyu/new
  srun -N 1  -n 4 --chdir=/tmp/tedhyu/new  mpiexec ~tedhyu/bin/origin1_centos6.4_mpich2_quest_265c.x

When I "qstat -n" it only shows one core: Job id Username Queue Name SessID NDS TSK Memory Time Use S Time -------------------- -------- -------- -------------------- ------ ----- ----- ------ ----- - ----- 11778 tedhyu atom test_mpi -- 1 4 -- 00:10 C 00:00
node3-5/4 Here is the first few lines of my output that shows only 1 core is running:

srun: error: node3-5: tasks 0-3: Exited with exit code 1
     MPINFO::: Global Communicator        :::
     MPINFO::: Global Context = ****      :::
     MPINFO::: Global Size =       1      :::
     MPINFO::: Global Root =       0      :::
     MPINFO::: Global Rank =       0      :::
     DEV: VDW development version

Global Size should equal 4 If anyone can point me in the right direction... Thanks!!!

ted y (21 rep)

Jan 24, 2021, 05:10 AM • Last activity: Feb 14, 2021, 10:06 AM

0 votes

2 answers

121 views

how to manage %files section in spec without environment variables

fedora rpm packaging mpi

I am creating a package for a software using `MPICH`. The binaries must be installed on the [mpich directory](https://docs.fedoraproject.org/en-US/packaging-guidelines/MPI/). This directory depends on the mpich version (`/usr/lib64/mpich-3.2/` or `/usr/lib64/mpich/`). For the `%install` section, I m...

                                  I am creating a package for a software using MPICH. The binaries must be installed on the [mpich directory](https://docs.fedoraproject.org/en-US/packaging-guidelines/MPI/) . This directory depends on the mpich version (/usr/lib64/mpich-3.2/ or /usr/lib64/mpich/). For the %install section, I managed with variables MPI_BIN and MPI_LIB setting by macro. But these variables are not expanded in the %files section.

How can I list the binaries in the %files section ?
I have read an [old post](https://unix.stackexchange.com/questions/6601/spec-files-attribute-and-shell-variables)  but the solution doesn't work.
                                

Bruno Guerraz (3 rep)

Jan 8, 2021, 10:22 AM • Last activity: Jan 8, 2021, 08:19 PM

2 votes

1 answers

7388 views

nohup ends when I close the SSH terminal and doesn't show up on ps/job when running

ssh process amazon-ec2 nohup mpi

I have an AWS EC2 Ubuntu 20.04 instance, accessed via puTTY SSH. Running a Python process via [OpenMPI][1] through the following command **on non root**: > nohup mpirun python3 job.py When the shell is open it runs normally. Even though using `ps` and `job` doesn't show the process I can see via the...

                                  I have an AWS EC2 Ubuntu 20.04 instance, accessed via puTTY SSH. Running a Python process via OpenMPI  through the following command **on non root**:

> nohup mpirun python3 job.py

When the shell is open it runs normally. Even though using ps and job doesn't show the process I can see via the nohup.out and the changing file system that it's working even hours after it runs. 

When I close the shell the nohup process ends. I should probably also note that when I run the nohup command above I cannot enter any more input (the cursor becomes blank). So when I ran ps and job above I had to open a second shell. I never used nohup before and assume this is abnormal. 

I investigated the nohup.out file for errors and found nothing written.

So TLDR:

 1. Closing the SSH instance (puTTY) ends the nohup process

 2. ps and job don't list the nohup process even though I know it's running 

*I can answer any more questions if needed*

Sad CRUD Developer (123 rep)

Nov 25, 2020, 08:28 PM • Last activity: Nov 25, 2020, 09:43 PM

0 votes

1 answers

386 views

Debian : MPI code - Intel compiler - [Hardware Error]: Unified Memory Controller Error: DRAM ECC error

debian intel amd compiler mpi

When running an executable compiled with `intel mpiicc`, I get, after 30 minutes of running, the following errors : kernel:[29585.573874] [Hardware Error]: Corrected error, no action required. Message from syslogd@pablo at Nov 8 09:53:25 ... kernel:[29585.573881] [Hardware Error]: CPU:2 (17:31:0) MC...

                                  When running an executable compiled with intel mpiicc, I get, after 30 minutes of running, the 
following errors :

     kernel:[29585.573874] [Hardware Error]: Corrected error, no action required.
    
    Message from syslogd@pablo at Nov  8 09:53:25 ...
     kernel:[29585.573881] [Hardware Error]: CPU:2 (17:31:0) MC18_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2041000000011b
    
    Message from syslogd@pablo at Nov  8 09:53:25 ...
     kernel:[29585.573887] [Hardware Error]: Error Addr: 0x0000000a6c12d280
    
    Message from syslogd@pablo at Nov  8 09:53:25 ...
     kernel:[29585.573888] [Hardware Error]: IPID: 0x0000009600550f00, Syndrome: 0xc54c00040a800611
    
    Message from syslogd@pablo at Nov  8 09:53:25 ...
     kernel:[29585.573891] [Hardware Error]: Unified Memory Controller Extended Error Code: 0
    
    Message from syslogd@pablo at Nov  8 09:53:25 ...
     kernel:[29585.573893] [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
    
    Message from syslogd@pablo at Nov  8 09:53:25 ...
     kernel:[29585.573895] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

I am working on a AMD EPYC 7702P 64-Core Processor with 1TB of RAM and a Debian OS :

    Linux pablo 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 GNU/Linux

From what I have seen, I did the command : dmidecode -t memory that gives :

    # dmidecode 3.2
    Getting SMBIOS data from sysfs.
    SMBIOS 3.2.0 present.
    
    Handle 0x0023, DMI type 16, 23 bytes
    Physical Memory Array
    	Location: System Board Or Motherboard
    	Use: System Memory
    	Error Correction Type: Multi-bit ECC
    	Maximum Capacity: 2 TB
    	Error Information Handle: 0x0022
    	Number Of Devices: 8
    
    Handle 0x002B, DMI type 17, 84 bytes
    Memory Device
    	Array Handle: 0x0023
    	Error Information Handle: 0x002A
    	Total Width: 72 bits
    	Data Width: 64 bits
    	Size: 128 GB
    	Form Factor: DIMM
    	Set: None
    	Locator: DIMM 0
    	Bank Locator: P0 CHANNEL A
    	Type: DDR4
    	Type Detail: Synchronous Registered (Buffered) LRDIMM
    	Speed: 2933 MT/s
    	Manufacturer: Samsung
    	Serial Number: 03C6F701
    	Asset Tag: Not Specified
    	Part Number: M386AAG40MMB-CVF
    	Rank: 4
    	Configured Memory Speed: 2933 MT/s
    	Minimum Voltage: 1.2 V
    	Maximum Voltage: 1.2 V
    	Configured Voltage: 1.2 V
    	Memory Technology: DRAM
    	Memory Operating Mode Capability: Volatile memory
    	Firmware Version: Unknown
    	Module Manufacturer ID: Bank 1, Hex 0xCE
    	Module Product ID: Unknown
    	Memory Subsystem Controller Manufacturer ID: Unknown
    	Memory Subsystem Controller Product ID: Unknown
    	Non-Volatile Size: None
    	Volatile Size: 128 kB
    	Cache Size: None
    	Logical Size: None
    
    Handle 0x002E, DMI type 17, 84 bytes
    Memory Device
    	Array Handle: 0x0023
    	Error Information Handle: 0x002D
    	Total Width: 72 bits
    	Data Width: 64 bits
    	Size: 128 GB
    	Form Factor: DIMM
    	Set: None
    	Locator: DIMM 0
    	Bank Locator: P0 CHANNEL B
    	Type: DDR4
    	Type Detail: Synchronous Registered (Buffered) LRDIMM
    	Speed: 2933 MT/s
    	Manufacturer: Samsung
    	Serial Number: 03C6F3ED
    	Asset Tag: Not Specified
    	Part Number: M386AAG40MMB-CVF
    	Rank: 4
    	Configured Memory Speed: 2933 MT/s
    	Minimum Voltage: 1.2 V
    	Maximum Voltage: 1.2 V
    	Configured Voltage: 1.2 V
    	Memory Technology: DRAM
    	Memory Operating Mode Capability: Volatile memory
    	Firmware Version: Unknown
    	Module Manufacturer ID: Bank 1, Hex 0xCE
    	Module Product ID: Unknown
    	Memory Subsystem Controller Manufacturer ID: Unknown
    	Memory Subsystem Controller Product ID: Unknown
    	Non-Volatile Size: None
    	Volatile Size: 128 kB
    	Cache Size: None
    	Logical Size: None
    
    Handle 0x0031, DMI type 17, 84 bytes
    Memory Device
    	Array Handle: 0x0023
    	Error Information Handle: 0x0030
    	Total Width: 72 bits
    	Data Width: 64 bits
    	Size: 128 GB
    	Form Factor: DIMM
    	Set: None
    	Locator: DIMM 0
    	Bank Locator: P0 CHANNEL C
    	Type: DDR4
    	Type Detail: Synchronous Registered (Buffered) LRDIMM
    	Speed: 2933 MT/s
    	Manufacturer: Samsung
    	Serial Number: 03C6F4BA
    	Asset Tag: Not Specified
    	Part Number: M386AAG40MMB-CVF
    	Rank: 4
    	Configured Memory Speed: 2933 MT/s
    	Minimum Voltage: 1.2 V
    	Maximum Voltage: 1.2 V
    	Configured Voltage: 1.2 V
    	Memory Technology: DRAM
    	Memory Operating Mode Capability: Volatile memory
    	Firmware Version: Unknown
    	Module Manufacturer ID: Bank 1, Hex 0xCE
    	Module Product ID: Unknown
    	Memory Subsystem Controller Manufacturer ID: Unknown
    	Memory Subsystem Controller Product ID: Unknown
    	Non-Volatile Size: None
    	Volatile Size: 128 kB
    	Cache Size: None
    	Logical Size: None
    
    Handle 0x0034, DMI type 17, 84 bytes
    Memory Device
    	Array Handle: 0x0023
    	Error Information Handle: 0x0033
    	Total Width: 72 bits
    	Data Width: 64 bits
    	Size: 128 GB
    	Form Factor: DIMM
    	Set: None
    	Locator: DIMM 0
    	Bank Locator: P0 CHANNEL D
    	Type: DDR4
    	Type Detail: Synchronous Registered (Buffered) LRDIMM
    	Speed: 2933 MT/s
    	Manufacturer: Samsung
    	Serial Number: 03C6F396
    	Asset Tag: Not Specified
    	Part Number: M386AAG40MMB-CVF
    	Rank: 4
    	Configured Memory Speed: 2933 MT/s
    	Minimum Voltage: 1.2 V
    	Maximum Voltage: 1.2 V
    	Configured Voltage: 1.2 V
    	Memory Technology: DRAM
    	Memory Operating Mode Capability: Volatile memory
    	Firmware Version: Unknown
    	Module Manufacturer ID: Bank 1, Hex 0xCE
    	Module Product ID: Unknown
    	Memory Subsystem Controller Manufacturer ID: Unknown
    	Memory Subsystem Controller Product ID: Unknown
    	Non-Volatile Size: None
    	Volatile Size: 128 kB
    	Cache Size: None
    	Logical Size: None
    
    Handle 0x0037, DMI type 17, 84 bytes
    Memory Device
    	Array Handle: 0x0023
    	Error Information Handle: 0x0036
    	Total Width: 72 bits
    	Data Width: 64 bits
    	Size: 128 GB
    	Form Factor: DIMM
    	Set: None
    	Locator: DIMM 0
    	Bank Locator: P0 CHANNEL E
    	Type: DDR4
    	Type Detail: Synchronous Registered (Buffered) LRDIMM
    	Speed: 2933 MT/s
    	Manufacturer: Samsung
    	Serial Number: 03C6F67D
    	Asset Tag: Not Specified
    	Part Number: M386AAG40MMB-CVF
    	Rank: 4
    	Configured Memory Speed: 2933 MT/s
    	Minimum Voltage: 1.2 V
    	Maximum Voltage: 1.2 V
    	Configured Voltage: 1.2 V
    	Memory Technology: DRAM
    	Memory Operating Mode Capability: Volatile memory
    	Firmware Version: Unknown
    	Module Manufacturer ID: Bank 1, Hex 0xCE
    	Module Product ID: Unknown
    	Memory Subsystem Controller Manufacturer ID: Unknown
    	Memory Subsystem Controller Product ID: Unknown
    	Non-Volatile Size: None
    	Volatile Size: 128 kB
    	Cache Size: None
    	Logical Size: None
    
    Handle 0x003A, DMI type 17, 84 bytes
    Memory Device
    	Array Handle: 0x0023
    	Error Information Handle: 0x0039
    	Total Width: 72 bits
    	Data Width: 64 bits
    	Size: 128 GB
    	Form Factor: DIMM
    	Set: None
    	Locator: DIMM 0
    	Bank Locator: P0 CHANNEL F
    	Type: DDR4
    	Type Detail: Synchronous Registered (Buffered) LRDIMM
    	Speed: 2933 MT/s
    	Manufacturer: Samsung
    	Serial Number: 03C6F394
    	Asset Tag: Not Specified
    	Part Number: M386AAG40MMB-CVF
    	Rank: 4
    	Configured Memory Speed: 2933 MT/s
    	Minimum Voltage: 1.2 V
    	Maximum Voltage: 1.2 V
    	Configured Voltage: 1.2 V
    	Memory Technology: DRAM
    	Memory Operating Mode Capability: Volatile memory
    	Firmware Version: Unknown
    	Module Manufacturer ID: Bank 1, Hex 0xCE
    	Module Product ID: Unknown
    	Memory Subsystem Controller Manufacturer ID: Unknown
    	Memory Subsystem Controller Product ID: Unknown
    	Non-Volatile Size: None
    	Volatile Size: 128 kB
    	Cache Size: None
    	Logical Size: None
    
    Handle 0x003D, DMI type 17, 84 bytes
    Memory Device
    	Array Handle: 0x0023
    	Error Information Handle: 0x003C
    	Total Width: 72 bits
    	Data Width: 64 bits
    	Size: 128 GB
    	Form Factor: DIMM
    	Set: None
    	Locator: DIMM 0
    	Bank Locator: P0 CHANNEL G
    	Type: DDR4
    	Type Detail: Synchronous Registered (Buffered) LRDIMM
    	Speed: 2933 MT/s
    	Manufacturer: Samsung
    	Serial Number: 03C6F48A
    	Asset Tag: Not Specified
    	Part Number: M386AAG40MMB-CVF
    	Rank: 4
    	Configured Memory Speed: 2933 MT/s
    	Minimum Voltage: 1.2 V
    	Maximum Voltage: 1.2 V
    	Configured Voltage: 1.2 V
    	Memory Technology: DRAM
    	Memory Operating Mode Capability: Volatile memory
    	Firmware Version: Unknown
    	Module Manufacturer ID: Bank 1, Hex 0xCE
    	Module Product ID: Unknown
    	Memory Subsystem Controller Manufacturer ID: Unknown
    	Memory Subsystem Controller Product ID: Unknown
    	Non-Volatile Size: None
    	Volatile Size: 128 kB
    	Cache Size: None
    	Logical Size: None
    
    Handle 0x0040, DMI type 17, 84 bytes
    Memory Device
    	Array Handle: 0x0023
    	Error Information Handle: 0x003F
    	Total Width: 72 bits
    	Data Width: 64 bits
    	Size: 128 GB
    	Form Factor: DIMM
    	Set: None
    	Locator: DIMM 0
    	Bank Locator: P0 CHANNEL H
    	Type: DDR4
    	Type Detail: Synchronous Registered (Buffered) LRDIMM
    	Speed: 2933 MT/s
    	Manufacturer: Samsung
    	Serial Number: 03C6F3FB
    	Asset Tag: Not Specified
    	Part Number: M386AAG40MMB-CVF
    	Rank: 4
    	Configured Memory Speed: 2933 MT/s
    	Minimum Voltage: 1.2 V
    	Maximum Voltage: 1.2 V
    	Configured Voltage: 1.2 V
    	Memory Technology: DRAM
    	Memory Operating Mode Capability: Volatile memory
    	Firmware Version: Unknown
    	Module Manufacturer ID: Bank 1, Hex 0xCE
    	Module Product ID: Unknown
    	Memory Subsystem Controller Manufacturer ID: Unknown
    	Memory Subsystem Controller Product ID: Unknown
    	Non-Volatile Size: None
    	Volatile Size: 128 kB
    	Cache Size: None
    	Logical Size: None

I don't know where these DRAM ECC error come from, Maybe there are incompatibilies between my motherboard, CPU model or bad version of Intel compiler SDK ?

These errors appears roughly every 5 minutes during the execution.

I am using the intel compilers version compilers_and_libraries_2020.1.217. 

**I have also the same error messages when I compile with MPI from official Open-MPI Debian 10 repository version.**

I should modify maybe an option in the BIOS but I am not sure.

If someone had an idea to solve this issue, this would be fine to tell it.

                                

youpilat13 (1 rep)

Nov 8, 2020, 03:48 PM • Last activity: Nov 9, 2020, 01:53 PM

0 votes

1 answers

1805 views

Install multiple MPI libraries and switch between them on Ubuntu

ubuntu package-management mpi

For educational purposes I'd like to set up several MPI libraries, e.g. OpenMPI, MPICH, and Intel MPI along with different backend compilers (gcc, clang, icc) on the same machine running Ubuntu 18.04.4 TLS. What is the best way to do this so as to be able to switch between them easily when I need to see how a particular code works with one MPI library/compiler or another? So far I only managed to select a compiler via mpicc's -cc command line argument (MPICH) or OMPI_CC environment variable (OpenMPI). But when I install OpenMPI after MPICH, for example, mpicc from MPICH seems to be getting replaced with the one from OpenMPI and I am basically losing access to MPICH:

$ sudo apt install mpich
$ mpicc -show
gcc -Wl,-Bsymbolic-functions -Wl,-z,relro -I/usr/include/mpich -L/usr/lib/x86_64-linux-gnu -lmpich

$ sudo apt install libopenmpi-dev
$ mpicc -show
gcc -I/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi -I/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent -I/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent/include -I/usr/lib/x86_64-linux-gnu/openmpi/include -pthread -L/usr//lib -L/usr/lib/x86_64-linux-gnu/openmpi/lib -lmpi

Is it possible to have both and choose which one I currently want to use?

mentalmushroom (103 rep)

Jul 29, 2020, 07:01 AM • Last activity: Jul 29, 2020, 07:16 AM

Showing page 1 of 20 total questions