Welcome to the forums. Please post in English or French.

You are not logged in.

#1 2011-10-24 03:28:38

JMB365
Member
Registered: 2008-01-19
Posts: 781

[Solved] ASTK: Having trouble running an MPI job

Hello,

I have setup a small cluster of 3 PCs in each of which I compiled CodeAster Ver_11.0.27+ and have also tested that mpiexec works correctly as shown below:

ssh -X ubuntu31
echo "ubuntu31" > ~/mpd.hosts
echo "ubuntu32"  >> ~/mpd.hosts
echo "ubuntu33"  >> ~/mpd.hosts
scp ~/mpd.hosts ubuntu32:/home/user
scp ~/mpd.hosts ubuntu33:/home/user

mpdboot -n 3 -f ~/mpd.hosts
mpdtrace -l
# ubuntu31_53981 (192.168.1.231)
# ubuntu32_43879 (192.168.1.232)
# ubuntu33_56404 (192.168.1.233)

mpiexec -n 3 uname -n
# ubuntu33
# ubuntu31
# ubuntu32

Also the ~/mpi_hostfile is:

cat ~/mpi_hostfile
# ubuntu31
# ubuntu32
# ubuntu33

The relevant line of the ~/.astrc/prefs file is "mpi_hostfile : /home/user/mpi_hostfile".  However when I submit a job to ASTK (in ubuntu33) using the test case perf010e with ncpus=1; mpi_nbcpu=3 & mpi_nbnoeud=1 I get:

<INFO> Command line 1 :
<INFO> ./asteru_mpi Python/Execution/E_SUPERV.py -eficas_path ./Python -commandes fort.1 -rep none  -num_job 9754-ubuntu33 -mode interactif -rep_outils /opt/aster/outils -rep_mat /opt/aster/NEW11.0_mpi/materiau -rep_dex /opt/aster/NEW11.0_mpi/datg -suivi_batch -memjeveux 500.0 -tpmax 900
mpiexec: unable to start all procs; may have invalid machine names
    remaining specified hosts:
        192.168.1.231 (ubuntu31.mshome)
        192.168.1.232 (ubuntu32.mshome)
EXIT_COMMAND_9827_00000017=0
<INFO> Code_Aster run ended, diagnostic : NO_RESU_FILE

What am I doing wrong?  Would somebody kindly assist please?  Thanks!

Regards, JMB

PS: All output lines have been pre-fixed with a '#' by me.

Last edited by JMB365 (2011-10-26 17:27:53)


SalomeMeca 2021
Ubuntu 20.04, 22.04

Offline

#2 2011-10-25 12:54:42

JMB365
Member
Registered: 2008-01-19
Posts: 781

Re: [Solved] ASTK: Having trouble running an MPI job

Hello anybody,

I even tried putting in /etc/hosts just:
192.168.1.231 ubuntu31
192.168.1.232 ubuntu32
192.168.1.233 ubuntu33

and yet I get:

mpiexec: unable to start all procs; may have invalid machine names
    remaining specified hosts:
        192.168.1.231 (ubuntu31)
        192.168.1.232 (ubuntu32)
EXIT_COMMAND_3231_00000017=0
<INFO> Code_Aster run ended, diagnostic : NO_RESU_FILE

Can somebody assist me please? Thanks  -JMB

Last edited by JMB365 (2011-10-25 13:16:38)


SalomeMeca 2021
Ubuntu 20.04, 22.04

Offline

#3 2011-10-25 16:27:09

Thomas DE SOZA
Guru
From: EDF
Registered: 2007-11-23
Posts: 3,066

Re: [Solved] ASTK: Having trouble running an MPI job

Try using mpirun instead of mpiexec.
And before trying to run inside ASTK/as_run. Try to run a simple parallel program to find all the necessary arguments to the command line.
Then update your as_run config file to reflect this.

With Mpich it is likely you'll need to launch a MPD (mpdboot) before running mpirun and closing it at the end (mpdallexit). ASTK provides the means to do that or you can launch the MPD before manually.
If you still have troubles launching it, be advised that other implementations (such as OpenMPI) do not require any preliminary steps (but still of course require correct SSH keys configuration).

TdS

Offline

#4 2011-10-25 18:48:06

JMB365
Member
Registered: 2008-01-19
Posts: 781

Re: [Solved] ASTK: Having trouble running an MPI job

Thomas DE SOZA wrote:

Try using mpirun instead of mpiexec.

Thanks for the suggestion.  I tried this by changing /opt/aster/etc/codeaster/asrun to use:

mpirun_cmd : /usr/local/bin/mpirun -machinefile %(mpi_hostfile)s -wdir %(wrkdir)s -n %(mpi_nbcpu)s %(program)s

as well as ~/.astkrc/prefs to the same.  Yet I get the same error message:
"mpiexec: unable to start all procs; may have invalid machine names"
Apparently ASTK is still using mpiexec.  Also I do not believe your suggestion would make any difference since:

which mpirun mpiexec | xargs ls -al
lrwxrwxrwx 1 root root 10 2011-10-18 20:28 /usr/local/bin/mpiexec -> mpiexec.py
lrwxrwxrwx 1 root root  7 2011-10-18 20:28 /usr/local/bin/mpirun -> mpiexec

link to the same mpiexec.py

Could you give me a hint as to how to "Try to run a simple parallel program to find all the necessary arguments to the command line."  So far I tried:

mpirun -machinefile ~/mpi_hostfile -wdir ./ -n 3 uname -n
# ubuntu32
# ubuntu33
# ubuntu31

which shows that the mpirun works okay outside of ASTK.  Note also the syntax I used is quite in keeping with the mpirun_cmd defined in the asrun or in prefs files.  I am puzzled by this problem!  Thanks.

Regards, JMB

PS: I have done quite a bit of reading now on setting up mpich2 and testing it.  My suspicions are that somehow ASTK (or asteru_mpi) is not calling mpiexec correctly, since the message "mpiexec: unable to start all procs; may have invalid machine names" originates from line 403 of the source code file /usr/local/bin/mpiexec.py   Therefore if "mpexec -machinefile ~/mpi_hostfile -wdir ./ -n 3 uname -n" works correctly in this cluster, I am suspicious about CodeAster's asteru_mpi.   Somebody correct me if I am wrong.

Last edited by JMB365 (2011-10-25 20:45:27)


SalomeMeca 2021
Ubuntu 20.04, 22.04

Offline

#5 2011-10-25 19:00:25

JMB365
Member
Registered: 2008-01-19
Posts: 781

Re: [Solved] ASTK: Having trouble running an MPI job

Hello,

I have a basic question.  When compiling asteru_mpi, I had done so on each individual PC (assuming that the other 2 were as yet unavailable).  Which means that I used:

echo `uname -n`  > /opt/aster/etc/codeaster/mpi_hostfile
echo `uname -n`  > /opt/aster/etc/codeaster/aster-mpihosts

for each individual PC in the cluster before compiling asteru/d_mpi

Do I need to recompile asteru_mpi, now that I have the 3 PCs in the cluster after I have done:

echo `ubuntu31`  >   /opt/aster/etc/codeaster/mpi_hostfile
echo `ubuntu32`  >> /opt/aster/etc/codeaster/mpi_hostfile
echo `ubuntu33`  >> /opt/aster/etc/codeaster/mpi_hostfile
cp /opt/aster/etc/codeaster/mpi_hostfile  /opt/aster/etc/codeaster/aster-mpihosts

Does that mean asteru/d_mpi need(s) to be recompiled on each PC (of a cluster) whenever a new one is added to the ring?  Could this be the reason for my troubles?  As you can see I am still baffled by my problem...  Thanks!

Regards, JMB


SalomeMeca 2021
Ubuntu 20.04, 22.04

Offline

#6 2011-10-26 00:45:46

JMB365
Member
Registered: 2008-01-19
Posts: 781

Re: [Solved] ASTK: Having trouble running an MPI job

Hello developers,

I have done some snooping (into my cluster node ubuntu33) and I am more convinced that the executables asteru/d_mpi (and asteru/d) have the hostname hardcoded (or compiled) into them because:

grep -R "ubuntu33" /opt/aster/NEW11.0_mpi/*
  Binary file /opt/aster/NEW11.0_mpi/asterd matches
  Binary file /opt/aster/NEW11.0_mpi/asterd_mpi matches
  Binary file /opt/aster/NEW11.0_mpi/asteru matches
  Binary file /opt/aster/NEW11.0_mpi/asteru_mpi matches
  /opt/aster/NEW11.0_mpi/config.txt.orig:ID_PERF        | id      | -     | ubuntu33

grep -R "ubuntu3[1~2]" /opt/aster/NEW11.0_mpi/*

Note that the latter grep yields nothing.  Similar results (with corresponding names being found) are obtained with the other two nodes "ubuntu31" and "ubuntu32".  So are the node names hardcoded into the executables and why ?!!!  Is this the cause of my problems? 

Furthermore when I mpdboot these 3 nodes outside ASTK and then submit a job via ASTK on ubuntu33, I can see after a short period that the mpd processes on the other two nodes disappear (killed by the ASTK job); then the solver complains that the other two nodes are absent (Error: "...may have invalid machine names"), so the solver abnormally exits, then the remaining host node also disappears.

Would somebody please provide details on how ASTK submits MPI jobs or some document that I can read that explains the steps, configuration or workings of ASTK with MPI.  I have read the Doc U_1.04.00 thoroughly too!  It does not explain how to configure/submit/start an MPI job.  Thanks!

Regards, JMB

Last edited by JMB365 (2011-10-26 02:02:01)


SalomeMeca 2021
Ubuntu 20.04, 22.04

Offline

#7 2011-10-26 11:54:25

Thomas DE SOZA
Guru
From: EDF
Registered: 2007-11-23
Posts: 3,066

Re: [Solved] ASTK: Having trouble running an MPI job

JMB365 wrote:

So are the node names hardcoded into the executables and why ?!!!  Is this the cause of my problems?

Clearly not. Something is wrong with your setup. Can you launch correctly a real MPI program across the 3 nodes or not ?
Just use pi.c, I think it is included in the mpich distribution (it's just a small program that computes pi).
ASTK only does what you tell it do to (e.g. possibly use a begin command, then launch the job using the run command and finally close the job with the end command if needed).

Also try removing your argument about a working directory, this is not needed.

TdS

Offline

#8 2011-10-26 13:20:04

JMB365
Member
Registered: 2008-01-19
Posts: 781

Re: [Solved] ASTK: Having trouble running an MPI job

Thomas DE SOZA wrote:

Can you launch correctly a real MPI program across the 3 nodes or not ?
Just use pi.c, I think it is included in the mpich distribution (it's just a small program that computes pi).

Thanks for the reply.  I had tested Pi.c prior to my trials with CodeAster and it did work just fine.  Just to double check, I ran it again today on the three nodes as shown below:

# Compile Pi.c on ubuntu33 and copy it to the other 2 nodes
ssh -X ubuntu33
cp ~/MyFiles/Download/CodeAster-Salome/Pi.c /tmp
cd /tmp
mpicc -o Pi Pi.c
ls -al Pi
# -rwxr-xr-x 1 ks ks 874780 2011-10-26 08:01 Pi
scp Pi ubuntu32:/tmp
scp Pi ubuntu31:/tmp

# Login into ubuntu31 and mpdboot 3 nodes
#  (just to try it differently this time rather than mpdboot'ing from ubuntu33)
ssh -X ubuntu31
cat ~/mpi_hostfile 
# ubuntu31
# ubuntu32
# ubuntu33
mpdboot -n 3 -f ~/mpi_hostfile
mpdtrace -l
# ubuntu31_50174 (192.168.1.231)
# ubuntu32_42007 (192.168.1.232)
# ubuntu33_42630 (192.168.1.233)

# Login into each of the 3 nodes and run Pi
ssh -X ubuntu33
cd /tmp
mpirun -np 3 ./Pi
# Input number of intervals:
# 10000
# 2: pi =         1.047064219529705
# 0: pi =         1.047397549530095
# 1: pi =         1.047130882863323
# pi =         3.141592651923124

ssh -X ubuntu32
cd /tmp
mpirun -np 3 ./Pi
# Input number of intervals:
# 10000
# 2: pi =         1.047064219529705
# 0: pi =         1.047397549530095
# pi =         3.141592651923124
# 1: pi =         1.047130882863323

ssh -X ubuntu32
cd /tmp
mpirun -np 3 ./Pi
# Input number of intervals:
# 10000
# 2: pi =         1.047064219529705
# 0: pi =         1.047397549530095
# 1: pi =         1.047130882863323
# pi =         3.141592651923124

(all output lines were pre-fixed with a # by me for this posting).  Any other ideas for trouble shooting this?  Thanks.

Regards -JMB

Last edited by JMB365 (2011-10-26 13:20:42)


SalomeMeca 2021
Ubuntu 20.04, 22.04

Offline

#9 2011-10-26 13:40:02

JMB365
Member
Registered: 2008-01-19
Posts: 781

Re: [Solved] ASTK: Having trouble running an MPI job

Hello,

Furthermore I tested the cpi.c that is included with the mpich2-1.3.2 examples:

ssh -X ubuntu33
cp /opt/mpich2-1.3.2/examples/cpi.c /tmp
cd /tmp
mpicc -o cpi cpi.c
scp cpi ubuntu31:/tmp
scp cpi ubuntu32:/tmp
mpdtrace -l
# ubuntu33_42630 (192.168.1.233)
# ubuntu31_50174 (192.168.1.231)
# ubuntu32_42007 (192.168.1.232)
mpirun -np 3 ./cpi
# Process 0 of 3 is on ubuntu33
# Process 1 of 3 is on ubuntu31
# Process 2 of 3 is on ubuntu32
# pi is approximately 3.1415926544231318, Error is 0.0000000008333387
# wall clock time = 0.002854

Similar results were seen when it was run on the other nodes.

Regards, JMB

Last edited by JMB365 (2011-10-26 13:45:03)


SalomeMeca 2021
Ubuntu 20.04, 22.04

Offline

#10 2011-10-26 17:29:50

JMB365
Member
Registered: 2008-01-19
Posts: 781

Re: [Solved] ASTK: Having trouble running an MPI job

Hello TdS,

I found the cause of my problem.  The prefs file was incorrectly configured for mpdboot for the flavor of mpich2-1.3.2 that I am running.  It was a mix-up of the usage of mpi_nbcpu / mpi_nbnoued variables.  Thanks for your all your assistance!

Regards, JMB

Last edited by JMB365 (2011-10-26 18:14:47)


SalomeMeca 2021
Ubuntu 20.04, 22.04

Offline