To run on a network of workstations, you must specify in some way the host names of the machines that you want to run on. This can be done in several ways. These are described in detail in the Users Guide. We give a shorter version here.
The easiest way is to edit the file mpich/util/machines/machines.xxxx,
*
to contain names of machines of architecture xxxx. The xxxx
matches the arch given when mpich was configured. Then whenever
mpirun is executed, the required number of hosts will be selcted from
this file for the run. (There is no fancy scheduling; the hosts are selected
starting from the top). To run all your MPI processes on a single
workstation, just make all the lines in the file the same.
A sample machines.solaris file might look like:
mercury venus earth mars earth marsThe names should be provided in the same format as is output by the hostname command. For example, if the result of hostname on earth is earth.my.edu (and similarly for the other names), then the machines file should be
mercury.my.edu venus.my.edu earth.my.edu mars.my.edu earth.my.edu mars.my.eduTo run the test suite in examples/test, you need a machines file with at least five lines in it. This is for homogeneous networks. Heterogeneous networks are discussed in the Users' Guide.
For nodes that contain multiple processors, indicate the number of processors
by following the name with a colon and the number of processors. For example,
if mars in the previous example had two processors, then the machines
file should be
mercury venus earth mars:2 earth mars:2
Automounters are programs that dynamically make file systems available when needed. While this is very convenient, many automounters are unable to recognize the file system names that the automounter itself generates.* For example, if a user accesses a file /home/me, the automounter may discover that it needs to mount this file system, and does so in /tmp_mnt/home/me. Unfortunately, if the automounter on a different system is presented with /tmp_mnt/home/me instead of /home/me, it may not be able to find the file system. This would not be such a problem if commands like pwd returned /home/me instead of /tmp_mnt/home/me; unfortunately, it is all too easy to get a path that the automounter should, but does not, recognize.
To deal with this problem, configure allows you to specify a filter
program when you configure with the option -automountfix=PROGRAM, where
PROGRAM is a filter that reads a file path from standard input, makes
any changes necessary, and writes the output to standard output.
mpirun uses this program to help it find necessary files.
By default, the value of PROGRAM is
sed -e s@/tmp_mnt/@/@gThis uses the sed command to strip the string /tmp_mnt from the file name. Simple sed scripts like this may be used as long as they do not involve quotes (single or double) or use % (these will interfere with the shell commands in configure that do the replacements). If you need more complex processing, use a separate shell script or program.
As another example, some systems will generate paths like
/a/thishost/root/home/username/....which are valid only on the machine thishost, but also have paths of the form
/u/home/username/....that are valid everywhere. For this case, the configure option
-automountfix='sed -e s@/a/.\*/home@/u/home@g'will make sure that mpirun gets the proper filename.
When using the ch_p4 device, it is possible to speedup the process of starting jobs by using the secure server. The secure server is a program that runs on the machines listed in the machines.xxxx file and that allows programs to start faster. There are two ways to install this program: so that only one user may use it and so all users may use it. No special privileges are required to install the secure server for a single user's use.
To use the secure server, follow these steps:
rpcinfo -p mysun
make serv_p4At the end of this step, the executable for the secure server is in the same directory as the MPI scripts and executables. The name of the server is serv_p4.
3. Start the secure server.
The script sbin/chp4_servs
bin/chp4_servs -port=n -arch=$ARCHcan be used to start the secure servers. This makes use of the remote shell command (rsh, remsh, or ssh) to start the servers; if you cannot use the remote shell command, you will need to log into each system on which you want to start the secure server and start the server manually. The command to start an individual server using port 2345 is
serv_p4 -o -p 2345 &If you are using the secure server, use the same command line.
For example, if you had choosen a port number of 2345 and were using
Solaris, then you would give the command
sbin/chp4_servs -port=2345 -arch=solarisThe server will keep a log of its activities in a file with the name Secure_Server.Log.xxxx in the current directory, where xxxx is the process id of the process that started the server (note that the server may be running as a child of that initial process).
4. To make use of the secure servers using the ch_p4 device, you
must inform mpirun of the
port number. You can do this in two ways. The first is to give the
-p4ssport n option to mpirun. For example, if the port is
2345 and you want to run cpi on four processors, use
mpirun -np 4 -p4ssport 2345 cpiThe other way to inform mpirun of the secure server is to use the environment variables MPI_USEP4SSPORT and MPI_P4SSPORT. In the C-shell, you can set these with
setenv MPI_USEP4SSPORT yes setenv MPI_P4SSPORT 2345The value of MPI_P4SSPORT must be the port with which you started the secure servers. When these environment variables are set, no extra options are needed with mpirun.
To stop the servers, their processes must be killed. This is easily done with
the Scalable Unix Tools [8] with the command
pfps -all -tn serv_p4 -and -o $LOGNAME -kill INTAlternately, you can log into each system and execute something like
ps auxww | egrep "$LOGNAME.*serv_p4"if using a BSD-style ps, or
ps -flu $LOGNAME | egrep 'serv_p4'if using a System V-style ps. The System V style will work only if the command name is short; the System V ps only gives you the first 80 characters of the command name, and if it was started with a long (but valid) directory path, the name of the command may have been lost.
An alternative approach is discussed in Section Managing the servers .
An experimental perl5 program is provided to help you manage the p4 secure servers. This program is chkserv, and is in the sbin directory. You can use this program to check that your servers are running, start up new servers, or stop servers that are running.
Before using this script, you may need to edit it; check that it has appropriate values for serv_p4, portnum, and machinelist; you may also need to set the first line to your version of perl5.
To check on the status of your servers, use
chkserv -port 2345To restart any servers that have stopped, use
chkserv -port 2345 -restartThis does not restart servers that are already running; you can use this as a cron job every morning to make sure that your servers are running. Note that this uses the same remote shell command that configure found; if you can't use that remote shell command to start the process on the remote systems, you'll need to restart the servers by hand. In that case, you can use the output from chkserv -port 1234 to see which servers need to be restarted.
chkserv -port 2345 -killThis contacts all running servers and tells them to exit. It does not use rsh, and can be used on any system.
This software is experimental. If you have comments or suggestions, please send them mpi-bugs@mcs.anl.gov.
The new MPD system is described in the companion document to this one, the User's Guide for MPICH.