In the previous installment, we discussed how distributed multiprocessing architectures come to dominate the supercomputing arena. Smaller, somewhat less than "super", clusters continue to find practical applications. As we mentioned, Hadoop and Spark clusters are designed to map code to multiple computers so the code can be applied to a great deal of data distributed across many physical servers. In a Beowulf-style cluster the data is often minimal, and code is distributed so that many multiple CPUs can be working on the problem in parallel. While such clusters are famous for their application to highpower math problems, they are also useful in some simple but very practical applications. Web scraping, for example, is very time-consuming; easy, immediate benefits are achieved from setting many multiple machine in a cluster loose on web-scraping tasks.
Many amateur and homebrew clusters include Windows machines, Linux desktops, and a few Raspberry Pi's all working together in a single cluster.
The term "Beowulf" has evolved to refer to an ecosystem of applications, not just the computer cluster itself. To be truly useful, cluster computing involves management tools, security, and job scheduling applications. We shall not review such tasks here, but will focus on the distributed execution of Python code. For those interested in implementing a true Beowulf cluster, the beowulf-python provides much functionality for managment and security.
We shall focus on the minimum essentials, that is to say how to execute parallel code on distributed processors.
Message Passing Interface (MPI)
Actually, there is no law that requires us to use MPI for a cluster, and there are alternative networking protocols for distributed processing. However, we will stick to MPI because it is very popular and there are many resources to draw from. There are two freely available MPI implentations, MPICH and OpenMPI; we will illustrate MPICH. Any individual cluster, however, must commit to one or the other. MPICH and OpenMPI cannot be mixed in a single cluster.
Regardless of the implementation used, there must be some infrastructure in place to handle the network connections and security that MPI requires. Both MPICH and OpenMPI work with the secure shell, ssh. All the machines participating in a cluster will need their own set of keys so that individual "slave" machines can accept instructions from the "master" and then return results to the master.
The greatest challenge in programming for a distributed environment is managing how data is passed from one participating cluster node to another. We will ignore that complexity for the time being and focus on the simplest possible case, executing independent, self-standing code on different machines.
When you install an implementation of MPI you will get some variant of mpiexec. (Some platforms continue to use the classical command mpirun.)This command-line program initiates the execution of code on one or more machines in a cluster. For our example, we will execute a python program that simply prints the name of the machine on which it is running. In that way, we can easily confirm that the code execution has been distributed.
Note that since our simple example does not need to communicate with code on other machines, the python code does not need to reference the mpi4py library.
We could specify the machines on which we want our code to run right in the commandline itself. However, it is easier in the long run to use machinefiles, which are nothing more than lists specifying cluster nodes. A command line merely invokes the appropriate machinefile to specify the nodes. Here is an example machinefile.
Note that the machine can be specified by IP address or by machine name. In this particular example, Hypatia is the master and 10.0.0.195 (Lovelace) is the slave; this distinction is not made in the machinefile. If need be, we can specify multiple processes on each node, which would make use of more available processor cores.
This machinefile specifies running two processes in the first machine and three in the second. There is no space on either side of he colon. Here is the command line and output:
The -f parameter is the machinefile name, which in this example is in the same folder from which mpiexec is being run. The program commandline to run on multiple machines is python ~/mpistuff/mpi-001.py. It is critical to note that mpiexec does not copy the program code to the slave machine; the code must already be installed on the slave machine and full path of that code is included in the command.
As a historical aside, the machines on my network are all named after mathematicians. (Yes, I'm a geek. I admit it.) Hypatia was a famous mathematician in ancient Alexandria and Ada Lovelace was the world's first computer programmer.
Setting up a distributed processing cluster is actually pretty easy. Fun, too. The most difficult part is ensuring that the cluster nodes all have the correct security credentials so that they can communicate with each other.