In the IDeA Lab, I’ve worked with the cluster they maintain there for batch processing of hundreds of MR images. They make use of PBS, where you can add your batch scripts to a queue and have it be assigned to run on a node when available. We now have PBS set up in my lab as well, and here’s a summary for Debian-flavored setups.

Setup

  1. Setup NFS. Ubuntu wiki [1] is a good starting point - we take the first 4 steps from their MPI guide.

    1. Set up hostnames in /etc/hosts.
    2. Install the packages.

       sudo apt-get install nfs-server nfs-client    
      
    3. Set up the mount on master.
    4. Set up the mount on nodes.
  2. Setup TORQUE server - the ArchLinux guide [2] is very up-to-date can be followed directly.

Troubleshooting

It took a while to figure out that the TORQUE software that is packaged with Ubuntu 14.04 is version 2.4 or so, which is much further behind the latest offering by the people at Adaptive Computing who put out the software. So monitoring the logs in /var/spool/torque/server_logs/ and tooling around with the server_priv/nodes configuration, I found that the gpus attribute wasn’t supported and had to remove it in order for the nodes to be created.

Usage

  1. Put code and input data on the NFS share.
  2. Write a bash script to call code with input data as command-line arguments.
  3. Enqueue the script with different arguments.
  4. The nodes run and spit out results on the NFS share.

The thing about running scripts with PBS is that it runs and captures stdout/stderr to the home directory on the node that the job is dispatched to by default. This can be changed by setting some PBS metadata in the script, like so:

#!/bin/bash
# This means redirect stdout here instead -- this is on an NFS share
#PBS -o /mirror/logs/o${PBS_JOBID}.out
...

You’ll also notice there’s a PBS_* variable – that’s another thing with PBS scripts: you have access to some special environment variables [3].

Conclusion

So NFS does funny things if UID/GID are inconsistent across your nodes, but there was a mention on Ubuntu Wiki that you can set it up so that this isn’t an issue. I actually manually reset UID/GID to make NFS work before I realized this, but I will report back on this when I get this ID mapping to work.

Our lab setup is mostly uniform with Ubuntu 14.04 installs except for a stray box that runs Linux Mint. This was tricky because compiling dependency libraries and putting them on NFS required using the oldest common gcc version (4.6). I still don’t know how to make CMake pull in all dependent libraries on installation as it does when you create a distributable bundle on OS X, but at least it works if you stick all your libraries on the NFS to begin with.

The next step for my research is to pile on the data on the NFS and set it running. I still have to set up a few things on the nodes and maybe create an alternate queue that assigns fewer nodes because one of the command-line utilities runs for tens of minutes on average, takes up a lot of memory, and can potentially run off and crash, so it would be rude to use up everyone’s cores during the day with that nonsense.

  1. https://help.ubuntu.com/community/MpichCluster
  2. https://wiki.archlinux.org/index.php/TORQUE
  3. https://wiki.hpcc.msu.edu/display/hpccdocs/Advanced+Scripting+Using+PBS+Environment+Variables