Cluster-Handbook/Torque
Torque
editTorque is an open source resource manager based on the original PBS project (http://www.pbsworks.com/). It is responsible to start, delete or to monitor jobs and thus supports a scheduler that could not manage the jobs without these functions otherwise. Therefore Torque brings it its own scheduler (pbs_sched), but you can also use other. Torque is flexible enough to perform space planning, but is used mostly in clusters. How to install and configure Torque for simple jobs on a cluster is described below. To install the latest version of Torque, you should not use the package from Ubuntu, but the package from the following website: http://www.adaptivecomputing.com/products/open-source/torque/.
Download Torque
editDownload the files in the master (here we used version 4.1.4).
$ sudo wget http://adaptive.wpengine.com/resources/downloads/torque/torque-4.1.4.tar.gz
Unzip file and navigate to the directory
edit$ tar -xzvf torque-4.1.4.tar.gz
$ cd torque-4.1.4/
When configuring and installing one remains best in this directory.
Configure and install the package on the master
editSet Directory
editBy default make install installs all files in in /usr/local/bin
, /usr/local/lib
, /usr/local/sbin
, /usr/local/include
, and /usr/local/man
.
You can also specify a different folder where the files should be stored by putting -–prefix=$directoryname
behind ./configure. So If you don't want to change anything, you do not need to consider this step.
Set Library Folder
editCreate a new file: /etc/ld.so.conf.d/torque.conf
$ sudo nano /etc/ld.so.conf.d/torque.conf
There you write the path to the libraries. In the standard setting, it would be /usr/local/lib
(is home
defined as a directory it would be /home/lib
). Then enter the following command:
$ sudo ldconfig
Perform Configure
editTo execute configure you have to install build-essentials, libssl-devel and libxml2-devel with this command:
$ sudo apt-get install build-essentials libssl-dev libxml2-dev
If you execute ./configure you will get an error that libxml2-devel isn't installed. This is a bug in Torque and can be fixed with following steps:
Firstly two lines in the configure.ac file need to be changed (see screenshot).
$ sudo nano configure.ac
The minus describes the line that needs to be changed, the plus describes how the line should read after the change. It is best to look for a keyword for the line to be changed because the file has a lot of lines.
After that execute autoconf:
$ sudo autoconf
and change the configure file:
sudo nano configure
Again, you look for the yellow marked line and change in the end (red rectangle) the -1 in a -l.
Now you can run ./configure and it should finish without errors.
sudo ./configure
In the end also run make and make install.
sudo make
sudo make install
By default, make install creates the directory /var/spool/torque
. This directory is referred to as TORQUE_HOME. There, various subfolders are created that are used to configure and run the program.
Install Torque on the Nodes
editCreate packages
editTorque has the function to create the packages, which uses the configurations and then can be installed on the nodes. Use the command make for this.
make packages
The packages are stored in the torque-4.1.4/ and must be copied from there in a shared directory the nodes have access to. In our case it would be the /home directory.
For example:
cp torque-package-mom-linux-i686.sh /home
On the nodes only the mom-linux package is needed. All others are optional.
Install Package
editOn the node you navigate to the directory in which you have copied the package and install it with the following command:
./torque-package-mom-linux-i686.sh –install
Torque Konfigurieren
editInitialise serverdb
editIn the directory TORQUE_HOME/server_priv
are configurations and information located that the pbs_server Service uses. To initialise the file serverdb run following command:
sudo ./torque.setup
Then the pbs_server needs a restart.
sudo qterm
sudo pbs_server
The server properties can be see by the following command:
sudo qmgr -c ’p s’
Specify Nodes
editThus, the pbs_server recognizes which computers in the network are the nodes. For this create in the directory TORQUE_HOME/server_priv
a new file nodes:
sudo nano nodes
In this file, the nodes will be stored with their name. Normally it is sufficient to write the names in the file, you may set special properties for each node. The syntax is:
NodeName[:ts] [np=] [gpus=] [properties]
[:ts]: This option sets the node as timeshared. These nodes are indeed listed by the server, but do not get jobs allocated.
[np=] This option is used to specify how many virtual processors are located on the nodes.
[gpus=] This option is used to specify how many CPUs are on the node.
[properties] This option allows to enter a name to identify the node. However, it must start with a letter.
One can detect the number of processors also automatically:
sudo qmgr -c set server auto_node_np = True
As a result, properties in the server auto_node_np are set to True.
Configure Nodes
To configure the nodes, the file config in the directory TORQUE_HOME/mom_priv
has to be created:
sudo nano config
This file is created the same on all nodes and should read the following:
Furthermore, one must write the line $usecp*:/home /home write into it. This ensures that the file of the finished jobs is stored in a specific directory (here the shared /home). Otherwise the following error will occur when running the command tracejob:
Execute Job
editRun Services
In order for a job to be performed at least 4 services must be started . On the master that are pbs_server, pbs_sched and trqauthd. On the nodes that is pbs_mom:
sudo pbs_server
sudo pbs_sched
sudo sudo trqauthd
sudo pbs_mom
Run Job
edit
The command qsub [file name], executed on the master, starts a job. To run a job, you need a Bash file. In the example above, the date is displayed, wait 10 seconds and then again output the date. The result is then stored in the directory on the master from which the job was started.
Useful Commands
editThere are some commands in Torque with which you can trace the running jobs and which are very useful for troubleshooting.
The command
pbsnodes -a
, executed on the master, shows if a node is active or not. With the command
qstat
a list of running or finished jobs is displayed.
There you can see which number a job has which node is used and whether the job is started, in progress or has already ended.
A very useful command for debugging is
tracejob [job number]
This is a command from Torque which searches and summarizes the log files in the pbs_server, mom and scheduler. With this one gets a quick overview.