Building a Beowulf Cluster/Printable version


Building a Beowulf Cluster

The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at
https://en.wikibooks.org/wiki/Building_a_Beowulf_Cluster

Permission is granted to copy, distribute, and/or modify this document under the terms of the Creative Commons Attribution-ShareAlike 3.0 License.

Introduction

You need to run big simulations or experiments with many repetitions (e.g. bootstrapping)? Then you need big computational power. You don't want to spend huge money on one big and expensive computer?

There are many computations, called embarrassingly parallel, because each run or iteration is independent from others and in principle all could run in parallel. Therefore, instead of spending huge bucks on a huge machine, you could rather decide on many cheap computers and have them compute in parallel. The computers are connected in a network and run Linux. This is called a beowulf.

In the next chapters we will see an example of some hardware with actual prices (in euros). In subsequent chapters, we will deal with issues of installation and configuration. In the end we come to running processes in parallel using matlab and R.


Hardware

A lot of attention should be focused on the hardware assembly. There are many tradeoffs and considerations of speed bottlenecks. You should be careful to settle on a vendor with experience, where they can give you good recommendations.

Here is a list of hardware, which we assembled at the artificial olfaction lab at the Institute for Bioengineering of Catalonia. Prices are as of May 2008 and do not include taxes.

 
rack containing a 8 computers, a KVM, and a switch (behind KVM).

We bought 8 computers with these parameters:

  • Intel Core 2 Duo Quad Q6600 2.4 GHz FSB1066 8MB
  • Chipset Intel X38 / Intel ICH9R
  • 4 GB RAM DDR3 1066 (in 2x2Gb).
  • 2 x PCI Express x16, 1 x PCI Express x1, 2 x PCI-X y 1 PCI
  • Marvell88E8056 Dual Gigabit LAN controller
  • Realtek ALC882M 7.1 channels (sound
  • 6 USB 2.0 ports y 1 IEEE1394
  • VGA 512MB Gforce8400GS PCI-e
  • 160 GB de Disco Duro SATA II 3 Gb.
  • DVD R/W, Multicard reader/writer
  • 19" rack computer case, 4U, with frontal lock.
  • 550W Server Guru

They cost 923 euros each.

In order to operate 8 computers efficiently you need a monitor that you can connect to all machines. Better even a KVM (keyboard video mouse). A KVM with 8 ports cost us 850 euros. The KVM, as the name says, serves as keyboard, monitor, and mouse, and did great service during configuration because it allows fast switching between different computers.

We also need to connect all the computers among themselves: A switch with 16 ports, 10/100/1000, comes at 162 euros.

We want to put all the hardware somewhere, where it is save and where it has good conditions of cooling and safety: a rack (see picture). The rack on the picture can take up 42 units of 19 and cost us 500 euros.

Don't be surprised to be charged extra for cables, screws, and multi-outlet power strips. Rails allow you to stack in and take out your computers like drawers. Additional cost: about 700 euros.

Also on top, the VAT, in Spain 18%, which makes about 1500 euros.

Total cost of the beowulf: about 11,090 euros.

In order to connect your computers, you need power lines that can support them. The configuration above needs about 4.5kWatt and we had to wait about 2 months for technicians to fix the capacity (that's Spain).

On the photo you can see the computer hardware mounted into the rack. You can see the 8 computers flanking from two sides the KVM which is in the middle, behind it the switch.

Beginner's Experiment Cluster

edit

This cluster may not be the fastest one out there, but it is small and cheap enough to build and operate by a single beginner:

Bill of material for one single node:

  • System board: Raspberry Pi 4. This cost $35. This single board computer packs a quad-core ARM Cortex-A72 processor at 1.5GHz frequency, and 1GB networking. Mounting holes on the board allow you to design mechanical structure for the final system easily using standard mounting hardware.
  • Memory card: Any class 10 microSDHC card with at least 4GB capacity will work. At least one seller sell reasonably good class 10 4GB microSDHC at $5.
  • Power supply: A reasonably good one is available on Amazon for about $14.

This means about $54 per node. Other housekeeping parts includes an ethernet switch (about $50 for a 8-port one) This means an investment of about $400 you can get a 7-node cluster to experiment with.

See also

edit


Installation, Configuration, and Administration

In this chapter, you will find basics of installation, configuration, and administration. Most important here: 1. network setup with DHCP, 2. sharing files over the network with the network file system (NFS). After we cover this, we will come to some more general ideas of administration.

Most beowulfs use a network structure of master (also network head) and slaves, so that computing jobs are going to be distributed from the master to the slaves. All machines of the cluster are connected to the switch. The head will additionally interface with an external network. Master and slaves will share user data over the network. It is easiest to only install the master and one slave (called golden slave) in a first step. In a later chapter we will come to cloning, which is the process of creating machines that are identical to the golden slave. Alternatively to cloning, one may choose to do a diskless boot.

Practically this means, that the master (or head node) has two network interfaces (say eth0, eth1), one connected to the outside world and one connected to the cluster intranet over a network switch. All other computers (the slaves or nodes) are connected to the switch. In order to start processes on the cluster a user logs in on the master and spawns the processes from there to the slaves.

Outline

edit


Parallelization of computation

As the principal goal of having the cluster is to run programs in parallel on different machines, I installed protocols for message-passing for distributed-memory applications. There are two common ones: the Parallel Virtual Machine (PVM) and Message Parsing Interface (MPI).

For scientific computing we can use high-level computing platforms or languages such as C/C++ and fortran. Here we will see GNU R and matlab. R can spawn parallel jobs using either PVM or MPI. Matlab comes with an implementation of MPI (more precisely mpich2).

Note that for PVM you need to enable password-less ssh access (see previous section) from the server to all clients. Also, for PVM, MPI (includes matlab's mdce), the network configuration you have to remove the host names from the loop-back line (where it says 127.0.0.1) of the /etc/hosts file. Just put localhost instead. Then you need a text file with a list of all machines you wish to use for computing and call it pvmhosts and mpihosts.

Outline

edit


Cloning of Slaves

When master and one slave (the golden slave) are installed and configured and now we want to scale up this configuration to more slaves by replicating the exact configuration of the golden slave.

Installing and configuring the OS on each machine manually is cumbersome and prone to error. However, nodes are identical, so why not just copy everything we need? This process is called cloning. We first setup a so-called golden node or model node and then transfer the system to other slave machines. Each new node will come with one new entry in the head node's DHCP server file (/etc/dhcpd.conf) and /etc/hosts file.

For preparation, make sure that in /etc/fstab and in the /boot/grub/menu.lst, there are no physical addresses of hardware (e.g. a hard disk), as they will be different among the nodes. All hardware should be addressed by their subdirectory in /dev which you can see in the output when you type mount.

I used low level R/W with dd and piping to and from netcat, respectively on machine to clone from and machine to clone to, as described in a howto. We clone using convert and copy (dd) and netcat (nc).

On the golden slave (or an identical clone) you run:

node1# dd if=/dev/hda conv=sync,noerror bs=64k | nc -l 5000

On the completely blank soon-to-be-slave you run:

node2# nc 192.168.1.1 5000 | dd of=/dev/hda bs=64k

where 192.168.1.1 is the ip of the golden slave. This presupposes the disk of soon-to-be slave is at least as big as the disk in the golden slave.

This took several hours time (it said 158GB read/written), but it worked.


Profiling and Benchmarking

It is important to know how long applications take, whether they parallelize at all (or just run locally), so this should be covered in this chapter. Also if computers heat up too much, they are gone, so we see how to control the temperature.

Outline

edit