OpenMP/Tasks

If you followed along with the previous chapter and did the exercises, you'll have discovered that the sum function we developed was quite unstable. (To be fair, we set up the data to show the instability.) In fact, the naive way of summing has $O(n)$ round-off error: the error is proportional to the number of elements we sum. We can fix this by changing the summation algorithm. There are three candidates for a stable sum algorithm:

Sort the number from small to large, then sum them as before. Problem: this changes the complexity of the problem from linear to $O(n\log n)$ .
Use Kahan's algorithm. While this algorithm takes linear time and is very stable, it is quite slow in practice and harder to parallelize then our third alternative, which is
divide-and-conquer recursion. The round-off error for such an algorithm is $O(\log n)$ , i.e. proportional to the logarithm of the number of elements.

The basic divide and conquer summation is very easy to express in C:

float sum(const float *a, size_t n)
{
    // base cases
    if (n == 0) {
        return 0;
    }
    else if (n == 1) {
        return *a;
    }

    // recursive case
    size_t half = n / 2;
    return sum(a, half) + sum(a + half, n - half);
}

If you use this definition of sum in the program we developed for the previous chapter, you'll see that it produces exactly the expected result. But this algorithm doesn't have a loop, so how do we make a parallel version using OpenMP?

We'll use the tasks construct in OpenMP, treating the problem as task-parallel instead of data parallel. In a first version, the task-recursive version of sum looks like

float sum(const float *a, size_t n)
{
    // base cases
    if (n == 0) {
        return 0;
    }
    else if (n == 1) {
        return 1;
    }

    // recursive case
    size_t half = n / 2;
    float x, y;

    #pragma omp parallel
    #pragma omp single nowait
    {
        #pragma omp task shared(x)
        x = sum(a, half);
        #pragma omp task shared(y)
        y = sum(a + half, n - half);
        #pragma omp taskwait
        x += y;
    }
    return x;
}

We introduced two tasks, each of which sets a variable that is declared shared with the other task. If we did not declare the variables shared, each task would set its own local variable, then throw away the results. We then wait for the tasks to complete with #pragma omp taskwait and combine the recursive results.

You may be surprised by the #pragma omp parallel followed immediately by #pragma omp single nowait. The thing is that the first pragma causes all of the threads in the pool to execute the next block of code. The single directive causes all threads but one (usually the first to encounter the block) to not execute it, while the nowait turns off the barrier on the single; there's already a barrier on the enclosing parallel region, to which the other threads will rush.

Unfortunately, if you actually try to run this code, you'll find that it's still not extremely fast. The reason is that the tasks are much too fine-grained: near the bottom of the recursion tree, ${\tfrac {n}{2}}$ invocations are splitting two-element arrays into subtasks that process one element each. We can solve this problem by introducing, apart from the base and recursive cases, an "intermediate case" for the recursion which is recursive, but does not involve setting up parallel tasks: if the recursion hits a prespecified cutoff, it will no longer try to set up tasks for the OpenMP thread pool, but will just do the recursive sum itself.

Exercise: introduce the additional case in the recursion and measure how fast the program is. Don't peek ahead to the next program, because it contains the solution to this exercise.

Now we effectively have two recursions rolled into one: one with a parallel recursive case, and a serial one. We can disentangle the two to get better performance, by doing less checks at each level. We also separate the parallel setup code to a driver function.

#include <stddef.h>

#define CUTOFF 100  // arbitrary

static float parallel_sum(const float *, size_t);
static float serial_sum(const float *, size_t);

float sum(const float *a, size_t n)
{
    float r;

    #pragma omp parallel
    #pragma omp single nowait
    r = parallel_sum(a, n);
    return r;
}

static float parallel_sum(const float *a, size_t n)
{
    // base case
    if (n <= CUTOFF) {
        return serial_sum(a, n);
    }

    // recursive case
    float x, y;
    size_t half = n / 2;

    #pragma omp task shared(x)
    x = parallel_sum(a, half);
    #pragma omp task shared(y)
    y = parallel_sum(a + half, n - half);
    #pragma omp taskwait
    x += y;

    return x;
}

static float serial_sum(const float *a, size_t n)
{
    // base cases
    if (n == 0) {
        return 0.;
    }
    else if (n == 1) {
        return a[0];
    }

    // recursive case
    size_t half = n / 2;
    return serial_sum(a, half) + serial_sum(a + half, n - half);
}

This technique works better when the code inside the parallel tasks spends more time computing and less time doing memory accesses, because those may need to be synchronized between processors.