OpenMP/Reductions

Summing floating point numbers

For our first parallel program, we turn to an age-old problem: summing an array of floating point numbers. The basic algorithm to solve this problem is so simple that it allows us to focus on OpenMP features rather than algorithmic details, but we'll see in a bit that the problem is actually less trivial than it appears at first.

Without further ado, here's a sequential algorithm to sum a list of floating point numbers:

#include <stddef.h>  // for size_t

float sum(const float *a, size_t n)
{
    float total;
    size_t i;

    for (i = 0, total = 0.; i < n; i++) {
        total += a[i];
    }
    return total;
}

As far as algorithms go, this one is as simple as it gets. Put the definition above in a file isum.c (iterative sum).

If you have experience dealing with floating point numbers, you might be tempted to make total a double instead of a float for added precision. Don't do that just yet, as we'll solve the precision issue in a different way in a bit.

To test our algorithm, we need a driver program, which we'll put in a file main.c:

#include <stdio.h>
#include <stdlib.h>

float sum(const float *, size_t);

#define N 1000000  // we'll sum this many numbers

int main()
{
    float *a = malloc(N * sizeof(float));
    if (a == NULL) {
        perror("malloc");
        return 1;
    }

    // fill the array a
    for (size_t i = 0; i < N; i++) {
        a[i] = .000001;
    }

    printf("%f\n", sum(a, N));
    return 0;
}

And finally we need some way to build this program. On Linux/Unix/OS X, the following Makefile should do the job. It assumes you're using GCC.

# C99 extensions are not necessary for OpenMP, but very convenient
CFLAGS = -fopenmp -Wall -std=c99
LDFLAGS = -fopenmp

OBJS = main.o isum.o

# when copy-pasting the following, be aware that the indent must be a tab, not spaces
sum: $(OBJS)
        $(CC) $(LDFLAGS) -o sum $(OBJS)

Now compile the program with make sum, run it and see how fast it is with a tool such as time. If it's too fast to measure, consider changing N to a larger number, or run sum in a loop instead of just once.

A parallel way of summing

We had to go through a bit of setup, but now we're ready to make a parallel sum algorithm for floating point numbers. Here it is:

#include <stddef.h>  // for size_t

float sum(const float *a, size_t n)
{
    float total = 0.;

    #pragma omp parallel for reduction(+:total)
    for (size_t i = 0; i < n; i++) {
        total += a[i];
    }
    return total;
}

#pragma omp parallel for turns the loop into a parallel loop. If you have two cores, OpenMP will (probably) use two threads that each run half of the loop. The reduction(+:total) declares that we're reducing the input array by summing into the variable total, so after the partial loops are done, their results must be summed into this variable.

Put this in isum.c and recompile. Now run the program. Do you get the same output as before?

Exercise: run the program with various settings for the environment variable OMP_NUM_THREADS, which controls the size of the thread pool that OpenMP uses. Try 1, 2, 4 and 8. Do you see the same results for each setting? Now try absurdly large number of threads, e.g. 16000. How does this affect performance?

Exercise: the dot product of two vectors is the sum of products of their respective entries,

\sum _{i=1}^{n}a_{i}b_{i}

. Adapt the sum function to a dot function for computing dot products in parallel.