A motivating example

edit

Assume we need to parallelize the following sequential code.

function VectorAdd(int *a, int *b, int N)
{
  for(int i=0; i<N; i++) 
  { 
    a[i] = a[i] + b[i];
  }
}

What we can do is assign each thread to carry out a subset of iterations.

#include <omp.h> // for OpenMP library

function VectorAddParallel(int *a, int *b, int N)
{
  omp_set_num_threads(24);
  #pragma omp parallel
  {
    int id, i, Nthrds, istart, iend;
    id = omp_get_thread_num();
    Nthrds= omp_get_num_threads();
    istart= id * N / Nthrds;
    iend= (id+1) * N / Nthrds;

    if (id == Nthrds-1)
      iend= N;

    for(int i=istart;i<iend;i++)   
    { 
      a[i] = a[i] + b[i];
    }
  }
}

The omp_set_num_threads(24) in above code tells the computer to spawn 24 threads when we enter the parallel region (scope of #pragma omp parallel). However, there is no guarantee that we would be given the exact number of threads as we ask for, due to resource limitations and environment constraints. Therefore, we call omp_get_num_threads() within the parallel region to know actual the number of threads spawned. Then we assign the starting and ending iterations for each thread. For example, thread with id 0 (returned by omp_get_thread_num()) will execute all iterations between istart and iend. Each thread will get a private instance of loop index variable, and private instances of variables declared within the parallel region.

We can simplify above function with OpenMP's for worksharing construct. It may be combined with the parallel directive as #pragma omp parallel for.

#include <omp.h> // for OpenMP library

function VectorAddParallelSimplified(int *a, int *b, int N)
{
  #pragma omp parallel for
    for(int i=0;i<N;i++)   
    { 
      a[i] = a[i] + b[i];
    }
}

The compiler directive #pragma omp for has no effect unless it is used within a parallel region. Also, it is safe to assume code that uses OpenMP library functions but does not use #pragma omp parallel anywhere, is not parallelized by OpenMP.