ROSE Compiler Framework/autoPar

This is a tool which can automatically insert OpenMP pragmas into input serial C/C++ codes.

Diagram explaining how the autoPar programming tool works

Overview

edit

This tool is an implementation of automatic parallelization using OpenMP. It can automatically insert OpenMP directives into input serial C/C++ codes. For input programs with existing OpenMP directives, the tool will double check the correctness when the right option is turned on.

  • The source files are currently located in rose/projects/autoParallelization.
  • A standalone executable program (named autoPar ) is generated and installed to the installation tree of ROSE (under ROSE INS/bin).
  • Test input files are located at rose/projects/autoParallelization/tests
  • You can test the tool in rose_build_tree/projects/autoParallelization by typing "make check"
  • There is a section in ROSE manual: 12.7 Automatic Parallelization pdf
  • It is used to explore semantics-aware automatic parallelization, as described in our papers:
    • A workshop paper: Chunhua Liao, Daniel J. Quinlan, Jeremiah J. Willcock and Thomas Panas, Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore, In Proceedings of the 5th international Workshop on OpenMP: Evolving OpenMP in An Age of Extreme Parallelism (Dresden, Germany, June 3–05, 2009). pdf
    • A journal version of the paper: Chunhua Liao, Daniel J. Quinlan, Jeremiah J. Willcock and Thomas Panas, Semantic-Aware Automatic Parallelization of Modern Applications Using High-Level Abstractions, Journal of Parallel Programming, Accepted in Jan. 2010 pdf

Similar to ROSE, autoPar is released under the BSD license.

Installation and configuration

edit

autoPar is released as part of the ROSE compiler. Please install ROSE by following the instructions at

  • Installation
  • type make install-core -j8 to install core ROSE libraries and prebuilt tools.

An executable file autoPar is created during the process of building ROSE. Users can use it to parallelize their codes. autoPar will be installed to the path specified by –prefix=ROSE INSTALLATION PATH. Here are some instructions.

During configuration, you might want to specify a prefix path, such as –prefix=/home/youraccount/opt/roserev-8345. Then, follow the ROSE installation guide to build and install ROSE, often by typing make and make install.

The next step is to set environment variables for search paths for ROSE executables (including autoPar ) and shared libraries. Assume you are using bash, put the following lines into a file (e.g. set.rose) and source it (typing source set.rose)

ROSE_INS=/home/youraccount/opt/rose-rev-8345
export ROSE_INS
PATH=$ROSE_INS/bin:$PATH
export PATH
LD_LIBRARY_PATH=$ROSE_INS/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

Now you can use autoPar to parallelize your code. Assume you have a sequential file: file.c, just type autoPar -c file.c. An output file rose file.c will be generated. If autoPar can find anything parallelizable, the output file should have OpenMP pragmas generated.

Alternative: using the virtual machine image

edit

If you don't want to install everything from scratch, you can download the vmware image of ROSE and directly use the installed copy of ROSE inside of it.

More information about this is at ROSE_Compiler_Framework/Virtual_Machine_Image.


Within the virtual machine, you can cd into

 /home/demo/buildrose/projects/autoParallelization

and type

 make check

You will see autoPar is used to process input files.

Command line options

edit

[~build64/projects/autoParallelization/tests]../autoPar—help

AUTOPAR(1)                           ROSE Command-line Tools                           AUTOPAR(1)

Name
    autoPar - This tool automatically inserts OpenMP directives into sequential codes.

Synopsis
    autoPar switches files...

Description
    This tool is an implementation of automatic parallelization using OpenMP. It can automatically insert OpenMP directives into input serial C/C++
    codes. For input programs with existing OpenMP directives, the tool will double check the correctness when requested.

Switches
  General switches
    --assert how
      Determines how a failed assertion behaves. The choices are "abort", "exit" with a non-zero value, or "throw" a
      Rose::Diagnostics::FailedAssertion exception. The default behavior depends on how ROSE was configured.

    --help; -h
      Show this documentation.

    --log config
      Configures diagnostics. Use "--log=help" and "--log=list" to get started.

    --threads n
      Number of threads to use for algorithms that support multi-threading. The default is 0. A value of zero means use the same number of
      threads as there is hardware concurrency (or one thread if the hardware concurrency can't be determined).

    --version; -V
      Shows the dotted quad ROSE version and then exits. See also --version-long, which prints much more information.

    --version-long
      Shows version information for ROSE and various dependencies and then exits. The shorter --version switch shows only the dotted quad of the
      ROSE library itself.

  autoPar's switches
    These switches control the autoPar tool.

    --[rose:autopar:]annot string
      Specify annotation file for semantics of abstractions

    --[rose:autopar:]dumpannot
      Dump annotation file content for debugging purposes.

    --[rose:autopar:]enable_debug
      Enable the debugging mode.

    --[rose:autopar:]enable_diff
      Compare user defined OpenMP pragmas to auto parallelization generated ones.

    --[rose:autopar:]enable_distance
      Report the absolute dependence distance of a dependence relation preventing parallelization.

    --[rose:autopar:]enable_patch
      Enable generating patch files for auto parallelization

    --[rose:autopar:]failure_report string
      Specify the report file for logging files autoPar cannot process

    --[rose:autopar:]keep_going
      Allow auto parallelization to keep going if errors happen

    --[rose:autopar:]no_aliasing
      Assuming no pointer aliasing exists.

    --[rose:autopar:]success_report string
      Specify the report file for logging files autoPar can process

    --[rose:autopar:]unique_indirect_index
      Assuming all arrays used as indirect indices have unique elements (no overlapping)

  ROSE builtin switches
    --rose:help
      Show the old-style ROSE help.

0.9.9.123                          Friday October 13 10:16:56 2017 

Additional useful rose flags

  • -rose:skipfinalCompileStep // skip invoking the backend compiler to compile the transformed code, this is useful to workaround some bugs
  • --edg:no_warnings // suppress warnings from the EDG C++ frontend

Example usage

edit

Testing input files can be found at https://github.com/rose-compiler/rose-develop/tree/master/projects/autoParallelization/tests

The corresponding generated testing output files can be found at: https://github.com/chunhualiao/autoPar-demo

We provide two samples below:

Without using annotations

edit

Input:

/* Only the inner loop can be parallelized
 */
void foo()
{
 int n=100, m=100;
 double b[n][m];
 int i,j;
 for (i=0;i<n;i++)
  for (j=0;j<m;j++)
   b[i][j]=b[i-1][j-1];
}

Command line and screen output

../autoPar -rose:C99—edg:no_warnings -w -rose:verbose 0 --edg:restrict -rose:autopar:unique_indirect_index -rose:autopar:enable_patch -I../../../../sourcetree/src/frontend/SageIII -c ../../../../sourcetree/projects/autoParallelization/tests/inner_only.c

Enabling generating patch files for auto parallelization ...
Assuming all arrays used as indirect indices have unique elements (no overlapping) ...

Unparallelizable loop at line:8 due to the following dependencies:
2*2 TRUE_DEP; commonlevel = 2 CarryLevel = (0,0) Is precise SgPntrArrRefExp:b[i][j]@10:14->SgPntrArrRefExp:b[i - 1][j - 1]@10:21 == -1;* 0;||* 0;== -1;||::

Automatically parallelized a loop at line:9

Output:

/* Only the inner loop can be parallelized
 */
#include "omp.h" 

void foo()
{
 int n = 100;
 int m = 100;
 double b[n][m];
 int i;
 int j;
 for (i = 0; i <= n - 1; i += 1) {
  
#pragma omp parallel for private (j) firstprivate (n,m,i)
  for (j = 0; j <= m - 1; j += 1) 
   b[i][j] = b[i - 1][j - 1];
 }
}

Another input:

int i,j;
int a[100][100];
void foo()
{
 for (i=0;i<100;i++)
  for (j=0;j<100;j++)
   a[i][j]=a[i][j]+1;
}

Output:

#include "omp.h" 
int i;
int j;
int a[100UL][100UL];

void foo()
{
 
#pragma omp parallel for private (i,j)
 for (i = 0; i <= 100 - 1; i += 1) {
  
#pragma omp parallel for private (j) firstprivate (i)
  for (j = 0; j <= 100 - 1; j += 1) 
   a[i][j] = (a[i][j] + 1);
 }
}

Using annotations

edit

Annotation files provide additional program information which is traditionally hard for compilers to extract, such as aliasing, side effect information.

Multiple annotations files can be loaded in one command line

../autoPar --edg:no_warnings -w -rose:verbose 0 --edg:restrict -rose:autopar:unique_indirect_index -I/home/liao6/rose/freshmaster/sourcetree/src/frontend/SageIII -c -annot /home/liao6/rose/freshmaster/sourcetree/projects/autoParallelization/tests/floatArray.annot -annot /home/liao6/rose/freshmaster/sourcetree/projects/autoParallelization/tests/funcs.annot -annot /home/liao6/rose/freshmaster/sourcetree/projects/autoParallelization/tests/Index.annot /home/liao6/rose/freshmaster/sourcetree/projects/autoParallelization/tests/interp1_elem2.C

Assuming all arrays used as indirect indices have unique elements (no overlapping) ...

Automatically parallelized a loop at line:13


Input

//#include <A++.h>
#include "simpleA++.h"

void interpolate1D(class floatArray &fineGrid,class floatArray &coarseGrid)
{
 int _var_0,i;
 int _var_1;
 int fineGridSize = fineGrid.length(0);
 int coarseGridSize = coarseGrid.length(0);
// Interior fine points
 class Range If(2,_var_1 = (fineGridSize - 2),2);
 class Range Ic(1,(coarseGridSize - 1),1);
 for (i = 1; i < _var_0; i += 1) {
  fineGrid.elem(i) =fineGrid.elem(i)+1;
 }
}

Annotation file

Output file

//#include <A++.h>
#include "simpleA++.h"
#include "omp.h"

void interpolate1D(class floatArray &fineGrid,class floatArray &coarseGrid)
{
 int _var_0;
 int i;
 int _var_1;
 int fineGridSize = fineGrid . length (0);
 int coarseGridSize = coarseGrid . length (0);
// Interior fine points
 class Range If(2,_var_1 = fineGridSize - 2,2);
 class Range Ic(1,coarseGridSize - 1,1);

#pragma omp parallel for private (i) firstprivate (_var_0)
 for (i = 1; i <= _var_0 - 1; i += 1) {
  fineGrid . elem (i) = fineGrid . elem (i) + 1;
 }
}

Generating patches

edit

autoPar accepts an option -rose:autopar:enable_patch to generate a patch file. So users want to examine the directives to be inserted first. Then users can apply the inspected patch to their application.

Input file: https://github.com/rose-compiler/rose-develop/blob/master/projects/autoParallelization/tests/jacobi_seq.c

Command line:

  • autoPar -rose:C99 --edg:no_warnings -w -rose:verbose 0 --edg:restrict -rose:autopar:unique_indirect_index -rose:autopar:enable_patch -c jacobi_seq.c

Screen output

Enabling generating patch files for auto parallelization ...
Assuming all arrays used as indirect indices have unique elements (no overlapping) ...

Automatically parallelized a loop at line:83

Unparallelizable loop at line:84 due to the following dependencies:
1*1 OUTPUT_DEP; commonlevel = 1 CarryLevel = (0,0) Is precise SgPntrArrRefExp:f[i][0]@89:6->SgPntrArrRefExp:f[i][0]@89:6 <= -1;||::
1*1 OUTPUT_DEP; commonlevel = 1 CarryLevel = (0,0) Is precise SgPntrArrRefExp:f[i][0]@89:6->SgPntrArrRefExp:f[i][0]@89:6 <= -1;||::

Automatically parallelized a loop at line:143

Automatically parallelized a loop at line:144

Automatically parallelized a loop at line:147

Automatically parallelized a loop at line:148

Automatically parallelized a loop at line:185

Automatically parallelized a loop at line:186

Generated patch file: cat jacobi_seq.c.patch

diff -ar /home/liao6/workspace/autoPar/sourcetree/projects/autoParallelization/tests/jacobi_seq.c rose_jacobi_seq.c
0a1
> #include <omp.h>
82a83
> #pragma omp parallel for private (xx,yy,i,j) firstprivate (n,m)
142a143
> #pragma omp parallel for private (i,j)
143a144
> #pragma omp parallel for private (j)
146a147
> #pragma omp parallel for private (resid,i,j) reduction (+:error)
147a148
> #pragma omp parallel for private (resid,j) reduction (+:error) firstprivate (omega,ax,ay,b)
184a185
> #pragma omp parallel for private (xx,yy,temp,i,j) reduction (+:error)
185a186
> #pragma omp parallel for private (xx,yy,temp,j) reduction (+:error) firstprivate (dx,dy)

After inspecting the patch, you can apply the patch to the original source file

  • patch jacobi_seq.c -i jacobi_seq.c.patch -o jacobi_patched.c

Checking Correctness

edit

AutoPar examines pre-existing OpenMP directives in an application and verifies that they have correctly accounted for private, reductions and other OMP data-sharing attributes.

Sample input

#include <stdio.h>
#include <omp.h>
 
int main(int argc, char *argv[]) {
 int N = 20;
 int total ;
 int i,j;
 
#pragma omp parallel for
 for (j=0;j<N;j++) {
  for (i=0;i<N;i++) {
   total += i+j ;
  }
 }
 printf("%d\n", total) ;
 return 0;
}

The code above contains a real OpenMP bug someone struggled with while trying to add OMP annotations to their code, submitted here: http://openmp.org/forum/viewtopic.php?f=3&t=821

Running the autoPar OpenMP directive checker produces this result:

% autoPar -rose:unparse_tokens -rose:autopar:unique_indirect_index -rose:autopar:enable_diff -fopenmp -c OMPcheck.cc
<<<<<<<<<
user defined   : #pragma omp parallel for
--------
OMPcheck.cc:11
compiler generated:#pragma omp parallel for private (i) reduction (+:total)
>>>>>>>> 

The output above from autoPar indicates that the OMP pragma is missing an OpenMP 'private' declaration for the variable 'i', and a reduction annotation for the variable 'total'.

Algorithm

edit

autoPar is designed to handle both conventional loops operating on primitive arrays and modern applications using high-level abstractions. The parallelizer uses the following algorithm: The loops may contain variables of either primitive data types or STL container types, or both.

overview

edit

1. Preparation and Preprocessing

  • (Optional) Read a specification file for known abstractions and semantics.
  • (Optional) Apply optional custom transformations based on input code semantics, such as converting tree traversals to loop iterations on memory pools.
  • (Optional) if unique index indices semantics are assumed, turn on normalization of indirect array access (void uniformIndirectIndexedArrayRefs() )
    • indirect_loop_index = array_Y [current_loop_index] ;array_X[indirect_loop_index] ..; becomes array_X [array_Y [current_loop_index]] ..
  • Normalize loops, including those using iterators.
  • Apply constant folding to simplify the normalized code
  • Find candidate array computation loops with canonical forms (for omp for) or loops and functions operating on individual elements (for omp task).

2. For each candidate:

  • Skip the target if there are function calls without known semantics or side effects.
  • Call dependence analysis and liveness analysis.
  • Classify OpenMP variables (autoscoping) to be private, firstprivate, lastprivate, reduction variables. recognize references to the current element, and find order independent write accesses.
  • Eliminate dependencies associated with autoscoped variables, those involving only the current elements, and output dependencies caused by order-independent write accesses.
  • Insert the corresponding OpenMP constructs if no dependencies remain.

The key idea of the algorithm is to capture dependencies within a target and eliminate them later on as much as possible based on various rules. Parallelization is safe if there are no remaining dependencies.

Canonical loops

edit

In page 51 or (59/320): section 2.6, OpenMP 4.0 (http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf), it is defined as

for (init-expr; test-expr; incr-expr) structured-block

init-expr One of the following:

var = lb
integer-type var = lb
random-access-iterator-type var = lb
pointer-type var = lb

test-expr One of the following:

var relational-op b
b relational-op var

relational-op One of the following:

<
<=
>
>=

incr-expr One of the following:

++var
var++
--var
var--
var += incr
var -= incr
var = var + incr
var = incr + var
var = var - incr

var One of the following:

A variable of a signed or unsigned integer type.
For C++, a variable of a random access iterator type.
For C, a variable of a pointer type.
If this variable would otherwise be shared, it is implicitly made private in the loop
construct. This variable must not be modified during the execution of the for-loop  other
than in incr-expr. Unless the variable is specified lastprivate on the loop
construct, its value after the loop is unspecified.

lb and b: Loop invariant expressions of a type compatible with the type of var.

incr: A loop invariant integer expression.

Dependence analysis

edit

Dependence analysis is the basis for the parallelizer to decide whether a loop is parallelizable. autoPar invokes the dependence analysis from the loop optimizer, which implements algorithms proposed in [35, 36] to effectively transform both perfectly nested loops and non-perfectly nested loops. An extended direction matrix (EDM) dependence representation is used to cover non-common loop nests that surround only one of the two statements in order to handle non-perfectly nested loops. For array accesses within loops, a Gaussian elimination algorithm is used to solve a set of linear integer equations of loop induction variables.


This step relies on the dependence analysis implemented in

  • src/midend/programTransformation/loopProcessing/depGraph

A simplified interface function is provided in

https://github.com/rose-compiler/rose-develop/blob/master/projects/autoParallelization/autoParSupport.C

//Compute dependence graph for a loop, using ArrayInterface and ArrayAnnoation
 LoopTreeDepGraph* ComputeDependenceGraph(SgNode* loop, ArrayInterface*, ArrayAnnotation* annot);

Side effect analysis

edit

Dependence analysis relies on the variable references information collected from side effect analysis

The analysis's source header file is midend/programTransformation/loopProcessing/depInfo/DepInfoAnal.h

  • bool AnalyzeStmtRefs( AstInterface& fa, const AstNodePtr& n, CollectObject<AstNodePtr> &wRefs, CollectObject<AstNodePtr> &rRefs);

Several SageInterface functions are provided to make this analysis more accessible to developers



// Collect all read and write references within stmt, which can be a function, a scope statement, or a single statement. Note that a reference can be both read and written, like i++.
bool SageInterface::collectReadWriteRefs (SgStatement *stmt, std::vector< SgNode * > &readRefs, std::vector< SgNode * > &writeRefs)


//Collect unique variables which are read or written within a statement. Note that a variable can be both read and written. The statement can be either of a function, a scope, or a single line statement.
bool SageInterface::collectReadWriteVariables (SgStatement *stmt, std::set< SgInitializedName * > &readVars, std::set< SgInitializedName * > &writeVars)

// Collect read only variables within a statement. The statement can be either of a function, a scope, or a single line statement.
void SageInterface::collectReadOnlyVariables (SgStatement *stmt, std::set< SgInitializedName * > &readOnlyVars)

//Collect read only variable symbols within a statement. The statement can be either of a function, a scope, or a single line statement. 
void SageInterface::collectReadOnlySymbols (SgStatement *stmt, std::set< SgVariableSymbol * > &readOnlySymbols)

Autoscoping/variable classification

edit

This step categories variables for their data-sharing attributes, based on their live-in (before the execution of a loop) and live-out (after the execution of a loop) analysis results. For instance, a private variable inside a loop is neither live-in nor live-out of the loop, which means the variable is immediately killed (redefined) inside the loop and then used inside the loop somehow, but is never going to be used anywhere after the loop. All loop index variables are also classified as OpenMP private variables to an avoid possible race condition. On the other hand, shared variables are live at both the beginning and the end of the loop. Firstprivate and lastprivate variables are live at either only the beginning or only the end of the loop, respectively.

Reduction variables are handled specially to maximize the opportunities for parallelization. A typical reduction operation inside a loop, such as sum = sum + a[i], causes a loop-carried output dependence, a loop-carried anti-dependence, and a loop independent anti-dependence. We use an idiom recognition analysis to capture such typical operations and exclude the associated loop-carried dependences when deciding if a loop is parallelizable.


This step relies on Liveness analysis, implemented in

  • src/midend/programAnalysis/defUseAnaysis/LivenessAnalysis.cpp
  • There is a SageInterface function to call this analysis:
//!Get liveIn and liveOut variables for a for loop
void SageInterface::getLiveVariables(LivenessAnalysis * liv, SgForStatement* loop, std::set<SgInitializedName*>& liveIns, std::set<SgIn
itializedName*> & liveOuts)


Algorithm: private and reduction variables cause dependences (being written). firstprivate and lastprivate variables are never being written in the loop (no dependences)

       live-in   live-out
shared      Y      Y   no written, no dependences: no special handling, shared by default 
private      N      N   written (depVars), need privatization: depVars- liveIns - liveOuts 
firstprivate   Y      N   liveIns - LiveOus - writtenVariables
lastprivate    N      Y   liveOuts - LiveIns 
reduction     Y      Y   depVars Intersection (liveIns Intersection liveOuts)

To recognize reduction variables: RecognizeReduction()

 /* 
  * Algorithms:
  *  for each scalar candidate which are both live-in and live-out for the loop body
  *  and which is not the loop invariant variable.
  *  Consider those with only 1 or 2 references
  *  1 reference
  *   the operation is one of x++, ++x, x--, --x, x binop= expr
  *  2 references belonging to the same operation
  *   operations: one of x= x op expr, x = expr op x (except for subtraction)
  * Also according to the specification.
  * x is not referenced in exp
  * expr has scalar type (no array, objects etc)
  * x: scalar only, aggregate types (including arrays), pointer types and reference types may not appear in a reduction clause.
  * op is not an overloaded operator, but +, *, -, &, ^ ,|, &&, ||
  * binop is not an overloaded operator but: +, *, -, &, ^ ,| 
  *
  */

Semantic specification via annotations

edit

We use a separated file to store annotations for input code. The file essentially specifies the semantics of the input code so the tool can leverage to parallelize the code.

Some annotation files (*.annot) can be found in the project directory at https://github.com/rose-compiler/rose-develop/tree/master/projects/autoParallelization/tests

Sample annotations are:

operator atoll(const char * val)
  {
  modify none; read{val}; alias none;
  }

class InternalIndex { has_value { stride = this.stride; base = this.base; length = this.length; } }
class Index { has_value { stride = this.stride; base = this.base; length = this.length; } }
class Range { has_value { stride = this.stride; base = this.base; length = this.length; } }

operator Index::Index( int lb, int l, int step)
   {
    modify none; read {lb, l, step}; alias none;
    restrict_value { result = { base = lb; length = l; stride = step; } };
   }
operator Range::Range( int lb, int ub, int step)
   {
    modify none; read {lb, ub, step}; alias none;
    restrict_value { result = { base = lb; length = (ub-lb+1)/step; stride = step; } };
   }
operator InternalIndex::InternalIndex( int lb, int l, int step)
   {
    modify none; read {lb, l, step}; alias none;
    restrict_value { result = { base = lb; length = l; stride = step; } };
   }

operator operator+ ( const InternalIndex & lhs, int x )
   {
    modify none; read {lhs, x}; alias none;
  restrict_value { result = {stride = lhs.stride; base = lhs.base + x;
                length = lhs.length}; };
   }
operator operator- ( const InternalIndex & lhs, int x )
   {
    modify none; read {lhs, x}; alias none;
  restrict_value { result = {stride = lhs.stride; base = lhs.base - x;
                length = lhs.length}; };
   }

Array abstractions:

class floatArray {
  array {
   dimension = 6;
   length(i) = this.Array_Descriptor.Array_Domain.Size[i];
   elem(i:dim:1:dimension) = this(i$dim);
   reshape(i:dim:1:dimension) = this.resize(i$dim);
   };
   array_optimize {
   define {
    float* _pointer = this.getDataPointer();
    int _size:dim:1:dimension = this.Array_Descriptor.Array_Domain.Size[dim-1];
    int _stride:dim:1:dimension = this.Array_Descriptor.Array_Domain.Stride[dim-1];
    int _length:dim:1:dimension = this.Array_Descriptor.Array_Domain.getLength(dim-1)
   }
   length(i) = _length$(i+1);
   elem(i:dim:1:dimension) =
      _pointer[i$1 + repeat(x,2,dimension, i$x * _stride$(x-1) * _size$(x-1))];
   };
   has_value { dimension = 6; length_0 = this.length(0);
         length_1 = this.length(1); length_2 = this.length(2); };
 }

operator floatArray::operator() (int index)
       {
          modify none; read {index};
          alias none;
          restrict_value { this = { dimension = 1; } };
          inline { this.elem(index) };
       }

operator floatArray::operator() ( const InternalIndex& index)
       {
          modify none; read {this, index};
          alias { (result, this) };
          restrict_value
            { this = { dimension = 1; };
             result = {dimension = 1; length_0 = index.length;}; };
          construct_array (this) {
            dimension = 1;
            length(0) = index.length;
            elem(i) = this.elem(i * index.stride + index.base);
          };
        }
operator floatArray::operator()(const InternalIndex& index1,
                const InternalIndex& index2)
       {
          modify none; read {this, index1, index2};
          alias { (result, this) };
          restrict_value
            { this = { dimension = 2; };
             result = {dimension = 2; length_0 = index1.length;
                  length_1 = index2.length};
            };
          construct_array (this) {
            dimension = 2;
            length(0) = index1.length; length(1) = index2.length;
            elem(i,j) = this.elem(i * index1.stride + index1.base,
                       j * index2.stride + index2.base);
          };
        }
operator floatArray::operator()(const InternalIndex& index1,
                const InternalIndex& index2,
                const InternalIndex& index3)
       {
          modify none; read {this, index1, index2, index3};
          alias { (result, this) };
          restrict_value
            { this = { dimension = 3; };
             result = {dimension = 3; length_0 = index1.length;
                  length_1 = index2.length; length_2 = index3.length; };
            };
          construct_array (this) {
            dimension = 3;
            length(0) = index1.length; length(1) = index2.length;
            length(2) = index3.length;
            elem(i,j,k) = this.elem(i * index1.stride + index1.base,
                       j * index2.stride + index2.base,
                       k * index3.stride + index3.base);
          };
        }

operator floatArray::operator= (const floatArray& that)
       {
          modify {this}; read {that}; alias none;
          restrict_value {
           result = { dimension = that.dimension; length_0 = that.length_0;};
          };
          modify_array (this) {
            dimension = that.dimension;
            length(i) = that.length(i);
            elem(i:dim:1:dimension) = that.elem(i$dim);
          };
        }
operator floatArray::operator= (float val)
       {
          modify {this}; read {val}; alias none;
          modify_array (this) {
            elem(i:dim:1:dimension) = val;
          };
        }

operator floatArray::getLength( int dim)
       {
          modify none; read {dim}; alias none;
          inline { this.length(dim) };
       }

Dependence elimination

edit

Not all dependences prevent us from parallelizing loops. We exclude some unharmful dependences in our consideration.

void DependenceElimination(SgNode* sg_node, LoopTreeDepGraph* depgraph, std::vector<DepInfo>& remainings, OmpSupport::OmpAttribute* att, std::map<SgNode*, bool> & indirect_table, ArrayInterface* array_interface/*=0*/, ArrayAnnotation* annot/*=0*/)

Algorithm, eliminate the following dependencies

  • caused by locally declared variables: already private to each iteration
  • either source or sink variable is thread local variable
  • ignore a dependence relationship with empty source/sink nodes
  • commonlevel ==0, no common enclosing loops
  • carry level !=0, loop independent,
  • at least one array reference in a dependence relationship, but dependence type is SCALAR_DEP or SCALAR_BACK_DEP
  • dependencies caused by autoscoped variables (private, firstprivate, lastprivate, reduction, etc.)
  • dependencies caused by a pair of indirect indexed array reference, if we know indirect index arrays have unique elements.


Note that why carry level must be 0 to prevent parallelization in our analysis

  • This is valid since we run dep analysis on each level of loop.
  • For a current loop to be considered, only the dependences carried by the current loop matters.
  • The current loop is always the outermost level loop , carry level id is 0.

An example with debugging info

edit

autoPar -rose:autopar:enable_debug sourcetree/projects/autoParallelization/tests/jacobi_seq.c

Search for "this dep relation cannot be eliminated. saved into remaining depedence set." to find loops with loop-carried dependency which cannot be eliminated.

// Input code from Jacobi, expected to generate
// #pragma omp parallel for private(i,j,xx,yy,temp) reduction(+:error)

185  for (i = 0; i < n; i++)
186   for (j = 0; j < m; j++)
187    {
188     xx = -1.0 + dx * (i - 1);
189     yy = -1.0 + dy * (j - 1);
190     temp = u[i][j] - (1.0 - xx * xx) * (1.0 - yy * yy);
191     error = error + temp * temp;
192    }


Debug: ComputeDependenceGraph() dumps the dependence graph for the loop at line :185
dep SgExprStatement:xx = - 1.0 + dx *(i - 1); SgExprStatement:xx = - 1.0 + dx *(i - 1); 0x7fffb5d03db0 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (2,2) Not precise
	SgVarRefExp:xx@188:2->
	SgVarRefExp:xx@188:2
== 0;* 0;||* 0;== 0;||::

dep SgExprStatement:xx = - 1.0 + dx *(i - 1); SgExprStatement:temp = u[i][j] -(1.0 - xx * xx) *(1.0 - yy * yy); 0x7fffb5d03db0 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (0,0) Not precise
	SgVarRefExp:xx@188:2->
	SgPntrArrRefExp:u[i][j]@190:13
* 0;* 0;||* 0;* 0;||::

dep SgExprStatement:xx = - 1.0 + dx *(i - 1); SgExprStatement:temp = u[i][j] -(1.0 - xx * xx) *(1.0 - yy * yy); 0x7fffb5d03db0 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (2,2) Not precise
	SgVarRefExp:xx@188:2->
	SgVarRefExp:xx@190:26
== 0;* 0;||* 0;== 0;||::

dep SgExprStatement:xx = - 1.0 + dx *(i - 1); SgExprStatement:temp = u[i][j] -(1.0 - xx * xx) *(1.0 - yy * yy); 0x7fffb5d03db0 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (2,2) Not precise
	SgVarRefExp:xx@188:2->
	SgVarRefExp:xx@190:31
== 0;* 0;||* 0;== 0;||::

..

dep SgExprStatement:error = error + temp * temp; SgExprStatement:error = error + temp * temp; 0x7fffb5d03db0 Distance Matrix size:2*2 SCALAR_BACK_DEP; commonlevel = 2 CarryLevel = (1,1) Not precise
	SgVarRefExp:error@191:2->
	SgVarRefExp:error@191:10
== 0;* 0;||* 0;<= -1;||::
...
dep SgExprStatement:error = error + temp * temp; SgExprStatement:temp = u[i][j] -(1.0 - xx * xx) *(1.0 - yy * yy); 0x7fffb5d03db0 Distance Matrix size:2*2 SCALAR_BACK_DEP; commonlevel = 2 CarryLevel = (1,1) Not precise
	SgVarRefExp:temp@191:25->
	SgVarRefExp:temp@190:2
== 0;* 0;||* 0;<= -1;||::

dep SgExprStatement:error = error + temp * temp; SgExprStatement:temp = u[i][j] -(1.0 - xx * xx) *(1.0 - yy * yy); 0x7fffb5d03db0 Distance Matrix size:2*2 SCALAR_BACK_DEP; commonlevel = 2 CarryLevel = (0,0) Not precise
	SgVarRefExp:error@191:2->
	SgPntrArrRefExp:u[i][j]@190:13
* 0;* 0;||* 0;* 0;||::
--------------------------------------------------------
Live-in variables for loop:186
error
::m
::n
Live-out variables for loop before line:193
error
::m
::n
Final Live-in variables for loop:
error
::m
::n
Final Live-out variables for loop:
error
::m
::n
Debug after CollectVisibleVaribles ():
0x7f1640ef0678 ::n
0x7f1640ef07c0 ::m
0x7f1640ef1200 ::dx
0x7f1640ef1348 ::dy
0x7f1640ef2680 j
0x7f1640ef27c8 xx
0x7f1640ef2910 yy
0x7f1640ef2a58 temp
0x7f1640ef2ba0 error

Debug after CollectVariablesWithDependence():
0x7f1640ef27c8 xx
0x7f1640ef2910 yy
0x7f1640ef2a58 temp
0x7f1640ef2ba0 error

Debug dump private:
0x7f1640ef27c8 xx
0x7f1640ef2910 yy
0x7f1640ef2a58 temp
0x7f1640ef2538 i
0x7f1640ef2680 j

Debug dump lastprivate:
Debug: A candidate used twice:error
Debug dump firstprivate:

Entering DependenceElimination ()
----------------------------------
Eliminating a dep relation due to at least one autoscoped variables
0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (2,2) Not precise
	SgVarRefExp:xx@188:2->
	SgVarRefExp:xx@188:2
== 0;* 0;||* 0;== 0;||::

Eliminating a dep relation due to scalar dep type for at least one array variable
0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (0,0) Not precise
	SgVarRefExp:xx@188:2->
	SgPntrArrRefExp:u[i][j]@190:13
* 0;* 0;||* 0;* 0;||::

Eliminating a dep relation due to at least one autoscoped variables
0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (2,2) Not precise
	SgVarRefExp:xx@188:2->
	SgVarRefExp:xx@190:26
== 0;* 0;||* 0;== 0;||::

Eliminating a dep relation due to scalar dep type for at least one array variable
0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (0,0) Not precise
	SgVarRefExp:yy@189:2->
	SgPntrArrRefExp:u[i][j]@190:13
* 0;* 0;||* 0;* 0;||::

..
Eliminating a dep relation due to at least one autoscoped variables
0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_BACK_DEP; commonlevel = 2 CarryLevel = (1,1) Not precise
	SgVarRefExp:yy@190:44->
	SgVarRefExp:yy@189:2
== 0;* 0;||* 0;<= -1;||::

...

Eliminating a dep relation due to scalar dep type for at least one array variable
0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (0,0) Not precise
	SgPntrArrRefExp:u[i][j]@190:13->
	SgVarRefExp:error@191:2
* 0;* 0;||* 0;* 0;||::

...

Eliminating a dep relation due to at least one autoscoped variables
0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_BACK_DEP; commonlevel = 2 CarryLevel = (1,1) Not precise
	SgVarRefExp:error@191:2->
	SgVarRefExp:error@191:10
== 0;* 0;||* 0;<= -1;||::

Eliminating a dep relation due to at least one autoscoped variables
0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_BACK_DEP; commonlevel = 2 CarryLevel = (1,1) Not precise
	SgVarRefExp:temp@191:18->
	SgVarRefExp:temp@190:2
== 0;* 0;||* 0;<= -1;||::


Exiting DependenceElimination ()
----------------------------------

Automatically parallelized a loop at line:185
debug patch generation: attaching OMP att to sg_node SgForStatement at line 185

Annotations encoding semantics

edit

Used to encode semantic information for types (class) and their operations (operator)

Source files: https://github.com/rose-compiler/rose-develop/tree/master/src/midend/astUtil/annotation

// annotations are separated by ;
<annot> ::= <annot1> 
      | <annot1>;<annot>

// two main categories: class and operator 
<annot1> ::= class <cls annot>
     | operator <op annot>

//---------class------------
// If this is an array-like class? 
// e.g: class MyArray:
<cls annot> ::= <clsname>:<cls annot1>;
<cls annot1>::= <cls annot2> 
       | <cls annot2> <cls annot1>
//from bool TypeDescriptor:: read(istream& in) in AnnotDescriptors.C
// type name can have const modifier, with qualifiers, and of reference or pointer type
<clasname>::= [const] name1[::name2[...]][&|*]
 
// array information, inheritable or not
<cls annot2>::= <arr annot>
       | inheritable <arr annot>
       | has-value { <val def> } 

//Array information: dimension, length, element, reshape 
<arr annot>::= is-array{ <arr def>}
       | is-array{define{<stmts>}<arr def>}  // define loop-invariant operations that should be executed before enumerating array elements

<arr def> ::= <arr attr def> 
       | <arr attr def> <arr def>

<arr attr def> ::= <arr attr>=<expression>;
<arr attr> ::= dim      // at most 'dim' dimensions
       | len (<param>) // length of each dimension: parameter is used to specify which dimension
       | elem(<paramlist>) //i$x:0:dim-1 ==> $x is from 0 to dim-1 ==> list of parameters i_0, i_1, i_dim-1
       | reshape(<paramlist>)

//Data members for types/classes
//Could have expressions
<val def> ::= <name>; 
       | <name>;<val def>
       | <name> = <expression> ;
       | <name> = <expression> ; <val def>

//---------operator------------
// side effects: mod, read, alias
<op annot> ::= <opdecl> : <op annot1> ;
<op annot1> ::=<op annot2> 
       | <op annot2> <op annot1>
<op annot2> ::= modify <namelist>         // side effects on scalar:e.g. modify none
       | new-array (<aliaslist>){<arr def>} // side effects on arrays: creational 
       | modify-array (<name>) {<arr def>} // side effects on arrays: modifying only
       | restrict-value {<val def list>}  // How to get member values from this operation's semantics( may be from parameters values)
       | read <namelist>          // e.g.: read {a,b,c}
       | alias <nameGrouplist>       // e.g.: alias none
       | allow-alias <nameGrouplist>
       | inline <expression>        // this operator is semantically equal to another operator/expression, used to substitute an operation with another

// from OperatorDescriptors.h NameGropu, OperatorAliasDescriptor
<nameGrouplist> e.g: {(x,y,z), (a,b), (m,n)}

examples

edit

Arrays and array optimizations (rewrite to raw arrays)

class floatArray {

  array { // high level array abstraction with range from 1 to dimension
   dimension = 6; // max dimension
   length(i) = this.Array_Descriptor.Array_Domain.Size[i]; // length of each dimension
//i:dim:1:dimension/ i$dim means a list of parameters i_1, i_2, .., i_dimension?
// i:dim:1:dimension also defines a range variable dim, which is reused later in i$dim
   elem(i:dim:1:dimension) = this(i$dim); 
   reshape(i:dim:1:dimension) = this.resize(i$dim);
   };

   array_optimize { // low level array with range from 0 to dimension-1
   define {
    float* _pointer = this.getDataPointer();
    int _size:dim:1:dimension = this.Array_Descriptor.Array_Domain.Size[dim-1];
    int _stride:dim:1:dimension = this.Array_Descriptor.Array_Domain.Stride[dim-1];
    int _length:dim:1:dimension = this.Array_Descriptor.Array_Domain.getLength(dim-1)
   }
   length(i) = _length$(i+1);
   elem(i:dim:1:dimension) =
      _pointer[i$1 + repeat(x,2,dimension, i$x * _stride$(x-1) * _size$(x-1))];
   };

   has_value { dimension = 6; length_0 = this.length(0); 
         length_1 = this.length(1); length_2 = this.length(2); };
 }

dump ()--------------
arrays: 

 floatArray : {dimension=6; length=((i:::):[](.(.(.(this,Array_Descriptor),Array_Domain),Size),i)); elem=((i:dim:1:dimension):this($(i,dim))) } reshape = ((i:dim:1:dimension):FunctionPtrCall(.(this,resize),$(i,dim)))

array optimizations: 

 floatArray : 
{float* _pointer:::=FunctionPtrCall(.(this,getDataPointer));
int _size:dim:1:dimension=[](.(.(.(this,Array_Descriptor),Array_Domain),Size),+(-1dim));
int _stride:dim:1:dimension=[](.(.(.(this,Array_Descriptor),Array_Domain),Stride),+(-1dim));
int _length:dim:1:dimension=FunctionPtrCall(.(.(.(this,Array_Descriptor),Array_Domain),getLength),+(-1dim))
}

{dimension=; length=((i:::):$(_length,+(1i))); elem=((i:dim:1:dimension):[](_pointer,+($(i,1)repeat(x,2,dimension,*($(i,x)$(_stride,+(-1x))$(_size,+(-1x))))))) 
} 


Function side effect annotation: variable sets for read, modify, and alias


operator VectorXY::VectorXY(double xx, double yy)
{
 modify none; read {xx,yy}; 
}


// side effects annotation 
operator floatArray::operator() ( const InternalIndex& index) 
       {
          modify none; read {this, index};
        }
// Dump(): mangled_function_name: (all variables): {modified variable }
 floatArray::operator()_constInternalIndex& : (this,index):{}

// another annotation, side effects, alias, value
operator Index::Index( int lb, int l, int step) 
   {
    modify none; read {lb, l, step}; alias none; 
    restrict_value { result = { base = lb; length = l; stride = step; } };
   }

// Dump() 
read: 
 Index::Index_int_int_int : (lb,l,step):{l,lb,step}


// allow_alias: used only for a function operating on two or more parameters? tell if parameters can be alias to each other?

operator interpolate2D(floatArray & fineGrid,floatArray & coarseGrid) 
  {
  allow_alias none;
  }

has_value and restrict _value annotations: what is "this" and "result" ?

// has_value: Only used for within class annotation, 
class floatArray {
..
   has_value { dimension = 6; length_0 = this.length(0); 
         length_1 = this.length(1); length_2 = this.length(2); };
class Range { 
   has_value { stride = this.stride; base = this.base; length = this.length; } 
}


//Dump:
floatArray : {dimension:6;length_0:FunctionPtrCall(.(this,length),0);length_1:FunctionPtrCall(.(this,length),1);length_2:FunctionPtrCall(.(this,length),2)}
}

 Range : {base:.(this,base);length:.(this,length);stride:.(this,stride)}


// restrict_value: infer the value of member variable from the semantics of current operations: used for operation annotations. could have inferred information for one "this" and multiple "result" depending on function parameters

operator floatArray::operator() (int index) 
       { ..
 restrict_value { this = { dimension = 1; } }; // this element access operator using one integer parameter can tell us that the current object (this) has a dimension value equaling to 1
}

operator floatArray::operator()(const InternalIndex& index1,
                const InternalIndex& index2,
                const InternalIndex& index3) 
       {
         restrict_value
            { this = { dimension = 3; };
             result = {dimension = 3; length_0 = index1.length;
                  length_1 = index2.length; length_2 = index3.length; };
            };

}

c function annotations

edit
operator sqrt( double val)
  {
   modify none; read{val}; alias none;
  }

operator sqrt( float val)
  {
   modify none; read{val}; alias none;
  }

operator abs(int val)
  {
  modify none; read{val}; alias none;
  }


Publications

edit
  • old technical report: Chunhua Liao, Daniel J Quinlan, Jeremiah J Willcock, and Thomas Panas. Automatic parallelization using OpenMP based on STL semantics. Technical report, Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States), 2008.
  • A workshop paper: Chunhua Liao, Daniel J. Quinlan, Jeremiah J. Willcock and Thomas Panas, Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore, In Proceedings of the 5th international Workshop on OpenMP: Evolving OpenMP in An Age of Extreme Parallelism (Dresden, Germany, June 3–05, 2009). pdf
  • A journal version of the paper: Chunhua Liao, Daniel J. Quinlan, Jeremiah J. Willcock and Thomas Panas, Semantic-Aware Automatic Parallelization of Modern Applications Using High-Level Abstractions, Journal of Parallel Programming, Accepted in Jan. 2010 pdf