ROSE Compiler Framework/Arithmetic intensity measuring tool

ROSE tools>

Overview

A tool to help measure arithmetic intensity (FLOPS/Memory) of loops. It does so by

statically estimating floating point operations and load/store bytes per iteration for user-specified loops
instrumenting the loops with statements to capture loop iteration counts and calculate FLOPS and memory footprints (load/store bytes)
users then run the instrumented code to generate the final reports.

Quick information

tool location: https://github.com/rose-compiler/rose-develop/tree/master/projects/ArithmeticMeasureTool
testing: type "make check" within the corresponding build tree

Download and Installation

It is recommended to obtain the tool from rose-develop repo to have the latest update.

https://github.com/rose-compiler/rose-develop

The first step is to download and install rose as usual

Latest instructions: http://rosecompiler.org/ROSE_HTML_Reference/installation.html

Then

cd rose-build-tree/projects/ArithmeticMeasureTool
make && make install

An executable file named measureTool will then be installed within ROSE_INSTALLATION_PATH/bin

Now prepare your environment so the tool can be invoked

# set.rose file,  source it to set up environment variables
ROSE_INS=/home/liao6/workspace/masterDevClean/install
export ROSE_INS

PATH=$ROSE_INS/bin:$PATH
export PATH

LD_LIBRARY_PATH=$ROSE_INS/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

Command line options

List

-help: print out help information
-debug: enable debugging mode, generating screen output showing progress and internal results
-annot your_annotation_file: accept user specified function side effect annotations, complement compiler analysis
-static-counting-only : a special execution mode in which the tool scans all loop bodies and write counting results into a report file
-report-file your_report_file.txt : specify your own report file name, otherwise the default file ai_tool_report.txt is used.
-use-algorithm-v2: using 2nd version algorithm in the static counting mode, bottomup synthesized traversal to count FLOPS, still under development

Function side effect annotation

Compiler analysis cannot figure out side effect of all functions. This can be caused by no access to the library source code or complexity of pointer uses in the source code. To solve this problem, the tool accepts function side effect annotation file via an option --annot

Annotation file format

operator abs(int val)
   {
    modify none; read{val}; alias none;
   }
operator max(double val1, double val2)
   {
    modify none; read{val1, val2}; alias none;
   }

example command line

measureTool -c -annot /path/to/functionSideEffect.annot your_input.c

Execution mode 1: static analysis only

This is a special mode of the tool to only find all loops and count FLOPs for loop bodies. The reported numbers are for single iteration only.

The load/store bytes are represented in two ways

expression format: such as 3*sizeof(float) + 5*sizeof(double)
evaluated final integer values: 52

The result is written to a text report file.

Example use

./measureTool -c -static-counting-only -annot ../../../sourcetree/projects/ArithmeticMeasureTool/src/functionSideEffect.annot -I. ../../../sourcetree/projects/ArithmeticMeasureTool/test/jacobi.c

Excerpt of the generated report. Note that a loop at line 129 has two Plus FP operations and 2 multiplication operations. It loads 0 bytes and store one double element (8 bytes usually). So the final arithmetic intensity (AI) is 4/8= 0.5 ops/byte

Content of generated report file: ai_tool_report.txt


----------Floating Point Operation Counts---------------------

SgForStatement@
/home/liao6/workspace/ExReDi/ai_tool/sourcetree/projects/ArithmeticMeasureTool/test/jacobi.c:129:10
        fp_plus:2
        fp_minus:0
        fp_multiply:2
        fp_divide:0
        fp_total:4
----------Memory Operation Counts---------------------

        Loads: NULL
        Loads int: 0
        Stores:1 * sizeof(double )
        Store int: 8
----------Arithmetic Intensity---------------------
AI=0.5

Right now

AI is set to -1.0 if it is unintialized
AI is set to be 9999.9 if divided by zero bytes

User pragma to verify results

In this mode, the translator can verify the tool-generated results by comparing the results to what is indicated by pragmas in the input code.

The user provided pragma has the form of

#pragma aitool fp_plus(10) fp_minus(10) fp_multiply(10) fp_divide (10) fp_total(40)
for () ...


void error_check ( )
{
  int i,j;
  double xx,yy,temp,error;

  dx = 2.0 / (n-1);
  dy = 2.0 / (m-1);
  error = 0.0 ;

#pragma aitool fp_plus(3) fp_minus(3) fp_multiply(6)
  for (i=0;i<n;i++)
    for (j=0;j<m;j++)
    {
      xx = -1.0 + dx * (i-1);
      yy = -1.0 + dy * (j-1);
      temp  = u[i][j] - (1.0-xx*xx)*(1.0-yy*yy);
      error = error + temp*temp;
    }
  error = sqrt(error)/(n*m);
  printf("Solution Error :%E \n",error);                                                                                                  
}

fp_total is required while the clauses of other kinds of FP operations are optional.

Execution mode 2: analyze and instrument the code

This is the default mode .

Manual instrument the input code

The tool currently works with collaboration with user-added code instrumentation, using the following steps:

declare four global counters with specific variable names, which will later be recognized by the tool
add chiterations = .. before the loops you want to count FPs and Load/store bytes
print out the results: printf ("chflops =%lu chloads =%lu chstores=%lu\n", chflops, chloads, chstores);

  1 #include <stdio.h>
  2 #define SIZE 10 
  3 
  4 // Instrumentation 1: add a few global variables
  5 unsigned long int chiterations = 0;
  6 unsigned long int chloads = 0;
  7 unsigned long int chstores = 0;
  8 unsigned long int chflops = 0;
  9 
 10 double ref[2] = {9.2, 5.4};
 11 double coarse[SIZE][SIZE][SIZE];
 12 int main()
 13 { 
 14   double refScale = 1.0 / (ref[0] * ref[1]);
 15   int iboxlo1 = 0, iboxlo0 = 0, iboxhi1 = SIZE-1, iboxhi0 = SIZE-1;
 16   int var; 
 17   int ic1=0, ic0=0;
 18   int ip0 = ic0 * ref[0];
 19   int ip1 = ic1 * ref[1];
 20   double coarseSum = 0.0;
 21   int ii1, ii0;
 22   
 23   for (var =0; var < SIZE ; var++)
 24   { 
 25     //Instrumentation 2: pass in loop iteration for the loop to be counted
 26     chiterations = (1 + iboxhi1 - iboxlo1) * (1 + iboxhi0 - iboxlo0);
 27     for (ic1 = iboxlo1; ic1< iboxhi1 +1; ic1++)
 28       for (ic0 = iboxlo0; ic0< iboxhi0 +1; ic0++)
 29       { 
 30         int ibreflo1 = 0, ibreflo0 = 0, ibrefhi1 = SIZE-1, ibrefhi0 = SIZE-1;
 31         //Instrumentation 3: pass in loop iteration for the loop to be counted
 32         chiterations = (1 + ibrefhi1 - ibreflo1) * (1 + ibrefhi0 - ibreflo0);
 33         for (ii1 = ibreflo1; ii1< ibrefhi1 +1; ii1++) 
 34           for (ii0 = ibreflo0; ii0< ibrefhi0 +1; ii0++)
 35           {
 36             coarseSum = coarseSum +  coarse[ii1][ii0][ii1] +(ip0 + ii0) + (ip1 + ii1)  + var;
 37           } 
 38         coarse[ic0][ic1][var] = coarseSum * refScale;
 39       } 
 40   }   
 41   //Instrumentation 4: print out results
 42   printf ("chflops =%lu chloads =%lu chstores=%lu\n", chflops, chloads, chstores);
 43   return 0;
 44 }

Use the tool to transform the code

./measureTool -c -annot ../../../sourcetree/projects/ArithmeticMeasureTool/src/functionSideEffect.annot nestedloops.c

The tool will

count the FLOPs and load store bytes for the specified loop
add counter accumulation statements, using different counters for different loops

  1 #include <stdio.h>
  2 #define SIZE 10 
  3 // Instrumentation 1: add a few global variables
  4 unsigned long chiterations = 0;
  5 unsigned long chloads = 0;
  6 unsigned long chstores = 0;
  7 unsigned long chflops = 0;
  8 double ref[2] = {(9.2), (5.4)};
  9 double coarse[10][10][10];
 10 
 11 int main()
 12 {
 13   double refScale = 1.0 / (ref[0] * ref[1]);
 14   int iboxlo1 = 0;
 15   int iboxlo0 = 0;
 16   int iboxhi1 = 10 - 1;
 17   int iboxhi0 = 10 - 1;
 18   int var;
 19   int ic1 = 0;
 20   int ic0 = 0;
 21   int ip0 = (ic0 * ref[0]);
 22   int ip1 = (ic1 * ref[1]);
 23   double coarseSum = 0.0;
 24   int ii1;
 25   int ii0;
 26   unsigned long chiterations_1;
 27   unsigned long chiterations_2;
 28   for (var = 0; var < 10; var++) {
 29 //Instrumentation 2: pass in loop iteration for the loop to be counted
 30     chiterations_2 = (1 + iboxhi1 - iboxlo1) * (1 + iboxhi0 - iboxlo0);
 31     for (ic1 = iboxlo1; ic1 < iboxhi1 + 1; ic1++) {
 32       for (ic0 = iboxlo0; ic0 < iboxhi0 + 1; ic0++) {
 33         int ibreflo1 = 0;
 34         int ibreflo0 = 0;
 35         int ibrefhi1 = 10 - 1;
 36         int ibrefhi0 = 10 - 1;
 37 //Instrumentation 3: pass in loop iteration for the loop to be counted
 38         chiterations_1 = (1 + ibrefhi1 - ibreflo1) * (1 + ibrefhi0 - ibreflo0);
 39         for (ii1 = ibreflo1; ii1 < ibrefhi1 + 1; ii1++) {
 40           for (ii0 = ibreflo0; ii0 < ibrefhi0 + 1; ii0++) {
 41             coarseSum = coarseSum + coarse[ii1][ii0][ii1] + (ip0 + ii0) + (ip1 + ii1) + var;
 42           }
 43         }
 44 /*       aitool generated Loads counting statement ... */
 45         chloads = chloads + chiterations_1 * (1 * sizeof(double ));
 46 /*       aitool generated FLOPS counting statement ... */
 47         chflops = chflops + chiterations_1 * 4;
 48         coarse[ic0][ic1][var] = coarseSum * refScale;
 49       }
 50     }
 51 /*       aitool generated Stores counting statement ... */
 52     chstores = chstores + chiterations_2 * (1 * sizeof(double ));
 53 /*       aitool generated FLOPS counting statement ... */
 54     chflops = chflops + chiterations_2 * 1;
 55   }
 56 //Instrumentation 4: pass in loop iteration for the loop to be counted
 57   printf("chflops =%lu chloads =%lu chstores=%lu\n",chflops,chloads,chstores);
 58   return 0;
 59 }

Compile& run the transformed code

gcc -O3 rose_nestedloops.c -o nestedloops.out -l

./nestedloops.out

The result looks like

chflops =401000 chloads =800000 chstores=8000

Limitations

The tool does not support Fortran loops with function calls for now

ROSE's Fortran procedure/routine representation is not accurate enough (missing parameter type info.) to hook up with function side effect annotations designed to match C/C++ functions.

Internals

Execution model variable running_mode

e_analysis_and_instrument
e_static_counting

FP operations

class FPCounters: public AstAttribute {}; to store analysis results

void CountFPOperations() from src/ai_measurement.cpp

   Rose_STL_Container<SgNode*> nodeList = NodeQuery::querySubTree(input, V_SgBinaryOp);
    for (Rose_STL_Container<SgNode *>::iterator i = nodeList.begin(); i != nodeList.end(); i++)
    {
      fp_operation_kind_enum op_kind = e_unknown;
//      bool isFPType = false;
      // check operation type
      SgBinaryOp* bop= isSgBinaryOp(*i);
      switch (bop->variantT())
      {
        case V_SgAddOp:
        case V_SgPlusAssignOp:
          op_kind = e_plus;
          break;
        case V_SgSubtractOp:
        case V_SgMinusAssignOp:
          op_kind = e_minus;
          break;
        case V_SgMultiplyOp:
        case V_SgMultAssignOp:
          op_kind = e_multiply;
          break;
        case V_SgDivideOp:
        case V_SgDivAssignOp:
          op_kind = e_divide;
          break;
        default:
          break;
      } //end switch
...

}

Load/Store bytes

The main functions are defined in ai_measurement.cpp:

std::pair <SgExpression*, SgExpression*> CountLoadStoreBytes (SgLocatedNode* input, bool includeScalars /* = true */, bool includeIntType /* = true */)
SgExpression* calculateBytes (std::set<SgInitializedName*>& name_set, SgStatement* lbody, bool isRead)

return expressions to calculate the value, not the actual values, since sizeof(type) is machine dependent.

Configuration

By default: only array references are counted. Scalars are ignored.

Algorithm

call side effect analysis to find read/write variables, some reference may trigger both read and write accesses. If analysis is successful, proceed. Otherwise warning is sent.
Accesses to the same array/scalar variable are grouped into one read (or write) access: e.g. array[i][j], array[i][j+1], array[i][j-1], etc are counted a single access
Group accesses based on the types: same type access-> increment the same counter to shorten expression length
Iterate on the results to generate expression like 2*sizeof(float) + 5* sizeof(double)
As an approximate, we use simple analysis here assuming no function calls.

    // Obtain per-iteration load/store bytes calculation expressions                                                           
    // excluding scalar types to match the manual version                                                                      
    //CountLoadStoreBytes (SgLocatedNode* input, bool includeScalars = true, bool includeIntType = true);                      
    std::pair <SgExpression*, SgExpression*> load_store_count_pair = CountLoadStoreBytes (loop_body, false, true);             
    // chstores=chstores+chiterations*8                                                                                        
    if (load_store_count_pair.second!= NULL)                                                                                   
    {                                                                                                                          
      SgExprStatement* store_byte_stmt = buildCounterAccumulationStmt("chstores", new_iter_var_name, load_store_count_pair.second, scope);   
      insertStatementAfter (loop, store_byte_stmt);                                                                            
      attachComment(store_byte_stmt,"      aitool generated Stores counting statement ...");                                   
    }                                                                                                                          
    // handle loads stmt 2nd so it can be inserted as the first after the loop                                                 
    // build  chloads=chloads+chiterations*2*8                                                                                 
    if (load_store_count_pair.first != NULL)                                                                                   
    {                                                                                                                          
      SgExprStatement* load_byte_stmt = buildCounterAccumulationStmt("chloads", new_iter_var_name, load_store_count_pair.first, scope);      
      insertStatementAfter (loop, load_byte_stmt);                                                                             
      attachComment(load_byte_stmt,"      aitool generated Loads counting statement ...");                                     
    }

Nested loops

Scientific applications usually have nested loops. Naive instrumentation will cause two problems

double counting for nested loop body:
the chiterations= .. statement is used for all levels of loop. The inner loop's chiterations will overwrite the chiterations used to indicate outer loop.

Solutions

The translator uses a bottom-up traversal order: processing inner loops first, then outer loops.
To avoid double counting FP operations within nested loops: all visited FP operations expressions are stored into a lookup table. Later counting will check if an operation is already counted. If so, skip it.
To avoid double counting of variables used in nested loops when counting a outer loop body: This is slightly different from the handling of FP op expressions. Here we find all variables counted in inner loops, exclude them when do the counting for an outer loop. Note: excluding a entirely, not just flagging a reference to a and exclude such reference later.
- Note: static counting mode does not do this excluding since the assumption of redundant execution is no longer a concern. We still count loop body's FLOPS for inner and outer loops if they are nested.
rewrite chiterations= to chiterations_loopId= .. , so each loop has its own iteration number variable.

   // global chiterations is changed to two local variables: each for one loop
  unsigned long chiterations_1;
  unsigned long chiterations_2;
  for (var = 0; var < 10; var++) {
//Instrumentation 2: pass in loop iteration for the loop to be counted
    chiterations_2 = ((1 + iboxhi1 - iboxlo1) * (1 + iboxhi0 - iboxlo0) * 1);
    for (ic1 = iboxlo1; ic1 < iboxhi1 + 1; ic1++) {
      for (ic0 = iboxlo0; ic0 < iboxhi0 + 1; ic0++) {
        int ibreflo1 = 0;
        int ibreflo0 = 0;
        int ibrefhi1 = 10 - 1;
        int ibrefhi0 = 10 - 1;
//Instrumentation 3: pass in loop iteration for the loop to be counted
        chiterations_1 = ((1 + ibrefhi1 - ibreflo1) * (1 + ibrefhi0 - ibreflo0) * 1);
        for (ii1 = ibreflo1; ii1 < ibrefhi1 + 1; ii1++) {
          for (ii0 = ibreflo0; ii0 < ibrefhi0 + 1; ii0++) {
            coarseSum = coarseSum + coarse[ii1][ii0][ii1] + (ip0 + ii0) + (ip1 + ii1) + var;
          }
        }
/*       aitool generated Loads counting statement ... */
        chloads = chloads + chiterations_1 * (1 * sizeof(double ));
/*       aitool generated FLOPS counting statement ... */
        chflops = chflops + chiterations_1 * 4;
        coarse[ic0][ic1][var] = coarseSum * refScale;
      }
    }
/*       aitool generated Stores counting statement ... */
    chstores = chstores + chiterations_2 * (1 * sizeof(double ));
/*       aitool generated FLOPS counting statement ... */
    chflops = chflops + chiterations_2 * 1;
  }

Testing

run all builtin tests

make check

run tests for static analysis only

make check-static

Manual testing

[liao6@tux322:~/workspace/ExReDi/ai_tool.git/translator]m && ./measureTool -c -annot ./src/functionSideEffect.annot -I. ./test/jacobi-v3.c