Optimizing C++/Code optimization/Pipeline

The conditional jump machine language instructions (aka branches), may be generated by many C++ language features, among which there are the if-else, for, while, do-while, and switch-case statements, and by boolean and conditional expressions operators.

Modern processors handle branches efficiently only if they can predict them. In case of prediction error, the steps already done by the pipeline on the following instructions are useless and the processor must restart from the branch destination instruction.

The branch prediction is based on the previous iterations on the same instruction. If the branches follow a regular pattern, the prediction are successful.

The best cases are those in which a branch instruction has always the same effect; in such cases, the prediction is almost always correct. The worst case is that in which the branch instruction has an outcome pattern the opposite of the branch prediction behavior. As predictors usually assume reoccurrence of results, this would be a branch that outcome is always the opposite of the previous outcome. In this case the prediction is never correct and the pipeline is always interrupted. Not all processors use such prediction behavior, with some simple ones always assuming no jump will occur. The case where the branch instruction has a random outcome results in the prediction being on average correct half of the times. However random distribution means that actual results will vary from being always right to always wrong in a Gaussian shaped distribution.

In bottlenecks, the hard-to-predict branches should be avoided. If a branch is predicted very badly, even replacing it with a rather slow sequence of instructions may result in a speed up.

In this section, some techniques are presented to replace branches with equivalent instructions.

Integer interval check

If you have to check whether an integer number i is between two integer numbers min_i and max_i included, and you are sure that min_i <= max_i, use the following expression:

unsigned(i – min_i) <= unsigned(max_i – min_i)

In the given conditions, the above formula is equivalent to the following, more intuitive formula:

min_i <= i && i <= max_i

The former formula performs two differences and one comparison, while the latter formula performs no difference and two comparisons. For pipelined processors, comparisons are slower than differences, because they imply a branch. ^{[dubious – discuss]}

In addition, if min_i is a constant expression with zero value, the two differences disappear.

In particular, to check whether an integer i is a valid index for an array of size elements, that is to perform array bounds checking, use the following expression:

unsigned(i) < unsigned(size)

Obviously, if the expressions are already of an unsigned type, conversions are unneeded.

The binary look-up table

Instead of a conditional expression in which both cases are constants, use a look-up table with two-places.

If you have a statement like the following, where c and d represent constant expressions, and b represents a boolean expression:

a = b ? c : d;

that is equivalent to the following code:

if (b) a = c;
else a = d;

try to replace it with the following code, equivalent but perhaps faster:

static const type lookup_table[] = { d, c };
a = lookup_table[b];

The conditional expression is compiled into a branch. If such a branch is not well predicted, it takes longer than the lookup-table.

This technique may be applied also to a sequence of conditional expressions. For example, instead of the following code:

a = b1 ? c : b2 ? d : b3 ? e : f;

that is equivalent to the following code:

if (b1) a = c;
else if (b2) a = d;
else if (b3) a = e;
else a = f;

try to see if the following code is faster:

static const type lookup_table[] = { f, e, d, d, c, c, c, c };
a = lookup_table[b1 * 4 + b2 * 2 + b3];

Early address calculation

Try to calculate the value of a pointer or iterator somewhat before when you need to access the referenced object.

For example, in a loop, the following statement:

a = *++p;

may be a bit less efficient than the following:

a = *p++;

In the first case, the value of the pointer or iterator is calculated just before it is used to access the referenced object, while in the second case it is computed in the previous iteration. In a pipelined processor, in the second case, the increment of the pointer may be performed simultaneously with the access of the referenced object, while in the first case the two operations must be serialized.