Optimizing C++/Writing efficient code/Performance improving features

Some features of the C++ language, if properly used, allow to increase the speed of the resulting software.

In this section guidelines to exploit such features are presented.

The most efficient types

When defining an object to store an integer number, use the int or the unsigned int type, except when a longer type is needed; when defining an object to store a character, use the char type, except when the wchar_t type is needed; and when defining an object to store a floating point number, use the double type, except when the long double type is needed. If the resulting aggregate object is of medium or large size, replace each integer type with the smallest integer type that is long enough to contain it (but without using bit-fields) and replace the floating point types with the float type, except when greater precision is needed.

The int and unsigned int types are, by definition, the most efficient ones available on the platform that can hold at least a 16-bit range. If you only need 8-bit width and are compiling for an 8-bit processor then char might be more efficient, but otherwise one of the int types is likely to be the most efficient type you can use.

The double type is two to three times less efficient than the float type, but it has greater precision.

Some processor types handle signed char objects more efficiently, while others handle unsigned char objects more efficiently. Therefore, both in C and in C++, the char type, which differs from the signed char type, was defined as the most efficient character type for the target processor.

The char type can contain only small character sets; typically up to a maximum of 255 distinct characters. To handle bigger character sets, you should use the wchar_t type, although that is less efficient.

In the case of numbers contained in a medium or large aggregate object or in a collection that will be probably be of medium or large size, it is better to minimize the size in bytes of the aggregate object or collection. This can be done by replacing primitives larger than word size with those that are word size for the processor. A short actually takes up the same amount of memory as word size even though the field size of a short is less.

Bit-fields can also be used to minimize the size of aggregate objects, but as their handling is slower this can be counterproductive. Therefore, postpone their introduction until the optimization stage.

Function-objects

Instead of passing a function pointer as an argument to a function, pass a function-object (or, if using the C++11 standard, a lambda expression).

For example, if you have the following array of structures:

struct S {
    int a, b;
};
S arr[n_items];

… and you want to sort it by the b field, you could define the following comparison function:

bool compare(const S& s1, const S& s2) {
    return s1.b < s2.b;
}

… and pass it to the standard sort algorithm:

std::sort(arr, arr + n_items, compare);

However, it is probably more efficient to define the following function-object class (aka functor):

struct Comparator {
    bool operator()(const S& s1, const S& s2) const {
        return s1.b < s2.b;
    }
};

… and pass a temporary instance of it to the standard sort algorithm:

std::sort(arr, arr + n_items, Comparator());

Function-objects are usually expanded inline and are therefore as efficient as in-place code, while functions passed by pointers are rarely inlined. Lambda expressions are implemented as function-objects, so they have the same performance.

`qsort` and `bsearch` functions

Instead of the qsort and bsearch C standard library functions, use the std::sort and std::lower_bound C++ standard library functions.

The former two functions require a function pointer as an argument, whereas the latter two may take a function-object argument (or, using the C++11 standard, a lambda expression). Pointers to functions are often not expanded inline and are therefore less efficient than function-objects, which are almost always inlined.

Encapsulated collections

Encapsulate (using a class) a collection that is accessible from several compilation units.

At design time, it is difficult to decide which data structure will have optimal performance when the software is used. At optimization time, performance can be measured and it can be seen whether changes to the container type result in improvements, for example changing from std::vector to std::list. Such implementation changes can however propagate to users of the code.

If a collection is private to one compilation unit, implementation changes will only impact the source code of that unit and encapsulation of the collection is unnecessary. If, however, the collection is not private (in other words, it is directly accessible from other compilation units) an implementation change could result in extensive change being necessary. To make such optimization feasible, therefore, encapsulate the collection in a class whose interface does not change when the container implementation is changed.

STL containers already use this principle, but certain operations are still available only for some containers (like operator[], existing for std::vector, but not for std::list).

STL container usage

When using an STL container, if several equivalent expressions have the same performance, choose the more general expression.

For instance, call a.empty() instead of a.size() == 0, call iter != a.end() instead of iter < a.end(), and call distance(iter1, iter2) instead of iter2 - iter1. The former expressions are valid for every container type, while the latter are valid only for some. The former are also no less efficient than the latter and may even be more efficient. For example, to get the size of a linked list the list must be traversed, whereas to see that it is empty is a constant time operation.

Unfortunately, it is not always possible to write code that is equally correct and efficient for every type of container. Nevertheless, decreasing the number of statements that are dependent on the container type will decrease the number of statements that must be changed if the type of the container is later changed.

Choice of the default container

When choosing a variable-length container, if in doubt, choose a vector.

For a data-set with a small number of elements, vector is the most efficient variable-length container for any operation.

For larger collections, other containers may become more efficient for certain operations, but vector still has the lowest space overhead (as long as there is no excess capacity) and the greatest locality of reference.

Inlined functions

If your compiler allows whole program optimization and automatic inline-expansion of functions, use such options and do not declare any functions inline. If such compiler features are not available, declare suitable functions as inline in a header; suitable functions contain no more than three lines of code and have no loops.

Inline function-expansion avoids the function call overhead. The overhead grows as the number of function arguments increases. In addition, since inline code is near to the caller code, it has better locality of reference. And because the intermediate code generated by the compiler for inlined functions is merged with the caller code, it can be optimized more easily by the compiler.

Expanding inline a tiny function, such as a function containing only a simple assignment or a simple return statement, can result in a decrease in the size of the generated machine code.

Conversely, every time a function containing substantial code is inlined the machine code is duplicated and the total size of the program increases. Increasing the size of the program also will likely decrease the performance of your instruction cache, increasing latency.

Inlined code is more difficult to profile. If a non-inlined function is a bottleneck, it can be found by the profiler. But if the same function is inlined wherever it is called, its run-time is scattered among many functions and the bottleneck cannot be detected by the profiler.

For functions containing substantial amounts of code, only performance critical ones should be declared inline during optimization.

Symbols representation

To represent internal symbols, use enumerations instead of strings.

For example, instead of the following code:

const char* const directions[] = { "North", "South", "East", "West" };

use the following code:

enum directions { North, South, East, West };

An enumeration is implemented as an integer. Compared to an integer, a string occupies more space and is slower to copy and compare. (In addition, using strings instead of integers to represent internal state may introduce string comparison errors in code that deals with multiple locales.)

`if` and `switch` statements

If you have to compare an integer value with a set of constant values, instead of a sequence of if statements, use a switch statement.

For example, instead of the following code:

if (a[i] == 1) f();
else if (a[i] == 2) g();
else if (a[i] == 5) h();

write the following code:

switch (a[i]) {
case 1: f(); break;
case 2: g(); break;
case 5: h(); break;
}

Compilers may exploit the regularity of switch statements to apply some optimizations, in particular if the guideline "Case values for switch statements" in this section is applied.

Case values of `switch` statements

As constants for switch statements cases, use compact sequences of values, that is, sequences with no gaps or with few small gaps.

When compiling a switch statement whose case values comprise most of the values in an integer interval, instead of generating a sequence of if statements, an optimizing compiler will generate a jump-table. The table is an array containing the start address of the code for each case. When executing the switch statement, the table is used to jump to the code associated with the case number.

For example, the following C++ code:

switch (i) {
case 10:
case 13:
    func_a();
    break;
case 11:
    func_b();
    break;
}

probably generates machine code corresponding to the following pseudo-code:

// N.B.: This is not C++ code
static address jump_table[] = { case_a, case_b, end, case_a };
unsigned int index = i - 10;
if (index > 3) goto end;
goto jump_table[index];
case_a: func_a(); goto end;
case_b: func_b();
end:

Instead, the following C++ code:

switch (i) {
case 100:
case 130:
    func_a();
    break;
case 110:
    func_b();
    break;
}

probably generates machine code corresponding to the following code:

if (i == 100) goto case_a;
if (i == 130) goto case_a;
if (i == 110) goto case_b;
goto end;
case_a: func_a(); goto end;
case_b: func_b();
end:

For so few cases, there is probably little difference between the two situations, but as the case count increases, the former code becomes more efficient, as it performs only one computed goto instead of a sequence of branches.

Case order in `switch` statement

In switch statements, put typical cases first.

If the compiler does not use a jump-table, cases are evaluated in order of appearance; therefore, fewer comparisons are performed for the more frequent cases.

Grouping function arguments

In a loop that calls a function with more arguments than there are registers, consider passing a struct or object instead.

For example, instead of the following code:

for (int i = 0; i < 1000; ++i) {
    f(i, a1, a2, a3, a4, a5, a6, a7, a8);
}

consider writing the following:

struct {
    int i;
    type a1, a2, a3, a4, a5, a6, a7, a8;
} s;
s.a1 = a1; s.a2 = a2; s.a3 = a3; s.a4 = a4;
s.a5 = a5; s.a6 = a6; s.a7 = a7; s.a8 = a8;
for (int i = 0; i < 1000; ++i) {
    s.i = i;
    f(s);
}

If all function arguments can be placed directly into processor registers, the arguments can be passed and manipulated quickly. If there are more arguments than available registers, those arguments that could not be placed into registers will be pushed onto the stack at the start of every function call and removed from the stack at the end of the call. If a structure or object is passed, a register may be used and after initialization of the structure or object, only those parts of the structure or object that change between successive calls must be updated.

Compilers vary in the number of registers used for function arguments. Relying on the number used by any particular compiler version is unwise. Assuming that 4 registers are used is reasonable.

Use of container member functions

To search for an element in a container, use a container member function instead of an STL algorithm.

If a container provides a member function that duplicates a generic STL algorithm it is because the member function is more efficient.

For example, to search a std::set object, you can use the std::find generic algorithm, or the std::set::find member function. The former has linear complexity (O(n)), while the latter has logarithmic complexity (O(log(n))).

Search in sorted sequences

To search a sorted sequence, use the std::lower_bound, std::upper_bound, std::equal_range, or std::binary_search generic algorithms.

Given that all the cited algorithms use a logarithmic complexity (O(log(n))) binary search, they are faster than the std::find algorithm, which uses a linear complexity (O(n)) sequential scan.

`static` member functions

In every class, declare every member function that does not access the non-static members of the class as static .

In other words, declare all the member functions that you can as static.

In this way, the implicit this argument is not passed.

Optimizing C++/Writing efficient code/Performance improving features

The most efficient types

Function-objects

qsort and bsearch functions

Encapsulated collections

STL container usage

Choice of the default container

Inlined functions

Symbols representation

if and switch statements

Case values of switch statements

Case order in switch statement

Grouping function arguments

Use of container member functions

Search in sorted sequences

static member functions

`qsort` and `bsearch` functions

`if` and `switch` statements

Case values of `switch` statements

Case order in `switch` statement

`static` member functions