x86 Disassembly/Calling Conventions

x86 Disassembly

Calling Conventions edit

Wikipedia has related information at Calling convention

Calling conventions are a standardized method for functions to be implemented and called by the machine. A calling convention specifies the method that a compiler sets up to access a subroutine. In theory, code from any compiler can be interfaced together, so long as the functions all have the same calling conventions. In practice however, this is not always the case.

Calling conventions specify how arguments are passed to a function, how return values are passed back out of a function, how the function is called, and how the function manages the stack and its stack frame. In short, the calling convention specifies how a function call in C or C++ is converted into assembly language. Needless to say, there are many ways for this translation to occur, which is why it's so important to specify certain standard methods. If these standard conventions did not exist, it would be nearly impossible for programs created using different compilers to communicate and interact with one another.

There are three major calling conventions that are used with the C language on 32-bit x86 processors: STDCALL, CDECL, and FASTCALL. In addition, there is another calling convention typically used with C++: THISCALL.^[1] There are other calling conventions as well, including PASCAL and FORTRAN conventions, among others.

Other processors, such as AMD64 processors (also called x86-64 processors), each have their own calling convention.^[2]^[3]

Notes on Terminology edit

There are a few terms that we are going to be using which are mostly common sense, but which are worthy of stating directly:

Passing arguments: "passing arguments" is a way of saying that the calling function is writing data in the place where the called function will look for them. Arguments are passed before the call instruction is executed.

Right-to-Left and Left-to-Right: These describe the manner in which arguments are passed to the subroutine, in terms of the High-level code. For instance, the following C function call:

MyFunction1(a, b);

will generate the following code if passed Left-to-Right:

push a
push b
call _MyFunction

and will generate the following code if passed Right-to-Left:

push b
push a
call _MyFunction

Return value: Some functions return a value, and that value must be received reliably by the function's caller. The called function places its return value in a place where the calling function can get it when execution returns. The called function stores the return value before executing the ret instruction.

Cleaning the stack: When arguments are pushed onto the stack, eventually they must be popped back off again. Whichever function, the caller or the callee, is responsible for cleaning the stack must reset the stack pointer to eliminate the passed arguments.

Calling function (the caller): The "parent" function that calls the subroutine. Execution resumes in the calling function directly after the subroutine call, unless the program terminates inside the subroutine.

Called function (the callee): The "child" function that gets called by the "parent."

Name Decoration: When C code is translated to assembly code, the compiler will often "decorate" the function name by adding extra information that the linker will use to find and link to the correct functions. For most calling conventions, the decoration is very simple (often only an extra symbol or two to denote the calling convention), but in some extreme cases (notably C++ "thiscall" convention), the names are "mangled" severely.

Entry sequence (the function prologue): a few instructions at the beginning of a function, which prepare the stack and registers for use within the function.

Exit sequence (the function epilogue): a few instructions at the end of a function, which restore the stack and registers to the state expected by the caller, and return to the caller. Some calling conventions clean the stack in the exit sequence.

Call sequence: a few instructions in the middle of a function (the caller) which pass the arguments and call the called function. After the called function has returned, some calling conventions have one more instruction in the call sequence to clean the stack.

Standard C Calling Conventions edit

The C language, by default, uses the CDECL calling convention, but most compilers allow the programmer to specify another convention via a specifier keyword. These keywords are not part of the ISO-ANSI C standard, so you should always check with your compiler documentation about implementation specifics.

If a calling convention other than CDECL is to be used, or if CDECL is not the default for your compiler, and you want to manually use it, you must specify the calling convention keyword in the function declaration itself, and in any prototypes for the function. This is important because both the calling function and the called function need to know the calling convention.

CDECL edit

In the CDECL calling convention the following holds:

Arguments are passed on the stack in Right-to-Left order, and return values are passed in eax.
The calling function cleans the stack. This allows CDECL functions to have variable-length argument lists (aka variadic functions). For this reason the number of arguments is not appended to the name of the function by the compiler, and the assembler and the linker are therefore unable to determine if an incorrect number of arguments is used.

Wikipedia has related information at Variadic_function

Variadic functions usually have special entry code, generated by the va_start(), va_arg() C pseudo-functions.

Consider the following C instructions:

_cdecl int MyFunction1(int a, int b)
{
  return a + b;
}

and the following function call:

 x = MyFunction1(2, 3);

These would produce the following assembly listings, respectively:

_MyFunction1:
push ebp
mov ebp, esp
mov eax, [ebp + 8]
mov edx, [ebp + 12]
add eax, edx
pop ebp
ret

and

push 3
push 2
call _MyFunction1
add esp, 8

When translated to assembly code, CDECL functions are almost always prepended with an underscore (that's why all previous examples have used "_" in the assembly code).

STDCALL edit

STDCALL, also known as "WINAPI" (and a few other names, depending on where you are reading it) is used almost exclusively by Microsoft as the standard calling convention for the Win32 API. Since STDCALL is strictly defined by Microsoft, all compilers that implement it do it the same way.

STDCALL passes arguments right-to-left, and returns the value in eax. (The Microsoft documentation erroneously claimed that arguments are passed left-to-right, but this is not the case.)
The called function cleans the stack, unlike CDECL. This means that STDCALL doesn't allow variable-length argument lists.

Consider the following C function:

_stdcall int MyFunction2(int a, int b)
{
   return a + b;
}

and the calling instruction:

 x = MyFunction2(2, 3);

These will produce the following respective assembly code fragments:

:_MyFunction2@8
push ebp
mov ebp, esp
mov eax, [ebp + 8]
mov edx, [ebp + 12]
add eax, edx
pop ebp
ret 8

and

push 3
push 2
call _MyFunction2@8

There are a few important points to note here:

In the function body, the ret instruction has an (optional) argument that indicates how many bytes to pop off the stack when the function returns.
STDCALL functions are name-decorated with a leading underscore, followed by an @, and then the number (in bytes) of arguments passed on the stack. This number will always be a multiple of 4, on a 32-bit aligned machine.

FASTCALL edit

The FASTCALL calling convention is not completely standard across all compilers, so it should be used with caution. In FASTCALL, the first 2 or 3 32-bit (or smaller) arguments are passed in registers, with the most commonly used registers being edx, eax, and ecx. Additional arguments, or arguments larger than 4-bytes are passed on the stack, often in Right-to-Left order (similar to CDECL). The calling function most frequently is responsible for cleaning the stack, if needed.

Because of the ambiguities, it is recommended that FASTCALL be used only in situations with 1, 2, or 3 32-bit arguments, where speed is essential.

The following C function:

_fastcall int MyFunction3(int a, int b)
{
   return a + b;
}

and the following C function call:

x = MyFunction3(2, 3);

Will produce the following assembly code fragments for the called, and the calling functions, respectively:

:@MyFunction3@8
push ebp
mov ebp, esp ;many compilers create a stack frame even if it isn't used
add eax, edx ;a is in eax, b is in edx
pop ebp
ret

and

;the calling function
mov eax, 2
mov edx, 3
call @MyFunction3@8

The name decoration for FASTCALL prepends an @ to the function name, and follows the function name with @x, where x is the number (in bytes) of arguments passed to the function.

Many compilers still produce a stack frame for FASTCALL functions, especially in situations where the FASTCALL function itself calls another subroutine. However, if a FASTCALL function doesn't need a stack frame, optimizing compilers are free to omit it.

Commonly gcc and Windows FASTCALL convention pushes parameters one and two into ecx and edx, respectively, before pushing any remaining parameters onto the stack. Calling MyFunction3 using this standard would look like:

;the calling function
mov ecx, 2
mov edx, 3
call @MyFunction3@8

C++ Calling Convention edit

C++ requires that non-static methods of a class be called by an instance of the class. Therefore it uses its own standard calling convention to ensure that pointers to the object are passed to the function: THISCALL.

THISCALL edit

In THISCALL, the pointer to the class object is passed in ecx, the arguments are passed Right-to-Left on the stack, and the return value is passed in eax.

For instance, the following C++ instruction:

 MyObj.MyMethod(a, b, c);

Would form the following asm code:

mov ecx, MyObj
push c
push b
push a
call _MyMethod

At least, it would look like the assembly code above if it weren't for name mangling.

Name Mangling edit

Wikipedia has related information at Name mangling

Because of the complexities inherent in function overloading, C++ functions are heavily name-decorated to the point that people often refer to the process as "Name Mangling." Unfortunately C++ compilers are free to do the name-mangling differently since the standard does not enforce a convention. Additionally, other issues such as exception handling are also not standardized.

Since every compiler does the name-mangling differently, this book will not spend too much time discussing the specifics of the algorithm. Notice that in many cases, it's possible to determine which compiler created the executable by examining the specifics of the name-mangling format. We will not cover this topic in this much depth in this book, however.

Here are a few general remarks about THISCALL name-mangled functions:

They are recognizable on sight because of their complexity when compared to CDECL, FASTCALL, and STDCALL function name decorations
They sometimes include the name of that function's class.
They almost always include the number and type of the arguments, so that overloaded functions can be differentiated by the arguments passed to it.

Here is an example of a C++ class and function declaration:

 class MyClass {
  MyFunction(int a) { }
 };

And here is the resultant mangled name:

?MyFunction@MyClass@@QAEHH@Z

Extern "C" edit

In a C++ source file, functions placed in an extern "C" block are guaranteed not to be mangled. This is done frequently when libraries are written in C++, and the functions need to be exported without being mangled. Even though the program is written in C++ and compiled with a C++ compiler, some of the functions might therefore not be mangled and will use one of the ordinary C calling conventions (typically CDECL).

Note on Name Decorations edit

We've been discussing name decorations in this chapter, but the fact is that in pure disassembled code there typically are no names whatsoever, especially not names with fancy decorations. The assembly stage removes all these readable identifiers, and replaces them with the binary locations instead. Function names really only appear in two places:

Listing files produced during compilation
In export tables, if functions are exported

When disassembling raw machine code, there will be no function names and no name decorations to examine. For this reason, you will need to pay more attention to the way parameters are passed, the way the stack is cleaned, and other similar details.

While we haven't covered optimizations yet, suffice it to say that optimizing compilers can even make a mess out of these details. Functions which are not exported do not necessarily need to maintain standard interfaces, and if it is determined that a particular function does not need to follow a standard convention, some of the details will be optimized away. In these cases, it can be difficult to determine what calling conventions were used (if any), and it is even difficult to determine where a function begins and ends. This book cannot account for all possibilities, so we try to show as much information as possible, with the knowledge that much of the information provided here will not be available in a true disassembly situation.