# x86 Disassembly/Optimization Examples

## Example: Optimized vs Non-Optimized Code

The following example is adapted from an algorithm presented in Knuth(vol 1, chapt 1) used to find the greatest common denominator of 2 integers. Compare the listing file of this function when compiler optimizations are turned on and off.

``` /*line 1*/
int EuclidsGCD(int m, int n) /*we want to find the GCD of m and n*/
{
int q, r; /*q is the quotient, r is the remainder*/
while(1)
{
q = m / n; /*find q and r*/
r = m % n;
if(r == 0) /*if r is 0, return our n value*/
{
return n;
}
m = n; /*set m to the current n value*/
n = r; /*set n to our current remainder value*/
} /*repeat*/
}
```

Compiling with the Microsoft C compiler, we generate a listing file using no optimization:

``` PUBLIC	_EuclidsGCD
_TEXT	SEGMENT
_r\$ = -8	; size = 4
_q\$ = -4	; size = 4
_m\$ = 8	; size = 4
_n\$ = 12	; size = 4
_EuclidsGCD PROC NEAR
; Line 2
push	ebp
mov	ebp, esp
sub	esp, 8
\$L477:
; Line 4
mov	eax, 1
test	eax, eax
je	SHORT \$L473
; Line 6
mov	eax, DWORD PTR _m\$[ebp]
cdq
idiv	DWORD PTR _n\$[ebp]
mov	DWORD PTR _q\$[ebp], eax
; Line 7
mov	eax, DWORD PTR _m\$[ebp]
cdq
idiv	DWORD PTR _n\$[ebp]
mov	DWORD PTR _r\$[ebp], edx
; Line 8
cmp	DWORD PTR _r\$[ebp], 0
jne	SHORT \$L479
; Line 10
mov	eax, DWORD PTR _n\$[ebp]
jmp	SHORT \$L473
\$L479:
; Line 12
mov	ecx, DWORD PTR _n\$[ebp]
mov	DWORD PTR _m\$[ebp], ecx
; Line 13
mov	edx, DWORD PTR _r\$[ebp]
mov	DWORD PTR _n\$[ebp], edx
; Line 14
jmp	SHORT \$L477
\$L473:
; Line 15
mov	esp, ebp
pop	ebp
ret	0
_EuclidsGCD ENDP
_TEXT	ENDS
END
```

Notice how there is a very clear correspondence between the lines of C code, and the lines of the ASM code. the addition of the "; line x" directives is very helpful in that respect.

Next, we compile the same function using a series of optimizations to stress speed over size:

```cl.exe /Tceuclids.c /Fa /Ogt2
```

and we produce the following listing:

``` PUBLIC	_EuclidsGCD
_TEXT	SEGMENT
_m\$ = 8	; size = 4
_n\$ = 12       ; size = 4
_EuclidsGCD PROC NEAR
; Line 7
mov	eax, DWORD PTR _m\$[esp-4]
push	esi
mov	esi, DWORD PTR _n\$[esp]
cdq
idiv	esi
mov	ecx, edx
; Line 8
test	ecx, ecx
je	SHORT \$L563
\$L547:
; Line 12
mov	eax, esi
cdq
idiv	ecx
; Line 13
mov	esi, ecx
mov	ecx, edx
test	ecx, ecx
jne	SHORT \$L547
\$L563:
; Line 10
mov	eax, esi
pop	esi
; Line 15
ret	0
_EuclidsGCD ENDP
_TEXT	ENDS
END
```

As you can see, the optimized version is significantly shorter then the non-optimized version. Some of the key differences include:

• The optimized version does not prepare a standard stack frame. This is important to note, because many times new reversers assume that functions always start and end with proper stack frames, and this is clearly not the case. EBP isnt being used, ESP isnt being altered (because the local variables are kept in registers, and not put on the stack), and no subfunctions are called. 5 instructions are cut by this.
• The "test EAX, EAX" series of instructions in the non-optimized output, under ";line 4" is all unnecessary. The while-loop is defined by "while(1)" and therefore the loop always continues. this extra code is safely cut out. Notice also that there is no unconditional jump in the loop like would be expected: the "if(r == 0) return n;" instruction has become the new loop condition.
• The structure of the function is altered greatly: the division of m and n to produce q and r is performed in this function twice: once at the beginning of the function to initialize, and once at the end of the loop. Also, the value of r is tested twice, in the same places. The compiler is very liberal with how it assigns storage in the function, and readily discards values that are not needed.

## Example: Manual Optimization

The following lines of assembly code are not optimized, but they can be optimized very easily. Can you find a way to optimize these lines?

```mov	eax, 1
test	eax, eax
je	SHORT \$L473
```

The code in this line is the code generated for the "while( 1 )" C code, to be exact, it represents the loop break condition. Because this is an infinite loop, we can assume that these lines are unnecessary.

"mov eax, 1" initializes eax.

the test immediately afterwards tests the value of eax to ensure that it is nonzero. because eax will always be nonzero (eax = 1) at this point, the conditional jump can be removed along whith the "mov" and the "test".

The assembly is actually checking whether 1 equals 1. Another fact is, that the C code for an infinite FOR loop:

``` for( ; ; )
{
...
}
```

would not create such a meaningless assembly code to begin with, and is logically the same as "while( 1 )".

## Example: Trace Variables

Here are the C code and the optimized assembly listing from the EuclidGCD function, from the example above. Can you determine which registers contain the variables r and q?

``` /*line 1*/
int EuclidsGCD(int m, int n) /*we want to find the GCD of m and n*/
{
int q, r; /*q is the quotient, r is the remainder*/
while(1)
{
q = m / n; /*find q and r*/
r = m % n;
if(r == 0) /*if r is 0, return our n value*/
{
return n;
}
m = n; /*set m to the current n value*/
n = r; /*set n to our current remainder value*/
} /*repeat*/
}
```
``` PUBLIC	_EuclidsGCD
_TEXT	SEGMENT
_m\$ = 8	; size = 4
_n\$ = 12       ; size = 4
_EuclidsGCD PROC NEAR
; Line 7
mov	eax, DWORD PTR _m\$[esp-4]
push	esi
mov	esi, DWORD PTR _n\$[esp]
cdq
idiv	esi
mov	ecx, edx
; Line 8
test	ecx, ecx
je	SHORT \$L563
\$L547:
; Line 12
mov	eax, esi
cdq
idiv	ecx
; Line 13
mov	esi, ecx
mov	ecx, edx
test	ecx, ecx
jne	SHORT \$L547
\$L563:
; Line 10
mov	eax, esi
pop	esi
; Line 15
ret	0
_EuclidsGCD ENDP
_TEXT	ENDS
END
```

At the beginning of the function, eax contains m, and esi contains n. When the instruction "idiv esi" is executed, eax contains the quotient (q), and edx contains the remainder (r). The instruction "mov ecx, edx" moves r into ecx, while q is not used for the rest of the loop, and is therefore discarded.

## Example: Decompile Optimized Code

Below is the optimized listing file of the EuclidGCD function, presented in the examples above. Can you decompile this assembly code listing into equivalent "optimized" C code? How is the optimized version different in structure from the non-optimized version?

``` PUBLIC	_EuclidsGCD
_TEXT	SEGMENT
_m\$ = 8	; size = 4
_n\$ = 12       ; size = 4
_EuclidsGCD PROC NEAR
; Line 7
mov	eax, DWORD PTR _m\$[esp-4]
push	esi
mov	esi, DWORD PTR _n\$[esp]
cdq
idiv	esi
mov	ecx, edx
; Line 8
test	ecx, ecx
je	SHORT \$L563
\$L547:
; Line 12
mov	eax, esi
cdq
idiv	ecx
; Line 13
mov	esi, ecx
mov	ecx, edx
test	ecx, ecx
jne	SHORT \$L547
\$L563:
; Line 10
mov	eax, esi
pop	esi
; Line 15
ret	0
_EuclidsGCD ENDP
_TEXT	ENDS
END
```

Altering the conditions to maintain the same structure gives us:

``` int EuclidsGCD(int m, int n)
{
int r;
r = m % n;
if(r != 0)
{
do
{
m = n;
r = m % r;
n = r;
}while(r != 0)
}
return n;
}
```

It is up to the reader to compile this new "optimized" C code, and determine if there is any performance increase. Try compiling this new code without optimizations first, and then with optimizations. Compare the new assembly listings to the previous ones.

## Example: Instruction Pairings

Q
Why does the dec/jne combo operate faster than the equivalent loopnz?
A
The dec/jnz pair operates faster than a loopnz for several reasons. First, dec and jnz pair up in the different modules of the netburst pipeline, so they can be executed simultaneously. Top that off with the fact that dec and jnz both require few cycles to execute, while the loopnz (and all the loop instructions, for that matter) instruction takes more cycles to complete. Loop instructions are rarely seen output by good compilers.

## Example: Avoiding Branches

Below is an assembly version of the expression `c ? d : 0`. There is no branching in the code, so how does it work?

```; ecx = c and edx = d
; eax will contain c ? d : 0 (eax = d if c is not zero, otherwise eax = 0)
neg	ecx
sbb	eax, eax
and	eax, edx
ret
```

This is an example of using various arithmetic instructions to avoid branching. The neg instruction sets the carry flag if c is not zero; otherwise, it clears the carry flag. The next line depends on this. If the carry flag is set, then sbb results in `eax = eax - eax - 1 = 0xffffffff`. Otherwise, `eax = eax - eax = 0`. Finally, performing an and on this result ensures that if ecx was not zero in the first place, eax will contain edx, and zero otherwise.

## Example: Duff's Device

What does the following C code function do? Is it useful? Why or why not?

```void MyFunction(int *arrayA, int *arrayB, int cnt)
{
switch(cnt % 6)
{
while(cnt != 0)
{
case 0:
arrayA[--cnt] = arrayB[cnt];
case 5:
arrayA[--cnt] = arrayB[cnt];
case 4:
arrayA[--cnt] = arrayB[cnt];
case 3:
arrayA[--cnt] = arrayB[cnt];
case 2:
arrayA[--cnt] = arrayB[cnt];
case 1:
arrayA[--cnt] = arrayB[cnt];
}
}
}
```

This piece of code is known as a Duff's device or "Duff's machine". It is used to partially unwind a loop for efficiency. Notice the strange way that the `while()` is nested inside the `switch` statement? Two arrays of integers are passed to the function, and at each iteration of the while loop, 6 consecutive elements are copied from `arrayB` to `arrayA`. The switch statement, since it is outside the while loop, only occurs at the beginning of the function. The modulo is taken of the variable `cnt` with respect to 6. If cnt is not evenly divisible by 6, then the modulo statement is going to start the loop off somewhere in the middle of the rotation, thus preventing the loop from causing a buffer overflow without having to test the current count after each iteration.

Duff's Device is considered one of the more efficient general-purpose methods for copying strings, arrays, or data streams.