x86 Disassembly/Code Obfuscation
Code Obfuscation
editCode Obfuscation is the act of making the assembly code or machine code of a program more difficult to disassemble or decompile. The term "obfuscation" is typically used to suggest a deliberate attempt to add difficulty, but many other practices will cause code to be obfuscated without that being the intention. Software vendors may attempt to obfuscate or even encrypt code to prevent reverse engineering efforts. There are many different types of obfuscations. Notice that many code optimizations (discussed in the previous chapter) have the side-effect of making code more difficult to read, and therefore optimizations act as obfuscations.
What is Code Obfuscation?
editThere are many things that obfuscation could be:
- Encrypted code that is decrypted prior to runtime.
- Compressed code that is decompressed prior to runtime.
- Executables that contain Encrypted sections, and a simple decrypter.
- Code instructions that are put in a hard-to read order.
- Code instructions which are used in a non-obvious way.
This chapter will try to examine some common methods of obfuscating code, but will not necessarily delve into methods to break the obfuscation.
Interleaving
editOptimizing Compilers will engage in a process called interleaving to try and maximize parallelism in pipelined processors. This technique is based on two premises:
- That certain instructions can be executed out of order and still maintain the correct output
- That processors can perform certain pairs of tasks simultaneously.
x86 NetBurst Architecture
editThe Intel NetBurst Architecture divides an x86 processor into 2 distinct parts: the supporting hardware, and the primitive core processor. The primitive core of a processor contains the ability to perform some calculations blindingly fast, but not the instructions that you or I am familiar with. The processor first converts the code instructions into a form called "micro-ops" that are then handled by the primitive core processor.
The processor can also be broken down into 4 components, or modules, each of which is capable of performing certain tasks. Since each module can operate separately, up to 4 separate tasks can be handled simultaneously by the processor core, so long as those tasks can be performed by each of the 4 modules:
- Port0
- Double-speed integer arithmetic, floating point load, memory store
- Port1
- Double-speed integer arithmetic, floating point arithmetic
- Port2
- memory read
- Port3
- memory write (writes to address bus)
So for instance, the processor can simultaneously perform 2 integer arithmetic instructions in both Port0 and Port1, so a compiler will frequently go to great lengths to put arithmetic instructions close to each other. If the timing is just right, up to 4 arithmetic instructions can be executed in a single instruction period.
Notice however that writing to memory is particularly slow (requiring the address to be sent by Port3, and the data itself to be written by Port0). Floating point numbers need to be loaded to the FPU before they can be operated on, so a floating point load and a floating point arithmetic instruction cannot operate on a single value in a single instruction cycle. Therefore, it is not uncommon to see floating point values loaded, integer values be manipulated, and then the floating point value be operated on.
Non-Intuitive Instructions
editOptimizing compilers frequently will use instructions that are not intuitive. Some instructions can perform tasks for which they were not designed, typically as a helpful side effect. Sometimes, one instruction can perform a task more quickly than other specialized instructions can.
The only way to know that one instruction is faster than another is to consult the processor documentation. However, knowing some of the most common substitutions is very useful to the reverser.
Here are some examples. The code in the first box operates more quickly than the one in the second, but performs exactly the same tasks.
Example 1
Fast
xor eax, eax
Slow
mov eax, 0
Example 2
Fast
shl eax, 3
Slow
push edx
push 8
mul dword [esp]
add esp, 4
pop edx ;# edx is not preserved by "mul"
Sometimes such transformations could be made to make the analysis more difficult:
Example 3
Fast
push $next_instr
jmp $some_function
$next_instr:...
Slow
call $some_function
Example 4
Fast
pop eax
jmp eax
Slow
retn
Common Instruction Substitutions
edit- lea
- The lea instruction has the following form:
lea dest, (XS:)[reg1 + reg2 * x]
Where XS is a segment register (SS, DS, CS, etc...), reg1 is the base address, reg2 is a variable offset, and x is a multiplicative scaling factor. What lea does, essentially, is load the memory address being pointed to in the second argument, into the first argument. Look at the following example:
mov eax, 1
lea ecx, [eax + 4]
Now, what is the value of ecx? The answer is that ecx has the value of (eax + 4), which is 5. In essence, lea is used to do addition and multiplication of a register and a constant that is a byte or less (-128 to +127).
Now, consider:
mov eax, 1
lea ecx, [eax+eax*2]
Now, ecx equals 3.
The difference is that lea is quick (because it only adds a register and a small constant), whereas the add and mul instructions are more versatile, but slower. lea is used for arithmetic in this fashion very frequently, even when compilers are not actively optimizing the code.
- xor
- The xor instruction performs the bit-wise exclusive-or operation on two operands. Consider then, the following example:
mov al, 0xAA
xor al, al
What does this do? Let's take a look at the binary:
10101010 ;10101010 = 0xAA xor 10101010 -------- 00000000
The answer is that "xor reg, reg" sets the register to 0. More importantly, however, is that "xor eax, eax" sets eax to 0 faster (and the generated code instruction is smaller) than an equivalent "mov eax, 0".
- mov edi, edi
- On a 64-bit x86 system, this instruction clears the high 32-bits of the rdi register.
- shl, shr
- left-shifting, in binary arithmetic, is equivalent to multiplying the operand by 2. Right-shifting is also equivalent to integer division by 2, although the lowest bit is dropped. in general, left-shifting by spaces multiplies the operand by , and right shifting by spaces is the same as dividing by . One important fact is that resulting number is an integer with no fractional part present. For example:
mov al, 31 ; 00011111
shr al, 1 ; 00001111 = 15, not 15.5
- xchg
- xchg exchanges the contents of two registers, or a register and a memory address. A noteworthy point is the fact that xchg operates faster than a move instruction. For this reason, xchg will be used to move a value from a source to a destination, when the value in the source no longer needs to be saved.
As an example, consider this code:
mov ebx, eax
mov eax, 0
Here, the value in eax
is stored in ebx
, and then eax
is loaded with the value zero. We can perform the same operation, but using xchg
and xor
instead:
xchg eax, ebx
xor eax, eax
It may surprise you to learn that the second code example operates significantly faster than the first one does.
Obfuscators
editThere are a number of tools on the market that will automate the process of code obfuscation. These products will use a number of transformations to turn a code snippet into a less-readable form, although it will not affect the program flow itself (although the transformations may increase code size or execution time).
Code Transformations
editCode transformations are a way of reordering code so that it performs exactly the same task but becomes more difficult to trace and disassemble. We can best demonstrate this technique by example. Let's say that we have 2 functions, FunctionA and FunctionB. Both of these two functions are comprised of 3 separate parts, which are performed in order. We can break this down as such:
FunctionA()
{
FuncAPart1();
FuncAPart2();
FuncAPart3();
}
FunctionB()
{
FuncBPart1();
FuncBPart2();
FuncBPart3();
}
And we have our main program, that executes the two functions:
main()
{
FunctionA();
FunctionB();
}
Now, we can rearrange these snippets to a form that is much more complicated (in assembly):
main:
jmp FAP1
FBP3: call FuncBPart3
jmp end
FBP1: call FuncBPart1
jmp FBP2
FAP2: call FuncAPart2
jmp FAP3
FBP2: call FuncBPart2
jmp FBP3
FAP1: call FuncAPart1
jmp FAP2
FAP3: call FuncAPart3
jmp FBP1
end:
As you can see, this is much harder to read, although it perfectly preserves the program flow of the original code. This code is much harder for a human to read, although it isn't hard at all for an automatic disassembler (such as IDA Pro) to read.
Opaque Predicates
editAn Opaque Predicate is a predicate inside the code, that cannot be evaluated during static analysis. This forces the attacker to perform a dynamic analysis to understand the result of the line. Typically this is related to a branch instruction that is used to prevent in static analysis the understanding which code path is taken.
Code Encryption
editCode can be encrypted, just like any other type of data, except that code can also work to encrypt and decrypt itself. Encrypted programs cannot be directly disassembled. However, such a program can also not be run directly because the encrypted opcodes cannot be interpreted properly by the CPU. For this reason, an encrypted program must contain some sort of method for decrypting itself prior to operation.
The most basic method is to include a small stub program that decrypts the remainder of the executable, and then passes control to the decrypted routines.
Disassembling Encrypted Code
editTo disassemble an encrypted executable, you must first determine how the code is being decrypted. Code can be decrypted in one of two primary ways:
- All at once. The entire code portion is decrypted in a single pass, and left decrypted during execution. Using a debugger, allow the decryption routine to run completely, and then dump the decrypted code into a file for further analysis.
- By Block. The code is encrypted in separate blocks, where each block may have a separate encryption key. Blocks may be decrypted before use, and re-encrypted again after use. Using a debugger, you can attempt to capture all the decryption keys and then use those keys to decrypt the entire program at once later, or you can wait for the blocks to be decrypted, and then dump the blocks individually to a separate file for analysis.