x86 Disassembly/Disassemblers and Decompilers

What is a Disassembler?

edit

In essence, a disassembler is the exact opposite of an assembler. Where an assembler converts code written in an assembly language into binary machine code, a disassembler reverses the process and attempts to recreate the assembly code from the binary machine code.

Since most assembly languages have a one-to-one correspondence with underlying machine instructions, the process of disassembly is relatively straight-forward, and a basic disassembler can often be implemented simply by reading in bytes, and performing a table lookup. Of course, disassembly has its own problems and pitfalls, and they are covered later in this chapter.

Many disassemblers have the option to output assembly language instructions in Intel, AT&T, or (occasionally) HLA syntax. Examples in this book will use Intel and AT&T syntax interchangeably. We will typically not use HLA syntax for code examples, but that may change in the future.

x86 Disassemblers

edit

Here we are going to list some commonly available disassembler tools. Notice that there are professional disassemblers (which cost money for a license) and there are freeware/shareware disassemblers. Each disassembler will have different features, so it is up to you as the reader to determine which tools you prefer to use.

Online Disassemblers

edit
ODA
is a free, web-based disassembler for a wide variety of architectures. You can use "Live View" to see how code is disassembled in real time, one byte at a time, or upload a file. The site is currently in beta release but will hopefully only get better with time.
http://www.onlinedisassembler.com

Commercial Windows Disassemblers

edit
IDA Pro
is a professional disassembler that is expensive, extremely powerful, and has a whole slew of features. The downside to IDA Pro is that it costs $515 US for the standard single-user edition. As such this wikibook will not consider IDA Pro specifically because the price tag is exclusionary. Freeware versions do exist; see below.
Relyze Desktop
is an interactive software reverse engineering tool that lets you disassemble, decompile and diff x86, x64, ARM32 and ARM64 software.
https://www.relyze.com/overview.html
Hopper Disassembler
is a reverse engineering tool for the Mac, that lets you disassemble, decompile and debug 32/64bits Intel Mac executables. It can also disassemble and decompile Windows executables.
http://www.hopperapp.com
OBJ2ASM
is an object file disassembler for 16 and 32 bit x86 object files in Intel OMF, Microsoft COFF format, Linux ELF or Mac OS X Mach-O format.
http://www.digitalmars.com/ctg/obj2asm.html
PE Explorer
is a disassembler that "focuses on ease of use, clarity and navigation." It isn't as feature-filled as IDA Pro and carries a smaller price tag to offset the missing functionality: $130
http://www.heaventools.com/PE_Explorer_disassembler.htm
W32DASM (Win32dasm)
W32DASM was an excellent 16/32 bit disassembler for Windows, it seems it is no longer developed. the latest version available is from 2003. the website went down and no replacement went up.
http://www.softpedia.com/get/Programming/Debuggers-Decompilers-Dissasemblers/WDASM.shtml
Binary Ninja
Binary Ninja is a commercial, cross-platform (Linux, OS X, Windows) reverse engineering platform with aims to offer a similar feature set to IDA at a much cheaper price point. A precursor written in python is open source and available at https://github.com/Vector35/deprecated-binaryninja-python. Introductory pricing is $99 for student/non-commercial use, and $399 for commercial use.
https://binary.ninja/
Hiew
x86-64 disassembler & assembler. Single license pricing is $19, and $199 with lifetime updates.
hiew.ru

Commercial Freeware/Shareware Windows Disassemblers

edit
OllyDbg
OllyDbg is one of the most popular disassemblers recently. It has a large community and a wide variety of plugins available. It emphasizes binary code analysis. Supports x86 instructions only (no x86_64 support for now, although it is on the way).
http://www.ollydbg.de/ (official website)
http://www.openrce.org/downloads/browse/OllyDbg_Plugins (plugins)
http://www.ollydbg.de/odbg64.html (64 bit version)

Free Windows Disassemblers

edit
Capstone
Capstone is an open source disassembly framework for multi-arch (including support for x86, x86_64) & multi-platform with advanced features.
http://www.capstone-engine.org/
Zydis
Fast and lightweight x86/x86-64 decoder library. It does not offer disassembler features such as linear sweep or recursive disassembling.
https://github.com/zyantific/zydis
Objconv
A command line disassembler supporting 16, 32, and 64 bit x86 code. Latest instruction set (SSE4, AVX, XOP, FMA, etc.), several object file formats, several assembly syntax dialects. Windows, Linux, BSD, Mac. Intelligent analysis.
IDA 3.7
A DOS GUI tool that behaves very much like IDA Pro, but is considerably more limited. It can disassemble code for the Z80, 6502, Intel 8051, Intel i860, and PDP-11 processors, as well as x86 instructions up to the 486.
IDA Pro Freeware
Behaves almost exactly like IDA Pro, but disassembles only Intel x86 opcodes and is Windows-only. It can disassemble instructions for those processors available as of 2003. Free for non-commercial use.
BORG Disassembler
BORG is an excellent Win32 Disassembler with GUI.
http://www.caesum.com/
HT Editor
An analyzing disassembler for Intel x86 instructions. The latest version runs as a console GUI program on Windows, but there are versions compiled for Linux as well.
http://hte.sourceforge.net/
diStorm64
diStorm is an open source highly optimized stream disassembler library for 80x86 and AMD64.
http://ragestorm.net/distorm/
crudasm
crudasm is an open source disassembler with a variety of options. It is a work in progress and is bundled with a partial decompiler.
http://sourceforge.net/projects/crudasm9/
BeaEngine
BeaEngine is a complete disassembler library for IA-32 and intel64 architectures (coded in C and usable in various languages : C, Python, Delphi, PureBasic, WinDev, masm, fasm, nasm, GoAsm).
https://github.com/BeaEngine/beaengine
Visual DuxDebugger
is a 64-bit debugger disassembler for Windows.
http://www.duxcore.com/products.html
BugDbg
is a 64-bit user-land debugger designed to debug native 64-bit applications on Windows.
http://www.pespin.com/
DSMHELP
Disassemble Help Library is a disassembler library with single line Epimorphic assembler. Supported instruction sets - Basic,System,SSE,SSE2,SSE3,SSSE3,SSE4,SSE4A,MMX,FPU,3DNOW,VMX,SVM,AVX,AVX2,BMI1,BMI2,F16C,FMA3,FMA4,XOP.
http://dsmhelp.narod.ru/ (in Russian)
ArkDasm
is a 64-bit interactive disassembler and debugger for Windows. Supported processor: x64 architecture (Intel x64 and AMD64)
http://www.arkdasm.com/
SharpDisam
is a C# port of the udis86 x86 / x86-64 disassembler
http://sharpdisasm.codeplex.com/
CFF Explorer
Special fields description and modification (.NET supported), utilities, rebuilder, hex editor, import adder, signature scanner, signature manager, extension support, scripting, disassembler, dependency walker etc.
ntcore.com
bddisasm
fast, lightweight, x86/x64 instruction decoding library.
github.com/bitdefender/bddisasm

Unix Disassemblers

edit

Many of the Unix disassemblers, especially the open source ones, have been ported to other platforms, like Windows (mostly using MinGW or Cygwin). Some Disassemblers like otool ([OS X) are distro-specific.

Capstone
Capstone is an open source disassembly framework for multi-arch (including support for x86, x86_64) & multi-platform (including Mac OSX, Linux, *BSD, Android, iOS, Solaris) with advanced features.
http://www.capstone-engine.org/
Bastard Disassembler
The Bastard disassembler is a powerful, scriptable disassembler for Linux and FreeBSD.
http://bastard.sourceforge.net/
ndisasm
NASM's disassembler for x86 and x86-64. Works on DOS, Windows, Linux, Mac OS X and various other systems.
udis86
Disassembler Library for x86 and x86-64
http://udis86.sourceforge.net/
Zydis
Fast and lightweight x86/x86-64 disassembler library.
https://github.com/zyantific/zydis
Objconv
See above.
ciasdis
The official name of ciasdis is computer_intelligence_assembler_disassembler. This Forth-based tool allows to incrementally and interactively build knowledge about a code body. It is unique that all disassembled code can be re-assembled to the exact same code. Processors are 8080, 6809, 8086, 80386, Pentium I en DEC Alpha. A scripting facility aids in analyzing Elf and MSDOS headers and makes this tool extendable. The Pentium I ciasdis is available as a binary image, others are in source form, loadable onto lina Forth, available from the same site.
http://home.hccnet.nl/a.w.m.van.der.horst/ciasdis.html
objdump
comes standard, and is typically used for general inspection of binaries. Pay attention to the relocation option and the dynamic symbol table option.
gdb
comes standard, as a debugger, but is very often used for disassembly. If you have loose hex dump data that you wish to disassemble, simply enter it (interactively) over top of something else or compile it into a program as a string like so: char foo[] = {0x90, 0xcd, 0x80, 0x90, 0xcc, 0xf1, 0x90};
lida linux interactive disassembler
an interactive disassembler with some special functions like a crypto analyzer. Displays string data references, does code flow analysis, and does not rely on objdump. Utilizes the Bastard disassembly library for decoding single opcodes. The project was started in 2004 and remains dormant to this day.
http://lida.sourceforge.net
dissy
This program is a interactive disassembler that uses objdump.
http://code.google.com/p/dissy/
EmilPRO
replacement for the deprecated dissy[check spelling] disassembler.
http://github.com/SimonKagstrom/emilpro
x86dis
This program can be used to display binary streams such as the boot sector or other unstructured binary files.
ldasm
LDasm (Linux Disassembler) is a Perl/Tk-based GUI for objdump/binutils that tries to imitate the 'look and feel' of W32Dasm. It searches for cross-references (e.g. strings), converts the code from GAS to a MASM-like style, traces programs and much more. Comes along with PTrace, a process-flow-logger. Last updated in 2002, available from Tucows.
http://www.tucows.com/preview/59983/LDasm
llvm
LLVM has two interfaces to its disassembler:
llvm-objdump
Mimics GNU objdump.
llvm-mc
See the LLVM blog. Example usage:
$ echo '1 2' | llvm-mc -disassemble -triple=x86_64-apple-darwin9
addl %eax, (%rdx)
$ echo '0x0f 0x1 0x9' | llvm-mc -disassemble -triple=x86_64-apple-darwin9
sidt (%rcx)
$ echo '0x0f 0xa2' | llvm-mc -disassemble -triple=x86_64-apple-darwin9
cpuid
$ echo '0xd9 0xff' | llvm-mc -disassemble -triple=i386-apple-darwin9
fcos
otool
OS X's object file displaying tool.
edb
A cross platform x86/x86-64 debugger.
https://github.com/eteran/edb-debugger
bddisasm
fast, lightweight, x86/x64 instruction decoding library.
github.com/bitdefender/bddisasm
rasm2
radare2 disassembler and assembler tool. Includes x86.nz library with support for x86/x86-64.

Disassembler Issues

edit

As we have alluded to before, there are a number of issues and difficulties associated with the disassembly process. The two most important difficulties are the division between code and data, and the loss of text information.

Separating Code from Data

edit

Since data and instructions are all stored in an executable as binary data, the obvious question arises: how can a disassembler tell code from data? Is any given byte a variable, or part of an instruction?

The problem wouldn't be as difficult if data were limited to the .data section (segment) of an executable (explained in a later chapter) and if executable code were limited to the .code section of an executable, but this is often not the case. Data may be inserted directly into the code section (e.g. jump address tables, constant strings), and executable code may be stored in the data section (although new systems are working to prevent this for security reasons). AI programs, LISP or Forth compilers may not contain .text and .data sections to help decide, and have code and data interspersed in a single section that is readable, writable and executable, Boot code may even require substantial effort to identify sections. A technique that is often used is to identify the entry point of an executable, and find all code reachable from there, recursively. This is known as "code crawling".

Many interactive disassemblers will give the user the option to render segments of code as either code or data, but non-interactive disassemblers will make the separation automatically. Disassemblers often will provide the instruction AND the corresponding hex data on the same line, shifting the burden for decisions about the nature of the code to the user. Some disassemblers (e.g. ciasdis) will allow you to specify rules about whether to disassemble as data or code and invent label names, based on the content of the object under scrutiny. Scripting your own "crawler" in this way is more efficient; for large programs interactive disassembling may be impractical to the point of being unfeasible.

The general problem of separating code from data in arbitrary executable programs is equivalent to the halting problem. As a consequence, it is not possible to write a disassembler that will correctly separate code and data for all possible input programs. Reverse engineering is full of such theoretical limitations, although by Rice's theorem all interesting questions about program properties are undecidable (so compilers and many other tools that deal with programs in any form run into such limits as well). In practice a combination of interactive and automatic analysis and perseverance can handle all but programs specifically designed to thwart reverse engineering, like using encryption and decrypting code just prior to use, and moving code around in memory.

Lost Information

edit

User defined textual identifiers, such as variable names, label names, and macros are removed by the assembly process. They may still be present in generated object files, for use by tools like debuggers and relocating linkers, but the direct connection is lost and re-establishing that connection requires more than a mere disassembler. Especially small constants may have more than one possible name. Operating system calls (like DLLs in MS-Windows, or syscalls in Unices) may be reconstructed, as their names appear in a separate segment or are known beforehand. Many disassemblers allow the user to attach a name to a label or constant based on his understanding of the code. These identifiers, in addition to comments in the source file, help to make the code more readable to a human, and can also shed some clues on the purpose of the code. Without these comments and identifiers, it is harder to understand the purpose of the source code, and it can be difficult to determine the algorithm being used by that code. When you combine this problem with the possibility that the code you are trying to read may, in reality, be data (as outlined above), then it can be even harder to determine what is going on. Another challenge is posed by modern optimising compilers; they inline small subroutines, then combine instructions over call and return boundaries. This loses valuable information about the way the program is structured.

Decompilers

edit

Akin to Disassembly, Decompilers take the process a step further and actually try to reproduce the code in a high level language. Frequently, this high level language is C, because C is simple and primitive enough to facilitate the decompilation process. Decompilation does have its drawbacks, because lots of data and readability constructs are lost during the original compilation process, and they cannot be reproduced. Since the science of decompilation is still young, and results are "good" but not "great", this page will limit itself to a listing of decompilers, and a general (but brief) discussion of the possibilities of decompilation. Compared to disassemblers a decompiler generates code that doesnot require that one is familiar at the processor at hand. It may even be that the decompiled code can be compiled on a different processor, or give a reasonable starting point to reproduce the program on a different processor.

Decompilation: Is It Possible?

edit

In the face of optimizing compilers, it is not uncommon to be asked "Is decompilation even possible?" To some degree, it usually is. Make no mistake, however: an optimizing compiler results in the irretrievable loss of information. An example is in-lining, as explained above, where code called is combined with its surroundings, such that the places where the original subroutine is called cannot even be identified. An optimizer that reverses that process is comparable to an artificial intelligence program that recreates a poem in a different language. So perfectly operational decompilers are a long way off. At most, current Decompilers can be used as simply an aid for the reverse engineering process leaving lots of arduous work.

Common Decompilers

edit
Hex-Rays Decompiler
Hex-Rays is a commercial decompiler. It is made as an extension to popular IDA-Pro disassembler. It is currently the only viable commercially available decompiler which produces usable results. It supports both x86 and ARM architecture.
http://www.hex-rays.com/products/decompiler/index.shtml
ILSpy
ILSpy is an open source .NET assembly browser and decompiler.
https://github.com/icsharpcode/ILSpy
DCC
DCC is likely one of the oldest decompilers in existence, dating back over 20 years. It serves as a good historical and theoretical frame of reference for the decompilation process in general (Mirrors: [1][2]). As of 2015, DCC is an active project. Some of the latest changes include fixes for longstanding memory leaks and a more modern Qt5-based front-end.
RetDec
The Retargetable Decompiler is a freeware web decompiler that takes in ELF/PE/COFF binaries in Intel x86, ARM, MIPS, PIC32, and PowerPC architectures and outputs C or Python-like code, plus flow charts and control flow graphs. It puts a running time limit on each decompilation. It produces nice results in most cases.
https://github.com/avast/retdec
Reko
a modular open-source decompiler supporting both an interactive GUI and a command-line interface. Its pluggable design supports decompilation of a variety of executable formats and processor architectures (8- , 16- , 32- and 64-bit architectures as of 2015). It also supports running unpacking scripts before actual decompilation. It performs global data and type analyses of the binary and yields its results in a subset of C++.
http://sourceforge.net/projects/decompiler
https://github.com/uxmal/reko
C4Decompiler
C4Decompiler is an interactive, static decompiler under development (Alpha in 2013). It performs global analysis of the binary and presents the resulting C source in a Windows GUI. Context menus support navigation, properties, cross references, C/Asm mixed view and manipulation of the decompile context (function ABI).
http://www.c4decompiler.com
Boomerang Decompiler Project
Boomerang Decompiler is an attempt to make a powerful, retargetable decompiler. So far, it only decompiles into C with moderate success.
http://boomerang.sourceforge.net/
Reverse Engineering Compiler (REC)
REC is a powerful "decompiler" that decompiles native assembly code into a C-like code representation. The code is half-way between assembly and C, but it is much more readable than the pure assembly is. Unfortunately the program appears to be rather unstable.
http://www.backerstreet.com/rec/rec.htm
ExeToC
ExeToC decompiler is an interactive decompiler that boasted pretty good results in the past.
http://sourceforge.net/projects/exetoc
snowman
Snowman is an open source native code to C/C++ decompiler. Supports ARM, x86, and x86-64 architectures. Reads ELF, Mach-O, and PE file formats. Reconstructs functions, their names and arguments, local and global variables, expressions, integer, pointer and structural types, all types of control-flow structures, including switch. Has a nice graphical user interface with one-click navigation between the assembler code and the reconstructed program. Has a command-line interface for batch processing.
https://derevenets.com
Ghidra
Ghidra is a reverse engineering package that includes a decompiler. It was written by the NSA for internal work, and apparently released because they didn't want to have to re-train every new person they hired. It is written in Java.


A General view of Disassembling

edit

8 bit CPU code

edit

Most embedded CPUs are 8-bit CPUs.[1]

Normally when a subroutine is finished, it returns to executing the next address immediately following the call instruction.

However, assembly-language programmers occasionally use several different techniques that adjust the return address, making disassembly more difficult:

  • jump tables,
  • calculated jumps, and
  • a parameter after the call instruction.

jump tables and other calculated jumps

edit

On 8-bit CPUs, calculated jumps are often implemented by pushing a calculated "return" address to the stack, then jumping to that address using the "return" instruction. For example, the RTS Trick uses this technique to implement jump tables (branch table).

parameters after the call instruction

edit

Instead of picking up their parameters off the stack or out of some fixed global address, some subroutines provide parameters in the addresses of memory that follow the instruction that called that subroutine. Subroutines that use this technique adjust the return address to skip over all the constant parameter data, then return to an address many bytes after the "call" instruction. One of the more famous programs that used this technique is the "Sweet 16" virtual machine.

The technique may make disassembly more difficult.

A simple example of this is the write() procedure implemented as follows:

; assume ds = cs, e.g like in boot sector code
start:
        call write       ; push message's address on top of stack
        db   "Hello, world",0dh,0ah,00h
; return point
        ret              ; back to DOS

write proc near
        pop  si          ; get string address
        mov  ah,0eh      ; BIOS: write teletype
w_loop:
        lodsb            ; read char at [ds:si] and increment si
        or   al,al       ; is it 00h?
        jz   short w_exit
        int  10h         ; write the character
        jmp  w_loop      ; continue writing
w_exit:
        jmp  si
write   endp
        end start

A macro-assembler like TASM will then use a macro like this one:

_write macro message
       call write
       db message
       db 0
_write endm

From a human disassembler's point of view, this is a nightmare, although this is straightforward to read in the original Assembly source code, as there is no way to decide if the db should be interpreted or not from the binary form, and this may contain various jumps to real executable code area, triggering analysis of code that should never be analysed, and interfering with the analysis of the real code (e.g. disassembling the above code from 0000h or 0001h won't give the same results at all).

However a half-decent tool with possibilities to specifiy rules, and heuristic means to identify texts will have little trouble.

32 bit CPU code

edit

Most 32-bit CPUs use the ARM instruction set.[1][2][3]

Typical ARM assembly code is a series of subroutines, with literal constants scattered between subroutines. The standard prolog and epilog for subroutines is pretty easy to recognize.

A brief list of disassemblers

edit
  • ciasdis "an assembler where the elements opcode, operands and modifiers are all objects, that are reusable for disassembly." For 8080 8086 80386 Alpha 6809 and should be usable for Pentium 68000 6502 8051.
  • radare, the reverse engineering framework includes open-source tools to disassemble code for many processors including x86, ARM, PowerPC, m68k, etc. several virtual machines including java, msil, etc., and for many platforms including Linux, BSD, OSX, Windows, iPhoneOS, etc.
  • IDA, the Interactive Disassembler ( IDA Pro ) can disassemble code for a huge number of processors, including ARM Architecture (including Thumb and Thumb-2), ATMEL AVR, INTEL 8051, INTEL 80x86, MOS Technologies 6502, MC6809, MC6811, M68H12C, MSP430, PIC 12XX, PIC 14XX, PIC 18XX, PIC 16XXX, Zilog Z80, etc.
  • objdump, part of the GNU binutils, can disassemble code for several processors and platforms. binutils is an important part of the toolchain as it provides the linker, assembler and other utilties (like objdump) to manipulate executables on the target platform, and is available for most popular platforms.
    • For OS X/BSD systems, there is a rough equivalent called otool in the XCode kit.
  • Disassemblers at Curlie lists a huge number of disassemblers
  • Program transformation wiki: disassembly lists many highly recommended disassemblers
  • search for "disassemble" at SourceForge shows many disassemblers for a variety of CPUs.
  • Hopper is a disassembler that runs on OS-X and disassembles 32/64-bit OS-X and windows binaries.
  • The University of Queensland Binary Translator (UQBT) is a reusable, component-based binary-translation framework that supports CISC, RISC, and stack-based processors.

Further reading

edit
  1. a b Jim Turley. "The Two Percent Solution". 2002.
  2. Mark Hachman. "ARM Cores Climb Into 3G Territory". 2002. "Although Intel and AMD receive the bulk of attention in the computing world, ARM’s embedded 32-bit architecture, ... has outsold all others."
  3. Tom Krazit. "ARMed for the living room". "ARM licensed 1.6 billion cores [in 2005]". 2006.

← Assemblers and Compilers | Disassembly Examples →