Understanding software that doesn*t want to be understood

Report
UNDERSTANDING SOFTWARE THAT
DOESN’T WANT TO BE UNDERSTOOD
REVERSE ENGINEERING OBFUSCATED BINARIES
Saumya Debray
The University of Arizona
Tucson, AZ 85721
The Problem
 Rapid analysis and understanding of malware
code essential for swift response to new threats
‒ Malicious software are usually heavily obfuscated
against analysis
 Existing approaches to reverse engineering such
code are primitive
‒ not a lot of high-level tool support
‒ requires a lot of manual intervention
‒ slow, cumbersome, potentially error-prone
 Delays development of countermeasures
Goals
Develop automated techniques for analysis and
reverse engineering of obfuscated binaries
 semantics-based
‒ output is functionally equivalent to, but simpler than,
the input program
 generality
‒ should work on any obfuscation
 even ones we haven’t thought of yet!
‒ should minimize assumptions about obfuscations
Challenges
 can’t make assumptions about obfuscations
‒ what do we leverage for deobfuscation?
‒ distinguishing code we care about from code we don’t
 how do we know which instructions we care about?
 scale
‒ “needle in haystack”
 no. of instructions executed increases by 270x (VMprotect) to
4300x (Themida) [Lau 2008]
 anti-analysis defenses
‒ runtime unpacking
‒ anti-emulation, anti-debug checks
Our Approach
 no obfuscation-specific assumptions
‒ treat programs as input-to-output transformations
‒ use semantics-preserving transformations to simplify
execution traces
Taint analysis
(bit-level)
map flow of values
from input to output
Semanticspreserving
transformations
simplify logic of
input-to-output
transformation
Control flow
reconstruction
reconstruct logic of
simplified computation
control flow
graph
input
program
 dynamic analysis to handle runtime unpacking
Ex 1:Emulation-based Obfuscation
bytecode logic (data)
input program
random seed
Obfuscator
mutation
engine
emulator (code)
 examination of the code reveals only the
emulator’s logic
‒ actual program logic embedded in byte code
 lots of “chaff” during execution
‒ separating emulator logic from payload logic tricky
 emulators can be nested
Ex 2:Return-Oriented Programs (ROP)
 Originally designed to bypass anti-code-injection
defenses
‒ stitches together existing code fragments ( “gadgets” ),
e.g., in system libraries
 Logic can be difficult to discern
‒ gadgets are typically scattered across many different
functions and/or libraries
‒ gadgets can overlap in memory in weird ways
‒ control flow structures (if-else, loops, function calls) are
typically implemented using non-standard idioms
Example 1 (emulation-obfuscation)
factorial (Themida)
Example 2 (ROP)
factorial
o
original
ROP
Interactions between Obfuscations
Example: Unpacking + Emulation
unpack
input
output
output
instructions “tainted” as propagating
values from input to output
input-to-output
computation
(further simplified)
used to construct control flow graph
input
unpack
execution trace
Results
 Ex. 1. binary search : Themida
original
obfuscated (cropped)
deobfuscated
Results
 Ex. 2. Hunatcha (drive infection code) : ExeCryptor
original
obfuscated (cropped)
deobfuscated
Results
 Ex. 3. fibonacci: ROP
original
obfuscated
deobfuscated
Results
 Ex. 4. Win32/Kryptik.OHY: Code Virtualizer
obfuscated
multiple layers of runtime
code generation
unpacking code
the CFG shown materializes
incrementally
initial unpacker is
emulation-obfuscated
deobfuscated
Results: CFG Similarity
100
Similarity with original program (%)
90
80
70
60
50
OBFUSCATED
DEOBFUSCATED
40
30
20
10
0
Programs
Lessons and Issues
 Static vs. dynamic analysis
‒ multiple layers of runtime code generation/unpacking
limits utility of static analysis
‒ dynamic analysis can run into problems of scale
 O(n2) algorithms impractical ; even O(n log n) can be problematic
 trade memory space for execution time/complexity
 code coverage — multi-path exploration?
 Taint propagation
‒ byte/word-level analyses may not be precise enough
 we use (enhanced) bit-level taint propagation
 Simplified trace → CFG: NP-hard
‒ semantic considerations?
Conclusions
 Rapid analysis and understanding of malware
code essential for swift response to new threats
‒ need to deal with advanced code obfuscations
‒ obfuscation-specific solutions tend to be fragile
 We describe a semantics-based framework for
automatic code deobfuscation
‒ no assumptions about the obfuscation(s) used
‒ promising results on obfuscators (e.g., Themida) not
handled by prior research
ADDITIONAL MATERIAL
Semantics-based simplification
 Quasi-invariant locations: locations that have the
same value at each use.
 Our transformations (currently):
‒ Arithmetic simplification
 adaptation of constant folding to execution traces
 consider quasi-invariant locations as constants
 controlled to avoid over-simplification
‒ Data movement simplification
 use pattern-driven rules to identify and simplify data movement.
‒ Dead code elimination
 need to consider implicit destinations, e.g., condition code flags.

similar documents