Chapter 4
 Three
tasks for anti-virus
o Infected or not? Provably undecidable…
o May be separate from detection,
depending on detection method used
o Remove the virus
Detection: Static Methods
 Generic
o Detects known and unknown viruses
o For example, anomaly detection
 Virus-specific
o Detects known viruses
o For example, signature detection
 Static
--- virus code not running
 Dynamic --- virus code running
Detection Outcomes
Detection Outcomes
 Also
can have ghost positive
 Virus remnant “detected”
o But virus is no longer there
 How
can this happen?
o Previous disinfection was incomplete
Static Detection
 Detection
without running virus code
 Three approaches…
1. Scanners
o Signature
o Look for “virus-like” code
Integrity Checkers
o Hash/checksum
 On-demand
o Files scanned when you say so
 On-access
o Constant scanning in background
o Whenever file is accessed, it’s scanned
Signature scanning
o Viruses represented by “signature”
o Signature == pattern of bits in a virus (might
include wildcards)
“Hundreds of thousands of signatures”
 Not feasible to scan one-by-one
o Multiple pattern search
o Efficiency is critical
We look in detail at several algorithms
Algorithm: Aho-Corasick
 Developed
1975, bibliographic search
 Based on finite automaton (graph)
o Circles are search states
o Edges are transitions
o Double circles are final states/output
 And
a failure function
o What to do when no suitable transition
o I.e., where to resume “matching”
Algorithm: Aho-Corasick
 When
virus scanning, search for virus
signature, which is bit string
 For simplicity, illustrate algorithm
using English words
 For our example…
 Scan for any of the following words:
o hi, hips, hip, hit, chip
Algorithm: Aho-Corasick
Algorithm: Aho-Corasick
 How
to construct automaton?
o And failure function
 Build
the automaton --- next slide
o A “trie”, also known as a “prefix tree”
 Then
determine failure function
o Two slides ahead
Labels added in
 Closest to root
get smallest
Aho-Corasick: Failure Function
 Depth
1 nodes
o Fail goes back to start state
 For
other states
o Go back to earliest place where search
can resume
o Pseudo-code is in the book
 The
bottom line…
 Linear search that can find multiple
o Like searching in parallel for related
 Efficient
representation of
automaton is the challenge
o Both time and space issues
Algorithm: Veldman
 Linear
search on “reduced” signatures
o Sequential search on reduced set
 From
each signature, select 4
adjacent non-wildcard bytes
o Want as many signatures as possible to
have each selected 4-byte pattern
 Then
use 2 hash tables to filter…
o Hash tables: 1st 2 bytes & 2nd 2 bytes
Algorithm: Veldman
 Example
 Suppose
the following 5 signatures
o blar?g, foo, greep, green, agreed
 Select
4-byte patterns, no wildcards:
Algorithm: Veldman
 Hashes
act as filters
 Test things that pass thru both filters
o In this example, get things like “grar”
Algorithm: Veldman
 Veldman
allows for wildcards and
complex signatures
o Aho-Corasick does not
 But
both algorithms analyze every
byte of input
 Is it possible to do better?
o That is, can we skip some of the input?
Algorithm: Wu-Manber
 Like
Veldman’s algorithm
o But can skip over bytes that can’t
possibly match
o Faster, improved performance
 Illustrate
algorithm with same
signatures used for Veldman’s:
o blar?g, foo, greep, green, agreed
Algorithm: Wu-Manber
 Calculate
o Min length of any pattern substring
 Two
hash tables
o SHIFT --- number of bytes that can
safely be skipped
o HASH --- mapping to signatures
 Input
bytes denoted b1,b2,…,bn
 Start at bMINLEN consider byte pairs
Algorithm: Wu-Manber
 Example:
Suppose hash tables are…
 Here,
 Start at bMINLEN
Algorithm: Wu-Manber
 How
to construct hash tables?
 It’s a 4-step process
o Calculate MINLEN
o Initialize SHIFT table
o Fill SHIFT table
o Fill HASH table
Algorithm: Wu-Manber
 Calculate
o Minimum number of adjacent, non-
wildcard bytes in any signature
 For
this example, we have
o blar?g
o greep
o agreed
 So
we have MINLEN = 3
Algorithm: Wu-Manber
 Extract MINLEN pattern substrings
o blar?g
o greep
o agreed
 Extract
all distinct 2-byte sequences
o bl, la, fo, oo, gr, re, ag
 If
input pair is not one of these, safe
to skip MINLEN - 1 bytes
Algorithm: Wu-Manber
 Initialize SHIFT table to MINLEN – 1
 For 2-byte pairs: bl, la, fo, oo, gr, re, ag
o Denote as xy
o Let qxy be rightmost ending position of xy
in any pattern substring
o For example, gr in agr and gre, but
in bla
o So, qgr = 3 while qbl = 2
o Then set SHIFT[xy] = MINLEN – qxy
 Note:
Wildcard matches everything…
Algorithm: Wu-Manber
 If SHIFT[xy] = MINLEN – qxy = 0
o Then we are at right edge of a pattern
 So,
set HASH[xy] to all signatures with
pattern substring ending xy
 For example
o HASH[gr]  agreed
]  greep, green
Algorithm: Wu-Manber
 Here,
we illustrated simplest form of
the algorithm
 More advanced forms can handle 10s
of thousands of signatures
 Worst case performance is terrible
o Sequential search thru every byte of
input for every signature…
 But
tests show it’s good in practice
 How
can we know if scanner works?
 Test on live viruses?
o Might not be a good idea
standard antivirus test file
o Not too useful either
 So,
what to do?
o Author doesn’t have any suggestions!
Improving Performance
 “Grunt
scanning” --- scan everything
o Slow slow slow
 Search
only beginning and end of files
 Scan code entry point
o And points reachable from entry point
 If
position of virus in file is known…
o Make it part of the “signature”
 Limit
scans to size of virus(es)
Improving Performance
 Only
scan certain types of files
o Not so viable today
 Only
rescan files that have changed
o How to detect change?
o Where to store this info? Cache?
Database? Tagged to file?
o Updates to signatures? Must rescan…
o How to checksum efficiently?
Improving Performance
 How
to checksum efficiently?
o Checksum entire file might take longer
than scanning it
o Only checksum parts that are scanned
 How
to avoid checksum tampering?
o Encrypt? Where to store the key?
o Checksum the checksums?
o Other?
Improving Performance
 Improve
the algorithm
o Maybe tailor algorithms to file type
 Optimize
o May be of limited value
 Other?
Static Heuristics
 Like
having expert look at code…
 Look for “virus-like” code
o Static, so we don’t execute the code
step process
o Gather data
o Analyze data
Static Heuristics
 What
data to gather?
 “Short signatures” or boosters
o Junk code
o Decryption loop
o Self-modifying code
o Undocumented API calls
o Unusual/non-compiler instructions
o Strings containing obscenities or “virus”
 Stopper
--- thing virus would not do
Static Heuristics
 Other
heuristics include…
 Length of code
o Too short? May be appended virus
 Statistical
analysis of instructions
o Handwritten assembly
o Encrypted code
 Might
look for signature heuristics
o Common characteristics of signatures
Static Heuristics
 Analysis
 May be simple…
o Weighted sum of various factors
o Unusual opcodes, etc.
 …or
o Machine learning (HMM, neural nets, etc.)
o Data mining
o Heuristic search (genetic algorithm, etc.)
Integrity Checkers
 Look
for unauthorized change to files
 Start with 100% clean files
 Compute checksums/hashes
 Store checksums
 Recompute checksums and compare
o If they differ, a change has occurred
Integrity Checkers
types of integrity checkers
 Offline --- recompute checksums
periodically (e.g., once/week)
 Self-checking --- modify file to check
itself when run
o Essentially, a beneficial “virus”
o For example, virus scanner self-checks
 Integrity
shell --- OS performs
checksum before file executed
Detection: Dynamic Methods
 Detection
based on running the code
o Observe the “behavior”
 Two
type of dynamic methods
o Behavior monitor/blockers
o Emulation
Behavior Monitor/Blocker
 Monitor
program as running
 Watch for “suspicious” behavior
 What is suspicious?
o It’s too far from “normal”
 What
is normal?
o A statistical measure --- mean, average
 How
far is too far?
o Depends on variance, standard deviation
Behavior Monitor/Blocker
 “Normal”
monitored in 3 ways…
Actions that are permitted
o White list, positive detection
Actions that are not permitted
o Black list, negative detection
Some combination of these two
 Analogies to immune system
o Distinguish self from non-self
Behavior Monitor/Blocker
 “Care
must be taken… because
anomalous behavior does not
automatically imply viral behavior”
o That’s an understatement!
is the fundamental problem in
anomaly detection
 This
o Potential for lots of false positives
Behavior Monitor/Blocker
 Look
for short “dynamic signatures”
o Like signature detection, but input string
generated dynamically
 But
what to monitor?
 Infection-like behavior?
o Open an exe for read/write
o Read code start address from header
o Write start address to header
o Seek to end of exe, append to exe, etc.
Behavior Monitor/Blocker
 How
to reduce false positives?
o Consider “ownership” --- some apps get
more leeway (e.g., browser clearing cache)
 How
to prevent damage?
o “Dynamic” implies code actually running…
o System undo capability?
 How
long to monitor?
o Monitoring increases overhead
o Can virus outlast monitor?
Execute code, but not for real…
 Instead, emulate execution
 Emulation can provide all of the info gotten
thru code execution
o But much safer
“Execute” code in emulator
o Gather info for static/dynamic signatures or
o Behavior blocker stuff applies too
 Emulation
and polymorphic detection
o Let virus decrypt itself
o Then use ordinary signature scan
 When
has decryption occurred?
o Use some heuristics…
o Execution of code that was modified
(decrypted) or in such a memory location
o More than N bytes of modified code, etc.
Emulator Anatomy
 Emulate
by single-stepping thru code?
o Easily detected by viruses (???)
o Danger of virus “escaping” emulator
 “A
more elaborate emulation
mechanism is needed”
o Why?
 Conceptually,
5 parts to an emulator
o Next slide please…
Emulator Anatomy
parts to new-and-improved
1. CPU emulation --- nothing more to say
2. Memory emulation
3. Hardware and OS emulation
4. Emulation controller
5. Extra analyses
Memory Emulation
 This
could be difficult…
o 32-bit addressing, so 4G of “memory”
 Do
we need to emulate all of this?
o No, most apps only uses small amount
 Keep
track of memory that’s modified
and where it is located
o Only need to deal with memory that is
modified by a specific app/virus
Hardware/OS Emulation
 Use
stripped-down, fake OS, due to…
o Copyright issues
o Size
o Startup time
o Emulator needs additional monitoring
 What
about OS system calls?
o Return faked/fixed values
o Don’t faithfully emulate some low-level stuff
Emulation Controller
 When
does emulation stop?
o Can’t expect to run code to completion…
 Use
heuristics to decide when to stop
o Number of instructions?
o Amount of time?
o Threshold on percent of instructions
that modify memory?
o “Stoppers”? E.g., assume virus wouldn’t
write output before being malicious
Emulator: Extra Analyses
 Post-emulation
 For example, look at histogram of
o Does it match typical polymorphic?
o Does it match a metamorphic family?
 Other
examples of post-emulation
If at First You Don’t Succeed
 Emulation
controller may re-invoke
emulator for the following reasons
o Rerun with different CPU parameters
o Test interrupt handlers
o Test multiple possible entry points
o Test for self-replication on “goat” files
o Test untaken branches in code
o Test “unused” memory locations
Emulator Optimizations
 Improve
performance, reduce size
and/or complexity
o Use the real file system (with caution)
o “Data” files must be checked for
malware, use lots of stoppers
o Cache state --- if match is found to
previous (non-virus) run, goto next file
 Cache register values, size, stack pointer and
contents, number of writes, checksums, etc.
Comparison of Techniques
 Recall,
the techniques considered…
Static heuristics
Integrity check
Behavior blocker
Comparison of Techniques
 Scanning
 Pros:
o Precise ID of malware
 Cons:
o Requires up-to-date signatures
o Cannot detect new/unknown malware
Comparison of Techniques
 Static
 Pros:
o Detect known and unknown malware
 Cons:
o Detected malware not identified
o False positives
Comparison of Techniques
 Integrity
 Pros:
o Can be efficient and fast
o Detect known and unknown malware
 Cons:
o Detected after infection & not identified
o Can’t detect in new/modified file
o Heavy burden on users/admins
Comparison of Techniques
 Behavior
 Pros:
o Known and unknown malware detected
 Cons:
o Probably won’t identify malware
o High overhead
o False positives
o Malware runs on system before detected
Comparison of Techniques
 Emulation
 Pros:
o Known, unknown, polymorphic detection
o Malware executed in “safe” environment
 Cons:
o Slow
o Malware might outlast emulator
o Might not provide identification
Detection: Bottom Line
 Static
analysis is fast
o Good approach when it works
 Dynamic
analysis can “peel away a
layer of obfuscation”
o Dynamic analysis is relatively costly
Verification, Quarantine,
 What
to do after virus detected?
1. Verify that it really is a virus
2. Quarantine infected code
3. Disinfect --- remove infection
 These
are done rarely, so can be slow
and costly in comparison to detection
 After
detection comes verification
 Why verify?
o Secondary test needed due to short,
general signature, or…
o …no signature, due to detection method
 Behavior,
heuristic, emulation, etc.
o Do not usually provide identification
 Writer
might try to make virus look
like some other virus
 How
to verify?
 “X-ray” the virus
 If encrypted, decrypt it, or
frequency analysis might suffice
o Like simple substitution cipher
 Extract
info/stats, etc.
 After
x-ray analysis…
o Longer virus-specific signatures
o Checksum all or part of virus
o Call special-purpose verification code
 Note
that these probably won’t work
on (good) metamorphic code
 Isolate
detected virus from system
o Then ask user if it’s OK to disinfect
o Or do further analysis of virus
 How
to quarantine virus?
o Copy to a “quarantine” directory?
o Hide it in “invisible” location?
o Encrypt it?
 Disinfect
== remove infection
 Not always possible to return file to
it’s original state
o E.g., file might have been overwritten
 Disinfection
 Delete the infected file
o Pros and cons?
 Disinfection
 Restore files from backup
o Pros and cons?
 Use
virus-specific info
o Info may be found automatically --compare infected files with uninfected
o E.g., appended virus, changes start
address, appends itself to file, etc.
o Like a chosen plaintext attack
 Disinfection
 Use virus-behavior specific info
o E.g., prepended virus changes header
 Save
some info about files
o Headers info, for example
o Then changed parts can be restored
o Integrates well with integrity checker
o Restore parts until checksum matches…
 Disinfection
 Use the virus to disinfect
o Stealth virus may give original code
 Generic
o Virus may restore code when executed
o Might be dangerous to run virus code…
o …emulation is a better strategy, maybe
even disinfect as part of detection
Virus Databases
 What
to put in a virus database?
o Name of virus?
o Characteristics of virus?
o Signatures?
o Encrypted/hashed signatures?
o Disinfection info?
o Other info?
Virus Databases
 How
to update database/signatures?
o Push or pull?
o Automatic or manual?
o How often to update?
o How to distribute updates?
o Distribute entire database or deltas?
 Also
must be able to update AV
Virus Updates
 Update
process is a BIG target
o AV’s machines that distribute updates
o Insider attack at AV site
o Trick user to getting “AV” from attacker
o Man-in-the-middle attack on
communications between user/AV
Virus Description Languages
 AV
vendors have specialized virus
description languages
 2 examples given in the book
Short Subjects
few quick points…
 Anti-stealth techniques
 Macro viruses
 Compiler optimizations and detection
Anti-Stealth Techniques
 Recall,
stealth viruses hide presence
 Anti-stealth as part of AV?
o Detect and disable stealth --- check
that OS calls go to right place
o Bypass usual OS features --- direct calls
to BIOS, for example
Macro Virus Detection
 Macro
viruses tricky to detect
o Macros are in source code
o Easy to change source
o Robust execution when errors occur
 So,
any changes can create new virus
 AV might create a new virus
o Eg, incomplete disinfection
 Macro
virus can infect other macros
Macro Viruses
 One
redeeming feature…
 They operate in restricted domain
o So easier to determine “normal”
o Reduces number of false positives
 Most/all
are not parasitic
o More like companion viruses
 All
the usual detection techniques can
be applied
Macro Viruses: Disinfection
 Delete
all macros in infected
 Delete all associated macros
 Delete macro if in doubt (heuristic)
 Emulation to find all macros used by
infected macro, and delete them
 Basic idea?
o Err on side of caution/deletion
 Macro
viruses not so common today
Compiler Optimization
 Compilers
use similar techniques as AV
 “Optimizing compiler” for detection??
o Constant propagation – reduces variables
o Dead code (executed, but not needed)
o Polymorphics may have lots of dead code
 If
used, efficiency could be an issue
o Compilers extensively studied
o Bad cases well-known, so virus writers
might take advantage of these

similar documents