simple substitution distance

Report
Simple Substitution Distance
and Metamorphic Detection
Gayathri Shanmugam
Richard M. Low
Mark Stamp
Simple Substitution Distance
1
The Idea
 Metamorphic
malware “mutates” with
each infection
 Measuring software similarity is one
method of detection
 But, how to measure similarity?
o Lots of relevant previous work
 Here,
an unusual and interesting
distance measure is considered
Simple Substitution Distance
2
Simple Substitution
Distance

We treat each metamorphic copy as if it
is an “encrypted” version of “base” virus
o Where the cipher is a simple substitution

Why simple substitution?

Why might this work?
o Easy to work with, fast algorithm to solve
o Simple substitution cryptanalysis gives
results that match family statistics
o Accounts for modifications to files similar
to some common metamorphic techniques
Simple Substitution Distance
3
Motivation

Given a simple substitution ciphertext
where plaintext is English…
o If we cryptanalyze using English language
statistics, we expect a good score
o If we cryptanalyze using, say, French language
statistics, we expect a not-so-good score

We can obtain opcode statistics for a
metamorphic family
o Using simple substitution cryptanalysis, a virus
of same family should score well…
o …but, a benign exe should not score as well
o Assuming statistics of these families differ
Simple Substitution Distance
4
Metamorphic Techniques
Many possible morphing strategies
 Here, briefly consider

o
o
o
o
o

Register swapping
Garbage code insertion
Equivalent substitution
Transposition
Formal grammar mutation
At a high level --- substitution,
transposition, insertion, and deletion
Simple Substitution Distance
5
Register Swap
 Register
swapping
o E.g., replace EBX register with EAX,
provided EAX not in use
 Very
simple and used in some of first
metamorphic malware
 Not very effective
o Why not?
Simple Substitution Distance
6
Garbage Insertion
Garbage code insertion
 Two cases:

o Dead code --- inserted, but not executed
 We can simply JMP over dead code
o Do-nothing instructions --- executed, but
has no effect on program
 Like NOP or ADD EAX,0
Relatively easy to implement
 Effective at breaking signatures
 Changes the opcodes statistics

Simple Substitution Distance
7
Code Substitution

Equivalent instruction substitution
o For example, can replace SUB EAX,EAX
with XOR EAX,EAX

Does not need to be 1 for 1 substitution
o That is, can also include insertion/deletion

Unlimited number of substitutions
o And can be very effective

Somewhat difficult to implement
Simple Substitution Distance
8
Transposition

Transposition
o Reorder instructions that have no
dependency

For example,
MOV R1,R2
ADD R3,R4

ADD R3,R4
MOV R1,R2
Can be highly effective
 But, can be difficult to implement

o Sometimes applied only to subroutines
Simple Substitution Distance
9
Formal Grammar Mutation
 Formal
grammar mutation
 View morphing engine as nondeterministic automata
o Allow transitions between any symbols
o Apply formal grammar rules
 Obtain
many variants, high variation
 Really just a formalization of others
approaches, not a separate technique
Simple Substitution Distance
10
Previous Work

Easy to prove that “good” metamorphic
code is immune to signature detection
o Why?

But, many successes detecting hackerproduced metamorphic malware…
o
o
o
o
o
HMM/PHMM/machine learning
Graph-based techniques
Statistics (chi-squared, naïve Bayes)
Structural entropy
Linear algebraic techniques
Simple Substitution Distance
11
Topic of This Research
 Measure
similarity using simple
substitution distance
 We
“decrypt” suspect file using
statistics from a metamorphic family
o If decryption is good, we classify it as a
member of the same metamorphic family
o If decryption is poor, we classify it as
NOT a member of the given family
Simple Substitution Distance
12
Simple Substitution Cipher
 Simple
substitution is one of the oldest
and simplest means of encryption
 A fixed key used to substitute letters
o For example, Caesar’s cipher, substitute
letter 3 positions ahead in alphabet
o In general, any permutation can be key
 Simple
substitution cryptanalysis?
o Statistical analysis of ciphertext
Simple Substitution Distance
13
Simple Substitution Cryptanalysis

Suppose you observe the ciphertext
PBFPVYFBQXZTYFPBFEQJHDXXQVAPTPQJKTOYQWIPBVWLXTOXBTFXQW
AXBVCXQWAXFQJVWLEQNTOZQGGQLFXQWAKVWLXQWAEBIPBFXFQVX
GTVJVWLBTPQWAEBFPBFHCVLXBQUFEVWLXGDPEQVPQGVPPBFTIXPFHXZH
VFAGFOTHFEFBQUFTDHZBQPOTHXTYFTODXQHFTDPTOGHFQPBQWAQJJ
TODXQHFOQPWTBDHHIXQVAPBFZQHCFWPFHPBFIPBQWKFABVYYDZBOT
HPBQPQJTQOTOGHFQAPBFEQJHDXXQVAVXEBQPEFZBVFOJIWFFACFCCF
HQWAUVWFLQHGFXVAFXQHFUFHILTTAVWAFFAWTEVOITDHFHFQAITIX
PFHXAFQHEFZQWGFLVWPTOFFA

Analyze frequency counts…

Likely that ciphertext “F” represents “E”
o And so on, at least for common letters
Simple Substitution Distance
14
Simple Substitution Cryptanalysis

Can automate the cryptanalysis
1.
2.
3.
4.
5.
6.
7.

Make initial guess for key using frequency counts
Compute oldScore
Modify key by swapping adjacent elements
Compute newScore
If newScore > oldScore. let oldScore = newScore
Else unswap key elements
Goto 3
How to compute score?
o Number of dictionary words in putative plaintext?
o Much better to use English digraph statistics
Simple Substitution Distance
15
Jackobsen’s Algorithm
 Method
on previous slide can be slow
o Why?
 Jackobsen’s
algorithm uses similar
idea, but fast and efficient
o Ciphertext is only decrypted once
o So algorithm is (essentially) independent
of length of message
o Then, only matrix manipulations required
Simple Substitution Distance
16
Jackobsen’s Algorithm: Swapping

Assume plaintext is English, 26 letters
Let K = k1,k2,k3,…,k26 be putative key

Then we swap elements as follows

Restart this swapping from the beginning
whenever the score improves

o And let “|” represent “swap”
Simple Substitution Distance
17
Jackobsen’s Algorithm: Swapping

Minimum swaps is 26 choose 2, or 325

Maximum is unbounded

Each swap requires a score computation

Average number of swaps, experimentally:
o Ciphertext of length 500, average 1050 swaps
o Ciphertext of length 8000, avg just 630 swaps

So, work depends on length of ciphertext
o More ciphertext, better scores, fewer swaps
Simple Substitution Distance
18
Jackobsen’s Algorithm: Scoring
 Let
D = {dij} be digraph distribution
corresponding to putative key K
 Let E = {eij} be digraph distribution of
English language
 These matrices are 26 x 26
 Compute score as
Simple Substitution Distance
19
Jackobsen’s Algorithm

So far, nothing fancy here
o Could see all of this in a CS 265 assignment
Jackobsen’s trick: Determine new D
matrix from old D without decrypting
 How to do so?

o It turns out that swapping elements of K
swaps corresponding rows and columns of D

See example on next slides…
Simple Substitution Distance
20
Swapping Example
 To
simplify, suppose 10 letter
alphabet
E, T, A, O, I, N, S, R, H, D
 Suppose
you are given the ciphertext
TNDEODRHISOADDRTEDOAHENSINEOAR
DTTDTINDDRNEDNTTTDDISRETEEEEEAA
 Frequency
counts given by
Simple Substitution Distance
21
Swapping Example
We choose the putative
key K given here 
 The corresponding
putative plaintext is

AOETRENDSHRIEENATE
RIDTOHSOTRINEAAEAS
OEENOTEOAAAEESHNA
TTTTTII

Corresponding digraph
distribution D is 
Simple Substitution Distance
22
Swapping Example
 Suppose
we
swap first 2
elements of K
 Then decrypt
using new K
 And compute
digraph matrix
for new K
Simple Substitution Distance
Previous key K
New key K
23
Swapping Example
 Old
D matrix vs
new D matrix
 What do you
notice?
 So what’s the
point here?
 This is good!
Simple Substitution Distance
24
Jackobsen’s Algorithm
Simple Substitution Distance
25
Proposed Similarity Score

Extract opcodes sequences from
collection of (family) viruses
o All viruses from same metamorphic family

Determine n most common opcodes
o Symbol n+1 used for all “other” opcodes

Use resulting digraph statistics form
matrix E = {eij}
o Note that matrix is (n+1) x (n+1)
Simple Substitution Distance
26
Scoring a File




Given an executable we want to score…
Extract it’s opcode sequence
Use opcode digraph stats to get D = {dij}
o This matrix also (n+1) x (n+1)
Initial “key” K chosen to match monograph
stats of virus family
o Most frequent opcode in exe maps to most
frequent opcode in virus family, etc.

Score based on distance between D and E
o “Decrypt” D and score how closely it matches E
o Jackobsen’s algorithm used for “decryption”
Simple Substitution Distance
27
Example

Suppose only 5 common opcodes in family
viruses (in descending frequency)

Extract following sequence from an exe

Initial “key” is

And “decrypt” is
Simple Substitution Distance
28
Example
 Given
“decrypt”
 Form
D matrix
 After
swap
o And so on…
Simple Substitution Distance
29
Scoring
Algorithm
Simple Substitution Distance
30
Quantifying Success
 Consider
 Which
these 2 scatterplots of scores
is better (and why)?
Simple Substitution Distance
31
ROC Curves
 Plot
true-positive vs
false positive
o As “threshold” varies
 Curve
nearer 45-degree
line is bad
 Curve nearer upper-left
is better
Simple Substitution Distance
32
ROC Curves
 Use
ROC curves to quantify success
 Area under the ROC curve (AUC)
o Probability that randomly chosen
positive instance scores higher than a
randomly chosen negative instance
 AUC
of 1.0 implies ideal detection
 AUC of 0.5 means classification is no
better than flipping a coin
Simple Substitution Distance
33
Parameter Selection
 Tested
the following parameters
o Opcode matrix size
o Scoring function
o Normalization
o Swapping strategy
 None
significant, except matrix size
o So we only give results for matrix size
Simple Substitution Distance
34
Opcode Matrix Size
 Obtained
 So,
following results
ironically, we use 26 x 26 matrix
Simple Substitution Distance
35
Test Data





Tested the following metamorphic families
o G2 --- known to be weak
o NGVCK --- highly metamorphic
o MWOR --- highly metamorphic and stealthy
MWOR “padding ratios” of 0.5 to 4.0
For G2 and NGVCK
o 50 files tested, cygwin utilities for benign files
For each MWOR padding ratio
o 100 files tested, Linux utilities for benign files
5-fold cross validation in each experiment
Simple Substitution Distance
36
NGVCK and G2 Graphs
Simple Substitution Distance
37
MWOR Score Graphs
Simple Substitution Distance
38
MWOR ROC Curves
Simple Substitution Distance
39
MWOR AUC Statistics
Simple Substitution Distance
40
Efficiency
Simple Substitution Distance
41
Conclusions
+
+
+
-
Simple substitution score, good
results for challenging metamorphics
Scoring is fast and efficient
Applicable to other types of malware
Requires opcodes
Simple Substitution Distance
42
Related Work
 Recently,
we generalized Jakobsen’s
algorithm to “combination” cipher
 Simple substitution column
transposition (SSCT)
 Uses multiple D matrices
o One D matrix for each column
o Enables easy column manipulations
o Overall, fast and effective SSCT attack
Simple Substitution Distance
43
SSCT
 SSCT
for malware detection
 This might be stronger malware score
o Why?
 Finding
good test data is an issue
o Can we find/make data where SSCT
outperforms simple substitution score?
 Currently
studying this problem
Simple Substitution Distance
44
Homophonic Substitution
 Homophonic
sub. allows more than one
ciphertext symbol for each plaintext
o Easy to encrypt, but harder to break
than simple substitution --- why?
 Previous
student developed Jakobsenlike algorithm for homophonic sub.
o Uses a nested hill climb approach
 This
could be tested on malware
Simple Substitution Distance
45
HMM
A
different way to attack simple
substitution ciphers?
 Train an HMM (of course!)
o Let A be 26 x 26, English digraph stats
o Then train, without updating A matrix
o Resulting B matrix is the key
o Can work for homophonic case too
 Any
problems with this?
Simple Substitution Distance
46
HMM with Random Restarts
 HMM
requires lots of data to converge
 Often, we don’t have lots of data
 In such cases, try random restarts
o HMM should converge with less data if we
start closer to the solution
o Try enough random restarts, might start
close enough to converge
 How
many random restarts?
Simple Substitution Distance
47
HMM with Random Restarts
 Could
be applied to malware detection
o However, slow and expensive
 More
relevant for cryptanalysis
 Zodiac 340 cipher, for example
o This has previously been analyzed using
millions of random restarts
Simple Substitution Distance
48
References

G. Shanmugam, R.M. Low, and M. Stamp,
Simple substitution distance
and metamorphic detection, Journal of
Computer Virology and Hacking Techniques,
9(3):159-170, 2013
 A. Dhavare, R.M. Low, and M. Stamp,
Efficient cryptanalysis of homophonic
substitution ciphers, Cryptologia,
37(3):250-281, 2013
Simple Substitution Distance
49
References

T. Berg-Kirkpatrick and D. Klein,
Decipherment with a million random
restarts,
http://www.cs.berkeley.edu/~tberg/papers
/emnlp2013.pdf
Simple Substitution Distance
50

similar documents