### Hashing Concepts

```Hashing Concepts
CSC 485/585
Objectives


Define Hashing and Hash Values.
Explain the common uses of Hashes within the field of
Computer Forensics.




Data Authentication
Data Reduction
File Identification
Explain the limitations of Hashes.
What is a Hash Function?

A hash function is any well-defined procedure or mathematical
function which converts a large, possibly variable-sized amount
of data into a small datum. The values returned by a hash
function are called hash values, hash codes, hash sums, or simply
hashes.
Cryptographic Hash Functions

A cryptographic hash function is a deterministic procedure that takes
an arbitrary block of data and returns a fixed-size bit string, the
(cryptographic) hash value, such that an accidental or intentional
change to the data will change the hash value. The data to be encoded is
often called the "message", and the hash value is sometimes called the
message digest or simply digest.

The ideal cryptographic hash function has the main properties:
 it is infeasible to find a message that has a given hash,
 it is infeasible to modify a message without changing its hash,
 it is infeasible to find two different messages with the same hash.

MD5 and SHA-1 are the most commonly used cryptographic hash
functions (a.k.a. algorithms) in the field of Computer Forensics.
MD5

MD5 (Message-Digest algorithm 5) is a widely used cryptographic hash function with a 128-bit hash
value.

The 128-bit MD5 hashes (also termed message digests) are represented as a sequence of 16 hexadecimal
bytes. The following demonstrates a 40-byte ASCII input and the corresponding MD5 hash:

MD5 of “This is an example of an MD5 Hash Value.” = 3413EE4F01F2A0AA17664088E79CF5C2

Even a small change in the message will result in a completely different hash. For example, changing the
period at the end of the sentence to an exclamation mark:

MD5 of "This is an example of an MD5 Hash Value!” = B872D23A7D14B6EE3B390A58C17F21A8
SHA-1

SHA stands for Secure Hash Algorithm.

SHA-1 produces a 160-bit digest from a message and is represented as a sequence
of 20 hexadecimal bytes. The following is an example of SHA-1 digests:

Just like MD5, even a small change in a message will result in a completely different
hash. For example:

SHA1 of "This is a test.” =
AFA6C8B3A2FAE95785DC7D9685A57835D703AC88

SHA1 of "This is a pest.” =
FE43FFB3C844CC93093922D1AAC44A39298CAE11
Statistics


The MD5 hash algorithm - the chance of 2 files having the
same MD5 hash value is 2 to the 128th power =
3.4028236692093846346337460743177e+38 or
1 in 340 billion billion billion billion.
The SHA-1 hash algorithm - the chance of 2 files having the
same SHA-1 hash value is 2 to the 160th power =
1.4615016373309029182036848327163e+48 or
1 in....a REALLY big number!
What do CF Examiners use Hashes for?

Data Authentication


Data Reduction


To prove two things are the same
To exclude many “known” files from hundreds of thousands of
file you have to look at.
File Identification

To find a needle in a haystack.
Data Authentication



One of the most important issues a computer forensic
examiner faces is ensuring the ability to “authenticate” your
digital evidence.
This is done via Chain of Custody, Documentation, and Hash
values.
Using MD5 or SHA-1 hashing tools, an examiner should be
able to verify that data has not changed. A hash of the
acquired data must be identical to a hash of the original
evidence.
Data Authentication





Calculating a “hash value” for any block of data (i.e. a file, an entire disk, a
partition, etc.) can be accomplished as a stand-alone task or simultaneous
with the acquisition process (by most tools).
Calculating the “hash value” of an entire disk is done by reading all data on
the disk, running it through the desired algorithm, and generating a hash of
all data read. The examiner then typically documents the resulting hash
value.
The resulting “hash value” is a hash of the data READ from the disk, not
necessarily a hash of the data WRITTEN to your target disk during the
acquisition process.
Input/Output errors and bad sector errors encountered during the
acquisition process will effect the resulting hash value.
An examiner should run a verification process after acquisition to ensure
that the original hash value calculated while reading the original data
matches the hash value of the data written out to your target disk.
FTK Imager
(Hashes calculated without acquiring drive)
WinHex Specialist
(Hash calculated without acquiring drive)
Linux – md5sum & sha1sum
(using Helix3-2009R1)
FTK Imager
(Hashes calculated (and verified) as part of acquisition process)
Data Authentication

Considerations:



Drives will start to fail as they get older, resulting in “bad
sectors”. Bad sectors = inability to obtain matching hash
values when comparing a hash of the original disk to the hash
of a forensic image of the data read from the disk.
The more time a disk spins up, the more chance of disk
failure(s). To calculate a hash value of a drive, you must read all
data on the disk. To acquire a forensic image, you must read all
data on the disk.
If your imaging tool does not simultaneously capture a hash
value as part of the data acquisition process, consider whether
the risk of double the spin-up time to obtain a pre-acquisition
hash values is appropriate given that your primary objective is
to obtain the data.
Data Authentication

In the previous slides, we looked at hashing an entire drive. Using hashes, an
examiner can also verify that a specific file or any block of data has not changed.

Hash individual file(s) with FTK Imager, WinHex, md5summer, and many other
hashing tools.
Data Authentication

Note that although these graphic files look identical, a single modified byte will result in hash
values that do not match.
Data Authentication

When hashing individual files:





Changing filename or extension does NOT change hash value.
Changing Modified, Accessed, Created dates does NOT change hash value.
Changing file system attributes (read-only, hidden, system, etc.) does NOT
change hash value.
Changing ANYTHING within the file contents DOES change the hash value of
the file.

For files like MS Word documents, that contain “Metadata”, changes within the Metadata DO change
the contents of the file and therefore change the hash value of the file.

For example, if you opened a MS Word document, made no changes to the contents of the file and
just re-saved the file, MS Word would update the dates saved within the Metadata and the actual
raw content of the overall word document would change and therefore generate a different hash
value.
Cropping a graphic, changing the resolution, saving as another graphic format
(BMP to JPEG), or any other change that may not necessarily change the visual
depiction of the picture, WILL change the raw contents of the file and therefore
will change the hash value of the file.
Data Authentication

NOTE:



Although we just told you that changing a filename or other “non-content” of a
file does not change the hash value of the file….
Such a “non-content” change DOES make a change to the FAT directory entry,
MFT entry, or other file system component that holds the filename, MAC dates,
attributes, etc. and therefore DOES change the data on the file system that holds
the file in question.
Therefore a change of a filename, MAC date, file attribute, etc. DOES NOT
change the hash value of the file, but it DOES change the hash value of the disk
on which the file is stored.
Data Reduction





As the storage capacity of disks grows, so does the number of files a
computer forensic examiner must examine.
A typical hard drive containing a Windows installation, software
etc. will contain well over a hundred thousand files.
Large databases containing hash values of “known” files can be used
by a forensic examiner to reduce the number of files he or she must
analyze.
Files that are known to be part of the operating system and/or
installed software applications are likely not going to contain
evidence.
By excluding all known operating system files and files from known
software applications, an examiner is left with only user created files
to review for potential evidence
Data Reduction





Using forensic software tools, an examiner calculates the hash value
of all files on a disk.
Then the examiner uses the software tool to compare the
calculated hash values against all of the hash values within a known
hash database to identify any matching hash values.
The examiner can then exclude from view, any files with hash values
matching those in the database.
The examiner can also exclude from view, any files that are
duplicates of each other according to their hash values, further
reducing the number of files in view.
This process called “Data Reduction” can save the examiner from
analyzing many thousands of un-necessary files.
Data Reduction

Hash Databases:




National Software Reference Library (NSRL) –
Reference Data Sets (RDS) - NIST
HashKeeper (LE, Military and Government only) - NDIC
Known File Filter (KFF) – AccessData, Inc.
Self-generated or shared databases
NIST NSRL (RDS)
Forensic Tool Kit - KFF
File Identification



Quickly identifying a specific “notable” file or files amongst the
hundreds of thousands of files on a disk can also be accomplished by
use of hash databases….finding the needle in the haystack!
Instead of using a database of known “ignorable” files such as OS
files, databases containing hash values of known “notable” files can
be utilized.
Example of common “Notable” files are:




Child Pornography and other contraband images
Hacker Tools
Viruses, Trojans and other Malware
The examiner can search by hash value and flag any files with hash
values matching those in the “notable” database.
Limitations


A mismatched hash value only tells you something changed, not
what changed!
When using MD5, SHA-1 or other standard cryptographic
hashes to identify known files, only EXACT matches will result
in success.



When files are slightly modified, standard hashing will not identify
similar files.
“Fuzzy Hashing” uses a concept called context triggered piecewise hashes
in the tool ssdeep to identify files that have similar pieces but may not
be entirely identical.
Hash “collisions” have been discovered and some argue that
stronger (more collision proof) hash algorithms should be used
in computer forensics.
Questions ???
…as usual, use the discussion board!
```