Using Identity Credential Usage Logs to Detect Anomalous

Report
Using Identity Credential Usage Logs to
Detect Anomalous Service Accesses
ACM DIM 2009, Chicago, IL, 2009
Daisuke Mashima
Dr. Mustaque Ahamad
College of Computing
Georgia Institute of Technology
Atlanta, GA, USA
Increasing Risk of Identity Theft
• Variety of online identity credentials
– Passwords, certificates, SSN, credit card
number, etc.
– Loss and theft are possible and common
• Consequence of online identity theft
– Impersonation
– Disclosure of sensitive information
– Financial loss
2
To counter such threats…
• Online service providers are required to
– Analyze huge amount of log records to identify
suspicious service accesses
– Investigate identified records extensively
• In reality…
– Significant reliance on human experts
– Not processed in real-time basis
• Automated mechanism to monitor identity
usage (service accesses) is desired.
3
Outline
•
•
•
•
•
Observations from real data sets
Our approach
Anomaly-based risk scoring scheme
Preliminary evaluation
Conclusion / Future Work
4
Buzzport Access Log
5
Buzzport Access Log
380484533347391,
380484533347391,
380484533347391,
380484533347391,
380484533347391,
24/08/2007 14:07:05,
27/08/2007 08:01:14,
27/08/2007 08:04:36,
27/08/2007 12:05:36,
31/08/2007 14:31:43,
24/08/2007 14:18:46
27/08/2007 08:02:54
27/08/2007 08:16:05
27/08/2007 12:18:15
31/08/2007 14:38:08
• Contain only
– (Anonymized) User ID
– Login timestamp
– Logout timestamp
6
Another data set
• Log records of a portal of online trading
company
• The following items are available:
– User ID
– Coarse Action Type (Login / Logout)
– Timestamp
– IP Address
– Organization Name etc.
7
Observations and Considerations
• Available information is quite limited.
– Typical fraud detection systems rely on much
richer information
• Data are not labeled.
– Supervised techniques are not available.
• Limited types of events can be observed.
– Schemes relying on event sequence or state
transition have limited applicability.
8
Our Approach
• Utilize attributes derived from an individual
identity usage record
– Timestamp (day-of–week etc.), IP address, etc.
– Focus on categorical attributes
• Build user profile based on occurrence
frequency of each attribute value
• Determine risk scores based on frequency
information
9
User Profile Management
• Defined as a frequency distribution of attribute
values (categories)
– One profile for one attribute
– Multiple profiles can be defined per user.
• Day-of-week profile, hour-of-day profile, and so forth…
• Updated upon receipt of each log record
– Simply increment occurrence counters corresponding
to the attribute values in the record
• Data aging can be easily implemented
– Periodically multiply all counters with some
decay factor
10
Base Score and Weight
• Base score represents how unlikely an
observed user’s access is.
– BaseScore = -log (RelativeFrequency)
• Score weight quantifies the “effectiveness”
of each attribute for profiling.
– When an attribute well characterizes user’s
identity usage pattern, the value should be high.
• How can we quantify it?
11
Score Weight
• Use “distance” between the frequency
distribution and uniform distribution as weight
– Bhattacharyya Distance etc.
– Data aging is necessary.
0.25
0.2
0.15
0.1
23
21
17
15
19
Hour of Day
13
11
9
7
5
0
3
0.05
1
Relative
Frequency
12
Score Aggregation
• Sub Score (a product of a base score and
the corresponding weight) are computed.
– Sub Score is computed for each profile.
• How can we combine Sub Scores?
– Pick the MAX of Sub Scores
– Weighted sum of Sub Scores
– Others?
10
8
9
10
9
13
Setting of Experiments
• Buzzport data set
• Profiling attributes
– Week of month (5 categories)
– Day of week (7 categories)
– Hour of Day (24 categories)
• Scale Sub Scores in [0, 100)
• Use MAX of 3 Sub Scores as output
14
Trends of Risk Scores
15
Trends of Risk Scores with Data Aging
• Decay Factor = 0.5 is applied monthly.
16
False Positive / True Positive Analysis
• Randomly pick 5 users with different access
frequency
• Split each user’s log records into two:
– Test data: last 1 month
– Training data: Rest of them
• Analyze False Positive rate by using the
same user’s training data and test data
• Analyze True Positive rate by using different
users’ data sets (a.k.a Cross Profiling)
17
False Positive / True Positive Results
* Each user’s threshold is determined based on the score range of the training data.
18
Time / Storage Cost
• Measured on Linux PC with Intel Core 2
Duo E6600 and 3GM RAM
• Average time per record: 5ms
– Good enough for real-time processing
• Storage space per user: 1.4KB
– Potential to accommodate a large number of
users
19
Conclusion
• Defined design principles for risk scoring
based on identity usage logs
• Proposed a way to compute anomalybased risk scores in real-time basis
• Presented a prototype system using time
stamp information and showed that it has
reasonably good accuracy
20
Future Work
• Investigate other attributes (E.g. location)
• Conduct detailed experiments
– Evaluate with other data sets
– Find the optimum configuration
• Integrate into other security mechanisms
21
Thank you very much.
Questions?
[email protected]
http://www.cc.gatech.edu/~mashima
22

similar documents