trace clustering

Report
Trace Clustering in Process Mining
M. Song, C.W. Gunther, and W.M.P. van der Aalst
Han-na Yang
Introduction
□
The major application of process mining

□
□
Discovery ⇒ extraction of abstract process knowledge from event logs
Real-life business processes are flexible

Spaghetti model

Single cases differ significantly from one another = ‘Diversity’

Discovering actual process which is being executed is valuable.
Solution for diversity of cases


Measure the similarity of cases and use the information to divide the set of cases
into more homogeneous subsets.
Trace clustering
Running Example
Activity identifier
Case identifier
□
Originator
Repair process of products within an electronic company that makes navigation
and mobile phones

Case: a specific row

Trace: the sequence of events within a case

Events: represented by the case identifier, activity identifier, and originator
Running Example
Navigation
system
Mobile
Phone
Repair is
Canceled
□
Trace clustering can support the identification of process variants
corresponding to homogenous subsets of cases
Trace profiles
□
□
In the trace clustering approach, each case is characterized by a defined
set of items, i.e., specific features which can be extracted from the
corresponding trace.
Items for comparing traces are organized in trace profiles, each
addressing a specific perspective of the log
Trace profiles
□
Information in Event log

WorkflowLog


ProcessInstance




a case
AuditTrailEntry


group any number of process elements
events
WorkflowModelElement

name of event

mandatory event attribute
EventType

identify lifecycle transitions

mandatory event attribute
Timestamp, Originator

optional data fields
Trace profiles
□
Profile

□
□
A set of related items which describe the trace from a specific perspective
Every item is a metric ⇒ we can consider a profile with n items to be a
function, which assigns to a trace a vector (i1, i2, … in)
Profiling a log can be described as measuring a set of traces with a
number of profiles, resulting in an aggregate vector

Resulting vectors can subsequently be used to calculate the distance between any
two traces, using a distance metric
Trace profiles
Clustering Methods - Distance Measures
□
Distance Measures

□
To calculate the similarity between cases
Three distance measures

n: the number of items extracted from the process log

case cj: corresponds to the vector (ij1, ij2, … ijn)

ijk: the number of appearance of item k in the case j
Clustering Methods – Clustering Algorithm
□
K-means clustering


□
A method of cluster analysis
aims to partition n observations into k clusters in which each observation belongs
to the cluster with the nearest mean.
QT (quality threshold) clustering

A method of partitioning data

invented for gene clustering

requires more computing power than k-means

but does not require specifying the number of clusters a priori

predictable - always returns the same result when run several times.

guided by a quality threshold(determines the maximum diameter of clusters)
Clustering Methods – Clustering Algorithm
□
Agglomerative hierarchical clustering

Gradually generate clusters by merging nearest traces

Smaller clusters are merged into large ones

Example: we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to
determine which elements to merge in a cluster. Usually, we want to take the two
closest elements, according to the chosen distance.
Clustering Methods – Clustering Algorithm
□
The Self-Organizing Map (SOM)

Used to map high dimensional data onto low dimensional spaces

grouping similar cases close together in certain areas of the value range

can be used to portray complex correlations in statistical data.

Example: World Bank statistics of countries in 1992.





39 indicators describing various quality-oflife factors were used
Countries that had similar values of the
indicators place near each other on the map
different clusters were automatically encoded
with different bright colors
each country was assigned a color describing
its poverty type in relation to other countries
The poverty structures of the world: each
country on the geographic map has been
colored according to its poverty type.
Case study
□
□
ProM

Support various process mining algorithm

Implemented the trace clustering plug-in in ProM
Process log from AMC hospital in Amsterdam, Netherlands

619 gynecological oncology patients

52 diagnostic activities

3,574 events, 34 departments are involved
(treated in 2005, 2006)
= 619 cases
Case study
□
Process model for all cases obtained using the Heuristic Miner
Case study
cluster
(1,2)
352
cluster (3,1)
113
□
The result obtained by applying the trace clustering plug-in in ProM
□
The cases in the same cell = belong to the same cluster
Case study
□
Result for cluster (1,2)



352 cases (more than half of the
entire cases)
Only 11 activities ⇒ Simple
Patients who are diagnosed by
another hospital and are referred
to the AMC hospital for
treatment
Case study
□
Result for cluster (3,1)



113 cases
Complex as the original process
model
Patients who are not diagnosed
and need more complex and
detailed diagnostic activities
Conclusion
□
Process mining techniques can deliver valuable, factual insights into
how processes are being executed in real life

□
Trace clustering operates on the event log level

□
Important for analyzing flexible environments
Improve the results of any process mining algorithm
Both the approach and implementation are straightforward to extend

Ex: By adding domain-specific profiles or further clustering algorithm

similar documents