A Visual Analytics Approach for Data Cleansing of Time

Report
TimeCleanser: A Visual Analytics Approach for
Data Cleansing of Time-Oriented Data
Theresia Gschwandtner, Wolfgang Aigner, Silvia Miksch, Johannes Gärtner,
Simone Kriglstein, Margit Pohl, Nik Suchy
Motivation
Motivation
Motivation
Motivation
Overview
TimeCleanser: special quality checks for time-induced problems
Evaluation of TimeCleanser
Results
Derived design principles
Conclusion
TimeCleanser – Special Focus on Time
Time
Point in time
Interval
Time-Oriented Values
Values that change over time
Multiple Data Sets
TimeCleanser – Special Focus on Time
Time
Point in time
Interval
Time-Oriented Values
Values that change over time
Multiple Data Sets
sales per hour
TimeCleanser – Design and Development
Tight collaboration with software company
Working at the company site for over 3 years
Problem analysis, design, and implementation
Fequent feedback sessions
Evaluation
Requirement Analysis
taxonomy of time-oriented quality problems
[Gschwandtner et al., 2012]
Page 1
Page 2
Page 3
real life experience of
partner company
TimeCleanser
Time Checks – Examples
Intervals
Same durations
Minimum and maximum duration
Obligatory gaps, e.g., break in the night
8pm
7am
time
Time Checks – Examples
Temporal range
IDs should cover same temporal range (with some
tolerance), e.g., different departments
A
B
………………...
………………...
time
Time-Oriented Value Checks – Examples
Valid minimum and maximum values within a given temporal range
e.g., sales of one hour
value
time
Time-Oriented Value Checks – Examples
Valid minimum and maximum values within a given temporal range
Valid value sequences, e.g., ready – start – operate – end
value
Y
Y
X
X
Z
Z
time
Multiple Data Sets Checks – Examples
Data should cover same temporal range (with some tolerance)
Contain intervals of same length
Contain time stamps of same precision
A
8:02
9:01
8:00
9:00
B
time
Summary - Checks
Syntax Checks
Time Checks
Valid overall temporal range
Durations/interval length
Missing time point or interval
Entries for different IDs cover same temporal range
Time-Oriented Value Checks
Valid minimum and maximum values within a given temporal range
Values which do not change for too long
Valid timing of values
Valid value sequences
Valid intervals between subsequent values
Multiple Data Sets
Cover same temporal range
Contain intervals of equal length
Contain time stamps of same precision
Visualizations
Overview of values over time
Visualizations
Difference plot of subsequent data values
Visualizations
Heatmap of interval lengths and data values
Evaluation – Focus Group
Participants:
2 data analysts of our partner company (target users)
2 HCI experts
Session:
2 scenarios (GPS data and working hours)
Tasks:
1.
Remove syntax errors
2.
Check interval lengths
3.
Check plausibility of velocity values (GPS data set only)
4.
Check validity of working hours and of weekly profiles (working hours
data set only)
Audio and video recording
Design Principle 1: Data cleansing is a
sequential task with loops
correct
syntax
Design Principle 1: Data cleansing is a
sequential task with loops
correct
syntax
assign
semantic
roles
From
To
Value
Differentiator
Design Principle 1: Data cleansing is a
sequential task with loops
correct
syntax
assign
semantic
roles
overview
zoom &
filter
analysis &
details on
demand
Design Principle 1: Data cleansing is a
sequential task with loops
correct
syntax
assign
semantic
roles
overview
time values
zoom &
filter
data values
analysis &
details on
demand
sequences
multiple data sets
Design Principle 1: Data cleansing is a
sequential task with loops
correct
syntax
assign
semantic
roles
overview
time values
zoom &
filter
data values
analysis &
details on
demand
sequences
multiple data sets
Design Principle 2: Complex quality
problems are best spotted with visualizations
‘You get a picture of the data set, not only of erroneous
entries, but also of how the data looks like and how it
should look like.’
Design Principle 3: Visualizations and raw
data tables are complementary
Design Principle 4: Algorithmic means are
suited to identify precisely definable errors
Design Principle 5: Original data needs to
be preserved
Correct data right away for further processing
Confer with customers later
Quickly undo changes
Design Principles – Summary
1.
Data cleansing is a sequential task with loops
2.
Complex quality problems are best spotted with
visualizations
3.
Visualizations and raw data tables are complementary
4.
Algorithmic means are suited to identify precisely
definable errors
5.
Original data needs to be preserved
Negative Points and Possible New
Features
More interactive features would be necessary (HCI experts)
Synchronized zooming for multiple visualizations
Linking and brushing between visualizations and data tables
Statistics about string lengths to support the detection of
outliers
Use of wildcards and regular expressions for filter
functionality
A one-page statistical summary of the data set (e.g.,minimum,
maximum, average, distribution)
Conclusion
Very close collaboration with target users
Systematic list of data quality checks
Sequence of cleansing steps
Design principles for data cleansing support (with special
focus on time-oriented data)
Need of visualizations for complex error detection and
cleansing tasks
Results – Topics which were discussed
Traditional methods
Workflow
Advantages of TimeCleanser
Attitudes towards visualizations
Intertwinedness of analytical and visual methods
Negative points and possible new features
Results – Topics which were discussed
Traditional methods
Workflow
Advantages of TimeCleanser
Attitudes towards visualizations
Intertwinedness of analytical and visual methods
Negative points and possible new features
 Design principles
Syntax Checks
Time-Oriented Value Checks
Time-Oriented Value Checks
Time-Oriented Value Checks
Evaluation – Questions
(1)
Does the prototype help the target users to perform data
cleansing tasks?
(2)
Is an integration of visualizations methods useful?
(3)
What are the advantages and disadvantages in comparison
with the data cleansing methods they have used so far?
(4)
For which tasks are visualization methods, common data
cleansing analysis methods, and a combination of both
suitable?
(5)
Which interaction methods for the visualizations are useful to
support users‘ working steps to perform data cleansing tasks?
TimeCleanser
TimeCleanser
Design Principle 1: Data cleansing is a
sequential task with loops
correct
syntax
assign
semantic
roles
overview
Design Principle 1: Data cleansing is a
sequential task with loops
correct
syntax
assign
semantic
roles
overview
zoom &
filter
Design Principle 1: Data cleansing is a
sequential task with loops
correct
syntax
assign
semantic
roles
overview
time values
zoom &
filter
data values
analysis &
details on
demand
sequences
multiple data sets
Additions to Shneiderman's Visual Information Seeking Mantra:
`correct syntax first, assign semantic roles, overview, zoom and filter,
then analysis and details-on-demand‘
Design Principle 1: Data cleansing is a
sequential task with loops
correct
syntax
assign
semantic
roles
overview
time values
zoom &
filter
data values
analysis &
details on
demand
sequences
multiple data sets
Additions to Keim's Visual Analytics mantra:
`correct syntax first – assign semantic roles – overview – analyse – show
the important – zoom, filter and analyse further – details on demand‘
Lessons Learned
1.
Automatic methods are preferred in cases which are
easily defined
2.
Visualizations are superior when judging plausibility
3.
Analysts appreciated the use of visualizations as an
interactive analysis tool
4.
Efficient connection of visualizations to raw data and a
side by side display is important
TimeCleanser – Design and Development
Tight collaboration with software company
Working at the company site for over 3 years
Problem analysis, design, and implementation – CEO, data analysts,
software developers, VA experts
Fequent feedback sessions – CEO, VA experts, software developers
Evaluation – data analysts, HCI experts
Design Principle 1: Data cleansing is a
sequential task with loops
correct
syntax
assign
semantic
roles
overview
zoom &
filter
analysis &
details on
demand
Additions to Shneiderman's Visual Information Seeking Mantra
[Shneiderman, 1996]:
`correct syntax first, assign semantic roles, overview, zoom and filter,
then analysis and details-on-demand‘
Design Principle 1: Data cleansing is a
sequential task with loops
correct
syntax
assign
semantic
roles
overview
zoom &
filter
analysis &
details on
demand
Additions to Keim's Visual Analytics mantra [Keim et al., 2008]:
`correct syntax first – assign semantic roles – overview – analyse – show
the important – zoom, filter and analyse further – details on demand‘
Design Principle 4: Algorithmic means are
suited to identify precisely definable errors

similar documents