Document

Report
Data Quality Assessment and Measurement
Laura Sebastian-Coleman, Ph.D., IQCP
Optum Data Management
EDW April 2014 – AM5 April 28, 8:30 – 11:45
Agenda
Welcome and Thank You for attending!
Agenda
•
Introductory materials
– Abstract
– Information about Optum and about me
•
Presentation sections will follow the outline in the abstract (details in a moment…)
– Challenges of measuring data quality
– DQ Assessment in context
» Initial Assessment Deep Dive
– Defining DQ Requirements
– Using measurement for improvement
•
Discussion / Questions
Ground Rules
•
I will try to stick to the agenda.
•
But the purpose of being here is to learn from each other, so questions are welcome at any point.
•
I will balance between the two.
Propriety and Confidential. Do not distribute.
2
Abstract: Data Quality Assessment and Measurement
Experts agree that to improve data quality, you must be able to measure data
quality. But determining what and how to measure is often challenging. The
purpose of this tutorial is to provide participants with a comprehensive and
adaptable approach to data quality assessment.
•
The challenges of measuring data quality and how to address them.
•
DQ assessment in context: Understand the goals and measurement
activities and deliverables associated with initial assessment, in-line
measurement and control, and periodic reassessment of data. Review a
template for capturing results of data analysis from these processes.
• Initial Assessment: Review an approach to initial assessment that allows
capture of important observations about the condition of data.
•
Defining DQ requirements: Learn how to define measurable characteristics
of data and establish requirements for data quality. Review a template
designed to solicit and document clear expectations related to specific
dimensions of quality.
•
Using measurement for improvement: Share examples of measurements
that contribute to the ongoing improvement of data quality.
Propriety and Confidential. Do not distribute.
3
About Optum
•
Optum is a leading information and technology-enabled health services
business dedicated to helping make the health system work better for
everyone.
•
With more than 35,000 people worldwide, Optum delivers intelligent,
integrated solutions that modernize the health system and help to improve
overall population health.
•
Optum solutions and services are used at nearly every point in the health
care system, from provider selection to diagnosis and treatment, and from
network management, administration and payments to the innovation of
better medications, therapies and procedures.
•
Optum clients and partners include those who promote wellness, treat
patients, pay for care, conduct research and develop, manage and deliver
medications.
•
With them, Optum is helping to improve the delivery, quality and cost
effectiveness of health care.
Propriety and Confidential. Do not distribute.
4
About me
•
•
10+ years experience in data quality in the health care industry
•
Have worked in banking, manufacturing, distribution, commercial
insurance, and academia. These experiences have influenced my
understanding of data, quality, and measurement.
•
Published Measuring Data Quality for Ongoing Improvement
(2013).
•
Influences on my thinking about data:
•
The challenge of how to measure data quality. Addressing this
challenge, I have focused on the concept of measurement itself.
Any problem of measurement is a microcosm of the general
challenge of data definition and collection.
•
The demands of data warehousing; specifically integrating data
from different sources, processing it so that it is prepared for
consumption, helping make it understandable
My thinking about data governance has been influenced by my position within an IT organization.
–
DAMA says governance is a business function. But I think IT needs to step up as well.
–
IT takes care of data. Technical and non-technical people would be better off it we all recognized
IT as data stewards and if IT acted responsibly to steward data.
–
The quality of data (esp. in large data assets) depends on data management practices, which are
IT’s responsibility. (It depends on other things, too, but data management is critical.)
–
Complex systems require monitoring and control to detect unexpected changes.
Propriety and Confidential. Do not distribute.
5
Challenges of Measuring the Quality of Data
Overview: Challenges of Measuring Data Quality
•
Lack of consensus about the meaning of key concepts. Specifically,
– Data
– Data Quality
– Measurement/Assessment
» The only way to address a lack of consensus about meaning is to propose
definitions and work toward consensus. In the next few slides, we will go into in
depth about the meaning of these terms.
• To start: Sometimes the term data quality is used to refer both to the condition of the data and to
the activities necessary to support the production of high quality data. I separate these into
» The quality of the data / the condition of data
» Data quality activities: those required to produce and sustain high quality data
•
Lack of clear goals and deliverables for the data assessment process
» These we will discuss in detail in DQ Assessment in Context.
•
Lack of a methodology for defining “requirements”, “expectations” and other criteria for
the quality of data. These criteria are necessary for measurement.
» This challenge we will discuss in detail in Defining Data Quality Requirements.
Propriety and Confidential. Do not distribute.
7
Assumptions about Data and Data Quality
• In today’s world, data is both valuable and complex.
• The processes and systems that produce data are also complex.
• Many organizations struggle to get value out of their data because
– They do not understand their data very well.
– They do not trust the systems that produce it.
– They think the quality of their data is poor – though they can rarely quantify
data quality.
• Poor data quality is not solely a technology problem – but we often
– Blame technology for the condition of data and
– And jump to the conclusion that tools can solve DQ problems. They don’t .
• Technology is required to manage data and to automate DQ measurement –
without automation, comprehensive measurement is not possible. There’s too
much data.
• Data is something people create.
– It does not just exist out in the world to be collected or gathered.
– To understand data requires understanding how data is created.
• Poor data quality results from a combination of factors related to processes,
communications, and systems within and between organizations.
•
Propriety and Confidential. Do not distribute.
8
Assumptions, continued….
•
Given the importance of data in most organizations, ALL employees have a stewardship
role, just as all employees have an obligation not to waste other resources.
•
Given how embedded data production is in non-technical processes, ALL employees
contribute to the condition of data.
– Raising awareness of how they contribute will help improve the quality of data.
•
Sustaining high quality data requires data management, not just technology
management
– Data management, like all forms of management, includes knowing what resources
you have and using those resources to reach goals and meet objectives
– Technology should be a servant, not a master; a means, not an end; a tail, not a
dog.
•
Producing high quality data requires a combination of technical and business skills,
(including management skills), knowledge, and vision.
– No one can do it alone
•
Better data does not happen by magic. It takes work.
•
People make data. People can make better data.
•
Why don’t they?
Propriety and Confidential. Do not distribute.
9
What we want data to be
Reasonable
Reliable
Rational
Ready to use
A bit technical,
but basically
comprehensible
Propriety and Confidential. Do not distribute.
10
How data sometimes seems
Powerful.
Packed with knowledge.
But threatening.
And ambiguous.
And for those reasons,
Interesting…
And, of course,
Somewhat magical
Still…It is difficult to tell
whose side data is on;
whether it is good or
evil.
Propriety and Confidential. Do not distribute.
11
What data seems to be turning into
Big Data is
BIG –
Monstrous,
even.
And also
powerful &
threatening.
Moving faster
than we can
control.
Neither
rational nor
ready to use.
And yet … a potential weapon. If only it would behave .
Propriety and Confidential. Do not distribute.
12
Definition: Data
•
•
•
•
•
Data’s Latin root is dare, past participle of to give. Data means “something
given.” In math and engineering, the terms data and givens are used
interchangeably.
The New Oxford American Dictionary (NOAD) defines data as “facts and
statistics collected together for reference or analysis.”
ASQ defines data as “A set of collected facts” and identifies two kinds of
numerical data: “measured or variable data … and counted or attribute data.”
ISO defines data as “re-interpretable representation of information in a
formalized manner suitable for communication, interpretation, or processing”
(ISO 11179).
Observations about the concept of data
– Data tries to tell the truth about the world (“facts”)
– Data is formal – it has a shape
– Data’s function is representational
– Data is often about quantities, measurements, and other numeric
representations “facts”
– Things are done with data: reference, analysis, interpretation, processing
Propriety and Confidential. Do not distribute.
13
Data
•
Data: Abstract representations of selected characteristics of real-world
objects, events, and concepts, expressed and understood through explicitly
definable conventions related to their meaning, collection, and storage.
•
Each piece is important
– abstract representations – Not “reality” itself.
– of selected characteristics of real-world objects, events, and concepts –
Not every characteristic.
– expressed through explicitly definable conventions related to their
meaning, collection, and storage – Defined in ways that encodes
meaning. Choices about how to encode are influenced by the ways
that data will be created, used, stored, and accessed.
– and understood through these conventions – Interpreted through decoding.
•
These concepts are clearly at work in systems of measurement and in the
scientific concept of data – something that you plan for (designing an
experiment) and test for both veracity (are the measurements correct) and
purpose (are the measurements telling me what I need to know).
Propriety and Confidential. Do not distribute.
14
Definition: Data Quality
Data Quality / Quality of Data:
• The level of quality of data represents the degree to which data meets the
expectations of data consumers, based on their intended use of the data.
• Data also serves a semiotic function – it serves as a sign of something other
than itself). So data quality is also directly related to the perception of how well
data effects (brings about) this representation.
Observations:
• High-quality data meets expectations for use and for representational
effectiveness to a greater degree than low-quality data.
• Assessing the quality of data requires understanding those expectations and
determining the degree to which the data meets them. Assessment requires
understanding
– The concepts the data represents
– The processes that created data
– The systems through which the data is created
– The known and potential uses of the data
Propriety and Confidential. Do not distribute.
15
Data Quality Activities
• The data quality practitioner’s primary function is to help an organization
improve and sustain the quality of its data so that it gets optimal value from its
data.
• Activities that improve and sustain data quality include:
– Defining / documenting quality requirements for data
– Measuring data to determine the degree to which data meets these
requirements
– Identifying and remediating root causes of data quality issues
– Monitoring the quality of data in order to help sustain quality
– Partnering with business process owners and technology owners to
improve the production, storage, and use of an organization’s data
– Advocating for and modeling a culture committed to quality
• Assessment of the condition of data and ongoing measurement of that
condition are central to the purpose of a data quality program.
Propriety and Confidential. Do not distribute.
16
Measurement is always about comparing two things….
Propriety and Confidential. Do not distribute.
17
Definition: Measurement
•
Measurement: The process of measurement is the act of ascertaining the
size, amount, or degree of something.
– Measuring always involves comparison. Measurements are the results of
comparison.
– Measurement most often includes a means to quantify the comparison.
•
Observation: Measurement is both simple and complex.
– Simple because we do it all the time and our brains are hard-wired to
understand unknown parts of our world in terms of things we know.
– Complex because, for those things we have not measured before, we
often do not have a basis for comparison, the tools to execute the
comparison, or the knowledge to evaluate the results.
» If you don’t believe me, imagine trying to understand “temperature” in a
world without thermometers.
– Measuring the quality of data is perceived as complex or difficult, because
we often do not know what we can or should compare data against.
Propriety and Confidential. Do not distribute.
18
Assessment goes further than measurement
Assessment is not
just about
comparison…it’s
about drawing
conclusions.
Drawing conclusions
depends on
understanding
implications and how
to act on them.
Propriety and Confidential. Do not distribute.
19
Definition: Assessment
•
Assessment is the process of evaluating or estimating the nature, ability, or
quality of a thing.
•
Data quality assessment is the process of evaluating data to identify errors
and understand their implications (Maydanchik, 2007).
Observations about assessment
• Like measurement, assessment requires comparison.
• Further, assessment implies drawing a conclusion about—evaluating—the
object of the assessment, whereas measurement does not always imply so.
• But as with data quality measurement, with data assessment, we do not always
know what we are comparing data against. For example, how do we know what
is wrong? What = an “error”?
Propriety and Confidential. Do not distribute.
20
Measurement/Assessment
Measurement is
knowing that the
temperature
outside is 30
degrees F below
zero.
Assessment is
knowing that it’s
cold outside. R
You can act on
the implications
of an
assessment:
Get a coat!
Or, better yet,
stay inside.
Propriety and Confidential. Do not distribute.
21
Benefits of Measurement
•
Objective, repeatable way of characterizing the condition of the thing being
measured.
• For measurement to work, people must understand the meaning of the
measurement.
•
A beginning point for change / improvement of the thing that needs
improvement.
•
A means of confirming improvement has taken place.
Propriety and Confidential. Do not distribute.
22
Data Quality Assessment in Context
Overview: DQ Assessment in Context
•
Goals:
– Understand the goals and measurement activities and deliverables associated with
» Initial assessment
» In-line measurement and control
» Periodic reassessment of data
– Review a template for capturing results of data analysis from these processes.
Order of information
•
Challenges of data quality assessment
•
Overview of the DQAF: Data Quality Assessment Framework
• What the DQAF is
• The Data Quality dimensions it includes
• Relation of DQAF measurement types to data quality dimensions and to specific
measurements
• Objects of measurement and the data quality lifecycle
•
Context diagrams and deliverables
•
Template review
Propriety and Confidential. Do not distribute.
24
Data Quality Assessment
• Ideally, data quality assessment enables you to describe the condition of data in relation
to particular expectations, requirements, or purposes in order to draw a conclusion about
whether it is suitable for those expectations, requirements, or purposes.
– A big challenge: Few organizations articulate expectations related to the expected
condition or quality of data. So at the beginning of an assessment process, these
expectations may not be known or fully understood. The assessment process includes
uncovering and defining expectations.
• We envision the process as linear….
• But in most cases, it is iterative and sometimes requires multiple iterations….
Propriety and Confidential. Do not distribute.
25
Data Quality Assessment
• Data assessment includes evaluation of how effectively data represent the objects,
events, and concepts it is designed to represent.
– If you cannot understand how the data works, it will appear to be of poor quality.
• Data Assessment is usually conducted in relation to a set of dimensions of quality that
can be used to guide the process, esp. in the absence of clear expectations:
– How complete the data is
– How well it conforms to defined rules for validity, integrity, and consistency
– How it adheres to defined expectations for presentation
• Deliverables from an assessment include observations, implications, and
recommendations.
– Observations: What you see
– Implications: What it means
– Recommendations: What to do about it
Propriety and Confidential. Do not distribute.
26
DQAF– Data Quality Assessment Framework
•
A descriptive taxonomy of measurement types designed to help people measure the
quality of their data and use measurement results to manage data.
– Conceptual and technology-independent (i.e., it is not a tool)
– Generic – it can be applied to any data
• Initially defined in 2009 by a multi-disciplinary team from Optum and UHC seeking to
establish an effective approach for ongoing measurement of data quality. Basis for
Measuring Data Quality for Ongoing Improvement.
– Focuses on objective characteristics of data within five quality dimensions:
– Completeness
– Timeliness
– Validity
– Consistency
– Integrity
– Defines measurement types that
– Measure characteristics important to most uses of data (i.e., related to the basic
meaning of the data)
– Represent a reasonable level of IT stewardship of data. That is, types that enable
data management.
Propriety and Confidential. Do not distribute.
27
Using the DQAF
•
The intention of the DQAF was to provide a comprehensive description of
ways to measure. I will describe it this way.
• But it does not have to be applied comprehensively.
• It can be applied to one attribute or rule.
•
The goal is to implement an optimal set of specific measurements in a
specific system (i.e., Implementing all the types should never be the goal of
any system).
•
Implementing an optimal set of specific measurements requires:
– Understanding the criticality and risk of data within a system.
– Associating critical data with measurement types.
– Building the types that will best serve the system by
» Providing data consumers a level of assurance to that data is sound
based on defined expectations
» Providing data management teams information that confirms that data
moves through the system in expected condition
Propriety and Confidential. Do not distribute.
28
Using the DQAF
•
The different kinds of assessment are related to each other.
– Initial assessment drives the process by separating data that meets
expectations from data that does not and helping identify at risk and critical
data for ongoing measurement.
– Monitoring and periodic measurement identify data that may cease to meet
expectations and data for which there are improvement opportunities.
• The concept of data quality dimensions provides the initial organizing principle
behind the DQAF:
Data Quality Dimension: A data quality dimension is a general, measurable category for a
distinctive characteristic (quality) possessed by data. Data quality dimensions function in
the way that length, width, and height function to express the size of a physical object. They
allow understanding of quality in relation to a scale and in relation to other data measured
against the same scale. Data quality dimensions can be used to define expectations (the
standards against which to measure) for the quality of a desired dataset, as well as to
measure the condition of an existing dataset. Dimensions provide an understanding of why
we measure. For example, to understand the level of completeness, validity, and integrity
of data.
Propriety and Confidential. Do not distribute.
29
DQAF Terminology
•
Measurement Type:
– Within the DQAF, a measurement type is a subcategory of a dimension of data
quality that allows for a repeatable pattern of measurement to be executed against
any data that fits the criteria required by the type, regardless of specific data
content.
– The measurement results of a particular measurement type can be stored in the
same data structure regardless of the data content.
– Measurement types describe how measurement are taken, including what data to
collect, what comparisons to make, and how to identify anomalies. For example, all
measurements of validity can be executed in the same way. Regardless of specific
content, validity measurements include collection of data and comparison of values
to a specified domain.
•
Specific Metric:
– A specific metric describes particular data that is measured and the way in which it
is measured.
– Specific metrics describe what is measured. For example, a metric to measure the
validity of procedure codes on a medical claim table. Or one to measure the validity
of ZIP codes on a customer address table.
Propriety and Confidential. Do not distribute.
30
Dimensions, Measurement Types, Specific Metrics
Propriety and Confidential. Do not distribute.
31
Example Measurement Types
Propriety and Confidential. Do not distribute.
32
DQAF Dimension Definitions
Completeness: Completeness is a dimension of data quality. As used in the DQAF, completeness implies
having all the necessary or appropriate parts; being entire, finished, total. A data set is compete to the
degree that it contains required attributes and a sufficient number of records, and to the degree
attributes are populated in accord with data consumer expectations. For data to be complete, at least
three conditions must be met: the data set must be defined so that it includes all the attributes desired
(width); the data set must contain the desired amount of data (depth); and the attributes must be
populated to the extent desired (density). Each of these secondary dimensions of completeness would
be measured differently.
Timeliness: Timeliness is a dimension of data quality related to the availability and currency of data. As
used in the DQAF, timeliness is associated data delivery, availability, and processing. Timeliness is the
degree to which data conforms to a schedule for being updated and made available. For data to be
timely, it must be delivered according to schedule.
Validity: Validity is a dimension of data quality, defined as the degree to which data conforms to stated
rules. As used in the DQAF, validity is differentiated from both accuracy, and correctness. Validity is
degree to which data conform to a set of business rules, sometimes expressed as a standard or
represented within a defined data domain.
Consistency: A dimension of data quality. As used in the DQAF, consistency can be thought of as the
absence of variety or change. Consistency is the degree to which data conform to an equivalent set of
data, usually a set produced under similar conditions or a set produced by the same process over time.
Integrity: Integrity is a dimension of data quality. As used in the DQAF, integrity refers to the state of being
whole and undivided or the condition of being unified. Integrity is degree to which data conform to data
relationship rules (as defined by the data model) that are intended to ensure the complete, consistent,
and valid presentation of data representing the same concepts. Integrity represents the internal
consistency of a data set.
Propriety and Confidential. Do not distribute.
33
DQAF Terminology
•
Assessment Category:
– In the DQAF, an assessment category is a way of grouping measurement types based on where
in the data life cycle the assessment is likely to be taken.
– Assessment categories pertain to both the frequency of the measurement (periodic or in-line)
and the type of assessment involved (control, measurement, assessment).
– They include: initial assessment, process control, in-line measurement, periodic measurement,
and periodic assessment.
•
Measurement (or Assessment) Activities:
– Measurement activities describe the goals and related actions related associated with work
carried out within an assessment category. Measurement activities differ depending on when,
within the data lifecycle, they are carried out and against which the objects of measurement.
– Measurement activities correspond closely with DQAF measurement types.
•
Object of Measurement:
– In the DQAF, objects of measurement are groupings of measurement types based on whether
types focus on process or content, or on a particular part of a process (e.g., receipt of data) or
kind of content (e.g., the data model).
– Content-related objects of measurement include: The data model, content based on row counts,
content of amount fields, date content, aggregated date content, summarized content, crosstable content (row counts, aggregated dates, amount fields, chronology), overall database
content.
– Process-related objects of measurement include: Receipt of data, Condition of data upon
receipt, adherence to schedule, data processing
Propriety and Confidential. Do not distribute.
34
Goal
Initial One-Time Assessment
Measurement Activities
Gain Understanding of the Data
& Data Environment
Assessment
Category
Context of Data Quality Assessment
Manage Data within and between Data Stores with
Controls and Ongoing Measurement
Automated Process Controls
In-line Measurement
Manage Data within
a Data Store
Periodic
Measurement
Understand
business
processes
represented by
the data
Review and
understand
processing rules
Ensure correct
receipt of data
Take preprocessing data
set size
measurements
Measure data
content
completeness
Take postprocessing data
set size
measurements
Measure crosstable integrity of
data
Assess
completeness,
consistency, and
integrity of the
data model
Assess the
condition of data
(profile data)
Inspect initial
condition of data
Take preprocessing
timing
measurements
Measure data
validity
Take postprocessing
timing
measurements
Assess overall
sufficiency of
database
content
Assess
sufficiency of
metadata and
reference data
Assess data
criticality; define
measurement
requirements
Measure data
set content
Measure data
consistency
Assess
effectiveness of
measurements
and controls
Support Processes and Skills
Propriety and Confidential. Do not distribute.
35
Functions in Assessment: Collect, Calculate, Compare, Conclude
Use DQAF dimensions to help with this process
and measurement types to help with this process
Propriety and Confidential. Do not distribute.
36
Results of Data Assessment
The following three slides associate deliverables from each of the measurement
activities.
Through these deliverables….
• Metadata is produced, including:
– Expectations related to the quality of data, based on dimensions of quality
– Objective description of the condition of data compared to those expectations
– Documentation of the relation of data’s condition to processes and systems –
rules, risks, relationships
• Data and process improvement opportunities can be identified and quantified,
so that decisions can be made about which ones to address.
Propriety and Confidential. Do not distribute.
37
Initial Assessment
Propriety and Confidential. Do not distribute.
38
In-Line Measurement & Control
Propriety and Confidential. Do not distribute.
39
Periodic Measurement
Propriety and Confidential. Do not distribute.
40
Initial Assessment:
Capturing Observations and Conclusions about the Condition of Data
Deep Dive on Initial Assessment: Data Analysis Results Template
•
One of the challenges in data quality measurement is a lack of clear goals
and deliverables for the data assessment process. I hope the preceding
materials can help you clarify your goals for any measurement activities within
an assessment.
•
Data Analysis Template should help you formulate your deliverable.
•
Components
– Analysis Protocol Checklist
– Observation Sheet
– Supporting components – purpose and usage, content overview,
definitions of terms, etc.
– Summarized analysis questions
Show template now…
Propriety and Confidential. Do not distribute.
42
Analysis Protocol Checklist
• A tool to enable analysts to execute data profiling in a consistent way.
• Describes the actions that should be taken during any data analysis sequence.
• Includes prompts and questions that help guide analysts in discovering
potential risks within source data.
• Although the list includes a set of discrete actions that can be described
individually, many of these can be executed simultaneously; for example when
reviewing the cardinality of a small set of valid values, analysts can and should
be assessing the reasonability of the distribution of values.
• The checklist ensure that nothing is missed when data is profiled.
Propriety and Confidential. Do not distribute.
43
Examples from Analysis Protocol Checklist
Protocol
number
1
2
3
4
5
Type
Action (Tool Use)
Analysis object
Review the overall set of files to be included in
Review the overall set of files to profiling. Make initial comparisons between
be included in profiling. Complete Source ERD, UDW model source metadata.
Data Set -- Review the Profiling metadata tab.
Complete the Profiling metadata tab. Identify
overall content
Identify any limitations or risks. any limitations or risks.
Sort Data Values by Minimum
Column -Value - low to high. (Column
Review columns where Min & Max Values are
Cardinality Analysis Analysis)
Null
Column -Sort Cardinality - Low to High
Cardinality Analysis (Column Analysis)
Sort Cardinality - Low to High
(Column Analysis). Use the View
Details button, for 2nd level of
Column -drill-down; select the Frequency
Cardinality Analysis Distribution Tab.
Sort Cardinality - Low to High
(Column Analysis). Use the View
Details button, for 2nd level of
Column -- Value
drill-down; select the Frequency
Distribution Analysis Distribution Tab.
Analyze all other columns where Cardinality =
1.
Analyze all columns where Cardinality = 2.
Review Value Distribution for columns where
Cardinality = 2 or 3
Propriety and Confidential. Do not distribute.
44
Examples from Analysis Protocol Checklist
Protocol
number
Type
Purpose (Why we do this)
11
If a column contains date information, there are inherent date format, as well as physical and
logical ranges of valid values. Is the Data Type consistent with Date? Which date fields are
nullable? Does this make sense? Is there a dependency or other relatedness between
multiple date columns? Between the date column and other "non-date" columns? Do the
Column -- Data Min and Max dates make sense? Are the column names consistent with a Date? Why or
Class Analysis Why not? When necessary, reassign the Selected Data Class. Review all other
Dates
characteristics for each column.
12
Is the Data Type consistent with a Quantity? Are the values consistent with Quantity? Are
the column names consistent with Quantity? Does the source distinguish between fields
related to dollar amounts and fields related to other kinds of quantity date? Does the data
Column -- Data format make sense for this column? (ex: dollars have precision of 2 decimals; count of
Class Analysis members are integers) When necessary, reassign the Selected Data Class. Review all other
Quantities
characteristics for each column.
13
14
The Code class represents instances where there is a finite set of valid values, as in those from
a code table. Does the source also supply the code tables? (See Structure for referential
integrity between reference and core data.) Is the cardinality consistent with what is known
about the specific Code set? Is the column name consistent with a code? What are the
Column -- Data values? Are there invalid values present? Are there expected values which are missing?
Class Analysis When necessary, reassign the Selected Data Class. Review all other characteristics for each
Code
column.
Column -- Data What is the Cardinality of the column? Based on what you observe about the column, how
Class Analysis should it be classified? Why? When necessary, reassign the Selected Data Class. Review all
Unknown
other characteristics for each column.
Propriety and Confidential. Do not distribute.
45
Examples from Analysis Protocol Checklist
Protocol
number Type
43
44
45
46
49
50
Purpose (Why we do this)
Identify patterns in the population of records based on status codes,
record type codes, dates, or other critical fields that may be used to
differentiate records of different kinds. Determine how many record
types appear to be present in the data and whether records can be
classified based on the type of transaction or type of data present in
Structure -- Record Types
the record.
Determine how records representing variations of the same
information can be understood in relation to each other. For
example, one record may be an original claim; another may be an
Structure -- Change over time adjustment to the same claim.
Based on record types and change over time, determine if we can
characterize the different states of data in the data set and the
events that might trigger a new record or an update to an existing
Structure -- State Data
record.
Identify any content or structural differences between older to more
recent records. Older records may have been produced under
Structure -- Age of Records different business processes.
Structure -- Source Naming Review findings from overall column analysis to identify any general
Conventions General
naming conventions
Structure -- Naming
Review findings from overall column analysis to identify any
Convention consistency
inconsistencies or peculiarities in source naming conventions
Propriety and Confidential. Do not distribute.
46
Observation List
•
Designed to capture discreet, specific observations for knowledge sharing
purposes.
•
Observations can be made at the column, table, file, or source level.
•
Observations will be used to inform other people about the condition of data
and will be repurposed as metadata. Observations should be formulated with
these ends in mind.
•
Each observation is recorded and associated with a relevancy category, so
that its importance is understood and can be confirmed.
Propriety and Confidential. Do not distribute.
47
Example Observations
Checklist
Column Name Step
Observati Observational
on Type Category
Observation
Naming
PROCEDURE_Conventions MODIFIER_2 - Consistency Finding
PROCEDURE_Nullability,
Informatio
MODIFIER_2 General
nal
DIAGNOSIS_I Nullability,
DENTIFIER
General
Finding
Naming
Conventions
Limited use
field
Relevance
Appears to be same data as
SRV_CDE_MOD,
PROCEDURE_MODIFIER_3 and
proc_mod_4_cd; why different naming
convention? Confirm.
Null=98.3%
Null=24.6%; Values: 1-9, A,B,C. Are these
expected values? Any missing values?
Reasonable for this data set? The Nulls
seem to be related to
PAY_RULE_INDIC=null. Is Diagnosis
Related Fields Business Rule needed for all procedures?
PAY_RULE_INDIC
PAY_RULE_IN Value Dist,
Informatio
DIC
Valid Values nal
Data Content
Null=24.6%; Values:
P=52.4%,C=21.3%,U=1.8%. Are these
expected values? Any missing values?
Reasonable for this data set?
Unexpected
percentage NULL
Propriety and Confidential. Do not distribute.
48
Example Observations
Column
Name
Checklist
Observational
Step
Observation Type Category
Value Dist,
FIRST_DAT Range of
E_OF_SRVC Values
Finding
Source Edit
Checks
RENDERIN
G_PROVIDE
R_NPI,
rndr_prov_npRelated
i_new_id
Columns
Risk
Business Rule
DETAIL_RE Field
CORD_NUM Length
Informational
Structure, Keys
Observation
Relevance
First DOS, no
reasonability checks
in source. Values
from 1915, 1920's,
1970's, 2020, later in Distribution of
2013. Future Dates values is
allowed
questionable
Related Fields Two fields both
RENDERING_P
appear to contain
ROVIDER_NPI
rendering provider and
NPI. Which should rndr_prov_npi_ne
be used?
w_id
Significant
difference
between actual
Defined=2;actual=3. and defined field
Sizing opportunity. lengths
Propriety and Confidential. Do not distribute.
49
Relevance Categories
Relevance:
High cardinality where low is expected
Component for compound Key
Low cardinality where high is expected
Inconsistent data type
Inconsistent format
Foreign Key
Identifying relationships
Logic for Natural Keys
Multiple formats in one field
Non identifying relationships
Difference between inferred and defined
field lengths
Source Natural Key
Multiple default values
Unexpected percentage of records
defaulted
Default value differs from target
Default value differs from Source
documentation
Granularity differs from Source
documentation
Granularity differs from target
Inconsistent granularity
Source Primary Key
100% NULL Not OK -- field is not
populated but should be
100% NULL but OK -- field is not
populated but is not expected to
be
Related Columns within a table
Related Columns across tables
Invalid values present in column
Non-standard values for a
standardized code
Value set differs from source
documentation
Value set differs from target
Data populated as expected
Distribution of values appears
unreasonable
Distribution of values is
questionable
High frequency values are
unexpected
Low frequency values are
unexpected
Unexpected percentage NULL
Inconsistent Naming Conventions
Naming Convention differs from
target
Propriety and Confidential. Do not distribute.
50
Summarized Analysis
•
Tracks high level findings, at the attribute or rule level.
•
Contains Yes / No questions so that analysts can reach conclusions and roll
up findings
•
Questions are based on the dimensions of quality in the DQAF and they can
be associated with measurement types in the DQAF.
Propriety and Confidential. Do not distribute.
51
Summarized Analysis – Sample questions
Field Name
Does cardinality of
values makes sense?
(Y, N, N/A)
Are default values
present? (Y/N)
Is level of default
population
reasonable? (Y,N,
N/A, CND)
Field Definition
Cardinality refers to the number of items in a set. For data measurement results, the cardinality
means the number of values returned from the core data. A high cardinality is expected on fields like
procedure code which have a large set of valid values; whereas fields like gender codes have only a
few values and therefore will return only a few rows. This field captures high-level reasonability
related to cardinality. Unreasonable conditions include: high cardinality where low is expected, low
cardinality where high is expected, inconsistent cardinality. Note that cardinality is not always the
number of ROWS returned. If a measurement is broken down by a source system code, then it will
return a set of rows for each source and the reasonableness of cardinality must be understood at the
source level.
Valid values = Y/N/CND [Cannot Determine]. The requirements process should capture whether
population of a field is optional or mandatory. This question asks whether default values are actually
present in the data. In most cases, defaults are valid values. However, depending on other
expectations, a high level of defaults may be a problem. If there is more than one functional default
value, capture that fact in the observation sheet. For Match Rate consistency measurements, the
presence of default values indicates that records have not matched.
This field captures high-level reasonability related to the level of defaulted records. Unreasonable
levels would occur when business rules indicate a field should be populated under specific conditions
and it is defaulted instead. For Match Rate Consistency measurements, this field should be used to
capture whether the level of non-matches is reasonable. The general intention of match processes is
to return 100% of records. If defaults are present, we should document the factors that have caused
them. In some cases, these may be reasonable. For example, if a field defaults for particular sources
because source data is not yet available, but the field is populate for sources for which data is
available, then the overall level is reasonable. If there are known issues associated with the data, then
the answer to this question is N and issues should be noted in the observations.
Propriety and Confidential. Do not distribute.
52
Summarized Analysis – Sample questions
Field Name
Field Definition
Are invalid values
present? (Y, N, N/A)
Does the level of
invalid values
represent a
problem? (Y, N,
CND)
Values = Y/N. The data model should record whether NULL is allowed or not. If it is allowed, then the
answer is Y. We need to record this information in order to ensure we report accurately on levels of
validity.
Does distribution of
values make sense,
based on what the
data represents? (Y,
N, CND)
For all instances where the answer is Y to the question "Are invalid values present?" determine
whether the level of invalid values presents a risk. This question can be answered based on either the
distinct number of invalid values, the percentage of rows containing invalid values, or both. A special
case: When the only invalid value is NULL and NULL is populated at a reasonable level, the answer
here should be No.
This field captured high level reasonability based on knowledge of what the field represents. Some
assessment can be based on common sense (in a large set of claim data, a high level of procedure
codes for office visits makes sense, whereas a large set for heart transplants would be questionable).
Other assessment can be made based on similar data in other systems, or on defined rules in standard
documentation. Questions should be directed at business process SMEs.
Are there identifiable
patterns based on
Valid Values = Y, N, CND [Cannot determine]. Baseline assessment of many consistency measures will
dates data was
focus on determining the degree to which there is a legitimate expectation that data will be
delivered or
consistent. In order to draw a conclusion, data needs to be looked at over time. Assessment of
validity measures includes not only the level of validity, but also changes over time. Analysts should
processed (e.g.,
also identify any known events (e.g., project releases, the introduction of new sources, etc.) that have
trends, spikes, etc.)? an impact on the consistency of data. To see changes over time requires graphing of data. Do not
(Y, N, CND)
answer this question until data has been reviewed graphically.
Propriety and Confidential. Do not distribute.
53
Summarized Results – Samples
Field Name
Field Definition
Are there patterns based
on age of records, status,
type codes, or any other
fields related to the
content of the records, This field prompts analysts to look across the record set at any fields that might
etc. ? (Y, N, CND)
provide a deeper understanding of the reasonability of the data.
This field is intended as a catch-all for any problems that may have been
identified or questions that may have been raised in the course of analysis that
do not fit into one of the previous categories. When such issues and questions
Are there any other issues are identified, we should not only address them for themselves, we should also
related to data content? review them for possible improvements to the data assessment process and this
(Y, N)
template.
This field should capture analysis of reasonability at the highest level. For validity
Overall Reasonability:
and consistency measurements, it should be populated based on analysis using
Does result content make the Data Content Protocol. The pink fields that follow summarize a series of
sense based on what we observations based on that protocol. They should be populated before
know of the data? (Y, N) completing the high level field.
Is this Data meeting
Like the reasonability field, the quality expectation field captures a high level
Quality expectations? (Y, conclusion. In most cases, it should be filled out after the other data content
N, CND)
analysis has been completed.
Propriety and Confidential. Do not distribute.
54
Results of Assessment
•
I had previously asserted that few organizations articulate expectations
related to the expected condition or quality of data.
•
The assessment process includes uncovering and defining expectations.
•
The assessment allows you to establish facts about the data and to answer
the questions about its condition:
– Is the data reasonable?
» Is it complete?
» Is it valid?
» Is there integrity between related tables
– Is data in the condition it needs to be for use?
– If note, what is not acceptable about the data?
•
The results should be in a sharable form. “Data-tized” observations.
•
From these findings, data consumers can define measurable requirements for
the expected condition of data.
Propriety and Confidential. Do not distribute.
55
Defining Data Quality Requirements
Overview: Defining Data Quality Requirements
Defining DQ requirements: Learn how to define measurable characteristics of
data and establish requirements for data quality. Review a template designed
to solicit and document clear expectations related to specific dimensions of
quality.
Order of Ideas
•
Definitions of terms – Requirement, Data Quality Requirement, expectations,
risks
•
Input for the requirements process
•
Asking questions
•
Capturing output
Propriety and Confidential. Do not distribute.
57
Data Quality Requirements
•
A requirement defines a thing or action that is necessary to fulfill a purpose.
•
Data quality requirements describe characteristics of data necessary for it to be of high
quality.
•
Data quality content requirements define the expected condition of data in terms of
quality characteristics, such as completeness, validity, consistency, and integrity.
•
Data quality measurement requirements define how a particular characteristic should
be measured.
•
The DQ requirements process should identify data quality expectations and risks in
order to make recommendations for how to measure the quality of data.
– Expectations are based on business processes and rules and what specific data is
designed to represent.
– Risks can be associated with business processes that produce data, source
systems that supply data to a downstream data asset, or technical processes
related to data movement and storage within the downstream data asset (e.g.,
transformation rules within ETL).
– Expectations and risks can be expressed in relation to dimensions of quality (i.e.,
the data is considered complete / valid / consistent, if….)
Propriety and Confidential. Do not distribute.
58
Assessment / Requirements Relationship
• We envision the requirements-to-assessment process as linear.
• But in most cases, it is iterative and sometimes requires multiple iterations.
• Because the assessment process includes uncovering and defining
expectations. Likewise, the requirements process often entails assessment.
We think requirements start
here
Or you figure
them out here
But
Requirements
may start here
Propriety and Confidential. Do not distribute.
59
Input to the Requirements Process
•
Data content requirements
•
Business process flows
•
Business rules
•
Entity and attribute definitions
•
Source data model
•
Target data model
•
Source-to-target mapping specification or functional specification,
•
Transformation rules
•
Profiling / data assessment results
•
And any other documentation that provides information about business
expectations for the data
Show requirements template now….
Propriety and Confidential. Do not distribute.
60
Which data to focus on
Measurements inform you
about the condition of the
data within a system.
But most organizations do
not have the capacity to
measure everything.
Nor is it beneficial to do so.
Measurement requirements
should focus on critical and
at risk data.
More on this in a few
slides…..
Propriety and Confidential. Do not distribute.
61
Risks in Business or Technical Processes
Business Process
complexity (high,
medium, low)
High risk =
candidate
measurement
Population rule
Complexity (high,
medium, low)
This field should assess the level of complexity within the business
process that creates the data. The purpose of this evaluation is to identify
risks in associated in data production.
This field should record a high level assessment of the complexity of the
population of a field. Complexity provides a way to assess technical risk
associated with data population. Complexly populated fields are
candidates for consistency measurements. Valid values include: High,
Medium, Low. Fields that require only formatting changes or are direct
move are considered low complexity. Fields that require minor
transformations are medium complexity. Fields that are derived from
multiple inputs are high complexity. Population rules complexity will be
populated only when input to the DQ Measurement Requirements
template includes the mapping spreadsheet / functional specification.
Column Category (direct This field is used to characterize the kind of data in the column, using
move field, Amount field, categories that are helpful in determining whether to take a
indicator, match process, measurement and to decide on which type of measurement to be
applied. Values are based largely on the guidelines for standard
other derivation, codified measurement processes. Valid values include: Match Process, Codified
data in a core table, ref Data, Ref Table field, UDW system generated field. The Column Category
table field, systemwill be populated only when input to the DQ Measurement Requirements
generated field)
template includes the mapping spreadsheet / functional specification.
Propriety and Confidential. Do not distribute.
62
Risks in Business or Technical Processes
Are there any Known
Risks within business
process that produces
this data? (Y, N, CBD)
Describe known business
process risks
Are there any Known
Risks or issues associated
with this data in the
source systems which
supply it? (Y, N, CBD)
Describe known source
system risks
This field should capture whether there are any known limitations of the
business process that produces the data. If there is no information about
known risks, then populate with CBD (cannot be determined).
This field should describe known limitations of the business process that
produces the data. If this information is documented elsewhere, include the
link or reference to that information. If additional space is needed, create
another tab.
This field should capture whether there are any known limitations of the data
within the systems that supply it. Risks can be associated with direct or
originating systems. If there is no information about known risks, then
populate with CBD (cannot be determined).
This field should describe known risks of the systems that supply the data. If
this information is documented elsewhere, include the link or reference to that
information. If additional space is needed, create another tab.
High risk
anywhere here
= candidate
measurement
Propriety and Confidential. Do not distribute.
63
Expectations for Completeness of Column Population
High Criticality
= Candidate
measurement
Attribute Criticality (high,
medium, low)
This field should record a high level assessment of an attribute's criticality for
business purposes. The Analyst should populate based on data knowledge and
common sense. The draft assessment should be reviewed by the business.
High critical fields are candidates for consistency measurements. Valid values
include: High, Medium, Low. High indicates that the data is very critical.
Attributes may be critical in and of themselves or they may be critical because
they serve as input into derivation processes.
This field should record whether a field is expected always to be populated
(mandatory) or whether population is not always expected (optional). Ideally,
information about optionality should be captured in the model and obtained
via the MID. Valid values include: Mandatory, Optional, TBD.
Population expectation:
(Mandatory vs. Optional)
If optional, identify the
For fields where the population is not mandatory, this field will record the
conditions under which the conditions under which the data will not be populated (or under which is will
field is populated
be populated, whichever is simpler to express).
Defaults allowed? (Y/N)
If the data quality requirements process is executed as part of development,
this field indicates whether defaults are allowed. Such information should be
captured in the model. If the process is executed against existing data, if actual
data is available for inspection, the field should record whether defaults are
present.
In this field, capture the specific value that is expected to be populated when
the field is defaulted.
Standard Default value?
Under what conditions are For fields where defaults are allowed, this field will record the conditions under
defaults allowed?
which the data will or might be defaulted.
Propriety and Confidential. Do not distribute.
64
Expectations for Validity of Column Population
This field captures all defined criteria for validity. For example, it may
Clear Criteria
name a range of valid values, a source of valid values, or a rule that is
for validity =
clear criteria for associated with determining validity. The criterion for a DIAGNOSIS Code
measurement field might be: ICD Diagnosis codes valid at the time of the claim; or See
DIAGNOSIS_CODE Table. It is not the intention of this field to capture
Criteria for Validity?
valid values or to duplicate information in code tables.
This field captures any business rules that are associated with the field.
For example, if a health condition is related to a workers compensation
Business Rules Associated claim, then the workers comp indicator must be 'Y' and the workers comp
with the population of the claim number must be populated. Often rules state the relationship
field
between fields, so ensure that you note the information for both fields.
This field should include any additional data quality expectations shared
by business SMEs or other data consumers. For example, whether fields
Other Expectations based are related to each other, whether there are differences in population
on business processes
based on different types of records, etc.
This field records what action the business wants the data store to take if
Action if data does not
data does not meet the expectations for population, validity, or defined
meet expectations (Keep/ business rules. Valid values are to keep the data and allow the record to
Reject)
be inserted despite the defect or to reject the record.
Propriety and Confidential. Do not distribute.
65
Compare documented requirements to Results of Data Analysis
Percentage of records not
populated [NULL or other If profiling information is available, record the percentage of defaults for
Default value]
the column.
Test of population
conditions for optional
fields (under what
For fields where the population is not mandatory, this field will test
conditions is the field
documented conditions under which the data will not be populated (or
populated/not populated?) under which is will be populated, whichever is simpler to express).
This field records whether or not default values are present. Valid values
include: Y - Defaults are present, N - Defaults are not Present, Multi -More than one functional default is present, and CND -- Cannot
determine whether defaults are present.
Defaults present? (Y/N)
Unexpected defaults? (Y/N) In this field, capture whether there are any unexpected characteristics
(default different from
related to rows where the field is defaulted. For example, are any values
documentation, more than other than the standard default are being used to default the field
one functional default,
(functional defaults)? Is the standard value not being used at all? Is there
etc.)
more than one value being populated when the field is defaulted?
Default percentage
reasonable? (Y/N)
This field should be populated with the analyst's assessment of whether
the level of defaulted data is reasonable based on an understanding of
what the data represents. If the response is No, then the reasons for
drawing this conclusion should be recorded in the "Observations on
existing population" field.
Propriety and Confidential. Do not distribute.
66
Compare documented requirements to Results of Data Analysis
Criteria for validity met?
(Y/N)
Observations on existing
population
Known issues from data
analysis / profiling
Risk Assessment
This field should record whether actual values in the data meet the
documented expectations for validity. For example, if the criteria for
a Procedure Code field stipulate that the field should contain only
industry standard codes, but it also has homegrown procedure
codes, then it has not met the criteria for validity.
This field should capture any observations on existing data. For
example, why the default level is unreasonable, what the functional
defaults are, whether the distribution of values in the column is
reasonable, etc. If there is a need to record multiple observations,
consider creating a observation tab, based on the Baseline
Assessment template, or if there is a distinct category of
observation, then add a column to this template to capture it.
This field should indicate whether previous analysis identified any
known issues with the data. The field does not need to detail those
issues, but should reference other documents
This field captures a high-level assessment of the risk associated with
the attribute, based on knowledge of the source or gained through
analysis. High, Medium, Low.
Propriety and Confidential. Do not distribute.
67
Measurement ROI
Once you have
identified risks and
assessed criticality,
you can associate
any data element or
rule with one of these
four quadrants.
High risk, high
criticality data is the
data that it is worth
monitoring.
Propriety and Confidential. Do not distribute.
68
Measurement Decision Tree – Criticality
Initial
Assessment
Critical
data
Improvement
projects
Data does
not meet
expectation
Less
critical
data
Assessed
data sets
Critical
data
Ongoing
monitoring
Less
critical
data
Periodic
reassessment
Data meets
expectation
Propriety and Confidential. Do not distribute.
69
Measurement Decision Tree – Criticality / Risk
Initial
Assessment
High Risk
Risk Mitigation
Critical Data
Low Risk
Assessed
data sets
High Risk
Ongoing
monitoring
Low Risk
Periodic
reassessment
Less-Critical
Date
Propriety and Confidential. Do not distribute.
70
Results of Requirements Process
•
Similar to those of the assessment process
•
A set of assertions about the expected condition of the data, focused on
• Completeness
• Consistency
• Validity
• Integrity
•
These can be defined as measurements:
• For customer address, ZIP code must always be valid for addresses in the US.
• Measure/monitor the level of invalid ZIP codes.
•
They can be shared with data consumers to ensure there is knowledge of the
actual condition of the data and consensus about the expected condition.
•
This information is valuable as metadata.
•
Patterns of requirements can be associated with DQAF measurement types
so that common processes can be set up to take sets of similar
measurements.
Show examples from filled in template…
Propriety and Confidential. Do not distribute.
71
Measurement from requirements through support
Propriety and Confidential. Do not distribute.
72
Using Measurement for Improvement
73
Using Measurements for Improvement
Goal: Share examples of measurements that contribute to the ongoing
improvement of data quality.
•
Member Coverage to Medical Claim Match Process: Complex derivation
across data domains.
•
Earliest service date derivation: Complex derivation within a data domain
•
Member Reconciliation process: Comparison of Source and Target data.
•
All three show the benefit of initial assessment and ongoing measurement /
monitoring.
Propriety and Confidential. Do not distribute.
74
Member Coverage ID population on Medical Claim Data
•
Member Coverage ID is populated through a look up to member data.
•
Business Rule: Each medical claim should be associated with one and only
one member with medical coverage at the time of first service.
•
The field was populated at only 88% (12% defaulted) and the population rate
was declining. Low point was 83% population.
•
The root cause appeared to be overlapping timelines in the Member
Coverage Data. This problem was addressed and the rate improved, but then
leveled out at ~88%.
•
An additional root cause was identified: The population of Earliest Service
Date on the Medical Claim records.
Propriety and Confidential. Do not distribute.
75
Measurement Results Before and After Logic Change
Keep this graph
in mind. We will
come back to it.
Propriety and Confidential. Do not distribute.
76
Earliest Service Date on Medical Claim Data
•
The attribute was being populated in an existing data warehouse table.
•
Assessment showed 13% of records had defaulted dates (1/1/1900)
•
Background: Claims are stored at two levels:
– Header or Event level contains data related to ALL services rendered
through the claims (member and subscriber information, provider
information, information on how the claim was submitted, status of the
claim, date range for the set of services were provided)
– Service or Line level includes details related to each service (procedure
codes, service dates, type of service line, places of service, etc.)
•
There can be one or many service records for each header record.
•
Populating Earliest Service Date on the Header record requires a look up to
the service records.
•
Review of derivation rules showed that the logic was being applied only to a
subset of service lines, based on the service line type code.
•
Solution included extending this logic to all service lines.
Propriety and Confidential. Do not distribute.
77
Measurement Results Before and After Logic Change
Propriety and Confidential. Do not distribute.
78
Measurement Results Before and After the Second Logic Change
This is the graph I
asked you to keep in
mind
% MBR COV ID Populated
As of March 2014
Individual Value
100
1
95
1
1
UCL=93.60
90
_
X=88.7
85
LCL=83.80
1
3
3
3
3
3
3
3
3
3
3
4
4
4
-1
-1
- 1 n- 1
- 1 g- 1 p- 1
-1
-1
-1
-1 b- 1
-1
l
r
r
t
r
y
v
c
n
a
a
a
Ju
Ju
Oc
Ja
Ap
M
Au
Se
No
De
Fe
M
M
Month.Year
Propriety and Confidential. Do not distribute.
79
Source / Target Reconciliation Measurement
Flow chart shows a generic
comparison. In the case of the
measurement we were taking
there was expected to be exact
correspondence between
Source and Target records.
Propriety and Confidential. Do not distribute.
80
Overstated in Target – Records active in Target System / Inactive in Source
Overstated OVR_UNDR_IND is Set to '2'
1600
1400
Count Over
1200
1000
When there is an
overstatement,
the problem
should selfcorrect. (See
7/2011 and
11/2011)
In 2013, it did
not self correct.
800
600
400
200
0
1
1
2
2
2
3
3
3
4
01
01
01
01
01
01
01
01
01
2
2
2
2
2
2
2
2
2
/
4/
2/
3/
4/
1/
6/
4/
1/
18
/1
/1
/1
/1
/1
/1
/1
/1
7/
1
2
6
0
2
6
0
2
1
1
1
LOAD_DT
Root Cause:
Target system
did not receive
de-activation
records.
Same records
keep getting
measured.
Propriety and Confidential. Do not distribute.
81
Understated in Target – Target System is Missing Records
Understated OVR_UNDR_IND is Set to '1'
If this
number
spikes, the
target
system is
missing
records.
1000
Count Under
800
600
400
200
0
1
1
1
2
2
2
2
3
3
3
3
01
01
01
01
01
01
01
01
01
01
01
2
2
2
2
2
2
2
2
2
2
2
/
/
/
/
/
/
/
/
/
7/
7/
/4
/4
/4
/5
/4
/4
/4
/4
11
6/
9/
2
3
6
9
2
3
6
9
/
1
1
12
LOAD_DT_1
Root Cause:
An out of
date
exclusion
rule was
preventing
records from
being
loaded.
Propriety and Confidential. Do not distribute.
82
Parting Thoughts
•
Know your data.
•
If you don’t already know your data, get to know it through assessment and
measurement.
Propriety and Confidential. Do not distribute.
83
THANK YOU!
Questions?
Contact information
Laura Sebastian-Coleman, Ph.D., IQCP
[email protected]
860 221 0422

similar documents