Topic 2
Data Quality
Introduction
Today, most organizations use data in two ways:
Transactional/operational use (“running the business”)
and,
Analytic use (“improving the business”)
Both usage scenarios rely on high quality information
Thus, it suggests the need for processes to ensure that
data is of sufficient quality to meet the needs
Therefore, it is of great value to any enterprise to
incorporate a data quality program
Includes processes for assessing, measuring, reporting,
reacting to, and controlling different aspects of risks
associated with poor data quality
Information Value and Data
Quality Improvement
There are different ways of looking at information
value.
The simplest approaches:
Consider the cost of acquisition (i.e., the data is worth
what we paid for it)
Its market value (i.e., what someone is willing to pay for
it)
But in an environment where data is created, stored,
processed, exchanged, shared, aggregated, and reused,
perhaps the best approach for understanding
information value is its utility – the expected value to
be derived from the information (what we can get from
the information)
Data Quality
Data quality is often defined as “fitness for use”, i.e. an
evaluation of to which extent some data serve the
purposes of the user
Other definition “Data quality is about having
confidence in the quality of the data that you record
and the data you use”
Data quality is divided into four dimensions: accuracy,
timeliness, completeness, and consistency (Ballou and
Pazer (1985)
[Link]
learning/lecture/eqLb8/data-quality
Data Quality
Dimensions contributing to data quality
Data Quality
Accurate – refers to how closely the data correctly
captures what it is designed to capture
E.g: Each data field is defined so that it is clear what type
of data is to be recorded, example DOB is in the format
dd/yy/mm
Complete – data that has all those items required to
measure intended activity or event
Legible – data that the intended users will find easy to
read and understand
Relevant – meets the need of the information users
Reliable – data is collected consistently over time and
reflects the true facts
Data Quality
Timely – data is collected within a reasonable agreed
time period
Valid – data is recorded in accordance with any rules
Data Quality Issues
Missing values
Duplicate data
Noise
Invalid Data
Outliers
[Link]
learning/lecture/tp2m0/addressing-data-quality-issues
Impacts of poor quality data
The implications of poor quality data carry negative effects
to business users through:
less customer satisfaction
increased running costs
inefficient decision-making processes
lower performance and,
lowered employee job satisfaction
increases operational costs since time and other resources are
spent detecting and correcting errors.
Gartner report that stated that an average organization
loses $8.2 million annually through poor quality data, 22
percent estimated their annual losses to be $20 million and
4 percent report losses were $100 million.
Causes of Poor Data Quality
Manual data entry
People mistype. They choose the wrong entry from a list. They
enter the right data value into the wrong box.
Given complete freedom on a data field, those who enter data
have to go from memory. Is the vendor named Grainger, WW
Granger, or W. W. Grainger?
Information obfuscation (not clear info)
If a field is not available, an alternate field is often used. This
can lead to such data quality issues as having Tax ID numbers
in the name field or contact information in the comments
field.
After the Merger
they usually happen fast and are unforeseen by IT
departments.
Mergers can result in a loss of expertise when key people leave
midway through the project to seek new ventures.
[Link]
lems-%20wp_en_dq_top_10_dq_problems.pdf
Solutions
Monitoring
Make public the results of poorly entered data and praise those who enter data
correctly.
Real-time Validation
In addition to forms, validation data quality tools can be implemented to
validate addresses, e-mail addresses and other important information as it is
entered.
Communication
Regular communication and a well-documented metadata model will make the
process of change much easier.
Root Cause
Analysis
What is Root Cause Analysis
A process of determining the causes that led to a
nonconformance, event or undesirable condition and
identifying corrective actions to prevent recurrence
which (when solved) restores the status quo or
establishes a desired effect.
Purpose
Root Cause Analysis helps to identify what, how, and why
something happened, thus preventing recurrence.
Root causes are underlying, are reasonably identifiable,
can be controlled by management and allow for the
generation of recommendations.
The process involves data collection, cause charting, root
cause identification, recommendation generation and
implementation.
Only when you are able to determine why an event or
failure occurred will you be able to specify workable
corrective measures.
Root Cause Analysis 14
Understanding Root Causes
To fix a problem it must be clearly defined. In a lot
of cases the symptom is identified and not the
underlying problem.
For example, buying expired milk is not an inspection
failure its a recall system failure.
Questions to ask are:
What is the scope of the problem?
What else is affected by the problem?
How often does it occur?
What impact will this have on the larger population?
Root Cause Analysis 15
Determining Root Causes
Four steps you can use to identify the Root Cause
Data Collection & Prioritization
Pareto Analysis
Cause Charting
Cause and Effect Diagram (Fishbone)
Root Cause Identification
Recommendation Generation and Implementation
[Link]
Root Cause Analysis 16
Data Collection
Data collection provides information and an understanding of causal
factors.
Good data collection techniques involve:
Data Types – Attribute or Discrete
Good/Bad, Counts or Percentages
([Link]
)
discrete-and-continuous-data-types
Planning – When, Who, How, Stratification
Check Sheets - Consistency of Data Collection
Measurement System Analysis
Ensure the data collection process is “Repeatable and
Reproduceable”
Root Cause Analysis 17
[Link]
Pareto Chart
A Pareto chart is a graphical tool to prioritize multiple problems in a process
so you can focus on areas where the largest opportunities exist.
Pareto charts are a type of bar chart in which the horizontal axis represents
categories of interest.
By ordering the bars from largest to smallest, a Pareto chart can help you
determine which of the defects comprise the “vital few”, and which are the
“trivial many.”
The Pareto principle states that 80% of the effect is generated by 20% of the
causes. We want to focus on the 20%.
Root Cause Analysis 18
Sample Pareto Chart:
Processing Errors
Pareto Chart of Processing Errors
140
100
120
100 80
Percent
80
Count
60
60
40
40
20
20
0 0
Exception HHG TQ/TA GHS AT New Res Other
Count 73 18 13 8 7 5
Percent 58.9 14.5 10.5 6.5 5.6 4.0
Cum % 58.9 73.4 83.9 90.3 96.0 100.0
19
Cause and Effect Diagram
(Also Called Fishbone)
What
A tool to represent the relationship between an effect
(problem) and its potential causes by category type.
When
Carried out when a root cause needs to be determined.
Why
To help ensure that a balanced list of ideas have been
generated during brainstorming.
To determine the real cause of
the problem versus a symptom.
To refine brainstormed ideas into
more
Root Cause Analysis detailed causes.
20
Example: Fishbone Diagram
Material Machine Methods Discovery of different
discount rates occurs too
late in process
Computer screens
Too many “jumps” Billing process not
Updates
accurate
Product
Shortages
Master customer discount
table not up-to-date Effect: Too many price
adjustments at
Incomplete Training on check-out
Power Failures
Management Policies common complaints
Not enough staffing during
peak times
Marketing metrics
counterproductive Unfamiliarity with procedures
For vacation Notification of absence
MotherRootNature
Cause Analysis Measurements Manpower notification 21
Root Cause Identification
Asking the right questions will help address the
actual problem and not the symptoms.
Types of questions to ask:
What is the scope of the problem?
How many problems are there?
What is affected by the problem?
How often does the problem occur?
Root Cause Analysis 22
Root Cause Identification
Tools used to assist with Root Cause Identification:
Data Analysis
Pareto Charts
Fishbone Diagrams
5 Why Technique
Brainstorming
Affinity Diagrams
[Link]
[Link]/9781138889255/Appendix_A.pdf
[Link]
Root Cause Analysis 23
Root Cause Identification
Reduce the list of potential root causes
Rank root causes using Pareto Analysis
(Statistical)
Rank the items in order of significance
(Organizational)
Identify the items with the most significant
impact
Time
Cost
Manpower
Root Cause Analysis 24
Root Cause Identification
Confirm potential root causes relate to the
overall problem
Validate/Verify that root causes identified
have a causal relationship with the desired
output
Ensure the legitimacy of the measurement
system
Ensure results are repeatable and reproducible
Note: If you cannot state the problem simply,
Root Cause Analysis 25
you do not fully understand the problem.
Addressing the Root Cause(s)
Conduct Value Add Analysis
Ensure that items identified will
add value to the organization or
customer
Ensure that the items are
required by regulation or policy
Confirm that the item does not
add value and is not needed or
required
Root Cause Analysis 26
Recommendation
Implementation
Things to consider prior to implementation:
Determine the impact the root causes will
have on critical inputs (X)
Estimate impact of the root cause on over-all
output (Y)
Root Cause Analysis 27
Recommendation
Implementation (Management)
Implement recommendations based on:
Significance to organizational goals and
objectives
Availability of personnel, finances or other
essential resources
Complexity of the implementation
Evaluate controls required to maintain corrective
actions after implementation.
Root Cause Analysis 28
Definition of the 5 Whys
The 5 Whys is an iterative question-asking
technique used to explore the cause-and-
effect relationships underlying a
particular problem.
The primary goal of the technique is to
determine the root cause of a defect or
problem. (The "5" in the name derives
from an empirical observation on the
number of iterations typically required to
resolve the problem.)
Root Cause Analysis 29
Benefits of the 5 Whys
Help identify the root cause of a problem.
Determine if there is a relationship
between different root causes of a
problem.
Simplicity; easy to complete without
statistical analysis.
Effective when problems involve human
factors or interactions.
Root Cause Analysis 30
Table Top Exercise
Problem Statement 1: You have to spend more and
more money on your utility bills.
Problem Statement 3: You frequently arrive to work
late in the mornings and you are faced with disciplinary
action if you don’t correct it immediately.
Problem Statement 4: You do not have enough money
to retire comfortably.
31
Root Cause Analysis
Data Quality
Management Plan
The Data Quality Management
Process
The process of data quality management is composed of
four main steps, which can be organized in a continuous
loop, as shown in the following figure.
Data Quality Management
Data
Definition
Data Data
Quality Quality
Monitoring Assessment
Problem
Resolution
Data Definition
In this step the data describing the business of the
undertaking must be appropriate and complete. The
definition of the data involves the identification of data
requirements that fulfill this criterion. Data requirements
should contain a proper description of the single items
and their relationship.
Data Quality Assessment
Data quality assessment involves validating the data
according to the three criteria: appropriateness,
completeness, and accuracy. The assessment should
consider the channel through which data is collected
and elaborated, whether through internal systems,
external third parties, or publicly available electronic
sources.
Problem resolution
The problems that are identified during the assessment
of the data quality are addressed in this phase. It is
important to document data limitations and justify the
remedies applied to poor data.
Data Quality Monitoring
Data quality monitoring involves monitoring the
performance of the associated IT systems, based on
data quality performance indicators. data quality
monitoring involves two dimensions: quantitative and
qualitative.
Data Quality Tools
Comprises much more than technology — it also includes
roles and organizational structures, processes for
monitoring, measuring and remediating data quality
issues, and links to broader information governance
activities via data-quality-specific policies.
Example:
Profiling
Parsing and standardization
Generalized "cleansing“
Matching
Monitoring
Data Quality Tools
Profiling - The analysis of data to capture statistics
(metadata) that provide insight into the quality of data
and help to identify data quality issues.
Discover metadata of the source database, including value
patterns and distributions, key candidates, foreign-key
candidates, and functional dependencies
Parsing - The decomposition of text fields into
component parts and the formatting of values into
consistent layouts based on industry standards, local
standards (for example, postal authority standards for
address data), user-defined business rules, and
knowledge bases of values and patterns.
Parsing : Breaking a data block into smaller chunks by
following a set of rules, so that it can be more easily
interpreted, managed, or transmitted
Read more:
[Link]
ml
Data Quality Tools
Generalized "cleansing“ – The modification of data
values to meet domain restrictions, integrity constraints
or other business rules that define when the quality of
data is sufficient for an organization.
Matching - Identifying, linking or merging related
entries within or across sets of data.
Monitoring - Deploying controls to ensure that data
continues to conform to business rules that define data
quality for the organization.
Data Quality Tools
Enrichment - Enhancing the value of internally held
data by appending related attributes from external
sources (for example, consumer demographic attributes
and geographic descriptors)
Data Cleaning
No matter how efficient the process of data entry,
errors will still occur and therefore data validation and
correction cannot be ignored.
Purpose
To detect and fix defects errors
To identify basic causes of errors, and used that
information to improve data entry process
Data Cleaning
The process may include:
format checks
completeness checks
reasonableness checks
limit checks
review of the data to identify outliers (geographic, statistical,
temporal or environmental) or other errors,
assessment of data by subject area experts (e.g. taxonomic
specialists)
missing values
smooth noisy data
identify or remove outliers, and
resolve inconsistencies
Data Cleaning
Data cleaning framework (Maletic & Marcus, 2000)
Define and determine error types
Search and identify error instances;
Correct the errors;
Document error instances and error types; and
Modify data entry procedures to reduce future errors.
Data Cleaning
Four methods:
Correct
Filter
Detect and Report
Prevent
What are the tools that can be used for data cleaning?
Discuss about the issues related to data cleaning
Data Quality Assessment
Purpose - to identify the quality of the data in the
identified business activity
The assessment results determine the accuracy,
completeness, consistency, precision, reliability,
temporal reliability, uniqueness and validity of the data
What are the metrics/indicator/criteria used to
measure data quality?
Sources
• Evaluating the Business Impacts of Poor Data Quality
David Loshin President, Knowledge Integrity, Inc.
• Top 10 Root Causes of Data Quality Problems. Talend
White Paper. Talend* Open Integration Solutions
• Institute of Internal Audits (IIA). Terry Upshur-
Director of Support Services with the Inspector
General of the U.S. House of Representatives.
• Meeting the Data Quality Management Challenges of
Solvency. White Paper, Neri Massimiliano Associate
Director, Moody’s Analytics. May 2011.
• Magic Quadrant for Data Quality Tools 2012. Ted
Friedman. Gartner