11/5/15
Data Profiling
Helena Galhardas
DEI/IST
References
Slides Data Profiling course, Felix Naumann,
Trento, July 2015
Z. Abedjan, L. Golab, F. Naumann, Profiling
Relational Data A Survey, VLDBJ 2015
T. Papenbrock and others, Data Profiling with
Metanome, demo paper, VLDB 2015
1
11/5/15
Definition Data Profiling
Data profiling is the process of examining the
data available in an existing data source [...] and
collecting statistics and information about that
data.
Wikipedia 09/2013
Data profiling refers to the activity of creating
small but informative summaries of a database.
Ted Johnson, Encyclopedia of Database Systems
Data profiling is the set of activities and
processes to determine the metadata about a
given dataset.
3
Profiling in Spreadsheets
Felix
Naumann
|
Data
Proling
4
|
Trento
2015
2
11/5/15
els
lab
n
m
C olu
Felix
Naumann
|
Data
Proling
5
|
Trento
2015
ows
er of r
b
Num
Felix
Naumann
|
Data
Proling
6
|
Trento
2015
3
11/5/15
Many interesting questions remain
What are the possible primary keys and foreign keys?
Phone
firstname, lastname, street
Are there any functional dependencies?
zip -> city
race -> voting behavior
Which columns correlate?
Date-of-Birth and first name
State and last name
What are frequent patterns in a column?
ddddd
dd aaaa St
Felix
Naumann
|
Data
Proling
7
|
Trento
2015
Results of data profiling
Encompasses several methods to examine
datasets and produce metadata
Simple results to compute:
Number of null and distinct values in a column
Data type of a column
Most frequent patterns of data values in a column
More difficult results to compute involve several
columns:
Inclusion dependencies
Functional dependencies, etc
8
4
11/5/15
Challenges
Managing the input
Decide which profiling tasks to execute on which
parts of the data
Performing the computation
Computational complexity depends on the
number of rows, and the number of columns;
sorting is a typical operation
Managing the output
Meaningfully interpret the profiling results; usually
performed by database and domain experts
Existing technology
SQL queries and spreadsheet browsing
Dedicated tools or components
E.g., IBM Information Analyzer, Microsoft SQL Server
Integration Services, Informatica Data Explorer
Innovative ways to handle the challenges
E.g., using indexes, parallel processing
Methods to deliver approximate results
E.g., by profiling samples
Narrowing the discovery process to certain
columns or tables
E.g., verifying inclusion dependencies on user-
suggested pairs of columns
10
5
11/5/15
Typical data profiling procedure
1. User specifies data to be profiled and
chooses type of metadata to be generated
2. Tool computes the metadata in batch mode
(using SQL queries or specialized
algorithms)
Can last minutes or hours
3. Tool displays results in a vast collection of
tabs, tables, charts, and other visualizations
Discovered results can be translated into rules
or constraints to be enforced in a subsequent
data cleaning step
11
Use Cases for Data Profiling
Data cleaning
Data profiling results can be used to measure/monitor the quality of a dataset
Data exploration
To have an insight of new datasets: simple ad-hoc SQL queries return simple statistics (e.g., nb
distinct values)
Automated data profiling is required
Database management
Basic statistics gathered by a DBMS: number of values, number of non-null values, etc
Optimizer uses these statistics to estimate selectivity of operators and perform query
optimization
Database reverse engineering
To identify relations and attributes, domain semantics, foreign keys and cardinalities
Result: ER model or logical schema to assist experts in maintaining, integrating and querying
the DB
Data integration
For finding semantically correct correspondences between elements of two schemata (schema
matching)
Cross-DB inclusion dependencies suggest which tables may be combined with a join operation
Big Data analytics
Profiling as preparation and for initial insights
Important to determine which data to mine, how to import it into various tools and how to
interpret the results
Data profiling as preparation for any other data management task 12
6
11/5/15
Types of storage of input data
Relational database
So data profiling methods make use of SQL
queries and indexes
CSV file
Data profiling methods need to create its own
data structures in memory or disk
Mixed approach
Data originally in the database are read once and
processed further outside the database
The type of storage for input data has an
impact on the performance of the data
profiling algorithms and tools
13
Data profiling vs. data mining
Data profiling gathers technical metadata to support
data management
Data mining and data analytics discovers non-obvious
results to support business management with new
insights
Data profiling results: information about columns and
column sets
Data mining results: information about rows or row
sets
clustering, summarization, association rules,
Recommendation or classification are not related to data
profiling
14
7
11/5/15
Outline
Data profiling tasks
Data profiling tools
Visualization
15
Outline
Data profiling tasks
Data profiling tools
Visualization
16
8
11/5/15
Classification of Traditional
Data Profiling Tasks
CardinaliEes
PaGerns
and
Single
column
data
types
Value
Data
proling
distribuEons
Key
discovery
Uniqueness
CondiEonal
ParEal
Foreign
key
discovery
MulEple
columns
Inclusion
dependencies
CondiEonal
ParEal
CondiEonal
FuncEonal
dependencies
ParEal
17
Data profiling tasks and their primary uses
18
9
11/5/15
Single column profiling
Analysis of individual columns in a given
table
Most basic form of data profiling
Assumption: All values are of same type
Assumption: All values have some common
properties to be discovered
Discover data types
Often part of the basic statistics gathered by DBMS
Complexity: Number of values/rows
19
Cardinalities
Number of values (nb of rows)
Length of values in terms of characters
Number of distinct values
Number of NULLs
MIN and MAX value
Useful for
Query optimization
Categorization of attribute
Relevance of attribute
20
10
11/5/15
Data completness
Finding disguised missing values (e.g., when
using web forms including fields whose values
must be chosen from pull-down lists)
9999-999 for the zip code
Alabama for the USA state
Methods: determine the distribution of values
and find out that disguised missing values are
occurring much more often
21
Data types and value patterns
Discovering the basic type of a column:
String vs. number
String vs. number vs. date
Increasing Difficulty
SQL data types (CHAR, INT, DECIMAL,)
Extracting frequent patterns observed in the
data of a column:
Regular expressions (\d{3})-(\d{3})-(\d{4})-(\d+)
Finding the meaning of a column (semantic
domain)
Adress, phone, email, first name
22
11
11/5/15
Value distributions
Probability distribution for numeric values
Detect whether data follows some well-known distribution
Determine that distribution function for data values
If no specific/useful function detectable: histograms
Normal distributions Laplace distributions 23
Histograms
Determine (and display) value frequencies for value intervals or for
individual values
Estimation of probability distribution for continuous variables
Grade
distribu,on
15
10
0
01
01
02
02
02
03
03
03
04
04
05
Useful for
Query optimization
Outlier detection
Visualize distribution
24
12
11/5/15
Multi-column data profiling
Covers multiple columns simultaneasously
Identifies inter-value dependencies and column
similarities
Identifies correlations between values through
frequent patterns or association rules
Complexity: Number of columns and number of
values
25
Correlations and association rules
Correlation analysis reveals related numeric
columns (e.g., salary and age in relation
Employees)
Nave method: compute pairwise correlations
among all pairs of columns
Association rules: denote relationships or patterns
between attribute values among columns
Ex: Employees(emp-nb, dept, position, allowance}!
{dept=finance, position=manager} -> {allowance=
$1000}!
Algorithms: Apriori, FP-growth
26
13
11/5/15
Clustering
To segment the records into homogeneous
groups using a clustering algorithm
Records that do not fit any cluster flagged
as outliers
May indicate data quality problems
Algorithms: K-means, for example
27
Dependencies
Metadata that describe relationships
among columns
Discovery of primary keys with the help of unique
column combinations
Discovery of foreign keys with the help of inclusion
dependencies
Functional dependencies
Complexity: Number of columns and number of
values
Several algorithms for detecting dependencies
28
14
11/5/15
Uniqueness and keys
Set of columns R.X that contain only unique
value combinations
(Primary) key candidate
No null values
Uniqueness and non-null in one instance do not
imply key: Only human can specify keys
Algorithms: Gordian, DUCC, SWAN
Useful for
Schema design, data integration, indexing,
optimization
Inverse: non-uniques are duplicates
29
Inclusion dependencies (IND) and
foreign keys (FKs)
R.A S.B
All values in R.A are also present in S.B
R.A1,,[Link] S.B1,,[Link]:
All value combinations in R.A1,,[Link] are also present
in S.B1,,[Link]
Prerequisite for foreign key:
Used across relations
Use across databases
But again: Discovery on a given instance, only user can specify
for schema
Algorithms for IND detection: Spider, BINDER
INDs useful for
suggesting how to join two relations 30
15
11/5/15
Functional dependencies
XA
whenever two records have the same X values, they
also have the same A values, where X is a set of
attributes
E.g., street, numberzip-code
Algorithms for detecting FDs: TANE, FUN, FD-Mine, etc
Useful for
Schema design
Normalization
Keys
Data cleansing
31
Partial dependencies
Real datasets contain exceptions to the rule so dependencies
can be relaxed
Aka approximate dependencies: hold for a subset of records
INDs and FDs that do not perfectly hold
For all but 10 of the tuples
Only for 80% of the tuples
Only for 1% of the tuples
Also for patterns, types, uniques, and other constraints
Useful for
Data cleansing
32
16
11/5/15
Conditional dependencies
Given a partial IND or FD: For which part do the hold?
Example: conditional unique column combination
street is unique for all records with city = Lisbon
Expressed as a condition over the attributes of the
relation
Problems:
Infinite possibilities of conditions
Interestingness:
Many distinct values: less interesting
Few distinct values: surprising condition high coverage
Useful for
Integration: cross-source cINDs
33
Outline
Data profiling tasks
Data profiling tools
34
17
11/5/15
Research data profiling tools
Bellman: Column statistics, column similarity, candidate
key discovery
Potters Wheel: Column statistics (including value
patterns)
Data Auditor: CFD and CIND discovery
RuleMiner: Denial constraint discovery
MADlib: Simple column statistics
Profiler: visual data profiler tool
Metanome: in a few slides
35
Commercial data profiling tools
IBM InfoSphere Information Analyzer
[Link]
Oracle Enterprise Data Quality
[Link]
Talend Data Quality
[Link]
Ataccama DQ Analyzer
[Link]
SAP BusinessObjects Data Insight and SAP BusinessObjects Information Steward
[Link]
[Link]
Informatica Data Explorer
[Link]
Microsoft SQL Server Integration Services Data Profiling Task and Viewer
[Link]
Trillium Software Data Profiling
[Link]
CloverETL Data Profiler
[Link]
OpenRefine
[Link] OSen
packaged
with
data
quality
/
data
cleansing
and many more soSware
36
18
11/5/15
Very long feature lists
Num
rows
Single
column
primary
key
discovery
Min
value
length
MulE-column
primary
key
discovery
Median
value
length
Single
column
IND
discovery
Max
value
length
Inclusion
percentage
Avg
value
length
Single-column
FK
discovery
Precision
of
numeric
values
Scale
of
numeric
values
MulE-column
IND
discovery
QuarEles
MulE-column
FK
discovery
Basic
data
types
Value
overlap
(cross
domain
analysis)
Num
disEnct
values
("cardinality")
Single-column
FD
discovery
Percentage
null
values
MulE-column
FD
discovery
Data
class
and
data
type
Text
proling
Uniqueness
and
constancy
Single-column
frequency
histogram
MulE-column
frequency
histogram
PaGern
discovery
(Aa9)
Soundex
frequencies
Benford
Law
Frequency
37
Screenshots from Talend Data Quality
Felix
Naumann
|
Data
Proling
38
|
Trento
2015
19
11/5/15
Screenshots from Talend
Felix
Naumann
|
Data
Proling
39
|
Trento
2015
Screenshots from Talend
Felix
Naumann
|
Data
Proling
40
|
Trento
2015
20
11/5/15
Screenshots for
IBM Information Analyzer
41
Screenshots for
IBM Information Analyzer
42
21
11/5/15
Typical Shortcomings of Tools
(and research methods)
Usability
Complex to configure
Results complex to view and interpret
Scalability
Main-memory based
SQL based DBMS
Efficiency
Coffee, Lunch, Overnight
Functionality
Restricted to simplest tasks
Restricted to individual columns or small column sets
Realistic key candidates vs. further use-cases
SAP R3 schema has many tables with up to 16 columns as key
Interpretation of profiling results Thats the big one
43
Metanome
Extensible profiling platform that incorporates
several state-of-the-art metadata discovery
algorithms
Goals:
To provide novel profiling algorithms from research
To perform comparative evaluations
To support developers in building/testing new algorithms
Typical users:
Database administrators and IT professionals
Developers and researchers
See in: [Link]
[Link]
44
22
11/5/15
Design Goals
Simplicity
Should be easy to setup and use
Extensibility
New algorithms and datasets should be easily
addable to the system
Standardization
All common tasks, tooling, input parsing, result
handling should be provided
Flexibility
Make as few restrictions as possible to the
algorithms
45
Metanome architecture
Algorithm execution Algorithm configuration
Result Result
management presentation
SWAN
Configuration jar
DB2 txt
Measurements DB2 csv
MySQL xml SPIDER DUCC
Results jar jar
46
23
11/5/15
Most important tasks
Input parsing
Build an abstraction around input sources; specific formats are
irrelevant to profiling algos
Handles relational databases/files/tables, JSON/RDF/XML files
Output processing
Standardize the output formats depending on the type of metadata the
algorithm discovers
Most important metadata supported: unique column combinations,
INDs, FDs, order dependencies, basic statistics
Parameterization handling
Defines the parameterization of algorithms through the configuration
variables exposed by the profiling algorithms (set by the user)
Temporary data management
Provides dedicated temp-files for storing temporary data written by
profiling algorithms
47
Profiling algorithms
A profiling algorithm needs to implement a given
set of light-weight interfaces
Work autonomously: they are treated as foreign
code modules that manage themselves
providing maximum flexibility for their design
Algorithms supported:
UCCs: DUCC
INDs: MIND, SPIDER, BINDER
FDs: TANE, FUN, FD_MINE, etc
ODs: ORDER
48
24
11/5/15
Snapshot visualization of results
49
Snapshot different visualization
techniques
50
25
11/5/15
Outline
Data profiling tasks
Data profiling tools
Visualization
51
Motivation
Human in the loop for data profiling and data
cleansing.
Advanced visualization techniques
Beyond bar-charts and pie-charts
Interactive visualization
Support users in visualizing data, profiling results
Support any action taken upon the results
Cleansing, sorting,
Re-profile and visualize immediately
52
26
11/5/15
Profiler: Integrated Statistical Analysis
and Visualization for Data Quality
hGp://[Link]/les/[Link]
Assessment
Felix
Naumann
|
Data
Proling
53
|
Trento
2015
[Link]
hGp://[Link]/GapminderMedia/wp-uploads/[Link]
54
27
11/5/15
Next Lecture
Introduction to Data Warehouse
28