Data Integration Concepts, Processes,
and Techniques
Concepts of Data Integration Processes
Lesson Objectives
• Explain diagrams for refresh processing and
typical tasks
• Discuss difficulties of initial population of a data
warehouse
• Understand tradeoffs and constraints in
managing refresh processing
2
3
Extract, Transform, Load (ETL)
4
Transform
• This stage applies a series of rules to extract data from
source to derive the data for loading into the end target
• Selecting only certain columns to load.
• Translating coded values (e.g., 1 for male and 2 for
female)
• Encoding free-form values (e.g., mapping "Male" to "M")
• Deriving a new calculated value
• Sorting
• Joining data from multiple sources (e.g., lookup, merge)
and de-duplicating the data
• Aggregation (e.g summarizing multiple rows of data —
total sales for each store, and region, etc.)
5
Transform
Generating surrogate-key values
Transposing or pivoting
Splitting a column into multiple columns
Lookup and validate the relevant data from tables or
referential files for slowly changing dimensions
Applying any form of simple or complex data validation.
6
7
Motivation for Data Integration
• Add value to disparate data sources by data transformations
• Find single source of truth for decision making
• Populating and maintaining a warehouse is complex
• Overcome challenges
– Large volumes of data
– Widely varying formats & units of measure
– Different update frequencies
– Missing data
– Lack of common identifiers
• Critical success factor for data warehouse projects
– Initially populating a data warehouse &
– Periodically refreshing a warehouse as data sources change
• Significant investments in effort, hardware, and software 8
Data Sources
• Internal Data Sources
– procured and consolidated from different branches
within your organization
• purchase orders from the sales team, transactions from accounting,
pre orders from inventory management, leads from marketing,
• External Data Sources
– Not collected by your organization.
– Obtained from a source outside of your
organization.
• Examples would be, purchasing a list from a list broker or gaining
access to a proprietary database
9
Business Analyst Perspective
Location Marketplace
Management Compensation
Employee
Turnover
Factor / qualitative variable
Outcome Variable / 10
quantitative variable
Periodic Refresh Processing of DWH
• Valid Time lag (diff btw the occurrence of an event in the real world (valid time)
& the storage of the event in an operational database (transaction time)
• Load Time lag (diff btw transaction time
11
and the storage of the event in a data warehouse (load time)
Periodic Refresh Workflow of DWH
Notification to user groups and administrators
Propagating the integrated changed data to fact,
dimension tables, materialized views, stored data
cubes & to data marts
Recording results of the merging process, performs
completeness & reasonableness checks & handling
exceptions
Merge the separate cleaned sources into one source,
removing inconsistencies
Recording results of the cleaning process, performs
completeness & reasonableness checks & handling
exceptions
Standardize & improves the quality of extracted data
Movement of extracted data to Staging Area
Retrieves data from individual source system 12
Data Quality
• An essential characteristic that determines the reliability of
data for making decisions
• “High Quality” means, if it is “fit” for its intended uses in
operations, decision making, and planning
• Tools to ensure Data Quality
– Data Profiling - initially assessing the data to understand its quality challenges
– Data Standardization - a business rules engine that ensures that data conforms
to quality rules
– Geocoding - for name and address data. Corrects data to Worldwide postal
standards
– Matching or Linking - similar, but slightly different records can be aligned
– Monitoring - keeping track of data quality over time and reporting variations
– Batch & Real time
13
14
15
Example cont..
16
Data Quality (7 Sources of poor data quality)
• Entry Quality
– Did the information enter the system correctly at the origin?
– Incorrect phone number/email address
– Cost of entry problems depends on use
• If used for informational purposes then its cost is low
• If used for marketing & driving new sales then its cost is significant
• Process Quality
– Was the integrity of the information maintained during processing
through the system?
• May result from a system crash, lost file or any technical occurrence
– Source of the problem needs to be identified for ramification
• Identification Quality
– Are two similar objects identified correctly to be the same or different?
17
Data Quality (Sources of poor data quality)
• Integration Quality (Quality of completeness)
– Is all the known information about an object integrated to the point of
providing an accurate representation of the object?
– Example: It might be important for an auto claims adjuster to know that
a customer is also a high-value life insurance customer
– It creates a need to develop MDM (Master Data Management)
• Enables the process of identifying records from multiple systems that refer to the
same entity. Records will then be consolidated.
• Usage Quality
– Is the information used and interpreted correctly at the point of access?
– Occurs due to lack of access to legacy source documentation or
subject matter experts.
– Making the Data warehouse experts guess the meaning and use of
certain data elements
– Need to have thorough documentation, robust metadata and data 18
governance program
Data Quality (Sources of poor data quality)
• Aging Quality
– Has enough time passed that the validity of the information can no
longer be trusted?
• (1) Maintaining a former customer's address for more than five years is probably not
useful. If customers haven't been heard from in several years despite marketing
efforts, how can we be certain they still live at the same address?
• (2) Maintaining customer address information for a homeowner's insurance claim
may be necessary and even required by law.
– Decisions need to be made by the business owners
• Organizational Quality
– Can the same information be reconciled between two systems based
on the way the organization constructs and views the data?
– Less technical more organizational issue
• marketing tries to "tie" their calculations to finance, where the reporting systems of
both the departments are quite different
– biggest challenge to reconciliation is getting the various departments to
19
agree that their A equals the other's B equals the other's C plus D.
Data Quality (Record Linkage – Example)
20
Data Profiling
• Process of examining data available from an existing information source
and collecting statistics or informative summaries about that data
OR
• A process of developing information about data instead of information from
data.
– Utilizes statistical variables
– Metadata
• Clarifies
– Structure, content, relationships, derivation rules of the data
– Metadata about data – to discover illegal values, misspellings, missing values,
varying value representation, duplicates
• Performed at several times with varying intensity:
– (1) Soon after when the candidate source systems are identified
– (2) prior to dimensional modeling process
– (3) after the data has been loaded into staging area 21
22
MS SQL Server Data Profiling Tool
23
Statistical Analysis System (SAS)
24
MS SQL Server Data Profiling Tool
25
Data Profiling Tools
26
Initial Data Warehouse Load
Data
quality Discover Resolve
problems
• Major development activity
• More open ended than refresh with difficult to estimate
time requirements
• Use profiling tools to discover data quality problems
• Initial population process should be performed for each
27
major extensions of data warehouse.
Primary Objective of managing the
refresh process
• The primary objective in managing the refresh
process is to determine the refresh frequency for
each data source and set detailed refresh
schedules.
28
Refresh Processing Decision Making
• Data timeliness depends on the
sensitivity of decision making to the Refresh
currency of the data costs
• Some decisions are very time Timeliness
Constraints
sensitive such as inventory importance
decisions - minimize inventory
carrying costs by stocking goods as Manage
refresh
close as possible to the time frequency
needed. and
schedules
• Other decisions are not so time
sensitive. For example, the decision Net Refresh benefit defined
to close a poor performing store as the value of data timeliness
would typically be done using data
over a long period of time.
minus the cost of refresh. 29
Refresh Constraints
• Source access constraints
can be due to legacy Source
technology with restricted Access
scalability
• Integration constraints often
involve identification of
Availability Integration
common entities
• Consistency constraints Satisfy
involves usage of the same Constraints
time period in change data
• Completeness constraints
involves inclusion of changed
data from each data source
Completeness Consistency
• Availability constraints
involves conflicts between
30
online availability and
warehouse loading
Data Integration Concepts, Processes,
and Techniques
Change Data Concepts
Lesson Objectives
• Explain the types of data sources involved in data
integration
• Provide examples of typical data quality problems
encountered during data integration
• Reflect on the relationship between type of
change data and data quality
32
Basics of Change Data
• Derived from internal and external data sources
• Used to populate and refresh a data warehouse
– Insert rows in fact and dimension tables (common)
– Update rows in dimension tables (less common)
• Challenges
– Difficult to change to source systems especially
external systems
– Lack of SQL access and descriptive (meta) data
especially for legacy data
33
Cooperative Change Data
34
Logged Change Data
35
Queryable Change Data
• Queryable Change Data: comes directly from a data source via a query.
• Requires timestamping in the data source.
• Since few data sources contain timestamps for all data, queryable change
data usually is augmented with other kinds of change data.
• Queryable change data is most applicable for fact tables using fields such as
order date, shipment date, and hire date that are stored in operational data36
sources.
Snapshot Change Data
• involves periodic dumps of
source data.
• To derive change data, a
difference operation uses the
two most recent snapshots.
• The result of a difference
operation is called a delta.
• Snapshots are the only form of
change data without
requirements on a source
system.
37
Change Data Classification
38
Data Quality Problems
• Multiple identifiers
• Different units
• Missing values
• Text data with different components and formats
• Conflicting data
• Different update times
39
Data Integration Concepts, Processes,
and Techniques
Data Cleaning Tasks
Lesson Objectives
• Explain the three types of data cleaning tasks
• Provide examples depicting data cleaning tasks
• Reflect on the tedious nature of data cleaning
41
Parsing
• Locates and separates individual data elements
in text
• Studied in computer science for decades
• Regular expressions for pattern specification
• Natural Language Processing (NLP)
42
Parsing Example
43
Correcting Values
Missing values
- Default value for inapplicable values
• For example, missing values for an order without an
employee can be replaced with a default value indicating a
web order.
- Typical value: for numeric : average, median, for non-
numeric : mode
- Complex processing for predicting values using
relationships to other fields : using data mining algos
Conflicting values
- More recent value
- More credible source : via domain experts
44
Correction Example
Detailed investigations, possibly conducted using search services,
can resolve some cases of unknown values and conflicting values.
45
Standardization
Applies conversion routines to transform data
into preferred formats
Uses both standard and custom business rules
can be developed.
Common standardizations:
Unit of measure transformations
Standard abbreviations (state names, titles, street
types)
In addition, data standardization services can be 46
purchased for names, addresses, and product
details, although, customization may be
necessary.
Standardization Example
This example extends the previous corrected example with
standardization. 47
Data Integration Concepts, Processes,
and Techniques
Pattern Matching with Regular
Expressions
Lesson Objectives
• Explain the three major elements of regular
expressions
• Practice with regular expressions
• Reflect on the complexity and limitations of
regular expressions
49
Regular Expressions (regex)
Search Expression
Escape
Literal Meta character
sequence
• A literal is any character used in a search expression or target string.
• A metacharacter is one or more special characters that have a unique meaning and
are NOT used as literals in a search expression, for example, the character ^
(circumflex or caret) is a metacharacter.
• An escape sequence turns off the special meaning of a metacharacter so that it is
matched as a literal. In a regular expression an escape sequence involves placing the
metacharacter \ (backslash) in front of the metacharacter that we want to use as a
literal. 50
Pattern Matching
Search expression Target string Match result
^[a-z]+\.com$ [Link] [Link]
Meta characters Literals Escape sequence
• ^ (caret or • c • \.
circumflex) • o
• [ • m
• ] • a
• + • z
• - • .
• \
• $
51
Common meta characters
Iteration or quantifier Position Other
{n}, [ ],
? * + {n,m} . ^ $ \ |
[^]
Search expression
^[a-z]+\.com$
52
Meta Character Summary
Metacharacter Type Meaning
? Iteration Matches preceding character 0 or 1 time
* Iteration Matches preceding character 0 or more times
+ Iteration Matches preceding character 1 or more times
{n} Iteration Matches preceding character exactly n times
{n,m} Iteration Matches preceding character at least n times and at
most m times
[] Range Matches one of enclosed characters one time
^ Position Matches at the beginning of the target string; only has
meaning as the first character in a regular expression
^ Range Negation of search pattern if ^ is inside []. Hyphen
inside square brackets defines a range of characters.
$ Position Matches at the end of the target string; only has
meaning as the last character in a regular expression.
. Position Matches any character except a newline character at
the specified position only
| Alteration Matches either pattern to the left or right of the |
character.
() Group Groups for matching parts of target strings 53
Meta Character Examples I
• This table shows six examples with multiple target strings per example.
Search Expression Target Strings Evaluation details
“colou?r” “color”, “colour” Matches both target
strings
“tre*” “tree”, “tread”, “trough” Matches all three target
strings; Matches
preceding character 0
times in third target string
“tre+” “tree”, “tread”, “trough” Does not match the third does not match the third
target string target string because the
third character is o
“[abcd]” “dog”, “fond” , “pen” Matches first two strings Does not match the third
but not the third string target string because it
does not contain one of
the letters inside the
square brackets.
“[0-9]{3}-[0-9]{4}” “123-4567”, “1234-567” Matches first string but not the first range must be
the second string matched three times,
and the second range,
four times. 55
“ba{2,3}b” “baab”, “baaab”, “bab”, Matches first two strings the proceeding character
“baaaab” but not the last two strings a, must be matched
between two and three
times.
Meta Characters II
• This table shows search expressions using position, iteration and alteration meta characters.
Search Target Strings Evaluation details
Expression
“^win” “erwin”, “window” Second string but not does not match the first target string
first string because win does not appear in the
beginning of the target string.
“win$” “erwin”, “window” First string but not does not match the second target string
second string because win does not appear at the end of
the target string.
“[^0-9]+” “123”, “abc”, Matches the second the caret inside the square brackets
“a456” and third target strings negates the enclosed character range
matching any non-digit.
“abc.e*” “fabc”, “fabcd”, Matches the second the period, a positional meta character in
“fabcee” and third target strings the search expression requires a character
following abc, so the search expression
does not match the first target string.
“dog|cat|frog” “a dog”, “cat Matches all three meta characters, that is vertical bars,
friend”, “frogman” target strings match all three target strings, as each one
contains one of the choices dog, cat or 57
frog.
More Complex Examples
Field Search Expression
User name ^[a-z0-9_-]{3,16}$
Hex value ^#?([a-f0-9]{6}|[a-f0-9]{3})$
Email address ^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$
Web address ^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$
Regular expression testing sites
- [Link]
- [Link]
- [Link]
- [Link]
58
Data Integration Concepts, Processes,
and Techniques
Matching and Consolidation
60
Entity Matching
Identifyies common entities from separate data
sources when no reliable common identifier
exists.
Difficult matching process: no common identifier
Data mining problem
Also known as the record linkage, entity identification,
and entity resolution
Many approaches
61
Improve data quality for better matching results
Matching Example
Source 1 Source 2
First name Aimee First name Aimee
Middle name Christina Middle name C.
Last name Parker Last name Parker-Lewis
Job title Product Manager Job title Prod. Mgr.
Firm Microsoft Corporation Firm Microsoft
Street 15580 NE 31st Street Street 16517 78th Place NE
City Redmond City Bothell
State WA State WA
Postal Code 98052 Postal Code 98020
Country USA Country USA
pre marriage name and marital name and home 62
work address address
Merging Example
This example shows a possible result of merging records from the previous matching
example.
Target
First name Aimee
Middle name Christina
Last name Parker-Lewis
Job title Product Manager
Firm Microsoft Corporation
Street 16517 78th Place NE
City Bothell Use latest name (married)
State WA
Postal Code 98020
Country USA
63
Entity Matching Applications
Marketing Law
combine
customers from enforcement
different link crimes to
companies after individuals
merger
Fraud Health care
detection combining
filing health records
fraudulent tax from same
returns with individuals 64
treated at
different SSN
different clinics
Entity Matching Outcomes
Actual
Predicted
Match Non Match
Match True match False match
Possible Match Investigation Investigation
Non Match False non match True non match
• The rows represent predictions and the columns represent actual results of
matching two records for duplication.
• A true match involves a predicted match and an actual match allowing
the two records to be combined correctly.
• The possible non match situations involve predictions without enough 65
certainty to indicate a match or non match.
Consolidation
Matched entities can be merged or linked.
Merging matched records
Linking matched records
Households : For households, linking combines
individuals with family and other social relationships.
Transactions : In transaction linking, all accounts
and transactions are associated to the same person.
66
Household Consolidation
George Janet Karen Thomas
Smith Smith Smith Smith
Household consolidation involves linking records from individuals living in the same 67
household.
Transaction Linking
Account No.
83451234 Policy No.
ME309451-2
Transaction
B498/97
In transaction linking, all accounts and transactions are associated to the same
68
person.
Data Integration Concepts, Processes,
and Techniques
Quasi Identifiers & Distance Functions
for Entity Matching
Quasi Identifiers
• Used in entity matching : entity matching algorithms use
quasi identifiers to compensate for missing common
identifiers.
• Almost unique in combination
• In a study published in 2000, Sweeney demonstrated that
87% of the US population can be identified by a
combination of gender, birth date, and postal code.
• Examples
– Name components
– Location components
– Profession
– Birthdate
70
– Race
Distance Functions
• Poor data quality such as missing values and unknown update
times complicate choices for quasi identifiers.
• Entity matching approaches use distance functions to determine if
quasi identifiers in two entities indicate the same entity.
• Nurmeri-quasi identifiers : Determine amount of space between
records or values
– Determine distance between combination of quasi identifier values
– Determine distance between two quasi identifier values
• Text quasi identifiers : Text distance
– Important for quasi identifiers containing text
– Examples: name and location components. They differ in spelling, length,
and context.
– Distance function for text are used to compare quasi identifiers with these
differences.
– Have many applications outside of entity matching : such as spelling 71
correction.
Edit Distance
• Used for comparing relatively short text
values occurring in entity matching applications.
• Very common distance function for text
• Operations to transform two text values
– Delete a character
– Insert a character
– Substitute one character for another
• Edit distance is defined as the minimal number
of operations to transform a source text value
into a target text value.
72
Edit Distance Example
Saturday Sunday
1. Sturday (delete “a”) The first sequence is preferred
2. Surday (delete “t”) because it contains fewer
3. Sunday (substitute “n” for “r”)
operations.
1. Suturday (substitute “u” for “a”)
2. Sunurday (substitute “n” for “t”)
3. Sunrday (delete “u”)
4. Sunday (delete “r”)
73
Quiz
• What is the edit distance between “Break”
and “Trick”
• 5
• 2
• 3
• 4
74
Phonetic Distance Functions
• Many applications in law enforcement to account
for different name spellings, but similar
pronunciations.
• Words of the same pronunciation, should have
the same phonetic value.
• Phonetic distance mainly codes words into
standard consonant sounds.
• Two phonetic distances functions are widely
available in DBMSs and data integration tools
– Soundex: 6 consonant sounds
75
– Metaphone: 16 consonant sounds
– Metaphone, with more consonant sounds, was
developed as in improvement to Soundex.
Phonetic Matching Examples
• Soundex
– Soundex(Assistance) = A223
– Soundex(Assistants) = A223
• Metaphone
– Metaphone(Assistance) = ASSTNS
– Metaphone(Assistants) = ASSTNTS
Examples from W3C schools
- Soundex examples from
[Link]
- Metaphone examples from
[Link] 76