EXTRACT, TRANSFORM , &
LOAD
Prof. Navneet Goyal
BITS, Pilani
• Sources used for this lecture
– Ralph Kimball, Joe Caserta, The Data
Warehouse ETL Toolkit: Practical Techniques for
Extracting, Cleaning, Conforming and
Delivering Data
Introduction
• Extract
– What data you want in the DW?
• Transform
– In what form you want the extracted data in
the DW?
• Load
– Load the transformed extracted data onto the
DW
Introduction
• Extract
– Extract relevant data
• Transform
– Transform data to DW format
– Build keys, etc.
– Cleansing of data
• Load
– Load data into DW
– Build aggregates, etc.
ETL System
• Back room or “Green room” of the DW
• Analogy - Kitchen of a restaurant
– A restaurant’s kitchen is designed for efficiency,
quality & integrity
– Throughput is critical when the restaurant is
packed
– Meals coming out should be consistent and
hygienic
– Skilled chefs
– Patrons not allowed inside
• Dangerous place to be in – sharp knives and hot
plates
• Trade secrets
ETL Design & Development
• Most challenging problem faced by the DW
project team
• 70% of the risk & effort in a DW project
comes from ETL
• Has 34 subsystems!!
• Not a one time effort!
– Initial load
– Subsequent loads (periodic refresh of the DW)
• Automation is critical!
Back Room Architecture
• ETL processing happens here
• Availability of right data from point A to
point B with appropriate transformations
applied at the appropriate time
• ETL tools are largely automated, but are
still very complex systems
General ETL Requirements
• Productivity support
– Basic development environment capabilities like code
library management, check in/check out, version control
etc.
• Usability
– Must be as usable as possible
– GUI based
– System documentation: developers should easily capture
information about processes they are creating
– This metadata should be available to all
– Data compliance
• Metadata Driven
– Services that support ETL process must be metadata
driven
General ETL Requirements
• Business needs – Users’ informntation requirement
• Compliance – must provide proof that the data reported is not
manipulated in any way
• Data Quality – garbage in garbage out!!
• Security – do not publish data widely to all decision makers
• Data Integration – Master Data Management System (MDM).
Conforming dimensions and facts
• Data Latency – huge effect on ETL architecture
– Use efficient data processing algorithms, parallelization, and
powerful hardware to speed up batch-oriented data flows
– If the requirement is for Real-time, then architecture must make a
switch from batch to microbatch or stream-oriented
• Archiving & Lineage – must for compliance & security reasons
– After ever major activity of the ETL pipeline, writing the data to disk
(staging) is recommended
– All staged data should be archived
Choice of Architecture
Tool Based ETL
Simpler, Cheaper & Faster development
• People with business skills & not much
technical skills can use it.
• Automatically generate Metadata
• Automatically generates data Lineage &
data dependency analysis
• Offers in-line encryption & compression
capabilities
• Manage complex load balancing across
servers
Choice of Architecture
Hand-Coded ETL
• Quality of tool by exhaustive unit testing
• Better metadata
• Requirement may be just file based
processes not database-stored procedures
• Use of existing legacy routines
• Use of in-house programmers
• Unlimited flexibility
• Cheaper
Middleware & Connectivity Tools
• Provide transparent access to source systems
in heterogeneous computing environments
• Expensive but often prove invaluable because
they provide transparent access to DBs of
different types, residing on different platforms
• Examples:
– IBM: Data Joiner
– Oracle: Transparent Gateway
– SAS: SAS/Connect
– Sybase: Enterprise Connect
Extract
• Extract
– What data you want in the DW?
• Remember that not all data generated by
operational systems is required in the DW
– For example: credit card no. is mandatory in
operational systems but is not required in a DW
• Typically implemented as a GUI
• Table names and their schema is displayed
for extraction
• Tables are selected and then their
attributes are selected
Extraction Tools
• Lot of tools available in the market
• Tool selection tedious
• Choice of tool depends on following factors:
– Source system platform and DB
– Built-in extraction or duplication functionality
– Batch windows of the operational systems
Extraction Methods
• Bulk Extractions
– Entire DW is refreshed periodically
– Heavily taxes the network connections between
the source & target DBs
– Easier to set up & maintain
• Change-based Extractions
– Only data that have been newly inserted or
updated in the source systems are extracted &
loaded into the DW
– Places less stress on the network but requires more
complex programming to determine when a new
DW record must be inserted or when an existing
DW record must be updated
Transformation Tools
• Transform extracted data into the
appropriate format, data structure, and
values that are required by the DW
• Features provided:
– Field splitting & consolidation
– Standardization
• Abbreviations, date formats, data types, character
formats, time zones, currencies, metric systems,
product keys, coding, etc.
– Deduplication
• A major non trivial task
Source System Type of DW
transformation
Address Field: Field Splitting No: 123
#123 ABC Street Street: ABC
XYZ City 1000 City: XYZ
Republic of MN Country: Republic
of MN
Postal Code: 1000
System A Field Consolidation Customer title:
Customer title: President President & CEO
System B
Customer title: CEO
Order Date:05 August 1998 Standardization Order Date:
Order Date: 08/08/98 05 August 1998
Order Date:
08 August 1998
System A Deduplication Customer Name:
Customer Name: John W. Smith John William Smith
System B
Customer Name: John William Smith
Mission of ETL team
To build the back room of the DW
– Deliver data most effectively to end user tools
– Add value to the data in the cleaning &
conforming steps
– Protect & document the lineage of data (data
provenance)
Mission of ETL team
The back room must support 4 key steps
– Extracting data from original sources
– Quality assuring & cleaning data
– Conforming the labels & measures in the data to
achieve consistency across the original sources
– Delivering the data in a physical format that can
be used by query tools and report writers
ETL Data Structures
Data Flow
Extract Clean Conform Deliver
• Back room of a DW is often called the data
staging area
• Staging means ‘writing to disk’
• ETL team needs a number of different data
structures for all kinds of staging needs
To stage or not to stage
• Decision to store data in physical staging
area versus processing it in memory is
ultimately the choice of the ETL architect
To stage or not to stage
• A conflict between
– getting the data from the operational systems
as fast as possible
– having the ability to restart without repeating
the process from the beginning
• Reasons for staging
– Recoverability: stage the data as soon as it
has been extracted from the source systems
and immediately after major processing
(cleaning, transformation, etc).
– Backup: can reload the data warehouse from
the staging tables without going to the
sources
– Auditing: lineage between the source data
and the underlying transformations before
Designing the Back Room
• The back room is owned by the ETL team
– no indexes, no aggregations, no presentation
access, no querying, no service level agreements
• Users are not allowed in the back room for
any reason
– Back room is a “construction” site
• Reports cannot access data in the back room
– tables can be added, or dropped without notifying
the user community
– Controlled environment
Designing the Back Room (contd…)
• Only ETL processes can read/write in the
back room (ETL developers must capture
table names, update strategies, load
frequency, ETL jobs, expected growth and
other details about the staging area)
• The back room consists of both RDBMS
tables and data files
Data Structures in the ETL System
• Flat files
– fast to write, append to, sort and filter (grep)
but slow to update, access or join
• XML Data Sets
– Used as a medium of data transfer between
incompatible data sources
• Relational Tables
Coming up next …
• 34 subsystems of ETL