0% found this document useful (0 votes)
19 views26 pages

DW M4 L1 - ETL Introduction

The document outlines the Extract, Transform, Load (ETL) process essential for data warehousing, detailing the steps of data extraction, transformation to the desired format, and loading into the data warehouse. It emphasizes the complexity of ETL systems, the importance of automation, and the need for data quality, compliance, and security. Additionally, it discusses various ETL architectures, tools, and the mission of the ETL team in managing data effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views26 pages

DW M4 L1 - ETL Introduction

The document outlines the Extract, Transform, Load (ETL) process essential for data warehousing, detailing the steps of data extraction, transformation to the desired format, and loading into the data warehouse. It emphasizes the complexity of ETL systems, the importance of automation, and the need for data quality, compliance, and security. Additionally, it discusses various ETL architectures, tools, and the mission of the ETL team in managing data effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

EXTRACT, TRANSFORM , &

LOAD

Prof. Navneet Goyal


BITS, Pilani
• Sources used for this lecture
– Ralph Kimball, Joe Caserta, The Data
Warehouse ETL Toolkit: Practical Techniques for
Extracting, Cleaning, Conforming and
Delivering Data
Introduction
• Extract
– What data you want in the DW?
• Transform
– In what form you want the extracted data in
the DW?
• Load
– Load the transformed extracted data onto the
DW
Introduction

• Extract
– Extract relevant data
• Transform
– Transform data to DW format
– Build keys, etc.
– Cleansing of data
• Load
– Load data into DW
– Build aggregates, etc.
ETL System
• Back room or “Green room” of the DW
• Analogy - Kitchen of a restaurant
– A restaurant’s kitchen is designed for efficiency,
quality & integrity
– Throughput is critical when the restaurant is
packed
– Meals coming out should be consistent and
hygienic
– Skilled chefs
– Patrons not allowed inside
• Dangerous place to be in – sharp knives and hot
plates
• Trade secrets
ETL Design & Development
• Most challenging problem faced by the DW
project team
• 70% of the risk & effort in a DW project
comes from ETL
• Has 34 subsystems!!
• Not a one time effort!
– Initial load
– Subsequent loads (periodic refresh of the DW)
• Automation is critical!
Back Room Architecture
• ETL processing happens here
• Availability of right data from point A to
point B with appropriate transformations
applied at the appropriate time
• ETL tools are largely automated, but are
still very complex systems
General ETL Requirements
• Productivity support
– Basic development environment capabilities like code
library management, check in/check out, version control
etc.
• Usability
– Must be as usable as possible
– GUI based
– System documentation: developers should easily capture
information about processes they are creating
– This metadata should be available to all
– Data compliance
• Metadata Driven
– Services that support ETL process must be metadata
driven
General ETL Requirements
• Business needs – Users’ informntation requirement
• Compliance – must provide proof that the data reported is not
manipulated in any way
• Data Quality – garbage in garbage out!!
• Security – do not publish data widely to all decision makers
• Data Integration – Master Data Management System (MDM).
Conforming dimensions and facts
• Data Latency – huge effect on ETL architecture
– Use efficient data processing algorithms, parallelization, and
powerful hardware to speed up batch-oriented data flows
– If the requirement is for Real-time, then architecture must make a
switch from batch to microbatch or stream-oriented
• Archiving & Lineage – must for compliance & security reasons
– After ever major activity of the ETL pipeline, writing the data to disk
(staging) is recommended
– All staged data should be archived
Choice of Architecture 
Tool Based ETL

Simpler, Cheaper & Faster development


• People with business skills & not much
technical skills can use it.
• Automatically generate Metadata
• Automatically generates data Lineage &
data dependency analysis
• Offers in-line encryption & compression
capabilities
• Manage complex load balancing across
servers
Choice of Architecture 
Hand-Coded ETL

• Quality of tool by exhaustive unit testing


• Better metadata
• Requirement may be just file based
processes not database-stored procedures
• Use of existing legacy routines
• Use of in-house programmers
• Unlimited flexibility
• Cheaper
Middleware & Connectivity Tools

• Provide transparent access to source systems


in heterogeneous computing environments
• Expensive but often prove invaluable because
they provide transparent access to DBs of
different types, residing on different platforms
• Examples:
– IBM: Data Joiner
– Oracle: Transparent Gateway
– SAS: SAS/Connect
– Sybase: Enterprise Connect
Extract
• Extract
– What data you want in the DW?
• Remember that not all data generated by
operational systems is required in the DW
– For example: credit card no. is mandatory in
operational systems but is not required in a DW
• Typically implemented as a GUI
• Table names and their schema is displayed
for extraction
• Tables are selected and then their
attributes are selected
Extraction Tools

• Lot of tools available in the market


• Tool selection tedious
• Choice of tool depends on following factors:
– Source system platform and DB
– Built-in extraction or duplication functionality
– Batch windows of the operational systems
Extraction Methods
• Bulk Extractions
– Entire DW is refreshed periodically
– Heavily taxes the network connections between
the source & target DBs
– Easier to set up & maintain
• Change-based Extractions
– Only data that have been newly inserted or
updated in the source systems are extracted &
loaded into the DW
– Places less stress on the network but requires more
complex programming to determine when a new
DW record must be inserted or when an existing
DW record must be updated
Transformation Tools

• Transform extracted data into the


appropriate format, data structure, and
values that are required by the DW
• Features provided:
– Field splitting & consolidation
– Standardization
• Abbreviations, date formats, data types, character
formats, time zones, currencies, metric systems,
product keys, coding, etc.
– Deduplication
• A major non trivial task
Source System Type of DW
transformation

Address Field: Field Splitting No: 123


#123 ABC Street Street: ABC
XYZ City 1000 City: XYZ
Republic of MN Country: Republic
of MN
Postal Code: 1000
System A Field Consolidation Customer title:
Customer title: President President & CEO
System B
Customer title: CEO
Order Date:05 August 1998 Standardization Order Date:
Order Date: 08/08/98 05 August 1998
Order Date:
08 August 1998
System A Deduplication Customer Name:
Customer Name: John W. Smith John William Smith
System B
Customer Name: John William Smith
Mission of ETL team

To build the back room of the DW


– Deliver data most effectively to end user tools
– Add value to the data in the cleaning &
conforming steps
– Protect & document the lineage of data (data
provenance)
Mission of ETL team

The back room must support 4 key steps


– Extracting data from original sources
– Quality assuring & cleaning data
– Conforming the labels & measures in the data to
achieve consistency across the original sources
– Delivering the data in a physical format that can
be used by query tools and report writers
ETL Data Structures

Data Flow
Extract  Clean  Conform  Deliver
• Back room of a DW is often called the data
staging area
• Staging means ‘writing to disk’
• ETL team needs a number of different data
structures for all kinds of staging needs
To stage or not to stage

• Decision to store data in physical staging


area versus processing it in memory is
ultimately the choice of the ETL architect
To stage or not to stage
• A conflict between
– getting the data from the operational systems
as fast as possible
– having the ability to restart without repeating
the process from the beginning
• Reasons for staging
– Recoverability: stage the data as soon as it
has been extracted from the source systems
and immediately after major processing
(cleaning, transformation, etc).
– Backup: can reload the data warehouse from
the staging tables without going to the
sources
– Auditing: lineage between the source data
and the underlying transformations before
Designing the Back Room
• The back room is owned by the ETL team
– no indexes, no aggregations, no presentation
access, no querying, no service level agreements
• Users are not allowed in the back room for
any reason
– Back room is a “construction” site

• Reports cannot access data in the back room


– tables can be added, or dropped without notifying
the user community
– Controlled environment
Designing the Back Room (contd…)

• Only ETL processes can read/write in the


back room (ETL developers must capture
table names, update strategies, load
frequency, ETL jobs, expected growth and
other details about the staging area)
• The back room consists of both RDBMS
tables and data files
Data Structures in the ETL System

• Flat files
– fast to write, append to, sort and filter (grep)
but slow to update, access or join
• XML Data Sets
– Used as a medium of data transfer between
incompatible data sources
• Relational Tables
Coming up next …

• 34 subsystems of ETL

You might also like