ETL ARCHITECTURE
Center of Excellence
Data Warehousing
Center
of Excellence
12/08/2004
Data Warehousing
ETL Architecture Options - Overview
12/08/2004
ETL Architecture Options - Guidelines
Provide information about the ETL tool.
Explain different components of ETL tool.
Provide architecture diagram of the ETL tool and mention how the different components of
architecture interacts with each other.
Explain different plug-inns available with the ETL tool. Explain the command line interface
if available.
Make sure that all the tool related concepts that you will be using through out the document
are well explained here.
Explain the factors to be considered while suggesting the ETL infrastructure. This factors can
be tool dependant, client environment dependant or it can be based on the best practices from
Wipros experience.
Provide different option of architecture by which the tool can be configured and use
effectively in the organization, the option may be based on setup of repository setup. Provide
pros and cons for each of the options. Limit the number of option to max 4 and then suggest
the one appropriately which will suite the clients requirement and environment
Make sure that you are handling the geographically distributed architecture is required, also
make sure that the architecture suggested is flexible enough to expand as needed
Provide different option for Disaster recovery solutions and Highly Availability System
3
solutions 12/08/2004
ETL Architecture Options - Guidelines
Architecture Option 1
Explain the architecture with the diagram and provide pros and cons. Explain how
this option can be implemented in client environment. Probably this can be a
distributed architecture ( explained in ETL Framework session)
Architecture Option 2
Explain the architecture with the diagram and provide pros and cons. Explain how
this option can be implemented in client environment. Probably this can be a
centralized architecture ( explained in ETL Framework session)
Architecture Option 3
Explain the architecture with the diagram and provide pros and cons. Explain how
this option can be implemented in client environment. Probably this can be a
geographically distributed architecture ( explained in ETL Framework session)
Recommendation for ETL Infrastructure
Explain the recommended architecture option for ETL infrastructure and provide
justification. Explain how client can be maximum benefited from the recommended
architecture option in their current environment. Also provide small roadmap for
12/08/2004
4
client regarding
how they should grow with this option.
ETL Design Considerations
12/08/2004
ETL Design Considerations
Modularity
Consistency
Flexibility
Speed
Heterogeneity
Meta Data Management
12/08/2004
ETL Design Considerations
Modularity
ETL systems should contain modular elements which encourages reuse and makes
them easy to modify when implementing changes
Consistency
ETL systems should guarantee consistency of data when it is loaded into the data
warehouse. An entire data load should be treated as a single logical transaction
(either commit or rollback).
Flexibility
ETL systems may be appropriate to accomplish some transformations in text files
and some on the source data system; others may require the development of custom
applications
12/08/2004
ETL Design Considerations
Speed
ETL systems should be as fast as possible.
Heterogeneity
ETL systems should be able to work with a wide variety of data in different formats
12/08/2004
ETL Architectures - Types
12/08/2004
ETL Architectures - Types
Based on the No of Source System
Homogenous architecture
Heterogeneous architecture.
Based on the No of ETL Processes.
Traditional Architecture
Conformed Architecture
12/08/2004
10
ETL Architectures - Types
Homogenous Architecture
A homogenous architecture for an ETL system is one that involves only a single
source system and a single target system
Features
Single data source
Data is extracted from a single source system, such as an OLTP system .
Rapid development
Rapid development because there is only one data format for each record
type.
data transformation
NoLight
data transformations are required since the incoming data is often in a
format
usable in the data warehouse.
12/08/2004
11
structural transformation
Light
Because the data comes from a single source, the amount of structural changes such
as table alteration is also very light
research requirements
Simple
The research efforts to locate data are generally
simple: if the data
is in the source System, it can be used. If it is not, it cannot
12/08/2004
12
ETL Architectures - Types
Heterogeneous Architecture
A heterogeneous architecture for an ETL system is one that extracts data from
multiple sources
Features
Multiple data sources
More complex development
The development effort required to extract the data is increased because
there are multiple source data formats for each record type.
Significant data transformation
Data transformations are required as the incoming data is often not in a
format usable in the data warehouse.
12/08/2004
13
Heterogeneous Architecture
Substantial research requirements to identify and match data elements
Heterogeneous Architecture
12/08/2004
14
ETL Architectures - Types
Traditional
ETL Architecture
Create individual ETL processes for each source system
A traditional ETL architecture would create one ETL process to perform all of the
logic necessary to transform the source data into its target destination.
12/08/2004
15
ETL Architectures - Types
Traditional ETL Architecture
Advantages
There are fewer ETL processes to create and maintain
Disadvantage
Here each individual ETL process redundantly perform many of the same actions .
development, this results in an increase in the amount and complexity of
During
ETL code to create and test.
Implementation, modifications require more effort as changes potentially
Upon
must be made in multiple places
12/08/2004
16
ETL Architectures - Types
Conformed
ETL
Architecture
There
is
an
intermediary
definition
of data entity which is a conformed table.
12/08/2004
Conformed table
17
Advantages
of ETL processes
Modularization
The creation of smaller, less complex ETL processes makes troubleshooting
problems and creating future enhancements easier.
Reusability of post-conform processes:
Enforcing referential integrity (looking up foreign key assignments) and
performing inserts and/or updates to the final target is made simple.
The conform approach avoids redundant processing and ETL logic reduces
the chances of error, ultimately improving data quality.
12/08/2004
18
Extensibility
Confirmed architecture allows us to add new source systems.
Disadvantages
There are more objects and processes that must be created and maintained.
12/08/2004
19
ETL Architectures - Types
Based on Repository
3 types of ETL Architectures based on repository
Single Domain Distributed Repository Architecture
Single Domain Single Repository Architecture
Distributed Domain Architecture
12/08/2004
20
Single Domain Distributed Repository Architecture
12/08/2004
21
Single Domain Single Repository Architecture
12/08/2004
22
Distributed Domain Architecture
12/08/2004
23
ETL Process Flows
12/08/2004
24
The ETL process flow Types
TRANSFORM THEN LOAD
LOAD THEN TRANSFORM
TRANSFORM WHILE LOADING
12/08/2004
25
The ETL process flow Types
TRANSFORM
THEN LOAD
The data is manipulated outside the database to cleanse and sort it, with the result
loaded into the database
12/08/2004
26
The main risks and disadvantages of
Transform
and
load
are
If
the
data
is
transformed
outside
the
database, the external tools used to prepare the
data may not scale as effectively as the database does and will become a bottleneck.
Depending on the architecture, the external mechanism has to control the progress
of the ETL process and provide recovery and restart ability
12/08/2004
27
The ETL process flow Types
LOAD THEN TRANSFORM
The raw data from the source system is copied into staging tables in the
database, where it is cleansed and then loaded into the warehouse tables .
12/08/2004
28
The main risks and disadvantages of
Load
then
Transform
are
Here,
extra
disk
storage
for
the
staging
tables is required.
The transformation process is interrupted by storing not only intermediate Results
but also the original raw data from the source systems in the Database.
12/08/2004
29
The ETL process flow Types
TRANSFORM
WHILE LOADING
The raw data is selected directly from a stream of data from the production system
or flat files, transformed by applying one or more table functions to it, and then
written to the database
12/08/2004
30
ETL Extraction Methodology
12/08/2004
31
Important considerations for
extraction
The extraction method to choose is highly dependent on the source system and also
from the business needs in the target data warehouse environment.
The estimated amount of the data to be extracted and the stage in the ETL process
also impact the decision of how to extract, from a logical and a physical perspective
12/08/2004
32
Methods to Extract data
Logical Extraction Method
There are two kinds of logical extraction:
Full Extraction
Incremental Extraction
Physical Extraction Method
There are two kinds of physical extraction
Online Extraction
Extraction
12/08/2004
Offline
33
Logical Extraction Method
Full Extraction
The data is extracted completely from the source system
Incremental Extraction
Only the data that has changed since a well-defined event back in history
will be extracted.
Two ways to accomplish
Change data capture technique
12/08/2004
Entire tables extraction
34
Change data capture technique
Extract only the most recently changed data
Entire tables extraction
Tables from the source systems are extracted to the data warehouse or
staging area, and these tables are compared with
a previous extract from the source system to identify the changed data
12/08/2004
35
Physical Extraction Methods
Online Extraction
The data is extracted directly from the source system itself.
Offline Extraction
The data is not extracted directly from the source system but is staged
explicitly outside the original source system
12/08/2004
36
Extract data in two ways
Extraction Using Data Files
Most of the database systems provide mechanisms for exporting or
unloading data from the internal database format into flat files
Extraction Via Distributed Operation
Using distributed-query technology, one database (oracle) can directly
query tables located in various different source systems, such as another
database (oracle)
12/08/2004
37
Transportation Methodology
12/08/2004
38
Transportation
Transportation is the operation of moving data from one system to another system
The most common requirements for transportation are in moving data from
A source system to a staging database or a data warehouse database
A staging database to a data warehouse
A data warehouse to a data mart
12/08/2004
39
Three basic choices for transporting
data
in
warehouses
Transportation
Using Flat Files
Transportation through Distributed Operations
Transportation Using Transportable Table Spaces
12/08/2004
40
Transportation Using Flat Files
The most common method for transporting data is by the transfer of flat
files, using mechanisms such as FTP or other remote file system access
protocols.
Advantages
Source systems and data warehouses use different operating systems
and database systems, using flat files is the simplest way to exchange
data between heterogeneous systems with minimal transformations.
When transporting data between homogeneous systems, flat files are
often the most efficient and most easy-to-manage mechanism for data
transfer
12/08/2004
41
Transportation through Distributed
Operations
Distributed queries, either with or without gateways, can be an effective mechanism
for extracting data.
These mechanisms also transport the data directly to the target systems, thus
providing both extraction and transformation in a single step
12/08/2004
42
Transportations Using Transportable
Table Spaces
Using transportable table spaces, Oracle data files (containing table data, indexes,
and almost every other Oracle database object) can be directly transported from one
database to another.
Disadvantage
Source and target systems must be running Oracle8i (or higher), must
be running the same operating system, must use the same character set.
12/08/2004
43
Loading Methodology
12/08/2004
44
Loading Mechanisms
The process of writing the data into the target database.
You can use the following mechanisms for loading a
warehouse:
SQL*LOADER
External Tables
OCI and Direct Path APIs
Export /Import
12/08/2004
45
Staging Area
STAGING AREA
12/08/2004
46
Definition
A place where raw data is brought in, cleaned, combined, archived, and exported to
one or more data marts.
It is also used to get data ready for loading into a presentation server.
12/08/2004
47
Architecture of a Data Warehouse with a Staging
Area.
12/08/2004
48
Data Staging Area Roles
Integrates data from many application source systems so there is one common
System wide enterprise view of the data.
Distributes data to data marts.
12/08/2004
49
Characteristics of Staging Area
Facilitates moving data from different sources on different schedules.
Provides a place to check data cleanliness and correctness.
Is the enterprise-wide integration of data.
Data is stored at the lowest level of detail available.
12/08/2004
50
Need For Staging Area
Data used in the data warehouse is extracted from the data sources, cleansed and
transformed into the data warehouse schema.
The data is checked for consistency and referential integrity.
Promotes effective data warehouse management.
Data transformation in the data source systems can interfere with OLTP
performance.
Allows DW professionals to assess the data quality problems before they are loaded
to the warehouse.
12/08/2004
51
Conformed dimensions must be permanently housed in the data staging areas as
flat files.
Holds data for emergency recovery operations.
Source of the most atomic transactional data.
12/08/2004
52
Creating Staging Area
Create tables and other database objects to support
the data extraction
cleansing
transformation operations required to prepare the data for loading into the
data warehouse.
We can create a separate database for the data staging area, or can create these items
in the data warehouse database.
12/08/2004
53
Creating Staging Area
It should include tables to contain the
incoming data
tables to aid in implementing surrogate keys
tables to hold transformed data.
Design will depend on
the diversity of data sources
the degree of transformation necessary to organize the data for DW
loading
the consistency of the incoming data.
12/08/2004
54
Data Staging Techniques
Surrogate key creation and maintenance
Processing Slowly Changing Dimensions
Combining from Separate Sources
Data Cleaning
Processing Names and Addresses
Validating One-to-One and One-to-Many Relationships
Fact Processing
Aggregate Processing
12/08/2004
55
Data Staging Techniques
Surrogate key creation and maintenance
create a surrogate key .
every DW key should be a surrogate key
Processing Slowly Changing Dimensions
Type 1: Overwrite the Value
Rewriting History - no history .
Type 2: Add A Dimension Row
Keeping Track of History
Type 3: Add A Dimension Column
Keeping Track of History in Parallel
Push down the changed value into an "old" attribute field.
12/08/2004
56
Data Staging Techniques
Combining from Separate Sources
Dimensions are derived from several sources. the merge operation is based
on the some criteria.
Data Cleaning
Data cleaning may involve checking the spelling of an attribute or
checking the membership of an attribute in a list
Processing Names and Addresses
12/08/2004
57
and addresses have been cleaned and put into standardized formats
Names
Data Staging Techniques
Fact Processing
The incoming fact records will have production keys, not data warehouse
keys. The current correct correspondence between a production key and the
data warehouse key must be looked up at load time
Aggregate Processing
Each load of new fact records requires that aggregates be calculated or
augmented. It is very important to keep the aggregates synchronous with the
base data at every instant
12/08/2004
58
Data Staging Techniques
Validating One-to-One and One-to-Many Relationships
If two attributes in a dimension have a one-to-one relationship, validate it
by sorting the dimension records on one of the attributes. Each attribute
value in the sorted column must have exactly one value in the other column.
The check must then be reversed by sorting on the second column and
repeating
A many-to-one relationship, (eg.zip code-to state) ,can be validated by
sorting on the "many" attribute and verifying that each value has a unique
value on the "one" attribute.
12/08/2004
59
Data Staging Storage Types
Flat Files
Relational tables
OODBMS - object-oriented data base management system
12/08/2004
60
Is Data Staging Relational ???
whether the data staging area is relational or has more to do with sequential
processing of flat files. Ralph Kimball [1] concludes that
" Most data staging activities are not relational, but rather they are sequential
processing. If your incoming data is in flat-file format you should finish your data
staging processes as flat files before loading it into a relational database. He also
states that if both the source and target databases are relational it may be
appropriate to retain this format and not convert to flat files
12/08/2004
61
Data Staging Components
Data Staging Application Server
Data Staging Repository
Metadata and Meta model Repository
12/08/2004
62
Data Staging Components
Data Staging Application Server
temporarily stores and transforms data extracted from OLTP data sources.
.
The data store archive (repository)
The data store archive (repository) of the results of extraction,
transformation and loading activity.
The archival repository stores cleaned, transformed records
and attributes for later loading into data marts and data warehouses.
12/08/2004
63
Data Staging Components
Metadata and Meta Model Repository
The data staging process is driven in an essential way by metadata,
including business rules.
Metadata is used along with administrative tools to guide data extractions,
transformations, archiving, and loading to target data mart and data
warehouse schemas.
12/08/2004
64
Staging scenarios
2 staging scenarios
Scenario 1- a data staging tool is available.
The data is already in a database. The data flow is set up so that it comes out of the
source system, moves through the transformation engine, and into a staging
database.
12/08/2004
65
Staging scenarios
Scenario 2
In the second scenario, begin with a mainframe legacy system. Then extract the
sought after data into a flat file, move the file to a staging server, transform its
contents, and load transformed data into the staging database.
12/08/2004
66
General Data Staging Requirements
Productivity support
The data staging services needs to provide basic development environment
capabilities like code library management check in/check out, version control and
production and development system builds.
Usability
The data staging system must be as usable as possible, it should have graphical user
interface.
12/08/2004
67
General Data Staging Requirements
System Documentation
The data staging system need to provide a way for developers to easily capture
information about the processes that they are creating.
Metadata driven
Metadata is used along with administrative tools to guide data extractions,
transformations, archiving, and loading to target data mart and data warehouse
schemas.
12/08/2004
68
Staging Metadata
12/08/2004
69
Data Staging Metadata
The metadata needed to get the data into a staging area and prepare it for loading
into one or more data marts are:
Data acquisition information
Dimension table Management
Transformation and Aggregation
Audit, Job Logs and documentation
DBMS metadata
Front Room Metadata
12/08/2004
70
Data Staging Metadata
Data acquisition information
Data transmission scheduling
File usage in the data staging area including duration, volatility and
ownership.
12/08/2004
71
Data Staging Metadata
Dimension table Management
Definitions of conformed dimensions and conformed facts.
Job specifications for joining sources, stripping out fields, and looking up
attributes.
Slowly changing dimension policies for each incoming descriptive
attribute.
Current surrogate key assignments for each production key, including a
fast lookup table to perform this mapping in a memory.
Yesterdays copy of production dimensions to use as the basis for diff
compare.
12/08/2004
72
Data Staging Metadata
Transformation and Aggregation
Data cleaning specifications.
Data enhancement and mapping transformations.
Transformations required for data mining.
Target schema designs, source-to-target data flows, and target data
ownership.
DBMS load scripts.
Aggregate definitions.
Aggregate usage statistics ,base table usage. statistics and potential
aggregates.
Aggregate modification logs.
12/08/2004
73
Data Staging Metadata
Job Logs and documentation
Audit,Data
lineage and audit records ( where exactly did this record come from
and when ? )
Data transformation run time logs, success summaries and time stamps.
Data transformation software version numbers.
Security settings for extract files, extract software , and extract metadata.
Security settings for data transmission (e.g. passwords ,certifications)
Data staging area archive logs and recovery procedures.
Data staging archive security settings.
12/08/2004
74
Data Staging Metadata
DBMS metadata
DBMS system table contents.
Partition settings.
Indexes.
Disk stripping specifications.
Processing hints.
DBMS-level security privileges and grants.
View definition.
Stored procedures and sql administrative scripts.
DBMS backup ,backup procedures and backup security.
12/08/2004
75
Data Staging Metadata
Front room metadata
Business names and descriptions for columns,tables and grouping.
query and report definitions.
Join specification tool settings.
Network security user privilege profiles.
Usage and access maps for data elements,tables,views and reports.
12/08/2004
76