0% found this document useful (0 votes)
42 views30 pages

ETL Review

The document discusses extract, transform, and load (ETL) software. It describes what ETL is, the extraction, transformation, and loading processes, considerations for ETL tools and vendors, and uses of ETL for spatial data warehousing.

Uploaded by

kodanda
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views30 pages

ETL Review

The document discusses extract, transform, and load (ETL) software. It describes what ETL is, the extraction, transformation, and loading processes, considerations for ETL tools and vendors, and uses of ETL for spatial data warehousing.

Uploaded by

kodanda
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 30

ETL Software

Joanna Frazier Abhishek Sengupta Chris Kadlec Erik Shepard Susan Kost Brian Strok Ivan Vasquez

What is ETL?
Short for extract, transform, and load.

Three database functions that are combined into one tool to pull data out of one database and place it into another database. ETL is used to migrate data from one database to another, to form data marts and data warehouses and also to convert databases from one format or type to another.
http://www.pcwebopedia.com/TERM/E/ETL.html

Side Note:
It should be noted that ETL is not 3 well-

defined steps. We are breaking them up and presenting a theoretical view for ease of understanding before bringing them together and showing you how this method actually works in the real business world.

Extraction
Data Needs to be taken from some data source so that it can be put into the Data Warehouse. To do this: 1. Some code at the data source exports the data to be used. 2. Some external program takes the data from the source.

Extraction (cont)
If the data is exported, it is typically

exported into a text file that can then be brought into an intermediary database. If the data is extracted from the source, it is typically transferred directly into an intermediary database.

Data Transformation
Designs the process and develops the utilities and programming that allow the data warehouse to be initially loaded and maintained

Locates, extracts, conditions, scrubs and

loads data onto the data warehouse platform Physical database design must be available before loading can be performed

Data Transformation
3 major steps

- Data Cleansing - Data Integration - Other Transformations (includes replacement of codes, derived values, calculating aggregates)

Data Cleansing
Dirty Data Dummy Values Absence of Data Cryptic Data Contradicting Data Inappropriate Use of Address Lines Reused Primary Keys Non-unique Identifiers

Data Integration
2 Major Problems - Data that should be related but cannot be
(May arise due to non-unique primary keys or more often, the absence of primary keys) - Data that is inadvertently related but should not be (Occurs when fields or records are reused for multiple purposes)

Loading
The populating of tables that presentation

applications will use to make data available to users Most critical operations in any warehouse, yet often neglected

Loading (cont)
The LOADING

process can be broken down into 2 different types:


Initial Load Continuous Load (loading over time)

Initial Load
Consists of populating tables in warehouse

schema and verifying data readiness Examples:


DTS data transformation services Bcp utility batch copy SQL*Loader Native Database Languages (T-SQL, PL/SQL, etc.)

Continuous Loads
Must be scheduled and processed in a

specific order to maintain integrity, completeness, and a satisfactory level of trust Should be the most carefully planned step in data warehousing or can lead to:
Error duplication Exaggeration of inconsistencies in data

Continuous Loads (cont)


Must be during a fixed batch window

(usually overnight) Must maximize system resources to load data efficiently in allotted time
Ex. Red Brick Loader can validate, load, and index up to 12GB of data per hour on an SMP system

Additional Aspects of Loader


Should be able to: Aggregations build on past data (SUM, MODIFY, APPEND, UPDATE, etc) Filtering additional cleaning and filtering based on user instructions Integrity ensure data to be loaded meets integrity constraints previously established Index Building creates indexes associated with the data being loaded

Questions to Ask
Data Source Connectivity: Oracle, Sybase,

Informix, mainframe(CICS), Flat files. Functionality: pre-built Transformations Metadata: Open Architecture, Reporting Capability, Extensibility Performance: Engine Driven, Code Generator, Bulk Loading, "Data never touches the ground", Multi-threaded processes. Administration: Versioning, Debugging, Auditing

More Questions to Ask


Backup and Disaster Recovery: Restart logic,

Error detection Modeling Tool Connectivity: Erwin, Powerdesigner Ease of Use: GUI Interface, Intuitive design, integrated toolset Programming Language Supported: VB, C, C++, COBOL Support: 24x7, Devoted Staff levels

ETL Vendors
Ascential

SAS
NCR Teradata

IBM
Oracle Vality Firstlogic

ETL Tool Set


Purchase or Grow Your Own 100s of Vendors-www.dwinfocenter.org/clean.html Pricing Varies Widely Trend Included as part of other initiatives CRMs
NCRs Teradata

Data Warehouses
Oracle, Red Brick, DB2, Prism, Sybase, Teradata, Informix, Microsoft SQL Server

Pricing Trends
Costs FireSpout, ETL Engine
Start at $150K

MetaRecon Enterprise
Server Package $250K, Client Package $50K

Pricing Trends
IBM, DB2, 7 ASPs and other partners pay with a percentage of revenue received from customers once solution is running per subscriber or per transaction basis. Still offer per-user base pricing model. Majority of database purchases are sold with an accompanying application and will still be done this way. Formation 1.4 Informix databases, Red Brick Warehouse, Oracle8 Server, Microsoft SQL Server databases. $7500 per processor for the Formation Flow Engine

No use of ETL Tools


Start Immediately Any logic set can be programmed Disadvantages Many programs to build Transformation logic is complex Lengthy program build process No automatic metadata generation Maintenance constant changes Infrastructure is very expensive
www.nyoug.org/dwetl_ny.pdf

Use ETL Tools


Enables rapid application development

(RAD) Allows easy maintenance Generates metadata automatically Reduces development costs Disadvantages
Learning curve Some limits to logic capabilities
www.nyoug.org/dwetl_ny.pdf

ETL for Spatial Data Warehousing


What is spatial data warehousing Spatial Data Warehousing is the

aggregation of discrete spatial databases together in a single repository, along with associated value-added tabular datasets. Often come from disparate data sources, e.g. roads from the Department of Transportation, rivers and lakes from the Department of Natural Resources, etc.

Spatial Data Types


Functionality, there are two principle spatial

data types
Vector Geometric data such as points, lines, and polygons. Examples would be roads, contour lines, schools, etc. Raster Continuous or image data. Examples would be aerial photography.

Demonstration
Georgia 2000 Information System

The Georgia 2000 Information System aggregates

spatial and tabular data from a wide variety of sources. Foundation of the Georgia 2000 is the map data. For example, political boundaries, roads, water features, facilities, locations, etc. Tabular data is value-added to the map data with information such as spending patterns per county, etc.

Problems unique to spatial data warehousing


Coordinate System (Projection) Geometric Errors Misalignment of

geometric features Geometric Errors Distortions of photography due to camera angle, height displacement, etc. Topological Errors Little pieces of unidentified areas called silvers. Can account in total for large areas.

ETL for spatial data warehousing involves

systematic corrections of geometric, topological or coordinate system problems. Another type of spatial data can be produced from a process called geocoding in which points are located along a network (for example, a street network) The quality of the underlying tabular data used as input affects quality of geocoding. Correcting this tabular data for good results from geocoding requires same types of ETL as does traditional data warehousing.

Demonstrations

You might also like