ETL Review
ETL Review
Joanna Frazier Abhishek Sengupta Chris Kadlec Erik Shepard Susan Kost Brian Strok Ivan Vasquez
What is ETL?
Short for extract, transform, and load.
Three database functions that are combined into one tool to pull data out of one database and place it into another database. ETL is used to migrate data from one database to another, to form data marts and data warehouses and also to convert databases from one format or type to another.
http://www.pcwebopedia.com/TERM/E/ETL.html
Side Note:
It should be noted that ETL is not 3 well-
defined steps. We are breaking them up and presenting a theoretical view for ease of understanding before bringing them together and showing you how this method actually works in the real business world.
Extraction
Data Needs to be taken from some data source so that it can be put into the Data Warehouse. To do this: 1. Some code at the data source exports the data to be used. 2. Some external program takes the data from the source.
Extraction (cont)
If the data is exported, it is typically
exported into a text file that can then be brought into an intermediary database. If the data is extracted from the source, it is typically transferred directly into an intermediary database.
Data Transformation
Designs the process and develops the utilities and programming that allow the data warehouse to be initially loaded and maintained
loads data onto the data warehouse platform Physical database design must be available before loading can be performed
Data Transformation
3 major steps
- Data Cleansing - Data Integration - Other Transformations (includes replacement of codes, derived values, calculating aggregates)
Data Cleansing
Dirty Data Dummy Values Absence of Data Cryptic Data Contradicting Data Inappropriate Use of Address Lines Reused Primary Keys Non-unique Identifiers
Data Integration
2 Major Problems - Data that should be related but cannot be
(May arise due to non-unique primary keys or more often, the absence of primary keys) - Data that is inadvertently related but should not be (Occurs when fields or records are reused for multiple purposes)
Loading
The populating of tables that presentation
applications will use to make data available to users Most critical operations in any warehouse, yet often neglected
Loading (cont)
The LOADING
Initial Load
Consists of populating tables in warehouse
Continuous Loads
Must be scheduled and processed in a
specific order to maintain integrity, completeness, and a satisfactory level of trust Should be the most carefully planned step in data warehousing or can lead to:
Error duplication Exaggeration of inconsistencies in data
(usually overnight) Must maximize system resources to load data efficiently in allotted time
Ex. Red Brick Loader can validate, load, and index up to 12GB of data per hour on an SMP system
Questions to Ask
Data Source Connectivity: Oracle, Sybase,
Informix, mainframe(CICS), Flat files. Functionality: pre-built Transformations Metadata: Open Architecture, Reporting Capability, Extensibility Performance: Engine Driven, Code Generator, Bulk Loading, "Data never touches the ground", Multi-threaded processes. Administration: Versioning, Debugging, Auditing
Error detection Modeling Tool Connectivity: Erwin, Powerdesigner Ease of Use: GUI Interface, Intuitive design, integrated toolset Programming Language Supported: VB, C, C++, COBOL Support: 24x7, Devoted Staff levels
ETL Vendors
Ascential
SAS
NCR Teradata
IBM
Oracle Vality Firstlogic
Data Warehouses
Oracle, Red Brick, DB2, Prism, Sybase, Teradata, Informix, Microsoft SQL Server
Pricing Trends
Costs FireSpout, ETL Engine
Start at $150K
MetaRecon Enterprise
Server Package $250K, Client Package $50K
Pricing Trends
IBM, DB2, 7 ASPs and other partners pay with a percentage of revenue received from customers once solution is running per subscriber or per transaction basis. Still offer per-user base pricing model. Majority of database purchases are sold with an accompanying application and will still be done this way. Formation 1.4 Informix databases, Red Brick Warehouse, Oracle8 Server, Microsoft SQL Server databases. $7500 per processor for the Formation Flow Engine
(RAD) Allows easy maintenance Generates metadata automatically Reduces development costs Disadvantages
Learning curve Some limits to logic capabilities
www.nyoug.org/dwetl_ny.pdf
aggregation of discrete spatial databases together in a single repository, along with associated value-added tabular datasets. Often come from disparate data sources, e.g. roads from the Department of Transportation, rivers and lakes from the Department of Natural Resources, etc.
data types
Vector Geometric data such as points, lines, and polygons. Examples would be roads, contour lines, schools, etc. Raster Continuous or image data. Examples would be aerial photography.
Demonstration
Georgia 2000 Information System
spatial and tabular data from a wide variety of sources. Foundation of the Georgia 2000 is the map data. For example, political boundaries, roads, water features, facilities, locations, etc. Tabular data is value-added to the map data with information such as spending patterns per county, etc.
geometric features Geometric Errors Distortions of photography due to camera angle, height displacement, etc. Topological Errors Little pieces of unidentified areas called silvers. Can account in total for large areas.
systematic corrections of geometric, topological or coordinate system problems. Another type of spatial data can be produced from a process called geocoding in which points are located along a network (for example, a street network) The quality of the underlying tabular data used as input affects quality of geocoding. Correcting this tabular data for good results from geocoding requires same types of ETL as does traditional data warehousing.
Demonstrations