0% found this document useful (0 votes)

42 views30 pages

ETL Review

The document discusses extract, transform, and load (ETL) software. It describes what ETL is, the extraction, transformation, and loading processes, considerations for ETL tools and vendors, and uses of ETL for spatial data warehousing.

Uploaded by

kodanda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views30 pages

ETL Review

Uploaded by

kodanda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 30

ETL Software

Joanna Frazier Abhishek Sengupta Chris Kadlec Erik Shepard Susan Kost Brian Strok Ivan Vasquez

What is ETL?
Short for extract, transform, and load.

Three database functions that are combined into one tool to pull data out of one database and place it into another database. ETL is used to migrate data from one database to another, to form data marts and data warehouses and also to convert databases from one format or type to another.
http://www.pcwebopedia.com/TERM/E/ETL.html

Side Note:
It should be noted that ETL is not 3 well-

defined steps. We are breaking them up and presenting a theoretical view for ease of understanding before bringing them together and showing you how this method actually works in the real business world.

Extraction
Data Needs to be taken from some data source so that it can be put into the Data Warehouse. To do this: 1. Some code at the data source exports the data to be used. 2. Some external program takes the data from the source.

Extraction (cont)
If the data is exported, it is typically

exported into a text file that can then be brought into an intermediary database. If the data is extracted from the source, it is typically transferred directly into an intermediary database.

Data Transformation
Designs the process and develops the utilities and programming that allow the data warehouse to be initially loaded and maintained

Locates, extracts, conditions, scrubs and

loads data onto the data warehouse platform Physical database design must be available before loading can be performed

Data Transformation
3 major steps

- Data Cleansing - Data Integration - Other Transformations (includes replacement of codes, derived values, calculating aggregates)

Data Cleansing
Dirty Data Dummy Values Absence of Data Cryptic Data Contradicting Data Inappropriate Use of Address Lines Reused Primary Keys Non-unique Identifiers

Data Integration
2 Major Problems - Data that should be related but cannot be
(May arise due to non-unique primary keys or more often, the absence of primary keys) - Data that is inadvertently related but should not be (Occurs when fields or records are reused for multiple purposes)

Loading
The populating of tables that presentation

applications will use to make data available to users Most critical operations in any warehouse, yet often neglected

Loading (cont)
The LOADING

process can be broken down into 2 different types:

Initial Load Continuous Load (loading over time)

Initial Load
Consists of populating tables in warehouse

schema and verifying data readiness Examples:

DTS data transformation services Bcp utility batch copy SQL*Loader Native Database Languages (T-SQL, PL/SQL, etc.)

Continuous Loads
Must be scheduled and processed in a

specific order to maintain integrity, completeness, and a satisfactory level of trust Should be the most carefully planned step in data warehousing or can lead to:
Error duplication Exaggeration of inconsistencies in data

Continuous Loads (cont)

Must be during a fixed batch window

(usually overnight) Must maximize system resources to load data efficiently in allotted time
Ex. Red Brick Loader can validate, load, and index up to 12GB of data per hour on an SMP system

Additional Aspects of Loader

Should be able to: Aggregations build on past data (SUM, MODIFY, APPEND, UPDATE, etc) Filtering additional cleaning and filtering based on user instructions Integrity ensure data to be loaded meets integrity constraints previously established Index Building creates indexes associated with the data being loaded

Questions to Ask
Data Source Connectivity: Oracle, Sybase,

Informix, mainframe(CICS), Flat files. Functionality: pre-built Transformations Metadata: Open Architecture, Reporting Capability, Extensibility Performance: Engine Driven, Code Generator, Bulk Loading, "Data never touches the ground", Multi-threaded processes. Administration: Versioning, Debugging, Auditing

ETL Tool Set

Purchase or Grow Your Own 100s of Vendors-www.dwinfocenter.org/clean.html Pricing Varies Widely Trend Included as part of other initiatives CRMs
NCRs Teradata

Data Warehouses
Oracle, Red Brick, DB2, Prism, Sybase, Teradata, Informix, Microsoft SQL Server

Pricing Trends
Costs FireSpout, ETL Engine
Start at $150K

MetaRecon Enterprise
Server Package $250K, Client Package $50K

Pricing Trends
IBM, DB2, 7 ASPs and other partners pay with a percentage of revenue received from customers once solution is running per subscriber or per transaction basis. Still offer per-user base pricing model. Majority of database purchases are sold with an accompanying application and will still be done this way. Formation 1.4 Informix databases, Red Brick Warehouse, Oracle8 Server, Microsoft SQL Server databases. $7500 per processor for the Formation Flow Engine

No use of ETL Tools

Start Immediately Any logic set can be programmed Disadvantages Many programs to build Transformation logic is complex Lengthy program build process No automatic metadata generation Maintenance constant changes Infrastructure is very expensive
www.nyoug.org/dwetl_ny.pdf

Use ETL Tools

Enables rapid application development

(RAD) Allows easy maintenance Generates metadata automatically Reduces development costs Disadvantages
Learning curve Some limits to logic capabilities
www.nyoug.org/dwetl_ny.pdf

ETL for Spatial Data Warehousing

What is spatial data warehousing Spatial Data Warehousing is the

aggregation of discrete spatial databases together in a single repository, along with associated value-added tabular datasets. Often come from disparate data sources, e.g. roads from the Department of Transportation, rivers and lakes from the Department of Natural Resources, etc.

Spatial Data Types

Functionality, there are two principle spatial

data types
Vector Geometric data such as points, lines, and polygons. Examples would be roads, contour lines, schools, etc. Raster Continuous or image data. Examples would be aerial photography.

Demonstration
Georgia 2000 Information System

The Georgia 2000 Information System aggregates

spatial and tabular data from a wide variety of sources. Foundation of the Georgia 2000 is the map data. For example, political boundaries, roads, water features, facilities, locations, etc. Tabular data is value-added to the map data with information such as spending patterns per county, etc.

Problems unique to spatial data warehousing

Coordinate System (Projection) Geometric Errors Misalignment of

geometric features Geometric Errors Distortions of photography due to camera angle, height displacement, etc. Topological Errors Little pieces of unidentified areas called silvers. Can account in total for large areas.

ETL for spatial data warehousing involves

systematic corrections of geometric, topological or coordinate system problems. Another type of spatial data can be produced from a process called geocoding in which points are located along a network (for example, a street network) The quality of the underlying tabular data used as input affects quality of geocoding. Correcting this tabular data for good results from geocoding requires same types of ETL as does traditional data warehousing.

Demonstrations

Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
SCD Type 1 Implementation Using Informatica PowerCenter
No ratings yet
SCD Type 1 Implementation Using Informatica PowerCenter
7 pages
ETL Basic Concepts
No ratings yet
ETL Basic Concepts
63 pages
The Process of Updating The Data Warehouse
No ratings yet
The Process of Updating The Data Warehouse
24 pages
Customer Relationship Management: Unit - IV: Lesson - 8
No ratings yet
Customer Relationship Management: Unit - IV: Lesson - 8
76 pages
Recent in Data Warehousing: Developments
No ratings yet
Recent in Data Warehousing: Developments
62 pages
ETL - Extract, Transform and Load: What Is A Data Warehouse?
No ratings yet
ETL - Extract, Transform and Load: What Is A Data Warehouse?
30 pages
06-Data-Integration Quality Profiling
No ratings yet
06-Data-Integration Quality Profiling
39 pages
ETL Testing
No ratings yet
ETL Testing
12 pages
DWH and Testing1
No ratings yet
DWH and Testing1
11 pages
notes download ba
No ratings yet
notes download ba
104 pages
Presentation 2
No ratings yet
Presentation 2
22 pages
crime-prevention-and-control-css402_1716304451
No ratings yet
crime-prevention-and-control-css402_1716304451
42 pages
What Is ETL?
No ratings yet
What Is ETL?
6 pages
Iia 4
No ratings yet
Iia 4
29 pages
ETL
No ratings yet
ETL
32 pages
Unit1 (DW&DM)
No ratings yet
Unit1 (DW&DM)
30 pages
Unit B Data Warehousing
No ratings yet
Unit B Data Warehousing
26 pages
BI Architecture
No ratings yet
BI Architecture
4 pages
ETL
No ratings yet
ETL
11 pages
DW M4 L1 - ETL Introduction
No ratings yet
DW M4 L1 - ETL Introduction
26 pages
All Bi
No ratings yet
All Bi
17 pages
DWH Concepts Overview
No ratings yet
DWH Concepts Overview
11 pages
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
No ratings yet
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
72 pages
Unit 3-1
No ratings yet
Unit 3-1
19 pages
Unit 3
No ratings yet
Unit 3
14 pages
ETL (Extract, Transform, and Load) Process
No ratings yet
ETL (Extract, Transform, and Load) Process
8 pages
02-dw Architecture
No ratings yet
02-dw Architecture
31 pages
In T e G R A Ti o N: Integration of Data
No ratings yet
In T e G R A Ti o N: Integration of Data
21 pages
Data Warehouse Notes
No ratings yet
Data Warehouse Notes
21 pages
lecture 7 (17-04-2024)
No ratings yet
lecture 7 (17-04-2024)
29 pages
Ass 1
No ratings yet
Ass 1
31 pages
Data Warehouse - Concept and Fundamentals: Sridevi
No ratings yet
Data Warehouse - Concept and Fundamentals: Sridevi
25 pages
ETL (Extract, Transform, and Load) Process in Data Warehouse
No ratings yet
ETL (Extract, Transform, and Load) Process in Data Warehouse
6 pages
Data Warehousing
No ratings yet
Data Warehousing
16 pages
Sem3 Unit1 DW PPT
No ratings yet
Sem3 Unit1 DW PPT
12 pages
Extract Transform Load Cycle
No ratings yet
Extract Transform Load Cycle
32 pages
Lec 11- DW
No ratings yet
Lec 11- DW
32 pages
DWDM202
No ratings yet
DWDM202
6 pages
Unit 2 LT
No ratings yet
Unit 2 LT
13 pages
2010-SQL Saturday WM Presentation
No ratings yet
2010-SQL Saturday WM Presentation
20 pages
Datawarehouse / Etl Testing: Reason For Build Data Warehouse: 1) Data Is Scattered at Different Places
No ratings yet
Datawarehouse / Etl Testing: Reason For Build Data Warehouse: 1) Data Is Scattered at Different Places
19 pages
All Unit
No ratings yet
All Unit
17 pages
Dataware House
100% (8)
Dataware House
42 pages
E Xtract T Ransform L OAD: MIS Systems (Acct, HR) Legacy Systems
No ratings yet
E Xtract T Ransform L OAD: MIS Systems (Acct, HR) Legacy Systems
30 pages
UNIT 2 DW
No ratings yet
UNIT 2 DW
75 pages
Lecture 13 - Data Warehousing
No ratings yet
Lecture 13 - Data Warehousing
27 pages
Solve These Questions
No ratings yet
Solve These Questions
11 pages
Session Five - Data Integration
No ratings yet
Session Five - Data Integration
11 pages
2-Data Warehousing
No ratings yet
2-Data Warehousing
30 pages
Creation of A Data Warehouse: To Create A Data Warehouse From Various .CSV Files Using Postgrsql Tool
No ratings yet
Creation of A Data Warehouse: To Create A Data Warehouse From Various .CSV Files Using Postgrsql Tool
18 pages
Module 2
No ratings yet
Module 2
43 pages
DWDM 2 UNIT NOTES
No ratings yet
DWDM 2 UNIT NOTES
14 pages
Unit 3
No ratings yet
Unit 3
33 pages
Data Warehouse For Bignners
No ratings yet
Data Warehouse For Bignners
14 pages
Utlimate Guide: ETL/ Datawarehouse Testing
No ratings yet
Utlimate Guide: ETL/ Datawarehouse Testing
12 pages
ETL
No ratings yet
ETL
3 pages
DWH Notes
No ratings yet
DWH Notes
30 pages
Data Warehouse
No ratings yet
Data Warehouse
15 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Introduction to Oracle Database Administration
From Everand
Introduction to Oracle Database Administration
Ying Wang
5/5 (1)
Informatica Question & Answer Set
80% (5)
Informatica Question & Answer Set
124 pages
RIO 01 Teradata Vs Other Databases v1.0
No ratings yet
RIO 01 Teradata Vs Other Databases v1.0
31 pages
DataWarehousing Interview Questions Answers PDF
100% (1)
DataWarehousing Interview Questions Answers PDF
11 pages
Hadoop Distributed Computing-ToC
No ratings yet
Hadoop Distributed Computing-ToC
1 page
Teradata Material2
100% (4)
Teradata Material2
152 pages
Teradata V13 Course Requirement Tracking Document: Prepared by BU Approval by Academy Approval by
No ratings yet
Teradata V13 Course Requirement Tracking Document: Prepared by BU Approval by Academy Approval by
11 pages
DataWarehousing Interview Questions Answers PDF
100% (1)
DataWarehousing Interview Questions Answers PDF
11 pages
(CDD) Teradata
No ratings yet
(CDD) Teradata
16 pages
Informatica Naming Standards V2 0
No ratings yet
Informatica Naming Standards V2 0
8 pages
Velocity For Data Integration 9.x Skill Set Inventory
No ratings yet
Velocity For Data Integration 9.x Skill Set Inventory
3 pages
TDQ
No ratings yet
TDQ
27 pages
Corp Challis T
No ratings yet
Corp Challis T
0 pages
SSIS Package Naming Standards
No ratings yet
SSIS Package Naming Standards
6 pages
Message Structures
No ratings yet
Message Structures
42 pages
IBM Informatica Technical Questions
No ratings yet
IBM Informatica Technical Questions
3 pages
All About Debugger
No ratings yet
All About Debugger
9 pages
Java Transformation: Using Java Code To Parse A Flat File: © 2008 Informatica Corporation
No ratings yet
Java Transformation: Using Java Code To Parse A Flat File: © 2008 Informatica Corporation
3 pages
Metal Cutting Gas
No ratings yet
Metal Cutting Gas
21 pages
Powderdiffraction2020 Joelreid v3
No ratings yet
Powderdiffraction2020 Joelreid v3
64 pages
.2021 Signature Forms
100% (2)
.2021 Signature Forms
5 pages
Que Es Openbsd PDF
No ratings yet
Que Es Openbsd PDF
16 pages
Scaffolding & Formwork - ABC Bahrain Business Directory
No ratings yet
Scaffolding & Formwork - ABC Bahrain Business Directory
5 pages
Circular42_2025
No ratings yet
Circular42_2025
2 pages
Aisha Fees Payment Portal-
No ratings yet
Aisha Fees Payment Portal-
1 page
Hyosung SF50
50% (2)
Hyosung SF50
144 pages
Yealink - SIP-T21P-E2 - Datasheet - by LATNOK CD
No ratings yet
Yealink - SIP-T21P-E2 - Datasheet - by LATNOK CD
2 pages
MIS MCQ QUESTION BANK-ans
No ratings yet
MIS MCQ QUESTION BANK-ans
5 pages
(It-Ebooks-2017) It-Ebooks - Design and Analysis of Algorithms Lecture Notes (MIT 6.046J) - Ibooker It-Ebooks (2017) PDF
No ratings yet
(It-Ebooks-2017) It-Ebooks - Design and Analysis of Algorithms Lecture Notes (MIT 6.046J) - Ibooker It-Ebooks (2017) PDF
135 pages
ICG - Preview
No ratings yet
ICG - Preview
11 pages
Switchview 1000 Switch: Installer/User Guide
No ratings yet
Switchview 1000 Switch: Installer/User Guide
20 pages
Araldite TDS
No ratings yet
Araldite TDS
6 pages
Blind Person Smart Stick - Copy12
100% (1)
Blind Person Smart Stick - Copy12
19 pages
Agro Technology Park in MARDI, Cameron Highlands
No ratings yet
Agro Technology Park in MARDI, Cameron Highlands
4 pages
C++ Classes and Data Structures: Hash Tables
No ratings yet
C++ Classes and Data Structures: Hash Tables
238 pages
Solution To Chapter 15
No ratings yet
Solution To Chapter 15
9 pages
Executive Order 292 - Administrative Code of 1987 Chapter 7. Expenditure of Appropriated Funds Section 58
No ratings yet
Executive Order 292 - Administrative Code of 1987 Chapter 7. Expenditure of Appropriated Funds Section 58
6 pages
Audio Signal Processing
100% (1)
Audio Signal Processing
389 pages
CV cá nhân (8)
No ratings yet
CV cá nhân (8)
1 page
MT6572 Android Scatter
71% (7)
MT6572 Android Scatter
6 pages
Adaptive Learning Feedback Linearization
No ratings yet
Adaptive Learning Feedback Linearization
9 pages
Tina Teresa Rozario: 105, East Tejturibazar, Tejgaon, Dhaka-1215 Mobile: +88-0195-5590666
No ratings yet
Tina Teresa Rozario: 105, East Tejturibazar, Tejgaon, Dhaka-1215 Mobile: +88-0195-5590666
4 pages
Contemporary House Plan 22174 The Abbott - 2190 SQFT, 3 Beds, 2.1 Baths
No ratings yet
Contemporary House Plan 22174 The Abbott - 2190 SQFT, 3 Beds, 2.1 Baths
9 pages
Serial Attached SCSI (SAS)
No ratings yet
Serial Attached SCSI (SAS)
4 pages
The Modern Entrepreneur
No ratings yet
The Modern Entrepreneur
74 pages
There Are 2 Types of Bapi's
No ratings yet
There Are 2 Types of Bapi's
3 pages
CT06 Nozzle CT 24.02.
No ratings yet
CT06 Nozzle CT 24.02.
2 pages
Isekai Yururi Kikou - Raising Children While Being an Adventurer - WN Chapter 00-27 [Shinsori][WebNovelConverter_CalibreV1DPC]
No ratings yet
Isekai Yururi Kikou - Raising Children While Being an Adventurer - WN Chapter 00-27 [Shinsori][WebNovelConverter_CalibreV1DPC]
168 pages

ETL Review

Uploaded by

ETL Review

Uploaded by

ETL Software

Locates, extracts, conditions, scrubs and

process can be broken down into 2 different types:

schema and verifying data readiness Examples:

Continuous Loads (cont)

Additional Aspects of Loader

More Questions to Ask

ETL Tool Set

No use of ETL Tools

Use ETL Tools

ETL for Spatial Data Warehousing

Spatial Data Types

The Georgia 2000 Information System aggregates

Problems unique to spatial data warehousing

ETL for spatial data warehousing involves

You might also like