Unit 01 Data Warehousing

Uploaded by

Piyush

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views45 pages

Unit 01 Data Warehousing

Uploaded by

Piyush

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Unit 01.

Data Warehousing
Prof. Jayanand
Contents:
•Data Warehouse: Basic Concepts,
•A Multitiered Architecture,
•Enterprise Warehouse,
•Data Mart, DATA Warehouse
•Extraction, Transformation, and Loading,
•Metadata Repository.
Data Warehouse: Basic Concepts
• Def:
• A data warehouse is a subject-oriented, integrated, time-varying, non-volatile
collection of data that is used primarily in organizational decision making. (Bill
Inmon in 1990)
• A single, complete and consistent store of data obtained from a variety of
different sources made available to end users in a what they can understand
and use in a business context.
Understanding a Data Warehouse
• A data warehouse is a database, which is kept separate from the
organization's operational database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the organization to
analyze its business.
• A data warehouse helps executives to organize, understand, and use their
data to take strategic decisions.
• Data warehouse systems help in the integration of diversity of application
systems.
• A data warehouse system helps in consolidated historical data analysis.
Data Warehouse Features
• Subject Oriented − A data warehouse is subject oriented because it
provides information around a subject rather than the organization's
ongoing operations. These subjects can be product, customers, suppliers,
sales, revenue, etc. A data warehouse does not focus on the ongoing
operations, rather it focuses on modelling and analysis of data for decision
making.
• Integrated − A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc. This
integration enhances the effective analysis of data.
• Time Variant − The data collected in a data warehouse is identified with a
particular time period. The data in a data warehouse provides information
from the historical point of view.
• Non-volatile − Non-volatile means the previous data is not erased when
new data is added to it. A data warehouse is kept separate from the
operational database and therefore frequent changes in operational
database is not reflected in the data warehouse.
Why a Data Warehouse is Separated from
Operational Databases?
• A data warehouses is kept separate from operational databases due
to the following reasons −
• An operational database is constructed for well-known tasks and workloads
such as searching particular records, indexing, etc. In contract, data
warehouse queries are often complex and they present a general form of
data.
• Operational databases support concurrent processing of multiple
transactions. Concurrency control and recovery mechanisms are required for
operational databases to ensure robustness and consistency of the database.
• An operational database query allows to read and modify operations, while
an OLAP query needs only read only access of stored data.
• An operational database maintains current data. On the other hand, a data
warehouse maintains historical data.
Data Warehouse Applications
• Financial services
• Banking services
• Consumer goods
• Retail sectors
• Controlled manufacturing
Sr.No. Data Warehouse (OLAP) Operational Database(OLTP)

1 It involves historical processing of It involves day-to-day processing.

information.
2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs, or
workers such as executives, managers, and database professionals.
analysts.
3 It is used to analyze the business. It is used to run the business.
4 It focuses on Information out. It focuses on Data in.
5 It is based on Star Schema, Snowflake It is based on Entity Relationship Model.
Schema, and Fact Constellation Schema.
6 It focuses on Information out. It is application oriented.
7 It contains historical data. It contains current data.
8 It provides summarized and consolidated It provides primitive and highly detailed data.
data.
9 It provides summarized and It provides detailed and flat relational view of
multidimensional view of data. data.
10 The number of users is in hundreds. The number of users is in thousands.
11 The number of records accessed is in The number of records accessed is in tens.
millions.
12 The database size is from 100GB to 100 TB. The database size is from 100 MB to 100 GB.
13 These are highly flexible. It provides high performance.
A Multitiered Architecture
Refer Written notes or Refer Alex Berson
• Data warehouse is an environment, not a product which is based on
relational database management system that functions as the central
repository for informational data.
• The central repository information is surrounded by number of key
components designed to make the environment is functional, manageable
and accessible.
• The data source for data warehouse is coming from operational
applications.
• The data entered into the data warehouse transformed into an integrated
structure and format.
• The transformation process involves conversion, summarization, filtering
and condensation.
Seven Major components
1. Sourcing, Acquisition, Clean up, and Transformation Tools
2. Meta data
3. Data warehouse database
4. Data marts
5. Access tools
6. Data warehouse admin and management
7. Information delivery system
Sourcing, Acquisition, Clean up, and
Transformation Tools
• They perform conversions, summarization, key changes, structural
changes and condensation.
• The data transformation is required so that the information can by
used by decision support tools.
• The transformation produces programs, control statements, JCL code,
COBOL code, UNIX scripts, and SQL DDL code etc., to move the data
into data warehouse from multiple operational systems.
• Functionalities:
• To remove unwanted data from operational db
• Converting to common data names and attributes
• Calculating summaries and derived data
• Establishing defaults for missing data
• Accommodating source data definition changes
• Issues:
• Database heterogeneity: different DBMS
• Data heterogeneity: way data defined
Data warehouse database
• This is the central part of the data warehousing environment.
• This is implemented based on RDBMS technology
Meta data
• It is data about data. It is used for maintaining, managing and using
the data warehouse.
• It is classified into two:
• Technical Meta data:
• Business Meta data:
Technical Meta data
• It contains information about data warehouse data used by
warehouse designer, administrator to carry out development and
management tasks.
• It includes-
• Information about data stores
• Transformation descriptions. That is mapping methods from operational db to
warehouse db
• Warehouse Object and data structure definitions for target data
• The rules for data clean up, and data enhancement
Business Meta data
• It contains information that gives user easy to understand perspective
of the information stored in data warehouse.
• It includes:
• Subject areas, and info object type including queries, reports, images, video,
audio clips etc.
• Internet home pages
• Info related to info delivery system
• Data warehouse operational info such as ownerships, audit trails etc
• Meta data helps the users to understand content and find the data.
• Meta data are stored in a separate data stores which is known as
informational directory or Meta data repository which helps to
integrate, maintain and view the contents of the data warehouse
Technical requirements of metadata
repository
• Should be a gateway to the data warehouse environment
• It should support easy distribution and replication of content for high
performance and availability
• Should be searchable by business oriented key words
• It should act as a launch platform for end user to access data and
analysis tools
• It should support the sharing of information
Access tools
• Its purpose is to provide info to business users for decision making.
• There are five categories:
1. Data query and reporting tools
2. Application development tools
3. Executive info system tools (EIS)
4. OLAP tools
5. Data mining tools
• Data query and reporting tools
• Query and reporting tools are used to generate query and report.
• There are two types of reporting tools. They are:
• Production reporting tool used to generate regular operational reports
• Desktop report writer are inexpensive desktop tools designed for end users.
• Managed Query tools: used to generate SQL query. It uses Meta layer
software in between users and databases which offers a point-and-click
creation of SQL statement.
• Application development tools: This is a graphical data access
environment which integrates OLAP tools with data warehouse and
can be used to access all db systems. Ex. Visual Basic etc.
• OLAP Tools are used to analyze the data in multi dimensional and
complex views. To enable multidimensional properties it uses MDDB
and MRDB where MDDB refers multidimensional data base and
MRDB refers multi relational data bases. Used for sale forecasting,
marketing campaign etc.
• Data mining tools are used to discover knowledge from the data
warehouse data also can be used for data visualization and data
correction purposes.
• Data Visualization: displaying and looking at data. It is a method of
presenting the o/p. goes beyond piecharts, includes 3D imaging,
video etc.
Data marts
• Departmental subsets that focus on selected subjects.
• They are independent used by dedicated user group.
• They are used for rapid delivery of enhanced decision support functionality
to end users.
• Data mart is used in the following situation:
• Extremely urgent user requirement
• The absence of a budget for a full scale data warehouse strategy
• The decentralization of business needs
• Data mart presents two problems:
1. Scalability: A small data mart can grow quickly in multi dimensions. So
that while designing it, the organization has to pay more attention on
system scalability, consistency and manageability issues
2. Data integration
Data warehouse admin and management
• The management of data warehouse includes:
• Security and priority management
• Monitoring updates from multiple sources
• Data quality checks
• Managing and updating meta data
• Auditing and reporting data warehouse usage and status
• Purging data
• Replicating, sub setting and distributing data
• Backup and recovery
Information delivery system
• Delivery to one or more destinations according to specified
scheduling algorithm
Data Warehouse Models/Types
• Enterprise Data Warehouse
• ODS (Operational Data Store)
• Data Mart
Enterprise Data Warehouse
• An enterprise warehouse collects all the information and the subjects
spanning an entire organization
• It provides us enterprise-wide data integration.
• The data is integrated from operational systems and external
information providers.
• This information can vary from a few gigabytes to hundreds of
gigabytes, terabytes or beyond.
ODS (Operational Data Store)
• An operational data store (ODS) is an alternative to having
operational decision support system (DSS) applications access data
directly from the database that supports transaction processing (TP).
• While both require a significant amount of planning, the ODS tends to
focus on the operational requirements of a particular business
process (for example, customer service), and on the need to allow
updates and propagate those updates back to the source operational
system from which the data elements were obtained.
• The data warehouse, on the other hand, provides an architecture for
decision makers to access data to perform strategic analysis, which
often involves historical and cross-functional data and the need to
support many applications.
Data Mart
• A data mart is a repository of data that is designed to serve a
particular community of knowledge workers.
• Data marts enable users to retrieve information for single
departments or subjects, improving the user response time.
• Because data marts catalog specific data, they often require less
space than enterprise data warehouses, making them easier to search
and cheaper to run
Terms in data warehousing
• Metadata:
• Metadata is simply defined as data about data.
• The data that are used to represent other data is known as metadata.
• For example, the index of a book serves as a metadata for the contents in the
book.
• In other words, we can say that metadata is the summarized data that leads
us to the detailed data.
• In terms of data warehouse, we can define metadata as following −
• Metadata is a road-map to data warehouse.
• Metadata in data warehouse defines the warehouse objects.
• Metadata acts as a directory. This directory helps the decision support system
to locate the contents of a data warehouse.
• Metadata Repository:
• Business metadata − It contains the data ownership information,
business definition, and changing policies.
• Operational metadata − It includes currency of data and data
lineage. Currency of data refers to the data being active, archived,
or purged. Lineage of data means history of data migrated and
transformation applied on it.
• Data for mapping from operational environment to data
warehouse − metadata includes source databases and their
contents, data extraction, data partition, cleaning, transformation
rules, data refresh and purging rules.
• The algorithms for summarization − It includes dimension
algorithms, data on granularity, aggregation, summarizing, etc
Data Cube
• A data cube helps us represent data in multiple dimensions. It is
defined by dimensions and facts.
• The dimensions are the entities with respect to which an enterprise
preserves the records.
• Suppose a company wants to keep track of sales records with the help
of sales data warehouse with respect to time, item, branch, and
location. These dimensions allow to keep track of monthly sales and
at which branch the items were sold. There is a table associated with
each dimension. This table is known as dimension table. For example,
"item" dimension table may have attributes such as item_name,
item_type, and item_brand.
Extraction, Transformation, and Loading (ETL)
• You need to load your data warehouse regularly so that it can serve
its purpose of facilitating business analysis.
• The process of extracting data from source systems and bringing it
into the data warehouse is commonly called ETL, which stands for
extraction, transformation, and loading.
Extraction
• During extraction, the desired data is identified and extracted from many
different sources, including database systems and applications.
• Very often, it is not possible to identify the specific subset of interest, therefore
more data than necessary has to be extracted, so the identification of the
relevant data will be done at a later point in time.
• Depending on the source system's capabilities (for example, operating system
resources), some transformations may take place during this extraction process.
• The size of the extracted data varies from hundreds of kilobytes up to gigabytes,
depending on the source system and the business situation.
• The same is true for the time delta between two (logically) identical extractions:
the time span may vary between days/hours and minutes to near real-time.
• Web server log files, for example, can easily grow to hundreds of megabytes in a
very short period of time.
• Designing and creating the extraction process is often one of the most
time-consuming tasks in the ETL process and, indeed, in the entire
data warehousing process.
• The source systems might be very complex and poorly documented,
and thus determining which data needs to be extracted can be
difficult.
• The data has to be extracted normally not only once, but several
times in a periodic manner to supply all changed data to the data
warehouse and keep it up-to-date
Designing Extraction process means making
decisions about
• Which extraction method do I choose?
• This influences the source system, the transportation process, and the
time needed for refreshing the warehouse.
• How do I provide the extracted data for further processing?
Extraction Methods
• Logical Extraction Methods
• Full Extraction:
• The data is extracted completely from the source system.
• Because this extraction reflects all the data currently available on the source
system, there's no need to keep track of changes to the data source since the
last successful extraction
• Incremental Extraction
• At a specific point in time, only the data that has changed since a well-defined
event back in history will be extracted.(delta change)
• Ex. Oracle's Change Data Capture (CDC) mechanism can extract and maintain
such delta information
Extraction Methods
• Physical Extraction Methods:
• Depending on the chosen logical extraction method and the capabilities and
restrictions on the source side, the extracted data can be physically extracted
by two mechanisms.
• The data can either be extracted online from the source system or from an
offline structure
• Online Extraction: The data is extracted directly from the source
system itself
• Offline Extraction: The data is not extracted directly from the source
system but is staged explicitly outside the original source system.
Transformation
• After data is extracted, it has to be physically transported to the
target system or to an intermediate system for further processing.
• Depending on the chosen way of transportation, some
transformations can be done during this process, too.
• For example, a SQL statement which directly accesses a remote target
through a gateway can concatenate two columns as part of the
SELECT statement.
Loading
• Once all the data has been cleansed and transformed into a structure
consistent with the data warehouse requirements, data is ready for
loading into the data warehouse.
• The initial load of the data warehouse consists of populating the
tables in the data warehouse schema and then checking that the data
is ready for use.
• Designing and maintaining the ETL process is often considered one of
the most difficult and resource-intensive portions of a data
warehouse project.
• Many data warehousing projects use ETL tools to manage this
process.
• Oracle Warehouse Builder (OWB), for example, provides ETL
capabilities and takes advantage of inherent database abilities.
• Other data warehouse builders create their own ETL tools and
processes, either inside or outside the database.
END of UNIT 01

References:
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques,
Third Edition, Elsevier Publication

Datawarehouse Unit-2
No ratings yet
Datawarehouse Unit-2
59 pages
Data Warehouse Overview and Components
No ratings yet
Data Warehouse Overview and Components
9 pages
Data Warehouse and Data Mining Notes
No ratings yet
Data Warehouse and Data Mining Notes
31 pages
Business Intelligence & Data Warehousing Basics
100% (1)
Business Intelligence & Data Warehousing Basics
157 pages
Business Intelligence
No ratings yet
Business Intelligence
17 pages
DWDM Unit-I
No ratings yet
DWDM Unit-I
25 pages
DWDM
No ratings yet
DWDM
15 pages
Data Warehousing Overview and Components
No ratings yet
Data Warehousing Overview and Components
31 pages
CS2032 Data Warehousing and Data Mining PPT Unit I
No ratings yet
CS2032 Data Warehousing and Data Mining PPT Unit I
88 pages
Data Warehouse Fundamentals and Benefits
No ratings yet
Data Warehouse Fundamentals and Benefits
30 pages
Understanding Data Warehouse Concepts
No ratings yet
Understanding Data Warehouse Concepts
16 pages
1.1 Basic Concepts & Architecture
No ratings yet
1.1 Basic Concepts & Architecture
27 pages
Data Warehousing and Mining Guide
No ratings yet
Data Warehousing and Mining Guide
46 pages
5-7-2010 Components of A Data Warehouse: Overall Architecture
No ratings yet
5-7-2010 Components of A Data Warehouse: Overall Architecture
4 pages
1 & 2 Data Warehousing - 021052
No ratings yet
1 & 2 Data Warehousing - 021052
80 pages
Data Warehousing
No ratings yet
Data Warehousing
111 pages
Data Warehousing & OLAP Basics
No ratings yet
Data Warehousing & OLAP Basics
31 pages
Introduction To Data Warehouse: Unit I: Data Warehousing
No ratings yet
Introduction To Data Warehouse: Unit I: Data Warehousing
110 pages
Unit 1
No ratings yet
Unit 1
22 pages
Lec09-Data Warehousing
No ratings yet
Lec09-Data Warehousing
32 pages
BCS18010 - Datawarehousing & Data Mining
No ratings yet
BCS18010 - Datawarehousing & Data Mining
136 pages
DW Part A Part B Notes
No ratings yet
DW Part A Part B Notes
69 pages
Data Warehouse Components Explained
No ratings yet
Data Warehouse Components Explained
26 pages
OLAP in Data Warehousing Explained
No ratings yet
OLAP in Data Warehousing Explained
49 pages
DWM 1
No ratings yet
DWM 1
15 pages
R16 4-2 DataMining Notes UNIT-I
No ratings yet
R16 4-2 DataMining Notes UNIT-I
31 pages
Unit 1
No ratings yet
Unit 1
29 pages
1a Ravi
No ratings yet
1a Ravi
37 pages
DWDM Notes 5 Units
0% (1)
DWDM Notes 5 Units
110 pages
Data Warehousing Introduction Pages 2 53
No ratings yet
Data Warehousing Introduction Pages 2 53
52 pages
Unit 1
No ratings yet
Unit 1
39 pages
Data Warehousing and OLAP
No ratings yet
Data Warehousing and OLAP
47 pages
DWM Exp1
No ratings yet
DWM Exp1
16 pages
Data Warehousing Lecture Notes
No ratings yet
Data Warehousing Lecture Notes
38 pages
Advanced Database Presentation
No ratings yet
Advanced Database Presentation
11 pages
Presentation Prepared By:: Aqsa Ashfaq
No ratings yet
Presentation Prepared By:: Aqsa Ashfaq
22 pages
Unit - 1 Introduction To Data Warehousing
No ratings yet
Unit - 1 Introduction To Data Warehousing
57 pages
2 Data Warehousing Components L3 L4 L5
No ratings yet
2 Data Warehousing Components L3 L4 L5
26 pages
Unit 2 Data Warehousing and OLAP
No ratings yet
Unit 2 Data Warehousing and OLAP
72 pages
Data Warehousing Basics & Components
No ratings yet
Data Warehousing Basics & Components
37 pages
Data Warehouse Concepts
100% (1)
Data Warehouse Concepts
11 pages
DMDW1
No ratings yet
DMDW1
13 pages
CA03CA3405Data Warehouse Architecture and Its Components
No ratings yet
CA03CA3405Data Warehouse Architecture and Its Components
5 pages
Eval of Business Performance - Module 1
No ratings yet
Eval of Business Performance - Module 1
8 pages
Chapter 2
No ratings yet
Chapter 2
44 pages
Unit-2 DM
No ratings yet
Unit-2 DM
21 pages
Unit 2 Data Mining & Warehouse
No ratings yet
Unit 2 Data Mining & Warehouse
40 pages
DWDM Imp Qnotes - Mid1
No ratings yet
DWDM Imp Qnotes - Mid1
31 pages
Unit I
No ratings yet
Unit I
90 pages
Data Federation vs. Data Warehouse Explained
No ratings yet
Data Federation vs. Data Warehouse Explained
7 pages
DWDM202
No ratings yet
DWDM202
6 pages
Introduction To Data Warehouse Edited
No ratings yet
Introduction To Data Warehouse Edited
34 pages
Lecture 2 - Datawarehouse
No ratings yet
Lecture 2 - Datawarehouse
50 pages
DWM Unit 1. Introduction To Data Warehousing
100% (4)
DWM Unit 1. Introduction To Data Warehousing
12 pages
Unit 1
No ratings yet
Unit 1
33 pages
Data Warehouse - Final
No ratings yet
Data Warehouse - Final
28 pages
Properties of Equality in Algebra
No ratings yet
Properties of Equality in Algebra
8 pages
181109-Design of Beams (By NBN)
No ratings yet
181109-Design of Beams (By NBN)
51 pages
Draft - Report TKHT
No ratings yet
Draft - Report TKHT
29 pages
Mb502P Current 2024-Final
No ratings yet
Mb502P Current 2024-Final
18 pages
Solid State Physics - Ii: Dr. N.Balasundari Assistant Professor Physics Department Sri K.G.S Arts College Srivaikundam
No ratings yet
Solid State Physics - Ii: Dr. N.Balasundari Assistant Professor Physics Department Sri K.G.S Arts College Srivaikundam
112 pages
Maths and Fmaths, Ss1 - Ss3
No ratings yet
Maths and Fmaths, Ss1 - Ss3
11 pages
General Physics I Syllabus
No ratings yet
General Physics I Syllabus
3 pages
Convey Belt Question
No ratings yet
Convey Belt Question
2 pages
TOPSIS and AHP Case
No ratings yet
TOPSIS and AHP Case
11 pages
Catalyst 4500 E-Series Switch Specs
No ratings yet
Catalyst 4500 E-Series Switch Specs
22 pages
Introduction To Plastics, Polymers, and Their Properties. The Effect of Temperature and Other Factors On Plastics and Elastomers
100% (1)
Introduction To Plastics, Polymers, and Their Properties. The Effect of Temperature and Other Factors On Plastics and Elastomers
45 pages
MATH207 - Chapter 4.5 CAUCHY EULER
No ratings yet
MATH207 - Chapter 4.5 CAUCHY EULER
32 pages
Pharmacopoeia Limit Tests Guide
No ratings yet
Pharmacopoeia Limit Tests Guide
1 page
LabEx No. 3 Shear Test of Wood
No ratings yet
LabEx No. 3 Shear Test of Wood
4 pages
Electric Circuit
No ratings yet
Electric Circuit
20 pages
Bangladesh National Building Code 2006
100% (3)
Bangladesh National Building Code 2006
47 pages
Fast DCT Algorithm for Signal Processing
No ratings yet
Fast DCT Algorithm for Signal Processing
4 pages
Synopsys Neural Processor
No ratings yet
Synopsys Neural Processor
8 pages
Variable Length Arguments ( Args), Keyword Varargs ( Kwargs) in Python
No ratings yet
Variable Length Arguments ( Args), Keyword Varargs ( Kwargs) in Python
12 pages
Rheological Properties of Food Analysis
No ratings yet
Rheological Properties of Food Analysis
25 pages
The Rock Physics Handbook, Second Edition: Tools For Seismic Analysis of Porous Media
No ratings yet
The Rock Physics Handbook, Second Edition: Tools For Seismic Analysis of Porous Media
12 pages
PASOLINK NEO Operation Guide
No ratings yet
PASOLINK NEO Operation Guide
86 pages
CLASS XI TERM END SYLLABUS COMPLIANCE Final
No ratings yet
CLASS XI TERM END SYLLABUS COMPLIANCE Final
12 pages
Class 12 2082
No ratings yet
Class 12 2082
2 pages
11-JMN-MT-11 C+M+P - 23-03-2024 - M2
No ratings yet
11-JMN-MT-11 C+M+P - 23-03-2024 - M2
18 pages
GCSE Chemistry Higher Tier Exam Paper
No ratings yet
GCSE Chemistry Higher Tier Exam Paper
32 pages
Han Xiao 2012 Ecai
No ratings yet
Han Xiao 2012 Ecai
7 pages
Introduction To The Theory of Error Correction Codes
No ratings yet
Introduction To The Theory of Error Correction Codes
78 pages
Impulse Response Model Overview
No ratings yet
Impulse Response Model Overview
26 pages
WAF: WAF Stands For Width Across Flat.: Nitto Kohki Co., LTD
No ratings yet
WAF: WAF Stands For Width Across Flat.: Nitto Kohki Co., LTD
1 page

Unit 01 Data Warehousing

Uploaded by

Unit 01 Data Warehousing

Uploaded by

Unit 01.

1 It involves historical processing of It involves day-to-day processing.

You might also like