0% found this document useful (0 votes)

10 views90 pages

UNIT-1 -DWDM

The document provides an overview of data warehousing, emphasizing its importance for decision support in organizations. It discusses the architecture, components, and processes involved in building a data warehouse, including data sourcing, transformation, and management. Additionally, it highlights the significance of metadata and the various tools used for data access and analysis.

Uploaded by

thenvithi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views90 pages

UNIT-1 -DWDM

Uploaded by

thenvithi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Warehousing and Data Mining

UNIT – I
Data Warehousing

Prepared by
[Link]
Assistant Professor
PG & Research Department of Computer Science & Data Analytics
Tiruppur Kumaran College for Women – Tirupur.

1
Introduction

2
What is Data Warehousing
Data Warehousing is an architectural construct of information
systems that provides users with current and historical decision
support information that is hard to access or present in traditional
operational data stores

The need for data warehousing

•Business perspective
–In order to survive and succeed in today’s highly
competitive global environment
•Decisions need to be made quickly and correctly
•The amount of data doubles every 18 months, which affects
response time and the sheer ability to comprehend its content
•Rapid changes

3
Business Problem Definition
Providing the organizations with a sustainable competitive

Advantage

 Customer retention

 Sales and customer service

 Marketing

 Risk assessment and fraud detection

4
Business problem and data warehousing
Classified into
Retrospective analysis:
Focuses on the issues of past and present events.

Predictive analysis:
Focuses on certain events or behavior based on historical
information. Further classified into
Classification:
Used to classify database records into a number of
predefined classes based on certain criteria.
Clustering:
Used to segment a database into subsets or clusters based on
a set of attributes

IFETCE/CSE/III YEAR/VI
SEM/IT6702/DWDM/PPT/UNIT-1/ VER 1.2 5
Association
It identify affinities among the collection as reflected in the
examined records.
Sequencing
This techniques helps identify patterns over time, thus
allowing , for example, an analysis of customers purchase during
separate visits.

Operational and Informational Data Store

Operational Data
Focusing on transactional function such as bank card
withdrawals and deposits
•Detailed
•Updateable
•Reflects current

6
Informational Data
Informational data, is organized around subjects such as
customer, vendor, and product. What is the total sales today?.
Focusing on providing answers to problems posed by
decision makers
• Summarized

•Nonupdateable

Operational data store.

An operational data store (ODS) is an architectural concept
to support day-to-day operational decision support and constrains
current value data propagated from operational applications.

7
A data warehouse is a subject-oriented, integrated,
nonvolatile, time-variant collection of data in support of
management's decisions. [WH Inmon]
Subject Oriented
Data warehouses are designed to help to analyze the data.
Integrated
The data in the data warehouse is loaded from different
sources that store the data in different formats and focus on
different aspects of the subject.

8
Nonvolatile
Nonvolatile means that, once entered into the warehouse,
data should not change.
Time Variant
Provides information from historical perspective

9
Data Warehouse Architecture
Seven data warehouse components
 Data sourcing, cleanup, transformation, and migration tools

 Metadata repository

 Warehouse/database technology

 Data marts

 Data query, reporting, analysis, and mining tools

 Data warehouse administration and management

 Information delivery system

10
11
1. Data Warehousing
Components

12
Data Warehousing Components
Operational data and processing is completely separate
form data warehouse processing.
Data Warehouse Database
It is an important concept (Marked as 2 in the diagram) in
the Warehouse environment.
In additional to transaction operation such as ad hoc query
processing, and the need for flexible user view creation including
aggregation, multiple joins, and drill-down.
 Parallel relational database designs that require a parallel
computing platform.
 Using new index structures to speed up a traditional RDBMS.
 Multidimensional database (MDDBS) that are based on
proprietary database technology or implemented using already
familiar RDBMS.

13
Sourcing, Acquisition, Cleaning, and Transformation tools

 Removing unwanted data from operational database

 Converting to common data names and definitions

 Calculating summarizes and derived data.

 Establishing default for missing data.

 Accommodating source data definition changes.

14
Metadata
 data about data
 Used for building, maintaining, and using the data warehouse
 Classified into
Technical metadata
 Information about data sources
 Transformation, descriptions, i.e., the mapping methods from
operational databases into the warehouse and algorithms used to
convert, enhance or transform data.
 Warehouse objects and data structure definitions for data targets.
 The rules used to perform data cleanup and data enhancement.

15
 Data mapping operations when capturing data from source
systems and applying to the target warehouse database.
 Access authorization, backup history, archive history,
information delivery history, data acquition history, data access
etc.,
Business metadata
Gives perspective of the information stored in the data warehouse
 Subject areas and information object type, including queries,
reports, images, video, and / or audio clips.
 Internet home pages.
 Other information to support all data warehouse components.
 Data warehouse operational information e.g., data history,
ownership, extract, audit trail, usage data.

16
Access Tools

The tools divided into five main groups.

 Data query and reporting tools

 Application development tools

 Executive information system (EIS) tools

 On-line analytical processing tools

 Data mining tools

17
Query and reporting tools
This category can be further divided into two groups.

 Reporting tools

 Managed query tools

Managed query tools shield end users from the complexities

of SAL and database structures by inserting a metalayer between
users and the database

Applications

Applications developed using a language for the users

18
OLAP
Based on the concepts of multidimensional database
Data mining
To discovery meaningful new correlations, patterns, and
trends by digging into (mining) large amount of data stored in
warehouse using artificial-intelligence (AI) and statistical and
mathematical techniques
Discover knowledge. The goal of knowledge discovery is to
determine the following things.
 Segmentation
 Classification
 Association
 Preferencing
19
Visualize data. Prior to any analysis, the goal is to “humanize” the
mass of data they must deal with and find a clever way to display
the data.
Correct data. While consolidating massive database may enterprise
find that the data is not complete and invariably contains erroneous
and contradictory information. Data mining techniques can help
identify and correct problems in the most consistent way possible.
Data visualization
Presenting the output of all the previously mentioned tools
Colors, shapes, 3-D images, sound, and virtual reality
Data Marts
Data store that is subsidiary to data warehouse
It is partition of data that is created for the use of dedicated
group of users
Placed on the data warehouse database rather than placing it as
separate store of data.
20
In most instance, the data mart is physically separate store of
data and is normally resident on separate database server
Data Warehouse administration and Management
 Managing data warehouse includes
 Security and priority management
 Monitoring updates form multiple sources
 Data quality checks
 Managing and updating metadata
 Auditing and reporting data warehouse usage and status
 Replicating, sub setting, and distributing data
 Backup and recover
 Data warehouse storage management
21
Information delivery system

 The information delivery system distributes warehouse

stored data and other information objects to other data warehouse
and end-user products such as spread sheets and local databases.

 Delivery of information may be based on time of day, or a

completion of an external event.

22
2. Building a Data Warehouse

23
Building a Data Warehouse
Business considerations
Return on Investment
Approach
The information scope of the data warehouse varies with the
business requirements, business priorities, and magnitude of the
problem
Two data warehouses
Marketing
Personnel
 The top-down approach
Building an enterprise data warehouse with subset data marts.
 The bottom-up approach
Resulted in developing individual data marts, which are then
integrated into the enterprise data warehouse.

24
Organizational issues
A data warehouse implementation is not truly a technological
issue; rather, it should be more concerned with identifying and
establishing information requirements, the data sources fulfill these
requirements, and timeliness.
Design considerations
A data Warehouse’s design point is to consolidate from
multiple, often heterogeneous sources into a query database. The
main factors include
 Heterogeneity of data sources, which affects data conversion,
quality, timeliness
 Use of historical data, which implies that data may be “old”.
 Tendency of databases to grow very large

25
Data content
A data warehouse may contain details data, but the data is
cleaned up and transformed to fit the warehouse model, and
certain transactional attributes of the data are filtered out.
Metadata
A data warehouse design should ensure that there is
mechanism that populates and maintains the metadata
repository, and that all access paths to the data warehouse
have metadata as an entry point.
Data distribution
One of the challenges when designing a data warehouse is
to know how the data should be divided across multiple
servers and which users should get access to which types of
data.

26
 The data placement and distribution design should consider
several options, including data distribution by subject area,
location, or time.
Tools
 Each tool takes a slightly different approach to data
warehousing and often maintain its own version of the metadata
which is placed in a tool-specific, proprietary metadata
repository.
 The designers of the tool have to make sure that all selected
tools are compatible with the given data warehouse
environment and with each other.

27
Performance considerations
Rapid query processing is highly desired feature that should
be designed into the data warehouse.
Design warehouse database to avoid the majority of the most
expensive operations such as multi table search and joins
Nine decisions in the design of data warehouse
1. Choosing the subject matter.
2. Deciding what a fact table represents.
3. Identifying and confirming the dimensions.
4. Choosing the facts.
5. Storing pre calculations in the fact table.
6. Rounding out the dimension tables.
7. Choosing the duration of the database.
8. The need to track slowly changing dimensions.
9. Deciding the query priorities and the query modes

28
Technical Considerations
 The hardware platform that would house the data warehouse
 The database management system that supports the warehouse
database.
 The communications infrastructure that connects the
warehouse, data marts, operational systems, and end users.
 The hardware platform and software to support the metadata
repository.
 The systems management framework that enables centralized
management and administration. of the entire environment
Hardware platforms
Data warehouse server is its capacity for handling the
volumes of data required by decision support applications,
some of which may require a significant amount of historical
data.

29
 This capacity requirement can be quite large
 The data warehouse residing on the mainframe is best suited
for situations in which large amounts of data
 The data warehouse server has to be able to support large data
 Volumes and complex query processing.
Balanced approach.
 An important design point when selecting a scalable
computing platform is the right balance between all computing
components
Data warehouse and DBMS specialization
 The requirements for the data warehouse DBMS are
performance, throughput, and scalability because the database
large in size and the need to process complex ad hoc queries in a
relatively in short time.
 The database that have been optimized specifically for data
warehousing.

30
Communications infrastructure
Communications networks have to be expanded, and new
hardware and software may have to be purchased to meet out the
cost and efforts associated with bringing access to corporate data
directly to the desktop.
Implementation Considerations
Data warehouse implementation requires the integration of
many products within a data warehouse.
 The steps needed to build a data warehouse are as follows.
 Collect and analyze business requirements.
 Create a data model and a physical design for the data
warehouse.
 Define data warehouse.
 Choose the database technology and platform for the warehouse.
 Extract the data from the operational databases, transform it,
clean it up, and load it into the database.
31
 Choose the database access and reporting tools.
 Choose database connectivity software.
 Choose data analysis and presentation software.
 Update the data warehouse.
Access tools
Suit of tools are needed to handle all possible data
warehouse access needs and the selection of tools based on
definition of deferent types of access to the data
 Simple tabular form reporting.
 Ranking.
 Multivariable analysis.
 Time series analysis.
 Data visualization, graphing, charting and pivoting.
 Complex textual search.

32
 Statistical analysis.
 Artificial intelligence techniques for testing of hypothesis, trend
discovery, definition and validation of data cluster and
segments.
 Information mapping
 Ad hoc user-specified queries
 Predefined repeatable queries
 Interactive drill-down reporting and analysis.
 Complex queries with multitable joins, multilevel sub queries,
and sophisticated search criteria.
Data extraction, cleanup, transformation and migration
Data extraction decides the ability to transform, consolidate,
integrate, and repair the data should be considered

33
 A field-level data examination for the transformation of data
into information is needed.
 The ability to perform data-type and character-set translation is
a requirement when moving data between incompatible
systems.
 The capability to create summarization, aggregation, and
derivation records and fields in very important
 The data warehouse database management should be able to
perform the load directly form the tool, using the native API
available with the RDBMS.
 Vendor stability and support for the product are items that must
be carefully evaluated.
Data placement strategies
As a data warehouse grows, there at least two options for
data placement. One is to put some of the data in the data
warehouse into another storage media e.g., WORM, RAID, or
photo-optical technology.

34
The second option is to distribute the data in the data
warehouse across multiple servers
Data replication
Data that is relevant to a particular workgroup in a localized
database can be a more affordable solution than data
warehousing
Replication technology creates copies of databases on a
periodic bases, so that data entry and data analysis can be
performed separately
Metadata
Metadata is the roadmap to the information stored in the
warehouse
The metadata has to be available to all warehouse users in
order to guide them as they use the warehouse.

35
User sophistication levels
• Casual users

• Power users.

• Experts

Integrated Solutions
A number of vendors participated in data warehousing by
providing a suit of services and products that go beyond one
particular Component of the data warehouse.

36
Digital Equipment Corp. Digital has combined the data
modeling, extraction and cleansing capabilities of Prism
Warehouse Manager with the copy management and data
replication capabilities of Digital’s ACCESSWORKS family of
database access servers in providing users with the ability to
build and use information warehouse

Hewlett-Packard. Hewlett-Packard’s client/server based HP open

warehouse comprises multiple components, including a data
management architecture, the HP-UX operating system HP
9000 computers, warehouse management tools, and the HP
information Access query tool

37
 IBM. The IBM information warehouse framework consists of an
architecture; data management tools; OS/2, AIX, and MVS
operating systems; hardware platforms, including mainframes
and servers; and a relational DBMS (DB2).

 Sequent. Sequent computer systems Inc.’s DecisionPoint

Program is a decision support program for the delivery of data
warehouses dedicated to on-line complex query processing
(OLCP). Using graphical interfaces users query the data
warehouse by pointing and clicking on the warehouse data item
they want to analyze. Query results are placed on the program’s
clipboard for pasting onto a variety of desktop applications, or
they can be saved on to a disk.

38
Benefits of Data Warehousing

Data warehouse usage includes

 Locating the right information

 Presentation of Information (reports, graphs).

 Testing of hypothesis

 Sharing and the analysis

39
Tangible benefits

 Product inventory turnover is improved

 Cost of product introduction are decreased with improved selection

of target markets.

 More cost-effective decision making is enabled by increased

quality and flexibility of market analysis available through
multilevel data structures, which may range from detailed to highly
summarized.

 Enhanced asset and liability management means that a data

warehouse can provide a “big” picture of enterprise wide
purchasing and inventory patterns.
40
Intangible benefits

The intangible benefits include.

 Improved productivity, by keeping all required data in a single

location and eliminating the redundant processing

 Reduced redundant processing.

 Enhance customer relations through improved knowledge of

individual requirement and trends.

 Enabling business process reengineering.

41
3. Mapping the Warehouse to a Multiprocessor
Architecture

42
Mapping the Warehouse to a Multiprocessor Architecture
Relational Database Technology for Data Warehouse
The Data warehouse environment needs
 Speed up
 Scale-p
Parallel hardware architectures, parallel operating systems
and parallel database management systems will provide the
requirement of warehouse environment.
Types of parallelism
Interquery parallelism
Threads (or process) handle multiple requests at the same time.
Intraquery parallelism
scan, join, sort, and aggregation operations are executed
concurrently in parallel.

43
Intraquery parallelism can be done in either of two ways

Horizontal parallelism

Database is partitioned across multiple disks, and parallel

processing occurs within a specific task that is performed
concurrently on different sets of data.

Vertical parallelism

An output from on tasks (e.g., scan) becomes are input into

another task (e.g., join) as soon as records become available.

44
Data Partitioning

Spreads data from database tables across multiple disks so

that I/O operations such as read and write can be performed in
parallel.

Random partitioning

It includes data striping across multiple disks on a

single server. Another options for random partitioning is round-
robin partitioning. In which each new record is placed on the
next assigned to the database.

45
Response
Time
Serial
RDBMS
Vertical Parallelism
(Query
Decomposition)
Horizontal
Parallelism
(Data Partitioning)

Case 1 Case 2 Case 3 Case 4

Intelligent partitioning
DBMS knows where a specific record is located and does
not waste time searching for it across all disks.
Hash partitioning. A hash algorithm is used to calculate the
partition umber (hash value) based on the value of the portioning
key for each row.
46
 Key range partitioning. Rows are placed and located in the
partitions according to the value of the partitioning key (all rows
with the key value form A to K are in partition 1, L to T are in
partition 2 etc.).

 Schema partitioning. an entire table is placed on one disk, another

table is placed on a different disk, etc. This is useful for small
reference tables that are more effectively used when replicated in
each partition rather than spread across partitions.

 User-defined partitioning. This is a partitioning method that allows

a table to be partitioned on the basis of a user-defined expression.
47
Database Architectures for
Parallel Processing

Shared-memoryArchitecture-
multiple processors share the main memory space, as well as
mass storage (e.g. hard disk drives)

Shared Disk Architecture - each node has its own

main memory, but all nodes share mass storage, usually
a storage area network

Shared-nothing Architecture - each node has its

own mass storage as well as main memory. 48
Database Architecture for Parallel Processing

Shared-memory architecture- SMP (Symmetric

Multiprocessors)

Multiple database components executing SQL statements

communicate with each other by exchanging messages and
data via the shared memory.

49
Scalability can be achieved through process-based multitasking
or thread-based multitasking.

Interconnection Network

Processor Processor
Unit Unit Processor
(PU) (PU) Unit
(PU)

Global Shared Memory

50
Shared-disk architecture
The entire database shared between RDBMS servers, each of
which is running on a node of a distributed memory system.
Each RDBMS server can read, write, update, and delete
records from the same shared database
Implemented by using distribute lock manager (DLM)
Disadvantage.
All nodes are reading and updating the same data, the
RDBMS and its DLM will have to spend a lot of resources
synchronizing
multiple buffer pools.
It may have to handle significant message traffic in a highly
utilized REBMS environment.

51
Advantages.

It reduce performance bottlenecks resulting from data skew

(an uneven distribution of data), and can significantly increases
system availability.

It eliminates the memory access bottleneck typical of large

SMP systems, and helps reduce DBMS dependency on data
partitioning.

52
Interconnection Network

Processor Processor
Processor
Unit Unit
Unit
(PU) (PU)
(PU)

Local Local Local

Memory Memory Memory

Global Shared Disk Subsystem

Figure 4.3 Distributed-memory shared-disk architecture

53
Shared-nothing architecture
Each processor has its own memory and disk, and communicates
with other processors by exchanging messages and data over the
interconnection network. Interconnection Network

Disadvantages. Processor Processor Processor

It is most difficult to implement. Unit
(PU)
Unit
(PU)
Unit
(PU)
It requires a new programming paradigm

Local Local Local

Memory Memory Memory

54
Combined architecture
 Combined hardware architecture could be a cluster of SMP
nodes
 combined parallel DBMS architecture should support
intersever parallelism of distributed memory MPPs and
intraserver parallelism of SMP nodes.

Parallel RDBMS features

 Scope and techniques of parallel DBMS operations
 Optimizer implementation
 Application transparency
 The parallel environment
 DBMS management tool

55
4. DBMS Schemas for
Decision Support

56
DBMS Schemas for Decision Support
Data Layout for best access
Multidimensional Data Model
Star Schema
Two groups: facts and dimension
Facts are the core data element being analyzed
e.g.. items sold
dimensions are attributes about the facts
e.g. date of purchase
The star schema is designed to overcome this limitation in
the two-dimensional relational model.
DBA Viewpoint
The fact table contains raw facts. The facts are typically
additive and are accessed via dimensions.

57
58
The dimension tables contain a non-compound primary key
and are heavily indexed.
Dimension tables appear in constraints and GROUP BY
Clauses, and are joined to the fact tables using foreign key
references.
Once the star schema database is defined and loaded, the
queries that answer simple and complex questions.
Potential Performance Problems with star schemas
The star schema suffers the following performance problems.
Indexing
Multipart key presents some problems in the star schema model.
(day->week-> month-> quarter-> year )
 It requires multiple metadata definition( one for each component)
to design a single table.

59
 Since the fact table must carry all key components as part of its
primary key, addition or deletion of levels in the hierarchy will
require physical modification of the affected table, which is time-
consuming processed that limits flexibility.

Level Indicator

The dimension table design includes a level of hierarchy

indicator for every record.

The user is not and aware of the level indicator, or its

values are in correct, the otherwise valid query may result in a
totally invalid answer.

60
Alternative to using the level indicator is the snowflake schema
Aggregate fact tables are created separately from detail tables
Snowflake schema contains separate fact tables for each level of
aggregation
Other problems with the star schema design
Pairwise Join Problem
5 tables require joining first two tables, the result of this join
with third table and so on.
The intermediate result of every join operation is used to
join with the next table.
Selecting the best order of pairwise joins rarely can be solve
in a reasonable amount of time.
Five-table query has 5!=120 combinations

61
This problem is so serious that some databases will not run a
query that tries to join too many tables.
STARjoin and STARindex
A STARjoin is a high-speed, single-pass, parallelizable
multitable join and is introduced by Red Brick’s RDBMS.

STARindexes to accelerate the join performance

STARindexes are created in one or more foreign key columns
of a fact table.

Traditional multicolumn references a single table where as the

STARindex can reference multiple tables

With multicolumn indexes, if a query’s WHERE Clause does

not contain on all the columns in the composite index, the index
cannot be fully used unless the specified columns are a leading
subset.

62
The STARjoin using STARindex could efficiently join the
dimension tables to the fact table without penalty of generating
the full Cartesian product.
The STARjoin algorithm is able to generate a Cartesian
product in regions where these are rows of interest and bypass
generating Cartesian products over region where these are no
rows.
Bit mapped Indexing
SYBASE IQ
Overview.
Data is loaded into SYBASE IQ, it converts all data into a
series of bitmaps; which are them highly compressed and stored
in disk.
SYBASE IQ indexes do not point to data stored elsewhere all
data is contained in the index structure.

63
Data Cardinality.

Bitmap indexes are used to queries against low-cardinality

data-that is data in which the total number of potential values is
relatively low.

For low cardinality data, each distinct value has its own
bitmap index consisting of a bit for every row in the table.

SYBASE IQ high cardinality index starts at 1000 distinct

values.

64
Emp-Id Gender Last Name First Name Address
104345 M Karthik Ramasamy 10, North street
104567 M Visu Pandian 12, Pallavan street
104788 F Mala Prathap 123, Koil street

1 1 0 0 0 1 1 0 0 1 0 1 1 1 0

Record 1
Record N
Record 2

65
Index Types.
The SYBASE IQ provides five index techniques. One is a
default index called the Fast projection index and the other is either
a low-or high-cardinality index.
Performance.
SYBASE IQ technology achieves very good performance in
ad hoc queries for several reasons.
 Bitwise Technology. This allows raped response to queries
containing various data type, supports data aggregation and
grouping.
 Compression. SYBASE IQ uses sophisticated algorithm to
compress data into bitmapping SYBASE IQ can hold more data in
memory minimizing expensive I/O operations.

66
E-Id Gender Name E-Id Gender Name

Read
Row 1 1 0

Read
Row 1 1 1

Read
Row 0 1 0

Read
Row 1 0 1

Traditional Row-based processing SYBASE IQ Column-wise processing

67
 Optimized memory-based processing. Columnwise
processing.

 Low Overhead.

 Large Block I/O.

 Operating-system-level parallelism.

 Prejoin and ad hoc join Capabilities.

68
Shortcoming of Indexing.

Some of the tradeoffs of the SYBASE IQ are as follows

 No Updates.

 Lack of core RDBMS features.

 Less advantage for planned queries.

 High memory Usage.

69
Column Local Storage
Performance in the data warehouse environment can be
achieved by storing data in memory in column wise instead to store
one row at a time and each row can be viewed and accessed as
single record.

Emp-id Emp-Name Dept Salary

1004 Suresh CSE 15000
1005 Mani MECH 25000
1006 Sara CIVIL 23000

1004 Suresh CSE 15000 1004 1005 1006

1005 Mani MECH 25000 Suresh Mani Sara
1006 Sara CIVIL 23000 CSE MECH CIVIL
15000 25000 23000

70
Complex Data types

The warehouse environment support for datatypes of

complex like text, image, full-motion video, some and large
objects called binary large object (BLOBs) other than simple
such as alphanumeric.

71
5. Data Extraction, Cleanup, and
Transformation Tools

72
Data Extraction, Cleanup, and Transformation Tools
Tools Requirements

The tools that enable sourcing of the proper data

contents and formats from operational and external data stores into
the data warehouse to perform a number of important tasks that
include

 Data transformation from one format to another on the basis of

possible differences between the source and the target platform.

 Data transformation and calculation based on the application of

the business rules that force certain transformations..

73
Vendor Approaches
The integrated solutions can fall into one of the categories
described below
Code generators
Database data replication tools
Rule-driven dynamic transformation engines capture data
from source systems at user-defined intervals, transform the data,
and then send and load the results into a target environment,
typically a data mart
Access to Legacy Data
Many organizations develop middleware solutions that can
manage the interaction between the new applications and growing
data warehouses on one hand and back-end legacy systems in the
other hand.
74
A three architecture that defines how applications are
partitioned to meet both near-term integration and long-term
migration objectives.
 The data layer provides data access and transaction services for
management of corporate data assets.
 The process layer provides services to manage automation and
support for current business process.
 The user layer manages user interaction with process and /or data
layer services.
Vendor Solutions
Prism Solutions
Provides a comprehensive solution of data warehousing by
mapping source data to a target database management system to
be used as warehouse.

75
Warehouse Manager generates code to extract and integrate
data, create and manage metadata, and build a subject-oriented,
historical base.
SAS Institute
SAS tools to serve all data warehousing functions.
Its data repository function can act to build the
informational database.
SAS Data Access Engine serve as extraction tools to
combine common variables, transform data representation forms
for consistency, consolidate redundant data, and use business
rules to produce computed values in the warehouse.
SAS engines can work with hierarchical and relational
databases and sequential files

76
Carleton Corporation’s PASSPORT and MetaCenter.
PASSPORT.
PASSPORT is sophisticated metadata-driven, data-mapping
and data-migration facility.
PASSPORT Workbench runs as a client on various PC
platforms in the three-tiered environment, including OS/2 and
Windows.
The product consists of two components.
The first, which is mainframe-based, collects the file, record,
or table layouts for the required inputs and outputs and converts
them to the Passport Data Language (PDL).

77
Overall, PASSPORT offers
 A metadata dictionary at the core of the process.
 Robust data conversion, migration, analysis, and auditing
facilities.
 The PASSPORT Workbench that enables project development on
a workstations, with uploading of the generated application to the
source data platform.
 Native interfaces to existing data files and RDBMS, helping users
to lever-age existing legacy applications and data.
 A comprehensive fourth-generation specification language and
the full power of COBOL.
The MetaCenter.
The MetaCenter, developed by Carleton Corporation in
partnership with Intellidex System, Inc., is and integrated tool
suite that is designed to put users in control of the data
warehouse.

78
It is used to manage
 Data extraction
 Data transformation
 Metadata capture
 Metadata browsing
 Data mart subscription
 Warehouse control center functionality
 Event control and notification
Vality Corporation
Vality Corporation’s Integrity data reengineering tool is used
to investigate, standardize, transform, and integrate data from
multiple operational systems and external sources.

79
 Data audits
 Data warehouse and decision support systems
 Customer information files and house holding applications
 Client/server business applications such as SAP, Oracle, and
Hogan
 System consolidations
 Rewrites of existing operational systems
Transformation Engines
Informatica
Informatica’s product, the PowerMart suite, captures
technical and business metadata on the back-end that can be
integrated with the metadata in front-end partner’s products.
PowerMart creates and maintains the metadata repository
automatically.

80
It consists of the following components
PowerMart Designer is made up of three integrated
modules- Source Analyzer, Warehouse Designer, and
Transformation Designer
PowerMart Server runs on a UNIX or Windows NT
platform.
The Information Server Manager is responsible for
configuring, scheduling, and monitoring the Information Server.
The Information Repository is the metadata integration hub
of the Informatica PowerMart Suite.
Constellar
The Constellar Hub is designed to handle the movement
and transformation of data for both data migration and data
distribution in an operational system, and for capturing
operational data for loading a data warehouse.

81
Constellar employs a hub and spoke architecture to manage
the flow of data between source and target systems.
Hubs that perform data transformation based on rules
defined and developed using Migration Manager
Each of the spokes represents a data path between a
transformation hub and a data source or target.
A hub and its associated sources and targets can be installed
on the same machine, or may run on separate networked
computers.

82
6. Metadata

83
Metadata
The metadata contains
 The location and description of warehouse system and data
components
 Names, definition, structure, and content of the warehouse and
end-user views.
 Identification of authoritative data sources.
 Integration and transformation rules used to populate the data
warehouse; these include the mapping method from operational
databases into the warehouse, and algorithms used to convert,
enhance, or transform data
 Integration and transforms rules used to deliver data to end-user
analytical tools.

84
Metadata Interchange Initiative

A Metadata standard developed for metadata interchange

format and its support mechanism.

The goal of the standard include

 Creating a vendor-independent, industry-defined and application

programming interface (API) for metadata.

 Allowing users to build tool configurations that meet their needs

and to incrementally adjust those configurations as necessary to
add or subtract tools without impact on the interchange standards
environment.
85
Metadata Interchange Standard framework.

The components of the Metadata Interchange Standard

Framework are

 The Standard Metadata Model, which refers to the ASCII file

format used to represent the metadata that is being exchanged.

 The Standard Access Framework, which describes the minimum

number of API functions a vendor must support.

 Tool Profile, which is provided by each tool vendor. The Tool

Profile is a file that describes what aspects of the interchange
standard metamodel a particular tool supports.
86
TOOL 2 TOOL 3

TOOL 1 TOOL 4
Tool Tool
Profile Profile

Tool Tool
Profile Profile

User Configuration

Standard Access Framework

Standard API

Standard
Metadata
Model

87
Metadata Repository

The metadata itself is housed in and managed by the

metadata repository.

Metadata repository management software can be used to

map the source data to the target database, generate code for data
transformations, integrate and transform the data, and control
moving data to the warehouse.

88
Metadata Management

Metadata define all data elements and their attributes, data

sources and timing, and the rules that govern data use and data
transformations.

The metadata also has to be available to all warehouse users

in order to guide them as they use the warehouse.

Awell-thought-through strategy for collecting, maintaining,

and distributing metadata is needed for a successful data
warehouse implementation.

89
Metadata Trends

The process of integrating external and internal data into

the warehouse faces a number of challenges

 Inconsistent data formats

 Missing or invalid data

 Different level of aggregation

 Semantic inconsistency (e.g., different codes may mean

different things from different suppliers of data)

 Unknown or questionable data quality and timeliness

Data Warehousing Fundamentals Explained
No ratings yet
Data Warehousing Fundamentals Explained
88 pages
Data Mining and Business Intelligence Overview
No ratings yet
Data Mining and Business Intelligence Overview
22 pages
Data Warehousing Concepts and Benefits
No ratings yet
Data Warehousing Concepts and Benefits
33 pages
Data Warehouse and Mining Overview
No ratings yet
Data Warehouse and Mining Overview
52 pages
Business Analytics: Data Warehousing Insights
No ratings yet
Business Analytics: Data Warehousing Insights
160 pages
Unit 1 New DWDM
No ratings yet
Unit 1 New DWDM
38 pages
Business Intelligence and Data Warehousing
No ratings yet
Business Intelligence and Data Warehousing
155 pages
Data Warehousing Concepts and Needs
No ratings yet
Data Warehousing Concepts and Needs
62 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
66 pages
Understanding Data Warehouses & Marts
No ratings yet
Understanding Data Warehouses & Marts
116 pages
Data Warehousing Concepts and Overview
No ratings yet
Data Warehousing Concepts and Overview
45 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
27 pages
Data Warehousing Fundamentals
100% (1)
Data Warehousing Fundamentals
19 pages
Data Warehousing Concepts and Benefits
100% (4)
Data Warehousing Concepts and Benefits
28 pages
Data Warehousing Concepts and Architecture
No ratings yet
Data Warehousing Concepts and Architecture
27 pages
Data Warehousing Fundamentals
No ratings yet
Data Warehousing Fundamentals
34 pages
Data Warehousing and OLAP Basics
No ratings yet
Data Warehousing and OLAP Basics
40 pages
Data Warehousing: Concepts and Benefits
No ratings yet
Data Warehousing: Concepts and Benefits
19 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
46 pages
Understanding Data Warehousing and OLAP
No ratings yet
Understanding Data Warehousing and OLAP
57 pages
OLAP in Data Warehousing Explained
No ratings yet
OLAP in Data Warehousing Explained
49 pages
Data Warehousing AND Data Mining
100% (1)
Data Warehousing AND Data Mining
90 pages
Dwdm Notes
No ratings yet
Dwdm Notes
141 pages
IBM Data Warehouse Architecture Overview
No ratings yet
IBM Data Warehouse Architecture Overview
31 pages
Data Warehousing and Mining Overview
75% (4)
Data Warehousing and Mining Overview
14 pages
Data Warehousing Concepts and Architectures
No ratings yet
Data Warehousing Concepts and Architectures
31 pages
Data Warehousing Course Syllabus
No ratings yet
Data Warehousing Course Syllabus
15 pages
Data Warehousing Components Overview
No ratings yet
Data Warehousing Components Overview
23 pages
Data Warehousing Syllabus Overview
No ratings yet
Data Warehousing Syllabus Overview
23 pages
Data Warehousing and OLAP Overview
No ratings yet
Data Warehousing and OLAP Overview
52 pages
Data Warehousing and Business Intelligence Overview
No ratings yet
Data Warehousing and Business Intelligence Overview
26 pages
Data Warehousing Insights and Trends
No ratings yet
Data Warehousing Insights and Trends
25 pages
Data Warehousing Overview and Components
No ratings yet
Data Warehousing Overview and Components
31 pages
Data Warehousing & Mining Course Overview
No ratings yet
Data Warehousing & Mining Course Overview
111 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
70 pages
Business Intelligence & Data Warehousing Guide
No ratings yet
Business Intelligence & Data Warehousing Guide
34 pages
Introduction to Data Warehousing Concepts
No ratings yet
Introduction to Data Warehousing Concepts
43 pages
Understanding Data Warehousing Basics
No ratings yet
Understanding Data Warehousing Basics
68 pages
Data Warehousing Overview and Applications
No ratings yet
Data Warehousing Overview and Applications
35 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
11 pages
Data Warehousing and Mining Overview
No ratings yet
Data Warehousing and Mining Overview
169 pages
IDMS Data Replication in Warehousing
No ratings yet
IDMS Data Replication in Warehousing
169 pages
Data Mining and Warehousing Overview
No ratings yet
Data Mining and Warehousing Overview
169 pages
Data Warehouse Concepts Explained
No ratings yet
Data Warehouse Concepts Explained
40 pages
Data Warehouse Components Overview
No ratings yet
Data Warehouse Components Overview
26 pages
Data Warehousing for Business Insights
No ratings yet
Data Warehousing for Business Insights
51 pages
Data Warehousing Concepts Overview
No ratings yet
Data Warehousing Concepts Overview
40 pages
Data Warehouse and OLAP Technology Overview
No ratings yet
Data Warehouse and OLAP Technology Overview
74 pages
Data Warehousing Concepts and Applications
No ratings yet
Data Warehousing Concepts and Applications
44 pages
Data Warehouse Implementation Essentials
100% (1)
Data Warehouse Implementation Essentials
157 pages
Business Information Systems Overview
No ratings yet
Business Information Systems Overview
24 pages
Data Warehousing Concepts Overview
No ratings yet
Data Warehousing Concepts Overview
68 pages
Data Warehousing Overview and Concepts
No ratings yet
Data Warehousing Overview and Concepts
14 pages
Understanding Data Warehousing Basics
No ratings yet
Understanding Data Warehousing Basics
51 pages
Data Warehousing Overview and Architecture
No ratings yet
Data Warehousing Overview and Architecture
495 pages
Data Warehousing Concepts and Architecture
No ratings yet
Data Warehousing Concepts and Architecture
90 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
79 pages
Data Warehousing and Mining Overview
No ratings yet
Data Warehousing and Mining Overview
169 pages
I Cs Da - Ds II Internal
No ratings yet
I Cs Da - Ds II Internal
1 page
UNIT 1,2 - DWDM NEW
No ratings yet
UNIT 1,2 - DWDM NEW
48 pages
Java Notes
No ratings yet
Java Notes
71 pages
VB Notes-1
No ratings yet
VB Notes-1
75 pages
Quick Heal Financial Analysis 2024
No ratings yet
Quick Heal Financial Analysis 2024
7 pages
Test Bank For Urry 12e Campbell Biology 12e
No ratings yet
Test Bank For Urry 12e Campbell Biology 12e
61 pages
Liberalism and Nationalism in 19th Century Germany
No ratings yet
Liberalism and Nationalism in 19th Century Germany
2 pages
Injection Molding Troubleshooting Guide
No ratings yet
Injection Molding Troubleshooting Guide
4 pages
HP Secretariat & IPH Contacts List
No ratings yet
HP Secretariat & IPH Contacts List
6 pages
Green Pesticides Handbook-Essential Oils For Pest Control
75% (4)
Green Pesticides Handbook-Essential Oils For Pest Control
119 pages
CDMO and Specialist Recruitment at IOCL
No ratings yet
CDMO and Specialist Recruitment at IOCL
1 page
GBC Credential Replacement Request Form
0% (1)
GBC Credential Replacement Request Form
1 page
Understanding Situation Ethics Explained
No ratings yet
Understanding Situation Ethics Explained
12 pages
Cambridge International General Certificate of Secondary Education
No ratings yet
Cambridge International General Certificate of Secondary Education
8 pages
IIT Kharagpur National Wetlab Championship
No ratings yet
IIT Kharagpur National Wetlab Championship
3 pages
Discover Thailand: Beaches & Culture
No ratings yet
Discover Thailand: Beaches & Culture
2 pages
Kesan Nilai Kerja dan Budaya Sekolah
No ratings yet
Kesan Nilai Kerja dan Budaya Sekolah
15 pages
Regional Economic Integration: Mcgraw-Hill/Irwin
No ratings yet
Regional Economic Integration: Mcgraw-Hill/Irwin
34 pages
Best Wood for Outdoor Furniture Guide
No ratings yet
Best Wood for Outdoor Furniture Guide
4 pages
A Novel Hypothesis of Visual Loss Secondary To Cosmetic Facial Filler Injection
No ratings yet
A Novel Hypothesis of Visual Loss Secondary To Cosmetic Facial Filler Injection
3 pages
RRB Bilaspur Technician Grade III Form
No ratings yet
RRB Bilaspur Technician Grade III Form
2 pages
MHCET Law 2025: Memory-Based Q&A Solutions
No ratings yet
MHCET Law 2025: Memory-Based Q&A Solutions
7 pages
Understanding Character Traits and Actions
No ratings yet
Understanding Character Traits and Actions
14 pages
Properties of Cost Functions in Economics
No ratings yet
Properties of Cost Functions in Economics
3 pages
The Mitten Book Companion Activities
No ratings yet
The Mitten Book Companion Activities
46 pages
Land 12 00345 v3
No ratings yet
Land 12 00345 v3
16 pages
India’s Iron Ore Production Overview
No ratings yet
India’s Iron Ore Production Overview
6 pages
Deep Dive into Zeek Network Tapping
No ratings yet
Deep Dive into Zeek Network Tapping
46 pages
GDS Recruitment Notification 2019
No ratings yet
GDS Recruitment Notification 2019
138 pages
GST Invoice for Reprocessed Plastics
No ratings yet
GST Invoice for Reprocessed Plastics
1 page
Veritas Prep GMAT Course Set Overview
No ratings yet
Veritas Prep GMAT Course Set Overview
1 page
Class 12 Biology Exam Paper 2024
No ratings yet
Class 12 Biology Exam Paper 2024
3 pages
Corporate Rehabilitation: Allied Bank Case
No ratings yet
Corporate Rehabilitation: Allied Bank Case
5 pages
Early River Valley Civilizations: 3500-450 BC
No ratings yet
Early River Valley Civilizations: 3500-450 BC
39 pages