0% found this document useful (0 votes)
50 views17 pages

Unit5 DM&DW

Uploaded by

srp27391
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views17 pages

Unit5 DM&DW

Uploaded by

srp27391
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

III BCA

Data Warehouse

Unit 5 : Data Warehouse


Introduction
A data warehouse is a repository of information collected from multiple sources,
stored under a unified schema, and that usually resides at a single site.

• Data warehouses generalize and consolidate data in multidimensional space.


The construction of data warehouses involves data cleaning, data integration,
and data transformation and can be viewed as an important preprocessing step
for data mining.
• Moreover, data warehouses provide on-line analytical processing (OLAP) tools
for the interactive analysis of multidimensional data of varied granularities,
which facilitates effective data generalization and data mining.
• An ordinary Database can store MBs to GBs of data and that too for a specific
purpose. For storing data of TB size, the storage shifted to the Data
Warehouse.
• Metadata Repository : Metadata are data about data. When used in a data
warehouse, metadata are the data that define warehouse objects.
Definition of Data Warehouse

A data warehouse is a subject-oriented, integrated, time-variant, and A data


warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection
of data in support of management’s decision making process

• The Definition presents the major features( Characteristics )of a data


warehouse. The four keywords—subject-oriented, integrated, time-variant,
and nonvolatile—distinguish data warehouses from other data repository
systems, such as relational database systems, transaction processing systems,
and file systems.
Key features of data warehouse are :

▪ Subject-oriented : A data warehouse is organized around major subjects


such as customer, supplier, product, and sales.
▪ Integrated : A data warehouse is usually constructed by integrating
multiple heterogeneous sources, such as relational databases, flat files, and
online transaction records.
▪ Time-variant : Data are stored to provide information from an historic
perspective (e.g., the past 5–10 years). Every key structure in the data
warehouse contains, either implicitly or explicitly, a time element.

Govt. first Grade College, Shimoga 1


III BCA
Data Warehouse
▪ Nonvolatile : A data warehouse is always a physically separate data store.
Due to this separation, a data warehouse does not require transaction
processing, recovery, and concurrency control mechanisms.
▪ It usually requires only two operations in data accessing: initial loading of
data and access of data.
Differences between Operational Database Systems and Data Warehouses

• The major task of online operational database systems is to perform online


transaction and query processing. These systems are called online transaction
processing (OLTP) systems.
• Data warehouse systems, on the other hand, serve users or knowledge workers
in the role of data analysis and decision making. These systems are known as
online analytical processing (OLAP) systems.

The major distinguishing features of OLTP and OLAP are summarized as follows:

• Users and system orientation: An OLTP system is customer-oriented and


is used for transaction and query processing by clerks, clients, and information
technology professionals. An OLAP system is market-oriented and is used for
data analysis by knowledge workers, including managers, executives, and
analysts.
• Data contents: An OLTP system manages current data that, typically, are
too detailed to be easily used for decision making. An OLAP system manages
large amounts of historic data, provides facilities for summarization and
aggregation
• Database design: An OLTP system usually adopts an entity-relationship
(ER) data model and an application-oriented database design. An OLAP
system typically adopts either a star or a snowflake model and a subject-
oriented database design.
• View: An OLTP system focuses mainly on the current data within an
enterprise or department, without referring to historic data or data in different
organizations. In contrast, an OLAP system often spans multiple versions of a
database schema, due to the evolutionary process of an organization.
• Access patterns: The access patterns of an OLTP system consist mainly of
short, atomic transactions. Such a system requires concurrency control and
recovery mechanisms. However, accesses to OLAP systems are mostly read-
only operations although many could be complex queries.

Govt. first Grade College, Shimoga 2


III BCA
Data Warehouse

Data Warehousing: A Multitiered Architecture


Data warehouses often adopt a three-tier architecture, as presented in
Figure 3.1.

1. The bottom tier is a warehouse database server that is almost always a


relational database system. Back-end tools and utilities are used to feed data
into the bottom tier from operational databases or other external sources
2. The middle tier is an OLAP server that is typically implemented using either
a relational OLAP (ROLAP) model (i.e., an extended relational DBMS that
maps operations on multidimensional data to standard relational operations);
or a multidimensional OLAP (MOLAP) model (i.e., a special-purpose server
that directly
implements multidimensional data and operations).
3. The top tier is a front-end client layer, which contains query and reporting
tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction,
and so on).

Data Warehouse Modeling


Data warehouse modeling refers to the process and techniques used to design
the structure of a data warehouse

• Data modeling refers to the process of handling and designing the data model
within a data warehouse platform.
• It consists of making an appropriate database schema so as to transfer the data
that can be stored and of useful to user.
• Data warehouse modeling includes models, different schemas , measures and
concept hierarchies to design the structure of data warehouse .

Govt. first Grade College, Shimoga 3


III BCA
Data Warehouse
❖ Data Warehouse Models
From the architecture point of view, there are three data warehouse models: the
enterprise warehouse, the data mart, and the virtual warehouse.
• Enterprise warehouse: An enterprise warehouse collects all of the information
about subjects spanning the entire organization
• Data mart: A data mart contains a subset of corporate-wide data that is of
value to a specific group of users. The scope is confined to specific selected
subjects. For example, a marketing data mart may confine its subjects to
customer, item, and sales.
• Virtual warehouse: A virtual warehouse is a set of views over operational
databases. For efficient query processing, only some of the possible summary
views may be materialized. A virtual warehouse is easy to build but requires
excess capacity on operational database servers.
❖ Schemas for Multidimensional Data Models

The most popular data model for a data warehouse is a multidimensional model.
Such a model can exist in the form of a star schema, a snowflake schema, or a fact
constellation schema.

• Star schema: The most common modeling paradigm is the star schema, in
which the data warehouse contains :
(1) a large central table (fact table) containing the bulk of the data, with no
redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each
dimension.
(3)The schema graph resembles a starburst, with the dimension tables
displayed in a radial pattern around the central fact table.

Govt. first Grade College, Shimoga 4


III BCA
Data Warehouse
• Snowflake schema: The snowflake schema is a variant of the star schema
model, where some dimension tables are normalized, thereby further splitting
the data into additional tables. The resulting schema graph forms a shape
similar to a snowflake.

• Fact constellation: Sophisticated applications may require multiple fact tables


to share dimension tables. This kind of schema can be viewed as a collection of
stars, and hence is called a galaxy schema or a fact constellation.

❖ Measures: Their Categorization and Computation


A data cube measure is a numerical function that can be evaluated at each
point in the data cube space.

• A measure value is computed for a given point by aggregating the data


corresponding to the respective dimension-value pairs defining the given point.
• Measures can be organized into three categories (i.e., distributive, algebraic,
holistic),based on the kind of aggregate functions used.

Govt. first Grade College, Shimoga 5


III BCA
Data Warehouse
1. Distributive : These measures can be computed in a distributive manner.
This means that the result can be obtained by partitioning the data, computing
the measure on each partition, and then combining the results.
• Examples : sum(), min(), and max() are distributive aggregate functions
2. Algebraic : These measures are composed of multiple distributive measures
combined using algebraic formulas. They can be computed by breaking them
down into basic distributive measures and then combining the results using
an algebraic formula.
• For example, avg() (average) can be computed by sum()/count(), where both
sum() and count() are distributive aggregate functions
3. Holistic : These measures require access to all the data to compute the result
and cannot be broken down into smaller pieces for intermediate computation.
• Common examples of holistic functions include median(), mode(), and
rank().
❖ Concept Hierarchies

A concept hierarchy defines a sequence of mappings from a set of low-level


concepts to higher-level, more general concepts.

• A conceptual hierarchy includes a set of nodes organized in a tree, where the


nodes define values of an attribute known as concepts
• The hierarchies allow the user to summarize the data at various levels.

(a) hierarchy for location (b) a lattice for time

Data Cube and OLAP :


Data cube :
Grouping of data in a multidimensional matrix is called data cubes.

Govt. first Grade College, Shimoga 6


III BCA
Data Warehouse
▪ In Data warehousing, we generally deal with various multidimensional data
models as the data will be represented by multiple dimensions and multiple
attributes.
• This multidimensional data is represented in the data cube as the cube
represents a high-dimensional space.
• The Data cube pictorially shows how different attributes of data are arranged
in the data model

Data cube classification:

The data cube can be classified into two categories:

• Multidimensional data cube: It basically helps in storing large amounts of


data by making use of a multi-dimensional array. It increases its efficiency by
keeping an index of each dimension. Thus, dimensional is able to retrieve data
fast.
• Relational data cube: It basically helps in storing large amounts of data by
making use of relational tables. Each relational table displays the dimensions
of the data cube. It is slower compared to a Multidimensional Data Cube.

Advantages of data cubes:

• Multi-dimensional analysis: Data cubes enable multi-dimensional analysis


of business data, allowing users to view data from different perspectives and
levels of detail.
• Interactivity: Data cubes provide interactive access to large amounts of data,
allowing users to easily navigate and manipulate the data to support their
analysis.
• Speed and efficiency: Data cubes are optimized for OLAP analysis, enabling
fast and efficient querying and aggregation of data.
• Data aggregation: Data cubes support complex calculations and data
aggregation, enabling users to quickly and easily summarize large amounts of
data.
• Improved decision-making: Data cubes provide a clear and comprehensive
view of business data, enabling improved decision-making and business
intelligence.
• Accessibility: Data cubes can be accessed from a variety of devices and
platforms, making it easy for users to access and analyze business data from
anywhere.

Govt. first Grade College, Shimoga 7


III BCA
Data Warehouse
Disadvantages of data cube:

• Complexity: OLAP systems can be complex to set up and maintain, requiring


specialized technical expertise.
• Data size limitations: OLAP systems can struggle with very large data sets
and may require extensive data aggregation or summarization.
• Performance issues: OLAP systems can be slow when dealing with large
amounts of data, especially when running complex queries or calculations.
• Data integrity: Inconsistent data definitions and data quality issues can
affect the accuracy of OLAP analysis.
• Cost: OLAP technology can be expensive, especially for enterprise-level
solutions, due to the need for specialized hardware and software.
• Inflexibility: OLAP systems may not easily accommodate changing business
needs and may require significant effort to modify or extend

OLAP :

OLAP stands for Online Analytical Processing, which is a technology that enables
multi-dimensional analysis of business data.

▪ It provides interactive access to large amounts of data and supports complex


calculations and data aggregation. OLAP is used to support business
intelligence and decision-making processes.

Types of OLAP Systems:

1. MOLAP (Multidimensional OLAP): Uses specialized storage to handle


multidimensional data and provide fast query performance.
▪ MOLAP uses array-based multidimensional storage engines for
multidimensional views of data.
2. ROLAP (Relational OLAP): Leverages relational databases to store data
and performs on-the-fly aggregation.
▪ ROLAP servers are placed between relational back-end server
and client front-end tools
3. HOLAP (Hybrid OLAP): Combines the capabilities of MOLAP and
ROLAP to balance the benefits of both.
▪ HOLAP servers allows to store the large data volumes of detailed
information.

Govt. first Grade College, Shimoga 8


III BCA
Data Warehouse
Characteristics of OLAP system

Fast :

It defines which the system targeted to deliver the most feedback to the client
within about five seconds, with the elementary analysis taking no more than one
second and very few taking more than 20 seconds.

Analysis :

It defines that the system can manage with any business logic and statistical
analysis that is appropriate for the application and the user, the keep it easy enough
for the target user. Although some pre programming can be required, we don’t think
it acceptable in the event that all application definitions need to be permit the client
to characterize modern Adhoc calculations as portion of the examination and to record
on the information in any wanted strategy.

Shared :

It defines that the system implements all the security requirements for
confidentiality (probably down to cell level) and, multiple write access is required,
concurrent update areas at a suitable level, It is not all applications required users
to write data back, but for the increasing number that does, the system must be able
to handle several updates in an appropriate, secure manner.

Multidimensional :

This is the basic requirement. OLAP system must provide a multidimensional


conceptual view of the data, including full support for hierarchies, as this is certainly
the most logical method to analyze business and organizations.

Information :

The system should be able to hold all the data needed by the applications. Data
sparsity should be handled in an efficient manner.

Data cube Operations


Data cube operations are key concepts in OLAP (Online Analytical Processing)
systems, which are used for analyzing data in a multidimensional space.

• These operations allow users to manipulate and analyze the data cube to derive
meaningful insights.

Govt. first Grade College, Shimoga 9


III BCA
Data Warehouse
• These operations allow users to navigate through data cubes in a flexible and
interactive way, enabling detailed data analysis.

Here are the main operations:


❖ Roll-up :

The roll-up operation (also called the drill-up operation by some vendors)
performs aggregation on a data cube, either by climbing up a concept hierarchy for a
dimension or by dimension reduction

• When roll-up is performed by dimension reduction, one or more dimensions are


removed from the given cube. For example, consider a sales data cube
containing only the two dimensions location and time. Roll-up may be
performed by removing, say, the time dimension, resulting in an aggregation
of the total sales by location, rather than by location and by time.

❖ Drill-down:
Drill-down is the reverse of roll-up. It navigates from less detailed data to

more detailed data.

• Drill-down occurs by descending the time hierarchy from the level of quarter to
the more detailed level of month. The resulting data cube details the total sales
per month rather than summarizing them by quarter.

Govt. first Grade College, Shimoga 10


III BCA
Data Warehouse
❖ Slice and dice:
• Slice : The slice operation performs a selection on one dimension of the given
cube, resulting in a sub cube.
This operation filters the unnecessary portions. Suppose in a particular
dimension, the user doesn’t need everything for analysis, rather a particular
attribute.
• Dice : The dice operation defines a sub cube by performing a selection on two or
more dimensions

Figure: Slice operation Figure: Dice operation

❖ Pivot (rotate):

Pivot (also called rotate) is a visualization operation that rotates the data axes
in view in order to provide an alternative presentation of the data.

• It may contain swapping the rows and columns or moving one of the row-
dimensions into the column dimensions.

Govt. first Grade College, Shimoga 11


III BCA
Data Warehouse

Multidimensional Data Model


The multi-Dimensional Data Model is a method which is used for ordering data in
the database along with proper arrangement and assembling of the contents in
the database.

• It represents data in the form of data cubes.


• Data cubes allow to model and view the data from many dimensions and
perspectives.
• Data warehouses and OLAP tools are based on a multidimensional data model.
This model views data in the form of a data cube.
• A data cube allows data to be modeled and viewed in multiple dimensions. It
is defined by dimensions and facts.
Dimensions :

• dimensions are the perspectives or entities with respect to which an


organization wants to keep records.
• Each dimension may have a table associated with it, called a dimension table,
which further describes the dimension.
• For example, a dimension table for item may contain the attributes item name,
brand, and type.
• Dimension tables can be specified by users or experts, or automatically
generated and adjusted based on data distributions.
Facts :

• A multidimensional data model is typically organized around a central theme,


like sales, for instance. This theme is represented by a fact table. Facts are
numerical measures. Think of them as the quantities by which we want to
analyze relationships between dimensions.
• Examples of facts for a sales data warehouse include dollars sold (sales amount
in dollars), units sold (number of units sold), and amount budgeted.
• The fact table contains the names of the facts, or measures, as well as keys to
each of the related dimension tables.

Govt. first Grade College, Shimoga 12


III BCA
Data Warehouse

The following stages should be followed by every project for building a Multi
Dimensional Data Model :

Stage 1 : Assembling data from the client : In first stage, a Multi Dimensional
Data Model collects correct data from the client.

Stage 2 : Grouping different segments of the system : In the second stage, the
Multi Dimensional Data Model recognizes and classifies all the data to the respective
section they belong to and also builds it problem-free to apply step by step.

Stage 3 : Noticing the different proportions : In this stage, the main factors are
recognized according to the user’s point of view. These factors are also known as
“Dimensions”.

Stage 4 : Preparing the actual-time factors and their respective qualities : In


the fourth stage, the factors which are recognized in the previous step are used
further for identifying the related qualities. These qualities are also known as
“attributes” in the database.

Stage 5 : Finding the actuality of factors which are listed previously and
their qualities: In the fifth stage, A Multi Dimensional Data Model separates and
differentiates the actuality from the factors which are collected by it.

Stage 6 : Building the Schema to place the data, with respect to the
information collected from the steps above : In the sixth stage, on the basis of
the data which was collected previously, a Schema is built.

Govt. first Grade College, Shimoga 13


III BCA
Data Warehouse

Data cube implementation


The implementation of a data cube in a data warehouse involves several key
steps to design, build, and utilize the cube for multi-dimensional data analysis.

Data cubes in data mining can be classified into two main categories -

1. Multidimensional data cube – This type of data cube in data mining is based
on the concept of dimensions and measures
2. Relational data cube – This type of data cube in data mining is based on the
relational database model and represents data in tables with rows and
columns.

Here's a detailed overview:

1. Requirement Analysis
Understand the business requirements to determine what dimensions and
measures are needed. This includes identifying the key metrics and
dimensions of analysis, such as time, geography, and product categories
2. Schema Design :
Design the database schema to support the data cube. This typically
involves creating a star schema or a snowflake schema.
• Star Schema: Central fact table connected to multiple dimension tables.
• Snowflake Schema: A normalized form of the star schema where dimension
tables are further broken down into related tables.
3. ETL Process

Extract, Transform, and Load (ETL) data into the data warehouse:

• Extract: Retrieve data from source systems.


• Transform: Cleanse, format, and prepare data.
• Load: Insert data into the fact and dimension tables in the data warehouse
4. Create the Data Cube
• Use OLAP tools to create the data cube.
• Tools like Microsoft SQL Server Analysis Services (SSAS), Oracle OLAP,
and IBM Cognos are commonly used.
5. Build the Cube:
• Select Measures: Choose columns from the fact table (e.g., SalesAmount,
UnitsSold).
• Define Dimensions: Specify the dimension tables and create hierarchies
(e.g., Year > Quarter > Month > Day).
• Process the Cube: Build and populate the cube with data

Govt. first Grade College, Shimoga 14


III BCA
Data Warehouse
6. Querying the Data Cube

Once the cube is created and processed, you can query it using MDX
(Multidimensional Expressions) or other supported query languages.

OLAP implementation
When implementing an OLAP system, there are a few key considerations to keep
in mind:

1.Data Model Design: Carefully design the data model to align with the analytical
requirements of the organization. This includes defining dimensions, hierarchies, and
measures.

2.Data Integration: Ensure seamless integration of data from various sources into
the OLAP database. This may involve data extraction, transformation, and loading
(ETL) processes.

3.Scalability and Performance: Plan for scalability and performance optimizations as


the volume of data and user queries increase over time. This may involve partitioning
data, optimizing aggregations, and using caching mechanisms.

4.Vision and strategy development

To choose and implement the most suitable solution your team need to define
business objectives first. It is one of the most important steps as only the clear
understanding of what you need and how to get it can lead to success. The next stage
is also to identify the strategy.

5.Data preparation

First of all, it is important to learn as much as possible about the system. That
is why before preparing your data to be transferred into the new system, you should
check OLAP data characteristics first:

▪ ·OLAP data is summarized;


▪ OLAP data is more departmentalized comparing with data warehouse
which serves corporate-wide needs;
▪ The system stores and uses less data than a data warehouse.

Make sure that conditions suit you and start your data preparation then.

6.Vendor and platform choice

Govt. first Grade College, Shimoga 15


III BCA
Data Warehouse
Summarizing all the information and your requirements, it’s time to choose
vendor and an OLAP system finally. The choice should be made regarding the kind
of the system:

▪ ROLAP
▪ MOLAP
▪ HOLAP

7.Review : Summarize all the requirements, data and steps before implementation.

OLAP implementation steps :

Step 1: dimensional modeling

Step 2 : select the data required for removing into OLAP system

Step 3 : data extraction for the OLAP system

Step 4 : loading data to the OLAP server

Step 5 : data aggregation and derived data computation

Step 6: implementation of OLAP application on desktop

Step 7 : user’s training organization

OLAP Software
OLAP (Online Analytical Processing) software is a critical component in the
field of data warehousing, enabling complex analytical and ad-hoc queries with a
rapid execution time.

Here’s an overview of OLAP software in the context of data warehouses:

Key Concepts

Multidimensional Data Models:

Cubes: OLAP organizes data into cubes instead of traditional tables. A cube is
a multi-dimensional array of data, and each dimension represents a different
attribute (e.g., time, geography, product).

Dimensions and Measures: Dimensions are the perspectives or entities with


respect to which an organization wants to keep records, and measures are the
numerical data being tracked (e.g., sales amount, profit).

Govt. first Grade College, Shimoga 16


III BCA
Data Warehouse
Key Features:

▪ Fast Query Performance: OLAP systems are optimized for quick query
execution to support real-time analysis.
▪ Complex Calculations: Supports complex calculations and aggregations, such
as SUM, AVG, COUNT, etc.
▪ Data Drilling: Allows users to drill down into details or roll up to higher-level
summaries.
▪ Slicing and Dicing: Enables data to be viewed from different perspectives by
slicing along dimensions or dicing subsets of the cube.

Benefits

• Enhanced Data Analysis


• Improved Decision Making
• Flexibility

OLAP software in data warehousing provides a powerful framework for analyzing


complex datasets, facilitating advanced reporting, and supporting strategic decision-
making processes in organizations.

Govt. first Grade College, Shimoga 17

You might also like