100% found this document useful (4 votes)

7K views24 pages

UNIT - 1 - Datawarehouse & Data Mining

The document provides an overview of data warehousing and data mining. It discusses key concepts such as what a data warehouse is, its characteristics of being subject-oriented, integrated, time-variant and non-volatile. It also describes the typical components of a data warehouse including the data warehouse database, ETL tools, metadata, query tools and data marts. Finally, it discusses mapping a data warehouse to a multiprocessor architecture and different types of database parallelism.

Uploaded by

Gamer Bhagvan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (4 votes)

7K views24 pages

UNIT - 1 - Datawarehouse & Data Mining

Uploaded by

Gamer Bhagvan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

KCA012: Data Warehousing & Data Mining

UNIT-1
Data Warehouse Introduction
A data warehouse is a collection of data marts representing historical data from
different operations in the company.
The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in
the following way: "A data warehouse is a subject-oriented, integrated, time-
variant and non-volatile collection of data in support of management's decision
making process".
 A data warehouse is constructed by integrating data from multiple
heterogeneous sources.
 A data warehouse is a database, which is kept separate from the
organization's operational database.
 It possesses consolidated historical data, which helps the organization to
analyze its business.
 A data warehouse helps executives to organize, understand, and use their
data to take strategic decisions.
 Data warehouse systems help in the integration of diversity of application
systems.
 Data warehouse is an information system that contains historical and
commutative data from single or multiple sources. It simplifies reporting and
analysis process of the organization.
 A data warehouse system helps in consolidated historical data analysis.
Data warehouse is an information system that contains historical and commutative
data from single or multiple sources. It simplifies reporting and analysis process of
the organization. It is also a single version of truth for any company for decision
making and forecasting.

Characteristics of Data warehouse

 Subject-Oriented
 Integrated
 Time-variant
 Non-volatile

 Subject Oriented − A data warehouse is subject oriented because it provides

information around a subject rather than the organization's ongoing
operations. These subjects can be product, customers, suppliers, sales,
revenue, etc. A data warehouse does not focus on the ongoing operations,
rather it focuses on modeling and analysis of data for decision making.

1
 Integrated − A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc. This
integration enhances the effective analysis of data.
 Time Variant − The data collected in a data warehouse is identified with a
particular time period. The data in a data warehouse provides information
from the historical point of view.
 Non-volatile − Non-volatile means the previous data is not erased when new
data is added to it. A data warehouse is kept separate from the operational
database and therefore frequent changes in operational database is not
reflected in the data warehouse.

DATA WAREHOUSE COMPONENTS

The data warehouse is based on an RDBMS server which is a central information
repository that is surrounded by some key components to make the entire
environment functional, manageable and accessible

There are mainly five components of Data Warehouse:

Data Warehouse Database:
The central database is the foundation of the data warehousing environment. This
database is implemented on the RDBMS technology. Although, this kind of

2
implementation is constrained by the fact that traditional RDBMS system is
optimized for transactional database processing and not for data warehousing. For
instance, ad-hoc query, multi-table joins, aggregates are resource intensive and
slow down performance.
Sourcing, Acquisition, Clean-up and Transformation Tools (ETL)
The data sourcing, transformation, and migration tools are used for performing all
the conversions, summarizations, and all the changes needed to transform data into
a unified format in the data warehouse. They are also called Extract, Transform
and Load (ETL) Tools.
Metadata
The name Meta Data suggests some high- level technological concept. However, it
is quite simple. Metadata is data about data which defines the data warehouse. It is
used for building, maintaining and managing the data warehouse. Metadata can be
classified into following categories:
1. Technical Meta Data: This kind of Metadata contains information about
warehouse which is used by Data warehouse designers and administrators.
2. Business Meta Data: This kind of Metadata contains detail that gives end-users
a way easy to understand information stored in the data warehouse.

Query Tools
One of the primary objects of data warehousing is to provide information to
businesses to make strategic decisions. Query tools allow users to interact with the
data warehouse system.
These tools fall into four different categories:
1. Query and reporting tools
2. Application Development tools
3. Data mining tools
4. OLAP tools

3
Data Marts
A data mart is an access layer which is used to get data out to the users. It is
presented as an option for large size data warehouse as it takes less time and
money to build. However, there is no standard definition of a data mart is differing
from person to person.

BUILDING A DATA WAREHOUSE

In general, building any data warehouse consists of the following steps:
1. Extracting the transactional data from the data sources into a staging area
2. Transforming the transactional data
3. Loading the transformed data into a dimensional database
4. Building pre-calculated summary values to speed up report generation
5. Building (or purchasing) a front-end reporting tool

Extracting Transactional Data:

A large part of building a DW is pulling data from various data sources and
placing it in a central storage area

Transforming Transactional Data:

An equally important and challenging step after extracting is transforming and
relating the data extracted from multiple sources.

4
Creating a Dimensional Model:
The third step in building a data warehouse is coming up with a dimensional
model. Most modern transactional systems are built using the relational model.
The relational database is highly normalized; when designing such a system

Loading the Data:

After you've built a dimensional model, it's time to populate it with the data in
the staging database. This step only sounds trivial. It might involve combining
several columns together or splitting one field into several columns

Generating Pre calculated Summary Values:

The next step is generating the pre calculated summary values which are
commonly referred to as aggregations. This step has been tremendously simplified
by SQL Server Analysis Services (or OLAP Services,

Building (or Purchasing) a Front-End Reporting Tool

After you've built the dimensional database and the aggregations you can decide
how sophisticated your reporting tools need to be. If you just need the drill-down
capabilities, and your users have Microsoft Office 2000 on their desktops, the
Pivot Table Service of Microsoft Excel 2000 will do the job.

MAPPING THE DATA WAREHOUSE TO A MULTIPROCESSOR

ARCHITECTURE
The functions of data warehouse are based on the relational data base technology.
The relational data base technology is implemented in parallel manner. There are
two advantages of having parallel relational data base technology for data
warehouse:
 Linear Speed up: refers the ability to increase the number of processor to
reduce response time
 Linear Scale up: refers the ability to provide same performance on the same
requests as the database size increases
Types of parallelism:
 Inter query Parallelism: In which different server threads or processes
handle multiple requests at the same time.
 Intra query Parallelism: This form of parallelism decomposes the serial
SQL query into lower level operations such as
scan, join, sort etc. Then these lower level operations are executed
concurrently in parallel.

5
Intra query parallelism can be done in either of two ways:
 Horizontal parallelism: which means that the data base is partitioned
across multiple disks and parallel processing
occurs within a specific task that is performed concurrently on different
processors against different set of data
 Vertical parallelism: This occurs among different tasks. All query
components such as scan, join, sort etc are executed in parallel in a pipelined
fashion. In other words, an output from one task becomes an input into
another task.

Types of DBMS Parallelism

Data Partitioning: Data partitioning is the key component for effective parallel
execution of data base operations. Data Partition can be done in two ways:-
 Random Partitioning: Includes random data striping across multiple disks
on a single server. Another option for random portioning is round robin
fashion partitioning in which each record is placed on the next disk assigned
to the data base.
 Intelligent partitioning: Assumes that DBMS knows where a specific
record is located and does not waste time searching for it across all disks.

2. Data base architectures of parallel processing

There are three DBMS software architecture styles for parallel processing:
 Shared memory or shared everything Architecture
 Shared disk architecture
 Shred nothing architecture

6
2.1 Shared Memory Architecture:
Tightly coupled shared memory systems, illustrated in following figure have the
following characteristics:
 Multiple PUs share memory.
 Each PU has full access to all shared memory through a common bus.
 Communication between nodes occurs via shared memory.
 Performance is limited by the bandwidth of the memory bus.
 It is simple to implement and provide a single system image, implementing
an RDBMS on SMP(symmetric multiprocessor)

2.2 Shared Disk Architecture

Shared disk systems are typically loosely coupled. Such systems, illustrated in
following figure, have the following characteristics:
 Each node consists of one or more PUs and associated memory.
 Memory is not shared between nodes.
 Communication occurs over a common high-speed bus.
 Each node has access to the same disks and other resources.
 A node can be an SMP if the hardware supports it.
 Bandwidth of the high-speed bus limits the number of nodes (scalability) of
the system.
 The Distributed Lock Manager (DLM ) is required.

7
2.3 Shared Nothing Architecture
Shared nothing systems are typically loosely coupled. In shared nothing systems
only one CPU is connected to a given disk. If a table or database is located on that
disk, access depends entirely on the PU which owns it.
Shared nothing systems are concerned with access to disks, not access to memory.
Adding more PUs and disks can improve scale up.

8
Draw the 3-tier data warehouse architecture. Explain ETL process.

Generally a data warehouses adopts a three-tier architecture. Following are the

three tiers of the data warehouse architecture.
Bottom Tier − The bottom tier of the architecture is the data warehouse database
server. It is the relational
database system. We use the back end tools and utilities to feed data into the
bottom tier. These back end tools
and utilities perform the Extract, Clean, Load, and refresh functions.
Middle Tier − In the middle tier, we have the OLAP Server that can be
implemented in either of the following
ways.
By Relational OLAP (ROLAP), which is an extended relational database
management system. The ROLAP
maps the operations on multidimensional data to standard relational operations.
By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and
operations.
9
Top-Tier − This tier is the front-end client layer. This layer holds the query tools
and reporting tools, analysis tools and data mining tools.

Difference between Database System and Data Warehouse:

Database System:
Database System is used in traditional way of storing and retrieving data. The
major task of database system is to perform query processing. These systems are
generally referred as online transaction processing system. These systems are used
day to day operations of an organization.

Data Warehouse:
Data Warehouse is the place where huge amount of data is stored. It is meant for
users or knowledge workers in the role of data analysis and decision making.
These systems are supposed to organize and present data in different format and
different forms in order to serve the need of the specific user for specific purpose.
These systems are referred as online analytical processing.

Database System Data Warehouse

It supports operational processes.It supports analysis and performance
reporting.
Capture and maintain the data. Explore the data.
Current data. Multiple years of history.
Data is balanced within the scope Data must be integrated and balanced
of this one system. from multiple system.
Data is updated when transaction Data is updated on scheduled processes.
occurs.
Data verification occurs when Data verification occurs after the fact.
entry is done.
100 MB to GB. 100 GB to TB.
ER based. Star/Snowflake.
Application oriented. Subject oriented.
Primitive and highly detailed. Summarized and consolidated.
Flat relational. Multidimensional.

10
MULTIDIMENSIONAL DATA MODEL
Multidimensional data model stores data in the form of data cube. Mostly, data
warehousing supports two or three-dimensional cubes.
A data cube allows data to be viewed in multiple dimensions. Dimensions are
entities with respect to which an organization wants to keep records. For example
in store sales record, dimensions allow the store to keep track of things like
monthly sales of items and the branches and locations. A multidimensional
database helps to provide data-related answers to complex business queries quickly
and accurately. Data warehouses and Online Analytical Processing (OLAP) tools
are based on a multidimensional data model. OLAP in data warehousing enables
users to view data from different angles and dimensions

The multi-Dimensional Data Model is a method which is used for ordering data in
the database along with good arrangement and assembling of the contents in the
database.
The Multi-Dimensional Data Model allows customers to interrogate analytical
questions associated with market or business trends, unlike relational databases
which allow customers to access data in the form of queries. They allow users to
rapidly receive answers to the requests which they made by creating and
examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi-dimensional
databases. It is used to show multiple dimensions of the data to users.

Working on a Multidimensional Data Model

11
The following stages should be followed by every project for building a Multi-
Dimensional Data Model:
Stage 1: Assembling data from the client: In first stage, a Multi-Dimensional Data
Model collects correct data from the client. Mostly, software professionals provide
simplicity to the client about the range of data which can be gained with the
selected technology and collect the complete data in detail.
Stage 2: Grouping different segments of the system: In the second stage, the
Multi-Dimensional Data Model recognizes and classifies all the data to the
respective section they belong to and also builds it problem-free to apply step by
step.
Stage 3: Noticing the different proportions: In the third stage, it is the basis on
which the design of the system is based. In this stage, the main factors are
recognized according to the user’s point of view. These factors are also known as
“Dimensions”.
Stage 4: Preparing the actual-time factors and their respective qualities: In the
fourth stage, the factors which are recognized in the previous step are used further
for identifying the related qualities. These qualities are also known as “attributes”
in the database.
Stage 5: Finding the actuality of factors which are listed previously and their
qualities: In the fifth stage, A Multi-Dimensional Data Model separates and
differentiates the actuality from the factors which are collected by it. These
actually play a significant role in the arrangement of a Multi-Dimensional Data
Model.
Stage 6: Building the Schema to place the data, with respect to the information
collected from the steps above: In the sixth stage, on the basis of the data which
was collected previously, a Schema is built.
For Example:
1. Let us take the example of a firm. The revenue cost of a firm can be recognized
on the basis of different factors such as geographical location of firm’s workplace,
products of the firm, advertisements done, time utilized to flourish a product, etc.

12
Let us take the example of the data of a factory which sells products per quarter in
Bangalore. The data is represented in the table given below:

In the above given presentation, the factory’s sales for Bangalore are, for the time
dimension, which is organized into quarters and the dimension of items, which is
sorted according to the kind of item which is sold. The facts here are represented
in rupees (in thousands).
Now, if we desire to view the data of the sales in a three-dimensional table, then
it is represented in the diagram given below.

Let us consider the data according to item, time and location (like Kolkata, Delhi,
and Mumbai). Here is the table:

13
This data can be represented in the form of three dimensions conceptually, which
is shown in the image below:

Advantages of Multi-Dimensional Data Model

The following are the advantages of a multi-dimensional data model:
 A multi-dimensional data model is easy to handle.
 It is easy to maintain.
 Its performance is better than that of normal databases (e.g. relational
databases).

14
 The representation of data is better than traditional databases. That is
because the multi-dimensional databases are multi-viewed and carry
different types of factors.
 It is workable on complex systems and applications, contrary to the simple
one-dimensional database systems
Disadvantages of Multi-Dimensional Data Model
The following are the disadvantages of a Multi-Dimensional Data Model:
 The multi-dimensional Data Model is slightly complicated in nature and it
requires professionals to recognize and examine the data in the database.
 During the work of a Multi-Dimensional Data Model, when the system
caches, there is a great effect on the working of the system.
 It is complicated in nature due to which the databases are generally dynamic
in design.

Data Cube
A data cube enables data to be modeled and viewed in several dimensions. It is
represented by dimensions and facts. In other terms, dimensions are the views or
entities related to which an organization is required to keep records.
When data is grouped or combined in multidimensional matrices called Data
Cubes. The data cube method has a few alternative names or a few variants, such
as "Multidimensional databases," "materialized views," and "OLAP (On-Line
Analytical Processing)."
For example, a relation with the schema sales (part, supplier, customer, and sale-
price) can be materialized into a set of eight views as shown in fig,
where psc indicates a view consisting of aggregate function value (such as total-
sales) computed by grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.

15
A data cube is created from a subset of attributes in the database.
The model view data in the form of a data cube. OLAP tools are based on the
multidimensional data model. Data cubes usually model n-dimensional data.
A data cube enables data to be modeled and viewed in multiple dimensions. A
multidimensional data model is organized around a central theme, like sales and
transactions.
.

Example: In the 2-D representation, we will look at the All Electronics sales data
for items sold per quarter in the city of Vancouver. The measured display in
dollars sold (in thousands).

3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For
example, suppose we would like to view the data according to time, item as well as
the location for the cities Chicago, New York, Toronto, and Vancouver. The
measured display in dollars sold (in thousands). These 3-D data are shown in the
table. The 3-D data of the table are represented as a series of 2-D tables.

16
Let us suppose that we would like to view our sales data with an additional fourth
dimension, such as a supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the
lowest level of summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time,
item, location, and supplier dimensions.

17
Figure is shown a 4-D data cube representation of sales data, according to the
dimensions time, item, location, and supplier. The measure displayed is dollars
sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is
known as the apex cuboid. In this example, this is the total sales, or dollars sold,
summarized over all four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids
creating 4-D data cubes for the dimension time, item, location, and supplier. Each
cuboid represents a different degree of summarization.

18
Question: Explain star, snowflakes and fact constellation schema.
OR
SCHEMAS FOR MULTI-DIMENSIONAL DATA MODEL
Schema is a logical description of the entire database. It includes the name and
description of records of all record types including all associated data-items and
aggregates. Much like a database, a data warehouse also requires to maintain a
schema. A database uses relational model, while a data warehouse uses Star,
Snowflake, and Fact Constellation schema.
Star Schema
 Each dimension in a star schema is represented with only one-dimension
table.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the
four dimensions, namely time, item, branch, and location.
 There is a fact table at the center. It contains the keys to each of four
dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.

Star Schema

19
Snowflake Schema
 Some dimension tables in the Snowflake schema are normalized.
 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are
normalized. For example, the item dimension table in star schema is
normalized and split into two dimension tables, namely item and supplier
table.
 Now the item dimension table contains the attributes item_key, item_name,
type, brand, and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier
dimension table contains the attributes supplier_key and supplier_type.

Snowflake Schema
20
Fact Constellation Schema
 A fact constellation has multiple fact tables. It is also known as galaxy
schema.
 The following diagram shows two fact tables, namely sales and shipping.
 The sales fact table is same as that in the star schema.
 The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
 The shipping fact table also contains two measures, namely dollars sold and
units sold.
 It is also possible to share dimension tables between fact tables. For
example, time, item, and location dimension tables are shared between the
sales and shipping fact table.

Fact Constellation Schema

21
Give the difference between star and fact constellation multidimensional data
model.

Data Warehouse Applications

22
Data warehouses are widely used in the following fields −
 Financial services
 Banking services
 Consumer goods
 Retail sectors
 Controlled manufacturing

Data Mart
A data mart is focused on a single functional area of an organization and contains a
subset of data stored in a Data Warehouse.
A data mart is a condensed version of Data Warehouse and is designed for use by a
specific department, unit or set of
users in an organization. E.g., Marketing, Sales, HR or finance. It is often
controlled by a single department in an organization.

Types of Data Mart

There are three main types of data mart:
Dependent: Dependent data marts are created by drawing data directly from
operational, external or both sources. A dependent data marts is a logical subset of a
physical subset of a higher data warehouse. These data mart are dependent on the data
warehouse and extract the essential record from it. In this technique, as the data
warehouse creates the data mart; therefore, there is no need for data mart integration. It
is also known as a top-down approach.

23
Independent: Independent data mart is created without the use of a central data
warehouse. In this approach, as all the data marts are designed independently;
therefore, the integration of data marts is required. It is also termed as a bottom-up
approach as the data marts are integrated to develop a data warehouse.

Hybrid: This type of data marts can take data from data warehouses or operational
systems. It allows us to combine input from sources other than a data warehouse.

Common questions

Maintaining a data warehouse poses several challenges, which can be addressed as follows: 1. **Data Integration and Consistency**: Integrating diverse data sources can lead to inconsistencies. This challenge can be addressed by employing robust ETL processes that include thorough data validation, cleansing, and transformation steps to ensure consistent and accurate data across the warehouse . 2. **Scalability**: As data volumes grow, maintaining performance is challenging. Implementing shared nothing architecture can enhance scalability, enabling easy addition of nodes to handle more data without bottlenecks . 3. **System Performance**: Optimizing query performance in the face of complex analytical queries is challenging. Using data cubes to pre-compute aggregations and leveraging OLAP tools can improve query response times and reduce computational stress on the system . 4. **Metadata Management**: Managing metadata is crucial for maintaining the data warehouse’s inner workings. Implementing automated metadata management tools can ensure efficient documentation and data lineage, which are vital for auditing and data governance . 5. **Security**: Protecting sensitive data within the warehouse is essential. Adopting stringent access control policies and data encryption methods can mitigate potential security risks. By addressing these challenges with appropriate strategies and tools, data warehouses can maintain their integrity, performance, and security over time.

The multidimensional data model offers several advantages and some disadvantages in data warehousing: **Advantages**: 1. **Performance**: Multidimensional models allow for better performance in complex queries involving aggregation or summarization, due to pre-calculated structures like cubes . 2. **Ease of Use**: Users find it straightforward to understand data presented in cubes with dimensions that align with business perspectives, like products or time periods . 3. **Multiple Views**: The model supports various angles and dimensions, allowing deeper insights and facilitating powerful analytical processing . 4. **Efficiency**: Multidimensional data aligns well with OLAP tools, thus enhancing query response time and analytical capabilities . **Disadvantages**: 1. **Complexity**: The multidimensional model can be complicated to design and maintain as it requires specialized knowledge and understanding of the domain . 2. **Resource-intensive**: Processing and maintaining multiple dimensions, especially in large datasets, can be demanding on resources, requiring additional computation power and storage . 3. **Flexibility Limitations**: Once established, changing dimensions or hierarchical structures can be challenging, making adaptability tougher compared to more flexible relational models . These factors must be carefully considered when selecting a data model to best suit the analytical needs of a given data warehouse system.

OLAP (Online Analytical Processing) technology enhances data analysis within a data warehouse by providing users with the capability to perform complex calculations and rapid data exploration. Its key functionalities include: 1. **Multidimensional Views**: OLAP allows users to view data across multiple dimensions—for instance, analyzing sales data by product, region, and time simultaneously—enabling comprehensive data exploration . 2. **Complex Analytical Queries**: OLAP supports complex queries involving aggregations, calculations, and datasets relationships, allowing users to derive insights through detailed numerical analysis . 3. **Drill-Down and Roll-Up**: Enables users to navigate through various levels of data granularity, such as drilling down to more detailed data or rolling up to higher summary data, which supports in-depth analysis and trend identification . 4. **Slice and Dice**: This functionality allows users to select and analyze specific subsets of data, providing flexibility to view data patterns and correlations from different angles without changing the data structure . OLAP contributes significantly to empowering users to perform advanced data analyses efficiently, leveraging the data warehouse's capabilities to drive business intelligence and strategic decision-making.

A data warehouse consists of several key components that contribute to its overall functionality: 1. **Data Warehouse Database**: This is the central repository implemented on RDBMS technology. It stores consolidated historical data critical for analysis but can be limited by the traditional RDBMS optimization for transactional processing rather than data warehousing . 2. **ETL Tools**: These include Sourcing, Acquisition, Clean-up, and Transformation Tools responsible for converting and consolidating data from diverse sources into a unified format, simplifying the reporting and analysis process . 3. **Metadata**: This is the "data about data" that defines the data warehouse. Technical metadata assists designers and administrators, while business metadata helps end-users understand the stored information for decision-making . 4. **Query Tools**: They enable interaction with the data warehouse and are crucial for strategic decisions. These tools include query and reporting tools, application development tools, data mining tools, and OLAP tools . 5. **Data Marts**: Considered an access layer for retrieving data, data marts can be built quickly and cost-effectively and are used when a full-scale data warehouse is unnecessary . These components work in tandem to ensure the effective implementation, maintenance, and utility of a data warehouse.

ETL (Extract, Transform, Load) tools are vital for creating an effective data warehouse environment. These tools: 1. **Extract Data**: ETL tools extract data from various sources, maintaining consistency by pulling in structured and unstructured data from multiple, often heterogeneous, data sources . 2. **Transform Data**: This involves converting extracted data into a consistent format or structure, which includes data cleansing, summarization, and normalization, ensuring integration and compatibility across the warehouse . 3. **Load Data**: Once transformed, data is loaded into the data warehouse for storage, analysis, and further processing. This stage is critical for updating the warehouse with new or changing data without disrupting operational processes . By automating these steps, ETL tools significantly streamline the data warehousing process, reducing manpower and error rates, while ensuring that the data warehouse maintains accurate and current historical data for analysis and decision-making.

The shared nothing architecture improves scalability in data warehousing systems more effectively than other architectures due to its unique design: 1. **Independence of Nodes**: Each node in a shared nothing system has its own processor and disk storage, eliminating resource sharing, which allows for straightforward scaling by adding additional nodes . 2. **Elimination of Bottlenecks**: Unlike shared memory or shared disk architectures where a common resource can become a bottleneck, the shared nothing architecture’s independence mitigates such constraints, improving system scalability . 3. **Performance Optimization**: Since each node operates independently, it can be finely tuned for its respective workload, thereby optimizing performance for query processing and data operations . 4. **Efficient Resource Allocation**: The localized processing and storage allow for better resource assignment and management, facilitating efficient use of system resources as the demands on the warehouse grow. By addressing these aspects, shared nothing architecture effectively supports the scaling of data warehousing systems, making it appropriate for handling large volumes of data.

The primary differences between a traditional database system and a data warehouse pertain to their usage and data management capabilities: 1. **Purpose**: Database systems are designed for day-to-day operations and are transaction-oriented, supporting online transaction processing (OLTP). In contrast, data warehouses are used for data analysis and decision making, supporting online analytical processing (OLAP). 2. **Data Scope**: Databases usually contain current data necessary for ongoing operations, while data warehouses store large volumes of historical data (multiple years) intended for analysis and insight generation . 3. **Data Updating and Verification**: Databases update data upon transaction occurrences with immediate verification, whereas data warehouses update data on scheduled processes, often incorporating data verification post hoc . 4. **Data Structuring**: Databases are application-oriented and often highly detailed, whereas data warehouses are subject-oriented and often summarize data across various dimensions . 5. **Underlying Models**: Database systems typically utilize ER-based, flat relational models, whereas data warehouses use multidimensional models, such as star or snowflake schemas, to facilitate faster query performance for analytical tasks . These differences highlight the tailored adaptations of databases for operational tasks and data warehouses for analytical functions.

Metadata serves a crucial role in a data warehouse by providing essential information that defines its structure and contents, aiding in effective management in several ways: 1. **Technical Metadata**: Provides detailed information on data sources, data transformations, and storage structures, which aids warehouse designers and administrators in building, maintaining, and managing the warehouse infrastructure . 2. **Business Metadata**: Assists end-users by providing contextual information about warehouse data, offering descriptions that make it easier for users to understand data contents and usage for decision-making . 3. **Facilitating Integration**: Metadata ensures integration across diverse data sources by maintaining logical mappings, transformations, and relationships of data, fostering consistency and reducing errors across the data flow process . 4. **Documentation and Auditability**: Metadata offers comprehensive documentation of how data is transformed and subsequently stored, maintaining data lineage and supporting audit requirements by tracking any changes over time . Overall, metadata acts as a critical enabler for the effective functioning of a data warehouse, bridging the technical and business aspects to support structured data analysis and governance.

Data cubes facilitate complex queries in data warehousing environments by providing a precomputed, multidimensional view of data which allows for efficient processing of analytical queries: 1. **Multidimensional Structure**: Data cubes enable data to be organized and stored in multiple dimensions, such as time, location, and product, allowing queries to access and compute aggregated results like sums or averages quickly . 2. **Optimized Aggregations**: Queries that require aggregated data from multiple dimensions (e.g., sales per region over time) can retrieve this data faster due to precomputed aggregations facilitated by cubes, reducing the need for resource-intensive computations at query time . 3. **Enhanced OLAP Operations**: Data cubes support various OLAP operations such as drill-down, roll-up, slicing, and dicing, which allow analysts to explore the data seamlessly from multiple perspectives and granularities . 4. **Improved Query Response Time**: Since cubes store aggregated data, queries that once took long processing times over raw tables can be answered in milliseconds, thus greatly enhancing the user experience in interactive data exploration environments . These features make data cubes an essential tool for enabling fast and efficient processing of complex analytical queries within data warehouses.

A three-tier data warehouse architecture enhances data management and accessibility through its distinct layers: 1. **Bottom Tier**: This is the database server where the data warehouse is stored. It uses backend tools and utilities to feed data, performing functions like extraction, cleaning, loading, and refreshing. This separation of the database ensures data integrity and optimizes storage . 2. **Middle Tier**: The OLAP server operates in this layer. It can be implemented using Relational OLAP (ROLAP) or Multidimensional OLAP (MOLAP), which respectively map operations via relational techniques or utilize multidimensional data and operations directly. This tier processes analytical queries efficiently . 3. **Top Tier**: This involves the front-end client layer comprising of query, reporting, analysis, and data mining tools, which allow end-users to interact with the data warehouse, enhancing decision-making capabilities . Together, these tiers support a streamlined flow of data and information, supporting complex analytical processing and offering robust tools for business intelligence.

DATA Ware House & Mining NOTES
100% (2)
DATA Ware House & Mining NOTES
31 pages
Notes CS-703 (B) Data Mining & Warehousing All Units
No ratings yet
Notes CS-703 (B) Data Mining & Warehousing All Units
46 pages
Data Warehousing & Data Mining Unit-2 Notes
100% (1)
Data Warehousing & Data Mining Unit-2 Notes
36 pages
Data Warehousing Notes Aktu
50% (2)
Data Warehousing Notes Aktu
10 pages
B.Tech. CSE (Artificial Intelligence) Syllabus 3rd Year 2024-25
No ratings yet
B.Tech. CSE (Artificial Intelligence) Syllabus 3rd Year 2024-25
42 pages
Notes - KCS 061 Big Data Unit 1
No ratings yet
Notes - KCS 061 Big Data Unit 1
25 pages
Data Warehouse Aktu Question Papers
100% (2)
Data Warehouse Aktu Question Papers
7 pages
ITCS Unit 1 Notes knc552
100% (1)
ITCS Unit 1 Notes knc552
23 pages
Data Analytics (Da) by I Tech World
No ratings yet
Data Analytics (Da) by I Tech World
65 pages
1734787260059cloud Computing AKTU Notes Password Chaudhary - Unlocked
No ratings yet
1734787260059cloud Computing AKTU Notes Password Chaudhary - Unlocked
55 pages
Quantum Design and Analysis of Algorithms Full PDF
100% (1)
Quantum Design and Analysis of Algorithms Full PDF
196 pages
CLIQUE and PROCLUS
0% (1)
CLIQUE and PROCLUS
13 pages
Warehousing and Data Mining
100% (1)
Warehousing and Data Mining
70 pages
IT8075-UNIT-5-Best-methods-of-staff-selection-Motivation SPM
No ratings yet
IT8075-UNIT-5-Best-methods-of-staff-selection-Motivation SPM
35 pages
COI Quantum
100% (1)
COI Quantum
62 pages
Oosd Unit 2
No ratings yet
Oosd Unit 2
43 pages
Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling
No ratings yet
Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling
18 pages
Dbms-Unit-3 - Aktu
100% (2)
Dbms-Unit-3 - Aktu
7 pages
UNIT-2 Web Technology (BCS502)
100% (1)
UNIT-2 Web Technology (BCS502)
27 pages
Unit 1 Aktu
No ratings yet
Unit 1 Aktu
26 pages
DBMS Unit 3 Notes by MultiAtomsPlus
No ratings yet
DBMS Unit 3 Notes by MultiAtomsPlus
26 pages
Computer Graphics Quantum
No ratings yet
Computer Graphics Quantum
74 pages
Blockchain Technology Overview
100% (1)
Blockchain Technology Overview
2 pages
XML DTD for Weather Reports
No ratings yet
XML DTD for Weather Reports
5 pages
Relational Model Concepts in DBMS
No ratings yet
Relational Model Concepts in DBMS
13 pages
Data Warehousing & Mining Exam Paper
100% (1)
Data Warehousing & Mining Exam Paper
2 pages
Daa Lab Manual Kcs553 2022-23
No ratings yet
Daa Lab Manual Kcs553 2022-23
89 pages
Unit 1 DBMS Aktu Notes
100% (1)
Unit 1 DBMS Aktu Notes
21 pages
DBMS-Unit 5
No ratings yet
DBMS-Unit 5
27 pages
Mapping The Data Warehouse Architecture To Multiprocessor Architecture
No ratings yet
Mapping The Data Warehouse Architecture To Multiprocessor Architecture
15 pages
Understanding Aggregate Data Models
100% (2)
Understanding Aggregate Data Models
55 pages
AKTU Syllabus CS 3rd Yr
75% (4)
AKTU Syllabus CS 3rd Yr
4 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
Dbms Unit 1 Acoording To AKTU Syllabus
100% (1)
Dbms Unit 1 Acoording To AKTU Syllabus
22 pages
Unit 1 - Data Mining and Warehousing - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Data Mining and Warehousing - WWW - Rgpvnotes.in
16 pages
Data Warehouse and Data Mining Notes
No ratings yet
Data Warehouse and Data Mining Notes
66 pages
Web Technologies Aktu Notes by Uvesh
No ratings yet
Web Technologies Aktu Notes by Uvesh
236 pages
Indroduction To Data Warehousing (Alex Berson)
0% (2)
Indroduction To Data Warehousing (Alex Berson)
20 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
10 pages
OOSD Quantum (Unit 2,3,4,5)
67% (3)
OOSD Quantum (Unit 2,3,4,5)
136 pages
KCS-652-WT LAB Manual-2021-22
No ratings yet
KCS-652-WT LAB Manual-2021-22
56 pages
Web Tech Quantum Updated
100% (1)
Web Tech Quantum Updated
204 pages
Unit-3 Notes Oosd
0% (1)
Unit-3 Notes Oosd
20 pages
CCS341-Data Warehousing Notes-Unit I
100% (3)
CCS341-Data Warehousing Notes-Unit I
30 pages
Web Tech Notes AKTU-compressed
100% (1)
Web Tech Notes AKTU-compressed
220 pages
DAA Syllabus and Key Concepts
0% (1)
DAA Syllabus and Key Concepts
87 pages
Dbms Lab File - Kit
57% (7)
Dbms Lab File - Kit
70 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
B.TECH. CSE (IoT) Syllabus 3rd Year 2024-25
No ratings yet
B.TECH. CSE (IoT) Syllabus 3rd Year 2024-25
29 pages
Software Engineering Notes Unit-5
100% (1)
Software Engineering Notes Unit-5
10 pages
Koe-064 Object Oriented Programming
0% (2)
Koe-064 Object Oriented Programming
2 pages
Social Media and Data Analytics Unit 3 Notes
No ratings yet
Social Media and Data Analytics Unit 3 Notes
7 pages
AKTU OOPS Previous Year Questions
100% (1)
AKTU OOPS Previous Year Questions
12 pages
Lab Manual Dbms
100% (1)
Lab Manual Dbms
58 pages
Quantum Data Warehousing Data Mining Koe 093
No ratings yet
Quantum Data Warehousing Data Mining Koe 093
67 pages
Understanding Data Warehouse Concepts
No ratings yet
Understanding Data Warehouse Concepts
16 pages
DWDM
No ratings yet
DWDM
15 pages
Unit 6 Data Warehousing
No ratings yet
Unit 6 Data Warehousing
40 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
29 pages
Data Warehousing Lecture Notes
No ratings yet
Data Warehousing Lecture Notes
38 pages
DDS Unit - 5
No ratings yet
DDS Unit - 5
27 pages
Query Optimization in Distributed Databases
No ratings yet
Query Optimization in Distributed Databases
7 pages
Unit-2 SQE
No ratings yet
Unit-2 SQE
8 pages
2024010865
No ratings yet
2024010865
1 page
20240108100
No ratings yet
20240108100
1 page
2024010730
No ratings yet
2024010730
3 pages
MCQs Unit 2 Measures of Central Tendency
100% (1)
MCQs Unit 2 Measures of Central Tendency
16 pages
V1600Gx-B Series Release Notes V1.4.1R
No ratings yet
V1600Gx-B Series Release Notes V1.4.1R
16 pages
AR-02-109 Guidelines and Recommended Criteria For The Development of A Material Specification For Carbon FiberEpoxy Unidirectional Prepregs
No ratings yet
AR-02-109 Guidelines and Recommended Criteria For The Development of A Material Specification For Carbon FiberEpoxy Unidirectional Prepregs
64 pages
Mobile Communication Lab
100% (2)
Mobile Communication Lab
14 pages
Your High Volume, Small Plastic Parts Specialist
No ratings yet
Your High Volume, Small Plastic Parts Specialist
12 pages
AI Chatbots: Trends and Benefits
No ratings yet
AI Chatbots: Trends and Benefits
9 pages
Topo-Iberia CGPS Data Quality Analysis
No ratings yet
Topo-Iberia CGPS Data Quality Analysis
11 pages
Ethernet Crossover Cable - DIY How-To Guide
No ratings yet
Ethernet Crossover Cable - DIY How-To Guide
2 pages
Contoh BoQ Elektrikal PLTA
No ratings yet
Contoh BoQ Elektrikal PLTA
13 pages
Methods of Reactive Power Compensation - PPT
100% (1)
Methods of Reactive Power Compensation - PPT
15 pages
Previewpdf
No ratings yet
Previewpdf
43 pages
Installation Guide DS3+ / DS4+: 2013 Hyundai Sonata (Smart Key) - 933.HYUNDAI1 4.10.195.3
No ratings yet
Installation Guide DS3+ / DS4+: 2013 Hyundai Sonata (Smart Key) - 933.HYUNDAI1 4.10.195.3
23 pages
Upload 1 Document To Download: EFRIS+Taxpayers'+Training+Material+v2 PDF
No ratings yet
Upload 1 Document To Download: EFRIS+Taxpayers'+Training+Material+v2 PDF
3 pages
IEEE 1588 Synchronization in Power Networks
No ratings yet
IEEE 1588 Synchronization in Power Networks
140 pages
Industrial Revolution Notes PDF
100% (1)
Industrial Revolution Notes PDF
10 pages
IT Service Transition Guide
No ratings yet
IT Service Transition Guide
45 pages
Pro SQL Server 2022 Administration: A Guide For The Modern DBA 3rd Edition Peter A. Carter PDF Available
No ratings yet
Pro SQL Server 2022 Administration: A Guide For The Modern DBA 3rd Edition Peter A. Carter PDF Available
155 pages
Guided Missile Components Guide
0% (1)
Guided Missile Components Guide
5 pages
Multiplayer Snake Game in C
67% (3)
Multiplayer Snake Game in C
12 pages
Advanced Administrator
No ratings yet
Advanced Administrator
13 pages
NDT System Qualification Report
No ratings yet
NDT System Qualification Report
71 pages
ST Thermal Fluid Heating Systems
No ratings yet
ST Thermal Fluid Heating Systems
4 pages
Shihab Saadi Resume
No ratings yet
Shihab Saadi Resume
1 page
Project Management Techniques
No ratings yet
Project Management Techniques
9 pages
Vpi Empower Release Notes 5.2 3a
No ratings yet
Vpi Empower Release Notes 5.2 3a
19 pages
Spam 150 C Motor Protection Relay: User S Manual and Technical Description
No ratings yet
Spam 150 C Motor Protection Relay: User S Manual and Technical Description
13 pages
Mail Merge Exercise 2023
No ratings yet
Mail Merge Exercise 2023
4 pages
Pure+Moderation Brochure+General+2020+
No ratings yet
Pure+Moderation Brochure+General+2020+
20 pages
Tai Lieu Bien Tan Toshiba TOSVERT VF-A7
No ratings yet
Tai Lieu Bien Tan Toshiba TOSVERT VF-A7
25 pages
SSCNET Platform DataSheet
No ratings yet
SSCNET Platform DataSheet
2 pages
Columbia Accident Investigation Board Volume Three
No ratings yet
Columbia Accident Investigation Board Volume Three
359 pages

UNIT - 1 - Datawarehouse & Data Mining

Uploaded by

UNIT - 1 - Datawarehouse & Data Mining

Uploaded by

KCA012: Data Warehousing & Data Mining

Characteristics of Data warehouse

 Subject Oriented − A data warehouse is subject oriented because it provides

DATA WAREHOUSE COMPONENTS

There are mainly five components of Data Warehouse:

BUILDING A DATA WAREHOUSE

Extracting Transactional Data:

Transforming Transactional Data:

Loading the Data:

Generating Pre calculated Summary Values:

Building (or Purchasing) a Front-End Reporting Tool

MAPPING THE DATA WAREHOUSE TO A MULTIPROCESSOR

Types of DBMS Parallelism

2. Data base architectures of parallel processing

2.2 Shared Disk Architecture

Generally a data warehouses adopts a three-tier architecture. Following are the

Difference between Database System and Data Warehouse:

Database System Data Warehouse

Working on a Multidimensional Data Model

Advantages of Multi-Dimensional Data Model

Fact Constellation Schema

Data Warehouse Applications

Types of Data Mart

Common questions

What are the potential challenges in maintaining a data warehouse, and how can these challenges be addressed?

Discuss the advantages and disadvantages of using a multidimensional data model in data warehousing.

How does OLAP technology enhance data analysis within a data warehouse, and what are its key functionalities?

What are the essential components of a data warehouse, and how do they contribute to its functionality?

In what ways do ETL tools contribute to creating an effective data warehouse environment?

How does the shared nothing architecture improve scalability in data warehousing systems compared to other architectures?

What are the differences between a traditional database system and a data warehouse, particularly in terms of their usage and data management capabilities?

What role does metadata play in a data warehouse, and how does it aid in managing the data warehouse effectively?

How do data cubes facilitate complex queries in data warehousing environments?

How does a three-tier data warehouse architecture enhance data management and accessibility?

You might also like