0% found this document useful (0 votes)
53 views7 pages

ADBMS Chapter 9

The document discusses data warehousing and data mining as essential components for decision support systems (DSSs) that facilitate rational decision-making by integrating and analyzing data from various sources. It outlines the concept of Business Intelligence (BI), which transforms data into actionable information through tools like ETL, data warehouses, and OLAP. Additionally, it highlights the differences between operational data and decision support data, the requirements for decision support databases, and the significance of the ETL process in creating data warehouses.

Uploaded by

matias bahiru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views7 pages

ADBMS Chapter 9

The document discusses data warehousing and data mining as essential components for decision support systems (DSSs) that facilitate rational decision-making by integrating and analyzing data from various sources. It outlines the concept of Business Intelligence (BI), which transforms data into actionable information through tools like ETL, data warehouses, and OLAP. Additionally, it highlights the differences between operational data and decision support data, the requirements for decision support databases, and the significance of the ETL process in creating data warehouses.

Uploaded by

matias bahiru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Advanced Database Management Systems Chapter 9- Datawarehouse & DataMining

G3 CS&IT
INTRODUCTION TO DATA WAREHOUSING & DATA MINING

Introduction
The reason for collecting, storing, and managing data is to generate information that becomes
the basis for rational decision making. Decision support systems (DSSs) were originally
developed to facilitate the decision-making process. However, as the complexity and range of
information requirements increased, so is the difficulty of extracting all the necessary
information from the data structures found in an operational database. Therefore, a new data
storage facility, called a data warehouse, was developed. The data warehouse extracts or
obtains its data from operational databases as well as from external sources, providing a more
comprehensive data pool.
In parallel with data warehouses, new ways to analyze and present decision support
data were developed. Online analytical processing (OLAP) provides advanced data analysis
and presentation tools (including multidimensional data analysis). Data mining employs
advanced statistical tools to analyze the wealth of data now available through data
warehouses and other sources and to identify possible relationships and anomalies.

BUSINESS INTELLIGENCE (BI)


 Comprehensive and integrated decision support framework within organizations is known
as business intelligence. It is the need for data analysis led to the development of a concept
like BI.
 Business intelligence (BI)1 is a term used to describe a comprehensive, cohesive, and
integrated set of tools and processes used to capture, collect, integrate, store, and analyze
data with the purpose of generating and presenting information used to support business
decision making.
 BI is a framework that allows a business to transform data into information, information
into knowledge, and knowledge into wisdom.
 BI is not a product by itself, but a framework of concepts, practices, tools, and
technologies that help a business better understand its core capabilities, provide snapshots
of the company situation, and identify key opportunities to create competitive advantage.
 BI involves the following general steps:
1. Collecting and storing operational data.
2. Aggregating the operational data into decision support data.
3. Analyzing decision support data to generate information.
4. Presenting such information to the end user to support business decisions.
5. Making business decisions, which in turn generate more data that is collected, stored,
etc. (restarting the process).
6. Monitoring results to evaluate outcomes of the business decisions (providing more
data to be collected, stored, etc.).
 The focus of traditional information systems was on operational automation and reporting;
in contrast, BI tools focus on the strategic
and tactical use of information.
 Implementing BI needs various
technologies and components- Basic
components that form part of BI
infrastructure describes the BI
architecture. There are four basic
components that all BI environments
should provide.

Mrs.Pravicha.M.T Page 1
Advanced Database Management Systems Chapter 9- Datawarehouse & DataMining
G3 CS&IT
COMPONENT DESCRIPTION

ETL TOOLS Data extraction, transformation, and loading (ETL) tools collect,
filter, integrate, and aggregate operational data to be saved into a
data store optimized for decision support.
For example, to determine the relative market share by selected
product lines, you require data on competitors' products. Such data
can be located in external databases provided by industry groups or
by companies that market the data. As the name implies, this
component extracts the data, filters the extracted data to select the
relevant records and packages the data in the right format to be
added to the data store component.
DATA STORE The data store is optimized for decision support and is generally
represented by a data warehouse or a data mart. The data store
contains two main types of data: business data and business model
data. The business data are extracted from the operational database
and from external data sources. The business data is stored in
structures that are optimized for data analysis and query speed. The
external data sources provide data that cannot be found within the
company but that are relevant to the business, such as stock prices,
market indicators, marketing information and competitor’s data.
Business models are generated by special algorithms that model the
business to identify and enhance the understanding of business
situations and problems.
DATA QUERY This component performs data retrieval, data analysis, and data-
AND mining tasks using the data in the data store. This component is used
ANALYSIS TOOLS by the data analyst to create the queries that access the database.
Depending on the implementation, the query tool accesses either the
operational database, or more commonly, the data store. This tool
advises the user on which data to select and how to build a reliable
business data model. This components generally represented in the
form of an OLAP tool.
DATA This component is in charge of presenting the data to the end user in
PRESENTATION a variety of ways. This component is used by the data analyst to
AND organize and present the data. This tool helps the end user select
VISUALIZATION the most appropriate presentation format, such as summary report,
TOOLS map, pie or bar graph, or mixed graphs. The query tool and the
presentation tool are the front end to the BI environment.
SOME OF THE BUSINESS INTELLIGENCE TOOLS & VENDORS

TOOL DESCRIPTION SAMPLE


VENDORS
Portals Portals provide a unified, single point of entry for Oracle Portal
information distribution. Portals are a Web-based Actuate
technology that uses a Web browser to integrate data from Microsoft
multiple sources into a single Web page. Many different
types of BI functionality can be accessed through a portal.
Data The data warehouse is the foundation on which a BI Microsoft
warehouses infrastructure is built. Data is captured from the OLTP Oracle
(DW) system and placed in the DW on near-real-time basis. BI IBM/Cognos
provides company-wide integration of data and the MicroStrategy
capability to respond to business issues in a timely manner.

Mrs.Pravicha.M.T Page 2
Advanced Database Management Systems Chapter 9- Datawarehouse & DataMining
G3 CS&IT
OLAP tools Online analytical processing provides multidimensionalCognos/IBM
data analysis. Oracle
Microsoft
Data-mining Tools that provide advanced statistical analysis to uncover MicroStrategy
tools problems and opportunities hidden within business data. Intelligence
MS Analytics
Services
Data Tools that provide advanced visual analysis and techniques iDashboards
visualization to enhance understanding of business data.

Note: The BI environment exists to support the manager; it does not replace the management
function. If the manager fails to ask the appropriate questions, problems will not be identified
and solved, and opportunities will be missed. In spite of the very powerful BI presence, the
human component is still at the centre of business technology.

OPERATIONAL DATA (OD) V/S DECISION SUPPORT DATA (DSD)


From the data analyst’s point of view, decision support data differ from operational data in
three main areas: time span, granularity, and dimensionality.
1. Time span: Operational data cover a short time frame whereas DSD tend to cover a longer
time frame. For e.g: Managers are seldom interested in a specific sales invoice to customer
X; rather, they tend to focus on sales generated during the last month, the last year, or the
last five years.
2. Granularity (level of aggregation: DSD must be presented at different levels of
aggregation, from highly summarized to near-atomic. For example, if managers must
analyze sales by region, they must be able to access data showing the sales by region, by
city within the region, by store within the city within the region, and so on. In that case,
summarized data to compare the regions is required, and also data in a structure that
enables a manager to drill down, or decompose, the data into more atomic components
(that is, finer-grained data at lower levels of aggregation). In contrast, when you roll up
the data, you are aggregating the data to a higher level.
3. Dimensionality: Operational data focus on representing individual transactions rather than
on the effects of the transactions over time. In contrast, data analysts tend to include many
data dimensions and are interested in how the data relate over those dimensions. For
example, an analyst might want to know how product X fared relative to product Z during
the past six months by region, state, city, store, and customer. In that case, both place and
time are part of the picture.
Some other comparisons with respect to data analyst:

Mrs.Pravicha.M.T Page 3
Advanced Database Management Systems Chapter 9- Datawarehouse & DataMining
G3 CS&IT
DECISION SUPPORT DATABASE REQUIREMENTS
A decision support database is a specialized DBMS tailored to provide fast answers to
complex queries. There are four main requirements for a decision support database: the
database schema, data extraction and loading, the end-user,analytical interface, and database
size.
Database Schema The decision support database schema must support complex (non-
normalized) data representations. The queries must be able to extract
multidimensional time slices. The decision support database schema
must also be optimized for query (read-only) retrievals. To optimize
query speed, the DBMS must support features such as bitmap
indexes, enhanced to support the non-normalized and complex
structures.
Data Extraction and The DBMS must support advanced data extraction and data-filtering
Filtering tools.
The data extraction capabilities should allow batch and scheduled
data extraction. Should also support different data sources: flat files
and hierarchical, network, and relational databases, as well as
multiple vendors. Data-filtering capabilities must include the ability
to check for inconsistent data or data validation rules.
Finally, to filter and integrate the operational data into the decision
support database, the DBMS must support advanced data
integration, aggregation, and classification.
End-User Analytical The decision support DBMS must support advanced data-modelling
Interface and data presentation tools.
 To define the nature and extent of business problems.
 To generate the necessary queries to retrieve the appropriate
data from the decision support database.
 To evaluate the query results with data analysis tools supported
by the decision support DBMS.
 To optimize the query for speedy processing.
The end-user analytical interface is one of the most critical DBMS
components for such issues. When properly implemented, an
analytical interface permits the user to navigate through the data to
simplify and accelerate the decision-making process.

Database Size Decision support databases tend to be very large; gigabyte and
terabyte ranges are not unusual. The decision support database
typically contains redundant and duplicated data to improve data
retrieval and simplify information generation. Therefore, the DBMS
must be capable of supporting very large databases (VLDBs). To
support a VLDB adequately, the DBMS might be required to use
advanced hardware, such as multiple disk arrays, to support
multiple-processor technologies, such as a symmetric multiprocessor
(SMP) or a massively parallel processor (MPP).
The complex information requirements and the ever-growing demand for sophisticated data
analysis sparked the creation of a new type of data repository. This repository contains data in
formats that facilitate data extraction, data analysis, and decision making. This data
repository is known as a data warehouse and has become the foundation for a new
generation of decision support systems.

THE DATA WAREHOUSE


Bill Inmon, the acknowledged “father” of the data warehouse, defines the term as “an
integrated, subject-oriented, time-variant, nonvolatile collection of data that provides support
for decision making.
Mrs.Pravicha.M.T Page 4
Advanced Database Management Systems Chapter 9- Datawarehouse & DataMining
G3 CS&IT
Integrated The data warehouse is a centralized, consolidated database that integrates data
derived from the entire organization and from multiple sources with diverse
formats. Data integration implies that all business entities, data elements, data
characteristics, and business metrics are described in the same way throughout
the enterprise. To avoid the potential format tangle, the data in the data
warehouse must conform to a common format acceptable throughout the
organization. This integration can be time-consuming, but once accomplished,
it enhances decision making and helps managers better understand the
company’s operations.
Subject- Data warehouse data are arranged and optimized to provide answers to
oriented questions coming from diverse functional areas within a company. Data
warehouse data are organized and summarized by topic, such as sales,
marketing, finance, distribution, and transportation. For each topic, the data
warehouse contains specific subjects of interest—products, customers,
departments, regions, promotions, and so on
Time- Warehouse data represent the flow of data through time. It is also time-variant
variant in the sense that once data are periodically uploaded to the data warehouse, all
time-dependent aggregations are recomputed. The data warehouse contains a
time ID that is used to generate summaries and aggregations by week, month,
quarter, year, and so on. Once the data enter the data warehouse, the time ID
assigned to the data cannot be changed.
Non- Once data enter the data warehouse, they are never removed and so the data
volatile warehouse is always growing. Because the data in the warehouse represent the
company’s history, the operational data, representing the near-term history, are
always added to it.
THE ETL PROCESS
Data warehouse is usually a read-only
database optimized for data analysis and
query processing. Typically, data are
extracted from various sources and are
then transformed and integrated—in
other words, passed through a data
filter—before being loaded into the data
warehouse. As mentioned, this process
of extracting, transforming, and loading
the aggregated data into the data
warehouse is known as ETL.

DATA MART
Creating a data warehouse requires time, money, and considerable managerial effort.
Therefore, many companies begin their foray into data warehousing by focusing on more
manageable data sets that are targeted to meet the special needs of small groups within the
organization. These smaller data stores are called data marts. A data mart is a small, single-
subject data warehouse subset that provides decision support to a small group of people. In
addition, a data mart could also be created from data extracted from a larger data warehouse
with the specific function to support faster data access to a target group or function. That is,
data marts and data warehouses can coexist within a business intelligence environment. Data
marts can serve as a test vehicle for companies exploring the potential benefits of data
warehouses. By gradually migrating from data marts to data warehouses, a specific
department’s decision support needs can be addressed within a reasonable time frame (six
months to one year) as opposed to the longer time frame usually required implementing a
data warehouse (one to three years). The only difference between a data mart and a data
warehouse is the size and scope of the problem being solved.

Mrs.Pravicha.M.T Page 5
Advanced Database Management Systems Chapter 9- Datawarehouse & DataMining
G3 CS&IT
TWELVE RULES THAT DEFINE A DATA WAREHOUSE
In 1994, William H. Inmon and Chuck Kelley created 12 rules defining a data warehouse.
1. The data warehouse and operational environments are separated.
2. The data warehouse data are integrated.
3. The data warehouse contains historical data over a long time.
4. The data warehouse data are snapshot data captured at a given point in time.
5. The data warehouse data are subject oriented.
6. The data warehouse data are mainly read-only with periodic batch updates from
operational data. No online updates are allowed.
7. The data warehouse development life cycle differs from classical systems development.
The data warehouse development is data-driven; the classical approach is process-driven.
8. The data warehouse contains data with several levels of detail: current detail data, old
detail data, lightly summarized data, and highly summarized data.
9. The data warehouse environment is characterized by read-only transactions to very large
data sets. The operational environment is characterized by numerous update transactions
to a few data entities at a time.
10. The data warehouse environment has a system that traces data sources, transformations,
and storage.
11. The data warehouse’s metadata are a critical component of this environment. The
metadata identify and define all data elements. The metadata provide the source,
transformation, integration, storage, usage, relationships, and history of each data
element.
12. The data warehouse contains a chargeback mechanism for resource usage that enforces
optimal use of the data by end users
ONLINE ANALYTICAL PROCESSING
ONLINE ANALYTICAL PROCESSING (OLAP)
The need for more intensive decision support prompted the introduction of a new generation
of tools. Those new tools, called online analytical processing (OLAP), create an advanced
data analysis environment that supports decision making, business modelling, and operations
research. OLAP systems share four main
characteristics:
 They use multidimensional data
analysis techniques.
 They provide advanced database
support.
 They provide easy-to-use end-
user interfaces.
 They support the client/server
architecture.
OLAP ARCHITECTURE
OLAP operational characteristics can be
divided into three main modules:
 Graphical user interface (GUI).
 Analytical processing logic.
 Data-processing logic

DATA MINING
The purpose of data analysis is to discover previously unknown data characteristics,
relationships, dependencies, or trends. Such discoveries then become part of the information
framework on which decisions are built. A typical data analysis tool relies on the end users to
define the problem, select the data, and initiate the appropriate data analyses to generate the
information that helps model and solve problems that the end users uncover.
Some current BI environments now support various types of automated alerts. The
alerts are software agents that constantly monitor certain parameters, such as sales indicators

Mrs.Pravicha.M.T Page 6
Advanced Database Management Systems Chapter 9- Datawarehouse & DataMining
G3 CS&IT
and inventory levels, and then perform specified actions (send e-mail or alert messages, run
programs, and so on) when such parameters reach predefined levels. In contrast to the
traditional (reactive) BI tools, data mining is proactive. Instead of having the end user define
the problem, select the data, and select the tools to analyze the data, data-mining tools
automatically search the data for anomalies and possible relationships, thereby identifying
problems that have not yet been identified by the end user.
In other words, data mining refers to the activities that analyze the data, uncover
problems or opportunities hidden in the data relationships, form computer models based on
their findings, and then use the models to predict business behaviour—requiring minimal
end-user intervention. In short, data-mining tools initiate analyses to create knowledge. Such
knowledge can be used to address any number of business problems. For example, banks and
credit card companies use knowledge-based analysis to detect fraud, thereby decreasing
fraudulent transactions.
It is difficult to provide a precise list of characteristics of data-mining tools. Many
variations exist because there are no established standards that govern the creation of data-
mining tools. In spite of the lack of precise standards, data mining is subject to four general
phases:
1. Data preparation.
2. Data analysis and
classification.
3. Knowledge acquisition.
4. Prognosis.
In the data preparation phase, the
main data sets to be used by the
data-mining operation are
identified and cleansed of any data
impurities. Because the data in the
data warehouse are already
integrated and filtered, the data warehouse usually is the target set for data-mining operations.
The data analysis and classification phase studies the data to identify common data
characteristics or patterns. During this phase, the data-mining tool applies specific algorithms
to find:
 Data groupings, classifications or sequences.
 Data dependencies, links, or relationships.
 Data patterns, trends, and deviations.
The knowledge acquisition phase uses the results of the data analysis and classification
phase. During the knowledge acquisition phase, the data-mining tool (with possible
intervention by the end user) selects the appropriate modelling or knowledge acquisition
algorithms. The most common algorithms used in data mining are based on neural networks,
decision trees, rules induction, genetic algorithms etc..
Although many data-mining tools stop at the knowledge-acquisition phase, others
continue to the prognosis phase. In that phase, the data-mining findings are used to predict
future behaviour and forecast business outcomes. Examples of data-mining findings can be:
Sixty-five percent of customers who did not use a particular credit card in the last six months
are 88 percent likely to cancel that account. The complete set of findings can be represented
in a decision tree, a neural net, a forecasting model, or a visual presentation interface that is
used to project future events or results.

Future expects the development of databases that not only store data and various statistics
about data usage, but also have the ability to learn about and extract knowledge from the
stored data. Such database management systems, also known as inductive or intelligent
databases, are the focus of intense research in many laboratories. Although those databases
have yet to lay claim to substantial commercial market penetration, DBMS-integrated data-
mining tools have proliferated in the data warehousing database market.

Mrs.Pravicha.M.T Page 7

You might also like