Data Warehousing and Data Mining
Data Warehousing and Data Mining
The first question that arises is, what is the need for Data
Warehouse and spending lots of money and time on it when
you can feed the transaction system direct to it, and we have BI
tools. But there are many limitations to this approach, and
gradually enterprises came to understand the need for Data
Warehouse. Let’s see some of the points that make using a
Data Warehouse so important for Business Analytics.
Subject-Oriented
Non-Volatile
• •
Database System: Database System is used in traditional way of
storing and retrieving data. The major task of database system is
to perform query processing. These systems are generally
referred as online transaction processing system. These systems
are used day to day operations of any organization. Data
Warehouse: Data Warehouse is the place where huge amount of
data is stored. It is meant for users or knowledge workers in the
role of data analysis and decision making. These systems are
supposed to organize and present data in different format and
different forms in order to serve the need of the specific user for
specific purpose. These systems are referred as online analytical
processing. Difference between Database System and Data
Warehouse:
Database System Data Warehouse
Data is balanced within the scope Data must be integrated and balanced
of this one system. from multiple system.
ER based. Star/Snowflake.
a data warehouse.
Reasons for creating a data mart
o Creates collective data by a group of users
o Easy access to frequently needed data
o Ease of creation
o Improves end-user response time
o Lower cost than implementing a complete data warehouses
o Potential clients are more clearly defined than in a
comprehensive data warehouse
o It contains only essential business data and is less cluttered.
Designing
The design step is the first in the data mart process. This phase
covers all of the functions from initiating the request for a data
mart through gathering data about the requirements and
developing the logical and physical design of the data mart.
Constructing
This step contains creating the physical database and logical
structures associated with the data mart to provide fast and
efficient access to the data.
Populating
This step includes all of the tasks related to the getting data from
the source, cleaning it up, modifying it to the right format and
level of detail, and moving it into the data mart.
Accessing
This step involves putting the data to use: querying the data,
analyzing it, creating reports, charts and graphs and publishing
them.
Managing
This step contains managing the data mart over its lifetime. In
this step, management functions are performed as:
Single-Tier Architecture
The figure shows the only layer physically available is the source
layer. In this method, data warehouses are virtual. This means
that the data warehouse is implemented as a multidimensional
view of operational data created by specific middleware, or an
intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the
requirement for separation between analytical and transactional
processing. Analysis queries are agreed to operational data after
the middleware interprets them. In this way, queries affect
transactional workloads.
Two-Tier Architecture
Fact Tables
A table in a star schema which contains facts and connected to
dimensions. A fact table has two types of columns: those that
include fact and those that are foreign keys to the dimension
table. The primary key of the fact tables is generally a composite
key that is made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have
been aggregated (fact tables that include aggregated fact are
often instead called summary tables). A fact table generally
contains facts with the same level of aggregation.
o Simpler Queries –
Features:
Advantages:
1. Concatenated Key
2. Additive Measures
3. Degenerated Dimensions
The level or depth of the data that is stored in the fact table is
known as the grain of the table. An efficient fact table should be
created at the highest level.
5. Sparse Data
Some data records in the fact table include attributes with null
values or measurements, indicating that they do not provide any
data.
These are those dimensions that are the subdivision of rows and
columns of the base dimension.
7. Outrigger dimensions
2. Attribute Values
3. Normalization
Introduction
The Meta Data Repository is responsible for storing domains
and their master data models. The models stored within this
service are consulted for different tasks such as data
validation. The meta models are also used by the transformer
to map the incoming data onto the Open Integration Hub
standard.
Technologies used
• Node.js
• MongoDB
• JSON Schema
• Data extraction:
o get data from multiple, heterogeneous, and external
sources
• Data cleaning:
o detect errors in the data and rectify them when possible
• Data transformation:
o convert data from legacy or host format to warehouse
format
• Load:
o sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
• Refresh
o propagate the updates from the data sources to the
warehou
UNIT = 2
MultiDimensional Data Model
The multi-Dimensional Data Model is a method which is used
for ordering data in the database along with good arrangement
and assembling of the contents in the database.
The Multi Dimensional Data Model allows customers to
interrogate analytical questions associated with market or
business trends, unlike relational databases which allow
customers to access data in the form of queries. They allow
users to rapidly receive answers to the requests which they
made by creating and examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing
uses multi dimensional databases. It is used to show multiple
dimensions of the data to users.
• Data Consolidation
• Data Cleaning
• Data Integration
• Data Storage
• Data Transformation
• Data Analysis
• Data Reporting
• Data Mining
• Performance Optimization
o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling
Production
o Production planning
o Defect analysis
Characteristics of OLAP
Fast
Share
It defines which the system tools all the security requirements for
understanding and, if multiple write connection is needed,
concurrent update location at an appropriated level, not all
functions need customer to write data back, but for the
increasing number which does, the system should be able to
manage multiple updates in a timely, secure manner.
Multidimensional
Roll-Up
The roll-up operation (also known as drill-up or aggregation
operation) performs aggregation on a data cube, by climbing
down concept hierarchies, i.e., dimension reduction. Roll-up is
like zooming-out on the data cubes. Figure shows the result of
roll-up operations performed on the dimension location. The
hierarchy for the location is defined as the Order Street, city,
province, or state, country. The roll-up operation aggregates the
data by ascending the location hierarchy from the level of the
city to the level of the country.
Drill-Down
Slice
Dice
The dice operation describes a subcube by operating a selection
on two or more dimension.
Pivot
Types of OLAP
o Database server.
o ROLAP server.
o Front-end tool
Relational OLAP (ROLAP) is the latest and fastest-growing
OLAP technology segment in the market. This method allows
multiple multidimensional views of two-dimensional relational
tables to be created, avoiding structuring record around the
desired view.
Advantages
MOLAP Architecture
o Database server.
o MOLAP server.
o Front-end tool.
MOLAP structure primarily reads the precompiled data. MOLAP
structure has limited capabilities to dynamically create
aggregations or to evaluate results which have not been pre-
calculated and stored.
Advantages
Disadvantages
Limited in the amount of information it can handle: Because
all calculations are performed when the cube is built, it is not
possible to contain a large amount of data in the cube itself.
Disadvantages of HOLAP
Implementation Guidelines
Ensure quality: The only record that has been cleaned and is of
a quality that is implicit by the organizations should be loaded in
the data warehouses.
Let suppose we would like to view the sales data with a third
dimension. For example, suppose we would like to view the data
according to time, item as well as the location for the cities
Chicago, New York, Toronto, and Vancouver. The measured
display in dollars sold (in thousands). These 3-D data are shown
in the table. The 3-D data of the table are represented as a series
of 2-D tables.
Let us suppose that we would like to view our sales data with an
additional fourth dimension, such as a supplier.
UNIT=3
Data warehouses:
Data Repositories:
Object-Relational Database:
1. Data Collection
The collection of raw data is the first step of the data processing
cycle. The raw data collected has a huge impact on the output
produced. Hence, raw data should be gathered from defined and
accurate sources so that the subsequent findings are valid and usable.
Raw data can include monetary figures, website cookies, profit/loss
statements of a company, user behavior, etc.
2. Data Preparation
3. Data Input
4Data Storage
The last step of the data processing cycle is storage, where data
and metadata are stored for further use. This allows quick access
and retrieval of information whenever needed. Effective proper
data storage is necessary for compliance with GDPR (data
protection legislation).
Tight Coupling
Loose Coupling
Facts with loose coupling are most effectively kept in the actual
source databases. This approach provides an interface that gets
a query from the user, changes it into a format that the supply
database may understand, and then sends the query to the
source databases without delay to obtain the result.
Integration tools
Data Aggregation
Data Generalization
INTRODUCTION:
Advantages:
Disadvantages:
Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it
involves collecting and analyzing large amounts of
data, which can include sensitive information about
individuals.
2. Complexity: KDD can be a complex process that
requires specialized skills and knowledge to implement
and interpret the results.
3. Unintended consequences: KDD can lead to
unintended consequences, such as bias or
discrimination, if the data or models are not properly
understood or used.
4. Data Quality: KDD process heavily depends on the
quality of data, if data is not accurate or consistent, the
results can be misleading
5. High cost: KDD can be an expensive process,
requiring significant investments in hardware,
software, and personnel.
Data Mining architecture
Data Mining refers to the detection and extraction of new
patterns from the already collected data. Data mining is the
amalgamation of the field of statistics and computer science
aiming to discover patterns in incredibly large datasets and
then transform them into a comprehensible structure for later
use.
the customers.
• Assists Companies to optimize their production
cube.
• These operations typically involve aggregate
• Variable
• Quantitative Variable
• Qualitative Variable
Characterization
Association
Classification
Prediction
TID Items
1 Bread, Milk
• Support(s) –
The number of transactions that include items in the
{X} and {Y} parts of the rule as a percentage of the
total number of transaction.It is a measure of how
frequently the collection of items occur together as a
percentage of all transactions.
• Support = (X+Y) total –
It is interpreted as fraction of transactions that contain
both X and Y.
• Confidence(c) –
It is the ratio of the no of transactions that includes all
items in {B} as well as the no of transactions that
includes all items in {A} to the no of transactions that
includes all items in {A}.
• Conf(X=>Y) = Supp(X Y) Supp(X) –
It measures how often each item in Y appears in
transactions that contains items in X also.
• Lift(l) –
The lift of the rule X=>Y is the confidence of the rule
divided by the expected confidence, assuming that the
itemsets X and Y are independent of each other.The
expected confidence is the confidence divided by the
frequency of {Y}.
• Lift(X=>Y) = Conf(X=>Y) Supp(Y) –
Lift value near 1 indicates X and Y almost often appear
together as expected, greater than 1 means they appear
together more than expected and less than 1 means they
appear less than expected.Greater lift values indicate
stronger association.
Multilevel Association Rule in data mining
Multilevel Association Rule :
Association rules created from mining information at different
degrees of reflection are called various level or staggered
association rules.
Multilevel association rules can be mined effectively utilizing
idea progressions under a help certainty system.
Rules at a high idea level may add to good judgment while
rules at a low idea level may not be valuable consistently.
Utilizing uniform least help for all levels :
• At the point when a uniform least help edge is
Group-based support –
The group-wise threshold value for support and confidence is
input by the user or expert. The group is selected based on a
product price or item set because often expert has insight as
to which groups are more important than others.
Rules.
• Numeric properties are progressively
discretized.
Example –:
1.age(X, "20..25") Λ income(X, "30K..41K")buys
( X, "Laptop Computer")
2. Grid FOR TUPLES :
Using distance based discretization with bunching –
This id dynamic discretization measure that considers
the distance between information focuses. It includes
a two stage mining measure as following.
• Perform bunching to discover the time period
included.
• Get affiliation rules via looking for gatherings
Classification Prediction
Key factors:
Neural Network:
Features of Backpropagation:
4. Working of Backpropagation:
5. Neural networks use supervised learning to generate
output vectors from input vectors that the network
operates on. It Compares generated output to the desired
output and generates an error report if the result does not
match the generated output vector. Then it adjusts the
weights according to the bug report to get your desired
output.
6. Backpropagation Algorithm:
• α = learning rate.
Training Algorithm :
Step 1: Initialize weight to small random values.
Step 2: While the stepsstopping condition is to be false do
step 3 to 10.
Step 3: For each training pair do step 4 to 9 (Feed-Forward).
Step 4: Each input unit receives the signal unit and
transmitsthe signal xi signal to all the units.
Step 5 : Each hidden unit Zj (z=1 to a) sums its weighted
input signal to calculate its net input
zinj = v0j + Σxivij ( i=1 to n)
Applying activation function zj = f(zinj) and sends this
signals to all units in the layer about i.e output units
For each output l=unit yk = (k=1 to m) sums its
weighted input signals.
yink = w0k + Σ ziwjk (j=1 to a)
and applies its activation function to calculate the
output signals.
yk = f(yink)
Backpropagation Error :
Step 6: Each output unit yk (k=1 to n) receives a target
pattern corresponding to an input pattern then error is
calculated as:
δk = ( tk – yk ) + yink
Step 7: Each hidden unit Zj (j=1 to a) sums its input from all
units in the layer above
δinj = Σ δj wjk
The error information term is calculated as :
δj = δinj + zinj
Types of Backpropagation
Advantages:
Disadvantages:
Bayesian interpretation:
In the Bayesian interpretation, probability determines a "degree
of belief." Bayes theorem connects the degree of belief in a
hypothesis before and after accounting for evidence. For
example, Lets us consider an example of the coin. If we toss a
coin, then we get either heads or tails, and the percent of
occurrence of either heads and tails is 50%. If the coin is flipped
numbers of times, and the outcomes are observed, the degree
of belief may rise, fall, or remain the same depending on the
outcomes.
Bayesian network:
A Bayesian Network falls under the classification of Probabilistic
Graphical Modelling (PGM) procedure that is utilized to compute
uncertainties by utilizing the probability concept. Generally
known as Belief Networks, Bayesian Networks are used to
show uncertainties using Directed Acyclic Graphs (DAG)
• 2.Confidence:
3.Lift:
It is the strength of any rule, which can be defined as below
formula: It is the ratio of the observed support measure and
expected support if X and Y are independent of each other. It
has three possible values:
Lift = Supp(X,Y) / Supp(X)*Supp(Y)
• If Lift= 1: The probability of occurrence of antecedent
algorithms:
• 1.Apriori Algorithm:
Classifier Accuracy
Evaluating & estimating the accuracy of classifiers is
important in that it allows one to evaluate how
accurately a given classifier will label future data, that,
is, data on which the classifier has not been trained.
Classifier Accuracy
Evaluating & estimating the accuracy of classifiers is
important in that it allows one to evaluate how
accurately a given classifier will label future data,
that, is, data on which the classifier has not been
trained.
Boosting
We now look at the ensemble method of
boosting. As in the previous section, suppose
that as a patient, you have certain symptoms.
Cross-Validation
In k-fold cross-validation, the initial data are
randomly partitioned into k mutually exclusive
subsets or “folds,” D1, D2,....., Dk, each of
approximately equal size.
Bootstrapping
Unlike the accuracy estimation methods
mentioned above, the bootstrap method samples
the given training tuples uniformly with
replacement.
That is, each time a tuple is selected, it is equally
likely to be selected again and readded to the
training set.
Bagging
We first take an intuitive look at how bagging
works as a method of increasing accuracy.
Boosting
We now look at the ensemble method of
boosting. As in the previous section, suppose
that as a patient, you have certain symptoms.
o The constant
o A parameter multiplied by an independent variable
(IV)
Then, you build the equation by only adding the terms together.
These rules limit the form to just one type:
o The constant
o A parameter multiplied by an independent variable
(IV)
Then, you build the equation by only adding the terms together.
These rules limit the form to just one type:
Nonlinear equation
Example form
Power:
Weibull growth:
Data Mining – Cluster Analysis
• INTRODUCTION:
Clr analysis, also known as clustering, is a method of data
mining that groups similar data points together. The goal of
cluster analysis is to divide a dataset into groups (or clusters)
such that the data points within each group are more similar to
each other than to data points in other groups. This process is
often used for exploratory data analysis and can help identify
patterns or relationships within the data that may not be
immediately obvious. There are many different algorithms
used for cluster analysis, such as k-means, hierarchical
clustering, and density-based clustering. The choice of
algorithm will depend on the specific requirements of the
analysis and the nature of the data being analyz
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of
data and should be dealing with huge databases. In order to
handle extensive databases, the clustering algorithm should
be scalable. Data should be scalable, if it is not scalable, then
we can’t get the appropriate result which would lead to wrong
results.
2. High Dimensionality: The algorithm should be able to
handle high dimensional space along with the data of small
size.
3. Algorithm Usability with multiple data kinds: Different
kinds of data can be used with algorithms of clustering. It
should be capable of dealing with different types of data like
discrete, categorical and interval-based data, binary data etc
Interpretability: The clustering outcomes should be
interpretable, comprehensible, and usable. The interpretability
reflects how easily the data is understood.
Clustering Methods:
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method: It is used to make partitions on the data
in order to form clusters. If “n” partitions are done on “p”
objects of the database then each partition is represented by a
cluster and n < p. The two conditions which need to be
satisfied with this Partitioning Clustering Method are:
• One objective should only belong to only one group.
purpose.
In the partitioning method, there is one technique called
iterative relocation, which means the object will be moved from
one group to another to improve the partitioning
Hierarchical Method: In this method, a hierarchical
decomposition of the given set of data objects is created. We
can classify hierarchical methods and will be able to know the
purpose of classification on the basis of how the hierarchical
decomposition is formed. There are two types of approaches
for the creation of hierarchical decomposition, they are:
• Agglomerative Approach: The agglomerative
approach is also known as the bottom-up approach.
Initially, the given data is divided into which objects
form separate groups. Thereafter it keeps on merging
the objects or the groups that are close to one another
which means that they exhibit similar properties. This
merging process continues until the termination
condition holds.
• Divisive Approach: The divisive approach is also
pattern recognition.
• It helps marketers to find the distinct groups in their
Output:
A dataset of K clusters
Method:
Method:
1. Randomly assign K objects from the dataset(D) as
cluster centres(C)
2. (Re) Assign each object to which object is most similar
based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of
each cluster with the updated values.
4. Repeat Step 2 until no change occurs.
i belongs to NEps(k)
Density connected:
DBSCAN
OPTICS
OPTICS stands for Ordering Points To Identify the Clustering
Structure. It gives a significant order of database with respect to
its density-based clustering structure. The order of the cluster
comprises information equivalent to the density-based
clustering related to a long range of parameter settings. OPTICS
methods are beneficial for both automatic and interactive cluster
analysis, including determining an intrinsic clustering structure.
DENCLUE
The web structure mining can be used to find the link structure
of hyperlink. It is used to identify that data either link the web
pages or direct link network. In Web Structure Mining, an
individual considers the web as a directed graph, with the web
pages being the vertices that are associated with hyperlinks. The
most important application in this regard is the Google search
engine, which estimates the ranking of its outcomes primarily
with the PageRank algorithm. It characterizes a page to be
exceptionally relevant when frequently connected by other
highly related pages. Structure and content mining
methodologies are usually combined. For example, web
structured mining can be beneficial to organizations to regulate
the network between two commercial sites.
Classification:
Association rules: