KM Notes Unit-3
KM Notes Unit-3
Digital Notes
[Department of Computer Applications]
1. Data Mining
Data Mining is the process of investigating hidden patterns of information to various
perspectives for categorization into useful data, which is collected and assembled in particular
areas such as data warehouses, efficient analysis, data mining algorithm, helping decision
making and other data requirement to eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to find trends
and patterns that go beyond simple analysis procedures. Data mining utilizes complex
mathematical algorithms for data segments and evaluates the probability of future events. Data
Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a
particular data set, with an objective. This process includes various types of services such as text
mining, web mining, audio and video mining, pictorial data mining, and social media mining. It
is done through software that is simple or highly specific. By outsourcing data mining, all the
work can be done faster with low operation costs.
2
Page
Data mining architecture has many elements like Data Warehouse, Data Mining Engine, Pattern
evaluation, User Interface and Knowledge Base.
Data Warehouse:
A data warehouse is a place which store information collected from multiple sources under
unified schema. Information stored in a data warehouse is critical to organizations for the process
of decision-making.
Pattern Evaluation:
Pattern Evaluation is responsible for finding various patterns with the help of Data Mining
Engine.
User Interface:
User Interface provides communication between user and data mining system. It allows user to
use the system easily even if user doesn't have proper knowledge of the system.
Knowledge Base:
Knowledge Base consists of data that is very important in the process of data mining.Knowledge
Base provides input to the data mining engine which guides data mining engine in the process of
pattern search.
Relational Database:
A relational database is a collection of multiple data sets formally organized by tables, records,
and columns from which data can be accessed in various ways without having to recognize the
database tables. Tables convey and share information, which facilitates data searchability,
3
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example, a group of databases, where an organization has kept various kinds of
information.
Object-Relational Database:
A combination of an object-oriented database model and relational database model is called an
object-relational model. It supports Classes, Objects, Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close the gap between the
Relational database and the object-oriented model practices frequently utilized in many
programming languages, for example, C++, Java, C#, and so on.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has the potential
to undo a database transaction if it is not performed appropriately. Even though this was a unique
capability a very long while back, today, most of the relational database systems support
transactional database activities.
1 Data cleaning -
First step in the Knowledge Discovery Process is Data cleaning in which noise and inconsistent
data is removed.
2 Data Integration -
Second step is Data Integration in which multiple data sources are combined.
3 Data Selection -
7
Next step is Data Selection in which data relevant to the analysis task are retrieved from the
Page
database.
4 Data Transformation -
In Data Transformation, data are transformed into forms appropriate for mining by performing
summary or aggregation operations.
5 Data Mining -
In Data Mining, data mining methods (algorithms) are applied in order to extract data patterns.
6 Pattern Evaluation -
In Pattern Evaluation, data patterns are identified based on some interesting measures.
7 Knowledge Presentation -
In Knowledge Presentation, knowledge is represented to user using many knowledge
representation techniques.
1. Classification:
This analysis is used to retrieve important and relevant information about data, and metadata.
This data mining method helps to classify data in different classes.
8
Page
2. Clustering:
Clustering analysis is a data mining technique to identify data that are like each other. This
process helps to understand the differences and similarities between the data.
3. Regression:
Regression analysis is the data mining method of identifying and analyzing the relationship
between variables. It is used to identify the likelihood of a specific variable, given the presence
of other variables.
4. Association Rules:
This data mining technique helps to find the association between two or more Items. It discovers
a hidden pattern in the data set.
5. Outer detection:
This type of data mining technique refers to observation of data items in the dataset which do not
match an expected pattern or expected behavior. This technique can be used in a variety of
domains, such as intrusion, detection, fraud or fault detection, etc. Outer detection is also called
Outlier Analysis or Outlier mining.
6. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends in transaction
data for certain period.
7. Prediction:
Prediction has used a combination of the other techniques of data mining like trends, sequential
patterns, clustering, classification, etc. It analyzes past events or instances in a right sequence for
predicting a future event.
The data mining is a cost-effective and efficient solution compared to other statistical
Page
data applications.
Data mining helps with the decision-making process.
Facilitates automated prediction of trends and behaviors as well as automated discovery
of hidden patterns.
It can be implemented in new systems as well as existing platforms
It is the speedy process which makes it easy for the users to analyze huge amount of data
in less time.
10
Page
4.1 How does the Multidimensional Data Model work?
The Multidimensional Data Model, like every other system, often operates based on preset steps
to preserve the same pattern in the industry and to allow the database structures already built or
developed to be reusable. Any project should go all the way through the steps below to construct
a multidimensional data model.
Congregating the requirements from the client
Categorizing the various modules of the system
Spotting the various dimensions based on which the system needs to be designed
Drafting the real-time dimensions and the corresponding properties
Discovering the facts from the already listed dimensions and their properties
Constructing the Schema to place the data, for the information gathered from the above
steps
For example, a shop may create a sales data warehouse to keep records of the store's sales for the
dimension time, item, and location. These dimensions allow the save to keep track of things, for
example, monthly sales of items and the locations at which the items were sold. Each dimension
has a table related to it, called a dimensional table, which describes the dimension further. For
example, a dimensional table for an item may contain the attributes item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales. This
theme is represented by a fact table. Facts are numerical measures. The fact table contains the
names of the facts or measures of the related dimensional tables.
11
Page
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in
the table. In this 2D representation, the sales for Delhi are shown for the time dimension
(organized in quarters) and the item dimension (classified according to the types of an item sold).
The fact or measure displayed in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.
12
Page
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of data.
With multidimensional data stores, the storage utilization may be low if the data set is sparse.
Therefore, many MOLAP server use two levels of data storage representation to handle dense
and sparse data sets.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP store.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Drill-down
Pivot (rotate)
Page
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
By climbing up a concept hierarchy for a dimension
By dimension reduction
When roll-up is performed, one or more dimensions from the data cube are removed.
Page
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −
By stepping down a concept hierarchy for a dimension
By introducing a new dimension.
The following diagram illustrates how drill-down works −
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year."
On drilling down, the time dimension is descended from the level of quarter to the level
of month.
When drill-down is performed, one or more dimensions from the data cube are added.
It navigates the data from less detailed data to highly detailed data.
Slice
16
The slice operation selects one particular dimension from a given cube and provides a new sub-
Page
cube. Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider
the following diagram that shows the dice operation.
17
Page
The dice operation on the cube based on the following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide
an alternative presentation of data. Consider the following diagram that shows the pivot
operation.
18
Page
OLAP vs OLTP
2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs,
workers such as executives, managers or database professionals.
and analysts.
millions.
Page
11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.
References:
1. Decision support system, EIS, 2000
2. W.H.Inmon, “Building Data Warehousing”, Willey, 1998.
3. Han, Jiawei, Kamber, Michelinal, “ Data Mining Concepts & Techniques”, Harcourt
India, 2001
4. https://www.javatpoint.com/data-mining
5. http://www.lastnightstudy.com/Show?id=30/What-is-Data-Mining?
6. https://www.includehelp.com/data-warehouse/multidimensional-data-model.aspx
7. https://www.tutorialspoint.com/dwh/dwh_olap.htm
20
Page