0% found this document useful (0 votes)
2 views

Data Warehousing and Data Mining

The document discusses the necessity of data warehouses for business analytics, highlighting their role as a single source of truth that enhances data quality, consistency, and query performance. It contrasts data warehouses with traditional database systems, emphasizing their focus on historical data and decision-making rather than daily operations. Additionally, it outlines the architecture, characteristics, and implementation steps of data warehouses and data marts, underscoring their importance in managing large volumes of data for strategic business insights.

Uploaded by

netsaifi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Warehousing and Data Mining

The document discusses the necessity of data warehouses for business analytics, highlighting their role as a single source of truth that enhances data quality, consistency, and query performance. It contrasts data warehouses with traditional database systems, emphasizing their focus on historical data and decision-making rather than daily operations. Additionally, it outlines the architecture, characteristics, and implementation steps of data warehouses and data marts, underscoring their importance in managing large volumes of data for strategic business insights.

Uploaded by

netsaifi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 156

NEED OF DATA WAREHOUSE

The first question that arises is, what is the need for Data
Warehouse and spending lots of money and time on it when
you can feed the transaction system direct to it, and we have BI
tools. But there are many limitations to this approach, and
gradually enterprises came to understand the need for Data
Warehouse. Let’s see some of the points that make using a
Data Warehouse so important for Business Analytics.

• It serves as a Single Source of Truth for all the data within


the company. Using a Data Warehouse eliminates the
following issues:
o Data quality issues
o Unstable data in reports

o Data Inconsistency
o Low query performance
• Data Warehouse gives the ability to quickly run analysis
on huge volumes of datasets.
• If there is any change in the structure of the data available
in the operational or transactional Databases. It will not
break the business reports running on top of it because
they are not directly connected to BI tools or Reporting
tools.
• Cloud Data Warehouse (such as Amazon Redshift and
Google BigQuery) offer an added advantage that you
need not invest in them upfront. Instead, you pay as you
go as the size of your data increases. You can refer to this
article on Amazon Redshift vs Google BigQuery for a
comparison of the two.
• When companies want to make the data available for all,
they will understand the need for Data Warehouse. You
can expose the data within the company for analysis.
While you do so you can hide certain sensitive information
(such as PII – Personally Identifiable Information about
your customers, or Partners).
• There is always the need for Data Warehouse as the
complexity of queries increases and users need faster
query processing. Because the transactional Databases
are built to store a store in a normalized form whereas fast
query processing can be achieved by denormalized data
that is available in Data Warehouse.
o

o 1) Business User: Business users require a data


warehouse to view summarized data from the past.
Since these people are non-technical, the data may
be presented to them in an elementary form.
• 2) Store historical data: Data Warehouse is required to
store the time variable data from the past. This input is
made to be used for various purposes.
• 3) Make strategic decisions: Some strategies may be
depending upon the data in the data warehouse. So, data
warehouse contributes to making strategic decisions.
• 4) For data consistency and quality: Bringing the data
from different sources at a commonplace, the user can
effectively undertake to bring the uniformity and
consistency in data.
• 5) High response time: Data warehouse has to be ready
for somewhat unexpected loads and types of queries, which
demands a significant degree
• DEFINATION
• Data Warehouse is a relational database
management system (RDBMS) construct to meet the
requirement of transaction processing systems. It can
be loosely described as any centralized data
repository which can be queried for business
benefits. It is a database that stores information
oriented to satisfy decision-making requests. It is a
group of decision support technologies, targets to
enabling the knowledge worker (executive, manager,
and analyst) to make superior and higher decisions.
So, Data Warehousing support architectures and tool
for business executives to systematically organize,
understand and use their information to make
strategic decisions.
• Data Warehouse environment contains an extraction,
transportation, and loading (ETL) solution, an online
analytical processing (OLAP) engine, customer
analysis tools, and other applications that handle the
process of gathering information and delivering it to
business users.
• What is a Data Warehouse?
• A Data Warehouse (DW) is a relational database that
is designed for query and analysis rather than
transaction processing. It includes historical data
derived from transaction data from single and
multiple sources.
• A Data Warehouse provides integrated, enterprise-
wide, historical data and focuses on providing
support for decision-makers for data modeling and
analysis.
• A Data Warehouse is a group of data specific to the
entire organization, not only to a particular group of
users.
• It is not used for daily operations and transaction
processing but used for making decisions.
• A Data Warehouse can be viewed as a data system
with the following attributes:
• It is a database designed for investigative tasks, using
data from various applications.
• It supports a relatively small number of clients with
relatively long interactions.
• It includes current and historical data to provide a
historical perspective of information.
• Its usage is read-intensive.
• It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-


variant store of information in support of management's
decisions."

Characteristics of Data Warehouse

Subject-Oriented

A data warehouse target on the modeling and analysis of data


for decision-makers. Therefore, data warehouses typically
provide a concise and straightforward view around a particular
subject, such as customer, product, or sales, instead of the global
organization's ongoing operations. This is done by excluding
data that are not useful concerning the subject and including all
data needed by the users to understand the subject.
Integrated

A data warehouse integrates various heterogeneous data


sources like RDBMS, flat files, and online transaction records. It
requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions,
attributes types, etc., among different data sourc
Time-Variant

Historical information is kept in a data warehouse. For example,


one can retrieve files from 3 months, 6 months, 12 months, or
even previous data from a data warehouse. These variations with
a transactions system, where often only the most current file is
kept.

Non-Volatile

The data warehouse is a physically separate data storage, which


is transformed from the source operational RDBMS. The
operational updates of data do not occur in the data warehouse,
i.e., update, insert, and delete operations are not performed. It
usually requires only two procedures in data accessing: Initial
loading of data and access to data. Therefore, the DW does not
require transaction processing, recovery, and concurrency
capabilities, which allows for substantial speedup of data
retrieval. Non-Volatile defines that once entered into the
warehouse, and data should not change.

Benefits of Data Warehous


Difference between Database System and Data Warehouse

• •
Database System: Database System is used in traditional way of
storing and retrieving data. The major task of database system is
to perform query processing. These systems are generally
referred as online transaction processing system. These systems
are used day to day operations of any organization. Data
Warehouse: Data Warehouse is the place where huge amount of
data is stored. It is meant for users or knowledge workers in the
role of data analysis and decision making. These systems are
supposed to organize and present data in different format and
different forms in order to serve the need of the specific user for
specific purpose. These systems are referred as online analytical
processing. Difference between Database System and Data
Warehouse:
Database System Data Warehouse

It supports analysis and performance


It supports operational processes.
reporting.

Capture and maintain the data. Explore the data.

Current data. Multiple years of history.

Data is balanced within the scope Data must be integrated and balanced
of this one system. from multiple system.

Data is updated when transaction


Data is updated on scheduled processes.
occurs.

Data verification occurs when


Data verification occurs after the fact.
entry is done.

100 MB to GB. 100 GB to TB.


Database System Data Warehouse

ER based. Star/Snowflake.

Application oriented. Subject oriented.

Primitive and highly detailed. Summarized and consolidated.

Flat relational. Multidimensional.

What is Data Mart?


A Data Mart is a subset of a directorial information store,
generally oriented to a specific purpose or primary data subject
which may be distributed to provide business needs. Data Marts
are analytical record stores designed to focus on particular
business functions for a specific community within an
organization. Data marts are derived from subsets of data in a
data warehouse, though in the bottom-up data warehouse
design methodology, the data warehouse is created from the
union of organizational data marts.

The fundamental use of a data mart is Business Intelligence


(BI) applications. BI is used to gather, store, access, and analyze
record. It can be used by smaller businesses to utilize the data
they have accumulated since it is less expensive than
implementing

a data warehouse.
Reasons for creating a data mart
o Creates collective data by a group of users
o Easy access to frequently needed data
o Ease of creation
o Improves end-user response time
o Lower cost than implementing a complete data warehouses
o Potential clients are more clearly defined than in a
comprehensive data warehouse
o It contains only essential business data and is less cluttered.

Types of Data Marts

o Dependent Data Marts


o Independent Data Marts
o Dependent Data Marts
o A dependent data marts is a logical subset of a physical
subset of a higher data warehouse. According to this
technique, the data marts are treated as the subsets of a
data warehouse. In this technique, firstly a data warehouse
is created from which further various data marts can be
created. These data mart are dependent on the data
warehouse and extract the essential record from it. In this
technique, as the data warehouse creates the data mart;
therefore, there is no need for data mart integration. It is
also known as a top-down approach.
o Independent Data Marts
o The second approach is Independent data marts (IDM)
Here, firstly independent data marts are created, and then
a data warehouse is designed using these independent
multiple data marts. In this approach, as all the data marts
are designed independently; therefore, the integration of
data marts is required. It is also termed as a bottom-up
approach as the data marts are integrated to develop a
data warehouse.
Steps in Implementing a Data Mart

The significant steps in implementing a data mart are to design


the schema, construct the physical storage, populate the data
mart with data from source systems, access it to make informed
decisions and manage it over time. So, the steps are:

Designing

The design step is the first in the data mart process. This phase
covers all of the functions from initiating the request for a data
mart through gathering data about the requirements and
developing the logical and physical design of the data mart.

1. Gathering the business and technical requirements


2. Identifying data sources
3. Selecting the appropriate subset of data
4. Designing the logical and physical architecture of the data
mart.

Constructing
This step contains creating the physical database and logical
structures associated with the data mart to provide fast and
efficient access to the data.

It involves the following tasks:

1. Creating the physical database and logical structures such


as tablespaces associated with the data mart.
2. creating the schema objects such as tables and indexes
describe in the design step.
3. Determining how best to set up the tables and access
structures.

Populating

This step includes all of the tasks related to the getting data from
the source, cleaning it up, modifying it to the right format and
level of detail, and moving it into the data mart.

It involves the following tasks:

1. Mapping data sources to target data sources


2. Extracting data
3. Cleansing and transforming the information.
4. Loading data into the data mart
5. Creating and storing metadata

Accessing

This step involves putting the data to use: querying the data,
analyzing it, creating reports, charts and graphs and publishing
them.

It involves the following tasks:


1. Set up and intermediate layer (Meta Layer) for the front-end
tool to use. This layer translates database operations and objects
names into business conditions so that the end-clients can
interact with the data mart using words which relates to the
business functions.
2. Set up and manage database architectures like summarized
tables which help queries agree through the front-end tools
execute rapidly and efficiently.

Managing

This step contains managing the data mart over its lifetime. In
this step, management functions are performed as:

1. Providing secure access to the data.


2. Managing the growth of the data.
3. Optimizing the system for better performance.
4. Ensuring the availability of data event with system failures.
5. Data Warehouse Architecture
6. A data warehouse architecture is a method of defining the
overall architecture of data communication processing and
presentation that exist for end-clients computing within the
enterprise. Each data warehouse is different, but all are
characterized by standard vital components.
7. Production applications such as payroll accounts payable
product purchasing and inventory control are designed for
online transaction processing (OLTP). Such applications
gather detailed data from day to day operations.
8. Data Warehouse applications are designed to support the
user ad-hoc data requirements, an activity recently dubbed
online analytical processing (OLAP). These include
applications such as forecasting, profiling, summary
reporting, and trend analysis.
9. Production databases are updated continuously by either
by hand or via OLTP applications. In contrast, a warehouse
database is updated from operational systems periodically,
usually during off-hours. As OLTP data accumulates in
production databases, it is regularly extracted, filtered, and
then loaded into a dedicated warehouse server that is
accessible to users. As the warehouse is populated, it must
be restructured tables de-normalized, data cleansed of
errors and redundancies and new fields and keys added to reflect the
needs to the user for sorting, combining, and summarizing data.

Properties of Data Warehouse Architectures

1. Separation: Analytical and transactional processing should be


keep apart as much as possible.

2. Scalability: Hardware and software architectures should be


simple to upgrade the data volume, which has to be managed
and processed, and the number of user's requirements, which
have to be met, progressively increase.
3. Extensibility: The architecture should be able to perform new
operations and technologies without redesigning the whole
system.

4. Security: Monitoring accesses are necessary because of the


strategic data stored in the data warehouses.

5. Administerability: Data Warehouse management should not


be complicate

Types of Data Warehouse Architectures

Single-Tier Architecture

Single-Tier architecture is not periodically used in practice. Its


purpose is to minimize the amount of data stored to reach this
goal; it removes data redundancies.

The figure shows the only layer physically available is the source
layer. In this method, data warehouses are virtual. This means
that the data warehouse is implemented as a multidimensional
view of operational data created by specific middleware, or an
intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the
requirement for separation between analytical and transactional
processing. Analysis queries are agreed to operational data after
the middleware interprets them. In this way, queries affect
transactional workloads.

Two-Tier Architecture

The requirement for separation plays an essential role in defining


the two-tier architecture for a data warehouse system, as shown
in fig:
Although it is typically called two-layer architecture to highlight
a separation between physically available sources and data
warehouses, in fact, consists of four subsequent data flow stages:

1. Source layer: A data warehouse system uses a


heterogeneous source of data. That data is stored initially
to corporate relational databases or legacy databases, or it
may come from an information system outside the
corporate walls.
2. Data Staging: The data stored to the source should be
extracted, cleansed to remove inconsistencies and fill gaps,
and integrated to merge heterogeneous sources into one
standard schema. The so-
named Extraction, Transformation, and Loading Tools
(ETL) can combine heterogeneous schemata, extract,
transform, cleanse, validate, filter, and load source data into
a data warehouse.
3. Data Warehouse layer: Information is saved to one
logically centralized individual repository: a data
warehouse. The data warehouses can be directly accessed,
but it can also be used as a source for creating data marts,
which partially replicate data warehouse contents and are
designed for specific enterprise departments. Meta-data
repositories store information on sources, access
procedures, data staging, users, data mart schema, and so
on
4. Three-Tier Architecture
5. The three-tier architecture consists of the source layer
(containing multiple source system), the reconciled layer
and the data warehouse layer (containing both data
warehouses and data marts). The reconciled layer sits
between the source data and data warehouse.
6. The main advantage of the reconciled layer is that it
creates a standard reference data model for a whole
enterprise. At the same time, it separates the problems of
source data extraction and integration from those of data
warehouse population. In some cases, the reconciled
layer is also directly used to accomplish better some
operational tasks, such as producing daily reports that
cannot be satisfactorily prepared using the corporate
applications or generating data flows to feed external
processes periodically to benefit from cleaning and
integration.
What is Star Schema?

A star schema is the elementary form of a dimensional model, in


which data are organized into facts and dimensions. A fact is an
event that is counted or measured, such as a sale or log in. A
dimension includes reference data about the fact, such as date,
item, or customer.

A star schema is a relational schema where a relational schema


whose design represents a multidimensional data model. The
star schema is the explicit data warehouse schema. It is known
as star schema because the entity-relationship diagram of this
schemas simulates a star, with points, diverge from a central
table. The center of the schema consists of a large fact table, and
the points of the star are the dimension tables.

Fact Tables
A table in a star schema which contains facts and connected to
dimensions. A fact table has two types of columns: those that
include fact and those that are foreign keys to the dimension
table. The primary key of the fact tables is generally a composite
key that is made up of all of its foreign keys.

A fact table might involve either detail level fact or fact that have
been aggregated (fact tables that include aggregated fact are
often instead called summary tables). A fact table generally
contains facts with the same level of aggregation.

Characteristics of Star Schema


The star schema is intensely suitable for data warehouse
database design because of the following features:

o It creates a DE-normalized database that can quickly


provide query responses.
o It provides a flexible design that can be changed easily or
added to throughout the development cycle, and as the
database grows.
o It provides a parallel in design to how end-users typically
think of and use the data.
o It reduces the complexity of metadata for both developers
and end-users.
o Advantages of Star Schema :

o Simpler Queries –

Join logic of star schema is quite cinch in comparison to


other join logic which are needed to fetch data from a
transactional schema that is highly normalized.
o Simplified Business Reporting Logic –

In comparison to a transactional schema that is highly


normalized, the star schema makes simpler common
business reporting logic, such as of reporting and period-
over-period.
o Feeding Cubes –

Star schema is widely used by all OLAP systems to


design OLAP cubes efficiently. In fact, major OLAP
systems deliver a ROLAP mode of operation which can
use a star schema as a source without designing a cube
structure.
Disadvantages of Star Schema –
1. Data integrity is not enforced well since in a highly de-
normalized schema state.
2. Not flexible in terms if analytical needs as a
normalized data model.
3. Star schemas don’t reinforce many-to-many
relationships within business entities – at least not
frequently.

Features:

Central fact table: The star schema revolves around a central


fact table that contains the numerical data being analyzed.
This table contains foreign keys to link to dimension tables.
Dimension tables: Dimension tables are tables that contain
descriptive attributes about the data being analyzed. These
attributes provide context to the numerical data in the fact
table. Each dimension table is linked to the fact table through
a
data in the fact table. Each dimension table is linked to the
fact table through a foreign key.
Denormalized structure: A star schema is denormalized,
which means that redundancy is allowed in the schema
design to improve query performance. This is because it is
easier and faster to join a small number of tables than a large
number of tables.
Simple queries: Star schema is designed to make queries
simple and fast. Queries can be written in a straightforward
manner by joining the fact table with the appropriate
dimension tables.
Snowflake Schema

Snowflake Schema: Snowflake Schema is a type of


multidimensional model. It is used for data warehouse. In
snowflake schema contains the fact table, dimension tables
and one or more than tables for each dimension table.
Snowflake schema is a normalized form of star schema which
reduce the redundancy and saves the significant storage. It is
easy to operate because it has less number of joins between
the tables and in this simple and less complex query is used
for accessing the data

Advantages:

Reduced data redundancy: The snowflake schema reduces


data redundancy by normalizing dimensions into multiple
tables, resulting in a more efficient use of storage space.
Improved performance: The snowflake schema can improve
query performance, as it requires fewer joins to retrieve data
from the fact table.
Scalability: The snowflake schema is scalable, making it
suitable for large data warehousing projects with complex
hierarchies.
Disadvantages:

Increased complexity: The snowflake schema can be more


complex to implement and maintain due to the additional
tables needed for the normalized dimensions.
Reduced query performance: The increased complexity of
the snowflake schema can result in reduced query
performance, particularly for queries that require data from
multiple dimensions.
Data integrity: The snowflake schema can be more difficult to
maintain data integrity due to the additional relationships
between tables.
Characteristics of the Snowflake Schema:
• Snowflake Schema are permitted to have dimension
tables joined to other dimension tables
• Snowflake Schema are to have one fact table only
• Snowflake Schema create normalized dimension
tables
• The normalized schema reduces required disk space
for running and managing this data warehouse
• Snowflake Scheme offer an easier way to implement a
dimensio
• What Is a Data Warehouse Schema?

• We can think of a data warehouse schema as a blueprint or


an architecture of how data will be stored and managed. A
data warehouse schema isn’t the data itself, but the
organization of how data is stored and how it relates to
other data within the data warehouse.

In the past, data warehouse schemas were often strictly


enforced across an enterprise, but in modern
implementations where storage is increasingly inexpensive,
schemas have become less constrained. Despite this
loosening or sometimes total abandonment of data
warehouse schemas, knowledge of the foundational schema
designs can be important to both maintaining legacy
resources and for creating modern data warehouse
design that learns from the past.
• The basic components of all data warehouse schemas are
fact and dimension tables. The different combination of
these two central elements compose almost the entirety of
all data warehouse schema designs.
• What Is a Galaxy Schema?
• The Galaxy Data Warehouse Schema, also known as a Fact
Constellation Schema, acts as the next iteration of the data
warehouse schema. Unlike the Star Schema and Snowflake
Schema, the Galaxy Schema uses multiple fact tables
connected with shared normalized dimension tables. Galaxy
Schema can be thought of as star schema interlinked and
completely normalized, avoiding any kind of redundancy or
inconsistency of data.
Characteristics of the Galaxy Schema:

• Galaxy Schema is multidimensional acting as a strong


design consideration for complex database systems
• Galaxy Schema reduces redundancy to near zero
redundancy as a result of normalization
• Galaxy Schema is known for high data quality and
accuracy and lends to effective reporting and
analytics

Schemas Used in Data Warehouses: Star, Galaxy, and


Snowflake

What Is a Star Schema in a Data Warehouse?

The star schema in a data warehouse is historically one of the


most straightforward designs. This schema follows some distinct
design parameters, such as only permitting one central table and
a handful of single-dimension tables joined to the table. In
following these design constraints, star schema can resemble a
star with one central table, and five dimension tables joined (thus
where the star schema got its name).

Star Schema is known to create denormalized dimension tables –


a database structuring strategy that organizes tables to introduce
redundancy for improved performance. Denormalization intends
to introduce redundancy in additional dimensions so long as it
improves query performance.

of Difference between Fact Tables and Dimension Tables

What is a Fact Table?


The fact table is a table that has the values of the dimension
table's attributes. It contains quantitative data that has
been denormalized. It essentially holds the data that must be
evaluated. Fact tables typically have two columns: one
for foreign keys that allow them to be joined with dimension
tables and another for the value or data that has to be
evaluated. It is largely made up of numbers. It expands vertically,
with more records and fewer properties.

Characteristics of Fact Table

There are various characteristics of the Fact Table. Some main


characteristics of the Fact Table are as follows:

1. Concatenated Key

The fact table includes Concatenated key, which is the


concatenation of the main keys of all dimension tables. The fact
table's concatenated key must uniquely identify each row.

2. Additive Measures

Fact table attributes might be fully additive, semi-additive, or


non-additive. Fully additive measures are those that are included
in all dimensions. Semi-additive measures are utilized to add
measurements to some dimensions. On the other hand, non-
additive measures are utilized to store the basic unit of
measurement of any organization process.

3. Degenerated Dimensions

Degenerated dimensions are those dimensions or attributes that


cannot be added or are not additive.
4. Fact Table Grain

The level or depth of the data that is stored in the fact table is
known as the grain of the table. An efficient fact table should be
created at the highest level.

5. Sparse Data

Some data records in the fact table include attributes with null
values or measurements, indicating that they do not provide any
data.

6. Shrunken Rollup dimensions

These are those dimensions that are the subdivision of rows and
columns of the base dimension.

7. Outrigger dimensions

Outrigger dimensions are those dimensions that contain a


relation to another dimension table.

What is Dimension Table?

The dimension Table is an important part of the Start Schema.


A dimension table includes the dimensions along which the
attributes' values are taken in the fact table. Dimension tables are
small and have many thousand rows, but their size can be
expanded occasionally. These tables are connected to a fact table
via foreign keys, and these dimension tables are denormalized.
The dimension table is hierarchical in nature and expands
horizontally.

Characteristics of Dimension Table


There are various characteristics of the Dimension Table. Some
main characteristics of the Dimension Table are as follows:

1. Attributes and Keys

Each Dimension table must include a primary key that uniquely


finds every table record. It is typically observed that the
dimension table has many attributes. As a result, it looks to be
wide, and when you create a dimension table, you will find that
it spreads horizontally.

2. Attribute Values

The values of attributes in dimension tables are rarely numeric;


instead, the values of attributes are usually in textual format.

3. Normalization

Dimension table is not normalized as normalization splits the


data and produces new tables, which reduces query execution
efficiency since it must travel through these other tables when it
needs to retrieve measurements from the fact table for any
equivalent attribute in the dimension table.

4. Relation Among Attributes

Attributes in the dimension table are often unrelated to one


another, although they are all part of the same dimension table.

Concept Hierarchy in Data Mining


In data mining, the concept of a concept hierarchy refers to the


organization of data into a tree-like structure, where each level
of the hierarchy represents a concept that is more general than
the level below it. This hierarchical organization of data allows
for more efficient and effective data analysis, as well as the
ability to drill down to more specific levels of detail when
needed. The concept of hierarchy is used to organize and
classify data in a way that makes it more understandable and
easier to analyze. The main idea behind the concept of
hierarchy is that the same data can have different levels of
granularity or levels of detail and that by organizing the data
in a hierarchical fashion, it is easier to understand and perform
analysis.
1. Schema Hierarchy: Schema Hierarchy is a type of
concept hierarchy that is used to organize the schema
of a database in a logical and meaningful way, grouping
similar objects together. A schema hierarchy can be
used to organize different types of data, such as tables,
attributes, and relationships, in a logical and
meaningful way. This can be useful in data
warehousing, where data from multiple sources needs
to be integrated into a single database.
2. Set-Grouping Hierarchy: Set-Grouping Hierarchy is a
type of concept hierarchy that is based on set theory,
where each set in the hierarchy is defined in terms of its
membership in other sets. Set-grouping hierarchy can
be used for data cleaning, data pre-processing and data
integration. This type of hierarchy can be used to
identify and remove outliers, noise, or inconsistencies
from the data and to integrate data from multiple
sources.

Need of Concept Hierarchy in Data Mining

There are several reasons why a concept hierarchy is useful in


data mining:
1. Improved Data Analysis: A concept hierarchy can help
to organize and simplify data, making it more
manageable and easier to analyze. By grouping similar
concepts together, a concept hierarchy can help to
identify patterns and trends in the data that would
otherwise be difficult to spot. This can be particularly
useful in uncovering hidden or unexpected insights that
can inform business decisions or inform the
development of new products or services.
2. Improved Data Visualization and Exploration: A
concept hierarchy can help to improve data
visualization and data exploration by organizing data
into a tree-like structure, allowing users to easily
navigate and understand large and complex data sets.
This can be particularly useful in creating interactive
dashboards and reports that allow users to easily drill
down to more specific levels of detail when needed.
3. Improved Algorithm Performance: The use of a
concept hierarchy can also help to improve the
performance of data mining algorithms. By organizing
data into a hierarchical structure, algorithms can more
easily process and analyze the data, resulting in faster
and more accurate results.

Applications of Concept Hierarchy

There are several applications of concept hierarchy in data


mining, some examples are:
1. Data Warehousing: Concept hierarchy can be used in
data warehousing to organize data from multiple
sources into a single, consistent and meaningful
structure. This can help to improve the efficiency and
effectiveness of data analysis and reporting.
2. Business Intelligence: Concept hierarchy can be used
in business intelligence to organize and analyze data in
a way that can inform business decisions. For example,
it can be used to analyze customer data to identify
patterns and trends that can inform the development of
new products or services.
3. Online Retail: Concept hierarchy can be used in online
retail to organize products into categories,
subcategories and sub-subcategories, it can help
customers to find the products they are looking for
more quickly and easily.
Meta Data Repository

Introduction
The Meta Data Repository is responsible for storing domains
and their master data models. The models stored within this
service are consulted for different tasks such as data
validation. The meta models are also used by the transformer
to map the incoming data onto the Open Integration Hub
standard.

Technologies used
• Node.js
• MongoDB
• JSON Schema

Purpose of the Meta Data Repository


If we talk about metadata in this context, we mean the
description of the domains and their corresponding Master
Data Models. An Open Integration Hub Master Data Model
(OMDM) describes the data of a certain domain in a depth
which is sufficient enough to map and synchronize the specific
data of multiple applications in that domain. The meta data
delivers all the information a user or customer needs to work
with data within a specific domain.

The domain models are specified by special workgroups. Please


see the specific domain model repository for further
informations on a domain and its master data model.

What is a Metadata Repository?

A metadata repository is a software tool that stores descriptive


information about the data model used to store and
share metadata. Metadata repositories combine diagrams and
text, enabling metadata integration and change. A successful
repository allows an organization to create a high-level
conception or map of its data, while also providing better data
usage across systems.

Data Warehouse Back-End Tools and Utilities

• Data extraction:
o get data from multiple, heterogeneous, and external
sources

• Data cleaning:
o detect errors in the data and rectify them when possible

• Data transformation:
o convert data from legacy or host format to warehouse
format

• Load:
o sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions

• Refresh
o propagate the updates from the data sources to the
warehou

UNIT = 2
MultiDimensional Data Model
The multi-Dimensional Data Model is a method which is used
for ordering data in the database along with good arrangement
and assembling of the contents in the database.
The Multi Dimensional Data Model allows customers to
interrogate analytical questions associated with market or
business trends, unlike relational databases which allow
customers to access data in the form of queries. They allow
users to rapidly receive answers to the requests which they
made by creating and examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing
uses multi dimensional databases. It is used to show multiple
dimensions of the data to users.

Working on a Multidimensional Data Model

On the basis of the pre-decided steps, the Multidimensional


Data Model works.
The following stages should be followed by every project for
building a Multi Dimensional Data Model :
Stage 1 : Assembling data from the client : In first stage, a
Multi Dimensional Data Model collects correct data from the
client. Mostly, software professionals provide simplicity to the
client about the range of data which can be gained with the
selected technology and collect the complete data in detail.
selected technology and collect the complete data in detail.
Stage 2 : Grouping different segments of the system : In the
second stage, the Multi Dimensional Data Model recognizes
and classifies all the data to the respective section they belong
to and also builds it problem-free to apply step by step.
Stage 3 : Noticing the different proportions : In the third
stage, it is the basis on which the design of the system is based.
In this stage, the main factors are recognized according to the
user’s point of view. These factors are also known as
“Dimensions”.
Stage 4 : Preparing the actual-time factors and their
respective qualities : In the fourth stage, the factors which are
recognized in the previous step are used further for identifying
the related qualities. These qualities are also known
as “attributes” in the database.

Features of multidimensional data models:

Measures: Measures are numerical data that can be analyzed


and compared, such as sales or revenue. They are typically
stored in fact tables in a multidimensional data model.
Dimensions: Dimensions are attributes that describe the
measures, such as time, location, or product. They are
typically stored in dimension tables in a multidimensional
data model.
Cubes: Cubes are structures that represent the
multidimensional relationships between measures and
dimensions in a data model. They provide a fast and efficient
way to retrieve and analyze data.
Aggregation: Aggregation is the process of summarizing data
across dimensions and levels of detail. This is a key feature of
multidimensional data models, as it enables users to quickly
analyze data at different levels of granularity.
Drill-down and roll-up: Drill-down is the process of moving
from a higher-level summary of data to a lower level of detail,
while roll-up is the opposite process of moving from a lower-
level detail to a higher-level summary. These features enable
users to explore data in greater detail and gain insights into
the underlying patterns.
Hierarchies: Hierarchies are a way of organizing dimensions
into levels of detail. For example, a time dimension might be
organized into years, quarters, months, and days. Hierarchies
provide a way to navigate the data and perform drill-down
and roll-up operations.

Advantages of Multi Dimensional Data Model

The following are the advantages of a multi-dimensional data


model :
• A multi-dimensional data model is easy to handle.
• It is easy to maintain.
• Its performance is better than that of normal databases
(e.g. relational databases).
• The representation of data is better than traditional
databases. That is because the multi-dimensional
databases are multi-viewed and carry different types of
factors.
• It is workable on complex systems and applications,
contrary to the simple one-dimensional database
systems.
• The compatibility in this type of database is an
upliftment for projects having lower bandwidth for
maintenance staff.

Disadvantages of Multi Dimensional Data Model

The following are the disadvantages of a Multi Dimensional


Data Model :
• The multi-dimensional Data Model is slightly
complicated in nature and it requires professionals to
recognize and examine the data in the database.
• During the work of a Multi-Dimensional Data Model,
when the system caches, there is a great effect on the
working of the system.
• It is complicated in nature due to which the databases
are generally dynamic in design.
• The path to achieving the end product is complicated
most of the time.
• As the Multi Dimensional Data Model has complicated
systems, databases have a large number of databases
due to which the system is very insecure when there is
a security break.

Functions of Data warehouse

A data warehouse is a collection of data that is organized


to provide various functions for managing and analyzing
data. Some of the important functions of a data
warehouse are −

• Data Consolidation
• Data Cleaning
• Data Integration
• Data Storage
• Data Transformation
• Data Analysis
• Data Reporting
• Data Mining
• Performance Optimization

What is OLAP (Online Analytical Processing)?

OLAP stands for On-Line Analytical Processing. OLAP is a


classification of software technology which authorizes analysts,
managers, and executives to gain insight into information
through fast, consistent, interactive access in a wide variety of
possible views of data that has been transformed from raw
information to reflect the real dimensionality of the enterprise
as understood by the

Who uses OLAP and Why?

OLAP applications are used by a variety of the functions of an


organization.

Finance and accounting:

o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling

Sales and Marketing

o Sales analysis and forecasting


o Market research analysis
o Promotion analysis
o Customer analysis
o Market and customer segmentation

Production

o Production planning
o Defect analysis

OLAP cubes have two main purposes. The first is to provide


business users with a data model more intuitive to them than a
tabular model. This model is called a Dimensional Model.
The second purpose is to enable fast query response that is
usually difficult to achieve using tabular models.

How OLAP Works?

Fundamentally, OLAP has a very simple concept. It pre-calculates


most of the queries that are typically very hard to execute over
tabular databases, namely aggregation, joining, and grouping.
These queries are calculated during a process that is usually
called 'building' or 'processing' of the OLAP cube. This process
happens overnight, and by the time end users get to work - data
will have been updated.

1) Multidimensional Conceptual View: This is the central


features of an OLAP system. By needing a multidimensional view,
it is possible to carry out methods like slice and dice.

2) Transparency: Make the technology, underlying information


repository, computing operations, and the dissimilar nature of
source data totally transparent to users. Such transparency helps
to improve the efficiency and productivity of the users.

3) Accessibility: It provides access only to the data that is


actually required to perform the particular analysis, present a
single, coherent, and consistent view to the clients. The OLAP
system must map its own logical schema to the heterogeneous
physical data stores and perform any necessary transformations.
The OLAP operations should be sitting between data sources
(e.g., data warehouses) and an OLAP front-end.

6) Generic Dimensionality: An OLAP method should treat each


dimension as equivalent in both is structure and operational
capabilities. Additional operational capabilities may be allowed
to selected dimensions, but such additional tasks should be
grantable to any dimension.

8) Multiuser Support: OLAP tools must provide concurrent data


access, data integrity, and access security.

9) Unrestricted cross-dimensional Operations: It provides the


ability for the methods to identify dimensional order and
necessarily functions roll-up and drill-down methods within a
dimension or across the dimension.

Characteristics of OLAP

Fast

It defines which the system targeted to deliver the most feedback


to the client within about five seconds, with the elementary
analysis taking no more than one second and very few taking
more than 20 seconds.

Share

It defines which the system tools all the security requirements for
understanding and, if multiple write connection is needed,
concurrent update location at an appropriated level, not all
functions need customer to write data back, but for the
increasing number which does, the system should be able to
manage multiple updates in a timely, secure manner.

Multidimensional

This is the basic requirement. OLAP system must provide a


multidimensional conceptual view of the data, including full
support for hierarchies, as this is certainly the most logical
method to analyze business and organizations.
Benefits of OLAP

OLAP holds several benefits for businesses: -

1. OLAP helps managers in decision-making through the


multidimensional record views that it is efficient in
providing, thus increasing their productivity.
2. OLAP functions are self-sufficient owing to the inherent
flexibility support to the organized databases.
3. It facilitates simulation of business models and problems,
through extensive management of analysis-capabilities.
4. In conjunction with data warehouse, OLAP can be used to
support a reduction in the application backlog, faster data
retrieval, and reduction in query drag.

OLAP Operations in the Multidimensional Data Model

In the multidimensional model, the records are organized into


various dimensions, and each dimension includes multiple levels
of abstraction described by concept hierarchies. This
organization support users with the flexibility to view data from
various perspectives. A number of OLAP data cube operation
exist to demonstrate these different views, allowing interactive
queries and search of the record at hand. Hence, OLAP supports
a user-friendly environment for interactive data analysis.

Consider the OLAP operations which are to be performed on


multidimensional data. The figure shows data cubes for sales of
a shop. The cube contains the dimensions, location, and time and
item, where the location is aggregated with regard to city
values, time is aggregated with respect to quarters, and
an item is aggregated with respect to item types.

Roll-Up
The roll-up operation (also known as drill-up or aggregation
operation) performs aggregation on a data cube, by climbing
down concept hierarchies, i.e., dimension reduction. Roll-up is
like zooming-out on the data cubes. Figure shows the result of
roll-up operations performed on the dimension location. The
hierarchy for the location is defined as the Order Street, city,
province, or state, country. The roll-up operation aggregates the
data by ascending the location hierarchy from the level of the
city to the level of the country.

Drill-Down

The drill-down operation (also called roll-down) is the reverse


operation of roll-up. Drill-down is like zooming-in on the data
cube. It navigates from less detailed record to more detailed
data. Drill-down can be performed by either stepping down a
concept hierarchy for a dimension or adding additional
dimensions.

Figure shows a drill-down operation performed on the


dimension time by stepping down a concept hierarchy which is
defined as day, month, quarter, and year. Drill-down appears by
descending the time hierarchy from the level of the quarter to a
more detailed level of the month.

Slice

A slice is a subset of the cubes corresponding to a single value


for one or more members of the dimension. For example, a slice
operation is executed when the customer wants a selection on
one dimension of a three-dimensional cube resulting in a two-
dimensional site. So, the Slice operations perform a selection on
one dimension of the given cube, thus resulting in a subcube.

Dice
The dice operation describes a subcube by operating a selection
on two or more dimension.

Pivot

The pivot operation is also called a rotation. Pivot is a


visualization operations which rotates the data axes in view to
provide an alternative presentation of the data. It may contain
swapping the rows and columns or moving one of the row-
dimensions into the column dimensions.

Types of OLAP

There are three main types of OLAP servers are as following:

ROLAP stands for Relational OLAP, an application based on


relational DBMSs.

MOLAP stands for Multidimensional OLAP, an application based


on multidimensional DBMSs.
HOLAP stands for Hybrid OLAP, an application using both
relational and multidimensional techniques.

Relational OLAP (ROLAP) Server

These are intermediate servers which stand in between a


relational back-end server and user frontend tools.

They use a relational or extended-relational DBMS to save and


handle warehouse data, and OLAP middleware to provide
missing pieces.

ROLAP servers contain optimization for each DBMS back end,


implementation of aggregation navigation logic, and additional
tools and services.

ROLAP technology tends to have higher scalability than MOLAP


technology.

ROLAP systems work primarily from the data that resides in a


relational database, where the base data and dimension tables
are stored as relational tables. This model permits the
multidimensional analysis of data.

Relational OLAP Architecture

ROLAP Architecture includes the following components

o Database server.
o ROLAP server.
o Front-end tool
Relational OLAP (ROLAP) is the latest and fastest-growing
OLAP technology segment in the market. This method allows
multiple multidimensional views of two-dimensional relational
tables to be created, avoiding structuring record around the
desired view.

Some products in this segment have supported reliable SQL


engines to help the complexity of multidimensional analysis. This
includes creating multiple SQL statements to handle user
requests, being 'RDBMS' aware and also being capable of
generating the SQL statements based on the optimizer of the
DBMS engine.

Advantages

Can handle large amounts of information: The data size


limitation of ROLAP technology is depends on the data size of
the underlying RDBMS. So, ROLAP itself does not restrict the data
amount.

<="" strong="">RDBMS already comes with a lot of features. So


ROLAP technologies, (works on top of the RDBMS) can control
these functionalities.
Disadvantages

Performance can be slow: Each ROLAP report is a SQL query (or


multiple SQL queries) in the relational database, the query time
can be prolonged if the underlying data size is large.

Limited by SQL functionalities: ROLAP technology relies on


upon developing SQL statements to query the relational
database, and SQL statements do not suit all needs.

Multidimensional OLAP (MOLAP) Server

A MOLAP system is based on a native logical model that directly


supports multidimensional data and operations. Data are stored
physically into multidimensional arrays, and positional
techniques are used to access them.

One of the significant distinctions of MOLAP against a ROLAP is


that data are summarized and are stored in an optimized format
in a multidimensional cube, instead of in a relational database. In
MOLAP model, data are structured into proprietary formats by
client's reporting requirements with the calculations pre-
generated on the cubes.

MOLAP Architecture

o Database server.
o MOLAP server.
o Front-end tool.
MOLAP structure primarily reads the precompiled data. MOLAP
structure has limited capabilities to dynamically create
aggregations or to evaluate results which have not been pre-
calculated and stored.

Applications requiring iterative and comprehensive time-series


analysis of trends are well suited for MOLAP technology (e.g.,
financial analysis and budgeting).

Advantages

Excellent Performance: A MOLAP cube is built for fast


information retrieval, and is optimal for slicing and dicing
operations.

Can perform complex calculations: All evaluation have been


pre-generated when the cube is created. Hence, complex
calculations are not only possible, but they return quickly.

Disadvantages
Limited in the amount of information it can handle: Because
all calculations are performed when the cube is built, it is not
possible to contain a large amount of data in the cube itself.

Requires additional investment: Cube technology is generally


proprietary and does not already exist in the organization.
Therefore, to adopt MOLAP technology, chances are other
investments in human and capital resources are needed.

Hybrid OLAP (HOLAP) Server

HOLAP incorporates the best features


of MOLAP and ROLAP into a single architecture. HOLAP
systems save more substantial quantities of detailed data in the
relational tables while the aggregations are stored in the pre-
calculated cubes. HOLAP also can drill through from the cube
down to the relational tables for delineated data. The Microsoft
SQL Server 2000 provides a hybrid OLAP server
Advantages of HOLAP

1. HOLAP provide benefits of both MOLAP and ROLAP.


2. It provides fast access at all levels of aggregation.
3. HOLAP balances the disk space requirement, as it only
stores the aggregate information on the OLAP server and
the detail record remains in the relational database. So no
duplicate copy of the detail record is maintained.

Disadvantages of HOLAP

1. HOLAP architecture is very complicated because it supports


both MOLAP and ROLAP servers.

Data Warehouse Implementation


1. Requirements analysis and capacity planning: The first
process in data warehousing involves defining enterprise needs,
defining architectures, carrying out capacity planning, and
selecting the hardware and software tools. This step will contain
be consulting senior management as well as the different
stakeholder.

2. Hardware integration: Once the hardware and software has


been selected, they require to be put by integrating the servers,
the storage methods, and the user software tools.

3. Modeling: Modelling is a significant stage that involves


designing the warehouse schema and views. This may contain
using a modeling tool if the data warehouses are sophisticated.

4. Physical modeling: For the data warehouses to perform


efficiently, physical modeling is needed. This contains designing
the physical data warehouse organization, data placement, data
partitioning, deciding on access techniques, and indexing.

5. Sources: The information for the data warehouse is likely to


come from several data sources. This step contains identifying
and connecting the sources using the gateway, ODBC drives, or
another wrapper.

6. ETL: The data from the source system will require to go


through an ETL phase. The process of designing and
implementing the ETL phase may contain defining a suitable ETL
tool vendors and purchasing and implementing the tools. This
may contains customize the tool to suit the need of the
enterprises.
7. Populate the data warehouses: Once the ETL tools have
been agreed upon, testing the tools will be needed, perhaps
using a staging area. Once everything is working adequately, the
ETL tools may be used in populating the warehouses given the
schema and view definition.

8. User applications: For the data warehouses to be helpful,


there must be end-user applications. This step contains
designing and implementing applications required by

Implementation Guidelines

Ensure quality: The only record that has been cleaned and is of
a quality that is implicit by the organizations should be loaded in
the data warehouses.

5. Corporate strategy: A data warehouse project must be


suitable for corporate strategies and business goals. The purpose
of the project must be defined before the beginning of the
projects.

6. Business plan: The financial costs (hardware, software, and


peopleware), expected advantage, and a project plan for a data
warehouses project must be clearly outlined and understood by
all stakeholders. Without such understanding, rumors about
expenditure and benefits can become the only sources of data,
subversion the projects.

7. Training: Data warehouses projects must not overlook data


warehouses training requirements. For a data warehouses
project to be successful, the customers must be trained to use
the warehouses and to understand its capabilities.

What is Data Cube?


When data is grouped or combined in multidimensional matrices
called Data Cubes. The data cube method has a few alternative
names or a few variants, such as "Multidimensional databases,"
"materialized views," and "OLAP (On-Line Analytical Processing)."

The general idea of this approach is to materialize certain


expensive computations that are frequently inquired.

For example, a relation with the schema sales (part, supplier,


customer, and sale-price) can be materialized into a set of eight
views as shown in fig, where psc indicates a view consisting of
aggregate function value (such as total-sales) computed by
grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding
aggregate function values calculated by grouping part alone, etc.

A data cube is created from a subset of attributes in the database.


Specific attributes are chosen to be measure attributes, i.e., the
attributes whose values are of interest. Another attributes are
selected as dimensions or functional attributes. The measure
attributes are aggregated according to the dimensions.
Example: In the 2-D representation, we will look at the All
Electronics sales data for items sold per quarter in the city of
Vancouver. The measured display in dollars sold (in thousands).
3-Dimensional Cuboids

Let suppose we would like to view the sales data with a third
dimension. For example, suppose we would like to view the data
according to time, item as well as the location for the cities
Chicago, New York, Toronto, and Vancouver. The measured
display in dollars sold (in thousands). These 3-D data are shown
in the table. The 3-D data of the table are represented as a series
of 2-D tables.
Let us suppose that we would like to view our sales data with an
additional fourth dimension, such as a supplier.

In data warehousing, the data cubes are n-dimensional. The


cuboid which holds the lowest level of summarization is called
a base cuboid.

Indexing OLAP data:


Requirements on an indexing method
1. Symmetric partial match queries:

• Most of the OLAP queries can be expressed conceptually as a


partial range query where associated with one or more
dimensions of the cube is a union of range of values and we
need to efficiently retrieve data corresponding to this range.
• The extreme case is where the size of the range is one for all
dimensions giving us a point query. The range could be
continuous, for instance, time between Jan '94 to July '94" or
discontinuous, for instance, first month of every year" and
product IN (soap, shirts, shoes).

2. Indexing at multiple levels of aggrega


• Most OLAP databases pre-compute multiple groupbys
corresponding to different levels of aggregations of the base
cube. For instance, groupbys could be computed at the
<product-store> level, <product-time> and <time> level for a
base cube with dimensions product, store and time. It is
equally important to index the summarized data.
• An issue that arises here is whether to build separate index
trees for di
erent levels of aggregation or whether to add special values to
the dimension and index precomputed summaries along with
the base level data. For instance, if we index sales for
<product-year>, then we can store total-sales at the
<product> level by simply adding an additional value for the
year dimension corresponding to "ALL" years and storing
total-sales for each product there.
3. Multiple traversal orders:
• B-trees are commonly used in OLTP systems to retrieve data
sorted on the indexed attribute. This is often a cheaper
alternative to doing external sorts. OLAP databases, because
of the large number of group-bys that they perform, can also
bene t from using indices to sort data fast. The challenge is in
allowing such a feature over multiple attributes instead of a
single one and also for any permutation of any subset of the
attributes.

4. Efficient batch update:

• OLAP databases have the advantage that frequent point


updates as in OLTP data is uncommon. However, the update
problem cannot be totally ignored. Making batch updates
efficient is absolutely necessary. It is not uncommon for
multinational organizations to update data as high as four
times a day since daily data from di
erent locations of the world appear at different times. On the
other hand, these updates are clustered by region and time.
This property can be exploited in localizing changes and
making updates faster.

UNIT=3

What is Data Mining?

The process of extracting information to identify patterns, trends,


and useful data that would allow the business to take the data-
driven decision from huge sets of data is called Data Mining.
In other words, we can say that Data Mining is the process of
investigating hidden patterns of information to various
perspectives for categorization into useful data, which is
collected and assembled in particular areas such as data
warehouses, efficient analysis, data mining algorithm, helping
decision making and other data requirement to eventually cost-
cutting and generating revenue.

Data mining is the act of automatically searching for large stores


of information to find trends and patterns that go beyond simple
analysis procedures. Data mining utilizes complex mathematical
algorithms for data segments and evaluates the probability of
future events. Data Mining is also called Knowledge Discovery of
Data (KDD).

Data Mining is a process used by organizations to extract specific


data from huge databases to solve business problems. It
primarily turns raw data into useful information.

Types of Data Mining


Relational Database:

A relational database is a collection of multiple data sets formally


organized by tables, records, and columns from which data can
be accessed in various ways without having to recognize the
database tables. Tables convey and share information, which
facilitates data searchability, reporting, and organization.

Data warehouses:

A Data Warehouse is the technology that collects the data from


various sources within the organization to provide meaningful
business insights. The huge amount of data comes from multiple
places such as Marketing and Finance. The extracted data is
utilized for analytical purposes and helps in decision- making for
a business organization. The data warehouse is designed for the
analysis of data rather than transaction processing.

Data Repositories:

The Data Repository generally refers to a destination for data


storage. However, many IT professionals utilize the term more
clearly to refer to a specific kind of setup within an IT structure.
For example, a group of databases, where an organization has
kept various kinds of information.

Object-Relational Database:

A combination of an object-oriented database model and


relational database model is called an object-relational model. It
supports Classes, Objects, Inheritance, etc.
Advantages of Data Mining

o The Data Mining technique enables organizations to obtain


knowledge-based data.
o Data mining enables organizations to make lucrative
modifications in operation and production.
o Compared with other statistical data applications, data
mining is a cost-efficient.
o Data Mining helps the decision-making process of an
organization.
o It Facilitates the automated discovery of hidden patterns as
well as the prediction of trends and behaviors

Disadvantages of Data Mining

o There is a probability that the organizations may sell useful


data of customers to other organizations for money. As per
the report, American Express has sold credit card purchases
of their customers to other organizations.
o Many data mining analytics software is difficult to operate
and needs advance training to work on.
o Different data mining instruments operate in distinct ways
due to the different algorithms used in their design.
Therefore, the selection of the right data mining tools is a
very challenging task.
o The data mining techniques are not precise, so that it may
lead to severe consequences in certain conditions.
o Data Processing in Data Mining
o Data processing is collecting raw data and translating it into
usable information. The raw data is collected, filtered,
sorted, processed, analyzed, stored, and then presented
in a readable format. It is usually performed in a step-by-
step process by a team of data scientists and data engineers
in an organization.
o The data processing is carried out automatically or
manually. Nowadays, most data is processed automatically
with the help of the computer, which is faster and gives
accurate results. Thus, data can be converted into different
forms. It can be graphic as well as audio ones. It depends
on the software used as well as data processing methods.
o After that, the data collected is processed and then
translated into a desirable form as per requirements, useful
for performing tasks. The data is acquired from Excel files,
databases, text file data, and unorganized data such
as audio clips, images, GPRS, and video clips.

Stages of Data Processing

1. Data Collection

The collection of raw data is the first step of the data processing
cycle. The raw data collected has a huge impact on the output
produced. Hence, raw data should be gathered from defined and
accurate sources so that the subsequent findings are valid and usable.
Raw data can include monetary figures, website cookies, profit/loss
statements of a company, user behavior, etc.

2. Data Preparation

Data preparation or data cleaning is the process of sorting and


filtering the raw data to remove unnecessary and inaccurate
data. Raw data is checked for errors, duplication,
miscalculations, or missing data and transformed into a suitable
form for further analysis and processing. This ensures that only
the highest quality data is fed into the processing unit.

3. Data Input

In this step, the raw data is converted into machine-readable


form and fed into the processing unit. This can be in the form of
data entry through a keyboard, scanner, or any other input
source.

4Data Storage

The last step of the data processing cycle is storage, where data
and metadata are stored for further use. This allows quick access
and retrieval of information whenever needed. Effective proper
data storage is necessary for compliance with GDPR (data
protection legislation).

types of Data Processing


There are different types of data processing based on the source
of data and the steps taken by the processing unit to generate
an output. There is no one size fits all method that can be used
for processing raw data.

1. Batch Processing: In this type of data processing, data is


collected and processed in batches. It is used for large
amounts of data. For example, the payroll system.
2. Single User Programming Processing: It is usually done
by a single person for his personal use. This technique is
suitable even for small offices.
3. Multiple Programming Processing: This technique allows
simultaneously storing and executing more than one
program in the Central Processing Unit (CPU). Data is
broken down into frames and processed using two or more
CPUs within a single computer system. It is also known as
parallel processing. Further, the multiple programming
techniques increase the respective computer's overall
working efficiency. A good example of multiple
programming processing is weather forecasting.

4Time-sharing Processing: This is another form of online data


processing that facilitates several users to share the resources
of an online computer system. This technique is adopted when
results are needed swiftly. Moreover, as the name suggests, this
system is time-based. Following are some of the major
advantages of time-sharing processing, such as:
o Several users can be served simultaneously.
o All the users have an almost equal amount of processing
time.
o There is a possibility of interaction with the running pro
o

Data Cleaning in Data Mining

Data cleaning is a crucial process in Data Mining. It carries an


important part in the building of a model. Data Cleaning can be
regarded as the process needed, but everyone often neglects it.
Data quality is the main issue in quality information
management. Data quality problems occur anywhere in
information systems. These problems are solved by data
cleaning.

Data cleaning is fixing or removing incorrect, corrupted,


incorrectly formatted, duplicate, or incomplete data within a
dataset. If data is incorrect, outcomes and algorithms are
unreliable, even though they may look correct. When combining
multiple data sources, there are many opportunities for data to
be duplicated or mislabeled.

Methods of Data Cleaning


1. Ignore the tuples: This method is not very feasible, as it
only comes to use when the tuple has several attributes is
has missing values.
2. Fill the missing value: This approach is also not very
effective or feasible. Moreover, it can be a time-consuming
method. In the approach, one has to fill in the missing value.
This is usually done manually, but it can also be done by
attribute mean or using the most probable value.
3. Binning method: This approach is very simple to
understand. The smoothing of sorted data is done using the
values around it. The data is then divided into several
segments of equal size. After that, the different methods are
executed to complete the task.
4. Regression: The data is made smooth with the help of
using the regression function. The regression can be linear
or multiple. Linear regression has only one independent
variable, and multiple regressions have more than one
independent variable.
5. Clustering: This method mainly operates on the group.
Clustering groups the data in a cluster. Then, the outliers
are detected with the help of clustering. Next, the similar
values are then arranged into a "group" or a "cluster".
6. Usage of Data Cleaning in Data Mining

Here are the following usages of data cleaning in data mining,


such as:

o Data Integration: Since it is difficult to ensure quality in


low-quality data, data integration has an important role in
solving this problem. Data Integration is the process of
combining data from different data sets into a single one.
This process uses data cleansing tools to ensure that the
embedded data set is standardized and formatted before
moving to the final destination.
o Data Migration: Data migration is the process of moving
one file from one system to another, one format to another,
or one application to another. While the data is on the
move, it is important to maintain its quality, security, and
consistency, to ensure that the resultant data has the
correct format and structure without any delicacies at the
destination.
o Data Transformation: Before the data is uploaded to a
destination, it needs to be transformed. This is only possible
through data cleaning, which considers the system criteria
of formatting, structuring, etc. Data transformation
processes usually include using rules and filters before
further analysis. Data transformation is an integral part of
most data integration and data management processes.
Data cleansing tools help to clean the data using the built-
in transformations of the systems.

Data Debugging in ETL Processes: Data cleansing is crucial to


preparing data during extract, transform, and load (ETL) for
reporting and analysis. Data cleansing ensures that only high-
quality data is used for decision-making and analysis

What is Data Integration?

It has been an integral part of data operations because data can


be obtained from several sources. It is a strategy that integrates
data from several sources to make it available to users in a single
uniform view that shows their status. There are communication
sources between systems that can include multiple databases,
data cubes, or flat files. Data fusion merges data from various
diverse sources to produce meaningful results. The consolidated
findings must exclude inconsistencies, contradictions,
redundancies, and inequities.

Data integration is important because it gives a uniform view of


scattered data while also maintaining data accuracy. It assists the
data-mining program in meaningful mining information, which
in turn assists the executive and managers make strategic
decisions for the enterprise's benefi

Data Integration Approaches

There are mainly two types of approaches for data integration.


These are as follows:

Tight Coupling

It is the process of using ETL (Extraction, Transformation, and


Loading) to combine data from various sources into a single
physical location.

Loose Coupling

Facts with loose coupling are most effectively kept in the actual
source databases. This approach provides an interface that gets
a query from the user, changes it into a format that the supply
database may understand, and then sends the query to the
source databases without delay to obtain the result.

Integration tools

There are various integration tools in data mining. Some of them


are as follows:

On-promise data integration tool

An on-premise data integration tool integrates data from local


sources and connects legacy databases using middleware
software.

Open-source data integration tool

If you want to avoid pricey enterprise solutions, an open-source


data integration tool is the ideal alternative. Although, you will
be responsible for the security and privacy of the data if you're
using the tool.

Cloud-based data integration tool

A cloud-based data integration tool may provide


an 'integration platform as a service'.

Data Transformation in Data Mining

Raw data is difficult to trace or understand. That's why it needs


to be preprocessed before retrieving any information from it.
Data transformation is a technique used to convert the raw data
into a suitable format that efficiently eases data mining and
retrieves strategic information. Data transformation includes
data cleaning techniques and a data reduction technique to
convert the data into the appropriate form.

Data transformation is an essential data preprocessing technique


that must be performed on the data before data mining to
provide patterns that are easier to understand.

Data Transformation Techniques

There are several data transformation techniques that can help


structure and clean up the data before analysis or storage in a
data warehouse. Let's study all techniques used for data
transformation, some of which we have already studied in data
reduction and data cleaning.
Data Smoothing

Data smoothing is a process that is used to remove noise from


the dataset using some algorithms. It allows for highlighting
important features present in the dataset. It helps in predicting
the patterns. When collecting data, it can be manipulated to
eliminate or reduce any variance or any other noise form.

The concept behind data smoothing is that it will be able to


identify simple changes to help predict different trends and
patterns. This serves as a help to analysts or traders who need to
look at a lot of data which can often be difficult to digest for
finding patterns that they wouldn't see otherwise.

o Binning: This method splits the sorted data into the


number of bins and smoothens the data values in each bin
considering the neighborhood values around it.
o Regression: This method identifies the relation among two
dependent attributes so that if we have one attribute, it can
be used to predict the other attribute.
o Clustering: This method groups similar data values and
form a cluster. The values that lie outside a cluster are
known as outliers.
Attribute Construction

In the attribute construction method, the new attributes consult


the existing attributes to construct a new data set that eases data
mining. New attributes are created and applied to assist the
mining process from the given attributes. This simplifies the
original data and makes the mining more efficient.

Data Aggregation

Data collection or aggregation is the method of storing and


presenting data in a summary format. The data may be obtained
from multiple data sources to integrate these data sources into
a data analysis description. This is a crucial step since the
accuracy of data analysis insights is highly dependent on the
quantity and quality of the data used

Data Generalization

It converts low-level data attributes to high-level data attributes


using concept hierarchy. This conversion from a lower level to a
higher conceptual level is useful to get a clearer picture of the
data. Data generalization can be divided into two approaches:
o Data cube process (OLAP) approach.
o Attribute-oriented induction (AOI) approach.
o Advantages of Data Transformation
o Better Organization: Transformed data is easier for both
humans and computers to use.
o Improved Data Quality: There are many risks and costs
associated with bad data. Data transformation can help
your organization eliminate quality issues such as missing
values and other inconsistencies.
o Perform Faster Queries: You can quickly and easily
retrieve transformed data thanks to it being stored and
standardizedin a source location.
o Better Data Management: Businesses are constantly
generating data from more and more sources. If there are
inconsistencies in the metadata, it can be challenging to
organize and understand it. Data transformation refines
your metadata, so it's easier to organize and understand.

Disadvantages of Data Transformation

While data transformation comes with a lot of benefits, still there


are some challenges to transforming data effectively, such as:

o Data transformation can be expensive. The cost is


dependent on the specific infrastructure, software, and
tools used to process data. Expenses may include licensing,
computing resources, and hiring necessary personnel.
o Data transformation processes can be resource-intensive.
Performing transformations in an on-premises data
warehouse after loading or transforming data before
feeding it into applications can create a computational
burden that slows down other operations. If you use a
cloud-based data warehouse, you can do the
transformations after loading because the platform can
scale up to meet demand.

Data Reduction in Data Mining

INTRODUCTION:

Data reduction is a technique used in data mining to reduce


the size of a dataset while still preserving the most important
information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the
dataset contains a large amount of irrelevant or redundant
information.

There are several different data reduction techniques that


can be used in data mining, including:

1. Data Sampling: This technique involves selecting a


subset of the data to work with, rather than using the
entire dataset. This can be useful for reducing the size
of a dataset while still preserving the overall trends
and patterns in the data.
2. Dimensionality Reduction: This technique involves
reducing the number of features in the dataset, either
by removing features that are not relevant or by
combining multiple features into a single feature.
3. Data Compression: This technique involves using
techniques such as lossy or lossless compression to
reduce the size of a dataset.
4. Data Discretization: This technique involves
converting continuous data into discrete data by
partitioning the range of possible values into intervals
or bins.
5. Feature Selection: This technique involves selecting a
subset of features from the dataset that are most
relevant to the task at han
6. Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form.
For example, imagine the information you gathered for your
analysis for the years 2012 to 2014, that data includes the
revenue of your company every three months. They involve
you in the annual sales, rather than the quarterly
average, So we can summarize the data in such a way that
the resulting data summarizes the total sales per year
instead of per quarter. It summarizes the data
2. Dimension reduction:
Whenever we come across any data which is weakly
important, then we use the attribute required for our
analysis. It reduces data size as it eliminates outdated or
redundant features.
3. Data Compression:
The data compression technique reduces the size of the files
using different encoding mechanisms (Huffman Encoding &
run-length Encoding). We can divide it into two types based
on their compression techniques.
• Lossless Compression –

Encoding techniques (Run Length Encoding) allow a


simple and minimal data size reduction. Lossless data
compression uses algorithms to restore the precise
original data from the compressed data.
4. Numerosity Reduction:
In this reduction technique, the actual data is replaced
with mathematical models or smaller representations of
the data instead of actual data, it is important to only
store the model parameter. Or non-parametric methods
such as clustering, histogram, and sampling

Advantages:

1. Improved efficiency: Data reduction can help to


improve the efficiency of machine learning algorithms
by reducing the size of the dataset. This can make it
faster and more practical to work with large datasets.
2. Improved performance: Data reduction can help to
improve the performance of machine learning
algorithms by removing irrelevant or redundant
information from the dataset. This can help to make
the model more accurate and robust.
3. Reduced storage costs: Data reduction can help to
reduce the storage costs associated with large
datasets by reducing the size of the data.

Disadvantages:

1. Loss of information: Data reduction can result in a loss


of information, if important data is removed during the
reduction process.
2. Impact on accuracy: Data reduction can impact the
accuracy of a model, as reducing the size of the
dataset can also remove important information that is
needed for accurate predictions.
3. Impact on interpretability: Data reduction can make it
harder to interpret the results, as removing irrelevant
or redundant information can also remove context that
is needed to understand the results.
Data Mining Task Primitives

A data mining task can be specified in the form of a data mining


query, which is input to the data mining system. A data mining
query is defined in terms of data mining task primitives. These
primitives allow the user to interactively communicate with the
data mining system during discovery to direct the mining
process or examine the findings from different angles or depths.
The data mining primitives specify the following,

1. Set of task-relevant data to be mined.


2. Kind of knowledge to be mined.
3. Background knowledge to be used in the discovery process.
4. Interestingness measures and thresholds for pattern
evaluation.
5. Representation for visualizing the discovered patterns.

List of Data Mining Task Primitives

. The set of task-relevant data to be mined

This specifies the portions of the database or the set of data in


which the user is interested. This includes the database attributes
or data warehouse dimensions of interest (the relevant attributes
or dimensions).

In a relational database, the set of task-relevant data can be


collected via a relational query involving operations like
selection, projection, join, and aggregation.

The kind of knowledge to be mined


This specifies the data mining functions to be performed, such as
characterization, discrimination, association or correlation
analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.

The background knowledge to be used in the discovery


process

This knowledge about the domain to be mined is useful for


guiding the knowledge discovery process and evaluating the
patterns found. Concept hierarchies are a popular form of
background knowledge, which allows data to be mined at
multiple levels of abstraction.

Concept hierarchy defines a sequence of mappings from low-


level concepts to higher-level, more general concepts.

The interestingness measures and thresholds for pattern


evaluation

Different kinds of knowledge may have different interesting


measures. They may be used to guide the mining process or,
after discovery, to evaluate the discovered patterns. For example,
interesting measures for association rules include support and
confidence. Rules whose support and confidence values are
below user-specified thresholds are considered uninteresting.

The expected representation for visualizing the discovered


patterns

This refers to the form in which discovered patterns are to be


displayed, which may include rules, tables, cross tabs, charts,
graphs, decision trees, cubes, or other visual representations.
Users must be able to specify the forms of presentation to be
used for displaying the discovered patterns. Some
representation forms may be better suited than others for
particular kinds of knowledge.

KDD Process in Data Mining


In the context of computer science, “Data Mining” can be
referred to as knowledge mining from data, knowledge
extraction, data/pattern analysis, data archaeology, and data
dredging. Data Mining also known as Knowledge Discovery in
Databases, refers to the nontrivial extraction of implicit,
previously unknown and potentially useful information from
data stored in databases.
The need of data mining is to extract useful information from
large datasets and use it to make predictions or better
decision-making. Nowadays, data mining is used in almost all
places where a large amount of data is stored and processed.
For examples: Banking sector, Market Basket Analysis,
Network Intrusion Detection.
KDD Process
KDD (Knowledge Discovery in Databases) is a process that
involves the extraction of useful, previously unknown, and
potentially valuable information from large datasets. The KDD
process is an iterative process and it requires multiple
iterations of the above steps to extract accurate knowledge
from the data.The following steps are included in KDD
process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data
from collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or
variance error.
3. Cleaning with Data discrepancy detection and Data
transformation tools.
Data Integration
Data integration is defined as heterogeneous data from
multiple sources combined in a common
source(DataWarehouse). Data integration using Data
Migration tools, Data Synchronization tools and ETL(Extract-
Load-Transformation) process.
Data Selection
Data selection is defined as the process where data relevant to
the analysis is decided and retrieved from the data
collection. For this we can use Neural network, Decision
Trees, Naive bayes, Clustering, and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming
data into appropriate form required by mining procedure. Data
Transformation is a two step process:
1. Data Mapping: Assigning elements from source base
to destination to capture transformations.
2. Code generation: Creation of the actual
transformation program.
Advantages of KDD
1. Improves decision-making: KDD provides valuable
insights and knowledge that can help organizations
make better decisions.
2. Increased efficiency: KDD automates repetitive and
time-consuming tasks and makes the data ready for
analysis, which saves time and money.
3. Better customer service: KDD helps organizations
gain a better understanding of their customers’ needs
and preferences, which can help them provide better
customer service.
4. Fraud detection: KDD can be used to detect
fraudulent activities by identifying patterns and
anomalies in the data that may indicate fraud.
5. Predictive modeling: KDD can be used to build
predictive models that can forecast future trends and
patterns

Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it
involves collecting and analyzing large amounts of
data, which can include sensitive information about
individuals.
2. Complexity: KDD can be a complex process that
requires specialized skills and knowledge to implement
and interpret the results.
3. Unintended consequences: KDD can lead to
unintended consequences, such as bias or
discrimination, if the data or models are not properly
understood or used.
4. Data Quality: KDD process heavily depends on the
quality of data, if data is not accurate or consistent, the
results can be misleading
5. High cost: KDD can be an expensive process,
requiring significant investments in hardware,
software, and personnel.
Data Mining architecture
Data Mining refers to the detection and extraction of new
patterns from the already collected data. Data mining is the
amalgamation of the field of statistics and computer science
aiming to discover patterns in incredibly large datasets and
then transform them into a comprehensible structure for later
use.

A detailed description of parts of data mining architecture is


shown:
1. Data Sources: Database, World Wide
Web(WWW), and data warehouse are parts of data
sources. The data in these sources may be in the form
of plain text, spreadsheets, or other forms of media
like photos or videos. WWW is one of the biggest
sources of data.
2. Database Server: The database server contains the
actual data ready to be processed. It performs the task
of handling data retrieval as per the request of the
user.
3. Data Mining Engine: It is one of the core components
of the data mining architecture that performs all kinds
of data mining techniques like association,
classification, characterization, clustering, prediction,
etc.
4. Pattern Evaluation Modules: They are responsible for
finding interesting patterns in the data and sometimes
they also interact with the database servers for
producing the result of the user requests.
5. Graphic User Interface: Since the user cannot fully
understand the complexity of the data mining process so
graphical user interface helps the user to communicate
effectively with the data mining system.
6. Knowledge Base: Knowledge Base is an important part
of the data mining engine that is quite beneficial in
guiding the search for the result patterns. Data mining
engines may also sometimes get inputs from the
knowledge base. This knowledge base may contain data
from user experiences. The objective of the knowledge
base is to make the result more accurate and reliable.
Types of Data Mining architecture:
1. No Coupling: The no coupling data mining
architecture retrieves data from particular data
sources. It does not use the database for retrieving the
data which is otherwise quite an efficient and accurate
way to do the same. The no coupling architecture for
data mining is poor and only used for performing very
simple data mining processes.
2. Loose Coupling: In loose coupling architecture data
mining system retrieves data from the database and
stores the data in those systems. This mining is for
memory-based data mining architecture.
3. Semi-Tight Coupling: It tends to use various
advantageous features of the data warehouse
systems. It includes sorting, indexing, and
aggregation. In this architecture, an intermediate
result can be stored in the database for better
performance.
4. Tight coupling: In this architecture, a data warehouse
is considered one of its most important components
whose features are employed for performing data
mining tasks. This architecture provides scalability,
performance, and integrated information
Advantages of Data Mining:
• Assists in preventing future adversaries by accurately

predicting future trends.


• Contributes to the making of important decisions.

• Compresses data into valuable information.

• Provides new trends and unexpected patterns.

• Helps to analyze huge data sets.

• Aids companies to find, attract and retain customers.

• Helps the company to improve its relationship with

the customers.
• Assists Companies to optimize their production

according to the likability of a certain product thus


saving costs to the company.
Disadvantages of Data Mining:
• Excessive work intensity requires high-performance

teams and staff training.


• The requirement of large investments can also be

considered a problem as sometimes data collection


consumes many resources that suppose a high cost.
• Lack of security could also put the data at huge risk, as

the data may contain private customer details.


• Inaccurate data may lead to the wrong output.

• Huge databases are quite difficult to manage.

Basic approaches for Data generalization (DWDM)


Data Generalization is the process of summarizing data by
replacing relatively low level values with higher level
concepts. It is a form of descriptive data mining.
There are two basic approaches of data generalization :
1. Data cube approach :
• It is also known as OLAP approach.

• It is an efficient approach as it is helpful to make the

past selling graph.


• In this approach, computation and results are stored in

the Data cube.


• It uses Roll-up and Drill-down operations on a data

cube.
• These operations typically involve aggregate

functions, such as count(), sum(), average(), and max().


• These materialized views can then be used for

decision support, knowledge discovery, and many


other applications.

. Attribute oriented induction :


• It is an online data analysis, query oriented and
generalization based approach.
• In this approach, we perform generalization on basis
of different values of each attributes within the
relevant data set. after that same tuple are merged
and their respective counts are accumulated in order
to perform aggregation.
• It performs off-line aggregation before an OLAP or
data mining query is submitted for processing.
• On the other hand, the attribute oriented induction
approach, at least in its initial proposal, a relational
database query – oriented, generalized based (on-line
data analysis technique).
• It is not limited to particular measures nor categorical
data.
• Attribute oriented induction approach uses two
method :
(i). Attribute removal.
(ii). Attribute generalization.
Analytical characterization
Analytical characterization is used to help and
identifying the weakly relevant, or irrelevant attributes.
We can exclude these unwanted irrelevant attributes
when we preparing our data for the mining.
Why Analytical Characterization?
Analytical Characterization is a very important activity
in data mining due to the following reasons;
Due to the limitation of the OLAP tool about handling
the complex objects.
Due to the lack of an automated generalization, we
must explicitly tell the system which attributes are
irrelevant and must be removed, and similarly, we
must explicitly tell the system which attributes are
relevant and must be included in the class
characterization.
Class Comparison Methods in Data Mining

In many applications, users may not be interested in having a


single class or concept described or characterized but rather
would prefer to mine a description comparing or distinguishing
one class (or concept) from other comparable classes (or
concepts). Class discrimination or comparison (hereafter referred
to as class comparison) mines descriptions that distinguish a
target class from its contrasting classes. Notice that the target
and contrasting classes must be comparable because they share
similar dimensions and attributes. For example, the three classes,
person, address, and item, are not comparable.

The previous sections' discussions on class characterization


handle multilevel data summarization and characterization in a
single class. However, the sales in the last three years are
comparable classes, and so are computer science students versus
physics students. The techniques developed can be extended to
handle class comparison across several comparable classes.

Class Comparison Methods and Implementation


1. Data Collection: The set of relevant data in the database
and data warehouse is collected by query Processing and
partitioned into a target class and one or a set of
contrasting classes.
2. Dimension relevance analysis: If there are many
dimensions and analytical comparisons are desired, then
dimension relevance analysis should be performed. Only
the highly relevant dimensions are included in the further
analysis.
3. Synchronous Generalization: The process of
generalization is performed upon the target class to the
level controlled by the user or expert specified dimension
threshold, which results in a prime target class relation or
cuboid. The concepts in the contrasting class or classes are
generalized to the same level as those in the prime target
class relation or cuboid, forming the prime contrasting class
relation or cuboid.

o Presentation of the derived comparison: The resulting


class comparison description can be visualized in the form
of tables, charts, and rules. This presentation usually
includes a "contrasting" measure (such as count%) that
reflects the comparison between the target and contrasting
classes. As desired, the user can adjust the comparison
description by applying drill-down, roll-up, and other OLAP
attributes = name, gender, program, birth_place,
birth_date, residence, phone_no, and GPA.
o Gen(ai)= concept hierarchies on attributes ai.
o Ui = attribute analytical thresholds for attributes ai.
o Ti = attribute generalization thresholds for attributes ai.
o R = attribute relevance threshold.

4. operations to the target and contrasting classes.

Statistical Methods in Data Mining

Data mining refers to extracting or mining knowledge from


large amounts of data. In other words, data mining is the
science, art, and technology of discovering large and complex
bodies of data in order to discover useful patterns.
Theoreticians and practitioners are continually seeking
improved techniques to make the process more efficient, cost-
effective, and accurate. Any situation can be analyzed in two
ways in data mining:
• Statistical Analysis: In statistics, data is collected,

analyzed, explored, and presented to identify patterns


and trends. Alternatively, it is referred to as
quantitative analysis.
• Non-statistical Analysis: This analysis provides
generalized information and includes sound, still
images, and moving images.
In statistics, there are two main categories:
• Descriptive Statistics: The purpose of descriptive
statistics is to organize data and identify the main
characteristics of that data. Graphs or numbers
summarize the data. Average, Mode, SD(Standard
Deviation), and Correlation are some of the commonly
used descriptive statistical methods.
• Inferential Statistics: The process of drawing
conclusions based on probability theory and
generalizing the data. By analyzing sample statistics,
you can infer parameters about populations and make
models of relationships within data.
There are various statistical terms that one should be aware
of while dealing with statistics. Some of these are:
•Population
• Sample

• Variable

• Quantitative Variable

• Qualitative Variable

Data Mining - Query Language


The Data Mining Query Language (DMQL) was proposed
by Han, Fu, Wang, et al. for the DBMiner data mining
system. The Data Mining Query Language is actually
based on the Structured Query Language (SQL). Data
Mining Query Languages can be designed to support ad
hoc and interactive data mining. This DMQL provides
commands for specifying primitives. The DMQL can work
with databases and data warehouses as well. DMQL can
be used to define data mining tasks. Particularly we
examine how to define data warehouses and data marts
in DMQL.

Syntax for Specifying the Kind of Knowledge

Here we will discuss the syntax for Characterization,


Discrimination, Association, Classification, and
Prediction.

Characterization

The syntax for characterization is −

mine characteristics [as pattern_name]


analyze {measure(s) }

Association

The syntax for Association is−

mine associations [ as {pattern_name} ]


{matching {metapattern} }

Classification

The syntax for Classification is −

mine classification [as pattern_name]


analyze classifying_attribute_or_dimension

Prediction

The syntax for prediction is −

mine prediction [as pattern_name]


analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}
UNIT-4
Association Rule

Association rule mining finds interesting associations and
relationships among large sets of data items. This rule shows
how frequently a itemset occurs in a transaction. A typical
example is a Market Based Analysis.
Market Based Analysis is one of the key techniques used by
large relations to show associations between items.It allows
retailers to identify relationships between the items that
people buy together frequently.
Given a set of transactions, we can find rules that will predict
the occurrence of an item based on the occurrences of other
items in the transaction.

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer


TID Items

5 Bread, Milk, Diaper, Coke

Support Count( ) – Frequency of occurrence of a itemset.


Here ({Milk, Bread, Diaper})=2
Frequent Itemset – An itemset whose support is greater than
or equal to minsup threshold.
Association Rule – An implication expression of the form X ->
Y, where X and Y are any 2 itemsets.
Example: {Milk, Diaper}->{Beer}
Rule Evaluation Metrics –

• Support(s) –
The number of transactions that include items in the
{X} and {Y} parts of the rule as a percentage of the
total number of transaction.It is a measure of how
frequently the collection of items occur together as a
percentage of all transactions.
• Support = (X+Y) total –
It is interpreted as fraction of transactions that contain
both X and Y.
• Confidence(c) –
It is the ratio of the no of transactions that includes all
items in {B} as well as the no of transactions that
includes all items in {A} to the no of transactions that
includes all items in {A}.
• Conf(X=>Y) = Supp(X Y) Supp(X) –
It measures how often each item in Y appears in
transactions that contains items in X also.
• Lift(l) –
The lift of the rule X=>Y is the confidence of the rule
divided by the expected confidence, assuming that the
itemsets X and Y are independent of each other.The
expected confidence is the confidence divided by the
frequency of {Y}.
• Lift(X=>Y) = Conf(X=>Y) Supp(Y) –
Lift value near 1 indicates X and Y almost often appear
together as expected, greater than 1 means they appear
together more than expected and less than 1 means they
appear less than expected.Greater lift values indicate
stronger association.
Multilevel Association Rule in data mining
Multilevel Association Rule :
Association rules created from mining information at different
degrees of reflection are called various level or staggered
association rules.
Multilevel association rules can be mined effectively utilizing
idea progressions under a help certainty system.
Rules at a high idea level may add to good judgment while
rules at a low idea level may not be valuable consistently.
Utilizing uniform least help for all levels :
• At the point when a uniform least help edge is

utilized, the pursuit system is rearranged.


• The technique is likewise straightforward, in that

clients are needed to indicate just a single least help


edge.
• A similar least help edge is utilized when mining at

each degree of deliberation. (for example for mining


from “PC” down to “PC”). Both “PC” and “PC”
discovered to be incessant, while “PC” isn’t.
Needs of Multidimensional Rule :
• Sometimes at the low data level, data does not show
any significant pattern but there is useful information
hiding behind it.
• The aim is to find the hidden information in or

between levels of abstraction.


Approaches to multilevel association rule mining :
1. Uniform Support(Using uniform minimum support for
all level)
2. Reduced Support (Using reduced minimum support at
lower levels)
3. Group-based Support(Using item or group based
support)
4. Uniform Support –
At the point when a uniform least help edge is used, the
search methodology is simplified. The technique is
likewise basic in that clients are needed to determine
just a single least help threshold. An advancement
technique can be adopted, based on the information that
a progenitor is a superset of its descendant. the search
keeps away from analyzing item sets containing
anything that doesn’t have minimum support. The
uniform support approach however has some
difficulties. It is unlikely that items at lower levels of
abstraction will occur as frequently as those at higher
levels of abstraction. If the minimum support threshold is
set too high it could miss several meaningful
associations occurring at low abstraction levels. This
provides the motivation for the following approach.
5. Reduce Support –
For mining various level relationship with diminished
support, there are various elective hunt techniques as
follows
6. Level-by-Level independence –
This is a full-broadness search, where no foundation
information on regular item sets is utilized for pruning.
Each hub is examined, regardless of whether its parent
hub is discovered to be incessant.
7. Level – cross-separating by single thing –
A thing at the I level is inspected if and just if its parent
hub at the (I-1) level is regular .all in all, we research a
more explicit relationship from a more broad one. If a
hub is frequent, its kids will be examined; otherwise, its
descendant is pruned from the inquiry.
8. Level-cross separating by – K-itemset –
A-itemset at the I level is inspected if and just if it’s For
mining various level relationship with diminished
support, there are various elective hunt techniques.

Group-based support –
The group-wise threshold value for support and confidence is
input by the user or expert. The group is selected based on a
product price or item set because often expert has insight as
to which groups are more important than others.

Data Mining Multidimensional Association Rule


In this article, we are going to discuss Multidimensional
Association Rule. Also, we will discuss examples of each.
Let’s discuss one by one.
Multidimensional Association Rules :
In Multi dimensional association rule Qualities can be
absolute or quantitative.
• Quantitative characteristics are numeric and
consolidates order.
• Numeric traits should be discretized.

• Multi dimensional affiliation rule comprises of more

than one measurement.


• Example –buys(X, “IBM Laptop computer”)buys(X,

“HP Inkjet Printer”)


Using static discretization of quantitative qualities :
• Discretization is static and happens preceding mining.

• Discretized ascribes are treated as unmitigated.

• Use apriori calculation to locate all k-regular predicate

sets(this requires k or k+1 table outputs). Each subset


of regular predicate set should be continuous.
Example –
1. Example –
If in an information block the 3D cuboid (age, pay,
purchases) is continuous suggests (age, pay), (age,
purchases), (pay, purchases) are likewise regular.
Note –
Information blocks are appropriate for mining since
they make mining quicker. The cells of an n-
dimensional information cuboid relate to the predicate
cells.
2. Using powerful discretization of quantitative traits :
• Known as mining Quantitative Association

Rules.
• Numeric properties are progressively

discretized.
Example –:
1.age(X, "20..25") Λ income(X, "30K..41K")buys
( X, "Laptop Computer")
2. Grid FOR TUPLES :
Using distance based discretization with bunching –
This id dynamic discretization measure that considers
the distance between information focuses. It includes
a two stage mining measure as following.
• Perform bunching to discover the time period

included.
• Get affiliation rules via looking for gatherings

of groups that happen together.


The resultant guidelines may fulfill –
• Bunches in the standard precursor are

unequivocally connected with groups of rules


in the subsequent.
• Bunches in the forerunner happen together.

• Bunches in the ensuing happen together.

What is Constraint-Based Association Mining?

A data mining procedure can uncover thousands of rules


from a given set of information, most of which end up
being independent or tedious to the users. Users have a
best sense of which “direction” of mining can lead to
interesting patterns and the “form” of the patterns or
rules they can like to discover.

Therefore, a good heuristic is to have the users defines


such intuition or expectations as constraints to
constraint the search space. This strategy is called
constraint-based mining.

Constraint-based algorithms need constraints to


decrease the search area in the frequent itemset
generation step (the association rule generating step is
exact to that of exhaustive algorithms).

The general constraint is the support minimum


threshold. If a constraint is uncontrolled, its inclusion in
the mining phase can support significant reduction of the
exploration space because of the definition of a
boundary inside the search space lattice, following which
exploration is not needed.

The important of constraints is well-defined − they


create only association rules that are appealing to users.
The method is quite trivial and the rules space is
decreased whereby remaining methods satisfy the
constraints.

Constraint-based clustering discover clusters that satisfy


user-defined preferences or constraints. It depends on
the characteristics of the constraints, constraint-based
clustering can adopt rather than different approaches.

The constraints can include the following which are as


follows −

Knowledge type constraints − These define the type


of knowledge to be mined, including association or
correlation.

Data constraints − These define the set of task-


relevant information such as Dimension/level
constraints − These defines the desired dimensions (or
attributes) of the information, or methods of the concept
hierarchies, to be utilized in mining.

Classification and Predication in Data Mining


What is Classification?

Classification is to identify the category or the class label of a new


observation. First, a set of data is used as training data. The set
of input data and the corresponding outputs are given to the
algorithm. So, the training data set includes the input data and
their associated class labels. Using the training dataset, the
algorithm derives a model or the classifier. The derived model
can be a decision tree, mathematical formula, or a neural
network. In classification, when unlabeled data is given to the
model, it should find the class to which it belongs. The new data
provided to the model is the test data set.

Classification is the process of classifying a record. One simple


example of classification is to check whether it is raining or not.
The answer can either be yes or no. So, there is a particular
number of choices. Sometimes there can be more than two
classes to classify. That is called multiclass classification.

The bank needs to analyze whether giving a loan to a particular


customer is risky or not. For example, based on observable data
for multiple loan borrowers, a classification model may be
established that forecasts credit risk. The data could track job
records, homeownership or leasing, years of residency, number,
type of deposits, historical credit ranking, etc. The goal would be
credit ranking, the predictors would be the other characteristics,
and the data would represent a case for each consumer. In this
example, a model is constructed to find the categorical label. The
labels are risky or safe.

How does Classification Works?

The functioning of classification with the assistance of the bank


loan application has been mentioned above. There are two
stages in the data classification system: classifier or model
creation and classification classifier.

1. Developing the Classifier or model creation: This level is


the learning stage or the learning process. The classification
algorithms construct the classifier in this stage. A classifier
is constructed from a training set composed of the records
of databases and their corresponding class names. Each
category that makes up the training set is referred to as a
category or class. We may also refer to these records as
samples, objects, or data points.
2. Applying classifier for classification: The classifier is used
for classification at this level. The test data are used here to
estimate the accuracy of the classification algorithm. If the
consistency is deemed sufficient, the classification rules can
be expanded to cover new data records. It includes:
o Sentiment Analysis: Sentiment analysis is highly
helpful in social media monitoring. We can use it to
extract social media insights. We can build sentiment
analysis models to read and analyze misspelled words
with advanced machine learning algorithms. The
accurate trained models provide consistently accurate
outcomes and result in a fraction of the time.

1. Document Classification: We can use document


classification to organize the documents into sections
according to the content. Document classification refers to
text classification; we can classify the words in the entire
document. And with the help of machine learning
classification alData Classification Process: The data
classification process can be categorized into five steps:
o Create the goals of data classification, strategy,
workflows, and architecture of data classification.
o Classify confidential details that we store.
o Using marks by data labelling.
o To improve protection and obedience, use effects.
o Data is complex, and a continuous method is a
classification.

What is Data Classification Lifecycle?

The data classification life cycle produces an excellent structure


for controlling the flow of data to an enterprise. Businesses need
to account for data security and compliance at each level. With
the help of data classification, we can perform it at every stage,
from origin to deletion. The data life-cycle has the following
stages, such as:
1. Origin: It produces sensitive data in various formats, with
emails, Excel, Word, Google documents, social media, and
websites.
2. Role-based practice: Role-based security restrictions
apply to all delicate data by tagging based on in-house
protection policies and agreement rules.
3. Storage: Here, we have the obtained data, including access
controls and encryption.
4. Sharing: Data is continually distributed among agents,
consumers, and co-workers from various devices and
platforms.
5. Archive: Here, data is eventually archived within an
industry's storage systems.
6. Publication: Through the publication of data, it can reach
customers. They can then view and download in the form
of dashboards.
7. What is Prediction?
8. Another process of data analysis is prediction. It is used to
find a numerical output. Same as in classification, the
training dataset contains the inputs and corresponding
numerical output values. The algorithm derives the model
or a predictor according to the training dataset. The model
should find a numerical output when the new data is given.
Unlike in classification, this method does not have a class
label. The model predicts a continuous-valued function or
ordered value.
9. Regression is generally used for prediction. Predicting the
value of a house depending on the facts such as the number
of rooms, the total area, etc., is an example for prediction.

Classification and Prediction Issues

10. The major issue is preparing the data for Classification


and Prediction. Preparing the data involves the following
activities, such as:

1. Data Cleaning: Data cleaning involves removing the noise


and treatment of missing values. The noise is removed by
applying smoothing techniques, and the problem of
missing values is solved by replacing a missing value with
the most commonly occurring value for that attribute.
2. Relevance Analysis: The database may also have irrelevant
attributes. Correlation analysis is used to know whether any
two given attributes are related.
3. Data Transformation and reduction: The data can be
transformed by any of the following methods.
o Normalization: The data is transformed using
normalization. Normalization involves scaling all
values for a given attribute to make them fall within a
small specified range. Normalization is used when the
neural networks or the methods involving
measurements are used in the learning step.
o Generalization: The data can also be transformed by
generalizing it to the higher concept. For this purpose,
we can use the concept hierarchies.

Classification Prediction

Classification is the process of identifying Predication is the process of


which category a new observation belongs identifying the missing or
to based on a training data set containing unavailable numerical data for a
observations whose category membership new observation.
is known.

In classification, the accuracy depends on In prediction, the accuracy


finding the class label correctly. depends on how well a given
predictor can guess the value of
a predicated attribute for new
data.

In classification, the model can be known as In prediction, the model can be


the classifier. known as the predictor.
A model or the classifier is constructed to A model or a predictor will be
find the categorical labels. constructed that predicts a
continuous-valued function or
ordered value.

For example, the grouping of patients For example, We can think of


based on their medical records can be prediction as predicting the
considered a classification. correct treatment for a particular
disease for a person.

Decision Tree Induction

Decision Tree is a supervised learning method used in data mining for


classification and regression methods. It is a tree that helps us in decision-
making purposes. The decision tree creates classification or regression models
as a tree structure. It separates a data set into smaller subsets, and at the same
time, the decision tree is steadily developed. The final tree is a tree with the
decision nodes and leaf nodes. A decision node has at least two branches. The
leaf nodes show a classification or decision. We can't accomplish more split on
leaf nodes-The uppermost decision node in a tree that relates to the best
predictor called the root node. Decision trees can deal with both categorical and
numerical data.

Key factors:

ntropy refers to a common way to measure impurity. In the decision tree, it


measures the randomness or impurity in data sets.
nformation Gain:

Information Gain refers to the decline in entropy after the dataset


is split. It is also called Entropy Reduction. Building a decision
tree is all about discovering attributes that return the highest
data gain.

Why are decision trees useful?

It enables us to analyze the possible consequences of a decision


thoroughly.

It provides us a framework to measure the values of outcomes


and the probability of accomplishing them.

It helps us to make the best decisions based on existing data and


best speculations.

Advantages of using decision trees:

A decision tree does not need scaling of information.


Missing values in data also do not influence the process of
building a choice tree to any considerable extent.

A decision tree model is automatic and simple to explain to the


technical team as well as stakeholders.

Compared to other algorithms, decision trees need less exertion


for data preparation during pre-processing.

A decision tree does not require a standardization of data.

Backpropagation in Data Mining


• Backpropagation is an algorithm that backpropagates the
errors from the output nodes to the input nodes. Therefore, it
is simply referred to as the backward propagation of errors. It
uses in the vast applications of neural networks in data mining
like Character recognition, Signature verification, etc.

Neural Network:

Neural networks are an information processing paradigm


inspired by the human nervous system. Just like in the human
nervous system, we have biological neurons in the same way
in neural networks we have artificial neurons, artificial neurons
are mathematical functions derived from biological neurons.
The human brain is estimated to have about 10 billion neurons,
each connected to an average of 10,000 other neurons. Each
neuron receives a signal through a synapse, which controls the
effect of the signconcerning on the neuron.
Backpropagation:

Backpropagation is a widely used algorithm for training


feedforward neural networks. It computes the gradient of the
loss function with respect to the network weights. It is very
efficient, rather than naively directly computing the gradient
concerning each weight. This efficiency makes it possible to
use gradient methods to train multi-layer networks and update
weights to minimize loss; variants such as gradient descent or
stochastic gradient descent are often used.
The backpropagation algorithm works by computing the
gradient of the loss function with respect to each weight via
the chain rule, computing the gradient layer by layer, and
iterating backward from the last layer to avoid redundant
computation of intermediate terms in the chain rule.

Features of Backpropagation:

1. it is the gradient descent method as used in the case of


simple perceptron network with the differentiable unit.
2. it is different from other networks in respect to the
process by which the weights are calculated during the
learning period of the network.
3. training is done in the three stages :
• the feed-forward of input training pattern

• the calculation and backpropagation of the error

• updation of the weight

4. Working of Backpropagation:
5. Neural networks use supervised learning to generate
output vectors from input vectors that the network
operates on. It Compares generated output to the desired
output and generates an error report if the result does not
match the generated output vector. Then it adjusts the
weights according to the bug report to get your desired
output.

6. Backpropagation Algorithm:

7. Step 1: Inputs X, arrive through the preconnected path.


8. Step 2: The input is modeled using true weights W.
Weights are usually chosen randomly.
9. Step 3: Calculate the output of each neuron from the
input layer to the hidden layer to the output layer.
10. Step 4: Calculate the error in the outputs
11. Backpropagation Error= Actual Output –
Desired Output
12. Step 5: From the output layer, go back to the
hidden layer to adjust the weights to reduce the error.
13. Step 6: Repeat the process until the desired output
is achieved.
14.
Parameters :
• x = inputs training vector x=(x 1,x2,…………xn).

• t = target vector t=(t 1,t2……………tn).

• δk = error at output unit.

• δj = error at hidden layer.

• α = learning rate.

• V0j = bias of hidden unit j.

Training Algorithm :
Step 1: Initialize weight to small random values.
Step 2: While the stepsstopping condition is to be false do
step 3 to 10.
Step 3: For each training pair do step 4 to 9 (Feed-Forward).
Step 4: Each input unit receives the signal unit and
transmitsthe signal xi signal to all the units.
Step 5 : Each hidden unit Zj (z=1 to a) sums its weighted
input signal to calculate its net input
zinj = v0j + Σxivij ( i=1 to n)
Applying activation function zj = f(zinj) and sends this
signals to all units in the layer about i.e output units
For each output l=unit yk = (k=1 to m) sums its
weighted input signals.
yink = w0k + Σ ziwjk (j=1 to a)
and applies its activation function to calculate the
output signals.
yk = f(yink)
Backpropagation Error :
Step 6: Each output unit yk (k=1 to n) receives a target
pattern corresponding to an input pattern then error is
calculated as:
δk = ( tk – yk ) + yink
Step 7: Each hidden unit Zj (j=1 to a) sums its input from all
units in the layer above
δinj = Σ δj wjk
The error information term is calculated as :
δj = δinj + zinj

Need for Backpropagation:

Backpropagation is “backpropagation of errors” and is very


useful for training neural networks. It’s fast, easy to implement,
and simple. Backpropagation does not require any parameters
to be set, except the number of inputs. Backpropagation is a
flexible method because no prior knowledge of the network is
required.

Types of Backpropagation

There are two types of backpropagation networks.


• Static backpropagation: Static backpropagation is a
network designed to map static inputs for static
outputs. These types of networks are capable of
solving static classification problems such as OCR
(Optical Character Recognition).
• Recurrent backpropagation: Recursive
backpropagation is another network used for fixed-
point learning. Activation in recurrent backpropagation
is feed-forward until a fixed value is reached. Static
backpropagation provides an instant mapping, while
recurrent backpropagation does not provide an instant
mapping.

Advantages:

• It is simple, fast, and easy to program.


• Only numbers of the input are tuned, not any other
parameter.
• It is Flexible and efficient.
• No need for users to learn any special functions.

Disadvantages:

• It is sensitive to noisy data and irregularities. Noisy data


can lead to inaccurate results.
• Performance is highly dependent on input data.
• Spending too much time training.
• The matrix-based approach is preferred over a mini-
batch.
• Data Mining Bayesian Classifiers
• In numerous applications, the connection between the
attribute set and the class variable is non- deterministic. In
other words, we can say the class label of a test record cant
be assumed with certainty even though its attribute set is
the same as some of the training examples. These
circumstances may emerge due to the noisy data or the
presence of certain confusing factors that influence
classification, but it is not included in the analysis. For
example, consider the task of predicting the occurrence of
whether an individual is at risk for liver illness based on
individuals eating habits and working efficiency. Although
most people who eat healthly and exercise consistently
having less probability of occurrence of liver disease, they
may still do so due to other factors. For example, due to
consumption of the high-calorie street foods and alcohol
abuse. Determining whether an individual's eating routine
is healthy or the workout efficiency is sufficient is also
subject to analysis, which in turn may introduce
vulnerabilities into the leaning issue.

Where X and Y are the events and P (Y) ≠ 0

P(X/Y) is a conditional probability that describes the


occurrence of event X is given that Y is true.

P(Y/X) is a conditional probability that describes the


occurrence of event Y is given that X is true.

P(X) and P(Y) are the probabilities of observing X and Y


independently of each other. This is known as the marginal
probability.

Bayesian interpretation:
In the Bayesian interpretation, probability determines a "degree
of belief." Bayes theorem connects the degree of belief in a
hypothesis before and after accounting for evidence. For
example, Lets us consider an example of the coin. If we toss a
coin, then we get either heads or tails, and the percent of
occurrence of either heads and tails is 50%. If the coin is flipped
numbers of times, and the outcomes are observed, the degree
of belief may rise, fall, or remain the same depending on the
outcomes.

o P(X), the prior, is the primary degree of belief in X


o P(X/Y), the posterior is the degree of belief having
accounted for Y.

o The quotient represents the supports Y provides for


X.

Bayes theorem can be derived from the conditional probability:

Where P (X⋂Y) is the joint probability of both X and Y being


true, because

Bayesian network:
A Bayesian Network falls under the classification of Probabilistic
Graphical Modelling (PGM) procedure that is utilized to compute
uncertainties by utilizing the probability concept. Generally
known as Belief Networks, Bayesian Networks are used to
show uncertainties using Directed Acyclic Graphs (DAG)

A Directed Acyclic Graph is used to show a Bayesian Network,


and like some other statistical graph, a DAG consists of a set of
nodes and links, where the links signify the connection between
the nodes.

a DAG models the uncertainty of an event taking place based


on the Conditional Probability Distribution (CDP) of each
random variable. A Conditional Probability Table (CPT) is used
to represent the CPD of each variable in a network.
Associative Classification in Data Mining
Data mining is the process of discovering and extracting hidden
patterns from different types of data to help decision-makers
make decisions. Associative classification is a common
classification learning method in data mining, which applies
association rule detection methods and classification to create
classification models.
Association Rule learning in Data Mining:
Association rule learning is a machine learning method for
discovering interesting relationships between variables in
large databases. It is designed to detect strong rules in the
database based on some interesting metrics. For any given
multi-item transaction, association rules aim to obtain rules
that determine how or why certain items are linked.
• In Classification analysis, it is mostly used to question,

make decisions, and predict behavior.


• In Clustering analysis, it is mainly used when no

assumptions are made about possible relationships in


the data.
• In Regression analysis, it is used when we want to

predict an infinitely dependent value of a set of


independent variables.
• How does Association Rule Learning work?

• Association rule learning is a type of unsupervised

learning technique that checks for the dependency of


one data item on another data item and maps
accordingly so that it can be more profitable. It is based
on different rules to discover the interesting relations
between variables in the database. The association rule
learning is one of the very important concepts of
machine learning, and it is employed in Market Basket
analysis, Web usage mining, continuous production, etc.
Here market basket analysis is a technique used by the
various big retailer to discover the associations between
items.
• Association rule learning works on the concept of If and

Else Statement, such as if A then B.


• 1.Support :
• Support is the frequency of A or how frequently an item
appears in the dataset. It is defined as the fraction of the
transaction T that contains the itemset X. If there are X
datasets, then for transactions T, it can be written as:
• Supp(X) = Freq(X) / T

• 2.Confidence:

• Confidence indicates how often the rule has been found

to be true. Or how often the items X and Y occur


together in the dataset when the occurrence of X is
already given. It is the ratio of the transaction that
contains X and Y to the number of records that contain X.
• Confidence = Freq(X,Y) / Freq(X)

3.Lift:
It is the strength of any rule, which can be defined as below
formula: It is the ratio of the observed support measure and
expected support if X and Y are independent of each other. It
has three possible values:
Lift = Supp(X,Y) / Supp(X)*Supp(Y)
• If Lift= 1: The probability of occurrence of antecedent

and consequent is independent of each other.


• Lift>1: It determines the degree to which the two

itemsets are dependent to each other.


• Lift<1: It tells us that one item is a substitute for other

items, which means one item has a negative effect on


another.
• Types of Association Rule Learning:

• Association rule learning can be divided into three

algorithms:
• 1.Apriori Algorithm:

• This algorithm uses frequent datasets to generate

association rules. It is designed to work on the databases


that contain transactions. This algorithm uses a breadth-
first search and Hash Tree to calculate the itemset
efficiently. It is mainly used for market basket analysis
and helps to understand the products that can be bought
together. It can also be used in the healthcare field to
find drug reactions for patients.
• 2.Eclat Algorithm:

• Eclat algorithm stands for Equivalence Class

Transformation. This algorithm uses a depthfirst search


technique to find frequent itemsets in a transaction
database. It performs faster execution than Apriori
Algorithm.
Applications of Association Rule Learning:
It has various applications in machine learning and data
mining. Below are some popular applications of association
rule learning:
• Market Basket Analysis: It is one of the popular

examples and applications of association rule mining.


This technique is commonly used by big retailers to
determine the association between items.
• Medical Diagnosis: With the help of association rules,

patients can be cured easily, as it helps in identifying


the probability of illness for a particular disease.
• Protein Sequence: The association rules help in

determining the synthesis of artificial Proteins.


• It is also used for the Catalog Design and Loss-leader

Analysis and many more other applications.


• Types of Associative Classification:

• There are different types of Associative Classification

Methods, Some of them are given below.


• 1. CBA (Classification Based on Associations): It uses
association rule techniques to classify data, which proves
to be more accurate than traditional classification
techniques. It has to face the sensitivity of the minimum
support threshold. When a lower minimum support
threshold is specified, a large number of rules are
generated.
• 2. CMAR (Classification based on Multiple Association

Rules): It uses an efficient FP-tree, which consumes less


memory and space compared to Classification Based on
Associations. The FP-tree will not always fit in the main
memory, especially when the number of attributes is
large.
4. CPAR (Classification based on Predictive Association
Rules): Classification based on predictive association
rules combines the advantages of association
classification and traditional rule-based classification.
Classification based on predictive association rules uses
a greedy algorithm to generate rules directly from
training data. Furthermore, classification based on
predictive association rules generates and tests more
rules than traditional rule-based classifiers to avoid
missing important rules.
Classifier Accuracy
5. Evaluating & estimating the accuracy
of classifiers is important in that it allows one to
evaluate how accurately a given classifier will label
future data, that, is, data on which the classifier
has not been trained.

For example, suppose you used data from previous


sales to train a classifier to predict customer
purchasing behavier

Classifier Accuracy
Evaluating & estimating the accuracy of classifiers is
important in that it allows one to evaluate how
accurately a given classifier will label future data, that,
is, data on which the classifier has not been trained.

For example, suppose you used data from previous


sales to train a classifier to predict customer
purchasing behavior.

You would like an estimate of how accurately the


classifier can predict the purchasing behavior of future
customers, that is, future customer data on which the
classifier has not been trained.

Accuracy estimates to help in the comparison of


different classifiers.

Methods To Find Accuracy Of The Classifiers


• Holdout Method
• Random Subsampling
• K-fold Cross-Validation
• Bootstrap Methods

HomeclassificationClassifier Accuracy Measures In Data Mining


Classifier Accuracy Measures In Data Mining
April 16, 2020

Classifier Accuracy
Evaluating & estimating the accuracy of classifiers is
important in that it allows one to evaluate how
accurately a given classifier will label future data,
that, is, data on which the classifier has not been
trained.

For example, suppose you used data from


previous sales to train a classifier to predict
customer purchasing behavior.

You would like an estimate of how accurately the


classifier can predict the purchasing behavior of
future customers, that is, future customer data
on which the classifier has not been trained.

Accuracy estimates to help in the comparison of


different classifiers.

Methods To Find Accuracy Of The


Classifiers
• Holdout Method
• Random Subsampling
• K-fold Cross-Validation
• Bootstrap Methods

Estimating The Classifier Accuracy


Holdout & Random Subsampling
The holdout method is what we have alluded
to so far in our discussions about accuracy.

In this method, the given data are randomly


partitioned into two independent sets, a training
set, and a test set.
HomeclassificationClassifier Accuracy Measures In Data Mining

Classifier Accuracy Measures In Data Mining


April 16, 2020
Classifier Accuracy
Evaluating & estimating the accuracy of classifiers is
important in that it allows one to evaluate how
accurately a given classifier will label future data,
that, is, data on which the classifier has not been
trained.

For example, suppose you used data from


previous sales to train a classifier to predict
customer purchasing behavior.

You would like an estimate of how accurately the


classifier can predict the purchasing behavior of
future customers, that is, future customer data
on which the classifier has not been trained.
Accuracy estimates to help in the comparison of
different classifiers.

Methods To Find Accuracy Of The


Classifiers
• Holdout Method
• Random Subsampling
• K-fold Cross-Validation
• Bootstrap Methods

Estimating The Classifier Accuracy


Holdout & Random Subsampling
The holdout method is what we have alluded
to so far in our discussions about accuracy.

In this method, the given data are randomly


partitioned into two independent sets, a training
set, and a test set.

Typically, two-thirds of the data are allocated to


the training set, and the remaining one-third is
allocated to the test set.

The training set is used to derive the model,


whose accuracy is estimated with the test set.

The estimate is pessimistic because only a


portion of the initial data is used to derive the
model.

Random subsampling is a variation of the


holdout method in which the holdout method is
repeated k times.

The overall accuracy estimate is taken as the


average of the accuracies obtained from each
iteration. (For prediction, we can take the
average of the predictor error rates.)
Cross-Validation
In k-fold cross-validation, the initial data are
randomly partitioned into k mutually exclusive
subsets or “folds,” D1, D2,....., Dk, each of
approximately equal size.

Training and testing are performed k times. In


iteration i, partition Di is reserved as the test set,
and the remaining partitions are collectively used
to train the model
Bootstrapping
Unlike the accuracy estimation methods
mentioned above, the bootstrap method
samples the given training tuples uniformly
with replacement.

That is, each time a tuple is selected, it is


equally likely to be selected again and
readded to the training set.
Bagging
We first take an intuitive look at how bagging
works as a method of increasing accuracy.

For ease of explanation, we will assume at first


that our model is a classifier. Suppose that you
are a patient and would like to have a diagnosis
made based on your symptoms.

Boosting
We now look at the ensemble method of
boosting. As in the previous section, suppose
that as a patient, you have certain symptoms.

Instead of consulting one doctor, you choose to


consult several.
HomeclassificationClassifier Accuracy Measures In Data Mining

Classifier Accuracy Measures In Data Mining


April 16, 2020
Classifier Accuracy
Evaluating & estimating the accuracy of classifiers is
important in that it allows one to evaluate how
accurately a given classifier will label future data,
that, is, data on which the classifier has not been
trained.

For example, suppose you used data from


previous sales to train a classifier to predict
customer purchasing behavior.

You would like an estimate of how accurately the


classifier can predict the purchasing behavior of
future customers, that is, future customer data
on which the classifier has not been trained.
Accuracy estimates to help in the comparison of
different classifiers.

Methods To Find Accuracy Of The


Classifiers
• Holdout Method
• Random Subsampling
• K-fold Cross-Validation
• Bootstrap Methods

Estimating The Classifier Accuracy


Holdout & Random Subsampling
The holdout method is what we have alluded
to so far in our discussions about accuracy.

In this method, the given data are randomly


partitioned into two independent sets, a training
set, and a test set.

Typically, two-thirds of the data are allocated to


the training set, and the remaining one-third is
allocated to the test set.

The training set is used to derive the model,


whose accuracy is estimated with the test set.

The estimate is pessimistic because only a


portion of the initial data is used to derive the
model.

Random subsampling is a variation of the


holdout method in which the holdout method is
repeated k times.

The overall accuracy estimate is taken as the


average of the accuracies obtained from each
iteration. (For prediction, we can take the
average of the predictor error rates.)

Cross-Validation
In k-fold cross-validation, the initial data are
randomly partitioned into k mutually exclusive
subsets or “folds,” D1, D2,....., Dk, each of
approximately equal size.

Training and testing are performed k times. In


iteration i, partition Di is reserved as the test set,
and the remaining partitions are collectively used
to train the model.

That is, in the first iteration, subsets D2..., Dk


collectively serves as the training set to obtain a
first model, which is tested on D1; the second
iteration is trained on subsets D1, D3,..., Dk and
tested on D2; and so on.

Unlike the holdout and random subsampling


methods above, here, each sample is used the
same number of times for training and once for
testing.

For classification, the accuracy estimate is the overall


number of correct classifications from the k
iterations, divided by the total number of tuples
in the initial data.

For prediction, the error estimate can be


computed as the total loss from the k iterations,
divided by the total number of initial tuples.

Bootstrapping
Unlike the accuracy estimation methods
mentioned above, the bootstrap method samples
the given training tuples uniformly with
replacement.
That is, each time a tuple is selected, it is equally
likely to be selected again and readded to the
training set.

For instance, imagine a machine that randomly


selects tuples for our training set. In sampling
with replacement, the machine is allowed to
select the same tuple more than once.

Ensemble Methods - Increasing The


Accuracy
Are there general strategies for improving
classifier and predictor accuracy?

YES, Bagging and boosting are two such


techniques.

Bagging
We first take an intuitive look at how bagging
works as a method of increasing accuracy.

For ease of explanation, we will assume at first


that our model is a classifier. Suppose that you
are a patient and would like to have a diagnosis
made based on your symptoms.

Instead of asking one doctor, you may choose to


ask several. If a certain diagnosis occurs more
than any of the others, you may choose this as
the final or best diagnosis.

That is, the final diagnosis is made based on a


majority vote, where each doctor gets an equal
vote. Now replace each doctor by a classifier,
and you have the basic idea behind bagging.

Intuitively, a majority vote made by a large


group of doctors may be more reliable than a
majority vote made by a small group.

Given a set, D, of d tuples, bagging works as


follows.

For iteration i (i = 1, 2,..., k), a training set, Di,


of d tuples is sampled with replacement from the
original set of tuples, D.

Note that the term bagging stands for bootstrap


aggregation.
A classifier model, Mi, is learned for each training
set, Di.

To classify an unknown tuple, X, each classifier,


Mi, returns its class prediction, which counts as
one vote.

The bagged classifier, M, counts the votes and


assigns the class with the most votes to X.

Bagging can be applied to the prediction of


continuous values by taking the average value of
each prediction for a given test tuple.
The bagged classifier often has significantly
greater accuracy than a single classifier derived
from D, the original training data.

Boosting
We now look at the ensemble method of
boosting. As in the previous section, suppose
that as a patient, you have certain symptoms.

Instead of consulting one doctor, you choose to


consult several.

Suppose you assign weights to the value or


worth of each doctor’s diagnosis, based on the
accuracies of previous diagnoses they have
made.

The final diagnosis is then a combination of the


weighted diagnoses. This is the essence behind
boosting.

In boosting, weights are assigned to each


training tuple.

A series of k classifiers is iteratively learned.


After a classifier Mi is learned, the weights are
updated to allow the subsequent classifier, Mi+1,
to “pay more attention” to the training tuples
that were misclassified by Mi.

The Difference between Linear and


Nonlinear Regression Models
The difference between linear and nonlinear regression models
isn’t as straightforward as it sounds. You’d think that linear
equations produce straight lines and nonlinear equations model
curvature. Unfortunately, that’s not correct. Both types of
models can fit curves to your data—so that’s not the defining
characteristic. In this post, I’ll teach you how to identify linear
and nonlinear regression models.

The difference between nonlinear and linear


is the “non.” OK, that sounds like a joke, but, honestly, that’s
the easiest way to understand the difference. First, I’ll define
what linear regression is, and then everything else must be
nonlinear regression. I’ll include examples of both linear and
nonlinear regression models.

Linear Regression Equations

A linear regression model follows a very particular form. In


statistics, a regression model is linear when all terms in the
model are one of the following:

o The constant
o A parameter multiplied by an independent variable
(IV)

Then, you build the equation by only adding the terms together.
These rules limit the form to just one type:

Nonlinear Regression Equations

I showed how linear regression models have one basic


configuration. Now, we’ll focus on the “non” in nonlinear! If a
regression equation doesn’t follow the rules for a linear model,
then it must be a nonlinear model. It’s that simple! A nonlinear
model is literally not linear.

The Difference between Linear and Nonlinear Regression


Models
By Jim Frost 60 Comments

The difference between linear and nonlinear regression models


isn’t as straightforward as it sounds. You’d think that linear
equations produce straight lines and nonlinear equations model
curvature. Unfortunately, that’s not correct. Both types of
models can fit curves to your data—so that’s not the defining
characteristic. In this post, I’ll teach you how to identify linear
and nonlinear regression models.

The difference between nonlinear and linear


is the “non.” OK, that sounds like a joke, but, honestly, that’s
the easiest way to understand the difference. First, I’ll define
what linear regression is, and then everything else must be
nonlinear regression. I’ll include examples of both linear and
nonlinear regression models.

Linear Regression Equations

A linear regression model follows a very particular form. In


statistics, a regression model is linear when all terms in the
model are one of the following:

o The constant
o A parameter multiplied by an independent variable
(IV)

Then, you build the equation by only adding the terms together.
These rules limit the form to just one type:

Dependent variable = constant + parameter * IV + … +


parameter * IV

Statisticians say that this type of regression equation is linear in


the parameters. However, it is possible to model curvature with
this type of model. While the function must be linear in the
parameters, you can raise an independent variable by an
exponent to fit a curve. For example, if you square an
independent variable, the model can follow a U-shaped curve.

While the independent variable is squared, the model is


still linear in the parameters. Linear models can also contain log
terms and inverse terms to follow different kinds of curves and
yet continue to be linear in the parameters.
The regression example below models the relationship
between body mass index (BMI) and body fat percent. In a
different blog post, I use this model to show how to make
predictions with regression analysis. It is a linear model that
uses a quadratic (squared) term to model the curved
relationship.

Related post: Linear Regression

Nonlinear Regression Equations

I showed how linear regression models have one basic


configuration. Now, we’ll focus on the “non” in nonlinear! If a
regression equation doesn’t follow the rules for a linear model,
then it must be a nonlinear model. It’s that simple! A nonlinear
model is literally not linear.

The added flexibility opens the door to a huge number of


possible forms. Consequently, nonlinear regression can fit an
enormous variety of curves. However, because there are so
many candidates, you may need to conduct some research to
determine which functional form provides the best fit for your
data.

Below, I present a handful of examples that illustrate the


diversity of nonlinear regression models. Keep in mind that
each function can fit a variety of shapes, and there are many
nonlinear functions. Also, notice how nonlinear regression
equations are not comprised of only addition and multiplication!
In the table, thetas are the parameters, and Xs are the
independent variables.

Nonlinear equation
Example form
Power:

Weibull growth:
Data Mining – Cluster Analysis
• INTRODUCTION:
Clr analysis, also known as clustering, is a method of data
mining that groups similar data points together. The goal of
cluster analysis is to divide a dataset into groups (or clusters)
such that the data points within each group are more similar to
each other than to data points in other groups. This process is
often used for exploratory data analysis and can help identify
patterns or relationships within the data that may not be
immediately obvious. There are many different algorithms
used for cluster analysis, such as k-means, hierarchical
clustering, and density-based clustering. The choice of
algorithm will depend on the specific requirements of the
analysis and the nature of the data being analyz
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of
data and should be dealing with huge databases. In order to
handle extensive databases, the clustering algorithm should
be scalable. Data should be scalable, if it is not scalable, then
we can’t get the appropriate result which would lead to wrong
results.
2. High Dimensionality: The algorithm should be able to
handle high dimensional space along with the data of small
size.
3. Algorithm Usability with multiple data kinds: Different
kinds of data can be used with algorithms of clustering. It
should be capable of dealing with different types of data like
discrete, categorical and interval-based data, binary data etc
Interpretability: The clustering outcomes should be
interpretable, comprehensible, and usable. The interpretability
reflects how easily the data is understood.
Clustering Methods:
• Partitioning Method
• Hierarchical Method

• Density-based Method

• Grid-Based Method

• Model-Based Method

• Constraint-based Method
Partitioning Method: It is used to make partitions on the data
in order to form clusters. If “n” partitions are done on “p”
objects of the database then each partition is represented by a
cluster and n < p. The two conditions which need to be
satisfied with this Partitioning Clustering Method are:
• One objective should only belong to only one group.

• There should be no group without even a single

purpose.
In the partitioning method, there is one technique called
iterative relocation, which means the object will be moved from
one group to another to improve the partitioning
Hierarchical Method: In this method, a hierarchical
decomposition of the given set of data objects is created. We
can classify hierarchical methods and will be able to know the
purpose of classification on the basis of how the hierarchical
decomposition is formed. There are two types of approaches
for the creation of hierarchical decomposition, they are:
• Agglomerative Approach: The agglomerative
approach is also known as the bottom-up approach.
Initially, the given data is divided into which objects
form separate groups. Thereafter it keeps on merging
the objects or the groups that are close to one another
which means that they exhibit similar properties. This
merging process continues until the termination
condition holds.
• Divisive Approach: The divisive approach is also

known as the top-down approach. In this approach, we


would start with the data objects that are in the same
cluster. The group of individual clusters is divided into
small clusters by continuous iteration. The iteration
continues until the condition of termination is met or
until each cluster contains one object.
• Density-Based Method: The density-based method
mainly focuses on density. In this method, the given
cluster will keep on growing continuously as long as the
density in the neighbourhood exceeds some threshold,
i.e, for each data point within a given cluster. The radius
of a given cluster has to contain at least a minimum
number of points.
• Grid-Based Method: In the Grid-Based method a grid is

formed using the object together,i.e, the object space is


quantized into a finite number of cells that form a grid
structure. One of the major advantages of the grid-based
method is fast processing time and it is dependent only
on the number of cells in each dimension in the quantized
space. The processing time for this method is much faster
so it can save time.
Applications Of Cluster Analysis:
• It is widely used in image processing, data analysis, and

pattern recognition.
• It helps marketers to find the distinct groups in their

customer base and they can characterize their customer


groups by using purchasing patterns.
• It can be used in the field of biology, by deriving animal
and plant taxonomies and identifying genes with the
same capabilities.
• It also helps in information discovery by classifying
documents on the web.

Advantages of Cluster Analysis:


1. It can help identify patterns and relationships within a
dataset that may not be immediately obvious.

2. It can be used for exploratory data analysis and can


help with feature selection.

3. It can be used to reduce the dimensionality of the


data.

4. It can be used for anomaly detection and outlier


identification.

5. It can be used for market segmentation and customer


profiling.
Disadvantages of Cluster Analysis:
1. It can be sensitive to the choice of initial conditions
and the number of clusters.

2. It can be sensitive to the presence of noise or outliers


in the data.

3. It can be difficult to interpret the results of the


analysis if the clusters are not well-defined.
4. It can be computationally expensive for large datasets.

5. The results of the analysis can be affected by the


choice of clustering algorithm used.

6. It is important to note that the success of cluster


analysis depends on the data, the goals of the
analysis, and the ability of the analyst to interpret the
results.
7. Partitioning Method (K-Mean) in Data Mining
Partitioning Method: This clustering method classifies the
information into multiple groups based on the
characteristics and similarity of the data. Its the data
analysts to specify the number of clusters that has to be
generated for the clustering methods. In the partitioning
method when database(D) that contains multiple(N)
objects then the partitioning method constructs user-
specified(K) partitions of the data in which each partition
represents a cluster and a particular region. There are many
algorithms that come under partitioning method some of
the popular ones are K-Mean, PAM(K-Medoids), CLARA
algorithm (Clustering Large Applications) etc. In this article,
we will be seeing the working of K Mean algorithm in
detail. K-Mean (A centroid based Technique): The K
means algorithm takes the input parameter K from the user
and partitions the dataset c
Input:
K: The number of clusters in which the dataset has
to be divided
D: A dataset containing N number of objects

Output:
A dataset of K clusters
Method:
Method:
1. Randomly assign K objects from the dataset(D) as
cluster centres(C)
2. (Re) Assign each object to which object is most similar
based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of
each cluster with the updated values.
4. Repeat Step 2 until no change occurs.

igure – K-mean ClusteringExample: Suppose we want to


group the visitors to a website using just their age as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41,
42, 43, 44, 45, 61, 62, 66

What is Density-based clustering?


Density-Based Clustering refers to one of the most popular
unsupervised learning methodologies used in model building
and machine learning algorithms. The data points in the region
separated by two clusters of low point density are considered as
noise. The surroundings with a radius ε of a given object are
known as the ε neighborhood of the object. If the ε
neighborhood of the object comprises at least a minimum
number, MinPts of objects, then it is called a core object.

Density-Based Clustering - Background

There aEPS: It is considered as the maximum radius of the


neighborhood.

re two diffMinPts: MinPts refers to the minimum number of


points in an Eps neighborhood of that point.

NEps (i) : { k belongs to D and dist (i,k) < = Eps}

Directly density reachable:

A point i is considered as the directly density reachable from a


point k with respect to Eps, MinPts if

i belongs to NEps(k)

Core point condition:

NEps (k) >= MinPts

erent parameters to calculate the density-based clustering


Density reachable:

A point denoted by i is a density reachable from a point j with


respect to Eps, MinPts if there is a sequence chain of a point i1,….,
in, i1 = j, pn = i such that ii + 1 is directly density reachable from
ii.

Density connected:

A point i refers to density connected to a point j with respect to


Eps, MinPts if there is a point o such that both i and j are
considered as density reachable from o with respect to Eps and
MinPts.
Working of Density-Based Clustering

Suppose a set of objects is denoted by D', we can say that an


object I is directly density reachable form the object j only if it is
located within the ε neighborhood of j, and j is a core object.

An object i is density reachable form the object j with respect to


ε and MinPts in a given set of objects, D' only if there is a
sequence of object chains point i1,…., in, i1 = j, pn = i such that
ii + 1 is directly density reachable from ii with respect to ε and
MinPts.

Density-Based Clustering Methods

DBSCAN

DBSCAN stands for Density-Based Spatial Clustering of


Applications with Noise. It depends on a density-based notion of
cluster. It also identifies clusters of arbitrary size in the spatial
database with outliers.

OPTICS
OPTICS stands for Ordering Points To Identify the Clustering
Structure. It gives a significant order of database with respect to
its density-based clustering structure. The order of the cluster
comprises information equivalent to the density-based
clustering related to a long range of parameter settings. OPTICS
methods are beneficial for both automatic and interactive cluster
analysis, including determining an intrinsic clustering structure.

DENCLUE

Density-based clustering by Hinnebirg and Kiem. It enables a


compact mathematical description of arbitrarily shaped clusters
in high dimension state of data, and it is good for data sets with
a huge amount of noise.

What is Web Mining?

Web mining can widely be seen as the application of adapted


data mining techniques to the web, whereas data mining is
defined as the application of the algorithm to discover patterns
on mostly structured data embedded into a knowledge
discovery process. Web mining has a distinctive property to
provide a set of various data types. The web has multiple aspects
that yield different approaches for the mining process, such as
web pages consist of text, web pages are linked via hyperlinks,
and user activity can be monitored via web server logs. These
three features lead to the differentiation between the three areas
are web content mining, web structure mining, web usage
mining.

There are three types of data mining:


Web Content Mining:

Web content mining can be used to extract useful data, information,


knowledge from the web page content. In web content mining, each
web page is considered as an individual document. The individual can
take advantage of the semi-structured nature of web pages, as HTML
provides information that concerns not only the layout but also logical
structure. The primary task of content mining is data extraction, where
structured data is extracted from unstructured websites. The objective
is to facilitate data aggregation over various web sites by using the
extracted structured data. Web content mining can be utilized to
distinguish topics on the web. For Example, if any user searches for a
specific task on the search engine, then the user will 2. Web
Structured Mining:

The web structure mining can be used to find the link structure
of hyperlink. It is used to identify that data either link the web
pages or direct link network. In Web Structure Mining, an
individual considers the web as a directed graph, with the web
pages being the vertices that are associated with hyperlinks. The
most important application in this regard is the Google search
engine, which estimates the ranking of its outcomes primarily
with the PageRank algorithm. It characterizes a page to be
exceptionally relevant when frequently connected by other
highly related pages. Structure and content mining
methodologies are usually combined. For example, web
structured mining can be beneficial to organizations to regulate
the network between two commercial sites.

3. Web Usage Mining:

Web usage mining is used to extract useful data, information,


knowledge from the weblog records, and assists in recognizing
the user access patterns for web pages. In Mining, the usage of
web resources, the individual is thinking about records of
requests of visitors of a website, that are often collected as web
server logs. While the content and structure of the collection of
web pages follow the intentions of the authors of the pages, the
individual requests demonstrate how the consumers see these
pages. Web usage mining may disclose relationships that were
not proposed by the creator of the pages.

Some of the methods to identify and analyze the web usage


patterns are given below:

I. SeII. OLAP (Online Analytical Processing):

OLAP accomplishes a multidimensional analysis of advanced


data.

OLAP can be accomplished on various parts of log related data


in a specific period.

OLAP tools can be used to infer important business intelligence


metric

Application of Web Mining:


Web mining has an extensive application because of various uses
of the web. The list of some applications of web mining is given
below.

o Marketing and conversion tool


o Data analysis on website and application accomplishment.
o Audience behavior analysis
o Advertising and campaign accomplishment analysis.
o Testing and analysis of a site.
o
o
o What is Spatial Data Mining?
o The emergence of spatial data and extensive usage of spatial
databases has led to spatial knowledge discovery. Spatial data
mining can be understood as a process that determines some
exciting and hypothetically valuable patterns from spatial
databases.
o Several tools are there that assist in extracting information from
geospatial data. These tools play a vital role for organizations
like NASA, the National Imagery and Mapping Agency
(NIMA), the National Cancer Institute (NCI), and the United
States Department of Transportation (USDOT) which tends to
make big decisions based on large spatial datasets.
o Earlier, some general-purpose data mining like Clementine
See5/C5.0, and Enterprise Miner were used. These tools were
utilized to analyze large commercial databases, and these tools
were mainly designed for understanding the buying patterns of all
customers from the database.
o spatial relationships among the variables,
o spatial structure of errors
o observations that are not independent
o spatial autocorrelation among the features
o non-linear interaction in feature space.
Spatial data must have latitude or longitude, UTM easting or
northing, or some other coordinates denoting a point's location
in space. Beyond that, spatial data can contain any number of
attributes pertaining to a place. You can choose the types of
attributes you want to describe a place. Government websites
provide a resource by offering spatial data, but you need not be
limited to what they have produced. You can produce your own.

Spa tial data


mining tasks

Classification:

Classification determines a set of rules which find the class of the


specified object as per its attributes.

Association rules:

Association rules determine rules from the data sets, and it


describes patterns that are usually in the database.
Characteristic rules:

Characteristic rules describe some parts of the data set.

You might also like