0% found this document useful (0 votes)

16 views19 pages

Unit I

Uploaded by

rolexxx3636

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views19 pages

Unit I

Uploaded by

rolexxx3636

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

DATA MINING

UNIT - I

Basic Data Mining Tasks – Data Mining Versus Knowledge

Discovery in Data Bases – Data Mining Issues – Data Mining
Matrices – Social Implications of Data Mining – Data Mining
from Data Base Perspective.
INTRODUCTION:

 Data mining is often defined as finding hidden information in a database.

 Alternatively, it has been called exploratory data analysis, data driven discovery, and deductive learning.

 Data mining access of a database differs from this traditional access in several ways:

• Query:

The query might not be well formed or precisely stated. The data miner might not even be exactly sure of what
he wants to see.

• Data:

The data accessed is usually a different version from that of the original operational database. The data have
been cleansed and modified to better support the mining process.

• Output:

The output of the data mining query probably is not a subset of the database. Instead it is the output of some
analysis of the contents of the database.
 Data mining involves many different algorithms to accomplish different tasks.
 All of these algorithms attempt to fit a model to the data.
 The algorithms examine the data and determine a model that is closest to the characteristics of the
data being examined.
 Data mining algorithms can be characterized as consisting of three parts:

• Model:
The purpose of the algorithm is to fit a model to the data.

• Preference:
Some criteria must be used to fit one model over another.

• Search:
All algorithms require some technique to search the data
A predictive model makes a prediction about values of data using known results found from different data.
Predictive modeling may be made based on the use of other historical data.

A descriptive model identifies patterns or relationships in data. Unlike the predictive model, a descriptive model
serves as a way to explore the properties of the data examined, not to predict new properties.
BASIC DATA MINING TASKS

1. Classification:

 Classification maps data into predefined groups or classes.

 It is often referred to as supervised learning because the classes are determined before examining the data. Two
examples of classification applications are determining whether to make a bank loan and identifying credit
risks.
 Classification algorithms require that the classes be defined based on data attribute values. They often describe
these classes by looking at the characteristics of data already known to belong to the classes.
 Pattern recognition is a type of classification where an input pattern is classified into one of several classes
based on its similarity to these predefined classes.

2. Regression

 Regression is used to map a data item to a real valued prediction variable.

 In actuality, regression involves the learning of the function that does this mapping.
 Regression assumes that the target data fit into some known type of function (e.g., linear, logistic, etc.) and then
determines the best function of this type that models the given data.
 Some type of error analysis is used to determine which function is "best."
3. Time Series Analysis

 With time series analysis, the value of an attribute is examined as it varies over time.
 The values usually are obtained as evenly spaced time points (daily, weekly, hourly, etc.).
 A time series plot, is used to visualize the time series.

4. Prediction:

 Many real-world data mining applications can be seen as predicting future data states based on past and
current data.
 Prediction can be viewed as a type of classification. (Note: This is a data mining task that is different from the
prediction model, although the prediction task is a type of prediction model.)
 The difference is that prediction is predicting a future state rather than a current state. Here we are referring
to a type of application rather than to a type of data mining modeling approach.
 Prediction applications include flooding, speech recognition, machine learning, and pattern recognition.
Although future values may be predicted using time series analysis or regression techniques, other approaches
may be used as well.
5. Clustering

 Clustering is similar to classification except that the groups are not predefined, but rather defined by the data
alone.
 Clustering is alternatively referred to as unsupervised learning or segmentation.
 It can be thought of as partitioning or segmenting the data into groups that might or might not be disjointed.
 The clustering is usually accomplished by determining the similarity among the data on predefined attributes.
 The most similar data are grouped into clusters.
 Since the clusters are not predefined, a domain expert is often required to interpret the meaning of the created
clusters.

6. Summarization

 Summarization maps data into subsets with associated simple descriptions.

 Summarization is also called characterization or generalization.
 It extracts or derives representative information about the database.
 This may be accomplished by actually retrieving portions of the data.
 Alternatively, summary type information (such as the mean of some numeric attribute) can be derived from the
data.
 The summarization succinctly characterizes the contents of the database.
7. Association Rules

 Link analysis, alternatively referred to as affinity analysis or association, refers to the data mining task of
uncovering relationships among data.
 The best example of this type of application is to determine association rules.
 An association rule is a model that identifies specific types of data associations.
 These associations are often used in the retail sales community to identify items that are frequently purchased
together

8. Sequence Discovery

 Sequential analysis or sequence discovery is used to determine sequential patterns in data.

 These patterns are based on a time sequence of actions.
 These patterns are similar to associations in that data (or events) are found to be related, but the relationship is
based on time.
DATA MINING VERSUS KNOWLEDGE DISCOVERY IN DATABASES

DEFINITION 1.1.

Knowledge discovery in databases (KDD) is the process of finding useful information and patterns in data.

DEFINITION 1.2.

Data mining is the use of algorithms to extract the information and patterns derived by the KDD process.

The KDD process consists of the following five steps:

1. Selection:
 The data needed for the data mining process may be obtained from many different and heterogeneous data sources.
 This first step obtains the data from various databases, files, and nonelectronic sources.

2. Preprocessing:
 The data to be used by the process may have incorrect or missing data.
 There may be anomalous data from multiple sources involving different data types and metrics.
 There may be many different activities performed at this time.
 Erroneous data may be corrected or removed, whereas missing data must be supplied or predicted (often using data mining
tools).

3. Transformation:
 Data from different sources must be converted into a common format for processing.
 Some data may be encoded or transformed into more usable formats.
 Data reduction may be used to reduce the number of possible data values being considered.

4. Data mining:
 Based on the data mining task being performed, this step applies algorithms to the transformed data to generate the desired
results.

5. Interpretation/evaluation:
 How the data mining results are presented to the users is extremely important because the usefulness of the results is dependent
on it.
 Various visualization and GUI strategies are used at this last step.
 Visualization refers to the visual presentation of data. Visualization techniques include:

• Graphical:
Traditional graph structures including bar charts, pie charts, histograms, and line graphs may be used.

• Geometric:
Geometric techniques include the. box plot and scatter diagram techniques.

• Icon-based:
Using figures, colors, or other icons can improve the presentation of the results.

• Pixel-based:
With these techniques each data value is shown as a uniquely colored pixel.

• Hierarchical:
These techniques hierarchically divide the display area (screen) into regions based on data values.

• Hybrid:
The preceding approaches can be combined into one display.
Parameter KDD Data Mining

KDD refers to a process of identifying valid, novel, Data Mining refers to a process of extracting useful
Definition potentially useful, and ultimately understandable and valuable information or patterns from large
patterns and relationships in data. data sets.

Objective To find useful knowledge from data. To extract useful information from data.

Data cleaning, data integration, data selection, data Association rules, classification, clustering,
Techniques Used transformation, data mining, pattern evaluation, and regression, decision trees, neural networks, and
knowledge representation and visualization. dimensionality reduction.

Structured information, such as rules and models, that Patterns, associations, or insights that can be used to
Output
can be used to make decisions or predictions. improve decision-making or understanding.

Focus is on the discovery of useful knowledge, rather Data mining focus is on the discovery of patterns or
Focus
than simply finding patterns in data. relationships in data.

Domain expertise is important in KDD, as it helps in Domain expertise is less critical in data mining, as
Role of domain
defining the goals of the process, choosing appropriate the algorithms are designed to identify patterns
expertise
data, and interpreting the results. without relying on prior knowledge.
DATA MINING ISSUES
There are many important implementation issues associated with data mining:

1. Human interaction:

Since data mining problems are often not precisely stated, interfaces may be needed with both domain and
technical experts. Technical experts are used to formulate the queries and assist in interpreting the results. Users
are needed to identify training data and desired results.

2. Overfitting:

When a model is generated that is associated with a given database state it is desirable that the model also fit
future database states. Overfitting occurs when the model does not fit future states. This may be caused by
assumptions that are made about the data or may simply be caused by the small size of the training database. For
example, a classification model for an employee database may be developed to classify employees as short,
medium, or tall. If the training database is quite small, the model might erroneously indicate that a short person is
anyone under five feet eight inches because there is only one entry in the training database under five feet eight.
In this case, many future employees would be erroneously classified as short. Overfitting can arise under other
circumstances as well, even though the data are not changing.
3. Outliers:

There are often many data entries that do not fit nicely into the derived model. This becomes even more of an issue
with very large databases. If a model is developed that includes these outliers, then the model may not behave well for
data that are not outliers.

4. Interpretation of results :

Currently, data mining output may require experts to correctly interpret the results, which might otherwise be
meaningless to the average database user.

5. Visualization of results:

To easily view and understand the output of data mining algorithms, visualization of the results is helpful.

6. Large datasets:

The massive datasets associated with data mining create problems when applying algorithms designed for small
datasets. Many modeling applications grow exponentially on the dataset size and thus are too inefficient for larger
datasets. Sampling and parallelization are effective tools to attack this scalability problem.
7. High dimensionality:

A conventional database schema may be composed of many different attributes. The problem here is that not all
attributes may be needed to solve a given data mining problem. In fact, the use of some attributes may interfere
with the correct completion of a data mining task. The use of other attributes may simply increase the overall
complexity and decrease the efficiency of an algorithm. This problem is sometimes referred to as the
dimensionality curse, meaning that there are many attributes (dimensions) involved and it is difficult to determine
which ones should be used. One solution to this high dimensionality problem is to reduce the number of attributes,
which is known as dimensionality reduction. However, determining which attributes not needed is not always easy
to do.

8. Multimedia data:

Most previous data mining algorithms are targeted to traditional data types (numeric, character, text, etc.). The
use of multimedia data such as is found in GIS databases complicates or invalidates many proposed algorithms.

9. Missing data:

During the preprocessing phase of KDD, missing data may be replaced with estimates. This and other approaches
to handling missing data can lead to invalid results in the data mining step.

10. Irrelevant data:

Some attributes in the database might not be of interest to the data mining task being developed.
11. Noisy data:

Some attribute values might be invalid or incorrect. These values are often corrected before running data mining
applications.

12. Changing data:

Databases cannot be assumed to be static. However, most data mining algorithms do assume a static database. This
requires that the algorithm be completely rerun anytime the database changes.

13. Integration:

The KDD process is not currently integrated into normal data processing activities. KDD requests may be treated as
special, unusual, or one-time needs. This makes them inefficient, ineffective, and not general enough to be used on an
ongoing basis. Integration of data mining functions into traditional DBMS systems is certainly a desirable goal.

14. Application:

Determining the intended use for the information obtained from the data mining function is a challenge. Indeed, how
business executives can effectively use the output is sometimes considered the more difficult part, not the running of the
algorithms themselves. Because the data are of a type that has not previously been known, business practices may have
to be modified to determine how to effectively use the information uncovered.
DATA MINING METRICS

Measuring the effectiveness or usefulness of a data mining approach is not always straightforward. In fact, different
metrics could be used for different techniques and also based on the interest level. From an overall business or
usefulness perspective, a measure such as return on investment (ROI) could be used. ROI examines the difference
between what the data mining technique costs and what the savings or benefits from its use are. Of course, this would
be difficult to measure because the return is hard to quantify. It could be measured as increased sales, reduced
advertising expenditure, or both. In a specific advertising campaign implemented via targeted catalog mailings, the
percentage of catalog recipients and the amount of . purchase per recipient would provide one means to measure the
effectiveness of the mailings.
In this text, however, we use a more computer science/database perspective to measure various data mining
approaches. We assume that the business management has determined that a particular data mining application be
made. They subsequently will determine the overall effectiveness of the approach using some ROI (or related)
strategy. Our objective is to compare different alternatives to implementing a specific data mining task. The metrics
used include the traditional metrics of space and time based on complexity analysis. In some cases, such as accuracy
in classification, more specific metrics targeted to a data mining task may be used.
Social implications of DATA MINING
The integration of data mining techniques into normal day-to-day activities . has become commonplace. We are
confronted daily with targeted advertising, and businesses have become more efficient through the use of data mining
activities to reduce costs. Data mining adversaries, however, are concerned that this information is being obtained at
the cost of reduced privacy. Data mining applications can derive much demographic information concerning
customers that was previously not known or hidden in the data. The unauthorized use of such data could result in the
disclosure of information that is deemed to be confidential.
We have recently seen an increase in interest in data mining techniques targeted to such applications as fraud
detection, identifying criminal suspects, and prediction of potential terrorists. These can be viewed as types of
classification problems . The approach that is often used here is one of "profiling" the typical behavior or
characteristics involved. Indeed, many classification techniques work by identifying the attribute values that
commonly occur for the target class. Subsequent records will be then classified based on these attribute values. Keep in
mind that these approaches to classification are imperfect. Mistakes can be made. Just because an individual makes a
series of credit card purchases that are similar to those often made when a card is stolen does not mean that the card is
stolen or that the individual is a criminal.
Users of data mining techniques must be sensitive to these issues and must not violate any privacy directives or
guidelines.
DATA MINING from a database perspective

 The study of data mining from a database perspective involves looking at all types of data mining applications and
techniques.
 However, we are interested primarily in those that are of practical interest.
 While our interest is not limited to any particular type of algorithm or approach, we are concerned about the
following implementation issues:

• Scalability:
Algorithms that do not scale up to perform well with massive real-world datasets are of limited application. Related
to this is the fact that techniques should work regardless of the amount of available main memory.

• Real-world data:
Real-world data are noisy and have many missing attribute values. Algorithms should be able to work even in the
presence of these problems.

• Update:
Many data mining algorithms work with static datasets. This is not a realistic assumption.

• Ease of use:
Although some algorithms may work well, they may not be well received by users if they are difficult to use or
understand.
These issues are crucial if applications are to be accepted a:nd used in the workplace. Throughout the text we will
mention how techniques perforn1 in these and other implementation categories.

Rapid Marts Full
No ratings yet
Rapid Marts Full
1,334 pages
A Brief Overview On Data Mining Survey PDF
No ratings yet
A Brief Overview On Data Mining Survey PDF
8 pages
Fundamentals of Data Science Unit 1
No ratings yet
Fundamentals of Data Science Unit 1
29 pages
Lesson 1
No ratings yet
Lesson 1
32 pages
unit 3 BI & Data science (1)
No ratings yet
unit 3 BI & Data science (1)
19 pages
Unit 1 Data Mining task
No ratings yet
Unit 1 Data Mining task
7 pages
datamining&warehousing
No ratings yet
datamining&warehousing
65 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
30 pages
DM Unit1 Intro
No ratings yet
DM Unit1 Intro
12 pages
Dwdm Unit-II Notes
No ratings yet
Dwdm Unit-II Notes
29 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
Unit 1 Datamining
No ratings yet
Unit 1 Datamining
16 pages
DM Module1
No ratings yet
DM Module1
15 pages
Unit 1
No ratings yet
Unit 1
59 pages
Whats App
No ratings yet
Whats App
23 pages
DATA_MINING_UNIT_1
No ratings yet
DATA_MINING_UNIT_1
13 pages
Data Mining Versus Knowledge Discovery I
No ratings yet
Data Mining Versus Knowledge Discovery I
3 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Unit 1
No ratings yet
Unit 1
43 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
p144 Data Mining
100% (3)
p144 Data Mining
11 pages
Data Mining
No ratings yet
Data Mining
27 pages
data mining unit I notes
No ratings yet
data mining unit I notes
24 pages
Subject Data Warehouse
No ratings yet
Subject Data Warehouse
42 pages
Data Mining
No ratings yet
Data Mining
25 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
IS352_ Lecture 01
No ratings yet
IS352_ Lecture 01
62 pages
Data Mining and Data Analysis UNIT-1 Notes For Print
No ratings yet
Data Mining and Data Analysis UNIT-1 Notes For Print
22 pages
wao
No ratings yet
wao
9 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
unit-III
No ratings yet
unit-III
101 pages
Data Science & Big Data Analysis Module 1,2,3,4,5
No ratings yet
Data Science & Big Data Analysis Module 1,2,3,4,5
70 pages
1.1 - Data Mining
No ratings yet
1.1 - Data Mining
18 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Unit - I
No ratings yet
Unit - I
22 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Data Mining
No ratings yet
Data Mining
88 pages
Data Mining Real
No ratings yet
Data Mining Real
19 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
UNIT 1 - Lecture 1 - Introduction To Data Mining
No ratings yet
UNIT 1 - Lecture 1 - Introduction To Data Mining
62 pages
DM Module1 notes
No ratings yet
DM Module1 notes
25 pages
Datamining & Cluster Coputing
No ratings yet
Datamining & Cluster Coputing
16 pages
Module1 DataMining Ktustudents - in
No ratings yet
Module1 DataMining Ktustudents - in
24 pages
Module 4
No ratings yet
Module 4
54 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
Unit 1
No ratings yet
Unit 1
46 pages
02 DM BI Data Mining
No ratings yet
02 DM BI Data Mining
66 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Unit-1 Data Mining
No ratings yet
Unit-1 Data Mining
19 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
No ratings yet
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
37 pages
1 DMiningKuliah 1 Introduction
No ratings yet
1 DMiningKuliah 1 Introduction
51 pages
DWDMunit 2
No ratings yet
DWDMunit 2
27 pages
Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery
No ratings yet
Mekelle University-Mekelle Institute of Technology Department of Information Technology Data Mining and Knowledge Discovery
36 pages
Unit-1 Notes (1)
No ratings yet
Unit-1 Notes (1)
24 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
CH 2
No ratings yet
CH 2
37 pages
Data Mining Tasks
No ratings yet
Data Mining Tasks
26 pages
Data Structures: Notes For Lecture 12 Introduction To Data Mining by Samaher Hussein Ali
No ratings yet
Data Structures: Notes For Lecture 12 Introduction To Data Mining by Samaher Hussein Ali
4 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
SQL Server Alwayson Basics: Windows
No ratings yet
SQL Server Alwayson Basics: Windows
6 pages
Information Technology IT in Agriculture Sector Is
No ratings yet
Information Technology IT in Agriculture Sector Is
27 pages
Research Proposal - Mixed Medium and The Aesthetic of Trauma in Indian Diaspora.
No ratings yet
Research Proposal - Mixed Medium and The Aesthetic of Trauma in Indian Diaspora.
16 pages
Perl Data Table Cookbook
No ratings yet
Perl Data Table Cookbook
43 pages
wk5 - 6 Entrep
No ratings yet
wk5 - 6 Entrep
10 pages
Be Information Technology Engineering Semester 4 2024 May Database Management Systems Dms Pattern 2019
No ratings yet
Be Information Technology Engineering Semester 4 2024 May Database Management Systems Dms Pattern 2019
3 pages
Hxcfloppyemulator Soft Release Notes
No ratings yet
Hxcfloppyemulator Soft Release Notes
13 pages
DBMSEnd Sem Winter 2017 Solution
No ratings yet
DBMSEnd Sem Winter 2017 Solution
7 pages
'TMF1434 Data Structure and Algorithms Semester 2, 2022/2023 LAB04: Linked List
No ratings yet
'TMF1434 Data Structure and Algorithms Semester 2, 2022/2023 LAB04: Linked List
7 pages
Wireshark Tutorial
50% (2)
Wireshark Tutorial
24 pages
MBA Marketing Project Synopsis Free Sample
No ratings yet
MBA Marketing Project Synopsis Free Sample
4 pages
Data Formats
No ratings yet
Data Formats
89 pages
Big Data Links
No ratings yet
Big Data Links
7 pages
Developing Perl and PHP Applications - Db2ape90
No ratings yet
Developing Perl and PHP Applications - Db2ape90
166 pages
SQL Transformation in Informatica With Examples
No ratings yet
SQL Transformation in Informatica With Examples
10 pages
First Periodical Test Research II
No ratings yet
First Periodical Test Research II
3 pages
Informatica Interview Questions Answers
No ratings yet
Informatica Interview Questions Answers
25 pages
Best Thesis Topics For Political Science in The Philippines
100% (2)
Best Thesis Topics For Political Science in The Philippines
8 pages
EEE 5 DDA Lab 9 File Input and Output in Java 1 1
No ratings yet
EEE 5 DDA Lab 9 File Input and Output in Java 1 1
13 pages
System Proposal (1)
No ratings yet
System Proposal (1)
18 pages
16.1 Azure Files and File Sync
No ratings yet
16.1 Azure Files and File Sync
2 pages
Log
No ratings yet
Log
4 pages
Artigo - Grafo Do Conhecimento
No ratings yet
Artigo - Grafo Do Conhecimento
8 pages
STATISTICS Form 2
80% (5)
STATISTICS Form 2
22 pages
Robotics-and-AI-Curriculum-for-ICSE-School-Class-9-to-10 (2)
No ratings yet
Robotics-and-AI-Curriculum-for-ICSE-School-Class-9-to-10 (2)
16 pages
DDA Answers
No ratings yet
DDA Answers
8 pages
Unit 4:OLAP: Online Analytical Processing
No ratings yet
Unit 4:OLAP: Online Analytical Processing
34 pages
Dbms and Spreadsheet
No ratings yet
Dbms and Spreadsheet
8 pages
Shayna+Schulman (4)
No ratings yet
Shayna+Schulman (4)
12 pages

Unit I

Uploaded by

Unit I

Uploaded by

DATA MINING

Basic Data Mining Tasks – Data Mining Versus Knowledge

 Data mining is often defined as finding hidden information in a database.

 Classification maps data into predefined groups or classes.

 Regression is used to map a data item to a real valued prediction variable.

 Summarization maps data into subsets with associated simple descriptions.

 Sequential analysis or sequence discovery is used to determine sequential patterns in data.

The KDD process consists of the following five steps:

10. Irrelevant data:

12. Changing data:

You might also like