0% found this document useful (0 votes)

78 views

Unit 1 Data Mining task

Uploaded by

Suja Mary

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views

Unit 1 Data Mining task

Uploaded by

Suja Mary

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Data Mining and Warehousing

Unit -1
Data mining is often defined as finding hidden information in a database.

Traditional database queries (Figure ), access a database using a well-defined query stated in a language
such as SQL.

Data mining access of a database differs from this traditional access in several ways:

• Query: The query might not be well formed or precisely stated. The data miner might not even be
exactly sure of what he wants to see.

• Data: The data accessed is usually a different version from that of the original operational database. The
data have been cleansed and modified to better support the mining process.

• Output: The output of the data mining query probably is not a subset of the database. Instead it is the
output of some analysis of the contents of the database

BASIC DATA MINING TASKS

The basic outline of tasks shown in Figure

 A predictive model makes a prediction about values of data using known results found from
different data
 Predictive model data mining tasks include classification, regression, time series analysis, and
prediction.
 A descriptive model identifies patterns or relationships in data
 A descriptive model serves as a way to explore the properties of the data examined, not to
predict new properties.
 Clustering, summarization, association rules, and sequence discovery are usually viewed as
descriptive in nature.

Classification
 Classification maps data into predefined groups or classes. It is often referred to as supervised
learning because the classes are determined before examining the data. Two examples of
classification applications are determining whether to make a bank loan and identifying credit
risks
 Pattern recognition is a type of classification where an input pattern is classified into one of
several classes based on its similarity to these predefined classes

Regression
 Regression is used to map a data item to a real valued prediction variable
 Regression assumes that the target data fit into some known type of function (e.g., linear, logistic,
etc.) and then determines the best function of this type that models the given data

Time Series Analysis

 With time series analysis, the value of an attribute is examined as it varies over time. The values
usually are obtained as evenly spaced time points (daily, weekly, hourly, etc.). A time series plot
(Figure 1.3), is used to visualize the time series

The plots for Y and Z have similar behaviour, while X appears to have less volatility.

Prediction
 Predicting future data states based on past and current data
 The difference is that prediction is predicting a future state rather than a current state.
 Prediction applications include flooding, speech recognition, machine learning, and pattern
recognition.
Clustering
 Clustering is similar to classification except that the groups are not predefined, but rather defined
by the data alone.
 Clustering is alternatively referred to as unsupervised learning or segmentation
 A special type of clustering is called segmentation. With segmentation a database is partitioned
into disjointed groupings of similar tuples called segments. Segmentation is often viewed as
being identical to clustering

Summarization
Summarization maps data into subsets with associated simple descriptions. Summarization is also called
characterization or generalization. It extracts or derives representative information about the database

Association Rules
 An association rule is a model that identifies specific types of data associations.
 These associations are often used in the retail sales community to identify items that are
frequently purchased together.
 Example: the use of association rules in market basket analysis

Sequence Discovery
Sequential analysis or sequence discovery is used to determine sequential patterns in data. These patterns
are based on a time sequence of actions. These patterns are similar to associations in that data (or events)
are found to be related, but the relationship is based on time.

DATA MINING VERSUS KNOWLEDGE DISCOVERY IN DATABASES

 Knowledge discovery in databases (KDD) is the process of finding useful information

and patterns in data.
 Data mining is the use of algorithms to extract the information and patterns derived by
the KDD process.
 KDD is a ' process that involves many different steps. The input to this process is the
data, and the output is the useful information desired by the users.
 Figure illustrates the overall KDD process.

The KDD process consists of the following five steps

• Selection: The data needed for the data mining process may be obtained from many
different and heterogeneous data sources. This first step obtains the data from various
databases, files, and nonelectronic sources.
• Preprocessing: The data to be used by the process may have incorrect or missing data.
There may be anomalous data from multiple sources involving different data types and
metrics. There may be many different activities performed at this time. Erroneous data
may be corrected or removed, whereas missing data must be supplied or predicted (often
using data mining tools).
• Transformation: Data from different sources must be converted into a common
format for processing. Some data may be encoded or transformed into more usable
formats. Data reduction may be used to reduce the number of possible data values being
considered.
• Data mining: Based on the data mining task being performed, this step applies
algorithms to the transformed data to generate the desired results.
• Interpretation/evaluation: How the data mining results are presented to the users is
extremely important because the usefulness of the results is dependent on it.

Visualization refers to the visual presentation of data. Visualization techniques include:

• Graphical: Traditional graph structures including bar charts, pie charts, histograms,
and line graphs may be used.
• Geometric: Geometric techniques include the box plot and scatter diagram techniques.
• Icon-based: Using figures, colors, or other icons can improve the presentation of the
results.
• Pixel-based: With these techniques each data value is shown as a uniquely colored
pixel.
• Hierarchical: These techniques hierarchically divide the display area (screen) into
regions based on data values.
• Hybrid: The preceding approaches can be combined into one display

Visualization tools can be used to summarize data as a data mining technique itself. The data
mining process itself is complex. The algorithms must be carefully applied to be effective.

Discovered patterns must be correctly interpreted and properly evaluated to ensure that the
resulting information is meaningful and accurate.

The Development of Data Mining

The current evolution of data mining functions and products is the result of years of influence
from many disciplines, including databases, information retrieval, statistics, algorithms, and
machine learning
Table shows developments in the areas of artificial intelligence (AI), information retrieval (IR),
databases (DB), and statistics (Stat) leading to the current view of data mining.

Different views of what data mining functions actually are

• Induction is used to proceed from very specific knowledge to more general information. This
type of technique is often found in AI applications.

• The primary objective of data mining is to describe some characteristics of a set of data by a
general model, this approach can be viewed as a type of compression .

• An ongoing direction of data mining research in how to define a data mining query and whether
a query language (like SQL) can be developed to capture the many different types of data mining
queries.

• A large database can be viewed as using approximation to help uncover hidden information
about the data.

• When dealing with large databases, the impact of size and efficiency of developing an abstract
model can be thought of as a type of search problem.

The various data mining problems viewed several different perspectives based on the viewpoint
and background of the researcher or developer

DATA MINING ISSUES

There are many important implementation issues associated with data mining:

1. Human interaction:
 Interfaces may be needed with both domain and technical experts.
 Technical experts are used to formulate the queries and assist in interpreting the
results.
 Users are needed to identify training data and desired results.
2. Overfitting:
Overfitting occurs when the model does not fit future states. This may be caused by
assumptions that are made about the data or may simply be caused by the small size of
the training database.
Example:-
A classification model for an employee database may be developed to classify
employees as short, medium, or tall.
If the training database is quite small, the model might erroneously indicate that a short
person is anyone under five feet eight inches because there is only one entry in the
training database under five feet eight. In this case, many future employees would be
erroneously classified as short.
3. Outliers:
There are often many data entries that do not fit nicely into the derived model. This
becomes even more of an issue with very databases.
4. Interpretation of results :
Currently, data mining output may require experts to correctly interpret the results, which
might otherwise be meaningless to the average database user.
5. Visualization of results:
To easily view and understand the output of data mining algorithms, visualization of the
results is helpful.
6. Large datasets:
The massive datasets associated with data mining create problems when applying
algorithms designed for small datasets.
7. High dimensionality:
 The dimensionality curse, meaning that there are many attributes (dimensions)
involved and it is difficult to determine which ones should be used.
 One solution to this high dimensionality problem is to reduce the number of
attributes, which is known as dimensionality reduction.
8. Multimedia data:
The use of multimedia data such as is found in GIS databases complicates or invalidates
many proposed algorithms.
9. Missing data:
During the preprocessing phase of KDD, missing data may be replaced with estimates.
This and other approaches to handling missing data can lead to invalid results in the data
mining step.
10. Irrelevant data:
Some attributes in the database might not be of interest to the data mining task being
developed.
11. Noisy data:
Some attribute values might be invalid or incorrect. These values are often corrected
before running data mining applications.
12. Changing data: Databases cannot be assumed to be static. However, most data mining
algorithms do assume a static database. This requires that the algorithm be completely
rerun anytime the database changes.
13. Integration: The KDD process is not currently integrated into normal data processing
activities. KDD requests may be treated as special, unusual, or one-time needs. This
makes them inefficient, ineffective, and not general enough to be used on an ongoing
basis.
14. Application: Determining the intended use for the information obtained from the data
mining function is a challenge. Indeed, how business executives can effectively use the
output is sometimes considered the more difficult part, not the running of the algorithms
themselves.

Batch 2018 3rd Semester CSE
No ratings yet
Batch 2018 3rd Semester CSE
20 pages
Claim Form Twitter
67% (3)
Claim Form Twitter
14 pages
Unit I
No ratings yet
Unit I
19 pages
datamining&warehousing
No ratings yet
datamining&warehousing
65 pages
unit 3 BI & Data science (1)
No ratings yet
unit 3 BI & Data science (1)
19 pages
Dwdm Unit-II Notes
No ratings yet
Dwdm Unit-II Notes
29 pages
Data Mining and Warehouse
No ratings yet
Data Mining and Warehouse
30 pages
DM Module1
No ratings yet
DM Module1
15 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Unit 4 Intro DM
No ratings yet
Unit 4 Intro DM
30 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
Data Mining Technique Using Weka Tool
No ratings yet
Data Mining Technique Using Weka Tool
21 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
Unit 1 Datamining
No ratings yet
Unit 1 Datamining
16 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
Data Mining Notes1
No ratings yet
Data Mining Notes1
56 pages
Whats App
No ratings yet
Whats App
23 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
Q.1. What Is Data Mining?
No ratings yet
Q.1. What Is Data Mining?
15 pages
p144 Data Mining
100% (3)
p144 Data Mining
11 pages
Data Mining
No ratings yet
Data Mining
88 pages
module 1
No ratings yet
module 1
41 pages
CH 2
No ratings yet
CH 2
37 pages
Data Mining Tasks
No ratings yet
Data Mining Tasks
26 pages
Paper - Xvii Data Mining and Warehousing
No ratings yet
Paper - Xvii Data Mining and Warehousing
140 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
Module 2 Data Mining
No ratings yet
Module 2 Data Mining
49 pages
Data Mining
No ratings yet
Data Mining
15 pages
Lesson 1
No ratings yet
Lesson 1
32 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
Data Mining
No ratings yet
Data Mining
22 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Data Mining 1
No ratings yet
Data Mining 1
56 pages
2 Buss Intel Analytics
No ratings yet
2 Buss Intel Analytics
43 pages
DM Chapter 1
No ratings yet
DM Chapter 1
10 pages
WINSEM2024-25_MCSE615L_TH_VL2024250502897_2024-12-19_Reference-Material-I
No ratings yet
WINSEM2024-25_MCSE615L_TH_VL2024250502897_2024-12-19_Reference-Material-I
58 pages
Data Mining
No ratings yet
Data Mining
25 pages
1 Intro
No ratings yet
1 Intro
33 pages
Unit - 2 Data Minig Notes
No ratings yet
Unit - 2 Data Minig Notes
15 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
Data Mining AND Warehousing: Abstract
No ratings yet
Data Mining AND Warehousing: Abstract
12 pages
Unit 1
No ratings yet
Unit 1
59 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Data Mining
No ratings yet
Data Mining
27 pages
Datawarehouse&Data mining_ALL
No ratings yet
Datawarehouse&Data mining_ALL
46 pages
DM Notes (6th Nov)
No ratings yet
DM Notes (6th Nov)
6 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
15 pages
Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
Notes for DMDWH -Module1
No ratings yet
Notes for DMDWH -Module1
21 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
No ratings yet
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
37 pages
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
No ratings yet
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
40 pages
DataMining S
No ratings yet
DataMining S
103 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
data mining unit I notes
No ratings yet
data mining unit I notes
24 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Unit-1 Control statement
No ratings yet
Unit-1 Control statement
15 pages
Sequential Storage
No ratings yet
Sequential Storage
9 pages
Unit-III Advanced Machine Learning
No ratings yet
Unit-III Advanced Machine Learning
8 pages
Unit IV Recommender System
No ratings yet
Unit IV Recommender System
5 pages
Unit II - Diagnotis and Multiple Linear
No ratings yet
Unit II - Diagnotis and Multiple Linear
8 pages
Numpy and Pandas
No ratings yet
Numpy and Pandas
11 pages
Programs
No ratings yet
Programs
10 pages
Floo& Towe&: Government Pakistan Directorate General Intelligence Tnvestigation (IR) Emigration Mauve Area, G-8/I
No ratings yet
Floo& Towe&: Government Pakistan Directorate General Intelligence Tnvestigation (IR) Emigration Mauve Area, G-8/I
2 pages
Exam Questions 102-500: LPIC-1 Exam 102 - Part 2 of 2 - Version 5.0
No ratings yet
Exam Questions 102-500: LPIC-1 Exam 102 - Part 2 of 2 - Version 5.0
8 pages
Postgis-2 5 2 PDF
No ratings yet
Postgis-2 5 2 PDF
868 pages
Chapter 04 Computer Architecture and D
No ratings yet
Chapter 04 Computer Architecture and D
95 pages
Unit 4 - Operating System - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Operating System - WWW - Rgpvnotes.in
23 pages
Web Services Negotiation - Business Services Manager
No ratings yet
Web Services Negotiation - Business Services Manager
3 pages
Final Dbms Mini Project
No ratings yet
Final Dbms Mini Project
24 pages
BMIT5103 - Soalan Exam
No ratings yet
BMIT5103 - Soalan Exam
2 pages
All in One Solution Rack
No ratings yet
All in One Solution Rack
8 pages
Microsoft Windows Basics Notes
No ratings yet
Microsoft Windows Basics Notes
6 pages
P-edms - Web User Guide_im19
No ratings yet
P-edms - Web User Guide_im19
46 pages
APC Building Data Lakes On AWS SG
No ratings yet
APC Building Data Lakes On AWS SG
187 pages
How To Print A Guitar Template - Electric Herald
No ratings yet
How To Print A Guitar Template - Electric Herald
5 pages
240605_slide_A Tale of Two Industroyers It Was the Season of Darkness
No ratings yet
240605_slide_A Tale of Two Industroyers It Was the Season of Darkness
23 pages
(PDF) Library Management Sytem
No ratings yet
(PDF) Library Management Sytem
36 pages
Unit 3 Event and GUI Programming (NEP)
No ratings yet
Unit 3 Event and GUI Programming (NEP)
35 pages
Maxbox Starter With Python4Delphi
No ratings yet
Maxbox Starter With Python4Delphi
8 pages
How To Create Data Source Through Jboss Admin Console
No ratings yet
How To Create Data Source Through Jboss Admin Console
14 pages
Trinity KT: Tyan S2390
No ratings yet
Trinity KT: Tyan S2390
66 pages
350-701 SCOR Exam Dumps_40Q
No ratings yet
350-701 SCOR Exam Dumps_40Q
6 pages
Introduction To REF CURSOR
No ratings yet
Introduction To REF CURSOR
10 pages
C64718AC Met One 3400+ Installation Qualification
No ratings yet
C64718AC Met One 3400+ Installation Qualification
20 pages
Manual XMR401HDS
No ratings yet
Manual XMR401HDS
73 pages
Checkmarx vs. SonarQube Report From PeerSpot 2022-05-07 1epf
No ratings yet
Checkmarx vs. SonarQube Report From PeerSpot 2022-05-07 1epf
34 pages
Geant4 Example
No ratings yet
Geant4 Example
8 pages
USP Unit I
100% (1)
USP Unit I
7 pages
Machine Learning With Boosting
100% (1)
Machine Learning With Boosting
212 pages
SCI4201 Lecture 8 - Macintosh and Linux Boot Processes and File Systems
No ratings yet
SCI4201 Lecture 8 - Macintosh and Linux Boot Processes and File Systems
23 pages

Unit 1 Data Mining task

Uploaded by

Unit 1 Data Mining task

Uploaded by

Data Mining and Warehousing

BASIC DATA MINING TASKS

The basic outline of tasks shown in Figure

Time Series Analysis

DATA MINING VERSUS KNOWLEDGE DISCOVERY IN DATABASES

 Knowledge discovery in databases (KDD) is the process of finding useful information

The KDD process consists of the following five steps

Visualization refers to the visual presentation of data. Visualization techniques include:

The Development of Data Mining

Different views of what data mining functions actually are

DATA MINING ISSUES

You might also like