0% found this document useful (0 votes)

10 views

DATA MINING

Data mining involves the non-trivial extraction of useful information from large datasets, utilizing techniques such as classification, clustering, and regression. It has applications across various fields including business, science, and e-commerce, driven by the increasing volume and complexity of data. Key tasks in data mining include anomaly detection, association rule discovery, and sequential pattern discovery, all aimed at uncovering meaningful patterns and insights from data.

Uploaded by

floraaluoch3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

DATA MINING

Uploaded by

floraaluoch3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

1

DATA MINING

A. DEFINITION
Various definitions;
 Non trivial extraction of nuggets from large amounts of data.
 Non-trivial extraction of implicit, previously unknown and potentially useful information
from
 data
 Exploration & analysis, by automatic or semi-automatic means, of large quantities of data
in order to discover meaningful patterns.

Data mining is not;

 Generating multidimensional cubes of a relational table
 Searching for a phone number in a phone book
 Searching for keywords on Google
 Generating a histogram of salaries for different age groups
 Issuing SQL query to a database, and reading the reply

Data mining is;

 Finding groups of people with similar hobbies
 Are chances of getting cancer higher if you live near a power line?

Prediction Methods versus Description Methods

Prediction methods: Use some variables to predict unknown or future values of the same or other
variables
Description methods: Find human interpretable patterns that describe data.

B. APPLICATIONS / REASONS FOR PREVALENCE

Business
– Wal-Mart logs nearly 20 million transactions per day.
• Astronomy
– Telescope collecting large amounts of data (SDSS).
• Space
– NASA is collecting peta bytes of data from satellites.
• Physics
– High energy physics experiments are expected to generate 100 to 1000 tera bytes in the next
decade.
Retailers
– Scanner data is much more accurate than other means
• E-Commerce
– Rich data on consumer browsing.
• Science
– Accuracy of sensors is improving.
The gap between data and analysts is increasing
• Hidden information is not always evident
• High cost of human labor
2

• Much of data is never analyzed at all

C. AREAS DATA MINING DRAWS IDEAS FROM;

Machine Learning, Pattern Recognition, Statistics, and Database systems for applications that
have;
– Enormity of data
– High dimensionality of data
– Heterogeneous data
– Unstructured data

D. DATA MINING TASKS / TECHNIQUES

Classification (predictive)
• Clustering (descriptive)
• Association Rule Discovery (descriptive)
• Sequential Pattern Discovery (descriptive)
• Regression (predictive)
• Deviation Detection (predictive)

(i) Regression
Predict the value of a given continuous valued variable based on the values of other variables,
assuming a linear or non-linear model of dependency.
• Extensively studied in the fields of Statistics and Neural Networks.
• Examples;
– Predicting sales numbers of a new product based on advertising expenditure.
– Predicting wind velocities based on temperature, humidity, air pressure, etc.
– Time series prediction of stock market indices.

(ii) Association rule discovery

Given a set of transactions, each of which is a set of items, find all rules (X -> Y ) that satisfy
user specified minimum support and confidence constraints.

Example
- Given a set of records, each of which contain some number of items from a given collection:
– Produce dependency rules which will predict occurrence of an item based on occurrences of
other items

Example
– {Bread} -> {Peanut Butter}
– {Jelly} -> {Peanut Butter}

Applications
– Cross selling and up selling
– Supermarket shelf management

Some rules discovered

– Bread -> Peanut Butter
3

• support=60%, confidence=75%
– Peanut Butter -> Bread
• support=60%, confidence=100%
– Jelly -> Peanut Butter
• support=20%, confidence=100%
– Jelly -> Milk
• support=0%

Example in supermarket’s shelf management

Goal: To identify items that are bought concomitantly by a reasonable fraction of customers so
that they can be shelved appropriately based on business goals.
• Data Used: Point-of-sale data collected with barcode scanners to find dependencies among
Products.
• Example
– If a customer buys Jelly, then he is very likely to buy Peanut Butter.
– So don’t be surprised if you find Peanut Butter next to Jelly on an aisle in the super market.
Also, salsa next to tortilla chips.

(iii) Classification
Given a set of records (called the training set),
– Each record contains a set of attributes. One of the attributes is the class
• Find a model for the class attribute as a function of the values of other attributes
• Goal: Previously unseen records should be assigned to a class as accurately as possible
– Usually, the given data set is divided into training and test set, with training set used to build
the model and test set used to validate it. The accuracy of the model is determined on the test set.

Example (direct marketing)

– Use the profiles of customers along with their {buy, didn’t buy} decision. The latter becomes
the class attribute.
– The profile of the information may consist of demographic, lifestyle and company interaction.
• Demographic – Age, Gender, Geography, Salary
• Psychographic – Hobbies
• Company Interaction –Recentness, Frequency, Monetary
– Use these information as input attributes to learn a classifier
Model

Example (fraud detection)

– Label past transactions as {fraud, fair} transactions.
This forms the class attribute
– Learn a model for the class of transactions
– Use this model to detect fraud by observing credit card transactions on an account
4

(iv) Clustering
Determine object groupings such that objects within the same cluster are similar to each other,
while objects in different groups are not.
Classes are unknown unlike classification.

Example (market segmentation)

– Collect different attributes of customers based on their geographical and lifestyle related
information
– Find clusters of similar customers
– Measure the clustering quality by observing the buying patterns of customers in the same
cluster vs. those from different clusters

Example (document clustering)

To find groups of documents that are similar to each other based on important
terms appearing in them
• Approach: To identify frequently occurring terms in each document. Form a similarity
measure based on frequencies of different terms. Use it to generate clusters
• Gain: Information Retrieval can utilize the clusters to relate a new document or search term to
clustered documents

(v) Deviation / Anomaly Detection

• Some data objects do not comply with the general behavior or model of the data. Data
objects that are different from or inconsistent with the remaining set are called outliers
• Outliers can be caused by measurement or execution error. Or they represent some kind of
fraudulent activity.
• Goal of Deviation / Anomaly Detection is to detect significant deviations from normal
behavior

Given a set of n data points or objects, and k, the expected number of outliers, find the
top k objects that considerably dissimilar, exceptional or inconsistent with the remaining data
• This can be viewed as two sub problems.
– Define what data can be considered as inconsistent in a given data set.
– Find an efficient method to mine the outliers so defined.

Example (Credit Card Fraud Detection)

• Goal: To detect fraudulent credit card transactions
• Approach:
– Based on past usage patterns, develop model for authorized credit card transactions
– Check for deviation form model, before authenticating new credit card transactions
– Hold payment and verify authenticity of “doubtful” transactions by other means
(phone call, etc.)
5

(vi) Sequential Pattern Discovery:

• Given is a set of objects, with each object associated with its own timeline of events,
find rules that predict strong sequential dependencies among different events

Example
Telecommunication alarm logs
– (Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) -> (Fire_Alarm)

E. DATA SETS
(i) Contents of data sets
 Attributes (describe objects)
Variable, field, characteristic, feature or observation
 Objects (have attributes)
Record, point, case, sample, entity or item
 Data Set
Collection of objects

(ii) Data types

Continuous
Discrete (integers)
Binary
Ordinal: Takes specific values, and order is important: E.g. Class (1, 2, 3)
Nominal: Takes specific values, and order is not important e.g. Gender (“male”, “female”)
Interval
Ratio

(iii) Data sets issues

Noise and outliers
Noise: Modification of original value.
Outliers: Small number of points with characteristics different from rest of the data.
Missing values
Duplicate data
Inconsistent values

(iv) Preprocessing
What preprocessing step can or should we apply to the data to make it more suitable for data
mining?
Aggregation
Sampling
Dimensionality Reduction
Feature Subset Selection
Feature Creation
Discretization and Binarization
6

Attribute Transformation

(I) Aggregation
Aggregation refers to combing two or more attributes (or objects) into a single attribute (or
object).
For example, merging daily sales figures to obtain monthly sales figures.
Why aggregation? Data reduction: Allows use of more expensive algorithms.

(II) Sampling
Sampling is the process of understanding characteristics of data or models based on a subset of
the original data. It is used extensively in all aspects of data exploration and mining.
Why sampling? Obtaining the entire set of “data of interest” is too expensive or time consuming
Obtaining the entire set of data may not be necessary (and hence a waste of resources).
A sample is representative for a particular operation if it results in approximately the
same outcome as if the entire data set was used.

(III) Dimension reduction

 Curse of dimensionality: Data analysis becomes significantly harder as the dimensionality
of the data increases.
 Determining dimensions (or combinations of dimensions) that are important for modeling
 Why dimensionality reduction?
o Many data mining algorithms work better if the dimensionality of data (i.e.
number of attributes) is lower.
o Also, allows the data to be more easily visualized.
o If dimensionality reduction eliminates irrelevant features or reduces noise, then
quality of results may improve.
o This can lead to a more understandable model.
 Redundant features duplicate much or all of the information contained in one or more
attributes.
 E.g: The purchase price of product and the sales tax paid contain the same
information •
 Irrelevant features contain no information that is useful for data mining task at hand
 E.g: Student ID numbers would be irrelevant to the task of predicting their GPA.

(IV)Feature creation
 Sometimes, a small number of new attributes can capture the important information in a
data set
 much more efficiently than the original attributes
 Also, the number of new attributes can be often smaller than the number of original
attributes. Hence, we get benefits of dimensionality reduction
 Three general methodologies:
o Feature Extraction
o Mapping the Data to a New Space
o Feature Construction

Feature extraction
7

 One approach to dimensionality reduction is feature extraction, which is creation of a new,

smaller set of features from the original set of features.
 For example, consider a set of photographs, where each photograph is to be classified
whether its human face or not.
 The raw data is set of pixels, and as such is not suitable for many classification algorithms.
 However, if data is processed to provide high-level features like presence or absence of
certain types of edges or areas correlated with presence of human faces, then a broader set of
classification techniques can be applied to the problem.

Mapping the Data to a New Space

 Sometimes, a totally different view of the data can reveal important and interesting features.
 Example: Applying Fourier transformation to data to detect time series patterns.

Feature Construction
 Sometimes features have the necessary information, but not in the form necessary for the
data mining algorithm. In this case, one or more new features constructed out of the original
features may be useful.
 Example, there are two attributes that record volume and mass of a set of objects.
 Suppose there exists a classification model based on material of which the objects are
constructed.
 Then a density feature constructed from the original two features would help classification.

Discretization and Binarization

 Discretization is the process of converting a continuous attribute to a discrete attribute.
 A common example is rounding off real numbers to integers.
 Some data mining algorithms require that the data be in the form of categorical or binary
 attributes. Thus, it is often necessary to convert continuous attributes in to categorical
attributes and / or binary attributes.
 Its pretty straightforward to convert categorical attributes in to discrete or binary attributes.
 Transformation of continuous attributes to a categorical attributes. It involves;
o Deciding how many categories to have.
o How to map the values of the continuous attribute to categorical attribute.
A method is the entropy method.

Final - UFP Skills For Sci Eng - Lab Assessment 2 Instructions 2022-23
No ratings yet
Final - UFP Skills For Sci Eng - Lab Assessment 2 Instructions 2022-23
10 pages
Mind On Statistics 5th Edition Utts Test Bank
No ratings yet
Mind On Statistics 5th Edition Utts Test Bank
22 pages
Gea1000 Finals Cheatsheet
No ratings yet
Gea1000 Finals Cheatsheet
2 pages
Data Mining
No ratings yet
Data Mining
23 pages
Ch2 DTasks
No ratings yet
Ch2 DTasks
44 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
datamining ch1
No ratings yet
datamining ch1
24 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
38 pages
Dwdm Unit-II Notes
No ratings yet
Dwdm Unit-II Notes
29 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
Lecture Notes 1.1 & 1.2
No ratings yet
Lecture Notes 1.1 & 1.2
8 pages
Data Mining
No ratings yet
Data Mining
33 pages
Data Mining
No ratings yet
Data Mining
37 pages
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
No ratings yet
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
6 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
No ratings yet
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
33 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
DM Unit1 Intro
No ratings yet
DM Unit1 Intro
12 pages
1.1 - Data Mining
No ratings yet
1.1 - Data Mining
18 pages
Knowledge Discovery & Data Mining
No ratings yet
Knowledge Discovery & Data Mining
30 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
35 pages
UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
11 pages
Archana Data Mining
No ratings yet
Archana Data Mining
24 pages
CSC 425 Data Mining and Warehousing 2024
No ratings yet
CSC 425 Data Mining and Warehousing 2024
54 pages
Data Mining, Data Pattern, Machine Learning (Week 2
No ratings yet
Data Mining, Data Pattern, Machine Learning (Week 2
19 pages
02 - Data Mining
No ratings yet
02 - Data Mining
27 pages
Datamining 1
No ratings yet
Datamining 1
30 pages
Lec 1
No ratings yet
Lec 1
33 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
24 pages
DMlecture1
No ratings yet
DMlecture1
39 pages
data mining unit I notes
No ratings yet
data mining unit I notes
24 pages
Data Mining Questions
100% (1)
Data Mining Questions
7 pages
module 1
No ratings yet
module 1
41 pages
CH 2
No ratings yet
CH 2
37 pages
Data Mining and Warehousing: - Module 1 - Introduction
No ratings yet
Data Mining and Warehousing: - Module 1 - Introduction
29 pages
CPS 196.03: Information Management and Mining: Shivnath Babu
No ratings yet
CPS 196.03: Information Management and Mining: Shivnath Babu
30 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
3 Data Mining
No ratings yet
3 Data Mining
58 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Unit 1
No ratings yet
Unit 1
59 pages
Assignment Solution 074
No ratings yet
Assignment Solution 074
8 pages
Data Mining Technique Using Weka Tool
No ratings yet
Data Mining Technique Using Weka Tool
21 pages
Unit v Data Mining
No ratings yet
Unit v Data Mining
62 pages
Unit-1 Notes (1)
No ratings yet
Unit-1 Notes (1)
24 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
36 pages
Unit-1 PPT
No ratings yet
Unit-1 PPT
21 pages
DWM Merged
No ratings yet
DWM Merged
125 pages
DM NOTES
No ratings yet
DM NOTES
91 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
p144 Data Mining
100% (3)
p144 Data Mining
11 pages
DM Module1 notes
No ratings yet
DM Module1 notes
25 pages
DataWarehouseMining Complete Notes
No ratings yet
DataWarehouseMining Complete Notes
55 pages
7e4aa890-c48b-42f1-a1ac-77279cc316e8 (1)
No ratings yet
7e4aa890-c48b-42f1-a1ac-77279cc316e8 (1)
58 pages
Data Mining
No ratings yet
Data Mining
25 pages
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
No ratings yet
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
32 pages
DM Module1
No ratings yet
DM Module1
15 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Clustering Illustrations Publishing 1
No ratings yet
Clustering Illustrations Publishing 1
54 pages
Clustering Notes
No ratings yet
Clustering Notes
20 pages
Classification 2
No ratings yet
Classification 2
41 pages
DM 05 04 Rule-Based Classification
No ratings yet
DM 05 04 Rule-Based Classification
72 pages
Exams 1
No ratings yet
Exams 1
20 pages
Transaction Concurrency and Deadlock
No ratings yet
Transaction Concurrency and Deadlock
65 pages
Lect3 Pipeline
No ratings yet
Lect3 Pipeline
4 pages
Stat Methods
No ratings yet
Stat Methods
243 pages
Arelovich 2008
No ratings yet
Arelovich 2008
9 pages
Exercises 5
No ratings yet
Exercises 5
5 pages
2023 CVPR 未知物体嗅探 Unknown Sniffer for Object Detection Don't Turn a Blind Eye to Unknown Objects
No ratings yet
2023 CVPR 未知物体嗅探 Unknown Sniffer for Object Detection Don't Turn a Blind Eye to Unknown Objects
10 pages
Trainity Data Analytics Trainee Task 9 - ADVAIT CHAVAN - Data Analysis Portfolio
No ratings yet
Trainity Data Analytics Trainee Task 9 - ADVAIT CHAVAN - Data Analysis Portfolio
151 pages
Acritas - Patterns in Legal Spend Report 2017
No ratings yet
Acritas - Patterns in Legal Spend Report 2017
12 pages
Basics of Predictive Modeling
No ratings yet
Basics of Predictive Modeling
11 pages
Getting Started With MLOPs 21 Page Tutorial
No ratings yet
Getting Started With MLOPs 21 Page Tutorial
21 pages
PS 1 Work Book Solution
100% (2)
PS 1 Work Book Solution
38 pages
Assessment, Synthesis and Analysis of Data Mining Tools
No ratings yet
Assessment, Synthesis and Analysis of Data Mining Tools
13 pages
Chapter 3 Assignment
100% (1)
Chapter 3 Assignment
5 pages
IITM B.Sc. Qualifier Exam Revision
No ratings yet
IITM B.Sc. Qualifier Exam Revision
3 pages
MBAS901 - L3
No ratings yet
MBAS901 - L3
103 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
Lab #2 - Descriptives: Statistics - Spring 2008
No ratings yet
Lab #2 - Descriptives: Statistics - Spring 2008
16 pages
Business Analytics Nanodegree - Udacity: Variable Types
No ratings yet
Business Analytics Nanodegree - Udacity: Variable Types
4 pages
Frequency: Saravana Somasundaram Home Work 1 Math 4/5/7600
No ratings yet
Frequency: Saravana Somasundaram Home Work 1 Math 4/5/7600
8 pages
2 Quantitative Techniques
No ratings yet
2 Quantitative Techniques
9 pages
Iso 4259
No ratings yet
Iso 4259
2 pages
Default_of_Credit_Card_Clients
No ratings yet
Default_of_Credit_Card_Clients
33 pages
Manual Técnico QRev
No ratings yet
Manual Técnico QRev
122 pages
Outlier Detection Methods and Sensor Data Fusion For Precision Agriculture
No ratings yet
Outlier Detection Methods and Sensor Data Fusion For Precision Agriculture
11 pages
Data Preprocessing With Python For Absolute Beginners. Step by Step. AI Publishing
100% (1)
Data Preprocessing With Python For Absolute Beginners. Step by Step. AI Publishing
252 pages
4.10 - Chemometrics Role Within The PAT Context. Examples From Primary Pharmaceutical Manufacturing
No ratings yet
4.10 - Chemometrics Role Within The PAT Context. Examples From Primary Pharmaceutical Manufacturing
43 pages
JSMath6 Part3
100% (1)
JSMath6 Part3
64 pages
Outliers Detection in Regression Analysis Using Partial Least Square Approach
No ratings yet
Outliers Detection in Regression Analysis Using Partial Least Square Approach
3 pages
8438 Ecap792 Data Science Toolbox
No ratings yet
8438 Ecap792 Data Science Toolbox
317 pages

DATA MINING

Uploaded by

DATA MINING

Uploaded by

1

Data mining is not;

Data mining is;

Prediction Methods versus Description Methods

B. APPLICATIONS / REASONS FOR PREVALENCE

• Much of data is never analyzed at all

C. AREAS DATA MINING DRAWS IDEAS FROM;

D. DATA MINING TASKS / TECHNIQUES

(ii) Association rule discovery

Some rules discovered

Example in supermarket’s shelf management

Example (direct marketing)

Example (fraud detection)

Example (market segmentation)

Example (document clustering)

(v) Deviation / Anomaly Detection

Example (Credit Card Fraud Detection)

(vi) Sequential Pattern Discovery:

(ii) Data types

(iii) Data sets issues

(III) Dimension reduction

 One approach to dimensionality reduction is feature extraction, which is creation of a new,

Mapping the Data to a New Space

Discretization and Binarization

You might also like