0% found this document useful (0 votes)
10 views

DATA MINING

Data mining involves the non-trivial extraction of useful information from large datasets, utilizing techniques such as classification, clustering, and regression. It has applications across various fields including business, science, and e-commerce, driven by the increasing volume and complexity of data. Key tasks in data mining include anomaly detection, association rule discovery, and sequential pattern discovery, all aimed at uncovering meaningful patterns and insights from data.

Uploaded by

floraaluoch3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

DATA MINING

Data mining involves the non-trivial extraction of useful information from large datasets, utilizing techniques such as classification, clustering, and regression. It has applications across various fields including business, science, and e-commerce, driven by the increasing volume and complexity of data. Key tasks in data mining include anomaly detection, association rule discovery, and sequential pattern discovery, all aimed at uncovering meaningful patterns and insights from data.

Uploaded by

floraaluoch3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

1

DATA MINING

A. DEFINITION
Various definitions;
 Non trivial extraction of nuggets from large amounts of data.
 Non-trivial extraction of implicit, previously unknown and potentially useful information
from
 data
 Exploration & analysis, by automatic or semi-automatic means, of large quantities of data
in order to discover meaningful patterns.

Data mining is not;


 Generating multidimensional cubes of a relational table
 Searching for a phone number in a phone book
 Searching for keywords on Google
 Generating a histogram of salaries for different age groups
 Issuing SQL query to a database, and reading the reply

Data mining is;


 Finding groups of people with similar hobbies
 Are chances of getting cancer higher if you live near a power line?

Prediction Methods versus Description Methods


Prediction methods: Use some variables to predict unknown or future values of the same or other
variables
Description methods: Find human interpretable patterns that describe data.

B. APPLICATIONS / REASONS FOR PREVALENCE


Business
– Wal-Mart logs nearly 20 million transactions per day.
• Astronomy
– Telescope collecting large amounts of data (SDSS).
• Space
– NASA is collecting peta bytes of data from satellites.
• Physics
– High energy physics experiments are expected to generate 100 to 1000 tera bytes in the next
decade.
Retailers
– Scanner data is much more accurate than other means
• E-Commerce
– Rich data on consumer browsing.
• Science
– Accuracy of sensors is improving.
The gap between data and analysts is increasing
• Hidden information is not always evident
• High cost of human labor
2

• Much of data is never analyzed at all

C. AREAS DATA MINING DRAWS IDEAS FROM;


Machine Learning, Pattern Recognition, Statistics, and Database systems for applications that
have;
– Enormity of data
– High dimensionality of data
– Heterogeneous data
– Unstructured data

D. DATA MINING TASKS / TECHNIQUES


Classification (predictive)
• Clustering (descriptive)
• Association Rule Discovery (descriptive)
• Sequential Pattern Discovery (descriptive)
• Regression (predictive)
• Deviation Detection (predictive)

(i) Regression
Predict the value of a given continuous valued variable based on the values of other variables,
assuming a linear or non-linear model of dependency.
• Extensively studied in the fields of Statistics and Neural Networks.
• Examples;
– Predicting sales numbers of a new product based on advertising expenditure.
– Predicting wind velocities based on temperature, humidity, air pressure, etc.
– Time series prediction of stock market indices.

(ii) Association rule discovery


Given a set of transactions, each of which is a set of items, find all rules (X -> Y ) that satisfy
user specified minimum support and confidence constraints.

Example
- Given a set of records, each of which contain some number of items from a given collection:
– Produce dependency rules which will predict occurrence of an item based on occurrences of
other items

Example
– {Bread} -> {Peanut Butter}
– {Jelly} -> {Peanut Butter}

Applications
– Cross selling and up selling
– Supermarket shelf management

Some rules discovered


– Bread -> Peanut Butter
3

• support=60%, confidence=75%
– Peanut Butter -> Bread
• support=60%, confidence=100%
– Jelly -> Peanut Butter
• support=20%, confidence=100%
– Jelly -> Milk
• support=0%

Example in supermarket’s shelf management


Goal: To identify items that are bought concomitantly by a reasonable fraction of customers so
that they can be shelved appropriately based on business goals.
• Data Used: Point-of-sale data collected with barcode scanners to find dependencies among
Products.
• Example
– If a customer buys Jelly, then he is very likely to buy Peanut Butter.
– So don’t be surprised if you find Peanut Butter next to Jelly on an aisle in the super market.
Also, salsa next to tortilla chips.

(iii) Classification
Given a set of records (called the training set),
– Each record contains a set of attributes. One of the attributes is the class
• Find a model for the class attribute as a function of the values of other attributes
• Goal: Previously unseen records should be assigned to a class as accurately as possible
– Usually, the given data set is divided into training and test set, with training set used to build
the model and test set used to validate it. The accuracy of the model is determined on the test set.

Example (direct marketing)


– Use the profiles of customers along with their {buy, didn’t buy} decision. The latter becomes
the class attribute.
– The profile of the information may consist of demographic, lifestyle and company interaction.
• Demographic – Age, Gender, Geography, Salary
• Psychographic – Hobbies
• Company Interaction –Recentness, Frequency, Monetary
– Use these information as input attributes to learn a classifier
Model

Example (fraud detection)


– Label past transactions as {fraud, fair} transactions.
This forms the class attribute
– Learn a model for the class of transactions
– Use this model to detect fraud by observing credit card transactions on an account
4

(iv) Clustering
Determine object groupings such that objects within the same cluster are similar to each other,
while objects in different groups are not.
Classes are unknown unlike classification.

Example (market segmentation)


– Collect different attributes of customers based on their geographical and lifestyle related
information
– Find clusters of similar customers
– Measure the clustering quality by observing the buying patterns of customers in the same
cluster vs. those from different clusters

Example (document clustering)


To find groups of documents that are similar to each other based on important
terms appearing in them
• Approach: To identify frequently occurring terms in each document. Form a similarity
measure based on frequencies of different terms. Use it to generate clusters
• Gain: Information Retrieval can utilize the clusters to relate a new document or search term to
clustered documents

(v) Deviation / Anomaly Detection


• Some data objects do not comply with the general behavior or model of the data. Data
objects that are different from or inconsistent with the remaining set are called outliers
• Outliers can be caused by measurement or execution error. Or they represent some kind of
fraudulent activity.
• Goal of Deviation / Anomaly Detection is to detect significant deviations from normal
behavior

Given a set of n data points or objects, and k, the expected number of outliers, find the
top k objects that considerably dissimilar, exceptional or inconsistent with the remaining data
• This can be viewed as two sub problems.
– Define what data can be considered as inconsistent in a given data set.
– Find an efficient method to mine the outliers so defined.

Example (Credit Card Fraud Detection)


• Goal: To detect fraudulent credit card transactions
• Approach:
– Based on past usage patterns, develop model for authorized credit card transactions
– Check for deviation form model, before authenticating new credit card transactions
– Hold payment and verify authenticity of “doubtful” transactions by other means
(phone call, etc.)
5

(vi) Sequential Pattern Discovery:


• Given is a set of objects, with each object associated with its own timeline of events,
find rules that predict strong sequential dependencies among different events

Example
Telecommunication alarm logs
– (Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) -> (Fire_Alarm)

E. DATA SETS
(i) Contents of data sets
 Attributes (describe objects)
Variable, field, characteristic, feature or observation
 Objects (have attributes)
Record, point, case, sample, entity or item
 Data Set
Collection of objects

(ii) Data types


Continuous
Discrete (integers)
Binary
Ordinal: Takes specific values, and order is important: E.g. Class (1, 2, 3)
Nominal: Takes specific values, and order is not important e.g. Gender (“male”, “female”)
Interval
Ratio

(iii) Data sets issues


Noise and outliers
Noise: Modification of original value.
Outliers: Small number of points with characteristics different from rest of the data.
Missing values
Duplicate data
Inconsistent values

(iv) Preprocessing
What preprocessing step can or should we apply to the data to make it more suitable for data
mining?
Aggregation
Sampling
Dimensionality Reduction
Feature Subset Selection
Feature Creation
Discretization and Binarization
6

Attribute Transformation

(I) Aggregation
Aggregation refers to combing two or more attributes (or objects) into a single attribute (or
object).
For example, merging daily sales figures to obtain monthly sales figures.
Why aggregation? Data reduction: Allows use of more expensive algorithms.

(II) Sampling
Sampling is the process of understanding characteristics of data or models based on a subset of
the original data. It is used extensively in all aspects of data exploration and mining.
Why sampling? Obtaining the entire set of “data of interest” is too expensive or time consuming
Obtaining the entire set of data may not be necessary (and hence a waste of resources).
A sample is representative for a particular operation if it results in approximately the
same outcome as if the entire data set was used.

(III) Dimension reduction


 Curse of dimensionality: Data analysis becomes significantly harder as the dimensionality
of the data increases.
 Determining dimensions (or combinations of dimensions) that are important for modeling
 Why dimensionality reduction?
o Many data mining algorithms work better if the dimensionality of data (i.e.
number of attributes) is lower.
o Also, allows the data to be more easily visualized.
o If dimensionality reduction eliminates irrelevant features or reduces noise, then
quality of results may improve.
o This can lead to a more understandable model.
 Redundant features duplicate much or all of the information contained in one or more
attributes.
 E.g: The purchase price of product and the sales tax paid contain the same
information •
 Irrelevant features contain no information that is useful for data mining task at hand
 E.g: Student ID numbers would be irrelevant to the task of predicting their GPA.

(IV)Feature creation
 Sometimes, a small number of new attributes can capture the important information in a
data set
 much more efficiently than the original attributes
 Also, the number of new attributes can be often smaller than the number of original
attributes. Hence, we get benefits of dimensionality reduction
 Three general methodologies:
o Feature Extraction
o Mapping the Data to a New Space
o Feature Construction

Feature extraction
7

 One approach to dimensionality reduction is feature extraction, which is creation of a new,


smaller set of features from the original set of features.
 For example, consider a set of photographs, where each photograph is to be classified
whether its human face or not.
 The raw data is set of pixels, and as such is not suitable for many classification algorithms.
 However, if data is processed to provide high-level features like presence or absence of
certain types of edges or areas correlated with presence of human faces, then a broader set of
classification techniques can be applied to the problem.

Mapping the Data to a New Space


 Sometimes, a totally different view of the data can reveal important and interesting features.
 Example: Applying Fourier transformation to data to detect time series patterns.

Feature Construction
 Sometimes features have the necessary information, but not in the form necessary for the
data mining algorithm. In this case, one or more new features constructed out of the original
features may be useful.
 Example, there are two attributes that record volume and mass of a set of objects.
 Suppose there exists a classification model based on material of which the objects are
constructed.
 Then a density feature constructed from the original two features would help classification.

Discretization and Binarization


 Discretization is the process of converting a continuous attribute to a discrete attribute.
 A common example is rounding off real numbers to integers.
 Some data mining algorithms require that the data be in the form of categorical or binary
 attributes. Thus, it is often necessary to convert continuous attributes in to categorical
attributes and / or binary attributes.
 Its pretty straightforward to convert categorical attributes in to discrete or binary attributes.
 Transformation of continuous attributes to a categorical attributes. It involves;
o Deciding how many categories to have.
o How to map the values of the continuous attribute to categorical attribute.
A method is the entropy method.

You might also like