DATA MINING
DATA MINING
DATA MINING
A. DEFINITION
Various definitions;
Non trivial extraction of nuggets from large amounts of data.
Non-trivial extraction of implicit, previously unknown and potentially useful information
from
data
Exploration & analysis, by automatic or semi-automatic means, of large quantities of data
in order to discover meaningful patterns.
(i) Regression
Predict the value of a given continuous valued variable based on the values of other variables,
assuming a linear or non-linear model of dependency.
• Extensively studied in the fields of Statistics and Neural Networks.
• Examples;
– Predicting sales numbers of a new product based on advertising expenditure.
– Predicting wind velocities based on temperature, humidity, air pressure, etc.
– Time series prediction of stock market indices.
Example
- Given a set of records, each of which contain some number of items from a given collection:
– Produce dependency rules which will predict occurrence of an item based on occurrences of
other items
Example
– {Bread} -> {Peanut Butter}
– {Jelly} -> {Peanut Butter}
Applications
– Cross selling and up selling
– Supermarket shelf management
• support=60%, confidence=75%
– Peanut Butter -> Bread
• support=60%, confidence=100%
– Jelly -> Peanut Butter
• support=20%, confidence=100%
– Jelly -> Milk
• support=0%
(iii) Classification
Given a set of records (called the training set),
– Each record contains a set of attributes. One of the attributes is the class
• Find a model for the class attribute as a function of the values of other attributes
• Goal: Previously unseen records should be assigned to a class as accurately as possible
– Usually, the given data set is divided into training and test set, with training set used to build
the model and test set used to validate it. The accuracy of the model is determined on the test set.
(iv) Clustering
Determine object groupings such that objects within the same cluster are similar to each other,
while objects in different groups are not.
Classes are unknown unlike classification.
Given a set of n data points or objects, and k, the expected number of outliers, find the
top k objects that considerably dissimilar, exceptional or inconsistent with the remaining data
• This can be viewed as two sub problems.
– Define what data can be considered as inconsistent in a given data set.
– Find an efficient method to mine the outliers so defined.
Example
Telecommunication alarm logs
– (Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) -> (Fire_Alarm)
E. DATA SETS
(i) Contents of data sets
Attributes (describe objects)
Variable, field, characteristic, feature or observation
Objects (have attributes)
Record, point, case, sample, entity or item
Data Set
Collection of objects
(iv) Preprocessing
What preprocessing step can or should we apply to the data to make it more suitable for data
mining?
Aggregation
Sampling
Dimensionality Reduction
Feature Subset Selection
Feature Creation
Discretization and Binarization
6
Attribute Transformation
(I) Aggregation
Aggregation refers to combing two or more attributes (or objects) into a single attribute (or
object).
For example, merging daily sales figures to obtain monthly sales figures.
Why aggregation? Data reduction: Allows use of more expensive algorithms.
(II) Sampling
Sampling is the process of understanding characteristics of data or models based on a subset of
the original data. It is used extensively in all aspects of data exploration and mining.
Why sampling? Obtaining the entire set of “data of interest” is too expensive or time consuming
Obtaining the entire set of data may not be necessary (and hence a waste of resources).
A sample is representative for a particular operation if it results in approximately the
same outcome as if the entire data set was used.
(IV)Feature creation
Sometimes, a small number of new attributes can capture the important information in a
data set
much more efficiently than the original attributes
Also, the number of new attributes can be often smaller than the number of original
attributes. Hence, we get benefits of dimensionality reduction
Three general methodologies:
o Feature Extraction
o Mapping the Data to a New Space
o Feature Construction
Feature extraction
7
Feature Construction
Sometimes features have the necessary information, but not in the form necessary for the
data mining algorithm. In this case, one or more new features constructed out of the original
features may be useful.
Example, there are two attributes that record volume and mass of a set of objects.
Suppose there exists a classification model based on material of which the objects are
constructed.
Then a density feature constructed from the original two features would help classification.