unit-III
unit-III
• Mining Applications:
• Major Issues in Data Mining
What is Data Mining?
• Data mining is extraction of interesting
patterns or knowledge from huge amount of
data.
• Patterns must satisfy following criteria:
• Non-trivial
• Implicit
• Previously unknown
• Potentially useful
What is Data Mining?
Alternative names of Data Mining:
• Knowledge discovery (mining) in databases (KDD)
• Knowledge Extraction
• Data/pattern analysis
• Data archeology,
• Data dredging
• Information harvesting
• Business intelligence
• Query processing.
• Expert systems or small ML/statistical programs
KDD(Knowledge Discovery from Data)
Data Mining as a step in process of KDD
Knowledge discovery as a process consists of an
iterative sequence of the following steps:
Data cleaning:
It can be applied to remove noise and correct
inconsistencies in the data.
Data integration:
Data integration merges data from multiple sources
into a coherent data store, such as a data
warehouse.
Data selection:
where data relevant to the analysis task are
retrieve from the database.
Data Mining as a step in process of KDD
Data transformation:
where data are transformed or consolidated into
forms appropriate for mining by performing
summary or aggregation operations.
Example, normalization may improve the accuracy
and efficiency of mining algorithms involving
distance measurements.
Data mining:
an essential process where intelligent methods are
applied in order to extract data patterns.
Data Mining as a step in process of KDD
Pattern evaluation:
To identify the truly interesting patterns
representing knowledge based on some
interestingness measures.
Knowledge presentation:
where visualization and knowledge
representation techniques are used to present
the mined knowledge to the user.
Data Mining Task
• Descriptive Tasks Find human-interpretable patterns
that describe the data.
• Descriptive task are:
• Class/Concept Description
• Mining of Frequent Patterns
• Mining of Associations
• Mining of Correlations
• Mining of Clusters
• Prediction Tasks Use some variables to predict
unknown or future values of other variables
Predictive task are:
• Classification
• Regression
• Time series analysis
Descriptive Function
• Class/Concept Description:
Class/Concept refers to the data to be in with specific
class.
– For example, in a company, the classes of items for sales include
computer and printers, and concepts description of customers
include big spenders and budget spenders.
NOISE or
OUTLIERS
Outlier analysis by box plotter
• A simple way of representing statistical data on a
plot in which a rectangle is drawn to represent
the second and third quartiles usually with a
vertical line inside to indicate the median value.
• The lower and upper quartiles are shown as
horizontal lines either side of the rectangle.
• For box plot arrange data in increasing order of
the values.
• Then calculate Q1,Q2, Q3 and IQR(Q3-Q1) Inter
Quartile range and draw box plot.
Outlier analysis by box plotter
Drawing a Box Plot.
Q1 Q2 Q3
10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4
Lower Upper
Quartile = Median = Quartile =
14.4 14.6 14.9
Gender
Total
Male Female
250 200
Preferred
Fiction 450
Reading
Non
Fiction
50 1000 1050
e= (Count_gender X Count_prefered_reading)/N
Gender
Total
Male Female
90 360
Preferred
Fiction 450
Reading
Non
Fiction
210 840 1050
Degree of freedom=(Row-1)X(Column-1)
=(2-1)X(2-1)
=1
Chi Square Table For(independency)
Example 2:A researcher might want to know if there is a significant
association between the variables gender and soft drink choice (Coke and
Pepsi were considered). The null hypothesis would be,
Ho: There is no significant association between gender and soft drink choice.
(Gender and preferred soft drinking is independent )
Significant level 5%
Data Transformation
In data transformation, the data are transformed or
consolidated into appropriate forms for mining.
Strategies used for Data transformation are :
1. Smoothing: Remove the noise
2. Attribute Construction-New set of attributes generated.
3. Aggregation: Summarization of data value.
4. Normalization: Attributes values are scaled in new range(-1
to +1, 0 to 1 etc.)
5. Discretization: Data divided into discrete intervals(Like 0-100,
101 to 200, 201 to 300 etc.)
6. Concepts of hierarchy generation from higher to lower .
For examples address hierarchy is :country>state>city>street
Data Transformation by discretization
In data discretization data divided into discrete intervals .Data
discretization can be categorized based on how it performed.
1. Supervised discretization use class information.
2. Unsupervised discretization use top down or bottom up
splitting and not use prior class information.
3. Other techniques are:
• Binning methods
• Histogram
• Cluster Analysis
• Decision tree analysis
• Correlation analysis
Data Transformation by Normalization
Normalization change the unit of measurement like
meter to kilometre.
For better performance data must scaled in smaller
format like interval [-1 to +1 or 0 to 1].
Following methods can be used for data normalization:
1. Min-Max Normalization
2. Z-Score(zero mean) Normalization
3. Normalization By decimal Scaling
Example:
Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively.
Map income $73,600 to the range [0.0,1.0].
Here MINA=12000, MAXA=98000,
NEW_MINA=0, NEW_MAXA=1 and V=73600
V’=[(73,600−12,000 )/(98,000−12,000 )]*(1.0−0) +0
= 0.716.
So 73600 will represented 0.716 in new range[0-1]
Example:
Suppose that the mean and standard deviation of the
values for the attribute income are $54,000 and $16,000,
respectively. With z-score normalization, a value of $73,600
for income is transformed to :
V’=(73,600−54,000)/ 16,000
= 1.225