0% found this document useful (0 votes)
9 views101 pages

unit-III

Data mining is the process of extracting interesting patterns from large datasets, serving as a critical step in the Knowledge Discovery in Databases (KDD) process. It encompasses various functionalities, including descriptive and predictive tasks, and has applications in numerous fields such as business intelligence, fraud detection, and healthcare. Major challenges in data mining include handling diverse data types, ensuring privacy, and improving efficiency and scalability.

Uploaded by

abdn89571
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views101 pages

unit-III

Data mining is the process of extracting interesting patterns from large datasets, serving as a critical step in the Knowledge Discovery in Databases (KDD) process. It encompasses various functionalities, including descriptive and predictive tasks, and has applications in numerous fields such as business intelligence, fraud detection, and healthcare. Major challenges in data mining include handling diverse data types, ensuring privacy, and improving efficiency and scalability.

Uploaded by

abdn89571
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Data Mining

• Definition and overviews


• Mining as step in KDD process
• Mining Functionalities
– Descriptive
– Predictive

• Mining Applications:
• Major Issues in Data Mining
What is Data Mining?
• Data mining is extraction of interesting
patterns or knowledge from huge amount of
data.
• Patterns must satisfy following criteria:
• Non-trivial
• Implicit
• Previously unknown
• Potentially useful
What is Data Mining?
Alternative names of Data Mining:
• Knowledge discovery (mining) in databases (KDD)
• Knowledge Extraction
• Data/pattern analysis
• Data archeology,
• Data dredging
• Information harvesting
• Business intelligence
• Query processing.
• Expert systems or small ML/statistical programs
KDD(Knowledge Discovery from Data)
Data Mining as a step in process of KDD
Knowledge discovery as a process consists of an
iterative sequence of the following steps:
Data cleaning:
It can be applied to remove noise and correct
inconsistencies in the data.
Data integration:
Data integration merges data from multiple sources
into a coherent data store, such as a data
warehouse.
Data selection:
where data relevant to the analysis task are
retrieve from the database.
Data Mining as a step in process of KDD
Data transformation:
where data are transformed or consolidated into
forms appropriate for mining by performing
summary or aggregation operations.
Example, normalization may improve the accuracy
and efficiency of mining algorithms involving
distance measurements.

Data mining:
an essential process where intelligent methods are
applied in order to extract data patterns.
Data Mining as a step in process of KDD
Pattern evaluation:
To identify the truly interesting patterns
representing knowledge based on some
interestingness measures.
Knowledge presentation:
where visualization and knowledge
representation techniques are used to present
the mined knowledge to the user.
Data Mining Task
• Descriptive Tasks Find human-interpretable patterns
that describe the data.
• Descriptive task are:
• Class/Concept Description
• Mining of Frequent Patterns
• Mining of Associations
• Mining of Correlations
• Mining of Clusters
• Prediction Tasks Use some variables to predict
unknown or future values of other variables
Predictive task are:
• Classification
• Regression
• Time series analysis
Descriptive Function
• Class/Concept Description:
Class/Concept refers to the data to be in with specific
class.
– For example, in a company, the classes of items for sales include
computer and printers, and concepts description of customers
include big spenders and budget spenders.

• Data Characterization − This refers to summarizing data of


general characteristics or features of target class of data.

• Data Discrimination − is a comparison of general features


of target class data objects against general features of
objects from one or multiple classes.
Descriptive Function
Mining of Frequent Patterns
Frequent patterns are those patterns that
occur frequently in transactional data.
• Frequent Item Set − It refers to a set of items
that frequently appear together, for example,
milk and bread.
Descriptive Function
Association rule learning is a popular and well
researched method for discovering interesting
relations between items in large databases.
Buys(X, “computer”)→ Buys(X, Software”) [Support=1%, confidence=50%]

Support 1% means out of 100 transactions computer and software


buy together in one transaction.
Confidence 50% means if customer buys computer then chances of
buying software is 50%.
Descriptive Function
Mining of Correlations:
• It is a kind of additional analysis performed to uncover
interesting statistical correlations between associated-
attribute-value pairs or between two item sets to analyze
that if they have positive, negative or no effect on each
other.
Mining of Clusters:
• Cluster refers to a group of similar kind of objects. Cluster
analysis refers to forming group of objects that are very
similar to each other but are highly different from the
objects in other clusters.
Classification and Prediction
• Classification is the process of finding a model that
describes the data classes or concepts.
• The purpose is to be able to use this model to predict the
class of objects whose class label is unknown.
• This derived model is based on the analysis of sets of
training data.
• The derived model can be presented in the following forms

• Classification (IF-THEN) Rules
• Decision Trees
• Mathematical Formulae
• Neural Networks
IF-THEN rules
IF age(X, “Youth”) AND income(X, “high”)THEN class(X, “A”)
IF age(X, “Youth”) AND income(X, “low”) THEN class(X, “B”)
IF age(X, “middle-aged”) THEN class(X, “C”)
F age(X, “Senior”) THEN class(X, “C”)
Business intelligence (BI) is the set of techniques and tools
for the transformation of raw data into meaningful and useful
information for business analysis purposes.
Data Mining Structure and
Components
Components of Data mining system
1. Database, data warehouse, or other information
repository: This is one or a set of databases, data
warehouses, spread sheets, or other kinds of
information repositories
2. Database or data warehouse server: The database
or data warehouse server is responsible for fetching
the relevant data, based on the user's data mining
request.
3. Knowledge base: This is the domain knowledge that
is used to guide the search, or evaluate the
interestingness of resulting patterns
Components of Data mining system
4. Data mining engine: This is essential to the data mining
system and ideally consists of a set of functional modules
for tasks such as characterization, association analysis,
classification, evolution and deviation analysis.
5. Pattern evaluation module: This component typically
employs interestingness measures and interacts with the
data mining modules so as to focus the search towards
interesting patterns.
6. Graphical user interface: This module communicates
between users and the data mining system, allowing the
user to interact with the system by specifying a data
mining query or task
Issues in Data Mining(Challenges)
1. Mining Methodology
2. User Interaction
3. Efficiency and Scalability
4. Diversity of Database Types
5. Data Mining and Society
Mining Methodology
• Handling uncertainty, noise and incompleteness data.
• Mining various new kinds of knowledge: Mining covers a wide
spectrum of data analysis from data characterization and
discrimination, association and correlation analysis, classification
clustering regression etc.
• Mining Knowledge in multidimensional space(MD data cubes).
• Data Mining is an interdiscipline efforts(NLP ).
• Boosting the power of discovery in a networked environment-
Most data are reside in linked and interconnected environment
i. e. Web, RDBMS, files and documents.
• Pattern Evaluation and pattern constraint guided mining-
Interesting pattern may vary user to user.
User Interaction
• User plays important roles in data mining process. The
interesting area is how user interact with data mining system.
This includes following .
• Interactive mining-It is important to build flexible user
interface and exploratory mining environment and facilitating
users interaction with system.
• Incorporation of background knowledge.
• Ad hock data mining and Query language.
• Presentation and visualization of data mining result.
Efficiency and Scalability
Efficiency and scalability always considered
when comparing mining algorithm
• Efficiency and scalability of data mining algorithm.
• Parallel ,distributed and incremental mining algorithm-
Partitioned the data into different pieces and then
analyze.
• Cloud computing and cluster computing
Diversity of Data Types
• The wide diversity of data types brings about challenges to
data mining which includes:
• Handling complex data types like unstructured , semi
structured and structured data types which include
temporal data, biological data, spatial data, hypertext,
multimedia, software codes, web data and social
network.
• Mining dynamic and networked and global data
repositories
Data Mining and Society
• Major issues of mining is how mining process impact the
society and preserve the privacy of individuals. This includes
following challenges:
• Social Impact of Data Mining includes-how mining
beneficial for society, how ensure against misuse
• Privacy Preserving Data Mining like disclosing of
personnel information.
• Indivisible Data Mining- We can not expected from
every one to learn the and master data mining
technology.
Data Mining Applications
• Business Intelligence • Intrusion Detection
• Search Engine • Market Analysis
• Advertising • Fraud Detection
• Web Mining • Customer Retention
• Text Mining • Production Control
• Surveillance • Science Exploration
• Astrology • Weather forecasting
• Financial Data Analysis • Health sector
• Retail Industry
• Biological Data Analysis
• Telecommunication
Industry
Data Warehousing Applications
• Consumer Goods • Marketing
• Distribution • Multi-Industry
• Finance and Banking
• Cross Industry
• Finance – General
• Government and • Retailers
Education • Services
• Health Care • Sports
• Hospitality
• Telephone
• Insurance
• Manufacturing and • Transportation
Distribution • Utilities
Forms of Data Pre-processing
• Requirements
• Quality Data
• Major Task in data Pre-processing
• Data cleaning
• Data Integration
• Data Transformation
• Data Reduction
• Data discretization
Requirement of Data Pre-processing
• The pre-processing is required in advance
before data mining task because real world
data are generally incomplete due to lacking
attribute values.
• Lacking certain attributes of interest, or containing
only aggregate data
• Noisy data ( containing errors or outliers)
• Inconsistent data(containing discrepancies in codes
or names).
Quality Data
• Reason behind dirty data are noise, redundancy and
inconsistency.
• Quality data satisfy requirement of desired use.
• Quality data satisfy followings:
– Accuracy:- Noiseless or error free
– Completeness: No lacking of attribute values
– Consistency: No discrepancy
– Timeliness: all records updated within deadline
– Believability: Trusted Data
– Interpretability: Easy to understand
Major Task in data Pre-processing
• Data cleaning: fill in missing values, smooth noisy data,
identify or remove outliers, and resolve inconsistencies.
• Data integration: using multiple databases, data cubes, or
files.
• Data transformation: normalization and aggregation.
• Data reduction: reducing the volume but producing the same
or similar analytical results.
• Data discretization: part of data reduction, replacing
numerical attributes with nominal ones.
Data Cleaning

1. Handling Data with Missing Values


2. Handling Noisy Data(Data Smoothing)
3. Handling Inconsistent Data.
Methods used to handle data with the missing values

• Ignore the tuple: usually done when class label is missing.


• Use the attribute mean (or majority nominal value) to fill in
the missing value.
• Use the attribute mean (or majority nominal value) for all
samples belonging to the same class.
• Fill missing values manually.

• Use the global constant(UNKNOWN or ∞) for missing values.


• Predict the missing value by using a learning algorithm:
consider the attribute with the missing value as a dependent
(class) variable and run a learning algorithm (usually Bay’s or
decision tree) to predict the missing value.
Handling Noisy Data (Data Smoothing)

To Identify outliers and smooth out noisy data by


following methods:
1. Binning Methods
2. Regression
3. Clustering and outliers
4. Computer and Human Inspection
Binning Methods
This method smooth the sorted values by consulting its
neighbourhood i.e. closest value . This methods perform only
local smoothing because of neighbour consult.
Steps are as below:
1. Sort the attribute values.
2. Partition them into equal size bins(not necessary for last bin).
3. Now smooth the data by replacing each data of bins using any
one of the following:
I. Mean of bin
II. Median of bin
III. Closest boundary values of bin .
Binning Methods Example:
Smooth the following price list using binning methods.
4,8,15,21, 25,28,34,21,24 . let the bin size is 3(given)
Step 1:sorted data:4,8,15,21,21,24,25,28,34
Step2: Partition data into bins of given size-3:
Bin1: 4, 8, 15
Bin2: 21, 21,24
Bin3: 25, 28, 34
Step3: Data smoothing using mean value:
Bin1: 9, 9, 9 [mean of 4,8,15 is 9]
Bin2: 22, 22,22 [mean of 21,21,24 is 22]
Bin3: 29, 29, 29 [mean of 25,28,34 is 29]
Binning Methods Example:
Bin1: 4, 8, 15
Bin2: 21, 21,24
Bin3: 25, 28, 34
Step3: Data smoothing using median value:
Bin1: 8, 8, 8 [median of 4,8,15 is 8]
Bin2: 21, 21,21 [median of 21,21,24 is 21]
Bin3: 28, 28, 28 [median of 25,28,34 is 28]
Binning Methods Example:
Bin1: 4 , 8, 15
Bin2: 21, 21,24
Bin3: 25, 28, 34

Step3: Data smoothing using Closest boundary value:


Bin1: 4, 4, 15 [4 and 15 are boundaries so 8 will be
replaced by closest boundary 4]
Bin2: 21, 21,24 [21 and 24 are boundaries]

Bin3: 25, 25, 34 [25and 34 are boundaries so 28 will


be replaced by closest boundary 25]
Regression Methods
• Regression and log regression models can be used to
approximate given data.
• In simple linear regression the data are modeled to fit a best
straight line between two attributes so that one attribute can
be use to predict other attributes.
• Multiple regression can be used to smooth the noise.

• Y=a.X+b where X,Y are attributes and a, b are regression


coefficient
Regression Methods
Clustering and Outliers
• Clustering is process of grouping a set of data values
into multiple group such that objects within the
group have high similarity and objects within
different group have high dissimilarity.
• Data which are out side the clusters treated as
outliers and can be remove or ignored.
Clustering

NOISE or
OUTLIERS
Outlier analysis by box plotter
• A simple way of representing statistical data on a
plot in which a rectangle is drawn to represent
the second and third quartiles usually with a
vertical line inside to indicate the median value.
• The lower and upper quartiles are shown as
horizontal lines either side of the rectangle.
• For box plot arrange data in increasing order of
the values.
• Then calculate Q1,Q2, Q3 and IQR(Q3-Q1) Inter
Quartile range and draw box plot.
Outlier analysis by box plotter
Drawing a Box Plot.

Example : Draw a Box plot for the data below

Q1 Q2 Q3
10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4

Lower Upper
Quartile = Median = Quartile =
14.4 14.6 14.9

10.2 14.4 14.6 14.9 16.4


Q2=14.6 (median value of whole data set)
Q1 = 14.4 (median value left half of data set)
Q3 = 14.9 (median value right half of data set)
Then the IQR is given by:
IQR=Q3-Q1
IQR = 14.9 – 14.4 = 0.5
Outliers will be any points satisfy following:
outliers< Q1 – 1.5 ×IQR Outliers> Q3 + 1.5×IQR
outliers <14.4 – 0.75 Outliers> 14.9 + 0.75
outliers <13.65 Outliers> 15.65.
Here outliers are : Here outliers are
10.2 15.9 and 16.4

The Outliers of data set are:10.2, 15.9,16.4


Computer and Human Inspection
• It detect the suspicious values by computer and checked by
human being manually.
• Outliers can be identified through a combination of computer
and human inspection. Outliers pattern may be informative
or garbage.
• The patterns whose surprise contents is above a threshold
are out to be list then human can sort through the pattern in
the list to identify actual garbage.
• This is much faster than manually search through entire
database.
• The garbage pattern can be used in subsequent data mining.
• Example: Indentifying outliers in hand written character
database for classification in which different verity of
characters are using.
Handling Inconsistent Data
• Inconsistent means discrepancies in code or name. It is
important issue which is considerable during data integration.
• Following are main reasons for data inconsistency:
 Faulty hardware and software for data collection.
 Human or computer error.
 Error due to data transmission.
 Technology limitations(data generated more faster than
receiver).
 Inconsistency in naming or convention(02/05/2018 may be
consider as 5 Feb, 2018 or 2 May, 2018)
 Due to duplicate tuple in database.
Data Integration
• It is a process of merging of data from multiple and
heterogeneous data sources.
• It is necessary to maintaining consistency to avoid
redundancy in resulted data set.
• This can be helpful to improve the accuracy and
speed of mining process over subsequent data set.
• Major challenges in integration process are:
1. Entity identification problem
2. Tuple Duplication
3. Data value conflict detection and resolutions
4. Redundancy and correlation analysis
Entity Identification Problem
• This refers to how real world entity from different databases
matched up.
– Example customer_id in one relation and customer_no in other
relation.

• During integration special attention need to be take care


while merging data from various sources.
– Discount attribute for individual in one data base is differ in other
database
• Metadata resolve these problem which defines name,
meaning, data type and range value for each attribute.
Tuple Duplications
• In addition to redundancy detection to maintain data
consistency, data tuple duplication should be deleted.
• Inconsistency often arises between various duplicates due to
updating partial data occurrences.

• Example :some customer name may be repeated with


different addresses against each name.
Data value conflict detection and resolutions

• Data Integration also involves the detection and resolution of


conflict data values for some real world entity.
• Attribute values may differs from different sources.
• This may be because of due to difference in scaling ,encoding
or representation.

• Examples: In University Grade System percentage of mark or


GRADE A/B/C or CREDIT SYSTEM
Redundancy and correlation analysis

• Inconsistency in value or attribute causes the redundancy in


dataset. An attribute may be redundant if it derived from
another attribute set.
• Example: Annual revenue of any organization may be derived
from different sources so ignoring any source during
integration may be create redundancy.
• Redundancy can be detected by correlation analysis. It detect
how strongly two or more attributes are dependent.
• Chi square ( χ²) test can be apply to detect how one attributes
strongly implies to other attributes
Example:
In a group of 1500 people is surveyed . In contingency table
observed frequency of preferred reading by male and female is
given. Perform chi square(χ²) to test hypothesis that “gender and
preferred reading are independent” at significant level 0.001.

Gender
Total
Male Female

250 200
Preferred

Fiction 450
Reading

Non
Fiction
50 1000 1050

Total 300 1200 1500


Expected frequency (e)can be computed by formula:

e= (Count_gender X Count_prefered_reading)/N

Gender
Total
Male Female

90 360
Preferred

Fiction 450
Reading

Non
Fiction
210 840 1050

Total 300 1200 1500

Degree of freedom=(Row-1)X(Column-1)
=(2-1)X(2-1)
=1
Chi Square Table For(independency)
Example 2:A researcher might want to know if there is a significant
association between the variables gender and soft drink choice (Coke and
Pepsi were considered). The null hypothesis would be,
Ho: There is no significant association between gender and soft drink choice.
(Gender and preferred soft drinking is independent )
Significant level 5%
Data Transformation
In data transformation, the data are transformed or
consolidated into appropriate forms for mining.
Strategies used for Data transformation are :
1. Smoothing: Remove the noise
2. Attribute Construction-New set of attributes generated.
3. Aggregation: Summarization of data value.
4. Normalization: Attributes values are scaled in new range(-1
to +1, 0 to 1 etc.)
5. Discretization: Data divided into discrete intervals(Like 0-100,
101 to 200, 201 to 300 etc.)
6. Concepts of hierarchy generation from higher to lower .
For examples address hierarchy is :country>state>city>street
Data Transformation by discretization
In data discretization data divided into discrete intervals .Data
discretization can be categorized based on how it performed.
1. Supervised discretization use class information.
2. Unsupervised discretization use top down or bottom up
splitting and not use prior class information.
3. Other techniques are:
• Binning methods
• Histogram
• Cluster Analysis
• Decision tree analysis
• Correlation analysis
Data Transformation by Normalization
Normalization change the unit of measurement like
meter to kilometre.
For better performance data must scaled in smaller
format like interval [-1 to +1 or 0 to 1].
Following methods can be used for data normalization:
1. Min-Max Normalization
2. Z-Score(zero mean) Normalization
3. Normalization By decimal Scaling
Example:
Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively.
Map income $73,600 to the range [0.0,1.0].
Here MINA=12000, MAXA=98000,
NEW_MINA=0, NEW_MAXA=1 and V=73600
V’=[(73,600−12,000 )/(98,000−12,000 )]*(1.0−0) +0
= 0.716.
So 73600 will represented 0.716 in new range[0-1]
Example:
Suppose that the mean and standard deviation of the
values for the attribute income are $54,000 and $16,000,
respectively. With z-score normalization, a value of $73,600
for income is transformed to :
V’=(73,600−54,000)/ 16,000
= 1.225

So new representation value of 73600 is 1.225


Example:
Suppose that the recorded values of A range from −986 to
917. The maximum absolute value of A is 986. To normalize
by decimal scaling, we therefore divide each value by 1,000
(i.e., j = 3 to make 986 les than 1 i.e. 0.986) so that −986
normalizes to:
−0.986 and 917 normalizes to 0.917.
Normalization Practice
Use the three methods below to normalize the
following group of data:
200, 300, 400, 600, 1000
(a) min-max normalization by setting
min = 0 and max = 1
(b) z-score normalization
(c) Decimal Normalization
Data reduction
• Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume.
• It Keep closely maintains the integrity of the original data.
• Following Strategies used for data reduction :
1. Dimensionality reduction
2. Attribute subset selection
3. Numerosity reduction
4. Data cube aggregation
5. Discretization and concept hierarchy generation
6. Decision Tree
7. Data Compression
Dimensionality reduction
• In dimensionality reduction, data encoding or transformations
are applied to obtain a reduced representation of the original
data.
• If the original data can be reconstructed from the reduced data
without any loss of information, the data reduction is called
lossless.
• If we can reconstruct only an approximation of the original
data, then the data reduction is called lossy.
• Two popular and effective methods of lossy dimensionality
reduction:
1. Wavelet transforms
2. Principal components analysis
Wavelet transforms
• The discrete wavelet transform(DWT) is a linear signal
processing technique that, when applied to a data vector X,
transforms it to a numerically different vector X’ of wavelet
coefficients.
• The usefulness lies in the fact that the wavelet transformed
data can be truncated.
• Data can be retained by storing only a small fraction of the
strongest of the wavelet coefficients.
• For example, all wavelet coefficients larger than some user-
specified threshold can be retained. All other coefficients are
set to 0
Principal components analysis
• Principal components analysis, or PCA (also called the searches
for k, n-dimensional orthogonal vectors that can best be used
to represent the data, where k <n. Procedure are as given
below
1. The input data are normalized.
2. PCA computes k orthonormal vectors that provide a basis
for the normalized input data.
3. The principal components are sorted in order of
decreasing “significance” or strength.
4. The size of the data can be reduced by eliminating the
weaker components, that is, those with low variance.
Principal components analysis
Attribute Subset Selection

• Data sets for analysis may contain hundreds of attributes, many of


which may be irrelevant to the mining task or redundant.
• It reduces the data set size by removing irrelevant or redundant
attributes (or dimensions).
• The goal of attribute subset selection is to find a minimum set of
attributes.
• Attribute subset selection include the following techniques:
1. Stepwise forward selection:
The procedure starts with an empty set of attributes as the
reduced set. The best of the original attributes is determined and
added to the reduced set. At each subsequent iteration or step,
the best of the remaining original attributes is added to the set.
Attribute Subset Selection
2. Stepwise backward elimination:
The procedure starts with the full set of attributes. At each
step, it removes the worst attribute remaining in the set.

3. Combination of forward selection and backward


elimination:
The stepwise forward selection and backward elimination
methods can be combined so that, at each step, the procedure
selects the best attribute and removes the worst from among
the remaining attributes.
4. Decision tree induction:
When decision tree induction is used for attribute subset
selection, a tree is constructed from the given data. All
attributes that do not appear in the tree are assumed to be
irrelevant. The set of attributes appearing in the tree form the
reduced subset of attributes.
Numerosity reduction
• Numerosity reduction reduce the data volume by choosing
alternative, ‘smaller’ forms of data representation.
• These techniques may be Parametric or Nonparametric.

• For parametric methods, a model is used to estimate the


data, so that typically only the data parameters need to be
stored, instead of the actual data. Regression and Log-
linear models are an example.
• Nonparametric methods for storing reduced
representations of the data include :
1. Histograms
2. Clustering
3. Sampling.
4. Data Cube aggregation
Histograms
• A histogram for an attribute, A, partitions the data
distribution of A into disjoint subsets, or buckets.
• If each bucket represents only a single attribute-
value/frequency pair, the buckets are called singleton
buckets.
• There are several partitioning rules, including the following:
• Equal-width: In an equal-width histogram, the width of each
bucket range is uniform (such as the width of $10 for the
buckets .
• Equal-frequency (or equidepth): In an equal-frequency
histogram, the buckets are created so that, roughly, the
frequency of each bucket is constant (that is, each bucket
contains roughly the same number of contiguous data
samples).
Histograms
• Example :The following data are a list of prices of commonly
sold items at AllElectronics (rounded to the nearest dollar):
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Histograms
• Example :The following data are a list of prices of commonly
sold items at AllElectronics (rounded to the nearest dollar):
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Sampling
Sampling can be used as a data reduction technique because it
allows a large data set to be represented by a much smaller
random sample (or subset) of the data.
1. Simple random sample without replacement (SRSWOR) of size s:
This is created by drawing s of the N tuples from D (s < N), where
the probability of drawing any tuple in D is 1=N, that is, all tuples
are equally likely to be sampled.
2. Simple random sample with replacement (SRSWR) of size s:
This is similar to SRSWOR, except that each time a tuple is drawn
from D, it is recorded and then replaced. That is, after a tuple is
drawn, it is placed back in D so that it may be drawn Again.
Sampling
3. Cluster sample: If the tuples in D are grouped into M
mutually disjoint “clusters,” then an SRS of s clusters can be
obtained, where s < M. For example, tuples in a database are
usually retrieved a page at a time, so that each page can be
considered a cluster
4. Stratified sample: If D is divided into mutually disjoint parts
called strata, a stratified sample of D is generated by
obtaining an SRS at each stratum. This helps ensure a
representative sample, especially when the data are skewed.
For example, a stratified sample may be obtained from
customer data, where a stratum is created for each customer
age group. In this way, the age group having the smallest
number of customers will be sure to be represented.
Data Cube Aggregation
• It is use to summarize multidimensional data cube.
• Data aggregation is any process in which information is
gathered and expressed in a summary form, for purposes
such as statistical analysis.
• A common aggregation purpose is to get more information
about particular groups based on specific variables such as
age, profession, sales or income.
Discretization and Concept Hierarchy
Generation for Numerical Data
• Discretization and concept hierarchy generation, where raw data
values for attributes are replaced by ranges or higher conceptual
levels.
• Data discretization is a form of numerosity reduction that is very
useful for the automatic generation of concept hierarchies.
• Discretization and concept hierarchy generation are powerful tools
for data mining, in that they allow the mining of data at multiple
levels of abstraction. Strategies are:
1. Binning
2. Histogram
3. Entropy based discretization
4. Chi square merged method
5. Cluster analysis
6. Discretization by intuitive partitioning
Discretization and Concept Hierarchy
Generation for Numerical Data
1. Binning: Binning is a top-down splitting technique
based on a specified number of bins.
These methods are also used as discretization
methods for numerosity reduction and concept
hierarchy generation.
2. Histogram analysis: Like binning, histogram analysis
is an unsupervised discretization technique because
it does not use class information. Histograms
partition the values for an attribute, A, into disjoint
ranges called buckets.
Discretization and Concept Hierarchy
Generation for Numerical Data
3. Entropy-based discretization: Entropy-based
discretization is a supervised, top-down splitting technique.
• It explores class distribution information in its calculation and
determination of split-points (data values for partitioning an
attribute range).
• To discretize a numerical attribute, A, the method selects the
value of A that has the minimum entropy as a split-point, and
recursively partitions the resulting intervals to arrive at a
hierarchical discretization. Such discretization forms a concept
hierarchy for A.
Discretization and Concept Hierarchy
Generation for Numerical Data
4. Interval Merging by χ² Analysis:
• Initially, each distinct value of a numerical attribute A
is considered to be one interval.
• χ² tests are performed for every pair of adjacent
intervals.
• Adjacent intervals with the least χ² values are
merged together, because low χ² values for a pair
indicate similar class distributions.
• This merging process proceeds recursively until a
predefined stopping criterion is met.
Discretization and Concept Hierarchy
Generation for Numerical Data
5. Cluster analysis: A clustering algorithm can be applied to
discretize a numerical attribute, A, by partitioning the values
of A into clusters or groups.
Clustering takes the distribution of A into consideration, as
well as the closeness of data points, and therefore is able to
produce high-quality discretization results.
Discretization and Concept Hierarchy
Generation for Numerical Data
6. Discretization by intuitive partitioning:
A concept hierarchy for the attribute price, where an interval
($X : : :$Y] denotes the range from $X (exclusive) to $Y
(inclusive).
Concept Hierarchy Generation for Categorical
Data
• Categorical data are discrete data. Categorical attributes have
a finite (but possibly large) number of distinct values, with no
ordering among the values. Examples include geographic
location, job category, and item type.
• There are several methods for the generation of concept
hierarchies for categorical data.
1. Specification of a partial ordering of attributes explicitly at
the schema level by users or Experts:
2. Specification of a portion of a hierarchy by explicit data
grouping:
3. Specification of a set of attributes, but not of their partial
ordering:
Specification of a partial ordering of attributes
explicitly at the schema level by users or experts:
• A user or expert can easily define a concept hierarchy by
specifying a partial or total ordering of the attributes at the
schema level.
• For example, a relational database or a dimension location of
a data warehouse may contain the following group of
attributes: street, city, province or state, and country.
• A hierarchy can be defined by specifying the total ordering
among these attributes at the schema level, such as:
street < city < province or state < country.
Specification of a portion of a hierarchy by
explicit data grouping:
• In a large database, it is unrealistic to define an entire concept
hierarchy by explicit value enumeration. We can easily specify
explicit groupings for a small portion of intermediate-level
data.
• For example, after specifying that province and country form a
hierarchy at the schema level, a user could define some
intermediate levels manually, such as
“{Alberta, Saskatchewan, Manitoba} ⊂ prairies Canada”
and
“{British Columbia, prairies Canada} ⊂ Western Canada”.
Specification of a set of attributes, but not of
their partial ordering:
• Consider the following observation that since higher-level
concepts generally cover several subordinate lower-level
concepts, an attribute defining a high concept level (e.g.,
country) will usually contain a smaller number of distinct
values than an attribute defining a lower concept level (e.g.,
street).
• The attribute with the most distinct values is placed at the
lowest level of the hierarchy.
• The lower the number of distinct values an attribute has, the
higher it is in the generated concept hierarchy.

You might also like