0% found this document useful (0 votes)
18 views22 pages

DM-unit 1

Uploaded by

vinaydarling063
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views22 pages

DM-unit 1

Uploaded by

vinaydarling063
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT-I

What Is Data Mining?

Def1: Data mining refers the process or method that extracts or “mines” interesting knowledge or
patterns from large amounts of data.

Def2: Data mining is the task of discovering interesting patterns from large amounts of data, where the
data can be stored in databases, data warehouses, or other information repositories.

Data Mining Functionalities—What Kinds of Patterns Can Be Mined?


Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In
general, data mining tasks can be classified into two categories: descriptive and predictive.

Descriptive mining tasks characterize the general properties of the data in the database.

Predictive mining tasks perform inference on the current data in order to make predictions.

1. Concept/Class Description: Characterization and Discrimination:


Data can be associated with classes or concepts. For example, in the All Electronics store, classes of
tems for sale include computers and printers, and concepts of customers include big Spenders and
budget Spenders. It can be useful to describe individual classes and concepts in summarized, concise,
and yet precise terms. Such descriptions of a class or a concept are called class/concept descriptions.

 Characterization is a summarization of the general characteristics or features of a target class


(class under study) of data. A data mining system should be able to produce a description
summarizing the characteristics of customers who spend more than $1,000 a year at All
Electronics. The result could be a general profile of the customers, such as they are 40–50 years
old, employed, and have excellent credit ratings.
 Data discrimination is a comparison of the general features of target class data objects with the
general features of objects from one or a set of contrasting classes. The target and contrasting
classes can be specified by the user, and the corresponding data objects retrieved through
database queries. For example, the user may like to compare the general features of software
products whose sales increased by 10% in the last year with those whose sales decreased by at
least 30% during the same period.

2. Mining Frequent Patterns, Associations, and Correlation:


Frequent patterns, as the name suggests, are patterns that occur frequently in data. There are many
kinds of frequent patterns, including item sets, subsequences, and substructures. A frequent item set
typically refers to a set of items that frequently appear together in a transactional data set, such as milk
and bread.
A frequently occurring subsequence, such as the pattern that customers tend to purchase first a PC,
followed by a digital camera, and then a memory card, is a (frequent) sequential pattern. A substructure
can refer to different structural forms, such as graphs, trees, or lattices, which may be combined with
item sets or subsequences.
The discovery of association rules showing attribute-value conditions that occur frequently together in a
given set of data. For example, a data mining system may find association rules like
major(X, “computing science””) ⇒ owns(X, “personal computer”) [support = 12%, confidence = 98% ]
where X is a variable representing a student. The rule indicates that of the students under study, 12%
(support) major in computing science and own a personal computer. There is a 98% probability
(confidence, or certainty) that a student in this group owns a personal computer.

3. Classification and Prediction:


Classification differs from prediction in that the former is to construct a set of models (or
functions) that describe and distinguish data class or concepts, whereas the latter is to predict some
missing or unavailable, and often numerical, data values. Their similarity is that they are both tools for
prediction: Classification is used for predicting the class label of data objects and prediction is typically
used for predicting missing numerical data values.

4. Clustering analysis data objects without consulting a known class label. The objects are clustered
or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass
similarity. Each cluster that is formed can be viewed as a class of objects. Clustering can also facilitate
taxonomy formation, that is, the organization of observations into a hierarchy of classes that group
similar events together.
5. Outlier Analysis:
A database may contain data objects that do not comply with the general behavior or model of the data.
These data objects are outliers. Most data mining methods discard outliers as noise or exceptions.
However, in some applications such as fraud detection, the rare events can be more interesting than the
more regularly occurring ones. The analysis of outlier data is referred to as outlier mining.

6. Data evolution analysis describes and models regularities or trends for objects whose behaviour
changes over time. Although this may include characterization, discrimination, association,lassification,
or clustering of time-related data, distinct features of such an analysis include time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data analysis.
Classification
Classification of Data Mining Systems:
Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database
systems, statistics, machine learning, visualization, and information science. Moreover, depending on
the data mining approach used, techniques from other disciplines may be applied, such as neural
networks, fuzzy and/or rough set theory, knowledge representation, inductive logic programming, or
high-performance computing.

Kinds of databases mined:

A data mining system can be classified according to the kinds of databases mined. Database systems can
be classified according to different criteria (such as data models, or the types of data or applications
involved)
Kinds of knowledge mined: Data mining systems can be categorized according to the kinds of
knowledge they mine, that is, based on data mining functionalities, such as characterization,
discrimination etc.

Kinds of techniques utilized: Data mining systems can be categorized according to the underlying data
mining techniques employed

Applications adapted: Data mining systems can also be categorized according to the applications they
adapt. For example, data mining systems may be tailored specifically for finance, telecommunications
etc.
Data Mining Task Primitives:
Each user will have a data mining task in mind, that is, some form of data analysis that he or she would
like to have performed. A data mining task can be specified in the form of a data mining query, which
is input to the data mining system.

A data mining query is defined in terms of data mining task primitives. These primitives allow the user
to interactively communicate with the data mining system during discovery in order to direct the
mining process, or examine the findings from different angles or depths.

Integration of a Data Mining System with a Database or Data Warehouse:


Integration of a data mining system with a database or data warehouse system are as
follows.

• No coupling: The data mining system uses sources such as flat files to obtain the initial data set to
be mined since no database system or data warehouse system functions are implemented as part of the
process. Thus, this architecture represents a poor design choice.

• Loose coupling: The data mining system is not integrated with the database or data warehouse
system beyond their use as the source of the initial data set to be mined, and possible use in storage of
the results. Thus, this architecture can take advantage of the flexibility, efficiency and features such as
indexing that the database and data warehousing systems may provide. However, it is difficult for loose
coupling to achieve high scalability and good performance with large data sets as many such systems
are memory-based.

• Semi tight coupling: Some of the data mining primitives such as aggregation, sorting or
precomputation of statistical functions are efficiently implemented in the database or data warehouse
system, for use by the data mining system during mining-query processing. Also, some frequently used
intermediate mining results can be precomputed and stored in the database or data warehouse system,
thereby enhancing the performance of the data mining system.

• Tight coupling: The database or data warehouse system is fully integrated as part of the data
mining system and thereby provides optimized data mining query processing. Thus, the data mining
subsystem is treated as one functional component of an information system. This is a highly desirable
architecture as it facilitates efficient implementations of data mining functions, high system
performance, and an integrated information processing environment.

Major Issues in Data Mining :


Issues can be classified as Follows.

1) Mining methodology and user interaction issues.


2) Performance issues.
3) Issues relating to the diversity of database types.

1. Mining methodology and user interaction issues:

• Mining different kinds of knowledge in databases:


Different users are interested in different kinds of knowledge and will require a wide range of data
analysis and knowledge discovery tasks such as data characterization, discrimination, association,
classification, clustering, trend and deviation analysis, and similarity analysis. Each of these tasks will
use the same database in different ways and will require different data mining techniques.

• Interactive mining of knowledge at multiple levels of abstraction:


Interactive mining, with the use of OLAP operations on a data cube, allows users to focus the search for
patterns, providing and refining data mining requests based on returned results. The user can then
interactively view the data and discover patterns at multiple granularities and from different angles.

• Incorporation of background knowledge:


Background knowledge, or information regarding the domain under study such as integrity constraints
and deduction rules, may be used to guide the discovery process and allow discovered patterns to be
expressed in concise terms and at different levels of abstraction. This helps to focus and speed up a data
mining process or judge the interestingness of discovered patterns.

• Data mining query languages and ad hoc data mining:


Relational query languages (such as SQL) allow users to pose ad hoc queries for data retrieval.

• Presentation and visualization of data mining results:


Discovered knowledge should be expressed in high-level languages, visual representations, or other
expressive forms so that the knowledge can be easily understood and directly usable by humans.

• Handling noisy or incomplete data:


The data stored in a database may reflect noise, exceptional cases, or incomplete data
objects. When mining data regularities, these objects may confuse the process, causing the knowledge
model constructed to overfit the data.

• Pattern evaluation:

The interestingness problem: A data mining system can uncover thousands of patterns.
Many of the patterns discovered may be uninteresting to the given user, either because they represent
common knowledge or lack novelty.

2. Performance issues: These include efficiency, scalability, and parallelization of data mining
algorithms.

Efficiency and scalability of data mining algorithms: To effectively extract information


from a huge amount of data in databases, data mining algorithms must be efficient and scalable.

3. Issues relating to the diversity of database types:

• Handling of relational and complex types of data: Because relational databases and data
warehouses are widely used, the development of efficient and effective data mining systems for such
data is important.

• Mining information from heterogeneous databases and global information systems.

Descriptive Data Summarization:


• Descriptive data summarization, which serves as a foundation for data preprocessing.

• Descriptive data summarization helps us study the general characteristics of the data and identify the
presence of noise or outliers, which is useful for successful data cleaning and data integration.

• For many data preprocessing tasks, users would like to learn about data characteristics regarding both
central tendency and dispersion of the data.

1. Measuring the Central Tendency:


The most common and most effective numerical measure of the “center” of a set of data is
the (arithmetic) mean. Let x1;x2; : : : ;xN be a set of N values or observations, such as for some
attribute, like salary. The mean of this set of values is

This corresponds to the built-in aggregate function, average (avg() in SQL), provided in relational
database systems.
weighted arithmetic mean or the weighted average

A holistic measure is a measure that must be computed on the entire data set as a whole.
It cannot be computed by partitioning the given data into subsets and merging the values obtained for
the measure in each subset.

2. Measuring the Dispersion of Data:


• The degree to which numerical data tend to spread is called the dispersion, or variance of the data.
The most common measures of data dispersion are range, the five-number summary.

• Based on quartiles), the interquartile range, and the standard deviation.

• Let x1;x2; : : : ;xN be a set of observations for some attribute. The range of the set is the difference
between the largest (max()) and smallest (min()) values.

• The most commonly used percentiles other than the median are quartiles. The first quartile, denoted
by Q1, is the 25th percentile; the third quartile, denoted by Q3, is the 75th percentile.

This distance is called the interquartile range (IQR) and is defined as

Variance and Standard Deviation:


3. Graphic Displays of Basic Descriptive Data Summaries:
These include histograms, quantile plots, q-q plots, scatter plots, and loess curves.
Such graphs are very helpful for the visual inspection of your data.

A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or
buckets. Typically, the width of each bucket is uniform. Each bucket is represented by a rectangle
whose height is equal to the count or relative frequency of the values at the bucket.

A quantile-quantile plot, or q-q plot, graphs the quantiles of one univariate


distribution against the corresponding quantiles of another. It is a powerful visualization tool in that it
allows the user to view whether there is a shift in going from one distribution to another.
Introduction to KDD process:
The knowledge discovery process (illustrates in the given figure) is iterative and interactive, comprises
of nine steps. The process is iterative at each stage, implying that moving back to the previous actions
might be required. The process has many imaginative aspects in the sense that one cant presents one
formula or makes a complete scientific categorization for the correct decisions for each step and
application type. Thus, it is needed to understand the process and the different requirements and
possibilities in each stage.
The process begins with determining the KDD objectives and ends with the implementation of the
discovered knowledge. At that point, the loop is closed, and the Active Data Mining starts.
Subsequently, changes would need to be made in the application domain. For example, offering various
features to cell phone users in order to reduce churn. This closes the loop, and the impacts are then
measured on the new data repositories and the KDD process again. Following is a concise description of
the nine-step KDD process, beginning with a managerial step

Data Cleaning:
Data cleaning is defined as removal of noisy and irrelevant data from collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data transformation tools.

Data Integration:
Data integration is defined as heterogeneous data from multiple sources combined in a common
source (Data Warehouse). Data integration using Data Migration tools, Data Synchronization tools
and ETL (Extract-Load-Transformation) process.

Data Selection:
Data selection is defined as the process where data relevant to the analysis is decided and retrieved
from the data collection. For this we can use Neural network, Decision Trees, Naive bayes,
Clustering, and Regression methods.

Data Transformation:
Data Transformation is defined as the process of transforming data into appropriate form required by
mining procedure. Data Transformation is a two step process:
Pattern Evaluation:
Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based
on given measures. It finds interestingness score of each pattern, and
uses summarization and Visualization to make data understandable by user.
Data pre-processing in data mining:
Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

(a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:

1. Ignore the tuples:


This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values manually,
by attribute mean or the most probable value.

(b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :

1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple
independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.

2. Data Transformation:

This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:

1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.

3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of data
analysis and to avoid overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature
selection is often performed to remove irrelevant or redundant features from the dataset. It can be
done using various techniques such as correlation analysis, mutual information, and principal
component analysis (PCA).

Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original features are
high-dimensional and complex. It can be done using techniques such as PCA, linear discriminant
analysis (LDA), and non-negative matrix factorization (NMF).

Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used to
reduce the size of the dataset while preserving the important information. It can be done using
techniques such as random sampling, stratified sampling, and systematic sampling.

Clustering: This involves grouping similar data points together into clusters. Clustering is often used
to reduce the size of the dataset by replacing similar data points with a representative centroid. It can
be done using techniques such as k-means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important information.
Compression is often used to reduce the size of the dataset for storage and transmission purposes. It
can be done using techniques such as wavelet compression, JPEG compression, and gzip
compression.
Data Cleaning Process:
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and
correct inconsistencies in the data.

1. Missing values:
The various methods for handling the problem of missing values in data tuples include:

(a) Ignoring the tuple:


This is usually done when the class label is missing (assuming the mining task
involves classification or description). This method is not very effective unless the tuple contains
several attributes with missing values. It is especially poor when the percentage of missing values per
attribute varies considerably.

(b) Manually filling in the missing value:


In general, this approach is time-consuming and may not be a reasonable task for large
data sets with many missing values, especially when the value to be filled in is not easily determined.
(c) Using a global constant to fill in the missing value:
Replace all missing attribute values by the same constant, such as a label like
“Unknown,” or −∞. If missing values are replaced by, say, “Unknown,” then the mining program may
mistakenly think that they form an interesting concept, since they all have a value in common — that of
“Unknown.” Hence, although this method is simple, it is not recommended.

(d) Using the attribute mean for quantitative (numeric) values or attribute mode
for categorical (nominal) values:
For example, suppose that the average income of All Electronics customers is $28,000.
Use this value to replace any missing values for income.

(e) Using the attribute mean for quantitative (numeric) values or attribute mode for
categorical (nominal) values, for all samples belonging to the same class as the
given tuple:
For example, if classifying customers according to credit risk, replace the missing
value with the average income value for customers in the same credit risk category as that of the given
tuple.

(f) Using the most probable value to fill in the missing value:
This may be determined with regression, inference-based tools using Bayesian formalism,
or decision tree induction. For example, using the other customer attributes in your data set, you may
construct a decision tree to predict the missing values for income.

2. Noisy Data:
Noise is a random error or variance in a measured variable. Given a numerical attribute such
as, say, price, how can we “smooth” out the data to remove the noise.

Let’s look at the following data smoothing techniques:

• Binning:

• Regression:

• Clustering:

Binning:
• Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the values
round it. The sorted values are distributed into a number of “buckets,” or bins. Because binning
methods consult the neighborhood of values, they performlocal smoothing.

• smoothing by bin medians can be employed, in which each bin value is replaced by the bin median.
• In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as
the bin boundaries.

Regression:
Data can be smoothed by fitting the data to a function, such as with regression. Linear
regression involves finding the “best” line to fit two attributes (or variables), so that one attribute can be
used to predict the other.

Clustering:
Outliers may be detected by clustering, where similar values are organized into groups, or
“clusters.” Intuitively, values that fall outside of the set of clusters may be considered outliers.

Data Integration:
* Data Integration involves combining data from several disparate source, which are stored using
various technologies and provide a unified view of the data.

* The later initiative is often called a data warehouse.

* It merges the data from multiple data stores (data source).

* It includes multiple databases, data cubes or flat files. * Metadata, correlation analysis, data conflict
detection and resolution of semantic heterogeneity contribute towards smooth data integration.

Advantages:
1. Independence.

2. Faster query processing.

3. Complex query processing.


4. Advanced data summarization & storage possible.

5. High volume data processing.

Disadvantages:
1. Latency (since data needs to be loaded using ETL).

2. Costlier (data localization, infrastructure, security).

There are a number of issues to consider during data integration.

1. Schema Integration.

2. Redundancy.

3. Detection and resolution of data value conflicts.

Schema integration:
The real-world entities from multiple source be matched is referred to as the entity identification
problem.

Redundancy:
* It is another important issue.

* An attribute may be redundant if it can be “derived” from another table, such as annual revenue.

* Some redundancies can be detected by correlation analysis.

Detection and resolution of data value conflicts:


* A third important issue in data integration is the detection and resolution of data value conflicts.

* The same real-world entity, attribute values from different sources. This may be due to differences in
representation, scaling, or encoding.

* An attribute in one system may be recorded at a lower level of abstraction than the “same” attribute
in another.

Data Transformation:
* Data transformation the data are transformed or consolidated into forms in appropriate for mining.

* Data transformation can involve

1. Smoothing.

2. Aggregation.

3. Generalization.

4. Normalization.
5. Attribute construction.

Smoothing:
Which works to remove the noise from data. Such techniques include binning, clustering and
regression.

Aggregation:
Where summary or aggregation operations are applied to the data.

Generalization:
The data where low-level or “primitive” data are placed by higher-level concepts through the use
of concept through the use of concept hierarchies.

Normalization:
Where the attribute data are scaled so as to fall within a specified range, such as -1.0 to 1.0 or
0.0 to 1.0.

Attribute construction:
Where new attribute are a constructed and added from the given set of attributes to help the
mining process.

There are many method for data normalization.

* Min-Max normalization.

* Z-Score normalization.

* Normalization by decimal scaling.

Min – Max Normalization:


It performs a linear transformation on the original data. Suppose that min A and max A are
the minimum and maximum values of attributes A. A Min – Max normalization maps a value v of A to
v’ in the range.

Z – Score Normalization:
The Z – Score normalization a value of an attribute A are normalized based on the mean
and standard deviation of A. A value v of A is normalized to v’.

Normalization by Decimal Scaling:


Normalization by decimal scaling normalizes by moving the decimal point of values of
attribute A. The number of decimal points moved depends on the maximum absolute value of A. A
value v of A is normalized to v’ by computing

where j is the smallest integer such that Max(|V’|) < 1.

Data Reduction:
The strategies to Obtain reduced representation in volume but produces the same or similar
analytical results.

Warehouse may store terabytes of data so Complex data analysis/mining may take a very
long time to run on the complete data set so we need a technique of Data Reduction.

Data reduction strategies:


1. Data cube aggregation
2. Attribute subset selection
3. Dimensionality reduction
4. Numerosity reduction
5. Discretization and concept hierarchy generation.

1. Data cube aggregation:


where aggregation operations are applied to the data in the construction of a data cube.
For example, Figure 2.14 shows a data cube for multidimensional analysis of sales data with respect to
annual sales per item type for each All Electronics branch. Each cell holds an aggregate data value,
corresponding to the data point in multidimensional.

2. Attribute subset selection:


where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.

 Stepwise forward selection: The procedure starts with an empty set of attributes as the reduced
set.
 Stepwise backward elimination: The procedure starts with the full set of attributes At each step,
it removes the worst attribute remaining in the set.
 Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each step, the procedure selects
the best attribute and removes the worst from among the remaining Attributes.

3.Dimensionality reduction:
In dimensionality reduction, data encoding or transformations are applied so as to
obtain a reduced or “compressed” representation of the original data. If the original data can be
reconstructed from the compressed data without any loss of information, the data reduction is called
lossless. If, instead, we can reconstruct only an approximation of the original data, then the data
reduction is called lossy.

a) Wavelet Transforms:
The discrete wavelet transform(DWT) is a linear signal processing technique that, when
applied to a data vector X, transforms it to a numerically different vector, X0, of wavelet coefficients.
The two vectors are of the same length. When applying this technique to data reduction, we consider
each tuple as an n-dimensional data vector.

b) Principal Components Analysis:


Principal components analysis, or PCA (also called the Karhunen-Loeve, or K-L, method),
searches for k n-dimensional orthogonal vectors that can best be used to represent the data, where k _ n.
The original data are thus projected onto a much smaller space, resulting in dimensionality reduction.

4.Numerosity reduction:
where the data are replaced or estimated by alternative, smaller data representations such
as parametric models (which need store only the model parameters instead of the actual data) or
nonparametric methods such as clustering, sampling, and the use of histograms.

a) Regression and Log-Linear Models:


Regression and log-linear models can be used to approximate the given data. In (simple)
linear regression, the data are modeled to fit a straight line
y = wx+b

b) Histograms:
Histograms use binning to approximate data distributions and are a popular form of data
reduction.
o Equal-width: In an equal-width histogram, the width of each bucket range is uniform (such as
the width of $10 for the buckets in Figure 2.19).
o Equal-frequency (or equidepth): In an equal-frequency histogram, the buckets are created so
that, roughly, the frequency of each bucket is constant (that is, each bucket contains roughly the
same number of contiguous data samples).

c) Clustering:
Clustering techniques consider data tuples as objects. They partition the objects into groups or
clusters, so that objects within a cluster are “similar” to one another and “dissimilar” to objects in other
clusters.

d) Sampling:
Sampling can be used as a data reduction technique because it allows a large data set to be
represented by a much smaller random sample (or subset) of the data.

5. Discretization and concept hierarchy generation:


where raw data values for attributes are replaced by ranges or higher conceptual levels.
Data discretization is a form of Numerosity reduction that is very useful for the automatic generation of
concept hierarchies.
Data Discretization and Concept Hierarchy Generation:
Discretization and Concept Hierarchy Generation where raw data values for attributes are
replaced by ranges or higher conceptual levels. Data discretization is a form of Numerosity reduction
that is very useful for the automatic generation of concept hierarchies.
Concept hierarchies for numerical attributes can be constructed automatically based on data
discretization.

Binning:
Binning is a top-down splitting technique based on a specified number of bins.
These methods are also used as discretization methods for numerosity reduction and concept
hierarchy generation.

Histogram Analysis:
Like binning, histogram analysis is an unsupervised discretization technique because it does not
use class information. Histograms partition the values for an attribute, A, into disjoint ranges called
buckets.

Entropy-Based Discretization:
Entropy is one of the most commonly used discretization measures. It explores class
distribution information in its calculation and determination of split-points (data values for partitioning
an attribute range).
The basic method for entropy-based discretization of an attribute A within the set is as follows:

1. Each value of A can be considered as a potential interval boundary or split-point (denoted split
point) to partition the range of A. That is, a split-point for A can partition the tuples in D into two
subsets satisfying the conditions A _ split point and A > split point, respectively, thereby creating
a binary discretization.
2. Entropy-based discretization, as mentioned above, uses information regarding the class label of
tuples. To explain the intuition behind entropy-based discretization, we must take a glimpse at
classification. Suppose we want to classify the tuples in D by partitioning on attribute A and
some split-point. Ideally, we would like this partitioning to result in an exact classification of the
tuples. For example, if we had two classes, we would hope that all of the tuples of, say, class C1
will fall into one partition, and all of the tuples of class C2 will fall into the other partition.
However, this is unlikely. For example, the first partition may contain many tuples of C1, but
also some of C2. How much more information would we still need for a perfect classification,
after this partitioning?
This amount is called the expected information

where D1 and D2 correspond to the tuples in D satisfying the conditions A _split point and A >
split point, respectively; jDj is the number of tuples in D, and so on. The entropy function for a given set
is calculated based on the class distribution of the tuples in the set. For example, given m classes,
C1;C2; : : : ;Cm, the entropy of D1.

requirement for classifying a tuple in D based on partitioning by A. It is given byChiMerge


is a discretization method. The discretization methods that we have studied up to this point have all
employed a top-down, splitting strategy. This contrasts with ChiMerge, which employs a bottom-up
approach by finding the best neighboring intervals and then merging these to form larger intervals,
recursively.
Cluster Analysis:
Cluster analysis is a popular data discretization method. A clustering algorithm can be
applied to discretize a numerical attribute, A, by partitioning the values of A into clusters or groups.
Clustering takes the distribution of A into consideration, as well as the closeness of data points, and
therefore is able to produce high-quality discretization results.

You might also like