DM-unit 1
DM-unit 1
Def1: Data mining refers the process or method that extracts or “mines” interesting knowledge or
patterns from large amounts of data.
Def2: Data mining is the task of discovering interesting patterns from large amounts of data, where the
data can be stored in databases, data warehouses, or other information repositories.
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.
4. Clustering analysis data objects without consulting a known class label. The objects are clustered
or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass
similarity. Each cluster that is formed can be viewed as a class of objects. Clustering can also facilitate
taxonomy formation, that is, the organization of observations into a hierarchy of classes that group
similar events together.
5. Outlier Analysis:
A database may contain data objects that do not comply with the general behavior or model of the data.
These data objects are outliers. Most data mining methods discard outliers as noise or exceptions.
However, in some applications such as fraud detection, the rare events can be more interesting than the
more regularly occurring ones. The analysis of outlier data is referred to as outlier mining.
6. Data evolution analysis describes and models regularities or trends for objects whose behaviour
changes over time. Although this may include characterization, discrimination, association,lassification,
or clustering of time-related data, distinct features of such an analysis include time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data analysis.
Classification
Classification of Data Mining Systems:
Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database
systems, statistics, machine learning, visualization, and information science. Moreover, depending on
the data mining approach used, techniques from other disciplines may be applied, such as neural
networks, fuzzy and/or rough set theory, knowledge representation, inductive logic programming, or
high-performance computing.
A data mining system can be classified according to the kinds of databases mined. Database systems can
be classified according to different criteria (such as data models, or the types of data or applications
involved)
Kinds of knowledge mined: Data mining systems can be categorized according to the kinds of
knowledge they mine, that is, based on data mining functionalities, such as characterization,
discrimination etc.
Kinds of techniques utilized: Data mining systems can be categorized according to the underlying data
mining techniques employed
Applications adapted: Data mining systems can also be categorized according to the applications they
adapt. For example, data mining systems may be tailored specifically for finance, telecommunications
etc.
Data Mining Task Primitives:
Each user will have a data mining task in mind, that is, some form of data analysis that he or she would
like to have performed. A data mining task can be specified in the form of a data mining query, which
is input to the data mining system.
A data mining query is defined in terms of data mining task primitives. These primitives allow the user
to interactively communicate with the data mining system during discovery in order to direct the
mining process, or examine the findings from different angles or depths.
• No coupling: The data mining system uses sources such as flat files to obtain the initial data set to
be mined since no database system or data warehouse system functions are implemented as part of the
process. Thus, this architecture represents a poor design choice.
• Loose coupling: The data mining system is not integrated with the database or data warehouse
system beyond their use as the source of the initial data set to be mined, and possible use in storage of
the results. Thus, this architecture can take advantage of the flexibility, efficiency and features such as
indexing that the database and data warehousing systems may provide. However, it is difficult for loose
coupling to achieve high scalability and good performance with large data sets as many such systems
are memory-based.
• Semi tight coupling: Some of the data mining primitives such as aggregation, sorting or
precomputation of statistical functions are efficiently implemented in the database or data warehouse
system, for use by the data mining system during mining-query processing. Also, some frequently used
intermediate mining results can be precomputed and stored in the database or data warehouse system,
thereby enhancing the performance of the data mining system.
• Tight coupling: The database or data warehouse system is fully integrated as part of the data
mining system and thereby provides optimized data mining query processing. Thus, the data mining
subsystem is treated as one functional component of an information system. This is a highly desirable
architecture as it facilitates efficient implementations of data mining functions, high system
performance, and an integrated information processing environment.
• Pattern evaluation:
The interestingness problem: A data mining system can uncover thousands of patterns.
Many of the patterns discovered may be uninteresting to the given user, either because they represent
common knowledge or lack novelty.
2. Performance issues: These include efficiency, scalability, and parallelization of data mining
algorithms.
• Handling of relational and complex types of data: Because relational databases and data
warehouses are widely used, the development of efficient and effective data mining systems for such
data is important.
• Descriptive data summarization helps us study the general characteristics of the data and identify the
presence of noise or outliers, which is useful for successful data cleaning and data integration.
• For many data preprocessing tasks, users would like to learn about data characteristics regarding both
central tendency and dispersion of the data.
This corresponds to the built-in aggregate function, average (avg() in SQL), provided in relational
database systems.
weighted arithmetic mean or the weighted average
A holistic measure is a measure that must be computed on the entire data set as a whole.
It cannot be computed by partitioning the given data into subsets and merging the values obtained for
the measure in each subset.
• Let x1;x2; : : : ;xN be a set of observations for some attribute. The range of the set is the difference
between the largest (max()) and smallest (min()) values.
• The most commonly used percentiles other than the median are quartiles. The first quartile, denoted
by Q1, is the 25th percentile; the third quartile, denoted by Q3, is the 75th percentile.
A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or
buckets. Typically, the width of each bucket is uniform. Each bucket is represented by a rectangle
whose height is equal to the count or relative frequency of the values at the bucket.
Data Cleaning:
Data cleaning is defined as removal of noisy and irrelevant data from collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration:
Data integration is defined as heterogeneous data from multiple sources combined in a common
source (Data Warehouse). Data integration using Data Migration tools, Data Synchronization tools
and ETL (Extract-Load-Transformation) process.
Data Selection:
Data selection is defined as the process where data relevant to the analysis is decided and retrieved
from the data collection. For this we can use Neural network, Decision Trees, Naive bayes,
Clustering, and Regression methods.
Data Transformation:
Data Transformation is defined as the process of transforming data into appropriate form required by
mining procedure. Data Transformation is a two step process:
Pattern Evaluation:
Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based
on given measures. It finds interestingness score of each pattern, and
uses summarization and Visualization to make data understandable by user.
Data pre-processing in data mining:
Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used
may be linear (having one independent variable) or multiple (having multiple
independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of data
analysis and to avoid overfitting of the model. Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature
selection is often performed to remove irrelevant or redundant features from the dataset. It can be
done using various techniques such as correlation analysis, mutual information, and principal
component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original features are
high-dimensional and complex. It can be done using techniques such as PCA, linear discriminant
analysis (LDA), and non-negative matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used to
reduce the size of the dataset while preserving the important information. It can be done using
techniques such as random sampling, stratified sampling, and systematic sampling.
Clustering: This involves grouping similar data points together into clusters. Clustering is often used
to reduce the size of the dataset by replacing similar data points with a representative centroid. It can
be done using techniques such as k-means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important information.
Compression is often used to reduce the size of the dataset for storage and transmission purposes. It
can be done using techniques such as wavelet compression, JPEG compression, and gzip
compression.
Data Cleaning Process:
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and
correct inconsistencies in the data.
1. Missing values:
The various methods for handling the problem of missing values in data tuples include:
(d) Using the attribute mean for quantitative (numeric) values or attribute mode
for categorical (nominal) values:
For example, suppose that the average income of All Electronics customers is $28,000.
Use this value to replace any missing values for income.
(e) Using the attribute mean for quantitative (numeric) values or attribute mode for
categorical (nominal) values, for all samples belonging to the same class as the
given tuple:
For example, if classifying customers according to credit risk, replace the missing
value with the average income value for customers in the same credit risk category as that of the given
tuple.
(f) Using the most probable value to fill in the missing value:
This may be determined with regression, inference-based tools using Bayesian formalism,
or decision tree induction. For example, using the other customer attributes in your data set, you may
construct a decision tree to predict the missing values for income.
2. Noisy Data:
Noise is a random error or variance in a measured variable. Given a numerical attribute such
as, say, price, how can we “smooth” out the data to remove the noise.
• Binning:
• Regression:
• Clustering:
Binning:
• Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the values
round it. The sorted values are distributed into a number of “buckets,” or bins. Because binning
methods consult the neighborhood of values, they performlocal smoothing.
• smoothing by bin medians can be employed, in which each bin value is replaced by the bin median.
• In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as
the bin boundaries.
Regression:
Data can be smoothed by fitting the data to a function, such as with regression. Linear
regression involves finding the “best” line to fit two attributes (or variables), so that one attribute can be
used to predict the other.
Clustering:
Outliers may be detected by clustering, where similar values are organized into groups, or
“clusters.” Intuitively, values that fall outside of the set of clusters may be considered outliers.
Data Integration:
* Data Integration involves combining data from several disparate source, which are stored using
various technologies and provide a unified view of the data.
* It includes multiple databases, data cubes or flat files. * Metadata, correlation analysis, data conflict
detection and resolution of semantic heterogeneity contribute towards smooth data integration.
Advantages:
1. Independence.
Disadvantages:
1. Latency (since data needs to be loaded using ETL).
1. Schema Integration.
2. Redundancy.
Schema integration:
The real-world entities from multiple source be matched is referred to as the entity identification
problem.
Redundancy:
* It is another important issue.
* An attribute may be redundant if it can be “derived” from another table, such as annual revenue.
* The same real-world entity, attribute values from different sources. This may be due to differences in
representation, scaling, or encoding.
* An attribute in one system may be recorded at a lower level of abstraction than the “same” attribute
in another.
Data Transformation:
* Data transformation the data are transformed or consolidated into forms in appropriate for mining.
1. Smoothing.
2. Aggregation.
3. Generalization.
4. Normalization.
5. Attribute construction.
Smoothing:
Which works to remove the noise from data. Such techniques include binning, clustering and
regression.
Aggregation:
Where summary or aggregation operations are applied to the data.
Generalization:
The data where low-level or “primitive” data are placed by higher-level concepts through the use
of concept through the use of concept hierarchies.
Normalization:
Where the attribute data are scaled so as to fall within a specified range, such as -1.0 to 1.0 or
0.0 to 1.0.
Attribute construction:
Where new attribute are a constructed and added from the given set of attributes to help the
mining process.
* Min-Max normalization.
* Z-Score normalization.
Z – Score Normalization:
The Z – Score normalization a value of an attribute A are normalized based on the mean
and standard deviation of A. A value v of A is normalized to v’.
Data Reduction:
The strategies to Obtain reduced representation in volume but produces the same or similar
analytical results.
Warehouse may store terabytes of data so Complex data analysis/mining may take a very
long time to run on the complete data set so we need a technique of Data Reduction.
Stepwise forward selection: The procedure starts with an empty set of attributes as the reduced
set.
Stepwise backward elimination: The procedure starts with the full set of attributes At each step,
it removes the worst attribute remaining in the set.
Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each step, the procedure selects
the best attribute and removes the worst from among the remaining Attributes.
3.Dimensionality reduction:
In dimensionality reduction, data encoding or transformations are applied so as to
obtain a reduced or “compressed” representation of the original data. If the original data can be
reconstructed from the compressed data without any loss of information, the data reduction is called
lossless. If, instead, we can reconstruct only an approximation of the original data, then the data
reduction is called lossy.
a) Wavelet Transforms:
The discrete wavelet transform(DWT) is a linear signal processing technique that, when
applied to a data vector X, transforms it to a numerically different vector, X0, of wavelet coefficients.
The two vectors are of the same length. When applying this technique to data reduction, we consider
each tuple as an n-dimensional data vector.
4.Numerosity reduction:
where the data are replaced or estimated by alternative, smaller data representations such
as parametric models (which need store only the model parameters instead of the actual data) or
nonparametric methods such as clustering, sampling, and the use of histograms.
b) Histograms:
Histograms use binning to approximate data distributions and are a popular form of data
reduction.
o Equal-width: In an equal-width histogram, the width of each bucket range is uniform (such as
the width of $10 for the buckets in Figure 2.19).
o Equal-frequency (or equidepth): In an equal-frequency histogram, the buckets are created so
that, roughly, the frequency of each bucket is constant (that is, each bucket contains roughly the
same number of contiguous data samples).
c) Clustering:
Clustering techniques consider data tuples as objects. They partition the objects into groups or
clusters, so that objects within a cluster are “similar” to one another and “dissimilar” to objects in other
clusters.
d) Sampling:
Sampling can be used as a data reduction technique because it allows a large data set to be
represented by a much smaller random sample (or subset) of the data.
Binning:
Binning is a top-down splitting technique based on a specified number of bins.
These methods are also used as discretization methods for numerosity reduction and concept
hierarchy generation.
Histogram Analysis:
Like binning, histogram analysis is an unsupervised discretization technique because it does not
use class information. Histograms partition the values for an attribute, A, into disjoint ranges called
buckets.
Entropy-Based Discretization:
Entropy is one of the most commonly used discretization measures. It explores class
distribution information in its calculation and determination of split-points (data values for partitioning
an attribute range).
The basic method for entropy-based discretization of an attribute A within the set is as follows:
1. Each value of A can be considered as a potential interval boundary or split-point (denoted split
point) to partition the range of A. That is, a split-point for A can partition the tuples in D into two
subsets satisfying the conditions A _ split point and A > split point, respectively, thereby creating
a binary discretization.
2. Entropy-based discretization, as mentioned above, uses information regarding the class label of
tuples. To explain the intuition behind entropy-based discretization, we must take a glimpse at
classification. Suppose we want to classify the tuples in D by partitioning on attribute A and
some split-point. Ideally, we would like this partitioning to result in an exact classification of the
tuples. For example, if we had two classes, we would hope that all of the tuples of, say, class C1
will fall into one partition, and all of the tuples of class C2 will fall into the other partition.
However, this is unlikely. For example, the first partition may contain many tuples of C1, but
also some of C2. How much more information would we still need for a perfect classification,
after this partitioning?
This amount is called the expected information
where D1 and D2 correspond to the tuples in D satisfying the conditions A _split point and A >
split point, respectively; jDj is the number of tuples in D, and so on. The entropy function for a given set
is calculated based on the class distribution of the tuples in the set. For example, given m classes,
C1;C2; : : : ;Cm, the entropy of D1.