Unit 1
Unit 1
SEMESTER : V
UNIT-I
2. Kinds of Data
3. Kinds of patterns
7. Data Visualization
9. Data Preprocessing
The process of extracting information to identify patterns, trends, and useful data that would
allow the business to take the data-driven decision from huge sets of datais called Data Mining.
Data Mining is the process of investigating hidden patterns of information to various perspectives
for categorization into useful data, which is collected and assembled in particular areas such as
data warehouses, efficient analysis, data mining algorithm, helping decision making and other
data requirement to eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to find trends
and patterns that go beyond simple analysis procedures. Data mining utilizes complex
mathematical algorithms for data segments and evaluates the probability of future events. Data
Mining is also called Knowledge Discovery of Data (KDD).
The knowledge discovery process is shown in Figure as an iterative sequence of the following
steps:
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into formsappropriate
5. Pattern evaluation (to identify the truly interesting patterns representingknowledge based
on interestingness measures)
DM can be applied to any kind of data as long as the data are meaningful for atarget
Each tuple in a relational table represents an object identified by a unique key anddescribed by a set of
attribute value
Data Warehouses
Suppose a successful international company has branches all around the world. Each branch has
its own set of databases. The president of the company has askedyou to provide an analysis of the
company’s sales per item type per branch for thethird quarter.
To facilitate decision making, the data in a data warehouse are organized aroundmajor
subjects.
Transactional Data
purchase, a flight booking, or a user’s clicks on a web page. A transaction typically includes a
unique transaction identity number (trans ID) and a list of the items making up the transaction,
Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks.Data mining tasks can be classified into two categories: descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.
It can be useful to describe individual classes and concepts in summarized, concise, and yet
precise terms. Such descriptions of a class or a concept are called class/concept descriptions.
These descriptions can be derived via data characterization, by summarizing the data of the
class under study (often called the target class) in general terms,
data discrimination, by comparison of the target class with one or a set of comparative classes
(often called the contrasting classes), or (3) both data characterization and discrimination.
Frequent patterns, as the name suggests, are patterns that occur frequently in data. There are
many kinds of frequent patterns, including itemsets, subsequences, and substructures.
A frequent itemset typically refers to a set of items that frequently appear together in a
transactional data set, such as Computer and Software. A frequently occurringsubsequence, such
as thepattern that customers tend to purchase first a PC, followed by a digital camera, and then a
memory card, is a (frequent) sequential pattern.
Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts, for the purpose of being able to use the model to predict the class of
objects whose class label is unknown. The derived model is based on the analysis of a set of
training data (i.e., data objects whose class label isknown).
“How is the derived model presented?” The derived model may be represented in various
forms, such as classification (IF-THEN) rules, decision trees, mathematical
formulae, or neural networks
A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute
value, each branch represents an outcome of the test, and tree leaves represent classes or class
distributions. Decision trees can easily be converted to classification rules
A neural network, when used for classification, is typically a collection of neuron- like
processing units with weighted connections between the units. There are manyother methods for
constructing classification models, such as naïve
Bayesian classification, support vector machines, and k-nearest neighbor classification. Whereas
classification predicts categorical (discrete, unordered) labels, prediction models Continuous-
valued functions. That is, it is used to predict missing or unavailable numerical data values rather
than class labels. Although the term prediction may refer to both numeric prediction and class
label prediction,
Cluster Analysis
Outlier Analysis
A database may contain data objects that do not comply with the general behavior or model of
the data. These data objects are outliers. Most data mining methods discard outliers as noise or
exceptions. However, in some applications such as fraud detection, the rare events can be more
interesting than the more regularly occurring ones. The analysis of outlier data is referred to as
outlier mining.
Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects whose behavior
changes over time. Although this may include characterization, discrimination, association and
correlation analysis, classification, prediction, or clustering of time related data, distinct features
of such an analysis include time- series data analysis, Sequence or periodicity pattern matching,
and similarity-baseddata analysis.
As a highly application-driven domain, data mining has incorporated many techniques from other
domains such as statistics, machine learning, pattern recognition, database and data warehouse
systems, information retrieval, visualization, algorithms, high- performance computing, and
many application domains. The interdisciplinary nature of data mining research and development
contributes significantly to the success of data mining and its extensive applications. In this
section, we give examples of several disciplines that strongly influence the development of data
mining methods.
Statistics
Statistics studies the collection, analysis, interpretation or explanation, and presentation of data.
Data mining has an inherent connection with statistics. A statistical model is a set of
mathematical functions that describe the behavior of theobjects in a target class in terms of
random variables and their associated probability distributions. Statistical models are widely
used to model data and data classes.
Machine learning
It investigates how computers can learn (or improve their performance) based ondata. A main
research area is for computer programs to automatically learn to recognize complex patterns
and make intelligent decisions based on data. For example, a typical machine learning
problem is to program a computer so that it
can automatically recognize handwritten postal codes on mail after learning from aset of
examples. Machine learning is a fast-growing discipline
Supervised learning
Itis basically a synonym for classification. The supervision in the learning comes from the
labeled examples in the training data set. For example, in the postal coderecognition problem, a
set of handwritten postal code images and their corresponding machine-readable translations are
used as the training examples, which supervise the learning of the classification model
Unsupervised learning
It is essentially a synonym for clustering. The learning process is unsupervised since the input
examples are not class labeled. Typically, we may use clustering todiscover classes within the
data. For example, an unsupervised learning method can take, as input, a set of images of
handwritten digits. Suppose that it finds 10 clusters of data. These clusters may correspond to
the 10 distinct digits of 0 to 9, respectively. However, since the training data are not labeled, the
learned model cannot tell us the semantic meaning of the clusters found.
Semi-supervised learning
It is a class of machine learning techniques that make use of both labeled and unlabeled
examples when learning a model. In one approach, labeled examples are
used to learn class models and unlabeled examples are used to refine theboundaries between
classes.
Active learning is a machine learning approach that lets users play an active role in the learning
process. An active learning approach can ask a user (e.g., a domainexpert) to label an example,
which may be from a set of unlabeled examples or synthesized by the learning program. The
goal is to optimize the model quality byactively acquiring knowledge from human users, given a
constraint on how many examples they can be asked to label.
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
These factors also create some issues. Here in this tutorial, we will discuss the major issues
regarding −
Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a meaningful order or ranking
among them, but the magnitude between successive values is notknown.
Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or
real values. Numeric attributes can be interval-scaled or ratio-scaled.
Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The values ofinterval-
scaled attributes have order and can be positive, 0, or negative. Thus, in addition to providing a
ranking of values, such attributes allow us to compare and quantify the difference between
values
For example, a temperature of 20◦C is five degrees higher than a temperature of 15◦C.
Calendar dates are another example. For instance, the years 2002 and 2010 are eight years
apart.
Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is, if a
measurement is ratio-scaled, we can speak of a value as being a multiple (or
ratio) of another value. In addition, the values are ordered, and we can also compute the
difference between values, as well as the mean, median, and mode.
7. Data visualization
Data visualization aims to communicate data clearly and effectively through graphical
representation. Data visualization has been used extensively in many applications—for example,
at work for reporting, managing business operations,and tracking progress of tasks.
More popularly, we can take advantage of visualization techniques to discover datarelationships that
are otherwise not easily observable by looking at the raw data.
Nowadays, people also use data visualization to create fun and interesting graphics.
8.Measuring Data Similarity and Dissimilarity
In data mining applications, such as clustering, outlier analysis, and nearest- neighbor
classification, we need ways to assess how alike or unalike objects are incomparison to one
another. For example, a store may want to search for clusters ofcustomer objects, resulting in
groups of customers with similar characteristics (e.g., similar income, area of residence, and
age). Such information can then be used for marketing.
A cluster is a collection of data objects such that the objects within a cluster are similar to one
another and dissimilar to the objects in other clusters. Outlier analysis also employs clustering-
based techniques to identify potential outliers asobjects that are highly dissimilar to others.
Knowledge of object similarities can also be used in nearest-neighbor classificationschemes
where a given object (e.g., a patient) is assigned a class label (relating to, say, a diagnosis) based
on its similarity toward other objects in the model.
9.Data Preprocessing
Data preprocessing is the process of transforming raw data into an understandableformat. It is
also an important step in data mining as we cannot work with raw data. The quality of the data
should be checked before applying machine learning or data mining algorithms.
Why is Data preprocessing important?
Preprocessing of data is mainly to check the data quality. The quality can bechecked by the
following
Accuracy: To check whether the data entered is correct or not.
Consistency: To check whether the same data is kept in all the places thatdo or do not
match.
1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation
5. Data Discretization
10.Data cleaning:
Data cleaning is the process to remove incorrect data, incomplete data and inaccurate data from
the datasets, and it also replaces the missing values. There aresome techniques in data cleaning
Noisy:
Binning: This method is to smooth or handle noisy data. First, the data is sorted then and then
the sorted values are separated and stored in the form of bins. There are three methods for
smoothing data in the bin. Smoothing by bin mean method:
In this method, the values in the bin are replaced by the mean value of the
bin; Smoothing by bin median: In this method, the values in the bin are replaced by the
median value; Smoothing by bin boundary: In this method, the using minimum and
maximum values of the bin values are taken and the values are replaced by the closest
boundary value.
Regression: This is used to smooth the data and will help to handle data when unnecessary
data is present. For the analysis, purpose regression helps to decidethe variable which is
suitable for our analysis.
Clustering: This is used for finding the outliers and also in grouping the data.Clustering is
generally used in unsupervised learning.
11.Data integration:
The process of combining multiple sources into a single dataset. The Data integration
process is one of the main components in data management. There aresome problems to be
considered during data integration.
Schema integration: Integrates metadata(a set of data that describes other data) from
different sources.
Entity identification problem: Identifying entities from multiple databases. For example, the
system or the use should know student _id of one database and student_name of another
database belongs to the same entity.
Detecting and resolving data value concepts: The data taken from different databases while
merging may differ. Like the attribute values from one databasemay differ from another
database. For example, the date format may differ like
“MM/DD/YYYY” or “DD/MM/YYYY”.
12.Data reduction:
This process helps in the reduction of the volume of the data which makes theanalysis
easier yet produces the same or almost the same result. This reduction also helps to reduce
storage space. There are some of the techniques in data reduction are Dimensionality reduction,
Numerosity reduction, Data compression.
Dimensionality reduction: This process is necessary for real-world applicationsas the data
size is big. In this process, the reduction of random variables or attributes is done so that the
dimensionality of the data set can be reduced.
Combining and merging the attributes of the data without losing its original characteristics. This
also helps in the reduction of storage space and computation time is reduced. When the data is
highly dimensional the problem called “Curse ofDimensionality” occurs.
Numerosity Reduction: In this method, the representation of the data is madesmaller by
reducing the volume. There will not be any loss of data in this reduction.
Data compression: The compressed form of data is called data compression. Thiscompression
can be lossless or lossy. When there is no loss of information during compression it is called
lossless compression. Whereas lossy compression reduces information but it removes only the
unnecessary information.
The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements. There are
some methods in data transformation.
Smoothing: With the help of algorithms, we can remove noise from the dataset and helps in
knowing the important features of the dataset. By smoothing we canfind even a simple change
that helps in prediction.
Aggregation: In this method, the data is stored and presented in the form of a summary. The
data set which is from multiple sources is integrated into with dataanalysis description. This is
an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the quantityof
the data are good the results are more relevant.
Discretization: The continuous data here is split into intervals. Discretization reduces
the data size. For example, rather than specifying the class time, we can setan interval
like (3 pm-5 pm, 6 pm-8 pm).
Normalization: It is the method of scaling the data so that it can be represented ina
smaller range. Example ranging from -1.0 to 1.0.
Data discretization It transforms numeric data by mapping values to interval or
concept labels. Such methods can be used to automatically generate concept
hierarchies for the data, which allows for mining at multiple levels of granularity.
Discretization techniques include binning, histogram analysis, cluster analysis,
decision tree analysis, and correlation analysis. For nominal data, concept hierarchies
may be generated based on schema definitions as well as the number ofdistinct values
per attribute. Although numerous methods of data preprocessing have been
developed, data preprocessing remains an active area of research, due to the huge
amount of inconsistent or dirty data and the complexity of the problem.