0% found this document useful (0 votes)
43 views

Unit 1

The document provides a comprehensive overview of Data Mining, covering its definition, types of data, patterns that can be mined, and the technologies used in the field. It outlines the data mining process, including data cleaning, integration, selection, transformation, and knowledge presentation, while also discussing major issues such as methodology, performance, and diverse data types. Additionally, it details various data types and attributes, data visualization techniques, and the significance of understanding patterns for effective decision-making.

Uploaded by

kvm474710
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Unit 1

The document provides a comprehensive overview of Data Mining, covering its definition, types of data, patterns that can be mined, and the technologies used in the field. It outlines the data mining process, including data cleaning, integration, selection, transformation, and knowledge presentation, while also discussing major issues such as methodology, performance, and diverse data types. Additionally, it details various data types and attributes, data visualization techniques, and the significance of understanding patterns for effective decision-making.

Uploaded by

kvm474710
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

SUBJECT NAME : DATA MINING

SUBJECT CODE : CECS54A

SEMESTER : V

UNIT-I

1. What is Data Mining

2. Kinds of Data

3. Kinds of patterns

4. Technologies used for Data Mining

5. Major Issues in Data Mining

6. Data Objects and Attribute types

7. Data Visualization

8. Measuring Data Similarity and Dissimilarity

9. Data Preprocessing

10. Data Cleaning

11. Data Integration

12. Data Reduction

13. Data Transformation and Data Discretization


1.What is Data Mining?

The process of extracting information to identify patterns, trends, and useful data that would
allow the business to take the data-driven decision from huge sets of datais called Data Mining.

Data Mining is the process of investigating hidden patterns of information to various perspectives
for categorization into useful data, which is collected and assembled in particular areas such as
data warehouses, efficient analysis, data mining algorithm, helping decision making and other
data requirement to eventually cost-cutting and generating revenue.

Data mining is the act of automatically searching for large stores of information to find trends
and patterns that go beyond simple analysis procedures. Data mining utilizes complex
mathematical algorithms for data segments and evaluates the probability of future events. Data
Mining is also called Knowledge Discovery of Data (KDD).

The knowledge discovery process is shown in Figure as an iterative sequence of the following
steps:

1. Data cleaning (to remove noise and inconsistent data)

2. Data integration (where multiple data sources may be combined)

3. Data selection (where data relevant to the analysis task are retrieved from the database)

4. Data transformation (where data are transformed and consolidated into formsappropriate

for mining by performing summary or aggregation operations)

5. Pattern evaluation (to identify the truly interesting patterns representingknowledge based
on interestingness measures)

6. Knowledge presentation (where visualization and knowledge representationtechniques are


used to present mined knowledge to users)
2. Kinds of data

What Kinds of Data Can Be Mined?

DM can be applied to any kind of data as long as the data are meaningful for atarget

application. The most basic forms of data for mining applications

are Database data, Data warehouse data, and Transactional data.

Database Data A database system, also called a database management system(DBMS),


consists of a collection of interrelated data, known as a database.

A relational database is a collection of tables, each of which is assigned a uniquename. Each

table consists of a set of attributes that contain large sets of a tuple.

Each tuple in a relational table represents an object identified by a unique key anddescribed by a set of

attribute value
Data Warehouses

Suppose a successful international company has branches all around the world. Each branch has

its own set of databases. The president of the company has askedyou to provide an analysis of the

company’s sales per item type per branch for thethird quarter.

To facilitate decision making, the data in a data warehouse are organized aroundmajor

subjects.

Transactional Data

In general, each record in a transactional database captures a transaction, such as a customer’s

purchase, a flight booking, or a user’s clicks on a web page. A transaction typically includes a

unique transaction identity number (trans ID) and a list of the items making up the transaction,

such as the items purchased in the transaction.


3.What Kinds of Patterns Can Be Mined?

Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks.Data mining tasks can be classified into two categories: descriptive and predictive.

Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.

Concept/Class Description: Characterization and Discrimination

Data can be associated with classes or concepts. For example, in the


AllElectronics store, classes of items for sale include computers and printers, and concepts of
customers include bigSpenders and budgetSpenders.

It can be useful to describe individual classes and concepts in summarized, concise, and yet
precise terms. Such descriptions of a class or a concept are called class/concept descriptions.

These descriptions can be derived via data characterization, by summarizing the data of the
class under study (often called the target class) in general terms,

data discrimination, by comparison of the target class with one or a set of comparative classes
(often called the contrasting classes), or (3) both data characterization and discrimination.

Data characterization is a summarization of the general characteristics or features of a target


class of data. The data corresponding to the user-specified class are typically collected by a
database query the output of data characterization can be presented in various forms. Examples
include pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables,
including crosstabs.
Data discrimination is a comparison of the general features of target class data objects with the
general features of objects from one or a set of contrasting classes. The target and contrasting
classes can be specified by the user, and the corresponding data objects retrieved through
database queries.

“How are discrimination descriptions output?”

Discrimination descriptions expressed in rule form are referred to as discriminate rules.

Mining Frequent Patterns, Associations, and Correlations

Frequent patterns, as the name suggests, are patterns that occur frequently in data. There are
many kinds of frequent patterns, including itemsets, subsequences, and substructures.

A frequent itemset typically refers to a set of items that frequently appear together in a
transactional data set, such as Computer and Software. A frequently occurringsubsequence, such
as thepattern that customers tend to purchase first a PC, followed by a digital camera, and then a
memory card, is a (frequent) sequential pattern.

Example: Association analysis. Suppose, as a marketing manager


of AllElectronics, you would like to determine which items are frequently purchased
together within the same transactions. An example of such a rule, mined from the
AllElectronics transactional database, is
buys(X;―computer‖) buys(X; ―software‖) [support = 1%, confidence = 50%]

where X is a variable representing a customer. A confidence, or certainty, of 50% means that if


a customer buys a computer, there is a 50% chance that she will buy software as well. A 1%
support means that 1% of all of the transactions under analysis showed that computer and
software were purchased together.
This association rule involves a single attribute or predicate (i.e., buys) that repeats. Association
rules that contain a single predicate are referred to as single- dimensional association rules.
Dropping the predicate notation, the above rule can be written simply as ―computer software
[1%, 50%]‖.

Classification and Prediction

Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts, for the purpose of being able to use the model to predict the class of
objects whose class label is unknown. The derived model is based on the analysis of a set of
training data (i.e., data objects whose class label isknown).

“How is the derived model presented?” The derived model may be represented in various
forms, such as classification (IF-THEN) rules, decision trees, mathematical
formulae, or neural networks

A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute
value, each branch represents an outcome of the test, and tree leaves represent classes or class
distributions. Decision trees can easily be converted to classification rules

A neural network, when used for classification, is typically a collection of neuron- like
processing units with weighted connections between the units. There are manyother methods for
constructing classification models, such as naïve

Bayesian classification, support vector machines, and k-nearest neighbor classification. Whereas
classification predicts categorical (discrete, unordered) labels, prediction models Continuous-
valued functions. That is, it is used to predict missing or unavailable numerical data values rather
than class labels. Although the term prediction may refer to both numeric prediction and class
label prediction,
Cluster Analysis

Classification and prediction analyze class-labeled data objects, where as clustering


analyzes data objects without consulting a known class label.

Outlier Analysis

A database may contain data objects that do not comply with the general behavior or model of
the data. These data objects are outliers. Most data mining methods discard outliers as noise or
exceptions. However, in some applications such as fraud detection, the rare events can be more
interesting than the more regularly occurring ones. The analysis of outlier data is referred to as
outlier mining.

Evolution Analysis

Data evolution analysis describes and models regularities or trends for objects whose behavior
changes over time. Although this may include characterization, discrimination, association and
correlation analysis, classification, prediction, or clustering of time related data, distinct features
of such an analysis include time- series data analysis, Sequence or periodicity pattern matching,
and similarity-baseddata analysis.

4. Which Technologies Are Used?

As a highly application-driven domain, data mining has incorporated many techniques from other
domains such as statistics, machine learning, pattern recognition, database and data warehouse
systems, information retrieval, visualization, algorithms, high- performance computing, and
many application domains. The interdisciplinary nature of data mining research and development
contributes significantly to the success of data mining and its extensive applications. In this
section, we give examples of several disciplines that strongly influence the development of data
mining methods.
Statistics

Statistics studies the collection, analysis, interpretation or explanation, and presentation of data.
Data mining has an inherent connection with statistics. A statistical model is a set of
mathematical functions that describe the behavior of theobjects in a target class in terms of
random variables and their associated probability distributions. Statistical models are widely
used to model data and data classes.

Machine learning

It investigates how computers can learn (or improve their performance) based ondata. A main
research area is for computer programs to automatically learn to recognize complex patterns
and make intelligent decisions based on data. For example, a typical machine learning
problem is to program a computer so that it
can automatically recognize handwritten postal codes on mail after learning from aset of
examples. Machine learning is a fast-growing discipline

Supervised learning

Itis basically a synonym for classification. The supervision in the learning comes from the
labeled examples in the training data set. For example, in the postal coderecognition problem, a
set of handwritten postal code images and their corresponding machine-readable translations are
used as the training examples, which supervise the learning of the classification model

Unsupervised learning

It is essentially a synonym for clustering. The learning process is unsupervised since the input
examples are not class labeled. Typically, we may use clustering todiscover classes within the
data. For example, an unsupervised learning method can take, as input, a set of images of
handwritten digits. Suppose that it finds 10 clusters of data. These clusters may correspond to
the 10 distinct digits of 0 to 9, respectively. However, since the training data are not labeled, the
learned model cannot tell us the semantic meaning of the clusters found.

Semi-supervised learning

It is a class of machine learning techniques that make use of both labeled and unlabeled
examples when learning a model. In one approach, labeled examples are
used to learn class models and unlabeled examples are used to refine theboundaries between
classes.

Active learning is a machine learning approach that lets users play an active role in the learning
process. An active learning approach can ask a user (e.g., a domainexpert) to label an example,
which may be from a set of unlabeled examples or synthesized by the learning program. The
goal is to optimize the model quality byactively acquiring knowledge from human users, given a
constraint on how many examples they can be asked to label.

5. MAJOR ISSUES IN DATA MINING

Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
These factors also create some issues. Here in this tutorial, we will discuss the major issues
regarding −

 Mining Methodology and User Interaction


 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.

Mining Methodology and User Interaction Issues It refers


to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different users may be interested in
different kinds of knowledge. Therefore it is necessary for data mining to cover a broad
range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based onthe returned results.
 Incorporation of background knowledge − To guide discovery process and to express
the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not only in
concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of thediscovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
6. Data Objects and Attribute Types
Data sets are made up of data objects. A data object represents an entity—ina sales
database, the objects may be customers, store items, and sales; in a medical database, the
objects may be patients; in a university database, the objects may be students, professors,
and courses.
Data objects are typically described by attributes. Data objects can also bereferred to as
samples, examples, instances, data points, or objects. If the data objects are stored in a
database, they are data tuples.
That is, the rows of a database correspond to the data objects, and the columns
correspond to the attributes. In this section, we define attributes andlook at the various
attribute types.
What Is an Attribute?
An attribute is a data field, representing a characteristic or feature of a data object.The nouns
attribute, dimension, feature, and variable are often used interchangeably in the literature.
The type of an attribute is determined by the set of possible values—nominal, binary, ordinal,
or numeric—the attribute can have. In the following subsections, we introduce each type.
Nominal Attributes
Nominal means “relating to names.” The values of a nominal attribute are symbolsor names of
things. Each value represents some kind of category, code, or state, and so nominal attributes are
also referred to as categorical. The values do not have any meaningful order. In computer
science, the values are also known as enumerations.
Binary Attributes
A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0
typically means that the attribute is absent, and 1 means that it is present. Binary attributes are
referred to as Boolean if the two states correspond to true and false.

Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a meaningful order or ranking
among them, but the magnitude between successive values is notknown.
Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or
real values. Numeric attributes can be interval-scaled or ratio-scaled.

Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The values ofinterval-
scaled attributes have order and can be positive, 0, or negative. Thus, in addition to providing a
ranking of values, such attributes allow us to compare and quantify the difference between
values
For example, a temperature of 20◦C is five degrees higher than a temperature of 15◦C.
Calendar dates are another example. For instance, the years 2002 and 2010 are eight years
apart.
Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is, if a
measurement is ratio-scaled, we can speak of a value as being a multiple (or
ratio) of another value. In addition, the values are ordered, and we can also compute the
difference between values, as well as the mean, median, and mode.
7. Data visualization
Data visualization aims to communicate data clearly and effectively through graphical
representation. Data visualization has been used extensively in many applications—for example,
at work for reporting, managing business operations,and tracking progress of tasks.
More popularly, we can take advantage of visualization techniques to discover datarelationships that
are otherwise not easily observable by looking at the raw data.
Nowadays, people also use data visualization to create fun and interesting graphics.
8.Measuring Data Similarity and Dissimilarity
In data mining applications, such as clustering, outlier analysis, and nearest- neighbor
classification, we need ways to assess how alike or unalike objects are incomparison to one
another. For example, a store may want to search for clusters ofcustomer objects, resulting in
groups of customers with similar characteristics (e.g., similar income, area of residence, and
age). Such information can then be used for marketing.
A cluster is a collection of data objects such that the objects within a cluster are similar to one
another and dissimilar to the objects in other clusters. Outlier analysis also employs clustering-
based techniques to identify potential outliers asobjects that are highly dissimilar to others.
Knowledge of object similarities can also be used in nearest-neighbor classificationschemes
where a given object (e.g., a patient) is assigned a class label (relating to, say, a diagnosis) based
on its similarity toward other objects in the model.
9.Data Preprocessing
Data preprocessing is the process of transforming raw data into an understandableformat. It is
also an important step in data mining as we cannot work with raw data. The quality of the data
should be checked before applying machine learning or data mining algorithms.
Why is Data preprocessing important?
Preprocessing of data is mainly to check the data quality. The quality can bechecked by the
following
 Accuracy: To check whether the data entered is correct or not.

 Completeness: To check whether the data is available or not recorded.

 Consistency: To check whether the same data is kept in all the places thatdo or do not
match.

 Timeliness: The data should be updated correctly.

 Believability: The data should be trustable.

 Interpretability: The understandability of the data.

Major Tasks in Data Preprocessing:

1. Data cleaning

2. Data integration
3. Data reduction

4. Data transformation

5. Data Discretization

10.Data cleaning:

Data cleaning is the process to remove incorrect data, incomplete data and inaccurate data from

the datasets, and it also replaces the missing values. There aresome techniques in data cleaning

Handling missing values:


 Standard values like “Not Available” or “NA” can be used to replace the missing values.
 Missing values can also be filled manually but it is not recommended when thatdataset is
big.
 The attribute’s mean value can be used to replace the missing value when the datais normally
distributed
wherein in the case of non-normal distribution median value of the attribute can beused.
 While using regression or decision tree algorithms the missing value can bereplaced by
the most probable
value.

Noisy:

Noisy generally means random error or containing unnecessary data points.


Here are some of the methods to handle noisy data.

 Binning: This method is to smooth or handle noisy data. First, the data is sorted then and then
the sorted values are separated and stored in the form of bins. There are three methods for
smoothing data in the bin. Smoothing by bin mean method:
In this method, the values in the bin are replaced by the mean value of the
bin; Smoothing by bin median: In this method, the values in the bin are replaced by the
median value; Smoothing by bin boundary: In this method, the using minimum and
maximum values of the bin values are taken and the values are replaced by the closest
boundary value.
 Regression: This is used to smooth the data and will help to handle data when unnecessary
data is present. For the analysis, purpose regression helps to decidethe variable which is
suitable for our analysis.
Clustering: This is used for finding the outliers and also in grouping the data.Clustering is
generally used in unsupervised learning.

11.Data integration:

The process of combining multiple sources into a single dataset. The Data integration
process is one of the main components in data management. There aresome problems to be
considered during data integration.

 Schema integration: Integrates metadata(a set of data that describes other data) from
different sources.
 Entity identification problem: Identifying entities from multiple databases. For example, the
system or the use should know student _id of one database and student_name of another
database belongs to the same entity.
 Detecting and resolving data value concepts: The data taken from different databases while
merging may differ. Like the attribute values from one databasemay differ from another
database. For example, the date format may differ like
“MM/DD/YYYY” or “DD/MM/YYYY”.

12.Data reduction:

This process helps in the reduction of the volume of the data which makes theanalysis
easier yet produces the same or almost the same result. This reduction also helps to reduce
storage space. There are some of the techniques in data reduction are Dimensionality reduction,
Numerosity reduction, Data compression.

 Dimensionality reduction: This process is necessary for real-world applicationsas the data
size is big. In this process, the reduction of random variables or attributes is done so that the
dimensionality of the data set can be reduced.
Combining and merging the attributes of the data without losing its original characteristics. This
also helps in the reduction of storage space and computation time is reduced. When the data is
highly dimensional the problem called “Curse ofDimensionality” occurs.
 Numerosity Reduction: In this method, the representation of the data is madesmaller by
reducing the volume. There will not be any loss of data in this reduction.
 Data compression: The compressed form of data is called data compression. Thiscompression
can be lossless or lossy. When there is no loss of information during compression it is called
lossless compression. Whereas lossy compression reduces information but it removes only the
unnecessary information.

13.Data Transformation and Data discretization

The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements. There are
some methods in data transformation.

 Smoothing: With the help of algorithms, we can remove noise from the dataset and helps in
knowing the important features of the dataset. By smoothing we canfind even a simple change
that helps in prediction.
 Aggregation: In this method, the data is stored and presented in the form of a summary. The
data set which is from multiple sources is integrated into with dataanalysis description. This is
an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the quantityof
the data are good the results are more relevant.
 Discretization: The continuous data here is split into intervals. Discretization reduces
the data size. For example, rather than specifying the class time, we can setan interval
like (3 pm-5 pm, 6 pm-8 pm).
 Normalization: It is the method of scaling the data so that it can be represented ina
smaller range. Example ranging from -1.0 to 1.0.
 Data discretization It transforms numeric data by mapping values to interval or
concept labels. Such methods can be used to automatically generate concept
hierarchies for the data, which allows for mining at multiple levels of granularity.
Discretization techniques include binning, histogram analysis, cluster analysis,
decision tree analysis, and correlation analysis. For nominal data, concept hierarchies
may be generated based on schema definitions as well as the number ofdistinct values
per attribute. Although numerous methods of data preprocessing have been
developed, data preprocessing remains an active area of research, due to the huge
amount of inconsistent or dirty data and the complexity of the problem.

You might also like