0% found this document useful (0 votes)

43 views

Unit 1

The document provides a comprehensive overview of Data Mining, covering its definition, types of data, patterns that can be mined, and the technologies used in the field. It outlines the data mining process, including data cleaning, integration, selection, transformation, and knowledge presentation, while also discussing major issues such as methodology, performance, and diverse data types. Additionally, it details various data types and attributes, data visualization techniques, and the significance of understanding patterns for effective decision-making.

Uploaded by

kvm474710

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views

Unit 1

Uploaded by

kvm474710

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

SUBJECT NAME : DATA MINING

SUBJECT CODE : CECS54A

SEMESTER : V

UNIT-I

1. What is Data Mining

2. Kinds of Data

3. Kinds of patterns

4. Technologies used for Data Mining

5. Major Issues in Data Mining

6. Data Objects and Attribute types

7. Data Visualization

8. Measuring Data Similarity and Dissimilarity

9. Data Preprocessing

10. Data Cleaning

11. Data Integration

12. Data Reduction

13. Data Transformation and Data Discretization

1.What is Data Mining?

The process of extracting information to identify patterns, trends, and useful data that would
allow the business to take the data-driven decision from huge sets of datais called Data Mining.

Data Mining is the process of investigating hidden patterns of information to various perspectives
for categorization into useful data, which is collected and assembled in particular areas such as
data warehouses, efficient analysis, data mining algorithm, helping decision making and other
data requirement to eventually cost-cutting and generating revenue.

Data mining is the act of automatically searching for large stores of information to find trends
and patterns that go beyond simple analysis procedures. Data mining utilizes complex
mathematical algorithms for data segments and evaluates the probability of future events. Data
Mining is also called Knowledge Discovery of Data (KDD).

The knowledge discovery process is shown in Figure as an iterative sequence of the following
steps:

1. Data cleaning (to remove noise and inconsistent data)

2. Data integration (where multiple data sources may be combined)

3. Data selection (where data relevant to the analysis task are retrieved from the database)

4. Data transformation (where data are transformed and consolidated into formsappropriate

for mining by performing summary or aggregation operations)

5. Pattern evaluation (to identify the truly interesting patterns representingknowledge based
on interestingness measures)

6. Knowledge presentation (where visualization and knowledge representationtechniques are

used to present mined knowledge to users)
2. Kinds of data

What Kinds of Data Can Be Mined?

DM can be applied to any kind of data as long as the data are meaningful for atarget

application. The most basic forms of data for mining applications

are Database data, Data warehouse data, and Transactional data.

Database Data A database system, also called a database management system(DBMS),

consists of a collection of interrelated data, known as a database.

A relational database is a collection of tables, each of which is assigned a uniquename. Each

table consists of a set of attributes that contain large sets of a tuple.

Each tuple in a relational table represents an object identified by a unique key anddescribed by a set of

attribute value
Data Warehouses

Suppose a successful international company has branches all around the world. Each branch has

its own set of databases. The president of the company has askedyou to provide an analysis of the

company’s sales per item type per branch for thethird quarter.

To facilitate decision making, the data in a data warehouse are organized aroundmajor

subjects.

Transactional Data

In general, each record in a transactional database captures a transaction, such as a customer’s

purchase, a flight booking, or a user’s clicks on a web page. A transaction typically includes a

unique transaction identity number (trans ID) and a list of the items making up the transaction,

such as the items purchased in the transaction.

3.What Kinds of Patterns Can Be Mined?

Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks.Data mining tasks can be classified into two categories: descriptive and predictive.

Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.

Concept/Class Description: Characterization and Discrimination

Data can be associated with classes or concepts. For example, in the

AllElectronics store, classes of items for sale include computers and printers, and concepts of
customers include bigSpenders and budgetSpenders.

It can be useful to describe individual classes and concepts in summarized, concise, and yet
precise terms. Such descriptions of a class or a concept are called class/concept descriptions.

These descriptions can be derived via data characterization, by summarizing the data of the
class under study (often called the target class) in general terms,

data discrimination, by comparison of the target class with one or a set of comparative classes
(often called the contrasting classes), or (3) both data characterization and discrimination.

Data characterization is a summarization of the general characteristics or features of a target

class of data. The data corresponding to the user-specified class are typically collected by a
database query the output of data characterization can be presented in various forms. Examples
include pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables,
including crosstabs.
Data discrimination is a comparison of the general features of target class data objects with the
general features of objects from one or a set of contrasting classes. The target and contrasting
classes can be specified by the user, and the corresponding data objects retrieved through
database queries.

“How are discrimination descriptions output?”

Discrimination descriptions expressed in rule form are referred to as discriminate rules.

Mining Frequent Patterns, Associations, and Correlations

Frequent patterns, as the name suggests, are patterns that occur frequently in data. There are
many kinds of frequent patterns, including itemsets, subsequences, and substructures.

A frequent itemset typically refers to a set of items that frequently appear together in a
transactional data set, such as Computer and Software. A frequently occurringsubsequence, such
as thepattern that customers tend to purchase first a PC, followed by a digital camera, and then a
memory card, is a (frequent) sequential pattern.

Example: Association analysis. Suppose, as a marketing manager

of AllElectronics, you would like to determine which items are frequently purchased
together within the same transactions. An example of such a rule, mined from the
AllElectronics transactional database, is
buys(X;―computer‖) buys(X; ―software‖) [support = 1%, confidence = 50%]

where X is a variable representing a customer. A confidence, or certainty, of 50% means that if

a customer buys a computer, there is a 50% chance that she will buy software as well. A 1%
support means that 1% of all of the transactions under analysis showed that computer and
software were purchased together.
This association rule involves a single attribute or predicate (i.e., buys) that repeats. Association
rules that contain a single predicate are referred to as single- dimensional association rules.
Dropping the predicate notation, the above rule can be written simply as ―computer software
[1%, 50%]‖.

Classification and Prediction

Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts, for the purpose of being able to use the model to predict the class of
objects whose class label is unknown. The derived model is based on the analysis of a set of
training data (i.e., data objects whose class label isknown).

“How is the derived model presented?” The derived model may be represented in various
forms, such as classification (IF-THEN) rules, decision trees, mathematical
formulae, or neural networks

A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute
value, each branch represents an outcome of the test, and tree leaves represent classes or class
distributions. Decision trees can easily be converted to classification rules

A neural network, when used for classification, is typically a collection of neuron- like
processing units with weighted connections between the units. There are manyother methods for
constructing classification models, such as naïve

Bayesian classification, support vector machines, and k-nearest neighbor classification. Whereas
classification predicts categorical (discrete, unordered) labels, prediction models Continuous-
valued functions. That is, it is used to predict missing or unavailable numerical data values rather
than class labels. Although the term prediction may refer to both numeric prediction and class
label prediction,
Cluster Analysis

Classification and prediction analyze class-labeled data objects, where as clustering

analyzes data objects without consulting a known class label.

Outlier Analysis

A database may contain data objects that do not comply with the general behavior or model of
the data. These data objects are outliers. Most data mining methods discard outliers as noise or
exceptions. However, in some applications such as fraud detection, the rare events can be more
interesting than the more regularly occurring ones. The analysis of outlier data is referred to as
outlier mining.

Evolution Analysis

Data evolution analysis describes and models regularities or trends for objects whose behavior
changes over time. Although this may include characterization, discrimination, association and
correlation analysis, classification, prediction, or clustering of time related data, distinct features
of such an analysis include time- series data analysis, Sequence or periodicity pattern matching,
and similarity-baseddata analysis.

4. Which Technologies Are Used?

As a highly application-driven domain, data mining has incorporated many techniques from other
domains such as statistics, machine learning, pattern recognition, database and data warehouse
systems, information retrieval, visualization, algorithms, high- performance computing, and
many application domains. The interdisciplinary nature of data mining research and development
contributes significantly to the success of data mining and its extensive applications. In this
section, we give examples of several disciplines that strongly influence the development of data
mining methods.
Statistics

Statistics studies the collection, analysis, interpretation or explanation, and presentation of data.
Data mining has an inherent connection with statistics. A statistical model is a set of
mathematical functions that describe the behavior of theobjects in a target class in terms of
random variables and their associated probability distributions. Statistical models are widely
used to model data and data classes.

Machine learning

It investigates how computers can learn (or improve their performance) based ondata. A main
research area is for computer programs to automatically learn to recognize complex patterns
and make intelligent decisions based on data. For example, a typical machine learning
problem is to program a computer so that it
can automatically recognize handwritten postal codes on mail after learning from aset of
examples. Machine learning is a fast-growing discipline

Supervised learning

Itis basically a synonym for classification. The supervision in the learning comes from the
labeled examples in the training data set. For example, in the postal coderecognition problem, a
set of handwritten postal code images and their corresponding machine-readable translations are
used as the training examples, which supervise the learning of the classification model

Unsupervised learning

It is essentially a synonym for clustering. The learning process is unsupervised since the input
examples are not class labeled. Typically, we may use clustering todiscover classes within the
data. For example, an unsupervised learning method can take, as input, a set of images of
handwritten digits. Suppose that it finds 10 clusters of data. These clusters may correspond to
the 10 distinct digits of 0 to 9, respectively. However, since the training data are not labeled, the
learned model cannot tell us the semantic meaning of the clusters found.

Semi-supervised learning

It is a class of machine learning techniques that make use of both labeled and unlabeled
examples when learning a model. In one approach, labeled examples are
used to learn class models and unlabeled examples are used to refine theboundaries between
classes.

Active learning is a machine learning approach that lets users play an active role in the learning
process. An active learning approach can ask a user (e.g., a domainexpert) to label an example,
which may be from a set of unlabeled examples or synthesized by the learning program. The
goal is to optimize the model quality byactively acquiring knowledge from human users, given a
constraint on how many examples they can be asked to label.

5. MAJOR ISSUES IN DATA MINING

Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
These factors also create some issues. Here in this tutorial, we will discuss the major issues
regarding −

 Mining Methodology and User Interaction

 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.

Mining Methodology and User Interaction Issues It refers

to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different users may be interested in
different kinds of knowledge. Therefore it is necessary for data mining to cover a broad
range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based onthe returned results.
 Incorporation of background knowledge − To guide discovery process and to express
the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not only in
concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of thediscovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
6. Data Objects and Attribute Types
Data sets are made up of data objects. A data object represents an entity—ina sales
database, the objects may be customers, store items, and sales; in a medical database, the
objects may be patients; in a university database, the objects may be students, professors,
and courses.
Data objects are typically described by attributes. Data objects can also bereferred to as
samples, examples, instances, data points, or objects. If the data objects are stored in a
database, they are data tuples.
That is, the rows of a database correspond to the data objects, and the columns
correspond to the attributes. In this section, we define attributes andlook at the various
attribute types.
What Is an Attribute?
An attribute is a data field, representing a characteristic or feature of a data object.The nouns
attribute, dimension, feature, and variable are often used interchangeably in the literature.
The type of an attribute is determined by the set of possible values—nominal, binary, ordinal,
or numeric—the attribute can have. In the following subsections, we introduce each type.
Nominal Attributes
Nominal means “relating to names.” The values of a nominal attribute are symbolsor names of
things. Each value represents some kind of category, code, or state, and so nominal attributes are
also referred to as categorical. The values do not have any meaningful order. In computer
science, the values are also known as enumerations.
Binary Attributes
A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0
typically means that the attribute is absent, and 1 means that it is present. Binary attributes are
referred to as Boolean if the two states correspond to true and false.

Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a meaningful order or ranking
among them, but the magnitude between successive values is notknown.
Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or
real values. Numeric attributes can be interval-scaled or ratio-scaled.

Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The values ofinterval-
scaled attributes have order and can be positive, 0, or negative. Thus, in addition to providing a
ranking of values, such attributes allow us to compare and quantify the difference between
values
For example, a temperature of 20◦C is five degrees higher than a temperature of 15◦C.
Calendar dates are another example. For instance, the years 2002 and 2010 are eight years
apart.
Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is, if a
measurement is ratio-scaled, we can speak of a value as being a multiple (or
ratio) of another value. In addition, the values are ordered, and we can also compute the
difference between values, as well as the mean, median, and mode.
7. Data visualization
Data visualization aims to communicate data clearly and effectively through graphical
representation. Data visualization has been used extensively in many applications—for example,
at work for reporting, managing business operations,and tracking progress of tasks.
More popularly, we can take advantage of visualization techniques to discover datarelationships that
are otherwise not easily observable by looking at the raw data.
Nowadays, people also use data visualization to create fun and interesting graphics.
8.Measuring Data Similarity and Dissimilarity
In data mining applications, such as clustering, outlier analysis, and nearest- neighbor
classification, we need ways to assess how alike or unalike objects are incomparison to one
another. For example, a store may want to search for clusters ofcustomer objects, resulting in
groups of customers with similar characteristics (e.g., similar income, area of residence, and
age). Such information can then be used for marketing.
A cluster is a collection of data objects such that the objects within a cluster are similar to one
another and dissimilar to the objects in other clusters. Outlier analysis also employs clustering-
based techniques to identify potential outliers asobjects that are highly dissimilar to others.
Knowledge of object similarities can also be used in nearest-neighbor classificationschemes
where a given object (e.g., a patient) is assigned a class label (relating to, say, a diagnosis) based
on its similarity toward other objects in the model.
9.Data Preprocessing
Data preprocessing is the process of transforming raw data into an understandableformat. It is
also an important step in data mining as we cannot work with raw data. The quality of the data
should be checked before applying machine learning or data mining algorithms.
Why is Data preprocessing important?
Preprocessing of data is mainly to check the data quality. The quality can bechecked by the
following
 Accuracy: To check whether the data entered is correct or not.

 Completeness: To check whether the data is available or not recorded.

 Consistency: To check whether the same data is kept in all the places thatdo or do not
match.

 Timeliness: The data should be updated correctly.

 Believability: The data should be trustable.

 Interpretability: The understandability of the data.

Major Tasks in Data Preprocessing:

1. Data cleaning

2. Data integration
3. Data reduction

4. Data transformation

5. Data Discretization

10.Data cleaning:

Data cleaning is the process to remove incorrect data, incomplete data and inaccurate data from

the datasets, and it also replaces the missing values. There aresome techniques in data cleaning

Handling missing values:

 Standard values like “Not Available” or “NA” can be used to replace the missing values.
 Missing values can also be filled manually but it is not recommended when thatdataset is
big.
 The attribute’s mean value can be used to replace the missing value when the datais normally
distributed
wherein in the case of non-normal distribution median value of the attribute can beused.
 While using regression or decision tree algorithms the missing value can bereplaced by
the most probable
value.

Noisy:

Noisy generally means random error or containing unnecessary data points.

Here are some of the methods to handle noisy data.

 Binning: This method is to smooth or handle noisy data. First, the data is sorted then and then
the sorted values are separated and stored in the form of bins. There are three methods for
smoothing data in the bin. Smoothing by bin mean method:
In this method, the values in the bin are replaced by the mean value of the
bin; Smoothing by bin median: In this method, the values in the bin are replaced by the
median value; Smoothing by bin boundary: In this method, the using minimum and
maximum values of the bin values are taken and the values are replaced by the closest
boundary value.
 Regression: This is used to smooth the data and will help to handle data when unnecessary
data is present. For the analysis, purpose regression helps to decidethe variable which is
suitable for our analysis.
Clustering: This is used for finding the outliers and also in grouping the data.Clustering is
generally used in unsupervised learning.

11.Data integration:

The process of combining multiple sources into a single dataset. The Data integration
process is one of the main components in data management. There aresome problems to be
considered during data integration.

 Schema integration: Integrates metadata(a set of data that describes other data) from
different sources.
 Entity identification problem: Identifying entities from multiple databases. For example, the
system or the use should know student _id of one database and student_name of another
database belongs to the same entity.
 Detecting and resolving data value concepts: The data taken from different databases while
merging may differ. Like the attribute values from one databasemay differ from another
database. For example, the date format may differ like
“MM/DD/YYYY” or “DD/MM/YYYY”.

12.Data reduction:

This process helps in the reduction of the volume of the data which makes theanalysis
easier yet produces the same or almost the same result. This reduction also helps to reduce
storage space. There are some of the techniques in data reduction are Dimensionality reduction,
Numerosity reduction, Data compression.

 Dimensionality reduction: This process is necessary for real-world applicationsas the data
size is big. In this process, the reduction of random variables or attributes is done so that the
dimensionality of the data set can be reduced.
Combining and merging the attributes of the data without losing its original characteristics. This
also helps in the reduction of storage space and computation time is reduced. When the data is
highly dimensional the problem called “Curse ofDimensionality” occurs.
 Numerosity Reduction: In this method, the representation of the data is madesmaller by
reducing the volume. There will not be any loss of data in this reduction.
 Data compression: The compressed form of data is called data compression. Thiscompression
can be lossless or lossy. When there is no loss of information during compression it is called
lossless compression. Whereas lossy compression reduces information but it removes only the
unnecessary information.

13.Data Transformation and Data discretization

The change made in the format or the structure of the data is called data
transformation. This step can be simple or complex based on the requirements. There are
some methods in data transformation.

 Smoothing: With the help of algorithms, we can remove noise from the dataset and helps in
knowing the important features of the dataset. By smoothing we canfind even a simple change
that helps in prediction.
 Aggregation: In this method, the data is stored and presented in the form of a summary. The
data set which is from multiple sources is integrated into with dataanalysis description. This is
an important step since the accuracy of the data
depends on the quantity and quality of the data. When the quality and the quantityof
the data are good the results are more relevant.
 Discretization: The continuous data here is split into intervals. Discretization reduces
the data size. For example, rather than specifying the class time, we can setan interval
like (3 pm-5 pm, 6 pm-8 pm).
 Normalization: It is the method of scaling the data so that it can be represented ina
smaller range. Example ranging from -1.0 to 1.0.
 Data discretization It transforms numeric data by mapping values to interval or
concept labels. Such methods can be used to automatically generate concept
hierarchies for the data, which allows for mining at multiple levels of granularity.
Discretization techniques include binning, histogram analysis, cluster analysis,
decision tree analysis, and correlation analysis. For nominal data, concept hierarchies
may be generated based on schema definitions as well as the number ofdistinct values
per attribute. Although numerous methods of data preprocessing have been
developed, data preprocessing remains an active area of research, due to the huge
amount of inconsistent or dirty data and the complexity of the problem.

Salesforce AI Associate Dumps
100% (4)
Salesforce AI Associate Dumps
60 pages
Sunbeam Popcorn Maker FPSBPP7310 FPSBPP7316
60% (10)
Sunbeam Popcorn Maker FPSBPP7310 FPSBPP7316
9 pages
Wiley - Operations Management - An Integrated Approach, 7th Edition - 978-1-119-49706-6
No ratings yet
Wiley - Operations Management - An Integrated Approach, 7th Edition - 978-1-119-49706-6
3 pages
Ford Escape 4wd Workshop Manual v6 3 0l 2008
100% (4)
Ford Escape 4wd Workshop Manual v6 3 0l 2008
7,556 pages
2019 Book EssentialsOfBusinessAnalytics PDF
93% (14)
2019 Book EssentialsOfBusinessAnalytics PDF
971 pages
AP Statistics Chapter 3
0% (1)
AP Statistics Chapter 3
3 pages
Udemy 2024 Learning Trends Top 100 Surging Skills Infographic
100% (1)
Udemy 2024 Learning Trends Top 100 Surging Skills Infographic
1 page
Select As From: Firstname (First Name) Employeedetail
100% (1)
Select As From: Firstname (First Name) Employeedetail
7 pages
Home Depot Strategy
100% (1)
Home Depot Strategy
8 pages
Lecture Notes 1.1 & 1.2
No ratings yet
Lecture Notes 1.1 & 1.2
8 pages
Unit-1 Notes (1)
No ratings yet
Unit-1 Notes (1)
24 pages
Lecture2 DataMiningFunctionalities
No ratings yet
Lecture2 DataMiningFunctionalities
18 pages
DM-unit 1
No ratings yet
DM-unit 1
22 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
DataWarehouseMining Complete Notes
No ratings yet
DataWarehouseMining Complete Notes
55 pages
1.1 - Data Mining
No ratings yet
1.1 - Data Mining
18 pages
Chapter 1
No ratings yet
Chapter 1
16 pages
Data Mining 1 2 and 3
No ratings yet
Data Mining 1 2 and 3
20 pages
Kinds of Data: 1. Data Bases Data 2.data Warehouses Data 3. Transactional Data
No ratings yet
Kinds of Data: 1. Data Bases Data 2.data Warehouses Data 3. Transactional Data
24 pages
data mining unit I notes
No ratings yet
data mining unit I notes
24 pages
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
10 pages
Unit-4 DWM
No ratings yet
Unit-4 DWM
73 pages
Bca DM Unit I
No ratings yet
Bca DM Unit I
20 pages
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
No ratings yet
Data Mining Is Defined As The Procedure of Extracting Information From Huge Sets of Data
6 pages
CSC 425 Data Mining and Warehousing 2024
No ratings yet
CSC 425 Data Mining and Warehousing 2024
54 pages
Data Warehouse & Mining
No ratings yet
Data Warehouse & Mining
28 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
02-Data Mining Functionalities-2
No ratings yet
02-Data Mining Functionalities-2
23 pages
What Motivated Data Mining?: Huge Amount of Raw DATA Is Available - The Motivation For The Data Mining Is To
No ratings yet
What Motivated Data Mining?: Huge Amount of Raw DATA Is Available - The Motivation For The Data Mining Is To
83 pages
DWM Module 2
No ratings yet
DWM Module 2
122 pages
Data Mining - Tasks: Data Characterization Data Discrimination
No ratings yet
Data Mining - Tasks: Data Characterization Data Discrimination
4 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
Introduction To Data Mining Techniques: Dr. Rajni Jain
No ratings yet
Introduction To Data Mining Techniques: Dr. Rajni Jain
11 pages
Dwdm Unit-II Notes
No ratings yet
Dwdm Unit-II Notes
29 pages
Data Mining Functionalities
100% (1)
Data Mining Functionalities
4 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
module 1
No ratings yet
module 1
41 pages
Unit 1
No ratings yet
Unit 1
59 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
24 pages
unit-1-dm
No ratings yet
unit-1-dm
62 pages
Patterns Mined +frequent Patterns
No ratings yet
Patterns Mined +frequent Patterns
18 pages
Data Warehouse
No ratings yet
Data Warehouse
19 pages
Soln 1
100% (1)
Soln 1
6 pages
DM Unit1 Intro
No ratings yet
DM Unit1 Intro
12 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
U1_1
No ratings yet
U1_1
13 pages
Data Mining
No ratings yet
Data Mining
25 pages
DATA MINING-Knowledge Discovery in Databases
No ratings yet
DATA MINING-Knowledge Discovery in Databases
6 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
47 pages
UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
11 pages
Data Mining
No ratings yet
Data Mining
14 pages
18mca52c U1
No ratings yet
18mca52c U1
17 pages
Data Mining Issues and Tasks
No ratings yet
Data Mining Issues and Tasks
5 pages
This PPT Is Dedicated To My Inner Controller Founders.: Amma Bhagavan
No ratings yet
This PPT Is Dedicated To My Inner Controller Founders.: Amma Bhagavan
84 pages
DATA MINING UNIT-1
No ratings yet
DATA MINING UNIT-1
59 pages
Unit I DWDM
No ratings yet
Unit I DWDM
26 pages
Data Mining
No ratings yet
Data Mining
52 pages
III-IT-Data Mining Unit 1-Session 2-Part1
No ratings yet
III-IT-Data Mining Unit 1-Session 2-Part1
17 pages
Data Mining Tasks
No ratings yet
Data Mining Tasks
24 pages
Data Minng
No ratings yet
Data Minng
20 pages
wao
No ratings yet
wao
9 pages
Data Mining
No ratings yet
Data Mining
7 pages
DM 1 PDF
No ratings yet
DM 1 PDF
67 pages
Archana Data Mining
No ratings yet
Archana Data Mining
24 pages
001Lecture_1 Introduction-1
No ratings yet
001Lecture_1 Introduction-1
40 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
A Collection of Fraud Schemes
67% (3)
A Collection of Fraud Schemes
54 pages
Resume Updated
100% (3)
Resume Updated
2 pages
Consumer Reports Buying Guide 2021
100% (1)
Consumer Reports Buying Guide 2021
227 pages
Political Analysis
No ratings yet
Political Analysis
11 pages
(PDF) Introduction To Selling Value - Course-Final
No ratings yet
(PDF) Introduction To Selling Value - Course-Final
75 pages
GRE Text Completion and Sentence Equivalence Practice Questions
100% (2)
GRE Text Completion and Sentence Equivalence Practice Questions
32 pages
Online Casino Software For Sale and Casino Software Solutions
No ratings yet
Online Casino Software For Sale and Casino Software Solutions
2 pages
ATS Resume Template PDF
No ratings yet
ATS Resume Template PDF
1 page
TED Talks List
100% (2)
TED Talks List
15 pages
Outdoor Living Skills (PDFDrive) PDF
No ratings yet
Outdoor Living Skills (PDFDrive) PDF
157 pages
Globalization Strategy Playbook: Document Revision History
100% (2)
Globalization Strategy Playbook: Document Revision History
93 pages
Data Analytics Concepts Techniques and A PDF
100% (11)
Data Analytics Concepts Techniques and A PDF
451 pages
Focus Investing PDF
No ratings yet
Focus Investing PDF
18 pages
The Chemical Engineer - Issue 983 - May 2023
No ratings yet
The Chemical Engineer - Issue 983 - May 2023
68 pages
Cyber Resilience Blueprint
No ratings yet
Cyber Resilience Blueprint
12 pages
SAP GTS Case Study - Citrix - Systems
100% (1)
SAP GTS Case Study - Citrix - Systems
2 pages
QuickBooks Online Core Certification Self Study Workbook V21.2.2
100% (1)
QuickBooks Online Core Certification Self Study Workbook V21.2.2
55 pages
Guidance On Good Data and Record Management Practices
No ratings yet
Guidance On Good Data and Record Management Practices
44 pages
2015 Book IntroductionToNursingInformati
100% (1)
2015 Book IntroductionToNursingInformati
456 pages
Microsoft AppSource Partner Listing Guidelines PDF
No ratings yet
Microsoft AppSource Partner Listing Guidelines PDF
10 pages
NIST 2 Framework
100% (1)
NIST 2 Framework
32 pages
Whitepaper - Third-Party Risk Management Services
No ratings yet
Whitepaper - Third-Party Risk Management Services
24 pages
Lab4 - DML3 - DML4
No ratings yet
Lab4 - DML3 - DML4
6 pages
CAD, Mechatronics
No ratings yet
CAD, Mechatronics
168 pages
Resume LinkedIn
No ratings yet
Resume LinkedIn
2 pages
DBMS Theory Concepts Notes
No ratings yet
DBMS Theory Concepts Notes
27 pages
Information Retrieval System
No ratings yet
Information Retrieval System
21 pages
System Refreshhh
100% (1)
System Refreshhh
21 pages
8.4_IdentityIQ_Installation_Guide
No ratings yet
8.4_IdentityIQ_Installation_Guide
40 pages
Functional Dependency and Normalisation
No ratings yet
Functional Dependency and Normalisation
15 pages
Research Parse Resume
No ratings yet
Research Parse Resume
2 pages
Big Data Technologies Course Outline
No ratings yet
Big Data Technologies Course Outline
2 pages
Veeam SAP Avilability Guide
100% (1)
Veeam SAP Avilability Guide
64 pages
Change Documents in SAP ECC, S4HANA and BTP
No ratings yet
Change Documents in SAP ECC, S4HANA and BTP
4 pages
OBIEE Life Cycle
No ratings yet
OBIEE Life Cycle
6 pages
955a4286865876cdc22a3a5128fca0c8
No ratings yet
955a4286865876cdc22a3a5128fca0c8
269 pages
List of Transaction Codes
No ratings yet
List of Transaction Codes
5 pages
Quiz10 Solution PDF
No ratings yet
Quiz10 Solution PDF
2 pages
Final Documentation
No ratings yet
Final Documentation
44 pages
Stored Procedures: Implementation of Business Rules Within Database Why? Improves Security, Efficiency, Integrity
No ratings yet
Stored Procedures: Implementation of Business Rules Within Database Why? Improves Security, Efficiency, Integrity
10 pages
Descriptive Statistics: Making Sense of Data
No ratings yet
Descriptive Statistics: Making Sense of Data
21 pages
About Quickxpert Infotech: - Sap Modules, Java, Dot Net, Software Testing, Web
No ratings yet
About Quickxpert Infotech: - Sap Modules, Java, Dot Net, Software Testing, Web
13 pages
Big Data Notes - 2 Unit
No ratings yet
Big Data Notes - 2 Unit
20 pages
Critical Logs to Monitor - A Guide for SOC Analysts v. 1.1
No ratings yet
Critical Logs to Monitor - A Guide for SOC Analysts v. 1.1
38 pages
Maree DA Chapter 2
No ratings yet
Maree DA Chapter 2
43 pages
Creating Block Exercise.en-US.11
No ratings yet
Creating Block Exercise.en-US.11
12 pages
How To: Setup Up of Oracle Streams Replication
No ratings yet
How To: Setup Up of Oracle Streams Replication
8 pages
DEV1 Section1 Activities V7.1
100% (1)
DEV1 Section1 Activities V7.1
67 pages
Exercises DDL Command
No ratings yet
Exercises DDL Command
2 pages
Professional Jakarta Struts (ISBN 0-7645-4437-3) by James Goodwill & Richard Hightower
No ratings yet
Professional Jakarta Struts (ISBN 0-7645-4437-3) by James Goodwill & Richard Hightower
40 pages
Order Database
No ratings yet
Order Database
4 pages

Unit 1

Uploaded by

Unit 1

Uploaded by

SUBJECT NAME : DATA MINING

SUBJECT CODE : CECS54A

1. What is Data Mining

4. Technologies used for Data Mining

5. Major Issues in Data Mining

6. Data Objects and Attribute types

8. Measuring Data Similarity and Dissimilarity

10. Data Cleaning

11. Data Integration

12. Data Reduction

13. Data Transformation and Data Discretization

1. Data cleaning (to remove noise and inconsistent data)

2. Data integration (where multiple data sources may be combined)

for mining by performing summary or aggregation operations)

6. Knowledge presentation (where visualization and knowledge representationtechniques are

What Kinds of Data Can Be Mined?

application. The most basic forms of data for mining applications

are Database data, Data warehouse data, and Transactional data.

Database Data A database system, also called a database management system(DBMS),

A relational database is a collection of tables, each of which is assigned a uniquename. Each

table consists of a set of attributes that contain large sets of a tuple.

In general, each record in a transactional database captures a transaction, such as a customer’s

such as the items purchased in the transaction.

Concept/Class Description: Characterization and Discrimination

Data can be associated with classes or concepts. For example, in the

Data characterization is a summarization of the general characteristics or features of a target

“How are discrimination descriptions output?”

Discrimination descriptions expressed in rule form are referred to as discriminate rules.

Mining Frequent Patterns, Associations, and Correlations

Example: Association analysis. Suppose, as a marketing manager

where X is a variable representing a customer. A confidence, or certainty, of 50% means that if

Classification and Prediction

Classification and prediction analyze class-labeled data objects, where as clustering

4. Which Technologies Are Used?

5. MAJOR ISSUES IN DATA MINING

 Mining Methodology and User Interaction

Mining Methodology and User Interaction Issues It refers

 Completeness: To check whether the data is available or not recorded.

 Timeliness: The data should be updated correctly.

 Believability: The data should be trustable.

 Interpretability: The understandability of the data.

Major Tasks in Data Preprocessing:

Handling missing values:

Noisy generally means random error or containing unnecessary data points.

13.Data Transformation and Data discretization

You might also like