0% found this document useful (0 votes)

107 views154 pages

Data Mining Presentation

The document discusses a syllabus for a data mining course. It includes the grading breakdown as 40% for midterms, 10% for exercises, 20% for projects, and 30% for the final exam. It outlines 6 chapters that will be covered: introduction, data preprocessing, frequent pattern mining, classification, and clustering. It also provides an overview of key concepts in data mining including the difference between data and knowledge discovery from data, different data mining tasks and techniques, and challenges in data mining.

Uploaded by

siavashmansoori1998

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views154 pages

Data Mining Presentation

Uploaded by

siavashmansoori1998

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Dr.

Mohammadi Zanjireh

Imam Khomeini International University

(IKIU)
 Midterms: 4*10=40
 Exercises: 10
 Project:20
 Final exam: 30
 Chapter 1 – Introduction
 Chapter 2 - Getting to Know Your Data
 Chapter 3 - Data Preprocessing
 Chapter 4- Frequent Patterns Mining
 Chapter 5 – Classification
 Chapter 6 - Clustering
 “We are living in the information age” is a popular
saying; however, we are actually living in the data
age.

 Terabytes or petabytes of data into our computer

networks, the World Wide Web (WWW), and
various data storage devices every day from
business, society, science and engineering,
medicine, and almost every other aspect of daily
life.
 This explosively growing, widely available, and
gigantic body of data makes our time truly the data age.
Powerful tools are badly needed to automatically
uncover valuable information from the tremendous
amounts of data and to transform such data into
organised knowledge.

 This necessity has led to the birth of data mining. The

field is young, dynamic, and promising. Data mining
has and will continue to make great strides in our
journey from the data age toward the coming
information age.
 Example 1-1: Search Engine.
 A data rich but information poor situation.
 Data tombs.

What is Data Mining?

Data mining should have been more appropriately named
“knowledge mining from data,” which is unfortunately
somewhat long.
 knowledge discovery from data (KDD):
 Data Cleaning.
Preprocessing Step
 Data Integration.
 Data Selection.
 Data Mining.
 Pattern Evaluation.
 Knowledge Presentation.
What Kinds of Data Can Be Mined?
As a general technology, data mining can be applied to
any kind of data as long as the data are meaningful
for a target application such as database data, data
warehouse data, transactional data, data streams,
multimedia data, and the WWW.
Exercise 1: What is the difference between Database and
Data warehouse?

Exercise 2: Describe a number of Data mining’s

applications.
 Data mining tasks can be classified into two
categories:

◦ Descriptive: Characterises properties of the data in a target

data set.

◦ Predictive: Performs induction on the current data in order to

make predictions.
 Data Cube:
 Fig 1.7:
◦ Drill-down
◦ Roll-up

◦ Slice
◦ Dice

◦ OLAP
◦ OLTP
 Which Technologies Are Used?
 Statistics

 Machine Learning (ML)

 Supervised Learning
 Unsupervised Learning
 Semi-supervised Learning
 Efficiency and Scalability:
◦ The running time of a data mining algorithm must be
predictable, short, and acceptable by applications.

 Parallel and distributed mining algorithms:

◦ Such algorithms first partition the data into “pieces” .
◦ Each piece is processed, in parallel.
◦ The patterns from each partition are eventually merged.
◦ Parallel in one machine.
◦ Distributed in multiple machines.
 Handling noise, error, exceptions, and outliers:
◦ Data often contain noise, errors, exceptions, or outliers.
◦ These may confuse the data mining process, leading to the
derivation of erroneous patterns.
◦ Data cleaning, data preprocessing, and outlier detection and
removal, are examples of techniques that need to be integrated with
the data mining process.

 Privacy-preserving data mining:

◦ Data mining is useful. However, it poses the risk of disclosing an
individual’s personal information.
◦ We have to observe data sensitivity while performing data mining.
 Attributes:
Attributes

Student_ID Name Average

Observations
1001 Ali 17.12
1002 Reza 13.89
1003 Maryam 16.02
1004 Hasan 15.45
 Attribute types:
◦ Nominal: Subject, Occupation.

◦ Binary (0,1-T,F): Gender, medical test.

 Symmetric Binary.
 Asymmetric Binary.

◦ Ordinal: Drink_size (small, medium, large).

◦ Numeric
 Interval_scaled.
 Ratio_scaled.
 Discrete vs. Continuous attributes.
 Mean:

 Median:
 Mode:
 Variance and Standard Deviation:
 Data Matrix:

Student _1

Student_i

Student_n
 Dissimilarity Matrix:

 Similarity Matrix:
Sim(i,j)=1-d(i,j)
 Proximity measures for Nominal attributes

 Example:
Id Subject Birth_City Living_City Eye_Colour
1 Computer Teh Teh Black
2 Electricity Teh Kar Brown
3 Mechanic Qaz Qaz Brown
4 Computer Kar Qaz Green
0.00
Dissimilarity 0.75 0.00
=
Matrix 1.00 0.75 0.00
0.75 1.00 0.75 0.00

 Proximity measures for Binary attributes

 Euclidean Distance:
 Euclidean Distance:

 Manhattan Distance:
 Euclidean Distance:

 Manhattan Distance:

 Supremum Distance:
 Proximity measures for Ordinal attributes
Small Medium Large
1 2 3
 Example:
 Degree={Diploma, Undergraduate, Master, PhD}
 Drink_size={Small, Medium, Large}

Id Degree Drink_size
1 Undergraduate Medium
2 PhD Small
3 Diploma Medium
4 Undergraduate Large
 Example:
 Degree={Diploma, Undergraduate, Master, PhD}
 Drink_size={Small, Medium, Large}

Id Degree Drink_size
1 0.33 0.50
2 1.00 0.00
3 0.00 0.50
4 0.33 1.00
 Normalising:

Id Grade
1 30
2 52
3 84
4 45
5 25
6 20
7 91
8 65
9 42
10 32
 Normalising:

Id Grade Grade-min
1 30 10
2 52 32
3 84 64
4 45 25
5 25 5
6 20 0
7 91 71
8 65 45
9 42 22
10 32 12
 Normalising:

(Grade-min)/(max-min)
Id Grade Grade-min
1 30 10 0.14
2 52 32 0.45
3 84 64 0.90
4 45 25 0.35
5 25 5 0.07
6 20 0 0.00
7 91 71 1.00
8 65 45 0.63
9 42 22 0.31
10 32 12 0.17
 Proximity measures for mixed types:
 Examples:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)

1 Code A Excellent 45
2 Code B Fair 22
3 Code C Good 64
4 Code A Excellent 28
 Examples:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)

1 Code A 1 0.55
2 Code B 0 0.00
3 Code C 0.5 1.00
4 Code A 1 0.14

0.00
Dissimilarity 0.85 0.00
=
Matrix 0.65 0.83 0.00
0.13 0.71 0.79 0.00
 Exercise:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)

1 Code A Excellent ----

2 ---- Fair 22
3 Code C ---- 64
4 Code A Excellent 28
 Cosine similarity:
D# team coach hockey baseball soccer penalty score win loss season
D1 5 0 3 0 2 0 0 2 0 0
D2 3 0 2 0 1 1 0 1 0 1
D3 0 7 0 2 1 0 0 3 0 0
D4 0 1 0 0 1 2 2 0 3 0
 Cosine similarity:

1.00
Similarity 0.94 1.00
=
Matrix 0.17 0.12 1.00
0.07 0.17 0.23 1.00
 3.1-Data Cleaning.

 3.2-Data Integration.

 3.3-Data Reduction.

 3.4-Data Transform.
 3.1- Data Cleaning
 3.1.1-Missing values
o Ignore the tuple.
o Fill in the missing value manually.
o Use a global constant.
o Use the attribute mean or median.
o Use the class mean or median.

 3.1.2-Noise
o Binning
o Smoothing by bin means.
o Smoothing by bin boundaries.
o Outlier analysis.
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: bin1: 4, 8, 15. bin2: 21, 21, 24. bin3: 25, 28, 34
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: bin1: 4, 8, 15. bin2: 21, 21, 24. bin3: 25, 28, 34

Smoothing by bin means:

bin1: 9, 9, 9. bin2: 22, 22, 22. bin3: 29, 29, 29.
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: bin1: 4, 8, 15. bin2: 21, 21, 24. bin3: 25, 28, 34

Smoothing by bin means:

bin1: 9, 9, 9. bin2: 22, 22, 22. bin3: 29, 29, 29.

Smoothing by bin boundaries:

bin1: 4, 4, 15. bin2: 21, 21, 24. bin3: 25, 25, 34
 Outlier Analysis:
Fig. 3.3
Meta data:
 3.2- Data Integration:

 3.3- Data Reduction

 3.3.1-Dimensionality reduction
o Discrete Wavelet Transform (DWT).
o Principal Components Reduction (PCA).

 3.3.2-Numerosity reduction
o Sampling.
o Clustering.
 3.4- Data Transform
 Normalisation:
 Min_max normalisation:

 Z-Score normalisation:
 Example:
3, 4, 4, 5, 9
Ã=5.00, σ = 2.34

Min-max: 0, 0.17, 0.17, 0.33, 1.00

Z-score: -0.95, -0.48, -0.48, 0, 1.91

 Descriptions
 Frequent Patterns
 Sequential Patterns
 Market Basket Analysis
 Association Rules:
 I={𝐼1 , 𝐼2 , … , 𝐼𝑛 }
 T
 D
 TID
 A => B[Support, Confidence]
A⊂ I , B⊂ I, A∩B=∅
A≠ ∅, B≠∅
 Example:

Computer => Antivirus[Support=2%, Confidence=60%]

• Support
Support(A=>B)=P(A∪B)

• Confidence
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴∪𝐵) 𝑆𝑢𝑝𝑝𝑜𝑟𝑡_𝐶𝑜𝑢𝑛𝑡(𝐴∪𝐵)
Confidence(A=>B) = P(B|A) = =
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴) 𝑆𝑢𝑝𝑝𝑜𝑟𝑡_𝐶𝑜𝑢𝑛𝑡(𝐴)
 Mining Association Rules
1. Find all frequent itemsets.
2. Generate strong association rules from the frequent itemsets.
‫•‬ ‫‪Example:‬‬
‫} 𝟖𝑻 ‪D = {𝑻𝟏 , 𝑻𝟐 , 𝑻𝟑 , 𝑻𝟒 , 𝑻𝟓 , 𝑻𝟔 , 𝑻𝟕 ,‬‬
‫}جعفری‪ ،‬پیاز‪ ،‬زیتون‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟏𝑻‬
‫}جعفری‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟐𝑻‬
‫}جعفری‪ ،‬پیاز‪،‬نان‪ ،‬نمک‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟑𝑻‬
‫}پیاز‪ ،‬نان‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟒𝑻‬
‫}پیاز‪ ،‬نمک‪ ،‬گوجه فرنگی{= 𝟓𝑻‬
‫}پنیر‪ ،‬نان{= 𝟔𝑻‬
‫}خیار‪ ،‬پیاز‪ ،‬گوجه فرنگی{= 𝟕𝑻‬
‫}کره‪ ،‬نان{= 𝟖𝑻‬

‫]‪A => B[ Support, Confidence‬‬

‫}پیاز‪ ،‬جعفری{ >= }خیار‪ ،‬گوجه فرنگی{‬
‫•‬ ‫‪Exercise:‬‬

‫}پیاز‪ ،‬جعفری{ >= }خیار{‬

‫}پیاز{ >= }خیار‪ ،‬گوجه فرنگی{‬
‫}پیاز{ >= }گوجه فرنگی{‬
 Apriory Algorithm
Proposed by Agrawal in 1994.
TID List of items
T100 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓
T200 𝑰𝟐 , 𝑰𝟒
T300 𝑰𝟐 , 𝑰𝟑
T400 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒
T500 𝑰𝟏 , 𝑰𝟑
T600 𝑰𝟐 , 𝑰𝟑
T700 𝑰𝟏 , 𝑰𝟑
T800 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 , 𝑰𝟓
T900 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑
 1. 𝑪𝟏 : min-sup = 2, Scan D.
Itemset Sup_Count
{𝐼1 } 6
{𝐼2 } 7
{𝐼3 } 6
{𝐼4 } 2
{𝐼5 } 2
 2. 𝑳𝟏 :
Itemset Sup_Count
{𝐼1 } 6
{𝐼2 } 7
{𝐼3 } 6
{𝐼4 } 2
{𝐼5 } 2
 3. 𝑪𝟐 = 𝑳𝟏 ∗ 𝑳𝟏 , Scan D
Itemset Sup_Count
{𝐼1 , 𝐼2 } 4
{𝐼1 , 𝐼3 } 4
{𝐼1 , 𝐼4 } 1
{𝐼1 , 𝐼5 } 2
{𝐼2 , 𝐼3 } 4
{𝐼2 , 𝐼4 } 2
{𝐼2 , 𝐼5 } 2
{𝐼3 , 𝐼4 } 0
{𝐼3 , 𝐼5 } 1
{𝐼4 , 𝐼5 } 0
 4. 𝑳𝟐 :
Itemset Sup_Count
{𝐼1 , 𝐼2 } 4
{𝐼1 , 𝐼3 } 4
{𝐼1 , 𝐼5 } 2
{𝐼2 , 𝐼3 } 4
{𝐼2 , 𝐼4 } 2
{𝐼2 , 𝐼5 } 2
 Prune:
 5. 𝑪𝟑 = 𝑳𝟐 ∗ 𝑳𝟐
Itemset Prune
{𝐼1 , 𝐼2 , 𝐼3 }
{𝐼1 , 𝐼2 , 𝐼5 }
{𝐼1 , 𝐼2 , 𝐼4 }
{𝐼1 , 𝐼3 , 𝐼5 }
{𝐼2 , 𝐼3 , 𝐼4 }
{𝐼2 , 𝐼3 , 𝐼5 }
{𝐼2 , 𝐼4 , 𝐼5 }
 5. 𝑪𝟑 = 𝑳𝟐 ∗ 𝑳𝟐 , Scan D

Itemset Sup_Count
{𝐼1 , 𝐼2 , 𝐼3 } 2
{𝐼1 , 𝐼2 , 𝐼5 } 2
 6. 𝑳𝟑

Itemset Sup_Count
{𝐼1 , 𝐼2 , 𝐼3 } 2
{𝐼1 , 𝐼2 , 𝐼5 } 2
 7. 𝑪𝟒 = 𝑳𝟑 ∗ 𝑳𝟑 , Prune
 𝐿4 = ∅.

End of Step 1.
 Code of Apriori Algorithm:
 Confidence: X={𝐼1 , 𝐼2 , 𝐼5 }
 Confidence: X={𝐼1 , 𝐼2 , 𝐼5 }.

2
𝐼1 =>{𝐼2 , 𝐼5 }: Confidence= = 33%
6
2
𝐼2 =>{𝐼1 , 𝐼5 }: Confidence= = 29%
7
2
𝐼5 =>{𝐼1 , 𝐼2 }: Confidence= = 100%
2
2
{𝐼1 , 𝐼2 }=>{𝐼5 }: Confidence= = 50%
4
2
{𝐼1 , 𝐼5 }=>{𝐼2 }: Confidence= = 100%
2
2
{𝐼2 , 𝐼5 }=>{𝐼1 }: Confidence= = 100%
2
 Confidence: X={𝐼1 , 𝐼2 , 𝐼5 }.

2
𝐼1 =>{𝐼2 , 𝐼5 }: Confidence= = 33% ×
6
2
𝐼2 =>{𝐼1 , 𝐼5 }: Confidence= = 29% ×
7
2
𝐼5 =>{𝐼1 , 𝐼2 }: Confidence= = 100%
2
2
{𝐼1 , 𝐼2 }=>{𝐼5 }: Confidence= = 50% ×
4
2
{𝐼1 , 𝐼5 }=>{𝐼2 }: Confidence= = 100%
2
2
{𝐼2 , 𝐼5 }=>{𝐼1 }: Confidence= = 100%
2

For Confidence ≥ 70%:

 Exercise_2: Calculate confidence for X={𝐼1 , 𝐼2 , 𝐼3 }.

 Exercise_3: Re-run the previous example using min-sup=3 and

min-conf=60%

 Exercise_4: Using min-sup=30% and min-conf=60%:

TID Item Purchased

1 {‫ لیموناد‬،‫{آب پرتقال‬
2 {‫ شیشه پاک کن‬،‫ آب پرتقال‬،‫{شیر‬
3 {‫ لیموناد‬،‫ پاک کننده‬،‫{آب پرتقال‬
4 {‫ لیموناد‬،‫{شیشه پاک کن‬
5 {‫ چیپس‬،‫{لیموناد‬
 FP-growth (finding frequent itemsets without candidate generation).
Min-sup=2
TID List of items
T100 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓
T200 𝑰𝟐 , 𝑰𝟒
T300 𝑰𝟐 , 𝑰𝟑
T400 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒
T500 𝑰𝟏 , 𝑰𝟑
T600 𝑰𝟐 , 𝑰𝟑
T700 𝑰𝟏 , 𝑰𝟑
T800 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 , 𝑰𝟓
T900 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑
1- Frequent items is sorted in the order of descending sup-count:
L={{𝐼2 : 7}, {𝐼1 : 6}, {𝐼3 : 6}, {𝐼4 : 2}, {𝐼5 : 2}}

2- T100={𝐼1 , 𝐼2 , 𝐼5 }
Null
𝟏
I2
𝟏
I1
𝟏
I5
3- T200={𝐼2 , 𝐼4 }
Null
𝟐
I2
𝟏 𝟏
I1 I4
𝟏
I5
4- T300={𝐼2 , 𝐼3 }
Null
𝟑
I2
𝟏 𝟏 𝟏
I3 I1 I4
𝟏
I5
5- T400={𝐼1 , 𝐼2 , 𝐼4 }
Null
𝟒
I2
𝟏 𝟐 𝟏
I3 I1 I4
𝟏 𝟏
I5 I4
6- T500={𝐼1 , 𝐼3 }
Null
𝟏

I2
𝟒 I1
𝟏
I3 𝟏
I1
𝟐
I4
𝟏 I3
𝟏 𝟏
I5 I4
7- T600={𝐼2 , 𝐼3 }
Null
𝟏

I2
𝟓 I1
𝟏
I3 𝟐
I1
𝟐
I4
𝟏 I3
𝟏 𝟏
I5 I4
8- T700={𝐼1 , 𝐼3 }
Null
𝟐

I2
𝟓 I1
𝟐
I3 𝟐
I1
𝟐
I4
𝟏 I3
𝟏 𝟏
I5 I4
9- T800={𝐼1 , 𝐼2 , 𝐼3 , 𝐼5 }
Null
𝟐

I2
𝟔 I1
𝟐
𝟐
I3 I1
𝟑
I4
𝟏 I3
𝟏 𝟏 𝟏
I5 I4 I3
𝟏
I5
10- T900={𝐼1 , 𝐼2 , 𝐼3 }
Null
𝟐

I2
𝟕 I1
𝟐
𝟐
I3 I1
𝟒
I4
𝟏 I3
𝟏 𝟏 𝟐
I5 I4 I3
𝟏
I5
11- FP-tree is mined: We start from the last item of L.

Item Conditional Pattern Base Conditional FP-tree

I5 {I2 , I1 : 1} , {I2 , I1 , I3 : 1} {I2 : 2, I1 : 2}
I4 {I2 , I1 : 1} , {I2 : 1} {I2 : 2}
I3 {I2 : 2},{I2 , I1 : 2},{I1 : 2} {I2 : 4, I1 : 2} ,{I1 : 2}
I1 {I2 : 4} {I2 : 4}
Frequent patterns generated
{I2 , I5 : 2}, {I1 , I5 : 2}, {I2 , I1 , I5 : 2}
{I2 , I4 : 2}
{I2 , I3 : 4}, {I1 , I3 : 4}, {I2 , I1 , I3 : 2}
{I2 , I1 : 4}

 Exercise_5: Calculate support and confidence for above rules.

 Exercise_6: Do exercise_4 using FP-growth algorithm.

 Code of FP-growth Algorithm:
 Eclat (Equivalent Class Transformation) Algorithm
Vertical data format
Min-sup=2

TID List of items

T100 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓
T200 𝑰𝟐 , 𝑰𝟒
T300 𝑰𝟐 , 𝑰𝟑
T400 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒
T500 𝑰𝟏 , 𝑰𝟑
T600 𝑰𝟐 , 𝑰𝟑
T700 𝑰𝟏 , 𝑰𝟑
T800 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 , 𝑰𝟓
T900 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑
itemsets List of items
𝑰𝟏 {T100, T400, T500, T700, T800, T900}
𝑰𝟐 {T100, T200, T300, T400, T600, T800 , T900}
𝑰𝟑 {T300, T500, T600, T700, T800, T900}
𝑰𝟒 {T200, T400}
𝑰𝟓 {T100, T800}
itemsets List of items
{𝑰𝟏 , 𝑰𝟐 } {T100, T400, T800, T900}
{𝑰𝟏 , 𝑰𝟑 } {T500, T700, T800 , T900}
{𝑰𝟏 , 𝑰𝟒 } {T400}
{𝑰𝟏 , 𝑰𝟓 } {T100, T800}
{𝑰𝟐 , 𝑰𝟑 } {T300, T600, T800, T900}

{𝑰𝟐 , 𝑰𝟒 } {T200, T400}

{𝑰𝟐 , 𝑰𝟓 } {T100, T800}

{𝑰𝟑 , 𝑰𝟒 } {}
{𝑰𝟑 , 𝑰𝟓 } {T800}
{𝑰𝟒 , 𝑰𝟓 } {}
itemsets List of items
{𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 } {T800, T900}
{𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒 } {T400}
{𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓 } {T100, T800}

 Exercise_7: Explore association rules from above example.

 Exercise_8: Do exercise_4 again using Eclat algorithm.

• Basic Concepts

• Data classification is a two-step process:

• Learning step (training phase).

• Classification
• Supervised learning vs. unsupervised learning.

• The accuracy of a classifier.

• 5.1 - Decision Tree Induction.

 Why are decision tree classifiers so popular?

Example:
Age Salary Class (Age ≤ 35)

30 65 G
(Salary ≤ 40) (Salary ≤ 50)
23 15 B
40 75 G
55 40 B
B G B G
55 100 G
45 60 G

Classification Rules:
Class B: {(Age<=35) and (Salary<=40)} or {(Age>35) and (Salary<=50)}
Class G: {(Age<=35) and (Salary>40)} or {(Age>35) and (Salary>50)}
Test data: (Age=25 and Salary=50) => Class G
• Generating Decision tree:

1. Apply SS to D to find Splitting-Criterion.

2. If n splits
Use best split to partition D to 𝑫𝟏 , 𝑫𝟐 , …, 𝑫𝒏 .
For (i=1; i<=n; ++i)
Build tree (𝒏𝒊 , 𝑫𝒊 , SS)
3. End if
• Split Criterions

1. Information Gain.
2. Gain Ratio.
3. Gini Index.
• Information Gain (Entropy)

𝑮𝒂𝒊𝒏 𝑨 = 𝑰𝒏𝒇𝒐(D) - 𝑰𝒏𝒇𝒐𝑨 𝑫

𝒎
𝑰𝒏𝒇𝒐(D) = - 𝒊=𝟏 𝒑𝒊 ∗ 𝒍𝒐𝒈𝟐 (𝒑𝒊 )

|𝑪𝒊,𝑫 |
𝒑𝒊 =
|𝑫|

𝑽 |𝑫𝒋 |
𝑰𝒏𝒇𝒐𝑨 𝑫 = - 𝒋=𝟏 |𝑫| * Info(𝑫𝒋 )
RID age income student credit-rating Class: buy-computer

1 youth high no fair no

2 youth high no excellent no
3 middle high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle medium no excellent yes
13 middle high yes fair yes
14 senior medium no excellent no
• Example:
Gain(age) = 0.25
Gain(income) = 0.03
Gain(credit_rating) = 0.05
Gain(student) = 0.15
• Split point for continuous-valued :
• Given v values of A=> (v-1) possible splits are evaluated.

𝒂𝒊 +𝒂𝒊+𝟏
( )
𝟐

𝑫𝟏 : 𝑨 ≤ 𝒔𝒑𝒍𝒊𝒕 − 𝒑𝒐𝒊𝒏𝒕
𝑫𝟐 : 𝑨 > 𝒔𝒑𝒍𝒊𝒕 − 𝒑𝒐𝒊𝒏𝒕

• Exercise: Complete the decision tree in the Fig. 8-5.

• Gain Ratio:
• Normalised Information Gain.

• Reduces a bias toward multi-valued attributes.

𝑮𝒂𝒊𝒏(𝑨)
GainRatio(A) =
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐𝑨 (𝑫)

𝒗 |𝑫𝒋 | |𝑫𝒋 |
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐𝑨 (𝑫) = - 𝒋=𝟏 |𝑫| ∗ 𝒍𝒐𝒈𝟐 ( )
|𝑫|
• Example:
𝟎.𝟐𝟓
GainRatio(age) = =0.16
𝟏.𝟓𝟕

• Exercise:
GainRatio(income) = ?
GainRatio(credit_rating) = ?
GainRatio(student) = ?

• Exercise: Redraw the previous tree using the GainRatio

• Gini Index: Normalised Information Gain.

|𝑫𝟏 | |𝑫𝟐 |
𝑮𝒊𝒏𝒊𝑨 𝑫 = Gini(𝑫𝟏 ) + Gini(𝑫𝟐 )
|𝑫| |𝑫|

𝑮𝒊𝒏𝒊 𝑫 = 𝟏 − 𝒑𝟐𝒊
𝒊=𝟏

|𝑪𝒊,𝑫 |
𝒑𝒊 =
|𝑫|
• Gini Index: Is a binary index

• Example:
𝑮𝒊𝒏𝒊_𝑰𝒏𝒅𝒆𝒙𝒂𝒈𝒆 𝒚𝒐𝒖𝒕𝒉,𝒔𝒆𝒏𝒊𝒐𝒓 𝑫 = 𝑮𝒊𝒏𝒊_𝑰𝒏𝒅𝒆𝒙𝒂𝒈𝒆 𝒎𝒊𝒅𝒅𝒍𝒆 𝑫 = 𝟎. 𝟑𝟔

• Exercise: Calculate the Gini Index for all attributes and all combinations.

• Exercise: Redraw the previous tree using the Gini Index.

• Evaluating a decision tree:

• Tree pruning

• Advantages of using decision trees.

• Bayesian Classification – naïve Bayesian

• Bayes’ Theorem
𝑷 𝑩 𝑨 . 𝑷(𝑨)
𝑷 𝑨𝑩 =
𝑷(𝑩)

• X=(𝒙𝟏 , 𝒙𝟐 , …, 𝒙𝒏 ), m classes: (𝑪𝟏 , 𝑪𝟐 , …, 𝑪𝒎 )

• P(𝑪𝒊 𝑿 > P(𝑪𝒋 𝑿 , j, 1≤ j ≤m, j≠i

𝑷 𝑿 𝑪𝒊 .𝑷(𝑪𝒊 )
𝑷(𝑪𝒊 |X)=
𝑷(𝑿)
|𝑪𝒊,𝑫 |
P(𝑪𝒊 ) =
|𝑫|

𝒏
𝑷 𝑿 𝑪𝒊 = 𝒌=𝟏 𝑷 𝒙𝒌 𝑪𝒊 = 𝑷 𝒙𝟏 𝑪𝒊 . 𝑷 𝒙𝟐 𝑪𝒊 . … . 𝑷 𝒙𝒏 𝑪𝒊

• For continuous attributes:

(𝒙𝒌 −𝛍𝑪 )𝟐
−( 𝒊 )
𝟏
𝑷 𝒙𝒌 𝑪𝒊 = 𝒆 𝟐𝛔𝒄𝒊
𝟐π𝛔𝑪𝒊
RID age income student credit-rating Class: buy-computer

1 youth high no fair no

P(𝑪𝟏 ). 𝑷 𝑿 𝑪𝟏 = 𝟎. 𝟎𝟐𝟖
P(𝑪𝟐 ). 𝑷 𝑿 𝑪𝟐 = 𝟎. 𝟎𝟎𝟕

• Laplacian correction:
𝑪𝟏 : buy_computer = yes : 1000 records

income = low : 0 records

income = medium : 990 records
income = high: 10 records
• Example:

P(𝑪𝟏 ). 𝑷 𝑿 𝑪𝟏 = 𝟎. 𝟎𝟐𝟖
P(𝑪𝟐 ). 𝑷 𝑿 𝑪𝟐 = 𝟎. 𝟎𝟎𝟕

• Laplacian correction:
𝑪𝟏 : buy_computer = yes : 1000 records

𝟎
income = low : 0 records =>P(income=low|𝑪𝟏 )= =0
𝟏𝟎𝟎𝟎
𝟗𝟗𝟎
income = medium : 990 records =>P(income=medium|𝑪𝟏 )=𝟏𝟎𝟎𝟎=0.99
𝟏𝟎
income = high: 10 records=>P(income=high|𝑪𝟏 )=𝟏𝟎𝟎𝟎 =0.010
• Laplacian correction:
𝑪𝟏 : buy_computer = yes : 1000 records

𝟏
income = low : 0 records =>P(income=low|𝑪𝟏 )= =0.001
𝟏𝟎𝟎𝟑
𝟗𝟗𝟏
income = medium : 990 records =>P(income=medium|𝑪𝟏 )=𝟏𝟎𝟎𝟑=0.988
𝟏𝟏
income = high: 10 records=>P(income=high|𝑪𝟏 )=𝟏𝟎𝟎𝟑 =0.011
• Using IF-THEN Rules for Classification:

IF condition THEN conclusion.

antecedent consequent

Example:
R1: IF age = youth AND student = yes THEN buy_computer = yes

R1: (age = youth) ^ (student = yes) => (buy_computer = yes)

• Coverage and Accuracy: A rule R can be assessed by its coverage and
accuracy.

𝒏𝒄𝒐𝒓𝒓𝒆𝒄𝒕 𝒏𝒄𝒐𝒗𝒆𝒓𝒔
• Accuracy(R) = Coverage(R) =
𝒏𝒄𝒐𝒗𝒆𝒓𝒔 |𝑫|
• Extracting classification rules from a
decision tree:

𝑹𝟏 : IF age = youth AND student = no THEN buy_computer = no

𝑹𝟐 : IF age = youth AND student = yes THEN buy_computer = yes
𝑹𝟑 : IF age = middle aged THEN buy_computer = yes
𝑹𝟒 : IF age = senior AND credit rating = excellent THEN buy_computer = yes
𝑹𝟓 : IF age = senior AND credit rating = fair THEN buy_computer = no
• Metrics for Evaluating Classifier Performance:

=𝑃′ + 𝑁′

𝑻𝑷+𝑻𝑵
• Accuracy = 𝑷+𝑵
𝑻𝑷 𝑻𝑷
• precision = =
𝑻𝑷+𝑭𝑷 𝑷′

𝑻𝑷 𝑻𝑷
• recall = 𝑻𝑷+𝑭𝑵 = 𝑷

• Example:

Predicted Total
yes no
Actual yes 90 210 300
no 140 9560 9700
Total 230 9770 10000
𝟗𝟎
• precision = = 39.13%
𝟐𝟑𝟎
𝟗𝟎
• recall = 𝟑𝟎𝟎 = 𝟑𝟎. 𝟎𝟎%

𝟐 ∗𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ∗ 𝒓𝒆𝒄𝒂𝒍𝒍
• F= 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏+ 𝒓𝒆𝒄𝒂𝒍𝒍
• In addition to accuracy-based measures, classifiers can also be compared with
respect to the following additional aspects:

Speed
𝑹𝒐𝒃𝒖𝒔𝒕𝒏𝒆𝒔𝒔
• Other aspects:
𝑺𝒄𝒂𝒍𝒂𝒃𝒊𝒍𝒊𝒕𝒚
Interpretability
• Introduction: Cohesion and Separation.

• Suppose a data set, D, contains n objects in Euclidean space. Partitioning

methods distribute the objects in D into k clusters, 𝑪𝟏 , ... , 𝑪𝒌 , that is, 𝑪𝒊 ⊂ D and
𝑪𝒊 ∩ 𝑪𝒋 = ∅ for (1 ≤ i, j ≤ k).

• k-means: A Centroid-Based Technique.

• Example 6-1: i 𝑿𝒊
𝒄𝟏 =3, 𝒄𝟐 =6 1 5
2 9
3 3
4 6
5 4
6 2
7 8
8 5
9 3
10 7
• Example 6-1:

Step_1: 𝒄𝟏 =3, 𝒄𝟐 =6

𝑪𝟏 = 𝑿𝟑 , 𝑿𝟓 , 𝑿𝟔 , 𝑿𝟗 , 𝑪𝟐 = 𝑿𝟏 , 𝑿𝟐 , 𝑿𝟒 , 𝑿𝟕 , 𝑿𝟖 , 𝑿𝟏𝟎
• Example 6-1:

Step_1: 𝒄𝟏 =3, 𝒄𝟐 =6

𝑪𝟏 = 𝑿𝟑 , 𝑿𝟓 , 𝑿𝟔 , 𝑿𝟗 , 𝑪𝟐 = 𝑿𝟏 , 𝑿𝟐 , 𝑿𝟒 , 𝑿𝟕 , 𝑿𝟖 , 𝑿𝟏𝟎

𝟑+𝟒+𝟐+𝟑 𝟓+𝟗+𝟔+𝟖+𝟓+𝟕
Step_2: 𝒄𝟏 = = 3.0, 𝒄𝟐 = = 6.67
𝟒 𝟔

𝑪𝟏 = 𝑿𝟑 , 𝑿𝟓 , 𝑿𝟔 , 𝑿𝟗 , 𝑪𝟐 = 𝑿𝟏 , 𝑿𝟐 , 𝑿𝟒 , 𝑿𝟕 , 𝑿𝟖 , 𝑿𝟏𝟎
• Exercise 1: Solve the previous example with 𝒄𝟏 = 2.0 and 𝒄𝟐 = 7.0

• Exercise 2: Solve the previous example with 𝒄𝟏 = 2.5, 𝒄𝟐 = 5.0 , and 𝒄𝟑 = 7.5
• Example 6-2 : (𝐢) (𝐢)
i 𝐗𝟏 𝐗𝟐
𝒄𝟏 =(1.0, 1.1), 𝒄𝟐 =(5.0, 7.0)
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Step_1: 𝒄𝟏 =(1.0, 1.1), 𝒄𝟐 =(5.0, 7.0)

𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕
𝑪𝟏 = {𝑿 ,𝑿 ,𝑿 } , 𝑪𝟐 = {𝑿 ,𝑿 ,𝑿 ,𝑿 }
Step_1: 𝒄𝟏 =(1.0, 1.0), 𝒄𝟐 =(5.0, 7.0)

𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕
𝑪𝟏 = {𝑿 ,𝑿 ,𝑿 } , 𝑪𝟐 = {𝑿 ,𝑿 ,𝑿 ,𝑿 }

𝟏+𝟏.𝟓+𝟑 𝟏+𝟐+𝟒
Step_2: 𝒄𝟏 = ( , 𝟑 ) = (𝟏. 𝟖𝟑, 𝟐. 𝟑𝟑)
𝟑
𝟓+𝟑.𝟓+𝟒.𝟓+𝟑.𝟓 𝟕+𝟓+𝟓+𝟒.𝟓
𝒄𝟐 = ( 𝟒
, 𝟒
) = (𝟒. 𝟏𝟐𝟓, 𝟓. 𝟑𝟕𝟓)

𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕
𝑪𝟏 = {𝑿 ,𝑿 } , 𝑪𝟐 = {𝑿 , 𝑿 ,𝑿 ,𝑿 ,𝑿 }
𝟏+𝟏.𝟓 𝟏+𝟐
Step_3: 𝒄𝟏 = ( , ) = (𝟏. 𝟐𝟓, 𝟏. 𝟓)
𝟐 𝟐
𝟑+𝟓+𝟑.𝟓+𝟒.𝟓+𝟑.𝟓 𝟒+𝟕+𝟓+𝟓+𝟒.𝟓
𝒄𝟐 = ( , ) = (𝟑. 𝟗, 𝟓. 𝟏)
𝟓 𝟓

𝑪𝟏 = {𝑿 𝟏 ,𝑿 𝟐 } , 𝑪𝟐 = {𝑿 𝟑 , 𝑿 𝟒 ,𝑿 𝟓 ,𝑿 𝟔 ,𝑿 𝟕 }

• Exercise 3: Solve the previous example with 𝒄𝟏 = (3.0, 3.0) and 𝒄𝟐 = (4.0, 4.0)
• Complexity of k-means: O(nkt)
• n: No of Subjects.

• k: No of Clusters.

• t: No of iterations.

• k << n

• t << n

• Drawbacks of k-means:

• Evaluations of Clusters:

𝒌
E= 𝒊=𝟏 𝒑∈𝑪𝒊 𝒅𝒊𝒔𝒕(𝒑, 𝒄𝒊 )
• Hierarchical Clustering:
• Bottom-Up: AGglomerative NESting (AGNES).

• Top-Down: DIvisive ANAlysis (DIANA).

• Distance Measures in Algorithmic Methods:

• Single Link

• Complete Link

• Average Link
r
• Centroid s

• Ward’s method
• Single Linkage - Nearest Neighbour

d
a b
e g
c
f
r
s
dist(r,s) = dist(b,e)
• Farthest Neighbour – Complete Linkage

d
a b
e g
c
f
r
s

dist(r,s) = dist(a,g)
• Average Linkage

d
a b
e g
c
f
r
s

𝒅𝒊𝒔𝒕 𝒂,𝒅 +𝒅𝒊𝒔𝒕 𝒂,𝒆 +𝒅𝒊𝒔𝒕 𝒂,𝒇 +𝒅𝒊𝒔𝒕 𝒂,𝒈 +⋯+𝒅𝒊𝒔𝒕(𝒄,𝒈)

dist(r,s) = 𝟏𝟐
• Centroid

d
a m b
e n g
c
f
r
s
dist(r,s) = dist(m,n)
• Ward’s method: Minimizes the total within-cluster variance.

• Error Sum of Square (ESS): ESS=𝒆𝟏 𝟐 + 𝒆𝟐 𝟐 + 𝒆𝟑 𝟐

a
𝑒1
𝑒2 b
𝑒3
c
• Example 6-3: Bottom-Up approach, Single-Linkage, Similarity more than 1.8

i 𝐗𝟏 𝐗𝟐
A 1.0 1.0
B 1.0 3.5
C 3.0 1.0
D 2.0 4.0
E 1.0 3.0
F 3.0 2.0
• Example 6-3:
A B C D E F
i 𝐗𝟏 𝐗𝟐 A 0 2.5 2.0 3.16 2.0 2.24
A 1.0 1.0
B 1.0 3.5 B - 0 3.24 1.12 0.5 2.5
C 3.0 1.0
C - - 0 3.16 2.83 1.0
D 2.0 4.0
D - - - 0 1.41 2.24
E 1.0 3.0
F 3.0 2.0 E - - - - 0 1.41

F - - - - - 0
• Example 6-3:
merge A B C D E F
i 𝐗𝟏 𝐗𝟐 A 0 2.5 2.0 3.16 2.0 2.24
A 1.0 1.0
B 1.0 3.5 B - 0 3.24 1.12 0.5 2.5
C 3.0 1.0
C - - 0 3.16 2.83 1.0
D 2.0 4.0
D - - - 0 1.41 2.24
E 1.0 3.0
F 3.0 2.0 E - - - - 0 1.41

F - - - - - 0
• Example 6-3:
A BE C D F
i 𝐗𝟏 𝐗𝟐 A 0 2.0 2.0 3.16 2.24
A 1.0 1.0
BE 1.0 3.25 BE - 0 2.83 1.12 1.41
C 3.0 1.0
C - - 0 3.16 1.0
D 2.0 4.0
D - - - 0 2.24
F 3.0 2.0
F - - - - 0
• Example 6-3:
A BE C D F
i 𝐗𝟏 𝐗𝟐 merge A 0 2.0 2.0 3.16 2.24
A 1.0 1.0
BE 1.0 3.25 BE - 0 2.83 1.12 1.41
C 3.0 1.0
C - - 0 3.16 1.0
D 2.0 4.0
D - - - 0 2.24
F 3.0 2.0
F - - - - 0
• Example 6-3:

i 𝐗𝟏 𝐗𝟐 A BE CF D
A 1.0 1.0 A 0 2.0 2.0 3.16
BE 1.0 3.25
CF 3.0 1.5 BE - 0 1.41 1.12
D 2.0 4.0 CF - - 0 2.24

D - - - 0
• Example 6-3:

𝐗𝟏 𝐗𝟐 A BE CF D
i merge
A 1.0 1.0 A 0 2.0 2.0 3.16
BE 1.0 3.25
CF 3.0 1.5 BE - 0 1.41 1.12
D 2.0 4.0 CF - - 0 2.24

D - - - 0
• Example 6-3:

i 𝐗𝟏 𝐗𝟐 A BDE CF
A 1.0 1.0 A 0 2.0 2.0
BDE 1.0 3.25
BDE - 0 2.24
CF 3.0 1.5
CF - - 0
• Example 6-3:

i 𝐗𝟏 𝐗𝟐 A BDE CF
A 1.0 1.0 A 0 2.0 2.0
BDE 1.0 3.25
BDE - 0 2.24
CF 3.0 1.5
CF - - 0

1.8≥…
• Example 6-3: Dendrogram - AGglomerative NESting (AGNES)

2.0

1.5

1.0

0.5

A B E D C F
• Exercise: Solve the previous example with Complete-Link, Average-Link,
Centroid, and ward’s method.

Data Mining1
No ratings yet
Data Mining1
13 pages
Data Mining Presentation
No ratings yet
Data Mining Presentation
206 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
DM Day3 Preprocessing A S25
No ratings yet
DM Day3 Preprocessing A S25
109 pages
It-3031 (DMDW) - CS End Nov 2023
No ratings yet
It-3031 (DMDW) - CS End Nov 2023
23 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
Data Ming
No ratings yet
Data Ming
28 pages
Paper - Xvii Data Mining and Warehousing
No ratings yet
Paper - Xvii Data Mining and Warehousing
140 pages
Data Mining: Statistical Analysis Techniques
No ratings yet
Data Mining: Statistical Analysis Techniques
24 pages
Unit 1
No ratings yet
Unit 1
28 pages
Wk. 1. Introduction (08.10.2020)
No ratings yet
Wk. 1. Introduction (08.10.2020)
30 pages
Association Rules
No ratings yet
Association Rules
64 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Data Mining & Agent Selection Guide
No ratings yet
Data Mining & Agent Selection Guide
8 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
Unit 4 Data Warehousing and Data Mining
No ratings yet
Unit 4 Data Warehousing and Data Mining
15 pages
ChatGPT - Shared Content
No ratings yet
ChatGPT - Shared Content
26 pages
Handling Continuous Attributes: Different Kinds of Rules
No ratings yet
Handling Continuous Attributes: Different Kinds of Rules
33 pages
Data Mining Notes: 7 Semester. CS 1435: Syllabus
No ratings yet
Data Mining Notes: 7 Semester. CS 1435: Syllabus
4 pages
Data Similarity Measurement Techniques
0% (1)
Data Similarity Measurement Techniques
18 pages
DM Consolidated
100% (1)
DM Consolidated
676 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Data Mining: Techniques and Methods
No ratings yet
Data Mining: Techniques and Methods
20 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
02 Tinh Khoang Cach - Compatibility Mode
No ratings yet
02 Tinh Khoang Cach - Compatibility Mode
14 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
40 pages
Concept Description & Rule Mining
No ratings yet
Concept Description & Rule Mining
58 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Lecture 2.1.3 2.1.4
No ratings yet
Lecture 2.1.3 2.1.4
34 pages
CS-DM Module - 1
No ratings yet
CS-DM Module - 1
27 pages
Data Mining Attribute Types and Tasks
No ratings yet
Data Mining Attribute Types and Tasks
17 pages
02 DataPreparation
No ratings yet
02 DataPreparation
43 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
DATA MINING For Search Engines
No ratings yet
DATA MINING For Search Engines
33 pages
Data Warehousing & Mining Course Overview
No ratings yet
Data Warehousing & Mining Course Overview
118 pages
DM - Unit I-Updated
No ratings yet
DM - Unit I-Updated
65 pages
CAS CS 565, Data Mining
No ratings yet
CAS CS 565, Data Mining
30 pages
02 - Data Mining
No ratings yet
02 - Data Mining
27 pages
Data Cleaning & Integration Guide
No ratings yet
Data Cleaning & Integration Guide
21 pages
Datamining ch1
No ratings yet
Datamining ch1
24 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
1 Data Mining
No ratings yet
1 Data Mining
47 pages
Chapter4 Clustering
No ratings yet
Chapter4 Clustering
77 pages
Week2 2
No ratings yet
Week2 2
25 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
Data Mining Algorithms Comparison
No ratings yet
Data Mining Algorithms Comparison
32 pages
Data Mining and Analysis: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Analysis: Fundamental Concepts and Algorithms
9 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
DM 2 Part 1
No ratings yet
DM 2 Part 1
50 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
DM02 Classification 1
No ratings yet
DM02 Classification 1
64 pages
Forecasting Seasonal Footwear Demand
No ratings yet
Forecasting Seasonal Footwear Demand
4 pages
A Novel Approach For Feature Selection and Classification of Diabetes Mellitus: Machine Learning Methods
No ratings yet
A Novel Approach For Feature Selection and Classification of Diabetes Mellitus: Machine Learning Methods
11 pages
DWDM Notes Unit-4
No ratings yet
DWDM Notes Unit-4
89 pages
Flight Fare Prediction for Students
No ratings yet
Flight Fare Prediction for Students
30 pages
8034 26557 2 PB
No ratings yet
8034 26557 2 PB
9 pages
Decision Tree - Milling Machine
No ratings yet
Decision Tree - Milling Machine
10 pages
Decision Tree Questions
No ratings yet
Decision Tree Questions
8 pages
Understanding Decision Trees in Machine Learning
No ratings yet
Understanding Decision Trees in Machine Learning
13 pages
Adaptive Assessment
No ratings yet
Adaptive Assessment
7 pages
Data Mining Mid Question Bank 2025 2026
No ratings yet
Data Mining Mid Question Bank 2025 2026
20 pages
Decision Tree Project Report
No ratings yet
Decision Tree Project Report
3 pages
2 Decision Trees 2
No ratings yet
2 Decision Trees 2
12 pages
Prediction of Customer Engagement Response To E-Wallet Content Based
No ratings yet
Prediction of Customer Engagement Response To E-Wallet Content Based
14 pages
Decision Theory/ Decision Tree: Operations Research
No ratings yet
Decision Theory/ Decision Tree: Operations Research
25 pages
Lec 15
No ratings yet
Lec 15
66 pages
Iare Data Preparation and Analysis Lab Manual
No ratings yet
Iare Data Preparation and Analysis Lab Manual
55 pages
MLT Question Bank Solution
No ratings yet
MLT Question Bank Solution
112 pages
Decision Trees for Managers
No ratings yet
Decision Trees for Managers
13 pages
Decision - Tree Using R
No ratings yet
Decision - Tree Using R
13 pages
Cancer Detection via Machine Learning
No ratings yet
Cancer Detection via Machine Learning
5 pages
Week Seven Assignment Decision Tree
No ratings yet
Week Seven Assignment Decision Tree
4 pages
Metaheuristics For Enterprise Data - Kaustubh Vaman Sakhare
No ratings yet
Metaheuristics For Enterprise Data - Kaustubh Vaman Sakhare
159 pages
3.2.2 Understanding Management Decision Making
No ratings yet
3.2.2 Understanding Management Decision Making
25 pages
Summer Internship Report ON: "Data Analytics"
No ratings yet
Summer Internship Report ON: "Data Analytics"
24 pages
Perl PL
No ratings yet
Perl PL
2 pages
Predicting Disease With Machine Learning
No ratings yet
Predicting Disease With Machine Learning
20 pages
ML Ch-2 Supervised Learning
No ratings yet
ML Ch-2 Supervised Learning
23 pages
Loan Repayment Prediction Model Analysis
No ratings yet
Loan Repayment Prediction Model Analysis
18 pages
Understanding Knowledge in AI Learning
No ratings yet
Understanding Knowledge in AI Learning
43 pages

Data Mining Presentation

Uploaded by

Data Mining Presentation

Uploaded by

Dr.

Imam Khomeini International University

 Terabytes or petabytes of data into our computer

 This necessity has led to the birth of data mining. The

What is Data Mining?

Exercise 2: Describe a number of Data mining’s

◦ Descriptive: Characterises properties of the data in a target

◦ Predictive: Performs induction on the current data in order to

 Machine Learning (ML)

 Parallel and distributed mining algorithms:

 Privacy-preserving data mining:

Student_ID Name Average

◦ Binary (0,1-T,F): Gender, medical test.

◦ Ordinal: Drink_size (small, medium, large).

 Proximity measures for Binary attributes

1 Code A Excellent ----

Smoothing by bin means:

Smoothing by bin means:

Smoothing by bin boundaries:

 3.3- Data Reduction

Min-max: 0, 0.17, 0.17, 0.33, 1.00

Z-score: -0.95, -0.48, -0.48, 0, 1.91

Computer => Antivirus[Support=2%, Confidence=60%]

‫]‪A => B[ Support, Confidence‬‬

‫}پیاز‪ ،‬جعفری{ >= }خیار{‬

For Confidence ≥ 70%:

 Exercise_3: Re-run the previous example using min-sup=3 and

 Exercise_4: Using min-sup=30% and min-conf=60%:

TID Item Purchased

Item Conditional Pattern Base Conditional FP-tree

 Exercise_5: Calculate support and confidence for above rules.

 Exercise_6: Do exercise_4 using FP-growth algorithm.

TID List of items

{𝑰𝟐 , 𝑰𝟒 } {T200, T400}

 Exercise_7: Explore association rules from above example.

 Exercise_8: Do exercise_4 again using Eclat algorithm.

• Data classification is a two-step process:

• The accuracy of a classifier.

 Why are decision tree classifiers so popular?

1. Apply SS to D to find Splitting-Criterion.

𝑮𝒂𝒊𝒏 𝑨 = 𝑰𝒏𝒇𝒐(D) - 𝑰𝒏𝒇𝒐𝑨 𝑫

1 youth high no fair no

• Exercise: Complete the decision tree in the Fig. 8-5.

• Reduces a bias toward multi-valued attributes.

• Exercise: Redraw the previous tree using the GainRatio

• Exercise: Redraw the previous tree using the Gini Index.

• Advantages of using decision trees.

• X=(𝒙𝟏 , 𝒙𝟐 , …, 𝒙𝒏 ), m classes: (𝑪𝟏 , 𝑪𝟐 , …, 𝑪𝒎 )

• P(𝑪𝒊 𝑿 > P(𝑪𝒋 𝑿 , j, 1≤ j ≤m, j≠i

• For continuous attributes:

1 youth high no fair no

income = low : 0 records

IF condition THEN conclusion.

R1: (age = youth) ^ (student = yes) => (buy_computer = yes)

𝑹𝟏 : IF age = youth AND student = no THEN buy_computer = no

• Suppose a data set, D, contains n objects in Euclidean space. Partitioning

• k-means: A Centroid-Based Technique.

• Top-Down: DIvisive ANAlysis (DIANA).

𝒅𝒊𝒔𝒕 𝒂,𝒅 +𝒅𝒊𝒔𝒕 𝒂,𝒆 +𝒅𝒊𝒔𝒕 𝒂,𝒇 +𝒅𝒊𝒔𝒕 𝒂,𝒈 +⋯+𝒅𝒊𝒔𝒕(𝒄,𝒈)

• Error Sum of Square (ESS): ESS=𝒆𝟏 𝟐 + 𝒆𝟐 𝟐 + 𝒆𝟑 𝟐

You might also like