0% found this document useful (0 votes)
107 views154 pages

Data Mining Presentation

The document discusses a syllabus for a data mining course. It includes the grading breakdown as 40% for midterms, 10% for exercises, 20% for projects, and 30% for the final exam. It outlines 6 chapters that will be covered: introduction, data preprocessing, frequent pattern mining, classification, and clustering. It also provides an overview of key concepts in data mining including the difference between data and knowledge discovery from data, different data mining tasks and techniques, and challenges in data mining.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views154 pages

Data Mining Presentation

The document discusses a syllabus for a data mining course. It includes the grading breakdown as 40% for midterms, 10% for exercises, 20% for projects, and 30% for the final exam. It outlines 6 chapters that will be covered: introduction, data preprocessing, frequent pattern mining, classification, and clustering. It also provides an overview of key concepts in data mining including the difference between data and knowledge discovery from data, different data mining tasks and techniques, and challenges in data mining.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Dr.

Mohammadi Zanjireh

Imam Khomeini International University


(IKIU)
 Midterms: 4*10=40
 Exercises: 10
 Project:20
 Final exam: 30
 Chapter 1 – Introduction
 Chapter 2 - Getting to Know Your Data
 Chapter 3 - Data Preprocessing
 Chapter 4- Frequent Patterns Mining
 Chapter 5 – Classification
 Chapter 6 - Clustering
 “We are living in the information age” is a popular
saying; however, we are actually living in the data
age.

 Terabytes or petabytes of data into our computer


networks, the World Wide Web (WWW), and
various data storage devices every day from
business, society, science and engineering,
medicine, and almost every other aspect of daily
life.
 This explosively growing, widely available, and
gigantic body of data makes our time truly the data age.
Powerful tools are badly needed to automatically
uncover valuable information from the tremendous
amounts of data and to transform such data into
organised knowledge.

 This necessity has led to the birth of data mining. The


field is young, dynamic, and promising. Data mining
has and will continue to make great strides in our
journey from the data age toward the coming
information age.
 Example 1-1: Search Engine.
 A data rich but information poor situation.
 Data tombs.

What is Data Mining?


Data mining should have been more appropriately named
“knowledge mining from data,” which is unfortunately
somewhat long.
 knowledge discovery from data (KDD):
 Data Cleaning.
Preprocessing Step
 Data Integration.
 Data Selection.
 Data Mining.
 Pattern Evaluation.
 Knowledge Presentation.
What Kinds of Data Can Be Mined?
As a general technology, data mining can be applied to
any kind of data as long as the data are meaningful
for a target application such as database data, data
warehouse data, transactional data, data streams,
multimedia data, and the WWW.
Exercise 1: What is the difference between Database and
Data warehouse?

Exercise 2: Describe a number of Data mining’s


applications.
 Data mining tasks can be classified into two
categories:

◦ Descriptive: Characterises properties of the data in a target


data set.

◦ Predictive: Performs induction on the current data in order to


make predictions.
 Data Cube:
 Fig 1.7:
◦ Drill-down
◦ Roll-up

◦ Slice
◦ Dice

◦ OLAP
◦ OLTP
 Which Technologies Are Used?
 Statistics

 Machine Learning (ML)


 Supervised Learning
 Unsupervised Learning
 Semi-supervised Learning
 Efficiency and Scalability:
◦ The running time of a data mining algorithm must be
predictable, short, and acceptable by applications.

 Parallel and distributed mining algorithms:


◦ Such algorithms first partition the data into “pieces” .
◦ Each piece is processed, in parallel.
◦ The patterns from each partition are eventually merged.
◦ Parallel in one machine.
◦ Distributed in multiple machines.
 Handling noise, error, exceptions, and outliers:
◦ Data often contain noise, errors, exceptions, or outliers.
◦ These may confuse the data mining process, leading to the
derivation of erroneous patterns.
◦ Data cleaning, data preprocessing, and outlier detection and
removal, are examples of techniques that need to be integrated with
the data mining process.

 Privacy-preserving data mining:


◦ Data mining is useful. However, it poses the risk of disclosing an
individual’s personal information.
◦ We have to observe data sensitivity while performing data mining.
 Attributes:
Attributes

Student_ID Name Average

Observations
1001 Ali 17.12
1002 Reza 13.89
1003 Maryam 16.02
1004 Hasan 15.45
 Attribute types:
◦ Nominal: Subject, Occupation.

◦ Binary (0,1-T,F): Gender, medical test.


 Symmetric Binary.
 Asymmetric Binary.

◦ Ordinal: Drink_size (small, medium, large).

◦ Numeric
 Interval_scaled.
 Ratio_scaled.
 Discrete vs. Continuous attributes.
 Mean:

 Median:
 Mode:
 Variance and Standard Deviation:
 Data Matrix:

Student _1

Student_i

Student_n
 Dissimilarity Matrix:

 Similarity Matrix:
Sim(i,j)=1-d(i,j)
 Proximity measures for Nominal attributes

 Example:
Id Subject Birth_City Living_City Eye_Colour
1 Computer Teh Teh Black
2 Electricity Teh Kar Brown
3 Mechanic Qaz Qaz Brown
4 Computer Kar Qaz Green
0.00
Dissimilarity 0.75 0.00
=
Matrix 1.00 0.75 0.00
0.75 1.00 0.75 0.00

 Proximity measures for Binary attributes


 Euclidean Distance:
 Euclidean Distance:

 Manhattan Distance:
 Euclidean Distance:

 Manhattan Distance:

 Supremum Distance:
 Proximity measures for Ordinal attributes
Small Medium Large
1 2 3
 Example:
 Degree={Diploma, Undergraduate, Master, PhD}
 Drink_size={Small, Medium, Large}

Id Degree Drink_size
1 Undergraduate Medium
2 PhD Small
3 Diploma Medium
4 Undergraduate Large
 Example:
 Degree={Diploma, Undergraduate, Master, PhD}
 Drink_size={Small, Medium, Large}

Id Degree Drink_size
1 0.33 0.50
2 1.00 0.00
3 0.00 0.50
4 0.33 1.00
 Normalising:

Id Grade
1 30
2 52
3 84
4 45
5 25
6 20
7 91
8 65
9 42
10 32
 Normalising:

Id Grade Grade-min
1 30 10
2 52 32
3 84 64
4 45 25
5 25 5
6 20 0
7 91 71
8 65 45
9 42 22
10 32 12
 Normalising:

(Grade-min)/(max-min)
Id Grade Grade-min
1 30 10 0.14
2 52 32 0.45
3 84 64 0.90
4 45 25 0.35
5 25 5 0.07
6 20 0 0.00
7 91 71 1.00
8 65 45 0.63
9 42 22 0.31
10 32 12 0.17
 Proximity measures for mixed types:
 Examples:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)

1 Code A Excellent 45
2 Code B Fair 22
3 Code C Good 64
4 Code A Excellent 28
 Examples:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)

1 Code A 1 0.55
2 Code B 0 0.00
3 Code C 0.5 1.00
4 Code A 1 0.14

0.00
Dissimilarity 0.85 0.00
=
Matrix 0.65 0.83 0.00
0.13 0.71 0.79 0.00
 Exercise:
Id Test_1(nominal) Test_2(ordinal) Test_3(numeric)

1 Code A Excellent ----


2 ---- Fair 22
3 Code C ---- 64
4 Code A Excellent 28
 Cosine similarity:
D# team coach hockey baseball soccer penalty score win loss season
D1 5 0 3 0 2 0 0 2 0 0
D2 3 0 2 0 1 1 0 1 0 1
D3 0 7 0 2 1 0 0 3 0 0
D4 0 1 0 0 1 2 2 0 3 0
 Cosine similarity:

1.00
Similarity 0.94 1.00
=
Matrix 0.17 0.12 1.00
0.07 0.17 0.23 1.00
 3.1-Data Cleaning.

 3.2-Data Integration.

 3.3-Data Reduction.

 3.4-Data Transform.
 3.1- Data Cleaning
 3.1.1-Missing values
o Ignore the tuple.
o Fill in the missing value manually.
o Use a global constant.
o Use the attribute mean or median.
o Use the class mean or median.

 3.1.2-Noise
o Binning
o Smoothing by bin means.
o Smoothing by bin boundaries.
o Outlier analysis.
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: bin1: 4, 8, 15. bin2: 21, 21, 24. bin3: 25, 28, 34
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: bin1: 4, 8, 15. bin2: 21, 21, 24. bin3: 25, 28, 34

Smoothing by bin means:


bin1: 9, 9, 9. bin2: 22, 22, 22. bin3: 29, 29, 29.
 Example:
4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: bin1: 4, 8, 15. bin2: 21, 21, 24. bin3: 25, 28, 34

Smoothing by bin means:


bin1: 9, 9, 9. bin2: 22, 22, 22. bin3: 29, 29, 29.

Smoothing by bin boundaries:


bin1: 4, 4, 15. bin2: 21, 21, 24. bin3: 25, 25, 34
 Outlier Analysis:
Fig. 3.3
Meta data:
 3.2- Data Integration:

 3.3- Data Reduction


 3.3.1-Dimensionality reduction
o Discrete Wavelet Transform (DWT).
o Principal Components Reduction (PCA).

 3.3.2-Numerosity reduction
o Sampling.
o Clustering.
 3.4- Data Transform
 Normalisation:
 Min_max normalisation:

 Z-Score normalisation:
 Example:
3, 4, 4, 5, 9
Ã=5.00, σ = 2.34

Min-max: 0, 0.17, 0.17, 0.33, 1.00

Z-score: -0.95, -0.48, -0.48, 0, 1.91


 Descriptions
 Frequent Patterns
 Sequential Patterns
 Market Basket Analysis
 Association Rules:
 I={𝐼1 , 𝐼2 , … , 𝐼𝑛 }
 T
 D
 TID
 A => B[Support, Confidence]
A⊂ I , B⊂ I, A∩B=∅
A≠ ∅, B≠∅
 Example:

Computer => Antivirus[Support=2%, Confidence=60%]

• Support
Support(A=>B)=P(A∪B)

• Confidence
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴∪𝐵) 𝑆𝑢𝑝𝑝𝑜𝑟𝑡_𝐶𝑜𝑢𝑛𝑡(𝐴∪𝐵)
Confidence(A=>B) = P(B|A) = =
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴) 𝑆𝑢𝑝𝑝𝑜𝑟𝑡_𝐶𝑜𝑢𝑛𝑡(𝐴)
 Mining Association Rules
1. Find all frequent itemsets.
2. Generate strong association rules from the frequent itemsets.
‫•‬ ‫‪Example:‬‬
‫} 𝟖𝑻 ‪D = {𝑻𝟏 , 𝑻𝟐 , 𝑻𝟑 , 𝑻𝟒 , 𝑻𝟓 , 𝑻𝟔 , 𝑻𝟕 ,‬‬
‫}جعفری‪ ،‬پیاز‪ ،‬زیتون‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟏𝑻‬
‫}جعفری‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟐𝑻‬
‫}جعفری‪ ،‬پیاز‪،‬نان‪ ،‬نمک‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟑𝑻‬
‫}پیاز‪ ،‬نان‪ ،‬خیار‪ ،‬گوجه فرنگی{= 𝟒𝑻‬
‫}پیاز‪ ،‬نمک‪ ،‬گوجه فرنگی{= 𝟓𝑻‬
‫}پنیر‪ ،‬نان{= 𝟔𝑻‬
‫}خیار‪ ،‬پیاز‪ ،‬گوجه فرنگی{= 𝟕𝑻‬
‫}کره‪ ،‬نان{= 𝟖𝑻‬

‫]‪A => B[ Support, Confidence‬‬


‫}پیاز‪ ،‬جعفری{ >= }خیار‪ ،‬گوجه فرنگی{‬
‫•‬ ‫‪Exercise:‬‬

‫}پیاز‪ ،‬جعفری{ >= }خیار{‬


‫}پیاز{ >= }خیار‪ ،‬گوجه فرنگی{‬
‫}پیاز{ >= }گوجه فرنگی{‬
 Apriory Algorithm
Proposed by Agrawal in 1994.
TID List of items
T100 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓
T200 𝑰𝟐 , 𝑰𝟒
T300 𝑰𝟐 , 𝑰𝟑
T400 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒
T500 𝑰𝟏 , 𝑰𝟑
T600 𝑰𝟐 , 𝑰𝟑
T700 𝑰𝟏 , 𝑰𝟑
T800 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 , 𝑰𝟓
T900 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑
 1. 𝑪𝟏 : min-sup = 2, Scan D.
Itemset Sup_Count
{𝐼1 } 6
{𝐼2 } 7
{𝐼3 } 6
{𝐼4 } 2
{𝐼5 } 2
 2. 𝑳𝟏 :
Itemset Sup_Count
{𝐼1 } 6
{𝐼2 } 7
{𝐼3 } 6
{𝐼4 } 2
{𝐼5 } 2
 3. 𝑪𝟐 = 𝑳𝟏 ∗ 𝑳𝟏 , Scan D
Itemset Sup_Count
{𝐼1 , 𝐼2 } 4
{𝐼1 , 𝐼3 } 4
{𝐼1 , 𝐼4 } 1
{𝐼1 , 𝐼5 } 2
{𝐼2 , 𝐼3 } 4
{𝐼2 , 𝐼4 } 2
{𝐼2 , 𝐼5 } 2
{𝐼3 , 𝐼4 } 0
{𝐼3 , 𝐼5 } 1
{𝐼4 , 𝐼5 } 0
 4. 𝑳𝟐 :
Itemset Sup_Count
{𝐼1 , 𝐼2 } 4
{𝐼1 , 𝐼3 } 4
{𝐼1 , 𝐼5 } 2
{𝐼2 , 𝐼3 } 4
{𝐼2 , 𝐼4 } 2
{𝐼2 , 𝐼5 } 2
 Prune:
 5. 𝑪𝟑 = 𝑳𝟐 ∗ 𝑳𝟐
Itemset Prune
{𝐼1 , 𝐼2 , 𝐼3 }
{𝐼1 , 𝐼2 , 𝐼5 }
{𝐼1 , 𝐼2 , 𝐼4 }
{𝐼1 , 𝐼3 , 𝐼5 }
{𝐼2 , 𝐼3 , 𝐼4 }
{𝐼2 , 𝐼3 , 𝐼5 }
{𝐼2 , 𝐼4 , 𝐼5 }
 5. 𝑪𝟑 = 𝑳𝟐 ∗ 𝑳𝟐 , Scan D

Itemset Sup_Count
{𝐼1 , 𝐼2 , 𝐼3 } 2
{𝐼1 , 𝐼2 , 𝐼5 } 2
 6. 𝑳𝟑

Itemset Sup_Count
{𝐼1 , 𝐼2 , 𝐼3 } 2
{𝐼1 , 𝐼2 , 𝐼5 } 2
 7. 𝑪𝟒 = 𝑳𝟑 ∗ 𝑳𝟑 , Prune
 𝐿4 = ∅.

End of Step 1.
 Code of Apriori Algorithm:
 Confidence: X={𝐼1 , 𝐼2 , 𝐼5 }
 Confidence: X={𝐼1 , 𝐼2 , 𝐼5 }.

2
𝐼1 =>{𝐼2 , 𝐼5 }: Confidence= = 33%
6
2
𝐼2 =>{𝐼1 , 𝐼5 }: Confidence= = 29%
7
2
𝐼5 =>{𝐼1 , 𝐼2 }: Confidence= = 100%
2
2
{𝐼1 , 𝐼2 }=>{𝐼5 }: Confidence= = 50%
4
2
{𝐼1 , 𝐼5 }=>{𝐼2 }: Confidence= = 100%
2
2
{𝐼2 , 𝐼5 }=>{𝐼1 }: Confidence= = 100%
2
 Confidence: X={𝐼1 , 𝐼2 , 𝐼5 }.

2
𝐼1 =>{𝐼2 , 𝐼5 }: Confidence= = 33% ×
6
2
𝐼2 =>{𝐼1 , 𝐼5 }: Confidence= = 29% ×
7
2
𝐼5 =>{𝐼1 , 𝐼2 }: Confidence= = 100%
2
2
{𝐼1 , 𝐼2 }=>{𝐼5 }: Confidence= = 50% ×
4
2
{𝐼1 , 𝐼5 }=>{𝐼2 }: Confidence= = 100%
2
2
{𝐼2 , 𝐼5 }=>{𝐼1 }: Confidence= = 100%
2

For Confidence ≥ 70%:


 Exercise_2: Calculate confidence for X={𝐼1 , 𝐼2 , 𝐼3 }.

 Exercise_3: Re-run the previous example using min-sup=3 and


min-conf=60%

 Exercise_4: Using min-sup=30% and min-conf=60%:

TID Item Purchased


1 {‫ لیموناد‬،‫{آب پرتقال‬
2 {‫ شیشه پاک کن‬،‫ آب پرتقال‬،‫{شیر‬
3 {‫ لیموناد‬،‫ پاک کننده‬،‫{آب پرتقال‬
4 {‫ لیموناد‬،‫{شیشه پاک کن‬
5 {‫ چیپس‬،‫{لیموناد‬
 FP-growth (finding frequent itemsets without candidate generation).
Min-sup=2
TID List of items
T100 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓
T200 𝑰𝟐 , 𝑰𝟒
T300 𝑰𝟐 , 𝑰𝟑
T400 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒
T500 𝑰𝟏 , 𝑰𝟑
T600 𝑰𝟐 , 𝑰𝟑
T700 𝑰𝟏 , 𝑰𝟑
T800 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 , 𝑰𝟓
T900 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑
1- Frequent items is sorted in the order of descending sup-count:
L={{𝐼2 : 7}, {𝐼1 : 6}, {𝐼3 : 6}, {𝐼4 : 2}, {𝐼5 : 2}}

2- T100={𝐼1 , 𝐼2 , 𝐼5 }
Null
𝟏
I2
𝟏
I1
𝟏
I5
3- T200={𝐼2 , 𝐼4 }
Null
𝟐
I2
𝟏 𝟏
I1 I4
𝟏
I5
4- T300={𝐼2 , 𝐼3 }
Null
𝟑
I2
𝟏 𝟏 𝟏
I3 I1 I4
𝟏
I5
5- T400={𝐼1 , 𝐼2 , 𝐼4 }
Null
𝟒
I2
𝟏 𝟐 𝟏
I3 I1 I4
𝟏 𝟏
I5 I4
6- T500={𝐼1 , 𝐼3 }
Null
𝟏

I2
𝟒 I1
𝟏
I3 𝟏
I1
𝟐
I4
𝟏 I3
𝟏 𝟏
I5 I4
7- T600={𝐼2 , 𝐼3 }
Null
𝟏

I2
𝟓 I1
𝟏
I3 𝟐
I1
𝟐
I4
𝟏 I3
𝟏 𝟏
I5 I4
8- T700={𝐼1 , 𝐼3 }
Null
𝟐

I2
𝟓 I1
𝟐
I3 𝟐
I1
𝟐
I4
𝟏 I3
𝟏 𝟏
I5 I4
9- T800={𝐼1 , 𝐼2 , 𝐼3 , 𝐼5 }
Null
𝟐

I2
𝟔 I1
𝟐
𝟐
I3 I1
𝟑
I4
𝟏 I3
𝟏 𝟏 𝟏
I5 I4 I3
𝟏
I5
10- T900={𝐼1 , 𝐼2 , 𝐼3 }
Null
𝟐

I2
𝟕 I1
𝟐
𝟐
I3 I1
𝟒
I4
𝟏 I3
𝟏 𝟏 𝟐
I5 I4 I3
𝟏
I5
11- FP-tree is mined: We start from the last item of L.

Item Conditional Pattern Base Conditional FP-tree


I5 {I2 , I1 : 1} , {I2 , I1 , I3 : 1} {I2 : 2, I1 : 2}
I4 {I2 , I1 : 1} , {I2 : 1} {I2 : 2}
I3 {I2 : 2},{I2 , I1 : 2},{I1 : 2} {I2 : 4, I1 : 2} ,{I1 : 2}
I1 {I2 : 4} {I2 : 4}
Frequent patterns generated
{I2 , I5 : 2}, {I1 , I5 : 2}, {I2 , I1 , I5 : 2}
{I2 , I4 : 2}
{I2 , I3 : 4}, {I1 , I3 : 4}, {I2 , I1 , I3 : 2}
{I2 , I1 : 4}

 Exercise_5: Calculate support and confidence for above rules.

 Exercise_6: Do exercise_4 using FP-growth algorithm.


 Code of FP-growth Algorithm:
 Eclat (Equivalent Class Transformation) Algorithm
Vertical data format
Min-sup=2

TID List of items


T100 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓
T200 𝑰𝟐 , 𝑰𝟒
T300 𝑰𝟐 , 𝑰𝟑
T400 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒
T500 𝑰𝟏 , 𝑰𝟑
T600 𝑰𝟐 , 𝑰𝟑
T700 𝑰𝟏 , 𝑰𝟑
T800 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 , 𝑰𝟓
T900 𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑
itemsets List of items
𝑰𝟏 {T100, T400, T500, T700, T800, T900}
𝑰𝟐 {T100, T200, T300, T400, T600, T800 , T900}
𝑰𝟑 {T300, T500, T600, T700, T800, T900}
𝑰𝟒 {T200, T400}
𝑰𝟓 {T100, T800}
itemsets List of items
{𝑰𝟏 , 𝑰𝟐 } {T100, T400, T800, T900}
{𝑰𝟏 , 𝑰𝟑 } {T500, T700, T800 , T900}
{𝑰𝟏 , 𝑰𝟒 } {T400}
{𝑰𝟏 , 𝑰𝟓 } {T100, T800}
{𝑰𝟐 , 𝑰𝟑 } {T300, T600, T800, T900}

{𝑰𝟐 , 𝑰𝟒 } {T200, T400}


{𝑰𝟐 , 𝑰𝟓 } {T100, T800}

{𝑰𝟑 , 𝑰𝟒 } {}
{𝑰𝟑 , 𝑰𝟓 } {T800}
{𝑰𝟒 , 𝑰𝟓 } {}
itemsets List of items
{𝑰𝟏 , 𝑰𝟐 , 𝑰𝟑 } {T800, T900}
{𝑰𝟏 , 𝑰𝟐 , 𝑰𝟒 } {T400}
{𝑰𝟏 , 𝑰𝟐 , 𝑰𝟓 } {T100, T800}

 Exercise_7: Explore association rules from above example.

 Exercise_8: Do exercise_4 again using Eclat algorithm.


• Basic Concepts

• Data classification is a two-step process:


• Learning step (training phase).

• Classification
• Supervised learning vs. unsupervised learning.

• The accuracy of a classifier.


• 5.1 - Decision Tree Induction.

 Why are decision tree classifiers so popular?


Example:
Age Salary Class (Age ≤ 35)

30 65 G
(Salary ≤ 40) (Salary ≤ 50)
23 15 B
40 75 G
55 40 B
B G B G
55 100 G
45 60 G

Classification Rules:
Class B: {(Age<=35) and (Salary<=40)} or {(Age>35) and (Salary<=50)}
Class G: {(Age<=35) and (Salary>40)} or {(Age>35) and (Salary>50)}
Test data: (Age=25 and Salary=50) => Class G
• Generating Decision tree:

1. Apply SS to D to find Splitting-Criterion.


2. If n splits
Use best split to partition D to 𝑫𝟏 , 𝑫𝟐 , …, 𝑫𝒏 .
For (i=1; i<=n; ++i)
Build tree (𝒏𝒊 , 𝑫𝒊 , SS)
3. End if
• Split Criterions

1. Information Gain.
2. Gain Ratio.
3. Gini Index.
• Information Gain (Entropy)

𝑮𝒂𝒊𝒏 𝑨 = 𝑰𝒏𝒇𝒐(D) - 𝑰𝒏𝒇𝒐𝑨 𝑫

𝒎
𝑰𝒏𝒇𝒐(D) = - 𝒊=𝟏 𝒑𝒊 ∗ 𝒍𝒐𝒈𝟐 (𝒑𝒊 )

|𝑪𝒊,𝑫 |
𝒑𝒊 =
|𝑫|

𝑽 |𝑫𝒋 |
𝑰𝒏𝒇𝒐𝑨 𝑫 = - 𝒋=𝟏 |𝑫| * Info(𝑫𝒋 )
RID age income student credit-rating Class: buy-computer

1 youth high no fair no


2 youth high no excellent no
3 middle high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle medium no excellent yes
13 middle high yes fair yes
14 senior medium no excellent no
• Example:
Gain(age) = 0.25
Gain(income) = 0.03
Gain(credit_rating) = 0.05
Gain(student) = 0.15
• Split point for continuous-valued :
• Given v values of A=> (v-1) possible splits are evaluated.

𝒂𝒊 +𝒂𝒊+𝟏
( )
𝟐

𝑫𝟏 : 𝑨 ≤ 𝒔𝒑𝒍𝒊𝒕 − 𝒑𝒐𝒊𝒏𝒕
𝑫𝟐 : 𝑨 > 𝒔𝒑𝒍𝒊𝒕 − 𝒑𝒐𝒊𝒏𝒕

• Exercise: Complete the decision tree in the Fig. 8-5.


• Gain Ratio:
• Normalised Information Gain.

• Reduces a bias toward multi-valued attributes.

𝑮𝒂𝒊𝒏(𝑨)
GainRatio(A) =
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐𝑨 (𝑫)

𝒗 |𝑫𝒋 | |𝑫𝒋 |
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐𝑨 (𝑫) = - 𝒋=𝟏 |𝑫| ∗ 𝒍𝒐𝒈𝟐 ( )
|𝑫|
• Example:
𝟎.𝟐𝟓
GainRatio(age) = =0.16
𝟏.𝟓𝟕

• Exercise:
GainRatio(income) = ?
GainRatio(credit_rating) = ?
GainRatio(student) = ?

• Exercise: Redraw the previous tree using the GainRatio


• Gini Index: Normalised Information Gain.

|𝑫𝟏 | |𝑫𝟐 |
𝑮𝒊𝒏𝒊𝑨 𝑫 = Gini(𝑫𝟏 ) + Gini(𝑫𝟐 )
|𝑫| |𝑫|

𝑮𝒊𝒏𝒊 𝑫 = 𝟏 − 𝒑𝟐𝒊
𝒊=𝟏

|𝑪𝒊,𝑫 |
𝒑𝒊 =
|𝑫|
• Gini Index: Is a binary index

• Example:
𝑮𝒊𝒏𝒊_𝑰𝒏𝒅𝒆𝒙𝒂𝒈𝒆 𝒚𝒐𝒖𝒕𝒉,𝒔𝒆𝒏𝒊𝒐𝒓 𝑫 = 𝑮𝒊𝒏𝒊_𝑰𝒏𝒅𝒆𝒙𝒂𝒈𝒆 𝒎𝒊𝒅𝒅𝒍𝒆 𝑫 = 𝟎. 𝟑𝟔

• Exercise: Calculate the Gini Index for all attributes and all combinations.

• Exercise: Redraw the previous tree using the Gini Index.


• Evaluating a decision tree:

• Tree pruning

• Advantages of using decision trees.


• Bayesian Classification – naïve Bayesian

• Bayes’ Theorem
𝑷 𝑩 𝑨 . 𝑷(𝑨)
𝑷 𝑨𝑩 =
𝑷(𝑩)

• X=(𝒙𝟏 , 𝒙𝟐 , …, 𝒙𝒏 ), m classes: (𝑪𝟏 , 𝑪𝟐 , …, 𝑪𝒎 )

• P(𝑪𝒊 𝑿 > P(𝑪𝒋 𝑿 , j, 1≤ j ≤m, j≠i


𝑷 𝑿 𝑪𝒊 .𝑷(𝑪𝒊 )
𝑷(𝑪𝒊 |X)=
𝑷(𝑿)
|𝑪𝒊,𝑫 |
P(𝑪𝒊 ) =
|𝑫|

𝒏
𝑷 𝑿 𝑪𝒊 = 𝒌=𝟏 𝑷 𝒙𝒌 𝑪𝒊 = 𝑷 𝒙𝟏 𝑪𝒊 . 𝑷 𝒙𝟐 𝑪𝒊 . … . 𝑷 𝒙𝒏 𝑪𝒊

• For continuous attributes:

(𝒙𝒌 −𝛍𝑪 )𝟐
−( 𝒊 )
𝟏
𝑷 𝒙𝒌 𝑪𝒊 = 𝒆 𝟐𝛔𝒄𝒊
𝟐π𝛔𝑪𝒊
RID age income student credit-rating Class: buy-computer

1 youth high no fair no


2 youth high no excellent no
3 middle high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle medium no excellent yes
13 middle high yes fair yes
14 senior medium no excellent no
• Example:

P(𝑪𝟏 ). 𝑷 𝑿 𝑪𝟏 = 𝟎. 𝟎𝟐𝟖
P(𝑪𝟐 ). 𝑷 𝑿 𝑪𝟐 = 𝟎. 𝟎𝟎𝟕

• Laplacian correction:
𝑪𝟏 : buy_computer = yes : 1000 records

income = low : 0 records


income = medium : 990 records
income = high: 10 records
• Example:

P(𝑪𝟏 ). 𝑷 𝑿 𝑪𝟏 = 𝟎. 𝟎𝟐𝟖
P(𝑪𝟐 ). 𝑷 𝑿 𝑪𝟐 = 𝟎. 𝟎𝟎𝟕

• Laplacian correction:
𝑪𝟏 : buy_computer = yes : 1000 records

𝟎
income = low : 0 records =>P(income=low|𝑪𝟏 )= =0
𝟏𝟎𝟎𝟎
𝟗𝟗𝟎
income = medium : 990 records =>P(income=medium|𝑪𝟏 )=𝟏𝟎𝟎𝟎=0.99
𝟏𝟎
income = high: 10 records=>P(income=high|𝑪𝟏 )=𝟏𝟎𝟎𝟎 =0.010
• Laplacian correction:
𝑪𝟏 : buy_computer = yes : 1000 records

𝟏
income = low : 0 records =>P(income=low|𝑪𝟏 )= =0.001
𝟏𝟎𝟎𝟑
𝟗𝟗𝟏
income = medium : 990 records =>P(income=medium|𝑪𝟏 )=𝟏𝟎𝟎𝟑=0.988
𝟏𝟏
income = high: 10 records=>P(income=high|𝑪𝟏 )=𝟏𝟎𝟎𝟑 =0.011
• Using IF-THEN Rules for Classification:

IF condition THEN conclusion.

antecedent consequent

Example:
R1: IF age = youth AND student = yes THEN buy_computer = yes

R1: (age = youth) ^ (student = yes) => (buy_computer = yes)


• Coverage and Accuracy: A rule R can be assessed by its coverage and
accuracy.

𝒏𝒄𝒐𝒓𝒓𝒆𝒄𝒕 𝒏𝒄𝒐𝒗𝒆𝒓𝒔
• Accuracy(R) = Coverage(R) =
𝒏𝒄𝒐𝒗𝒆𝒓𝒔 |𝑫|
• Extracting classification rules from a
decision tree:

𝑹𝟏 : IF age = youth AND student = no THEN buy_computer = no


𝑹𝟐 : IF age = youth AND student = yes THEN buy_computer = yes
𝑹𝟑 : IF age = middle aged THEN buy_computer = yes
𝑹𝟒 : IF age = senior AND credit rating = excellent THEN buy_computer = yes
𝑹𝟓 : IF age = senior AND credit rating = fair THEN buy_computer = no
• Metrics for Evaluating Classifier Performance:

=𝑃′ + 𝑁′

𝑻𝑷+𝑻𝑵
• Accuracy = 𝑷+𝑵
𝑻𝑷 𝑻𝑷
• precision = =
𝑻𝑷+𝑭𝑷 𝑷′

𝑻𝑷 𝑻𝑷
• recall = 𝑻𝑷+𝑭𝑵 = 𝑷

• Example:

Predicted Total
yes no
Actual yes 90 210 300
no 140 9560 9700
Total 230 9770 10000
𝟗𝟎
• precision = = 39.13%
𝟐𝟑𝟎
𝟗𝟎
• recall = 𝟑𝟎𝟎 = 𝟑𝟎. 𝟎𝟎%

𝟐 ∗𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ∗ 𝒓𝒆𝒄𝒂𝒍𝒍
• F= 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏+ 𝒓𝒆𝒄𝒂𝒍𝒍
• In addition to accuracy-based measures, classifiers can also be compared with
respect to the following additional aspects:

Speed
𝑹𝒐𝒃𝒖𝒔𝒕𝒏𝒆𝒔𝒔
• Other aspects:
𝑺𝒄𝒂𝒍𝒂𝒃𝒊𝒍𝒊𝒕𝒚
Interpretability
• Introduction: Cohesion and Separation.

• Suppose a data set, D, contains n objects in Euclidean space. Partitioning


methods distribute the objects in D into k clusters, 𝑪𝟏 , ... , 𝑪𝒌 , that is, 𝑪𝒊 ⊂ D and
𝑪𝒊 ∩ 𝑪𝒋 = ∅ for (1 ≤ i, j ≤ k).

• k-means: A Centroid-Based Technique.


• Example 6-1: i 𝑿𝒊
𝒄𝟏 =3, 𝒄𝟐 =6 1 5
2 9
3 3
4 6
5 4
6 2
7 8
8 5
9 3
10 7
• Example 6-1:

Step_1: 𝒄𝟏 =3, 𝒄𝟐 =6

𝑪𝟏 = 𝑿𝟑 , 𝑿𝟓 , 𝑿𝟔 , 𝑿𝟗 , 𝑪𝟐 = 𝑿𝟏 , 𝑿𝟐 , 𝑿𝟒 , 𝑿𝟕 , 𝑿𝟖 , 𝑿𝟏𝟎
• Example 6-1:

Step_1: 𝒄𝟏 =3, 𝒄𝟐 =6

𝑪𝟏 = 𝑿𝟑 , 𝑿𝟓 , 𝑿𝟔 , 𝑿𝟗 , 𝑪𝟐 = 𝑿𝟏 , 𝑿𝟐 , 𝑿𝟒 , 𝑿𝟕 , 𝑿𝟖 , 𝑿𝟏𝟎

𝟑+𝟒+𝟐+𝟑 𝟓+𝟗+𝟔+𝟖+𝟓+𝟕
Step_2: 𝒄𝟏 = = 3.0, 𝒄𝟐 = = 6.67
𝟒 𝟔

𝑪𝟏 = 𝑿𝟑 , 𝑿𝟓 , 𝑿𝟔 , 𝑿𝟗 , 𝑪𝟐 = 𝑿𝟏 , 𝑿𝟐 , 𝑿𝟒 , 𝑿𝟕 , 𝑿𝟖 , 𝑿𝟏𝟎
• Exercise 1: Solve the previous example with 𝒄𝟏 = 2.0 and 𝒄𝟐 = 7.0

• Exercise 2: Solve the previous example with 𝒄𝟏 = 2.5, 𝒄𝟐 = 5.0 , and 𝒄𝟑 = 7.5
• Example 6-2 : (𝐢) (𝐢)
i 𝐗𝟏 𝐗𝟐
𝒄𝟏 =(1.0, 1.1), 𝒄𝟐 =(5.0, 7.0)
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Step_1: 𝒄𝟏 =(1.0, 1.1), 𝒄𝟐 =(5.0, 7.0)

𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕
𝑪𝟏 = {𝑿 ,𝑿 ,𝑿 } , 𝑪𝟐 = {𝑿 ,𝑿 ,𝑿 ,𝑿 }
Step_1: 𝒄𝟏 =(1.0, 1.0), 𝒄𝟐 =(5.0, 7.0)

𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕
𝑪𝟏 = {𝑿 ,𝑿 ,𝑿 } , 𝑪𝟐 = {𝑿 ,𝑿 ,𝑿 ,𝑿 }

𝟏+𝟏.𝟓+𝟑 𝟏+𝟐+𝟒
Step_2: 𝒄𝟏 = ( , 𝟑 ) = (𝟏. 𝟖𝟑, 𝟐. 𝟑𝟑)
𝟑
𝟓+𝟑.𝟓+𝟒.𝟓+𝟑.𝟓 𝟕+𝟓+𝟓+𝟒.𝟓
𝒄𝟐 = ( 𝟒
, 𝟒
) = (𝟒. 𝟏𝟐𝟓, 𝟓. 𝟑𝟕𝟓)

𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕
𝑪𝟏 = {𝑿 ,𝑿 } , 𝑪𝟐 = {𝑿 , 𝑿 ,𝑿 ,𝑿 ,𝑿 }
𝟏+𝟏.𝟓 𝟏+𝟐
Step_3: 𝒄𝟏 = ( , ) = (𝟏. 𝟐𝟓, 𝟏. 𝟓)
𝟐 𝟐
𝟑+𝟓+𝟑.𝟓+𝟒.𝟓+𝟑.𝟓 𝟒+𝟕+𝟓+𝟓+𝟒.𝟓
𝒄𝟐 = ( , ) = (𝟑. 𝟗, 𝟓. 𝟏)
𝟓 𝟓

𝑪𝟏 = {𝑿 𝟏 ,𝑿 𝟐 } , 𝑪𝟐 = {𝑿 𝟑 , 𝑿 𝟒 ,𝑿 𝟓 ,𝑿 𝟔 ,𝑿 𝟕 }

• Exercise 3: Solve the previous example with 𝒄𝟏 = (3.0, 3.0) and 𝒄𝟐 = (4.0, 4.0)
• Complexity of k-means: O(nkt)
• n: No of Subjects.

• k: No of Clusters.

• t: No of iterations.

• k << n

• t << n

• Drawbacks of k-means:

• Evaluations of Clusters:

𝒌
E= 𝒊=𝟏 𝒑∈𝑪𝒊 𝒅𝒊𝒔𝒕(𝒑, 𝒄𝒊 )
• Hierarchical Clustering:
• Bottom-Up: AGglomerative NESting (AGNES).

• Top-Down: DIvisive ANAlysis (DIANA).


• Distance Measures in Algorithmic Methods:

• Single Link

• Complete Link

• Average Link
r
• Centroid s

• Ward’s method
• Single Linkage - Nearest Neighbour

d
a b
e g
c
f
r
s
dist(r,s) = dist(b,e)
• Farthest Neighbour – Complete Linkage

d
a b
e g
c
f
r
s

dist(r,s) = dist(a,g)
• Average Linkage

d
a b
e g
c
f
r
s

𝒅𝒊𝒔𝒕 𝒂,𝒅 +𝒅𝒊𝒔𝒕 𝒂,𝒆 +𝒅𝒊𝒔𝒕 𝒂,𝒇 +𝒅𝒊𝒔𝒕 𝒂,𝒈 +⋯+𝒅𝒊𝒔𝒕(𝒄,𝒈)


dist(r,s) = 𝟏𝟐
• Centroid

d
a m b
e n g
c
f
r
s
dist(r,s) = dist(m,n)
• Ward’s method: Minimizes the total within-cluster variance.

• Error Sum of Square (ESS): ESS=𝒆𝟏 𝟐 + 𝒆𝟐 𝟐 + 𝒆𝟑 𝟐

a
𝑒1
𝑒2 b
𝑒3
c
• Example 6-3: Bottom-Up approach, Single-Linkage, Similarity more than 1.8

i 𝐗𝟏 𝐗𝟐
A 1.0 1.0
B 1.0 3.5
C 3.0 1.0
D 2.0 4.0
E 1.0 3.0
F 3.0 2.0
• Example 6-3:
A B C D E F
i 𝐗𝟏 𝐗𝟐 A 0 2.5 2.0 3.16 2.0 2.24
A 1.0 1.0
B 1.0 3.5 B - 0 3.24 1.12 0.5 2.5
C 3.0 1.0
C - - 0 3.16 2.83 1.0
D 2.0 4.0
D - - - 0 1.41 2.24
E 1.0 3.0
F 3.0 2.0 E - - - - 0 1.41

F - - - - - 0
• Example 6-3:
merge A B C D E F
i 𝐗𝟏 𝐗𝟐 A 0 2.5 2.0 3.16 2.0 2.24
A 1.0 1.0
B 1.0 3.5 B - 0 3.24 1.12 0.5 2.5
C 3.0 1.0
C - - 0 3.16 2.83 1.0
D 2.0 4.0
D - - - 0 1.41 2.24
E 1.0 3.0
F 3.0 2.0 E - - - - 0 1.41

F - - - - - 0
• Example 6-3:
A BE C D F
i 𝐗𝟏 𝐗𝟐 A 0 2.0 2.0 3.16 2.24
A 1.0 1.0
BE 1.0 3.25 BE - 0 2.83 1.12 1.41
C 3.0 1.0
C - - 0 3.16 1.0
D 2.0 4.0
D - - - 0 2.24
F 3.0 2.0
F - - - - 0
• Example 6-3:
A BE C D F
i 𝐗𝟏 𝐗𝟐 merge A 0 2.0 2.0 3.16 2.24
A 1.0 1.0
BE 1.0 3.25 BE - 0 2.83 1.12 1.41
C 3.0 1.0
C - - 0 3.16 1.0
D 2.0 4.0
D - - - 0 2.24
F 3.0 2.0
F - - - - 0
• Example 6-3:

i 𝐗𝟏 𝐗𝟐 A BE CF D
A 1.0 1.0 A 0 2.0 2.0 3.16
BE 1.0 3.25
CF 3.0 1.5 BE - 0 1.41 1.12
D 2.0 4.0 CF - - 0 2.24

D - - - 0
• Example 6-3:

𝐗𝟏 𝐗𝟐 A BE CF D
i merge
A 1.0 1.0 A 0 2.0 2.0 3.16
BE 1.0 3.25
CF 3.0 1.5 BE - 0 1.41 1.12
D 2.0 4.0 CF - - 0 2.24

D - - - 0
• Example 6-3:

i 𝐗𝟏 𝐗𝟐 A BDE CF
A 1.0 1.0 A 0 2.0 2.0
BDE 1.0 3.25
BDE - 0 2.24
CF 3.0 1.5
CF - - 0
• Example 6-3:

i 𝐗𝟏 𝐗𝟐 A BDE CF
A 1.0 1.0 A 0 2.0 2.0
BDE 1.0 3.25
BDE - 0 2.24
CF 3.0 1.5
CF - - 0

1.8≥…
• Example 6-3: Dendrogram - AGglomerative NESting (AGNES)

2.0

1.5

1.0

0.5

A B E D C F
• Exercise: Solve the previous example with Complete-Link, Average-Link,
Centroid, and ward’s method.

You might also like