OneR Algorithm: Simple Data Mining

The OneR algorithm learns a single rule for each attribute in a dataset and chooses the attribute with the rule that best predicts the target class. It works by discretizing numeric attributes into intervals and creating a rule for each possible value or interval of an attribute that assigns instances to the majority class for that value in the training data. OneR is a simple but surprisingly effective algorithm that is useful for getting an initial sense of what attributes may be predictive without overfitting the data.

Uploaded by

shayanahmedkhilji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views30 pages

OneR Algorithm: Simple Data Mining

Uploaded by

shayanahmedkhilji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 30

Data Mining – Algorithms: OneR

Chapter 4, Section 4.1

Simplicity First
• Simple Algorithms sometimes work surprisingly
well
• It is worth trying simple approaches first
• Different approaches may work better for
different data
• There is more than one simple approach
• First to be examined: OneR (or 1R) – learns one
rule for the dataset – actually a bit of a
misnomer – one level decision tree
OneR – Holte (1993)
• Simple, cheap method
• Often performs surprisingly well
• Many real datasets may not have complicated things
going on
• Idea:
– Make rules that test a single attribute and branch accordingly
(each branch corresponds to a different value for that
attribute)
– Classification for a given branch is the “majority” class for
that branch in the training data
– Evaluate use of each attribute via “error rate” on training data
– Choose the best attribute
Figure 4.1 Pseudo-code for 1R.

For each attribute,

For each value of that attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute-value.
Calculate the error rate of the rules.
Choose the rules with the smallest error rate.

At least in the simplest version, “missing” is

treated as a separate value
Example: My Weather (Nominal)
Outlook Temp Humid Windy Play?
sunny hot high FALSE no
sunny hot high TRUE yes
overcast hot high FALSE no
rainy mild high FALSE no
rainy cool normal FALSE no
rainy cool normal TRUE no
overcast cool normal TRUE yes
sunny mild high FALSE yes
sunny cool normal FALSE yes
rainy mild normal FALSE no
sunny mild normal TRUE yes
overcast mild high TRUE yes
overcast hot normal FALSE no
rainy mild high TRUE no
Let’s take this a little more
realistic than book does
• Divide into training and test data
• Let’s save the last record as a test
For each attribute – start with
Outlook
• Make a rule for each value
– Sunny  yes 1/5 errors
– Overcast  yes* 2/4 errors
– Rainy  no 0/4 errors
– Total errors = 3/13
• Move on to next attribute – temperature
– Hot  no 1/4 errors
– Mild  yes 2/5 errors
– Cool  no* 2/4 errors
– Total errors = 5/13

* - means tie – arbitrarily broken (maybe random)

Continue with Humidity
• Make a rule for each value
– High  yes* 3/6 errors
– Normal  no 3/7 errors
– Total errors = 6/13
• Move on to next attribute – windy
– False  no 2/8 errors
– True  yes 1/5 errors
– Total errors = 3/13
•- means tie – arbitrarily broken (maybe random)
• First and last attributes tie – one would have to be arbitrarily chosen
• On the test record, the first would end up being correct, the last
wouldn’t
Again being more realistic than
the book,
• this will be cross validated
• Normally 10-fold is used, but with 14 instances
that is a little awkward –
– 6 of the tests would be on 1 instance
– 4 of the tests will be on 2 instances
• I’m going to do 14-fold instead – having one
test instance for each test
• Next test, save 13th instance out as test data
For each attribute – start with
Outlook
• Make a rule for each value
– Sunny  yes 1/5 errors
– Overcast  yes 1/3 errors
– Rainy  no 0/5 errors
– Total errors = 2/13
• Move on to next attribute – temperature
– Hot  no 1/3 errors
– Mild  yes* 3/6 errors
– Cool  no* 2/4 errors
– Total errors = 6/13

* - means tie – arbitrarily broken (maybe random)

Continue with Humidity
• Make a rule for each value
– High  no 3/7 errors
– Normal  yes* 3/6 errors
– Total errors = 6/13
• Move on to next attribute – windy
– False  no 2/7 errors
– True  yes 2/6 errors
– Total errors = 4/13
•- means tie – arbitrarily broken (maybe random)
• First attribute wins
•On the test record, this makes an incorrect prediction
In a 14-fold cross validation, this
would continue 12 more times
• Let’s run WEKA on this …
WEKA results – first look near
the bottom
=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances 9 64.2857 %

Incorrectly Classified Instances 5 35.7143 %
============================================
• On the cross validation – it got 9 out of 14 tests correct (I
don’t know which way it went on arbitrary decisions so we
may not re-create exactly if we walk all of the way through
More Detailed Results
=== Confusion Matrix ===
a b <-- classified as
4 2 | a = yes
3 5 | b = no
====================================
•Here we see –the program 7 times predicted play=yes, on 4 of those it
was correct
•The program 7 times predicted play = no, on 5 of those it was correct
•There were 6 instances whose actual value was play=yes, the program
correctly predicted that on 4 of them
•There were 8 instances whose actual value was play=no, the program
correctly predicted that on 5 of them
Part of our purpose is to have a
take-home message for humans
• Not 14 take home messages!
• So instead of reporting each of the things
learned on each of the 14 training sets …
• … The program runs again on all of the data
and builds a pattern for that – a take home
message
For each attribute – start with
Outlook
• Make a rule for each value
– Sunny  yes 1/5 errors
– Overcast  yes* 2/4 errors
– Rainy  no 0/5 errors
– Total errors = 3/14
• Move on to next attribute – temperature
– Hot  no 1/4 errors
– Mild  yes* 3/6 errors
– Cool  no* 2/4 errors
– Total errors = 6/14

* - means tie – arbitrarily broken (maybe random)

Continue with Humidity
• Make a rule for each value
– High  no 3/7 errors
– Normal  no 3/7 errors
– Total errors = 6/14
• Move on to next attribute – windy
– False  no 2/8 errors
– True  yes 2/6 errors
– Total errors = 4/14
•- means tie – arbitrarily broken (maybe random)
• First attribute wins - see WEKA results on next slide
WEKA - Take-Home
=== Classifier model (full training set) ===
outlook:
sunny -> yes
overcast -> yes
rainy -> no
(11/14 instances correct)
•This very simple classifier rule-set could be the take home message
from running this algorithm on this data – if you are satisfied with the
results!
•This 11/14 correct is NOT a good indicator of quality – it is looking at
% correct on TRAINING DATA
•The cross validation results previously shown (9/14) is a much more
fair judgment because it is on TEST DATA
Let’s Try WEKA OneR on
njcrimenominal
• Try 10-fold
unemploy:
hi -> bad
med -> ok
low -> ok
(27/32 instances correct)
=== Confusion Matrix ===
a b <-- classified as
1 6 | a = bad
3 22 | b = ok
Numeric Attributes
• For OneR, numeric attributes are “discretized” – the range
of values is divided into a set of intervals
• (Too) simple method:
– Sort
– Put breakpoint wherever class changes (this is “supervised”
discetization)
– See my weather data …

Temp- 64 65 68 69 70 71 72 72 75 75 80 81 83 85
erature

Play? Y N N Y N N Y Y N Y Y N N N

•With OneR, there would only be one error on the training

data … but ..
This is “overfitting”
• What makes 64 a different group than 65?
• Using this technique, ideal division would be
with a numeric primary key – every attribute
value could get its own group and error on
training data would be 0 (but unlikely to be
valuable for future prediction)
• Improvement via a Heuristic – each group must
have at least N members of majority class (and
go further if keep having majority class)
• In book, example N = 3.
• In WEKA, default N = 6.
With N = 3 on My Weather
temperature
• Hit 3rd No with 70, then continue and include 71
• Hit 3rd Yes with 75, then continue and include 80
• We’re actually just lucky here that the last group
reaches 3 in a majority class. If one had been No, that
still would have been the last group – no choice
Temp- 64 65 68 69 70 71 72 72 75 75 80 81 83 85
erature

Play? Y N N Y N N Y Y N Y Y N N N

•3 errors on this training data with this discretized attribute, but more
likely to be useful for future predictions
With N = 3 on My Weather
humidity
• In Class Exercise – What groups will we have?

Humid 65 70 70 70 75 80 80 85 86 90 90 91 95 96
ity

Play? Y N Y Y N N N N N Y Y N Y N
Let’s run WEKA
• My Weather Data
• First with default options
• Next with 3 (double click option area – WEKA
option B)
Another Thing or Two
• Using this method, if two adjacent groups have
the same majority class, they can be collapsed
into one group
• (this doesn’t happen for temperature or
humidity)
• We can’t do anything about missing values,
they have to be in their own group
OneR in context
• The machine learning community had been using a set of
available datasets to compare algorithms for a number of
years
– http://www.ics.uci.edu/~mlearn/MLSummary.html
• Algorithms were getting more and more complicated, with
small gains in improvement
• Holte (1993) said “the emperor has no clothes” – state of the
art methods were often only a few pct points better, and
with much more complicated structural patterns (concept
descriptions)
• OneR can provide a “baseline” against which other, more
complicated methods can be compared
– If they improve significantly, use them, otherwise …
Class Exercise
Let’s run WEKA OneR on
japanbank
• B option = 3
We can actually discretize and
save data for future use using
• Preprocess Tab
WEKA
• Select Choose Unsupervised > Attribute > Discretize
• Choose Options
– Attribute indices (#s to be binned – e.g. attr 3-4)
– FindNumBins – to have WEKA find a good number of
groups for this data
– NumBins = max # groups to consider
• Choose Apply Button
• Choose Save Button, to save in permanent file
• Undo if necessary
End Section 4.1

Wk10 Algorithms
No ratings yet
Wk10 Algorithms
123 pages
DLWSS551 - Algorithms Part I
No ratings yet
DLWSS551 - Algorithms Part I
59 pages
DWM
No ratings yet
DWM
9 pages
Lab3 NguyenQuocKhanh ITITIU18186
No ratings yet
Lab3 NguyenQuocKhanh ITITIU18186
7 pages
Classification Algorithms Overview
No ratings yet
Classification Algorithms Overview
206 pages
NguyenCongSang ITITIU20292 Lab6
No ratings yet
NguyenCongSang ITITIU20292 Lab6
10 pages
Decision Tree Learning and 1R Algorithm
No ratings yet
Decision Tree Learning and 1R Algorithm
39 pages
EC9560 Data Mining: Lab 02: Classification and Prediction Using WEKA
No ratings yet
EC9560 Data Mining: Lab 02: Classification and Prediction Using WEKA
5 pages
Anjali Weka Software Report
No ratings yet
Anjali Weka Software Report
17 pages
Decision Tree Classification Techniques
No ratings yet
Decision Tree Classification Techniques
78 pages
Lab Record 10-15
No ratings yet
Lab Record 10-15
17 pages
Yapay Zeka Ve Makine Öğrenmesi 10
No ratings yet
Yapay Zeka Ve Makine Öğrenmesi 10
34 pages
Semester 2, 2020 Week 8: Data Mining in WEKA Tutorial/Lab Session - 7
No ratings yet
Semester 2, 2020 Week 8: Data Mining in WEKA Tutorial/Lab Session - 7
13 pages
Se#ng Up ML Problem: Rao Vemuri UC Davis
No ratings yet
Se#ng Up ML Problem: Rao Vemuri UC Davis
19 pages
DM Witten 03
No ratings yet
DM Witten 03
56 pages
A Decision Tree For Weather Prediction: Elia Georgiana Petre
No ratings yet
A Decision Tree For Weather Prediction: Elia Georgiana Petre
7 pages
CP1407 Assignment Final
No ratings yet
CP1407 Assignment Final
13 pages
Data Mining Evaluation Techniques
No ratings yet
Data Mining Evaluation Techniques
36 pages
WEKA Data Analysis Guide
No ratings yet
WEKA Data Analysis Guide
85 pages
Lab3 Form
No ratings yet
Lab3 Form
5 pages
Knowledge Representation in Data Mining
No ratings yet
Knowledge Representation in Data Mining
45 pages
Phạm Nguyễn Quỳnh Anh - ITDSIU22130 - Lab-06
No ratings yet
Phạm Nguyễn Quỳnh Anh - ITDSIU22130 - Lab-06
5 pages
Data Mining Practical Guide
No ratings yet
Data Mining Practical Guide
27 pages
Weka Data Mining Guide
No ratings yet
Weka Data Mining Guide
12 pages
IML 8 DecisionRules 120825
No ratings yet
IML 8 DecisionRules 120825
45 pages
DWDM Lab Manual for B.Tech Students
No ratings yet
DWDM Lab Manual for B.Tech Students
46 pages
CP1407 Prac6-9
No ratings yet
CP1407 Prac6-9
45 pages
DM Record-No Roll No
No ratings yet
DM Record-No Roll No
46 pages
WEKA Lab Manual
100% (2)
WEKA Lab Manual
107 pages
NguyenThanhNam ITCSIU22311 Lab01
No ratings yet
NguyenThanhNam ITCSIU22311 Lab01
20 pages
Individual Assignment 2
No ratings yet
Individual Assignment 2
4 pages
Weka Book Questions
0% (1)
Weka Book Questions
2 pages
Lecture 12 - Weka Tutorial
No ratings yet
Lecture 12 - Weka Tutorial
84 pages
CSE445 NSU Week - 2
No ratings yet
CSE445 NSU Week - 2
31 pages
Wk7 Knowlege Representation
No ratings yet
Wk7 Knowlege Representation
45 pages
Assignment Data Warehose
No ratings yet
Assignment Data Warehose
12 pages
R20 Iii-Ii ML Lab Manual
100% (1)
R20 Iii-Ii ML Lab Manual
79 pages
NguyenThanhNam ITCSIU22311 Lab5
No ratings yet
NguyenThanhNam ITCSIU22311 Lab5
9 pages
Decision Tree.10.11
No ratings yet
Decision Tree.10.11
31 pages
Assignment1 COMP723 2019
No ratings yet
Assignment1 COMP723 2019
4 pages
Slides
No ratings yet
Slides
174 pages
Data Mining Classification Guide
No ratings yet
Data Mining Classification Guide
10 pages
DA LabFile
No ratings yet
DA LabFile
63 pages
Chapter4 ML
No ratings yet
Chapter4 ML
108 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
111 pages
Chapter 4
No ratings yet
Chapter 4
111 pages
1935510219+edwin Thungari Macpal+ Experiment2
No ratings yet
1935510219+edwin Thungari Macpal+ Experiment2
14 pages
Simple Learning Algorithms: Jiming Peng, Advol, Cas, Mcmaster 1
No ratings yet
Simple Learning Algorithms: Jiming Peng, Advol, Cas, Mcmaster 1
41 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
52 pages
Lab Manual
No ratings yet
Lab Manual
55 pages
Chap 18 B
No ratings yet
Chap 18 B
22 pages
Data Mining Mini Projrct
No ratings yet
Data Mining Mini Projrct
16 pages
Jdavis Indlearn2
No ratings yet
Jdavis Indlearn2
91 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Decision Trees for Data Scientists
No ratings yet
Decision Trees for Data Scientists
75 pages
Classification - Issues Regarding Classification and Prediction
No ratings yet
Classification - Issues Regarding Classification and Prediction
42 pages
02 Input Output
No ratings yet
02 Input Output
44 pages
(PPT) Sereguine, Vinus M.
No ratings yet
(PPT) Sereguine, Vinus M.
6 pages
Efficacy of Itrifal Saghir', A Combination of Three Medicinal Plants in The Treatment of Obesity A Randomized Controlled Trial
No ratings yet
Efficacy of Itrifal Saghir', A Combination of Three Medicinal Plants in The Treatment of Obesity A Randomized Controlled Trial
8 pages
Appendix A: Nato Standardization Agreement 6001 (Edition 4)
100% (1)
Appendix A: Nato Standardization Agreement 6001 (Edition 4)
24 pages
Understanding International Relations
No ratings yet
Understanding International Relations
14 pages
Translating English Words Into Algebraic Expressions
No ratings yet
Translating English Words Into Algebraic Expressions
2 pages
Hair Product Formation - Project Proposal
No ratings yet
Hair Product Formation - Project Proposal
4 pages
Pan India Network Hospital List As On 12-Jul-2021
No ratings yet
Pan India Network Hospital List As On 12-Jul-2021
1,116 pages
Turkish and English in Daily Activity
100% (3)
Turkish and English in Daily Activity
41 pages
Beverage Recipe Confidentiality Agreement
No ratings yet
Beverage Recipe Confidentiality Agreement
2 pages
Seminar 3 - Need For Specialized Ministry
100% (1)
Seminar 3 - Need For Specialized Ministry
28 pages
A Capsule For Plus One English Public Exam
No ratings yet
A Capsule For Plus One English Public Exam
9 pages
Our Christian Life and Ministry: Sample Presentations
No ratings yet
Our Christian Life and Ministry: Sample Presentations
8 pages
Understanding Social Status and Roles
No ratings yet
Understanding Social Status and Roles
2 pages
Websterspellingbookmethod (2) (1)
No ratings yet
Websterspellingbookmethod (2) (1)
167 pages
Leadership Insights for Managers
No ratings yet
Leadership Insights for Managers
26 pages
The Life of A Private Corporation, Study Guide 2019
No ratings yet
The Life of A Private Corporation, Study Guide 2019
2 pages
Deloitte Test Papers
No ratings yet
Deloitte Test Papers
4 pages
Samudra Manthan Vocabulary by Rani Madam
86% (44)
Samudra Manthan Vocabulary by Rani Madam
494 pages
Lattice Models
No ratings yet
Lattice Models
56 pages
Genentech Internship Journal E-Portfolio
No ratings yet
Genentech Internship Journal E-Portfolio
5 pages
Doctrines of Equity
No ratings yet
Doctrines of Equity
5 pages
Nik Bartsch's Ronin Modul 48 Analysis
No ratings yet
Nik Bartsch's Ronin Modul 48 Analysis
14 pages
TOTE Táska Szabásminta
No ratings yet
TOTE Táska Szabásminta
14 pages
PEOPLE v. WILSON LAB-EO PDF
No ratings yet
PEOPLE v. WILSON LAB-EO PDF
2 pages
Family
No ratings yet
Family
21 pages
Geographies of Urban Walking Practices
No ratings yet
Geographies of Urban Walking Practices
16 pages
Materials QTheft, Estafa
No ratings yet
Materials QTheft, Estafa
41 pages
Modern History British Policies Objective
No ratings yet
Modern History British Policies Objective
14 pages
Centre For Management Studies: Online Submission of Assignment-02
No ratings yet
Centre For Management Studies: Online Submission of Assignment-02
10 pages
PhD Thesis Writing Help in Finance
100% (3)
PhD Thesis Writing Help in Finance
8 pages