Introduction to Data Mining
Lab 1: Introduction to Weka
1.1. Introduction
Weka is an open-source software available at www.cs.waikato.ac.nz/ml/weka. Weka stands for the
Waikato Environment for Knowledge Analysis. It offers clean, spare implementation of the simplest
techniques, designed to aid understanding of the data mining techniques. It also provides a work-bench
that includes full, working, state-of-the-art implementations of many popular learning schemes that can
be used for practical data mining or for research.
In the first class, we are going to get started with Weka: exploring the “Explorer” interface, exploring
some datasets, building a classifier, using filters, and visualizing your dataset. (See the lecture of class 1
by Ian H. Witten, [1])
Task: Taking notes how you find the Explorer, and answering questions in the following sections
1.2. Exploring the Explorer
Follow the instructions in [1]
1.3. Exploring datasets
Follow the instructions in [1]
In dataset weather.nominal.arff, how many attributes are there in the relation? What are their values?
What is the class and its values? Counting instances for each attribute value.
1
Dataset Attributes Values #Instances
outlook sunny 5
Relation: overcast 4
weather.symBolic rainy 5
#Instances: 14 Distinct 3
#Attributes: 5 hot 4
temperature mild 6
cool 4
Distinct 3
high high
humidity normal normal
Distinct 2
TRUE TRUE
windy FALSE FALSE
Distinct 2
Class play yes yes
no no
Distinct 2
Similarly, examine datasets: weather.numeric.arff and glass.arff.
Weather.numeric.arff
Dataset Attributes Values #Instances
outlook sunny 5
Relation: weather overcast 4
#Instances: 14 rainy 5
#Attributes: 5 Distinct 3
Minimum 64 Distinct 12
temperature Maximum 85
Mean 73.571
StdDev 6.572
Minimum 65 Distinct 10
humidity Maximum 96
Mean 81.643
StdDev 10.285
TRUE 6
windy FALSE 8
Distinct 2
Class play yes 9
no 5
Distinct 2
Glass.arff
Dataset Attributes Values #Instances
2
Dataset Attributes Values #Instances
Rl Minimum 1.511
Relation:Glass Maximum 1.534
#Instances: 214 Mean 1.518
#Attributes: 10 StdDev 0.003
Distinct: 178
Na Minimum 10.73
Maximum 17.38
Mean 13.408
StdDev 0.817
Distinct: 142
Mg Minimum 0
Maximum 4.49
Mean 2.685
StdDev 1.442
Distinct: 94
Al Minimum 0.29
Maximum 3.5
Mean 1.445
StdDev 0.499
Distinct: 118
Si Minimum 69.81
Maximum 75.41
Mean 72.651
StdDev 0.775
Distinct: 133
K Minimum 0
Maximum 6.21
Mean 0.497
StdDev 0.652
Distinct: 65
Ca Minimum 5.43
Maximum 16.19
Mean 8.957
StdDev 1.423
Distinct: 143
Ba Minimum 0
Maximum 3.15
Mean 0.175
StdDev 0.497
Distinct: 34
Fe Minimum 0
Maximum 0.51
Mean 0.057
StdDev 0.097
Distinct: 32
Class Type build wind float 70
3
build wind non-float 76
vehic wind float 17
vehic wind non-float 0
containers 13
tableware 9
headlamps 29
Distinct: 6
Create a file of ARFF format and examine it.
Dataset Attributes Values #Instances
temperature Minimum 20
Relation: air_quality Maximum 35
#Instances: 10 Mean 27.8
#Attributes: 5 StdDev 4.803
Distinct: 10
humidity Minimum 50
Maximum 90
Mean 70.8
StdDev 13.155
Distinct: 10
CO2_level Minimum 300
Maximum 800
Mean 535
StdDev 171.675
Distinct: 9
wind_speed Minimum 2
Maximum 7
Mean 4.1
4
StdDev 1.663
Distinct: 6
Class pollution low 4
moderate 3
high 3
Distinct: 3
1.4. Building a classifier
Follow the instructions in [1]
Examine the output of J48 vs. RandomTree applied to dataset glass.arff
Algorithm Pruned/unpruned minNumObj No. of Leaves Correctly
Classified
Instances
J48 unpruned 15 8 131
Random tree N/A N/A N/A 150
5
Evaluate the confusion matrix every time running an algorithm.
J48 - unpruned - minNumObj = 15:
The algorithm is skewed towards classifying into a = build wind float, and b = build wind non-float
RandomTree:
The algorithm is skewed towards classifying into a = build wind float, and b = build wind non-float.
However, RandomTree provides better results than 148.
1.5. Using a filter
Follow the instructions in [1], and remark
6
_Use a filter to remove an attribute
- What are attributeIndices? -
_Remove instances where humidity is high
- What are nominalIndices? -
_Fewer attributes, better classification:
Follow the instructions in [1], review the outputs of J48 applied to glass.arff:
Filter Leaf size Correctly Classified Remark
Instances
Original
Remove Fe
Remove all
attributes
except RI and
MG
1.6. Visualizing your data
Follow the instructions in [1], how do you find “Visualize classifier errors”?
After running J48 for iris.arff, determine:
- How many instances are predicted wrong? -
- What are they?
Instance Predicted class Actual class
7
8