100% found this document useful (1 vote)
201 views12 pages

WQD7005 Final Exam - 17219402

This document contains a student's alternative assessment submission for a data mining course. It includes definitions of key concepts like data mining and business intelligence. It also presents statistical analysis of sample data, including measures of central tendency and dispersion. Additionally, it discusses data modeling techniques like the snowflake schema and frequent pattern trees. The document demonstrates the student's understanding of classification algorithms like Naive Bayes, random forests, and support vector machines by applying them to a sample dataset and evaluating their performance.

Uploaded by

AdamZain788
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
201 views12 pages

WQD7005 Final Exam - 17219402

This document contains a student's alternative assessment submission for a data mining course. It includes definitions of key concepts like data mining and business intelligence. It also presents statistical analysis of sample data, including measures of central tendency and dispersion. Additionally, it discusses data modeling techniques like the snowflake schema and frequent pattern trees. The document demonstrates the student's understanding of classification algorithms like Naive Bayes, random forests, and support vector machines by applying them to a sample dataset and evaluating their performance.

Uploaded by

AdamZain788
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Najlaa Ramli (17219402)

WQD7005 Data Mining – Alternative Assessment

14th January 2021

1. Define data mining in terms of BI

Data mining is a branch of data science that searches through vast datasets to discover
patterns in data. Data mining is a part of Business Intelligence, whereby Business
Intelligence involves data generation, aggregation, analysis and visualization of data.
Business Intelligence will involve both OLTP and OLAP systems. In general, OLTP provide
source data whereas OLAP systems help to analyse it.

Data mining typically takes place within OLAP systems. Complex queries in data mining
are executable in OLAP systems. Data mining also deals with historical, summarized,
integrated, multi-dimensional and consolidated data.

2. Statistics of the data

a. Mean (average value in the data) = 29.96


b. Median (value separating the higher half from the lower of a data sample) = 25
c. Mode (value that appears most frequently in the data sample) = 25 and 35
d. Smoothing by bin

Step 1: Arrange the values in bin. Each bin will have 3 values .
Step 2: Calculate the smoothed values by finding the mean in each bin

3. Snowflake schema
4. Frequent Pattern (FP) tree
Step1: Calculate the support for each item.

Step2: Rearrange the itemsets according the order of support

Step 3: draw FP Tree:


5. Information Gain on Condition
Entropy (Profit) = p+(-log2p+) + p-(-log2p-) = -[p+log2p++ p-log2p-]

= -[5/10*(log25/10) + 5/10*(log25/10)]

=1

Entropy (Old) = -[3/3*(log23/3) + 0/3*(log20/3)]

= 0 * 3/10

=0

Entropy (Mid) = -[2/4*(log22/4) + 2/4*(log22/4)]

= 1 * 4/10

= 0.4

Entropy (New) = -[0/3*(log20/3) + 3/3*(log23/3)]

= 0 * 3/10

=0

Entropy (Condition) = 0 + 0.4 + 0

= 0.4

Information Gain (Condition) = E(Target)-E(Non-Target)

= 1 – 0.4

= 0.6

6. Steps of DBScan algorithm


The main concept of DBSCAN algorithm is to locate regions of high density that are
separated from one another by regions of low density. 

Steps in DBscan are:

 Arbitrary select a point p


 Retrieve all points density-reachable from p based on Eps and Min Pts ( where
Eps: Maximum radius of the neighborhood
MinPts: Minimum number of points in an Eps-neighbourhood of that point)
 If p is a core point, a cluster is formed
 If p is a border point, no points are density-reachable from p and DBSCAN visits
the next point of the database
 Continue the process until all of the points have been processed

PART B – Using dataset Data(exam).csv


1. Solution to select the best non-target features

Feature selection the process whereby we automatically or manually select features which
contribute most to our target variable or output. We can use “Chi-Square” statistical test to
select the best non-target features.

Chi-Square is a very simple tool for univariate feature selection for classification. It does not take
into consideration the feature interactions. This is best suited for categorical variables that we
have in our dataset.

We will have to perform this test on each non-target feature separately.

Formula for Chi-Square:

where “O” stands for observed or actual and “E” stands for expected value if these two
categories are independent.  If they are independent these O and E values will be close and if
they have some association then the Chi-squared value will be high.

In our case, features that have higher values of Chi-Square is more important ones. Hence,
features with higher Chi-Square will be selected.

2. Simulate the classification algorithms (I am using WEKA to perform all simulations).

i. Naïve Bayes
ii. Random Forest
iii. Support Vector Machine
In regards to classification accuracy, the algorithms managed to classify:
 Naïve Bayes: 1 instance
 Random Forest: 2 instances
 SVM: 2 instances

In regards to False Positive rate, the ranges for each algorithm is as follows

 Naïve Bayes: 0.0 to 0.110


 Random Forest: 0.0 to 0.03
 SVM: 0.0 to 0.152

Based on these criteria, it seems that Random Forest Algorithm is the better performing
algorithm, as it correctly classifies more instances and also have narrower range for False
Positive rate.

3. Performance metric of all algorithm in terms of Receiver Operator Characteristics (ROC)


curve

ROC curve summarizes all of the confusion matrices that each threshold produced. It provides
visualization whether our classifiers are appropriate enough. The Y-axis of ROC is for True
Positive Rate (Sensitivity), and X-axis is for the False Positive Rate (1 – Specificity). Threshold that
produces higher value on Y-axis and lower value on X-axis on the ROC (i.e. top-left section) is
more preferable. Depending on how much of the False Positive Rate that we are willing to
accept, we can select the optimal algorithm.

Area Under ROC (AUC) is also another indicator to tell whether the classifier is appropriate or
not. Higher values of AUC is more preferable.

To illustrate, we will provide ROC curve in identifying ‘Lion’ from the dataset:

Naïve Bayes:

Random Forest:
Support Vector Machine (SVM):

Based on screenshots of ROC for all three algorithm, none of them can be considered as a very
good classifier. This is because none of the algorithm produces points on top-left section of the
ROC. When the False Positive Rates are low or near zero, the True Positive rates are also low.
Hence, the accuracy of the classifier produced by the algorithms is compromised.

However, looking at the AUC, we can observe that SVM has slightly higher value compared to
the other two algorithms. But it is still inconclusive that SVM is the better performing algorithm.

You might also like