0% found this document useful (0 votes)
17 views13 pages

王玉 20201108012390

Uploaded by

li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views13 pages

王玉 20201108012390

Uploaded by

li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

姓 名 王玉 成 绩 90

学 号 202011080123 评卷人 黄杰

中 南 财 经 政 法 大 学
研 究 生 课 程 考 试 试 卷
(课程论文)

论文题目 Machine Learning Algorithm Practice

——Based on Iris Data Set

课程名称 商务智能与数据挖掘

完成时间 2021.12.30

专业年级 2020 级电子商务专业


Contents
1. Introduction .............................................................................................................. 1

2. Data Preparation and Pre-processing .................................................................... 1

3. Algorithm Description and Implementation ......................................................... 2

3.1 Decision Tree .................................................................................................... 2

3.1.1 Algorithm Description ............................................................................ 2


3.1.2 Training Process ...................................................................................... 3
3.1.3 Implementation and Performance Analysis ............................................ 3

3.2 Support Vector Machines .................................................................................. 6

3.2.1 Algorithm Description ............................................................................ 6


3.2.2 Training Process ...................................................................................... 6
3.2.3 Implementation and Performance Analysis ............................................ 7

4. Result Analysis ......................................................................................................... 9

4.1 Comparison of Results ...................................................................................... 9

4.2 Comparison of Algorithms and Improvement Directions ................................. 9

4.2.1 Decision Tree .......................................................................................... 9


4.2.2 Support Vector Machines ...................................................................... 10
1. Introduction
The data set selected in this report is iris. Iris is a genus of perennial herbaceous
plants in the family Iridaceae of the order Monocotyledon, with large, beautiful flowers
and high ornamental value. There are about 300 species in the genus Iris, and the Iris
dataset contains three of them: Setosa, Versicolor, and Virginica, with 50 data for each
species and 150 data in total. Each data contains four attributes: sepal length, sepal
width, petal length, and petal width, which can be used to predict which category the
iris flowers belong to. The problem we are trying to solve is: if there has an iris plant,
how can we infer which of the three categories it belongs to by the characteristics it has?
The problem we want to solve is a classification problem, so we choose two
classification algorithms to solve this problem and compare the accuracy of the two
classification algorithms.

2. Data Preparation and Pre-processing


The dataset we used in this report is the classic Iris dataset, which is divided into
four variables: sepal length, sepal width, petal length and petal width. We observed no
null values in this dataset, so it can be used directly for data analysis.

Figure 1 Iris data feature scatter plot matrix


Figure 1 shows the scatter plot matrix of the Iris data distribution, where the
horizontal coordinates are the corresponding petal width, petal length, sepal width, and
1
sepal length, and the vertical coordinates are also the four attributes. The two-by-two
combinations of these four attributes form a total of 16 small scatterplots. Among them,
scatter plots of combinations of the same attributes are not very meaningful, so I plotted
the combinations of the same attributes as histograms of the distribution of the iris
dataset under this attribute. For the three different types of Iris, the graph is replaced by
three different colors, and by comparing the colors we can see the distribution of these
three flowers on different attribute characteristics, and we can observe which attribute
has a better classification effect on Iris. We can see that the histograms of sepal length
and sepal width for the three types of flowers have more overlap, which means that the
classification of flowers by the two attributes of sepal length and sepal width does not
achieve good results. The distribution of the three types of flowers could be clearly
distinguished in the two attributes of petal length and petal width, and therefore these
two attributes could better distinguish the types of flowers. By looking at the scatter
plot of the combined distribution of petal length and petal width, we can see that the
distribution of the three types of Iris under these two attributes has three compact areas,
so the classification of flowers by these two attributes can achieve better results.

3. Algorithm Description and Implementation


3.1 Decision Tree
3.1.1 Algorithm Description
First, we selected the method of decision tree to analyze the Iris data set. A decision
tree is a graphical method that uses probabilistic analysis. Since this decision branch is
drawn as a graph resembling the branches of a tree, it is called a decision tree. In
machine learning, a decision tree is a predictive model that represents a mapping
relationship between object attributes and object values. Decision tree is a very common
classification method. It is a supervised learning, so called supervised learning is given
a bunch of samples, each with a set of attributes and a category, which are determined
in advance, then a classifier is obtained by learning, which is able to give the correct
classification for the emerging objects. Each node in the tree represents a certain object,
while each bifurcated path represents a possible attribute value, and each leaf node
corresponds to the value of the object represented by the path experienced from the root
node to that leaf node. Decision trees are a frequently used technique in data mining for
analyzing data and also for making predictions. The machine learning technique for
generating decision trees from data is called decision tree learning, which is commonly
known as decision tree.

2
3.1.2 Training Process
Figure 2 shows the operation flow of decision tree:

Figure 2 Flow Chart of Decision Tree


3.1.3 Implementation and Performance Analysis
We used decision tree model to analyze the Iris dataset and two attributes, petal
length and petal width were selected to analyze the data in this dataset. In this analysis,
120 data items were selected as the training set and 30 data items were selected as the
test set to train the model and test the effect of the model. According to the analysis in
the data preparation phase, we can see that the two attributes of petal width and petal
length have a strong classification ability for iris. Therefore, a scatter plot was drawn
to observe the distribution of the data based on the petal length and width attributes.
The horizontal coordinate is the petal length, the vertical coordinate is the petal width,
the red circle represents setosa, the purple diamond represents versicolor, and the green
rectangle represents virginica. it can be seen from the graph that these two attributes are
very effective in distinguishing the three types of Iris.

3
Figure 3 Scatter Chart
After analyzing the data using the decision tree model in matlab, we obtained the
following decision tree in figure 4, which shows that when the petal length is less than
2.45, the flower species is setosa, when the petal length is greater than or equal to 2.45
and the petal width is greater than or equal to 1.75, the species is virginica, when petal
length is greater than or equal to 2.45, petal width is less than 1.75, and petal length is
less than 5.05, the species of Iris is versicolor, otherwise the species of Iris is virginica.

Figure 4 Decision Tree


Next, we analyzed the number of terminal nodes of the decision tree and the
corresponding misclassification error cost, from figure 5 we can see that when the
number of terminal nodes is 3, the misclassification error cost is the smallest. That is to
say, the best result is achieved at this time. The main purpose of pruning the decision
tree is to combat the overfitting problem by actively removing some branches to reduce
the risk of overfitting.

4
Figure 5 Misclassification Error Cost of the Decision Tree
Next, we pruned the decision tree based on this optimal solution, and the pruned
decision tree is shown in Figure 6. The decision tree shows that when the petal length
is less than 2.45, the Iris species is setosa, and when the petal length is greater than 2.45,
the Iris species is versicolor when the petal width is less than 1.75, and the flower
species is virginica when the petal width is greater than or equal to 1.75. Compared
with the previously trained decision tree, the modified decision tree has fewer number
of terminal nodes and is more efficient.

Figure 6 Decision Tree


We tested this decision tree using 30 previously selected test data to obtain the
following confusion matrix in figure 7 and calculated the accuracy and sensitivity of
the decision tree classification prediction in table 1. We can see that the prediction
accuracy of the decision tree for the three flowers is 100%, 93.3%, and 93.3%,
respectively, and the recall for setosa, versicolor, and virginica is 100%, 100%, and
86.7%, respectively, and the precision is 100%, 71.4%, and 100%, respectively, and the
F-core is 100%, 83.3%, and 92.9%, and overall, a relatively accurate predictive
5
classification of this data could be made using the decision tree model.

Figure 7 Confusion Matrix


Table 1 shows the specific values of various detection indexes:
Table 1 Test results
Group Order Accuracy Recall Specificity Precision F-core
Setosa 1 1 1 1 1
Versicolor 0.9333 1 0.92 0.7143 0.8333
virginica 0.9333 0.8667 1 1 0.9286

3.2 Support Vector Machines


3.2.1 Algorithm Description
We also analyzed the Iris dataset using a method called Support vector machine.
Support vector machine is a binary classification model that maps feature vectors of
instances to some points in space. the purpose of SVM is to find a hyperplane using
two classes of data as far away from the hyperplane as possible, thus classifying the
new data more accurately, even though the classifier is more robust. SVM is suitable
for small and medium data samples, nonlinear, and high-dimensional classification
problems. The Iris dataset contains three flower classifications with 150 data, which
meets the applicability conditions of SVM method, so we selected SVM for the analysis
of this dataset.
3.2.2 Training Process
Figure 8 shows the flow chart of SVM algorithm:

6
Figure 8 Flow Chart of SVM
3.2.3 Implementation and Performance Analysis
In this analysis, 120 data are selected as the training set and 30 data are selected
as the test set to train the model and test the model effect respectively. We first select
two variables, petal width and petal length, to draw a scatter plot of the data and observe
the general trend of the data distribution. The horizontal coordinate is the petal length,
the vertical coordinate is the petal width, the red dots represent setosa, the green dots
represent versicolor, and the blue dots represent virginica, and it can be seen that these
two attributes have good differentiation effect on the three flowers.

7
Figure 9 Scatter Plot
By training the SVM model, the partition shown in the figure 10 was obtained.
This result is based on the distribution in the scatter plot, and the whole region is divided
into three blocks by the two attributes of petal length and petal width. The region
composed of blue points represents setosa, the region composed of red points represents
versicolor, and the region composed of yellow points represents virginica. the fork in
the figure represents the test set, and the classification effect under the SVM model is
tested.

Figure 10 Iris Classification Regions


By testing the prediction effect, we obtained the following confusion matrix in
figure 11, and we can see that the SVM model has a prediction accuracy of 100% for
setosa, 96.7% for versicolor, and 96.7% for virginica, and the recall were 100%, 88.9%,
and 100%, respectively, and the precision was 100%, 100%, and 91.7%, respectively,
and the F-core was 100%, 94.1%, and 95.7%, respectively, and overall, the Iris data
could be well predicted using the SVM model.

8
Figure 11 Confusion Matrix
Table 2 shows the specific values of various detection indexes:
Table 2 Test results
Group Order Accuracy Recall Specificity Precision F-core
Setosa 1 1 1 1 1
Versicolor 0.9667 0.8889 1 1 0.9412
virginica 0.9667 1 0.9474 0.9167 0.9565

4. Result Analysis
4.1 Comparison of Results
By comparing the confusion matrices of the test results of the two models, we can
see that the accuracy, recall, specificity, precision, and F-core in the confusion matrix
under the SVM model are higher than those of each test index under the decision tree
model, which means that, in terms of the current model effect, using the SVM model
can achieve a better prediction of the iris data.
4.2 Comparison of Algorithms and Improvement Directions
4.2.1 Decision Tree
Through analysis, we can find that the decision tree algorithm has the following
advantages: The decision tree is easy to understand and implement. After explaining,
people have the ability to understand the meaning expressed by the decision tree; for
decision trees, data preparation is often simple. Other technologies often require data
to be generalized first, such as removing redundant or blank attributes, ability to handle
both data-type and regular-type attributes at the same time. Given an observation model,
it is easy to deduce the corresponding logical expression based on the generated
decision tree; it is easy to evaluate the model through static testing. Indicates that it is
possible to measure the credibility of the model; it can make feasible and effective
9
results for large data sources in a relatively short time; the computational complexity is
not high, the output result is easy to understand, the data can be run even if there is
missing, and irrelevant features can be processed.
Decision trees also have some disadvantages: it is prone to overfitting; for those
data with inconsistent sample sizes in each category, the results of information gain in
the decision tree are biased towards those features with more numerical values.
The improvement methods of decision tree model can be as follows: we can train
the model by sampling in the data set for many times to form different training sets, so
as to achieve higher accuracy and better prediction effect of decision tree model. In
order to improve the accuracy of decision tree algorithm, each sample can be given a
weight during model training, and the weight of unclassified samples can be constantly
modified to improve the learning speed. Random forest algorithm can also be used to
improve the prediction effect of decision tree.
4.2.2 Support Vector Machines
Through comparison, we can observe some advantages of SVM, for example,
SVM can get much better results than decision tree on the small sample training set.
The reason why support vector machine has become one of the most commonly used
and best-performing classifiers is its excellent generalization ability. This is because its
own optimization goal is to minimize structural risk, not to minimize empirical risk.
Therefore, through the concept of margin, a structured description of data distribution
is obtained, thus reducing the requirements for data scale and data distribution. SVM
has strict mathematical theory support, strong interpretability, and does not rely on
statistical methods, thus simplifying the usual classification and regression problems;
it is able to find the key samples that are critical to the task; the final decision function
is determined by only a small number of support vectors, and the complexity of the
calculation depends on the number of support vectors, not the dimensionality of the
sample space, which avoids the "dimension disaster" in a sense. The SVM algorithm
can model the limits of linear and nonlinear problems based on the kernel. It is also
very feasible for "overfitting", especially in large spaces. The classification idea of
SVM is very simple, which is to maximize the interval between the sample and the
decision surface.
There are also some disadvantages of SVM: for example, it needs long training
time; in model prediction, the prediction time is proportional to the number of support
vectors, when the number of support vectors is large, the prediction calculation
complexity is high. Therefore, support vector machines are currently only suitable for

10
tasks with small batches of samples, and cannot adapt to tasks with millions or even
hundreds of millions of samples. Support vector machines require a lot of memory.
Because it is important to select the correct kernel, it is difficult to adjust, and good
results cannot be obtained with a fairly large data set. SVM algorithm is difficult to
implement on large-scale training samples. There are some difficulties in solving multi-
classification problems with SVM; and SVM model is sensitive to the missing data, the
choice of parameters and the kernel functions.
The classical support vector machine algorithm only gives the algorithm of binary
classification, but in the practical application of data mining, it usually needs to solve
the problem of multi-class classification. It can be solved by the combination of
multiple binary support vector machines. There are mainly one-to-many combination
mode, one-to-one combination mode and SVM decision tree. Then it can be solved by
constructing a combination of multiple classifiers. The main principle is to overcome
the inherent shortcomings of SVM and combine the advantages of other algorithms to
solve the classification accuracy of multi-class problems.

11

You might also like