王玉 20201108012390
王玉 20201108012390
学 号 202011080123 评卷人 黄杰
中 南 财 经 政 法 大 学
研 究 生 课 程 考 试 试 卷
(课程论文)
课程名称 商务智能与数据挖掘
完成时间 2021.12.30
2
3.1.2 Training Process
Figure 2 shows the operation flow of decision tree:
3
Figure 3 Scatter Chart
After analyzing the data using the decision tree model in matlab, we obtained the
following decision tree in figure 4, which shows that when the petal length is less than
2.45, the flower species is setosa, when the petal length is greater than or equal to 2.45
and the petal width is greater than or equal to 1.75, the species is virginica, when petal
length is greater than or equal to 2.45, petal width is less than 1.75, and petal length is
less than 5.05, the species of Iris is versicolor, otherwise the species of Iris is virginica.
4
Figure 5 Misclassification Error Cost of the Decision Tree
Next, we pruned the decision tree based on this optimal solution, and the pruned
decision tree is shown in Figure 6. The decision tree shows that when the petal length
is less than 2.45, the Iris species is setosa, and when the petal length is greater than 2.45,
the Iris species is versicolor when the petal width is less than 1.75, and the flower
species is virginica when the petal width is greater than or equal to 1.75. Compared
with the previously trained decision tree, the modified decision tree has fewer number
of terminal nodes and is more efficient.
6
Figure 8 Flow Chart of SVM
3.2.3 Implementation and Performance Analysis
In this analysis, 120 data are selected as the training set and 30 data are selected
as the test set to train the model and test the model effect respectively. We first select
two variables, petal width and petal length, to draw a scatter plot of the data and observe
the general trend of the data distribution. The horizontal coordinate is the petal length,
the vertical coordinate is the petal width, the red dots represent setosa, the green dots
represent versicolor, and the blue dots represent virginica, and it can be seen that these
two attributes have good differentiation effect on the three flowers.
7
Figure 9 Scatter Plot
By training the SVM model, the partition shown in the figure 10 was obtained.
This result is based on the distribution in the scatter plot, and the whole region is divided
into three blocks by the two attributes of petal length and petal width. The region
composed of blue points represents setosa, the region composed of red points represents
versicolor, and the region composed of yellow points represents virginica. the fork in
the figure represents the test set, and the classification effect under the SVM model is
tested.
8
Figure 11 Confusion Matrix
Table 2 shows the specific values of various detection indexes:
Table 2 Test results
Group Order Accuracy Recall Specificity Precision F-core
Setosa 1 1 1 1 1
Versicolor 0.9667 0.8889 1 1 0.9412
virginica 0.9667 1 0.9474 0.9167 0.9565
4. Result Analysis
4.1 Comparison of Results
By comparing the confusion matrices of the test results of the two models, we can
see that the accuracy, recall, specificity, precision, and F-core in the confusion matrix
under the SVM model are higher than those of each test index under the decision tree
model, which means that, in terms of the current model effect, using the SVM model
can achieve a better prediction of the iris data.
4.2 Comparison of Algorithms and Improvement Directions
4.2.1 Decision Tree
Through analysis, we can find that the decision tree algorithm has the following
advantages: The decision tree is easy to understand and implement. After explaining,
people have the ability to understand the meaning expressed by the decision tree; for
decision trees, data preparation is often simple. Other technologies often require data
to be generalized first, such as removing redundant or blank attributes, ability to handle
both data-type and regular-type attributes at the same time. Given an observation model,
it is easy to deduce the corresponding logical expression based on the generated
decision tree; it is easy to evaluate the model through static testing. Indicates that it is
possible to measure the credibility of the model; it can make feasible and effective
9
results for large data sources in a relatively short time; the computational complexity is
not high, the output result is easy to understand, the data can be run even if there is
missing, and irrelevant features can be processed.
Decision trees also have some disadvantages: it is prone to overfitting; for those
data with inconsistent sample sizes in each category, the results of information gain in
the decision tree are biased towards those features with more numerical values.
The improvement methods of decision tree model can be as follows: we can train
the model by sampling in the data set for many times to form different training sets, so
as to achieve higher accuracy and better prediction effect of decision tree model. In
order to improve the accuracy of decision tree algorithm, each sample can be given a
weight during model training, and the weight of unclassified samples can be constantly
modified to improve the learning speed. Random forest algorithm can also be used to
improve the prediction effect of decision tree.
4.2.2 Support Vector Machines
Through comparison, we can observe some advantages of SVM, for example,
SVM can get much better results than decision tree on the small sample training set.
The reason why support vector machine has become one of the most commonly used
and best-performing classifiers is its excellent generalization ability. This is because its
own optimization goal is to minimize structural risk, not to minimize empirical risk.
Therefore, through the concept of margin, a structured description of data distribution
is obtained, thus reducing the requirements for data scale and data distribution. SVM
has strict mathematical theory support, strong interpretability, and does not rely on
statistical methods, thus simplifying the usual classification and regression problems;
it is able to find the key samples that are critical to the task; the final decision function
is determined by only a small number of support vectors, and the complexity of the
calculation depends on the number of support vectors, not the dimensionality of the
sample space, which avoids the "dimension disaster" in a sense. The SVM algorithm
can model the limits of linear and nonlinear problems based on the kernel. It is also
very feasible for "overfitting", especially in large spaces. The classification idea of
SVM is very simple, which is to maximize the interval between the sample and the
decision surface.
There are also some disadvantages of SVM: for example, it needs long training
time; in model prediction, the prediction time is proportional to the number of support
vectors, when the number of support vectors is large, the prediction calculation
complexity is high. Therefore, support vector machines are currently only suitable for
10
tasks with small batches of samples, and cannot adapt to tasks with millions or even
hundreds of millions of samples. Support vector machines require a lot of memory.
Because it is important to select the correct kernel, it is difficult to adjust, and good
results cannot be obtained with a fairly large data set. SVM algorithm is difficult to
implement on large-scale training samples. There are some difficulties in solving multi-
classification problems with SVM; and SVM model is sensitive to the missing data, the
choice of parameters and the kernel functions.
The classical support vector machine algorithm only gives the algorithm of binary
classification, but in the practical application of data mining, it usually needs to solve
the problem of multi-class classification. It can be solved by the combination of
multiple binary support vector machines. There are mainly one-to-many combination
mode, one-to-one combination mode and SVM decision tree. Then it can be solved by
constructing a combination of multiple classifiers. The main principle is to overcome
the inherent shortcomings of SVM and combine the advantages of other algorithms to
solve the classification accuracy of multi-class problems.
11