Solomatine 2004
Solomatine 2004
Abstract: The applicability and performance of the so-called M5 model tree machine learning technique is investigated in a flood
forecasting problem for the upper reach of the Huai River in China. In one of configurations this technique is compared to multilayer
perceptron artificial neural network (ANN). It is shown that model trees, being analogous to piecewise linear functions, have certain
advantages compared to ANNs—they are more transparent and hence acceptable by decision makers, are very fast in training and always
converge. The accuracy of M5 trees is similar to that of ANNs. The improved accuracy in predicting high floods was achieved by building
a modular model (mixture of models); in it the flood samples with special hydrological characteristics are split into groups for which
separate M5 and ANN models are built. The hybrid model combining model tree and ANN gives the best prediction result.
DOI: 10.1061/(ASCE)1084-0699(2004)9:6(491)
CE Database subject headings: Hydrologic models; Hydrologic data; Flood forecasting; Artificial intelligence; China; Neural
networks; Multiple regression models; Data analysis.
Introduction Kundzewicz (1987). Another method of such type comes from the
“statistics” world—the approach by Friedman (1991) in his mul-
Artificial neural network (ANN) models have become a popular tiple adaptive regression splines algorithm. Yet another one, being
choice among the nonlinear flood forecasting methods (Hsu et al. the subject of this paper, is a M5 model tree (Quinlan 1992;
1995; Minns and Hall 1996; Solomatine and Torres 1996; Daw- Witten and Frank 2000), a method attributed to the area of ma-
son and Wilby 1998; See and Openshaw 1998; Govindaraju and chine learning. An earlier method classification and regression
Rao 2000; Dibike and Solomatine 2001; Bhattacharya and Solo- tree of Breiman et al. (1984) of regression trees should also be
matine 2002a; Birikundavyi et al. 2002). Being an accurate pre- mentioned; it generates, however, zero-order models (constant
dictive tool, the ANN technique has, however, a disadvantage that output values for subsets of input data) rather than the first-order
often limits its acceptance in practice—ANN models are not (linear) models.
transparent (“black box”) and do not help us to understand the
The M5 algorithm combines the features of classification and
nature of the solution. The arbitrary nature of the internal repre-
regression: trees—structured regression is built on the assumption
sentation means that there may be dramatic variations between
networks of identical architecture trained on the same data (Wit- that the functional dependency is not constant in the whole do-
ten and Frank 2000). Recently some attempts were made to pro- main, but can be approximated as such on smaller subdomains
duce the understandable insights from the structure of neural net- (Fig. 1). For the continuous variables, these subdomains are
works, such as saliency analysis (Abrahart et al. 2001) and the searched for and characterized by the average value (regression
methods of recovering rules reported by Setonio et al. (2002). The trees) or with a linear regression function (model trees) of the
latter method starts from building an ANN as the “right” tool that dependent variable (on Fig. 1, for example, for the domain
further needs a better interpretability. 关x2 ⬎ 2 , x1 ⬎ 2.5兴 Model 3 is used and its form is y = a0 + a1x1
There are, however, approaches that instead of constructing a + a2x2) The most attractive advantage is that by dividing the func-
single complex model use a number of simpler “local” models tion being induced into linear patches, M5 model trees provide a
specialized in a particular area of input space (called mixtures of representation that is reproducible and comprehensible by practi-
experts). Such models were developed already in the 1980s—see, tioners.
for example, a paper on multilinear models by Becker and Still, the M5 model tree is not a very popular method: to our
knowledge after the paper of Kompare et al. (1997) in Slovene
1
Associate Professor, UNESCO-IHE Institute for Water Education language the applications of M5 model trees in water-related
(IHE Delft), P.O. Box 3015, 2601 DA Delft, The Netherlands. (corre-
problems are reported only by Solomatine (2002), by Solomatine
sponding author). E-mail: [email protected]
2
Yellow River Conservancy Commission, 11 Jinshui Rd., 450003 and Dulal (2003) (for rainfall-runoff modeling), and by Bhatta-
Zhengzhou, China. E-mail: [email protected] charya and Solomatine (2002b) (for modeling the stage–discharge
Note. Discussion open until April 1, 2005. Separate discussions must relationship).
be submitted for individual papers. To extend the closing date by one In this study that actually took place in 2000–2001, a rather
month, a written request must be filed with the ASCE Managing Editor. complex catchment area, the upper reach of the Huai River, was
The manuscript for this paper was submitted for review and possible
publication on October 29, 2002; approved on February 20, 2004. This considered as the study area, and the performance of various M5
paper is part of the Journal of Hydrologic Engineering, Vol. 9, No. 6, model trees was investigated. In two of the five cases M5 model
November 1, 2004. ©ASCE, ISSN 1084-0699/2004/6-491–501/$18.00. tree is also compared to ANN.
Introduction to M5 Model Trees and Artificial Neural tropy in the resulting subsets; in other words, trying to filter as
Network many samples from the same class into one subset as possible.
The M5 model tree is a numerical prediction algorithm and its
M5 Model Trees splitting criterion is based on the standard deviation of the values
in the subset T of the training data that reaches a particular node
The M5 model tree algorithm was originally developed by Quin- (which is an analogue of entropy). It is used as a measure of the
lan (1992); we used the software implementing its variation M5⬘ error at that node, and the attribute that maximizes the expected
provided by Witten and Frank (2000). Model trees combine a
error reduction is chosen for splitting at the node. Accordingly, on
conventional decision tree with the possibility of generating linear
Fig. 1 the attribute X2 is selected for the root node with the split
regression functions at the leaves. This representation is relatively
value 2.0.
perspicuous because the decision structure is clear and the regres-
sion functions do not normally involve many variables. The M5 The splitting process terminates when the output values of the
tree is a piecewise linear model, so it takes an intermediate posi- samples that reach a node vary slightly, that is, when their stan-
tion between the linear models as ARIMA and truly nonlinear dard deviation is just a small fraction (say, less than 5%) of the
models as ANNs. standard deviation of the original sample set. Splitting also termi-
The construction of a model tree is similar to that of decision nates when just a few samples remain in a subset. The linear
trees. Fig. 1(a) illustrates how the splitting of space is done. First, regression models are then built for each subset of samples asso-
the initial tree is built and then the initial tree is pruned (reduced) ciated with the terminating (leaf) nodes.
to overcome the overfitting problem (that is a problem when a
model is very accurate on the training data set and fails on the test Pruning and Smoothing Model Trees
set). Finally, the smoothing process is employed to compensate
for the sharp discontinuities between adjacent linear models at the Pruning If a generated tree has too many leaves, it may be “too
leaves of the pruned tree (this operation is not needed in building accurate” and hence overfit and be a poor generalizer. It is pos-
the decision tree). sible to make a tree more robust by simplifying it, i.e., by prun-
ing, that is by merging some of the lower subtrees into one node.
Building Model Trees
Different decision tree inductive algorithms used to solve classi-
Smoothing This process is used to compensate for the sharp dis-
fication problems employ the divide-and-conquer approach. First,
continuities that will inevitably occur between adjacent linear
an attribute is selected to be placed at the root node and one
branch is made for each possible value; then the example set is models at the leaves of the pruned trees. This is a particular prob-
split up into subsets; one for every value of the attribute. Now the lem for models constructed from a small number of training
process can be repeated recursively for each branch using only samples. Smoothing can be accomplished by producing linear
those samples that actually reach the branch. If at any time all models for each internal node, as well as for the leaves at the time
samples at a node have the same classification, the development the tree is built. Experiments show that smoothing substantially
of that part of the tree is stopped. The attribute, which is chosen to increases the accuracy of prediction.
be used for a split for a given set of samples, can be determined Fig. 4(c) presents a tree combining seven linear regression
by certain statistical property called a splitting criterion. For de- models at the leaves. In parenthesis, the first number is the num-
cision trees the splitting is based on trying to minimize the en- ber of samples in the subset sorted to this leave and the second
splitting attribute is QCt, the upstream discharge on the current Flood Season Global Model
day; it has the maximum correlation with the predicted discharge This model dealt only with the flood season (FS) data from May
QXt+1. The attributes at lower levels are QXt, PaMov4t and to October across the 21 years time series, and the 2 day moving
PaMov4t−1, they also appear in the subbranches frequently. The average of area rainfall were used instead of the 4 day average
attributes Pat, Pat−1, Pat−2, and QCt−1 are less important and ap- (since it has higher correlation with QXt+1). Correlation analysis
pear only at or near leaf nodes in the trees and are thus indicative led to the selection of 16 input attributes (Pat, Pat−1, Pat−2, Pat−3,
of some special situations. PaMov2t, PaMov2t−1, PaMov2t−2, PaMov2t−3, PaMov2t−4, QCt,
As shown in Figs. 3(a and b) and Table 1 the M5 model tree QCt−1, QCt−2, QXt, QXt−1, QXt−2, and QXt−3). Two versions of
can predict the low flow correctly, but has higher errors in pre- model trees were built: with all 16 attributes (the model had 11
dicting some of the flood peaks: RMSE was 69 m3 / s in training regression equations) and the simpler version with seven at-
and 84 m3 / s in testing. Nevertheless, the M5 model tree error was
tributes (Pat, Pat−1, PaMov2t, PaMov2t−1, QCt, QCt−1, and QXt)
54% smaller than that of the naïve “no-change” model and 47%
and with seven equations. Accuracy of prediction was very simi-
smaller than that of the three-point linear regression model.
lar.
High error in flood forecasting was attributed to the fact that
An ANN model with the same input and output variables was
the number of samples corresponding to high flow was much
built as well. The popular three-layer feed forward ANN topology
smaller compared to those of the low flow in the full-year data
was employed, and the linear activation functions were used in
set. As a result, out of the 35 rules that M5 model generated, there
was only one linear model for the samples with QXt ⬎ 721 m3 / s the output layer since they delivered better performance if com-
corresponding to the flood situation. pared to sigmoid or tangent ones. The classical backpropagation
training method of ANN was adopted. The Neural Machine (Neu-
ral Machine 2003) and NeuroSolutions (NeuroSolutions 2003)
Zooming-In: Better Models for Extreme Flows software packages were used.
The performance of the models is shown in Figs. 4(a and b)
In order to reproduce the extreme-flow situations better, two other and Table 1, and the induced tree on Fig. 4(c). The ANN predic-
models were built: one for the selected high flows only (that was tion overall result is similar to that of the M5 model trees—its
filtered by the value of QX), and the other one for the data col-
RMSE is 10% higher than of M5, mean absolute error (MAE) is
lected during the flood season (filtered by the time constraints).
the same, and the maximum absolute error (MaxAE) is 12%
lower. Fig. 4(b) shows that the prediction of high flows by the M5
High-Flows Global Model
model has improved. However, both M5 and ANN still have a
A separate model for the flows QXt+1 ⬎ 500 m3 / s was set up, with high error in predicting some flood events, and the maximum
the 234 samples used for training and 80 for testing. The same 11
error occurs during the same flood events. This means that the
inputs were used and the model tree with 11 equations generated.
input data have to be processed more efficiently and some new
Most of the equations, however, have rather high error with only
attributes should be added to improve the prediction accuracy.
one rule with the error smaller than 10%. RMSE was 281 m3 / s in
training and 411 m3 / s in testing.
Modular Models (Mixtures of Models): Combining Expert
Interestingly, in the nodes corresponding to higher discharge
Rules with M5 Trees
values, instead of QXt the rainfall on the previous day Pat−1 be-
gins to appear at top layers of the generated M5 model tree—this More accurate analysis of the hydrological processes in the catch-
means that this attribute became the most important one for pre- ment and the performed error analysis of the models reported
dicting the discharge QXt+1. The physical explanation of this fact above lead us to a conclusion that the conditions used so far were
is that in flood season the rapid increase in discharge occurs after too superficial and this actually did not allow the data-driven
the intensive rainfall 共Pat−1兲—this is different from low flow con- models to classify various flood conditions into physically inter-
ditions when there is not much influence of rainfall. So, in spite of pretable classes.
the errors, the M5 model has correctly suggested that the flood In order to improve the effectiveness of the predictive model,
discharge has characteristics different from those of the low flow: the expert-generated rules were used to build modular models.
this is consistent with the physics of the hydrological processes. The whole flood season data was split into subsets using domain
by the global model anyway. This prompted for the further filter-
ing of data and considering the next local model—Module 2.
Module 1 (FS-m1-M5 Model) This model was built for the dis-
charges QXt−1 the day before higher than 1,000 m3 / s. Figs. 6(a
and b) show that the model is very accurate; Fig. 6(c) presents the
M5 model which is very concise and easy to understand. Table 2
shows that the prediction error of this model is much lower than
that of the flood season global model FS-M5 calculated only for
the samples with QXt−1 ⬎ 1,000 m3 / s. Fig. 6. FS-m1-M5 model performance (module 1, samples with
It was found that many samples processed by Module 1 are QXt−1 ⬎ 1,000 m3 / s) in (a) training; (b) testing; and (c) M5 model
still associated with low-flow predictions which are not so inter- tree for FS-m1-M5 model (module 1, samples with
esting for flood forecasting and which are already predicted well QXt−1 ⬎ 1,000 m3 / s)
predictions, and its prediction performance is close to that of the This can be explained by the flood effect of the antecedent rain-
global model. Fig. 7(c) presents the resulting model tree. fall, and the heterogeneous distribution of rainfall that is not ac-
counted for due to averaging.
Analysis of Errors for Modules 1 and 2 Figs. 8(a and b) pre- From the generated M5 model trees of Module 2 [Fig. 7(c)], it
sents the individual flood events hydrographs with the measured can be seen that the intensive floods have been classified reason-
and calculated discharges. It can be seen that the data points of
ably well 共PaMov2t ⬎ 40.5兲, and even the distribution of the rain-
Module 1 either lie in the peak and recession part of the flood
fall duration is modeled correctly. The data filtered into Module 2
events, or lie in the rising limb of a flood with long duration. Thus
the soil moisture is saturated and the prediction is not affected by does not exhaust the possibilities of building more accurate local
flash flood at a tributary, so Module 1 gives a good prediction. models. Consider, for example, equation LM8 which is respon-
However, the situation corresponding to the samples of Module 2 sible for modeling the short duration of heavy rainfall in the
is more complex. If the point lies at the relatively low flow part, middle reach or downstream part 共QXt ⬍ 870, QCt−2 艋 8兲. The
the prediction is still good. However the points that are close to boundary of 870 for QXt that corresponds to the antecedent rain-
peaks of flood events of short duration are not predicted well. fall (soil moisture) is somehow misleading: the value 870 is too
high for that. If, for example, there is small rainfall of long dura-
tion the discharge does not rise too much, but the soil becomes Fig. 9. Performance of FS-m3-M5 model in (a) training and (b)
saturated. Based on these considerations a new class called Mod- testing
ule 3 was constructed.
Table 3. M5 Model Performance for Module 2 (Flood Season Data with QXt−1 ⬍ 1,000 and QXt ⬎ 200)
FS-m2-M5 model FS-M5 model, extracted samples
Table 4. M5 Model Performance for Module 3 (Pat−1 ⬎ 50 and PaMov2t−2 ⬍ 5 and PaMov2t−4 ⬍ 5 Using Flood Season data)
FS-m3-M5 model FS-M5 model (extracted samples)
Table 5. M5 Model Trees and Artificial Neural Network (ANN) for Module 2
Training 1976–1989 Testing 1995–1996
Performance FS-ANN, extracted samples FS-m2-ANN FS-m2-M5 FS-ANN, extracted samples FS-m2-ANN FS-m2-M5
Mean absolute error 125.2 91.7 97.3 121.8 24.5 20.8
Maximum absolute error 1,519 1,135 1,173 1,519 1,258 1,460
Root mean squared error 229.7 154.7 176.6 266.6 253.6 254.8
Correlation coefficient 0.929 0.968 0.96 0.893 0.925 0.91
Another way to improve the modeling performance would be • the generated tree-like structure of linear models is repro-
to use the distributed rainfall as input. Several experiments were ducible and easy to understand for decision makers. It
conducted, but the lack of detailed data did not allow for drawing makes it possible for a hydrologist to have a good over-
reliable conclusions. view of the relationships between the hydrological char-
acteristics;
• the M5 algorithm allows one to easily generate a family
Conclusions and Recommendations of interpretable models with different number of compo-
nent models/leaves and hence different robustness and ac-
1. Data-driven (machine learning) models are capable of per-
forming rainfall-runoff forecasting, even for a rather com- curacy;
plex catchment system. The performance of M5 model trees • training of M5 model trees is much faster than ANN and
is comparable to that of the widely used MLP ANNs. always converges; and
2. The advantageous features of M5 model trees if compared to • the knowledge encapsulated in a model tree may also help
ANN are: in parameters selection and assessing their relationships