1 Introduction

Prediction problems are popular in real applications such as power scheduling [1], bioinformatics [2], and environmental science [3]. The systems of real applications are generally complex, so the prediction task is often challenging, particularly for multi-model data. Moreover, for multi-model data, identifying the model for each data point effectively is difficult and crucial in data modelling to achieve satisfying predictions. Therefore, we will concentrate on multi-model data prediction by accurately recognising and learning the underlying data of each model in this paper.

In predictive method research, scholars often assume that the data is from a single model and many different types of models have been developed, such as statistical models [4,5,6], Kalman filter methods [7, 8], artificial neural networks [9,10,11], deep learning models [12,13,14], and fuzzy logic methods [15,16,17]. These models have been applied in various fields such as epidemic forecasting [18, 19], portfolio selection [20], weather forecasting [21, 22], financial forecasting [23, 24], and chloride diffusion forecasting in building materials [25]. Although these models, considering a single mode, can provide good predictions [20, 26,27,28,29], their effectiveness is often constrained in complex scenarios of multi-model data due to limited model assumptions. It is noted that their prediction performance improves when the models of the data are diagnosed and modelled. Therefore, some advanced clustering-based predictive models have been proposed to address the problem of multi-model data modelling. Initially, algorithms such as K-means clustering [30,31,32], fuzzy C-means algorithm [33], and multi-objective semi-supervised clustering model [34, 35] are used to identify the data modalities. Then, models such as binary logistic regression [36], Lion-Wolf based deep belief network [37], kernel ridge regression [31], grey prediction [38], and support vector machine [39, 40] are applied for data prediction. To improve the prediction accuracy, feature extraction is also carried out, which includes principal component analysis. One of the most typical works is in the reference of [34]. The authors established a multi-objective semi-supervised clustering model to address a predictive task in clinical data, where two objective functions were designed to minimise the cost function of K-medians clustering and the mean square error of cross-validation. Here, the K-medians clustering was employed to diagnose the model for each dataset. Later, [35] further improved upon the work of [34] by modifying the objective function to minimise the deviation of the data points in the clusters and the prediction error of the response. Here, it should be noticed that the non-dominated sorting genetic algorithm II (NSGA-II) was used to solve the multi-objective optimisation problem, and local regression is used to learn each model. The investigation of seven references in [30, 31, 33, 34, 36, 37], and [35] all use the method of clustering the data first and then learning each cluster separately. Because the data of each cluster, which is from one model, has similar characteristics, this approach significantly improves the prediction accuracy, reduces the prediction error, and has high prediction accuracy and stability. Multi-objective optimisation models in [34] and [35] also help to overcome the disadvantage of the initial setting sensitivity of clustering.

However, the optimal number of clusters cannot be directly determined, and the computational cost of using an experiment-by-experiment method is too high. In addition, clustering of the data by simple Manhattan distance does not distinguish the data from different models well, as such data is distributed disorderly. Furthermore, real-world data is usually non-linear, so linear regression is ineffective. To address this problem, a non-linear prediction model can be adopted. Non-linear support vector regression (SVR) is one of the most popular methods used for non-linear regression [41], and its performance is sensitive to parameter settings [42]. Thus, a dynamic parameter adjustment mechanism can further improve computational efficiency [43].

Taking these ideas, we propose an adaptive regression algorithm to forecast multi-model data based on the clustering procedure, using the cluster-feedback mechanism (CFM) and modified support vector regression (MSVR). Specifically, we select the CFM to find the data models via a clustering process [44]. By adaptively selecting the number of clusters based on prediction residuals, CFM not only reduces computational cost but also enhances prediction accuracy. Moreover, CFM helps the proposed algorithm pay more attention to the original data generation models, which are unknown. Following the clustering process, we use non-linear SVR with a radial basis function to forecast the samples in different clusters after the clustering process. In addition to improving the generalisation ability of each forecasting model, we propose the MSVR, which uses NSGA-II to find better parameters for each SVR model. In summary, the contributions of this paper are as follows:

  1. (1)

    An adaptive regression algorithm, which is named CFM-MSVR, based on a cluster-feedback mechanism, is proposed to forecast the multi-model data. In particular, CFM is used to cluster the samples according to the residual of each sample in each forecasting model. The CFM helps to find more reasonable clusters that are more similar to the real models, which improves the forecasting accuracy.

  2. (2)

    Based on the clusters adjusted by CFM, the number of clusters can be estimated intelligently based on the number of samples in each cluster, which decreases the computational cost, the number of parameters, and the dependence on the empirical setting.

  3. (3)

    Focusing on the non-linear forecasting problems, we propose the MSVR, which is composed of non-linear SVR and non-dominated sorting genetic algorithm II (NSGA-II). To avoid overfitting, we use NSGA-II to optimise the number of support vectors and the prediction error and promote the generalisation and forecasting ability.

  4. (4)

    What’s more, we demonstrate the effectiveness of the proposed algorithm in terms of computational cost, generalisation ability, and forecasting accuracy on one simulated dataset, four real datasets, and a case study.

The organisation of this paper is as follows: Sect. 2 outlines the non-linear SVR algorithm and NSGA-II. Section 3 describes the proposed adaptive regression algorithm (CFM-MSVR) in detail. Section 4 tests the proposed method using one simulated dataset and four real datasets. Section 5 illustrates the performance of the CFM-MSVR based on the data Global Energy Forecasting Competition 2012 (GEFCom2012). Section 6 concludes the paper.

2 Background

In this section, the non-linear SVR algorithm and NSGA-II are introduced in the following, which is advantageous to propose the MSVR.

2.1 The non-linear SVR algorithm

The non-linear SVR algorithm is an effective forecasting method proposed by [45]. Compared with linear regression, non-linear SVR performs better when forecasting high-dimensional data [46]. Given a set of training data \(\{(x_1, y_1),(x_2, y_2),\cdots ,(x_N, y_N)\}\) with size N, where \(x_i \in R^{D},\ i=1, 2,\cdots , N\) are input feature vectors and \(y_i \in R^1,\ i=1, 2,\cdots , N\) are input label values. We use a non-linear function f(x) to regress the input data in high-dimensional space as follows:

$$\begin{aligned} f(x) = w^T\varphi (x) + b, \end{aligned}$$
(1)

where \(x \in R^D\) is the input vector, w and b are the weight and bias vectors. \(\varphi (x)\) is a nonlinear function used to map the x to high-dimensional Hilbert space [47]. The weight vector w and the bias b can be calculated by solving the optimisation problem as follows:

$$\begin{aligned}&\text {Minimize: } \frac{1}{2}\left\| w\right\| ^2 + C\sum _{i=1}^{N}(\zeta _i + \zeta _i^*).\\&\text {Subject to:}\\&\left\{ \begin{aligned}&y_i - w^T\varphi (x_i) - b - \epsilon - \zeta _i \le 0,\\&w^T\varphi (x_i) - y_i - b - \epsilon - \zeta _i^* \le 0, \quad i = 1,2,\cdots ,N,\\&\zeta _i \ge 0, \zeta _i^* \ge 0, C> 0,\\ \end{aligned} \right. \\ \end{aligned}$$

where C is the penalty factor used to avoid overfitting, \(\zeta _i\) and \(\zeta _i^*\) are loose variables, and \(\epsilon\) is a constant [48]. Finally, the regression function can be expressed as follows:

$$\begin{aligned} f = \sum _{i=1}^{N}(\beta _i - \beta _i^*)k(x, x_i) + b, \end{aligned}$$
(2)

where \(k(x, x_i) = <\varphi (x), \varphi (x_i)>\) is the kernel function and common kernel functions, including linear, polynomial, and Gaussian kernel functions.

Remark 1

The selection of penalty factor C, kernel function \(k(x, x_i)\), and \(\epsilon\) matters a lot to the forecasting accuracy. However, the selection of these parameters is difficult, and most time is based on experience [49].

2.2 The non-dominated sorting genetic algorithm II

The non-dominated sorting genetic algorithm II (NSGA-II), which is proposed by [50], is one of the most popular algorithms to solve multi-objective optimisation problems. Compared with NSGA, NSGA-II adds elitism and decreases the computational complexity and the number of parameters [51]. NSGA-II contains three main procedures: fast non-dominated sorting, crowding distance, and selection operator [51]. A brief description of these procedures is shown as follows.

2.2.1 Fast non-dominated sorting

Fast non-dominated sorting is proposed based on the concept of Pareto dominance. Suppose that a multi-objective problem includes n decision variables \(u_i,\ i = 1, 2, \cdots , n\) and k objective functions \(g_j,\ j = 1, 2, \cdots , k\). For any \(p,q\in \{1, 2, \cdots , n\}\), if \(g_j(u_p)> g_j(u_q)\) for all \(j=1,\cdots ,k\), we say that \(u_p\) dominates \(u_q\), denoted by \(u_p \succ u_q\).

Moreover, a decision variable u is named as Pareto optimal if u is not dominated by any other variable.

The procedures of fast non-dominated sorting are shown as follows:

Step 1: Initialize \(\Psi (u_i) = 0\) and \(S(u_i) = \emptyset\), where \(\Psi (u_i)\) and \(S(u_i)\) are the number of variables which dominate \(u_i\) and the set of solutions which are dominated by solution \(u_i\). For all \(i, j = 1,2,\cdots , n\), if \(u_i \succ u_j\), let \(\Psi (u_j) = \Psi (u_j) + 1\) and put \(u_j\) in \(S(u_i)\).

Step 2: Let \(k = 1\) and \(P_{k}\) be the set of kth front. Set \((r_1, r_2,\cdots , r_n)^T = \textbf{0}\).

Step 3: Find \(Q \subset \{1,2,\cdots ,n\}\), where \(\Psi (u_i) = 0, i \in Q\). Let \(r_i = k\), for all \(i \in Q\), and put \(u_i\) into \(P_k\).

Step 4: Find all solutions \(u_q\) in the \(S(u_i)\) for all \(i \in Q\), and let \(\Psi (u_q) = \Psi (u_q) - 1\).

Step 5: If \(r_i \ne 0\) for all \(i = 1, 2, \cdots , n\), stop. Otherwise, \(k = k + 1\), go to Step 3.

2.2.2 Crowding distance

Though the solutions have been ranked by fast non-dominated sorting, the solutions in the same rank are difficult to judge which one is better. So the crowding distance is calculated to rank the solutions in the same rank. The definition of the crowding distance is as follows:

$$\begin{aligned} cd(u_i) = \sum _{j=1}^{k}\frac{g_j(u_{i+1}) - g_j(u_{i-1})}{g_j^{max} - g_j^{min}}, \end{aligned}$$
(3)

where \(g_j^{max}\) is the largest number in \(g_j(u_s),\ s = 1, 2,\cdots , n\). The definition of \(g_j^{min}\) is analogous. The solution with the larger crowding distance is considered to be the better solution.

2.2.3 Selection operator

When generating an offspring, n solutions are selected based on the fast non-dominated sorting and crowding distance. For \(p,q \in \{1, 2, \cdots , 2n\}\), the rule of selection between \(u_p\) and \(u_q\) is as follows: (1) If \(r_p < r_q\), retain the solution \(u_p\) in the next generation. (2) If \(r_p = r_q\), calculate the crowding distance \(cd(u_p)\) and \(cd(u_q)\). If \(cd(u_p)> cd(u_q)\), retain the solution \(u_p\) in the next generation. Else, retain the solution \(u_q\).

With the components outlined above, the complete NSGA-II process is shown as follows.

First, generate an initial population \(u = \{u_1, u_2, \cdots , u_n\}\). Then, a new generation \(u' = \{u'_1, u'_2, \cdots , u'_n\}\) is generated after crossover and mutation. Let \(u^* = u \cup u'\) and rank all members of \(u^*\) by fast non-dominated sorting and crowding distance. After that, select n members using the selection operator as u in the next generation. The working of NSGA-II is shown in Fig. 1.

Fig. 1
figure 1

The flow chart of NSGA-II. The rectangles which share the same size stand for the populations with the same number of agents. The gray rectangle represents the agents which are discarded

3 Proposed Method

In this section, the details of the adaptive regression algorithm (CFM-MSVR) to forecast multi-model data based on the clustering process are shown. The two main components of CFM-MSVR are the cluster-feedback mechanism (CFM) and modified SVR (MSVR). First, the CFM is used to find the data models and adjust the number of clusters. Then, MSVR is used to calculate the regression models and adjust the parameters for each model. Finally, the CFM-MSVR based on CFM and MSVR is proposed in detail.

3.1 The cluster-feedback mechanism

First, we initialise m clusters. The m centers \(\{c_1, c_2, \cdots , c_m\}\) are randomly chosen from the samples \(\{(x_1, y_1),(x_2, y_2),\cdots ,(x_N, y_N)\}\). Let \(Cl = \{Cl_i,\ i = 1,2,\cdots , m\}\), where \(Cl_i\) is the set of ith cluster. Then for each sample \((x_t,y_t),\ t=1,\cdots ,N\), let \(d((x_t,y_t),c_i)\) be the Manhattan distance from \((x_t,y_t)\) to \(c_i\). Find \(i^* \in \{1,2,\cdots ,m\}\) where \(d(x_t, c_{i^*})\) is the minimum value in \(\{d((x_t, y_t), c_1),\cdots , d((x_t, y_t), c_m)\}\), and add \((x_t,y_t)\) to \(Cl_{i^*}\).

Supposing we have obtained the forecasting models \(f_i,\ i = 1,2,\cdots ,m\), the predictive residual of each sample \(r_i(x_t, y_t),\ i = 1,2,\cdots ,m, \ t = 1,2,\cdots ,N\) in \(f_i\) can be calculated as follows:

$$\begin{aligned} r_i (x_t, y_t) = \Vert y_t - f_i (x_t)\Vert . \end{aligned}$$
(4)

Let \(p = \arg \min \limits _{i}(r_i(x_t, y_t))\) and put the sample \((x_t, y_t)\) into the pth cluster \(Cl_p\).

In addition, let \(size(Cl_i)\) be the number of samples in \(Cl_i\). After regrouping all of the samples, \(size(Cl_i),\ i=1,2,\cdots ,m\) changes at the same time. Then some clusters contain few samples. Using Eq. (5), we can adjust the number of clusters:

$$\begin{aligned} {\begin{matrix} & Cl^* = \{Cl_i \Vert size(Cl_i)\ge \hat{\epsilon }\},\\ & m = len (Cl^*),\\ \end{matrix}} \end{aligned}$$
(5)

where \(Cl^*\) is the adjusted set of clusters and \(\hat{\epsilon }\) is a threshold value. \(len(Cl^*)\) is the number of clusters in \(Cl^*\). Then, if \(len(Cl^*) \ne len(Cl)\), choose centres randomly from the samples again in the next iteration. Otherwise, the clusters in Cl are all retained and used in the next iteration. The steps of the CFM are illustrated in Algorithm 1.

Algorithm 1
figure a

Pseudocode of the cluster-feedback mechanism (CFM)

3.2 The modified SVR

Because the selection of penalty factor C and \(\epsilon\) influences the performance of SVR models, we use NSGA-II to adjust C and \(\epsilon\). Supposing that the samples \(\{(x_1, y_1),(x_2, y_2),\cdots ,(x_N, y_N)\}\) have been grouped into m clusters (\(Cl_i = \{(x_t, y_t)\mid (x_t, y_t)\ \text {is grouped into the }i \text {th cluster}\},\ i = 1, 2, \cdots , m\)), first we initialize n agents \(u_j = (\{C_1, \epsilon _1\}, \{C_2, \epsilon _2\}, \cdots , \{C_m, \epsilon _m\}),\ j = 1, 2, \cdots , n\). Then for each agent \(u_j\), the regression functions \(f^{(j)} = \{f_i(Cl_i\mid C_i, \epsilon _i),\ i = 1,2,\cdots ,m\}\) can be trained using data in \(Cl_i\) with the parameters \(\{C_i\),\(\epsilon _i\}\) by Eq. (1). Now we have m SVR models \(f^{(j)}\) for the jth agent. Based on these models, the number of support vectors and the sum of the validation error, which are two objective functions, can be calculated. The optimisation problem is shown as follows:

$$\begin{aligned}&\text {Minimize: } g = \{g_1(C,\epsilon ),\ g_2(C,\epsilon )\}.\\&\text {Subject to:}\\&\left\{ \begin{aligned}&g_1(C,\epsilon ) = \frac{1}{n}\sum _{j=1}^{n} \sum _{i=1}^{m}\psi (f_i^{(j)}(Cl_i \mid C_i, \epsilon _i)),\\&g_2(C,\epsilon ) = \frac{1}{n}\sum _{j=1}^{n}\sum _{i=1}^{m}\sum _{(x, y) \in Cl_i}\mid y - f_i^{(j)}(x)\mid ,\\ \end{aligned} \right. \\ \end{aligned}$$

where \(C = (C_1,C_2,\cdots ,C_n)\), \(\epsilon = (\epsilon _1, \epsilon _2,\cdots , \epsilon _n)\), and \(\psi (f_i^{(j)}(Cl_i\mid C_i, \epsilon _i))\) is the number of support vectors in \(f_i^{(j)}(Cl_i\mid C_i, \epsilon _i)\). Then update \(u_j\) by NSGA-II until the last iteration is finished. After the last iteration, we need to find the best agent in \(P_1\) [34]. First, we normalise the objective function values of the agents in \(P_1\):

$$\begin{aligned} NormG_k^{(p)} = \frac{g_k^{(p)} -mean(g_k)}{std(g_k))},\ k = 1,2,\ p = 1,\cdots ,n_p, \end{aligned}$$
(6)

where \(n_p\) is the number of samples in the first front. \(g_k^{(p)}\) is the kth objective function value of the \((x_p,y_p)\). \(mean(g_k)\) and \(std(g_k)\) stand for the average and standard deviation values of the kth objective function for all samples in \(P_1\). The overall steps of the MSVR are illustrated in Algorithm 2.

Then a positive vector P can be determined by the following equation:

$$\begin{aligned} {\begin{matrix} P & = [\underset{p = 1,\cdots ,n_p}{min}\ NormG_1^{(p)},\underset{p = 1,\cdots ,n_p}{min}\ NormG_2^{(p)}],\\ & =[NormG_1^{(p_1)},\ NormG_2^{(p_2)}], \end{matrix}} \end{aligned}$$
(7)

where \(\underset{p = 1,\cdots ,n_p}{min}\ NormG_k^{(p)},\ k=1,2\) is the minimum value in \(NormG_k^{(p)}\) with the subscript \(p_k\). Finally, the similarity between each sample in the first front can be calculated by the following equation:

$$\begin{aligned} Similarity_p(NormG^{(p)},P) = \frac{{NormG^{(p)}} \cdot P^T}{||NormG^{(p)}||_2||P||_2}, \end{aligned}$$
(8)

where \(NormG^{(p)} = [NormG_1^{(p)},\ NormG_2^{(p)}]\). And choose the sample with the highest similarity. The overall steps of the MSVR are illustrated in Algorithm 2.

Algorithm 2
figure b

Pseudocode of the modified support vector regression (MSVR)

3.3 The proposed CFM-MSVR

The adaptive regression algorithm (CFM-MSVR) to forecast multi-model data based on the clustering process is proposed in this section. CFM-MSVR not only uses non-linear regression to forecast the multi-model data, but also uses CFM to find the different models quickly and improve the forecasting accuracy. CFM helps to select the data in different models, and MSVR is used to calculate the forecasting models. The combination of CFM and MSVR helps to forecast the multi-model data quickly and accurately. The main steps of CFM-MSVR are below.

Step 1: Initialise the agents and parameters, and randomly choose the centres from the input data.

Step 2: Cluster the data based on the Manhattan distance.

Step 3: Use MSVR to calculate the forecasting models for each cluster.

Step 4: Regroup the data and adjust the number of clusters based on each model by CFM. If the number of clusters changes, randomly choose the centres again. Go to Step 3.

Fig. 2
figure 2

The flow chart of CFM-MSVR

The overall structures of the proposed CFM-MSVR are shown in Fig. 2, where Iter and iter are the iterations of CFM and MSVR, respectively.

Remark 2

The setting of the initial number of clusters m and the threshold value \(\hat{\epsilon }\) is usually based on experience. The initial number of clusters tends to be set to a large number. When we forecast the data with N samples, the threshold value \(\hat{\epsilon }\) is about \(\frac{N}{2m}\). Especially, if the threshold value is too small, the number of clusters will not change. On the other hand, some of the clusters will be deleted by error. What’s more, when the data is significantly impacted by noise, using more models to forecast one dataset may get better forecasting results.

4 Numerical simulation

In this section, the proposed CFM-MSVR is tested on one constructive dataset and four real datasets to demonstrate its effectiveness.

When experimenting on the constructive data, the validation error of CFM-MSVR is recorded and discussed. To verify the superiority of the proposed CFM-MSVR, the results of CFM-MSVR are compared with the results of the method in [35], which is denoted by MethodG. In addition, to verify the effectiveness of CFM, we convert linear regression to MSVR in MethodG (MethodG-MSVR). And the result of MethodG-MSVR is also compared.

When experimenting on the four real datasets, the computational cost and forecasting accuracy of CFM-MSVR are discussed. The results of CFM-MSVR are compared with MethodG. In addition, to prove the stability of the proposed algorithm, CFM-MSVR, the differential entropy in each iteration of CFM is recorded.

4.1 Constructive data

We construct the samples produced by two models. Our goal is to judge which model each sample belongs to and forecast the two constructive datasets by the proposed method CFM-MSVR. Here we uniformly sample 450 data points from \([-8\pi , -3\pi ]\cup [-3\pi ,-\pi ]\cup [\pi ,3\pi ]\cup [3\pi ,8\pi ]\) as x. And the label value y(x) is defined as follows:

$$\begin{aligned} y(x) = {\left\{ \begin{array}{ll} -x^2 + 20,& x \in [-8\pi , -3\pi ]\cup [3\pi , 8\pi ], \\ 600sin(x)/x,& \text {else. }\\ \end{array}\right. } \end{aligned}$$
(9)

In addition, noise obeying a normal distribution is added to the original data (xy). And the data is standardized after all.

In this experiment, \(m = 6\) is the initial number of clusters, and \(Max\_Iter = 10\) is used in the CFM. 25 iterations and 50 search agents are used in MSVR. What’s more, the radial basis function (RBF) is chosen as the kernel function, and the threshold \(\hat{\epsilon }\) is 80. In NSGA-II, the rate of crossover is set to 0.8 and the rate of mutation to 0.01. For MethodG, the initial number of clusters is set as 2, then the number of iterations is 50.

Fig. 3
figure 3

The results of CFM-MSVR and MethodG based on the constructive data

Figure 3 shows the results of numerical simulation using CFM-MSVR and MethodG. By CFM-MSVR, all data are put into the original model correctly, and this verifies the effectiveness of the proposed method. When two groups of data are densely distributed, CFM-MSVR uses CFM can successfully separate the data from the two models, but MethodG uses the shortest distance to make it difficult to separate the data. Moreover, the true number of models is unknown when setting the number of clusters. CFM-MSVR with CFM can find the true number of clusters and calculate the two models.

The validation error of the CFM-MSVR is recorded and compared with the results of MethodG and MethodG-MSVR. Table 1 shows the minimum and average validation error (\(g_2\)) in each iteration of the three methods, and CFM-MSVR performs better in the forecasting accuracy as a whole.

Table 1 Validation error (\(g_2\)) of CFM-MSVR and the other methods

The average validation error of MethodG is 1.43e+04, and that of CFM-MSVR is 95.5, which is the smallest of all. It means that the use of the CFM improves the performance in the average value. Meanwhile, compared with the results of MethodG, the validation error of MethodG-MSVR is 184 on average. Based on the results of MethodG and MethodG-MSVR, MSVR significantly improves forecasting accuracy when it comes to the data in complex systems.

4.2 Real data

In this section, the proposed CFM-MSVR is tested on four real datasets containing Boston Housing, Concrete Compressive Strength, QSAR Aquatic Toxicity, and Concrete Slump Test, all of which are publicly available from https://archive.ics.uci.edu/datasets. The details of the four datasets are shown in Table 2. We discuss the validation error and computational cost of CFM-MSVR based on four real datasets. The results of CFM-MSVR are also compared with the results of MethodG.

Table 2 Description of the four datasets

In this experiment, in CFM-MSVR, the initial number of clusters is set as 12. \(Max\_Iter\) and \(max\_iter\) are set as 10 and 50. And the number of agents is 25. Moreover, the radial basis function is selected as the kernel function, and the threshold \(\hat{\epsilon }\) is set as 40 for Data 1, 3, set as 80 for Data 2, and set as 5 for Data 4. In NSGA-II, the crossover rate and mutation rate are set to 0.8 and 0.01, respectively. For the method MethodG, the number of iterations is set as 50, and the number of agents is 25. The number of clusters is set from 12 to 2. The experiments are conducted in a Python 3.7 environment with a 1.4 GHz quad-core Intel Core i5 and 8 GB of RAM.

To ensure the stability of the proposed algorithm, the differential entropy of the distance between samples and their centres is calculated after each iteration of CFM as:

$$\begin{aligned} h(X) = -\sum _{i=1}^{N}||X_i - C(X_i)||log(||X_i - C(X_i)||), \end{aligned}$$
(10)

where N is the number of samples and \(X = (X_1,\cdots ,X_N)\) is the sample data. \(C(X_i)\) is the centre of the cluster which contains the sample \(X_i\).

The results of the CFM-MSVR and MethodG on the four real datasets are recorded in Table 3, which contains the computation cost and forecasting accuracy. Also, the differential entropy in each iteration is recorded and displayed in Fig. 4.

Table 3 The comparison of the two methods

When forecasting complex real data, the proposed method, CFM-MSVR, performs better than MethodG. For example, on the high-dimensional dataset, Data 1, the minimum validation error of the CFM-MSVR is 32.12, and that of MethodG is 164.91, much higher than 32.12. Meanwhile, the proposed CFM-MSVR significantly decreases the computational cost. For example, on Data 1, the time costs of CFM-MSVR and MethodG are 8360 s and 19081 s, respectively. So, using CFM-MSVR to forecast complex multi-model data is more effective.

To verify the stability of the proposed method, the differential entropy by Eq. (10) in each iteration of CFM for the four real data is shown in Fig. 4. The difference in entropy decreases gradually in each iteration and tends to converge, which stands for the stability of the proposed CFM-MSVR.

Fig. 4
figure 4

The differential entropy of the four case data, where the x axis is the iteration of the CFM

5 The global energy forecasting competition 2012

In this section, the CFM-MSVR is tested using data from GEFCom2012. The data of Zone 21 is chosen, and the hourly load from 2004 to 2006 (26280 samples) is used as the training set, and the hourly data from 2007, which contains 8760 samples, is used as the test set. Taking the total hourly load of 20 power stations into consideration, \(L_t\), \(M_t\), \(W_t\), \(H_t\), and \(T_t\) are inputs, and \(y_t\) is the output, where \(y_t\) is the real load at time t. \(L_t = t\) is a linear trend term. \(M_t\), \(W_t\), and \(H_t\) stand for the month, the day of the week, and the hour of the day at time t. \(T_t\) is the temperature at time t.

To measure the performance of each model, the mean absolute percentage error (MAPE) is calculated as follows:

$$\begin{aligned} \text {MAPE} = \frac{1}{N_{test}} \sum _{t=1}^{N_{test}} \frac{\mid y_t - \hat{y}_t \mid }{y_t}, \end{aligned}$$
(11)

where \(N_{test}\) is the size of the test set. \(y_t\) is the real load at time t and \(\hat{y}_t\) is the forecasting load at time t. MAPE treats errors proportionally and reduces the impact of outliers, making it more robust than the root mean square error (RMSE) in measuring prediction performance. To demonstrate the superiority of the proposed method, CFM-MSVR, the values of MAPE of MethodG and the other methods mentioned in [52] are compared.

Here, in CFM-MSVR, the number of agents is set as 15. The initial number of clusters is 80. And the maximum number of iterations of CFM and MSVR is set as 10 and 5, respectively. The threshold value \(\hat{\epsilon }\) is 100. In NSGA-II, the rate of crossover and the rate of mutation are set to 0.8 and 0.01, respectively. For MethodG, the number of agents is 15, and the number of iterations is set as 50. For all different datasets, the results of the algorithm converge after 10 iterations. For the convenience of comparison, the results shown in Fig. 4 are those of the first 10 iterations.

By CFM-MSVR, after deleting the clusters with few samples, the number of clusters stays around 40, which represents the number of models in the data GEFCom2012. Figure 5 (a) shows the curve of the number of clusters. With the help of CFM and MSVR, different models are found, which stand for the different electricity load modes. In addition, we test the stability of the CFM-MSVR based on differential entropy. The differential entropy decreases gradually. When the last iteration is reached, the differential entropy stays around 366.39, as shown in Fig. 5(b).

Fig. 5
figure 5

The number of clusters and the differential entropy in each iteration, where the x axis is the iteration of the CFM. The graphical results are shown up to the 10th iteration because the results are stable after that

CFM-MSVR predicts data with lower residuals, generally compared with MethodG. The proposed CFM-MSVR and MethodG forecasting loads are shown in Fig. 6 (a). Figure 6 (b) illustrates the residuals of the hourly load using CFM-MSVR and MethodG. As the figures showed, in the test data, the CFM-MSVR managed to improve the forecasting accuracy based on the CFM and MSVR. Though some of the maximum and minimum points are not predicted well, the whole forecasting result of CFM-MSVR is excellent, where the residuals of the CFM-MSVR are much lower than the residuals of MethodG.

Fig. 6
figure 6

The forecasting load and residuals based on the hourly load from 2007/1/1 to 2007/1/7 using CFM-MSVR and MethodG

Figure 7 shows the results of the CFM-MSVR, MethodG, and the other methods mentioned in [52], including IRLS_bis, IRLS_log, and the other popular forecasting methods. The results in [52] are used.

Fig. 7
figure 7

MAPE% based on the hourly load in 2007 of GEFCom 2012

The MAPE% of the CFM-MSVR is 1.52, and the MAPE% of the MethodG is 24.18. Compared with MethodG, the non-linear forecasting method is quite effective, and CFM-MSVR gets more accurate forecasting results. Besides, some famous methods, including IRLS_bis, MLR, ANN, SVR, GRR, and RFR, are tested and compared. For example, the MAPE% of IRLS_bis and MLR are 5.3 and 5.22, respectively. CFM-MSVR performs best among all of the methods. Especially, compared with the SVR with the MAPE% of 5.23, CFM-MSVR manages to lower the MAPE% with the use of CFM to find the samples in different models. In addition, NSGA-II is used to find better parameters for each forecasting model.

To sum up, the performance of CFM-MSVR is stable and significant. Combined with the CFM and MSVR, the MAPE based on the test data is 1.52, and the forecasting accuracy of CFM-MSVR is better than the other forecasting methods.

6 Conclusions

This paper proposes an adaptive regression algorithm (CFM-MSVR) via a clustering process to forecast multi-model data. First, the cluster-feedback mechanism helps to recognise each model and adjust the samples in each cluster. Then we apply the modified SVR to forecast the data in each cluster. To demonstrate the effectiveness of the proposed CFM-MSVR, the proposed method is tested on one constructive dataset and four real datasets, where the results illustrate the superiority of the CFM-MSVR in the computational cost and forecasting accuracy. For instance, the validation error is 32.12, and the time cost is 8360 s based on Data 1. Moreover, CFM-MSVR is tested in the dataset GEFCom2012, and the results are compared with MethodG and other deep learning methods. The MAPE% of the CFM-MSVR is 1.52, which shows that CFM-MSVR performs better when it comes to complex multi-model data.

Despite extensive experiments, this study may not fully capture the diversity and dynamics of real-world applications, potentially limiting the algorithm’s generalizability. Moreover, the cluster-feedback mechanism needs further improvement, especially in setting the threshold value \(\hat{\epsilon }\). The setting of the \(\hat{\epsilon }\) should be more intelligent, which will improve the effectiveness and accuracy of finding each model. We will further explore these aspects in future research.