An adaptive regression algorithm with a clustering process for multi-modal data prediction

Zhao, Shangrui; Yu, Weiqi; Wu, Yulu; Wu, Jinran; Li, Xi’an; Wang, You-Gan

doi:10.1007/s10791-025-09565-7

An adaptive regression algorithm with a clustering process for multi-modal data prediction

Research
Open access
Published: 26 May 2025

Volume 28, article number 94, (2025)
Cite this article

You have full access to this open access article

Download PDF

View saved research

Discover Computing Aims and scope Submit manuscript

An adaptive regression algorithm with a clustering process for multi-modal data prediction

Download PDF

Shangrui Zhao¹,
Weiqi Yu¹,
Yulu Wu¹,
Jinran Wu²,
Xi’an Li³ &
…
You-Gan Wang⁴

921 Accesses
2 Citations
Explore all metrics

Abstract

Prediction problems regularly exist in practical application problems, where real application systems are usually complex. For multi-model data in complex systems, effectively identifying the patterns of each data set is significant for subsequent prediction. This paper focuses on the prediction problem of multi-model data and proposes an adaptive regression algorithm (CFM-MSVR) that combines a clustering feedback mechanism with improved support vector regression. The clustering feedback mechanism (CFM) clusters samples based on their residuals in each forecasting model, enabling the discovery of original data generation models. Meanwhile, it can intelligently estimate the number of clusters based on the number of samples in each cluster, reducing computational cost and dependence on empirical settings. In the regression stage, the multi-model support vector regression (MSVR) leverages the non-dominated sorting genetic algorithm II (NSGA-II) to optimise the parameters of the support vector regression, thereby improving the generalisation of each sub-model. The proposed method is evaluated on a simulated dataset, four real-world datasets, and the 2012 Global Energy Forecasting Competition dataset. Results show that CFM-MSVR achieves a MAPE% of 1.52 on the energy prediction task, demonstrating its strong performance in complex forecasting scenarios.

A hybrid multiscale filter along with an improved adaptive SVR technique for fault diagnosis and machine learning modeling: forecasting the octane number of gasoline in isomerization reactor

Article 12 December 2022

A new model and DCA based algorithm for clustering

Article 01 April 2025

A deep multiple self-supervised clustering model based on autoencoder networks

Article Open access 26 May 2025

1 Introduction

Prediction problems are popular in real applications such as power scheduling [1], bioinformatics [2], and environmental science [3]. The systems of real applications are generally complex, so the prediction task is often challenging, particularly for multi-model data. Moreover, for multi-model data, identifying the model for each data point effectively is difficult and crucial in data modelling to achieve satisfying predictions. Therefore, we will concentrate on multi-model data prediction by accurately recognising and learning the underlying data of each model in this paper.

In predictive method research, scholars often assume that the data is from a single model and many different types of models have been developed, such as statistical models [4,5,6], Kalman filter methods [7, 8], artificial neural networks [9,10,11], deep learning models [12,13,14], and fuzzy logic methods [15,16,17]. These models have been applied in various fields such as epidemic forecasting [18, 19], portfolio selection [20], weather forecasting [21, 22], financial forecasting [23, 24], and chloride diffusion forecasting in building materials [25]. Although these models, considering a single mode, can provide good predictions [20, 26,27,28,29], their effectiveness is often constrained in complex scenarios of multi-model data due to limited model assumptions. It is noted that their prediction performance improves when the models of the data are diagnosed and modelled. Therefore, some advanced clustering-based predictive models have been proposed to address the problem of multi-model data modelling. Initially, algorithms such as K-means clustering [30,31,32], fuzzy C-means algorithm [33], and multi-objective semi-supervised clustering model [34, 35] are used to identify the data modalities. Then, models such as binary logistic regression [36], Lion-Wolf based deep belief network [37], kernel ridge regression [31], grey prediction [38], and support vector machine [39, 40] are applied for data prediction. To improve the prediction accuracy, feature extraction is also carried out, which includes principal component analysis. One of the most typical works is in the reference of [34]. The authors established a multi-objective semi-supervised clustering model to address a predictive task in clinical data, where two objective functions were designed to minimise the cost function of K-medians clustering and the mean square error of cross-validation. Here, the K-medians clustering was employed to diagnose the model for each dataset. Later, [35] further improved upon the work of [34] by modifying the objective function to minimise the deviation of the data points in the clusters and the prediction error of the response. Here, it should be noticed that the non-dominated sorting genetic algorithm II (NSGA-II) was used to solve the multi-objective optimisation problem, and local regression is used to learn each model. The investigation of seven references in [30, 31, 33, 34, 36, 37], and [35] all use the method of clustering the data first and then learning each cluster separately. Because the data of each cluster, which is from one model, has similar characteristics, this approach significantly improves the prediction accuracy, reduces the prediction error, and has high prediction accuracy and stability. Multi-objective optimisation models in [34] and [35] also help to overcome the disadvantage of the initial setting sensitivity of clustering.

However, the optimal number of clusters cannot be directly determined, and the computational cost of using an experiment-by-experiment method is too high. In addition, clustering of the data by simple Manhattan distance does not distinguish the data from different models well, as such data is distributed disorderly. Furthermore, real-world data is usually non-linear, so linear regression is ineffective. To address this problem, a non-linear prediction model can be adopted. Non-linear support vector regression (SVR) is one of the most popular methods used for non-linear regression [41], and its performance is sensitive to parameter settings [42]. Thus, a dynamic parameter adjustment mechanism can further improve computational efficiency [43].

Taking these ideas, we propose an adaptive regression algorithm to forecast multi-model data based on the clustering procedure, using the cluster-feedback mechanism (CFM) and modified support vector regression (MSVR). Specifically, we select the CFM to find the data models via a clustering process [44]. By adaptively selecting the number of clusters based on prediction residuals, CFM not only reduces computational cost but also enhances prediction accuracy. Moreover, CFM helps the proposed algorithm pay more attention to the original data generation models, which are unknown. Following the clustering process, we use non-linear SVR with a radial basis function to forecast the samples in different clusters after the clustering process. In addition to improving the generalisation ability of each forecasting model, we propose the MSVR, which uses NSGA-II to find better parameters for each SVR model. In summary, the contributions of this paper are as follows:

(1)
An adaptive regression algorithm, which is named CFM-MSVR, based on a cluster-feedback mechanism, is proposed to forecast the multi-model data. In particular, CFM is used to cluster the samples according to the residual of each sample in each forecasting model. The CFM helps to find more reasonable clusters that are more similar to the real models, which improves the forecasting accuracy.
(2)
Based on the clusters adjusted by CFM, the number of clusters can be estimated intelligently based on the number of samples in each cluster, which decreases the computational cost, the number of parameters, and the dependence on the empirical setting.
(3)
Focusing on the non-linear forecasting problems, we propose the MSVR, which is composed of non-linear SVR and non-dominated sorting genetic algorithm II (NSGA-II). To avoid overfitting, we use NSGA-II to optimise the number of support vectors and the prediction error and promote the generalisation and forecasting ability.
(4)
What’s more, we demonstrate the effectiveness of the proposed algorithm in terms of computational cost, generalisation ability, and forecasting accuracy on one simulated dataset, four real datasets, and a case study.

The organisation of this paper is as follows: Sect. 2 outlines the non-linear SVR algorithm and NSGA-II. Section 3 describes the proposed adaptive regression algorithm (CFM-MSVR) in detail. Section 4 tests the proposed method using one simulated dataset and four real datasets. Section 5 illustrates the performance of the CFM-MSVR based on the data Global Energy Forecasting Competition 2012 (GEFCom2012). Section 6 concludes the paper.

2 Background

In this section, the non-linear SVR algorithm and NSGA-II are introduced in the following, which is advantageous to propose the MSVR.

2.1 The non-linear SVR algorithm

The non-linear SVR algorithm is an effective forecasting method proposed by [45]. Compared with linear regression, non-linear SVR performs better when forecasting high-dimensional data [46]. Given a set of training data $\{(x_1, y_1),(x_2, y_2),\cdots ,(x_N, y_N)\}$ with size N, where $x_i \in R^{D},\ i=1, 2,\cdots , N$ are input feature vectors and $y_i \in R^1,\ i=1, 2,\cdots , N$ are input label values. We use a non-linear function f(x) to regress the input data in high-dimensional space as follows:

$$\begin{aligned} f(x) = w^T\varphi (x) + b, \end{aligned}$$

(1)

where $x \in R^D$ is the input vector, w and b are the weight and bias vectors. $\varphi (x)$ is a nonlinear function used to map the x to high-dimensional Hilbert space [47]. The weight vector w and the bias b can be calculated by solving the optimisation problem as follows:

$$\begin{aligned}&\text {Minimize: } \frac{1}{2}\left\| w\right\| ^2 + C\sum _{i=1}^{N}(\zeta _i + \zeta _i^*).\\&\text {Subject to:}\\&\left\{ \begin{aligned}&y_i - w^T\varphi (x_i) - b - \epsilon - \zeta _i \le 0,\\&w^T\varphi (x_i) - y_i - b - \epsilon - \zeta _i^* \le 0, \quad i = 1,2,\cdots ,N,\\&\zeta _i \ge 0, \zeta _i^* \ge 0, C> 0,\\ \end{aligned} \right. \\ \end{aligned}$$

where C is the penalty factor used to avoid overfitting, $\zeta _i$ and $\zeta _i^*$ are loose variables, and $\epsilon$ is a constant [48]. Finally, the regression function can be expressed as follows:

$$\begin{aligned} f = \sum _{i=1}^{N}(\beta _i - \beta _i^*)k(x, x_i) + b, \end{aligned}$$

(2)

where $k(x, x_i) = <\varphi (x), \varphi (x_i)>$ is the kernel function and common kernel functions, including linear, polynomial, and Gaussian kernel functions.

Remark 1

The selection of penalty factor C, kernel function $k(x, x_i)$, and $\epsilon$ matters a lot to the forecasting accuracy. However, the selection of these parameters is difficult, and most time is based on experience [49].

2.2 The non-dominated sorting genetic algorithm II

The non-dominated sorting genetic algorithm II (NSGA-II), which is proposed by [50], is one of the most popular algorithms to solve multi-objective optimisation problems. Compared with NSGA, NSGA-II adds elitism and decreases the computational complexity and the number of parameters [51]. NSGA-II contains three main procedures: fast non-dominated sorting, crowding distance, and selection operator [51]. A brief description of these procedures is shown as follows.

2.2.1 Fast non-dominated sorting

Fast non-dominated sorting is proposed based on the concept of Pareto dominance. Suppose that a multi-objective problem includes n decision variables $u_i,\ i = 1, 2, \cdots , n$ and k objective functions $g_j,\ j = 1, 2, \cdots , k$. For any $p,q\in \{1, 2, \cdots , n\}$, if $g_j(u_p)> g_j(u_q)$ for all $j=1,\cdots ,k$, we say that $u_p$ dominates $u_q$, denoted by $u_p \succ u_q$.

Moreover, a decision variable u is named as Pareto optimal if u is not dominated by any other variable.

The procedures of fast non-dominated sorting are shown as follows:

Step 1: Initialize $\Psi (u_i) = 0$ and $S(u_i) = \emptyset$, where $\Psi (u_i)$ and $S(u_i)$ are the number of variables which dominate $u_i$ and the set of solutions which are dominated by solution $u_i$. For all $i, j = 1,2,\cdots , n$, if $u_i \succ u_j$, let $\Psi (u_j) = \Psi (u_j) + 1$ and put $u_j$ in $S(u_i)$.

Step 2: Let $k = 1$ and $P_{k}$ be the set of kth front. Set $(r_1, r_2,\cdots , r_n)^T = \textbf{0}$.

Step 3: Find $Q \subset \{1,2,\cdots ,n\}$, where $\Psi (u_i) = 0, i \in Q$. Let $r_i = k$, for all $i \in Q$, and put $u_i$ into $P_k$.

Step 4: Find all solutions $u_q$ in the $S(u_i)$ for all $i \in Q$, and let $\Psi (u_q) = \Psi (u_q) - 1$.

Step 5: If $r_i \ne 0$ for all $i = 1, 2, \cdots , n$, stop. Otherwise, $k = k + 1$, go to Step 3.

2.2.2 Crowding distance

Though the solutions have been ranked by fast non-dominated sorting, the solutions in the same rank are difficult to judge which one is better. So the crowding distance is calculated to rank the solutions in the same rank. The definition of the crowding distance is as follows:

$$\begin{aligned} cd(u_i) = \sum _{j=1}^{k}\frac{g_j(u_{i+1}) - g_j(u_{i-1})}{g_j^{max} - g_j^{min}}, \end{aligned}$$

(3)

where $g_j^{max}$ is the largest number in $g_j(u_s),\ s = 1, 2,\cdots , n$. The definition of $g_j^{min}$ is analogous. The solution with the larger crowding distance is considered to be the better solution.

2.2.3 Selection operator

When generating an offspring, n solutions are selected based on the fast non-dominated sorting and crowding distance. For $p,q \in \{1, 2, \cdots , 2n\}$, the rule of selection between $u_p$ and $u_q$ is as follows: (1) If $r_p < r_q$, retain the solution $u_p$ in the next generation. (2) If $r_p = r_q$, calculate the crowding distance $cd(u_p)$ and $cd(u_q)$. If $cd(u_p)> cd(u_q)$, retain the solution $u_p$ in the next generation. Else, retain the solution $u_q$.

With the components outlined above, the complete NSGA-II process is shown as follows.

First, generate an initial population $u = \{u_1, u_2, \cdots , u_n\}$. Then, a new generation $u' = \{u'_1, u'_2, \cdots , u'_n\}$ is generated after crossover and mutation. Let $u^* = u \cup u'$ and rank all members of $u^*$ by fast non-dominated sorting and crowding distance. After that, select n members using the selection operator as u in the next generation. The working of NSGA-II is shown in Fig. 1.

3 Proposed Method

In this section, the details of the adaptive regression algorithm (CFM-MSVR) to forecast multi-model data based on the clustering process are shown. The two main components of CFM-MSVR are the cluster-feedback mechanism (CFM) and modified SVR (MSVR). First, the CFM is used to find the data models and adjust the number of clusters. Then, MSVR is used to calculate the regression models and adjust the parameters for each model. Finally, the CFM-MSVR based on CFM and MSVR is proposed in detail.

3.1 The cluster-feedback mechanism

First, we initialise m clusters. The m centers $\{c_1, c_2, \cdots , c_m\}$ are randomly chosen from the samples $\{(x_1, y_1),(x_2, y_2),\cdots ,(x_N, y_N)\}$. Let $Cl = \{Cl_i,\ i = 1,2,\cdots , m\}$, where $Cl_i$ is the set of ith cluster. Then for each sample $(x_t,y_t),\ t=1,\cdots ,N$, let $d((x_t,y_t),c_i)$ be the Manhattan distance from $(x_t,y_t)$ to $c_i$. Find $i^* \in \{1,2,\cdots ,m\}$ where $d(x_t, c_{i^*})$ is the minimum value in $\{d((x_t, y_t), c_1),\cdots , d((x_t, y_t), c_m)\}$, and add $(x_t,y_t)$ to $Cl_{i^*}$.

Supposing we have obtained the forecasting models $f_i,\ i = 1,2,\cdots ,m$, the predictive residual of each sample $r_i(x_t, y_t),\ i = 1,2,\cdots ,m, \ t = 1,2,\cdots ,N$ in $f_i$ can be calculated as follows:

$$\begin{aligned} r_i (x_t, y_t) = \Vert y_t - f_i (x_t)\Vert . \end{aligned}$$

(4)

Let $p = \arg \min \limits _{i}(r_i(x_t, y_t))$ and put the sample $(x_t, y_t)$ into the pth cluster $Cl_p$.

In addition, let $size(Cl_i)$ be the number of samples in $Cl_i$. After regrouping all of the samples, $size(Cl_i),\ i=1,2,\cdots ,m$ changes at the same time. Then some clusters contain few samples. Using Eq. (5), we can adjust the number of clusters:

$$\begin{aligned} {\begin{matrix} & Cl^* = \{Cl_i \Vert size(Cl_i)\ge \hat{\epsilon }\},\\ & m = len (Cl^*),\\ \end{matrix}} \end{aligned}$$

(5)

where $Cl^*$ is the adjusted set of clusters and $\hat{\epsilon }$ is a threshold value. $len(Cl^*)$ is the number of clusters in $Cl^*$. Then, if $len(Cl^*) \ne len(Cl)$, choose centres randomly from the samples again in the next iteration. Otherwise, the clusters in Cl are all retained and used in the next iteration. The steps of the CFM are illustrated in Algorithm 1.

3.2 The modified SVR

Because the selection of penalty factor C and $\epsilon$ influences the performance of SVR models, we use NSGA-II to adjust C and $\epsilon$. Supposing that the samples $\{(x_1, y_1),(x_2, y_2),\cdots ,(x_N, y_N)\}$ have been grouped into m clusters ($Cl_i = \{(x_t, y_t)\mid (x_t, y_t)\ \text {is grouped into the }i \text {th cluster}\},\ i = 1, 2, \cdots , m$), first we initialize n agents $u_j = (\{C_1, \epsilon _1\}, \{C_2, \epsilon _2\}, \cdots , \{C_m, \epsilon _m\}),\ j = 1, 2, \cdots , n$. Then for each agent $u_j$, the regression functions $f^{(j)} = \{f_i(Cl_i\mid C_i, \epsilon _i),\ i = 1,2,\cdots ,m\}$ can be trained using data in $Cl_i$ with the parameters $\{C_i$,$\epsilon _i\}$ by Eq. (1). Now we have m SVR models $f^{(j)}$ for the jth agent. Based on these models, the number of support vectors and the sum of the validation error, which are two objective functions, can be calculated. The optimisation problem is shown as follows:

$$\begin{aligned}&\text {Minimize: } g = \{g_1(C,\epsilon ),\ g_2(C,\epsilon )\}.\\&\text {Subject to:}\\&\left\{ \begin{aligned}&g_1(C,\epsilon ) = \frac{1}{n}\sum _{j=1}^{n} \sum _{i=1}^{m}\psi (f_i^{(j)}(Cl_i \mid C_i, \epsilon _i)),\\&g_2(C,\epsilon ) = \frac{1}{n}\sum _{j=1}^{n}\sum _{i=1}^{m}\sum _{(x, y) \in Cl_i}\mid y - f_i^{(j)}(x)\mid ,\\ \end{aligned} \right. \\ \end{aligned}$$

where $C = (C_1,C_2,\cdots ,C_n)$, $\epsilon = (\epsilon _1, \epsilon _2,\cdots , \epsilon _n)$, and $\psi (f_i^{(j)}(Cl_i\mid C_i, \epsilon _i))$ is the number of support vectors in $f_i^{(j)}(Cl_i\mid C_i, \epsilon _i)$. Then update $u_j$ by NSGA-II until the last iteration is finished. After the last iteration, we need to find the best agent in $P_1$ [34]. First, we normalise the objective function values of the agents in $P_1$:

$$\begin{aligned} NormG_k^{(p)} = \frac{g_k^{(p)} -mean(g_k)}{std(g_k))},\ k = 1,2,\ p = 1,\cdots ,n_p, \end{aligned}$$

(6)

where $n_p$ is the number of samples in the first front. $g_k^{(p)}$ is the kth objective function value of the $(x_p,y_p)$. $mean(g_k)$ and $std(g_k)$ stand for the average and standard deviation values of the kth objective function for all samples in $P_1$. The overall steps of the MSVR are illustrated in Algorithm 2.

Then a positive vector P can be determined by the following equation:

$$\begin{aligned} {\begin{matrix} P & = [\underset{p = 1,\cdots ,n_p}{min}\ NormG_1^{(p)},\underset{p = 1,\cdots ,n_p}{min}\ NormG_2^{(p)}],\\ & =[NormG_1^{(p_1)},\ NormG_2^{(p_2)}], \end{matrix}} \end{aligned}$$

(7)

where $\underset{p = 1,\cdots ,n_p}{min}\ NormG_k^{(p)},\ k=1,2$ is the minimum value in $NormG_k^{(p)}$ with the subscript $p_k$. Finally, the similarity between each sample in the first front can be calculated by the following equation:

$$\begin{aligned} Similarity_p(NormG^{(p)},P) = \frac{{NormG^{(p)}} \cdot P^T}{||NormG^{(p)}||_2||P||_2}, \end{aligned}$$

(8)

where $NormG^{(p)} = [NormG_1^{(p)},\ NormG_2^{(p)}]$. And choose the sample with the highest similarity. The overall steps of the MSVR are illustrated in Algorithm 2.

3.3 The proposed CFM-MSVR

The adaptive regression algorithm (CFM-MSVR) to forecast multi-model data based on the clustering process is proposed in this section. CFM-MSVR not only uses non-linear regression to forecast the multi-model data, but also uses CFM to find the different models quickly and improve the forecasting accuracy. CFM helps to select the data in different models, and MSVR is used to calculate the forecasting models. The combination of CFM and MSVR helps to forecast the multi-model data quickly and accurately. The main steps of CFM-MSVR are below.

Step 1: Initialise the agents and parameters, and randomly choose the centres from the input data.

Step 2: Cluster the data based on the Manhattan distance.

Step 3: Use MSVR to calculate the forecasting models for each cluster.

Step 4: Regroup the data and adjust the number of clusters based on each model by CFM. If the number of clusters changes, randomly choose the centres again. Go to Step 3.

The overall structures of the proposed CFM-MSVR are shown in Fig. 2, where Iter and iter are the iterations of CFM and MSVR, respectively.

Remark 2

The setting of the initial number of clusters m and the threshold value $\hat{\epsilon }$ is usually based on experience. The initial number of clusters tends to be set to a large number. When we forecast the data with N samples, the threshold value $\hat{\epsilon }$ is about $\frac{N}{2m}$. Especially, if the threshold value is too small, the number of clusters will not change. On the other hand, some of the clusters will be deleted by error. What’s more, when the data is significantly impacted by noise, using more models to forecast one dataset may get better forecasting results.

4 Numerical simulation

In this section, the proposed CFM-MSVR is tested on one constructive dataset and four real datasets to demonstrate its effectiveness.

When experimenting on the constructive data, the validation error of CFM-MSVR is recorded and discussed. To verify the superiority of the proposed CFM-MSVR, the results of CFM-MSVR are compared with the results of the method in [35], which is denoted by MethodG. In addition, to verify the effectiveness of CFM, we convert linear regression to MSVR in MethodG (MethodG-MSVR). And the result of MethodG-MSVR is also compared.

When experimenting on the four real datasets, the computational cost and forecasting accuracy of CFM-MSVR are discussed. The results of CFM-MSVR are compared with MethodG. In addition, to prove the stability of the proposed algorithm, CFM-MSVR, the differential entropy in each iteration of CFM is recorded.

4.1 Constructive data

We construct the samples produced by two models. Our goal is to judge which model each sample belongs to and forecast the two constructive datasets by the proposed method CFM-MSVR. Here we uniformly sample 450 data points from $[-8\pi , -3\pi ]\cup [-3\pi ,-\pi ]\cup [\pi ,3\pi ]\cup [3\pi ,8\pi ]$ as x. And the label value y(x) is defined as follows:

$$\begin{aligned} y(x) = {\left\{ \begin{array}{ll} -x^2 + 20,& x \in [-8\pi , -3\pi ]\cup [3\pi , 8\pi ], \\ 600sin(x)/x,& \text {else. }\\ \end{array}\right. } \end{aligned}$$

(9)

In addition, noise obeying a normal distribution is added to the original data (x, y). And the data is standardized after all.

In this experiment, $m = 6$ is the initial number of clusters, and $Max\_Iter = 10$ is used in the CFM. 25 iterations and 50 search agents are used in MSVR. What’s more, the radial basis function (RBF) is chosen as the kernel function, and the threshold $\hat{\epsilon }$ is 80. In NSGA-II, the rate of crossover is set to 0.8 and the rate of mutation to 0.01. For MethodG, the initial number of clusters is set as 2, then the number of iterations is 50.

Figure 3 shows the results of numerical simulation using CFM-MSVR and MethodG. By CFM-MSVR, all data are put into the original model correctly, and this verifies the effectiveness of the proposed method. When two groups of data are densely distributed, CFM-MSVR uses CFM can successfully separate the data from the two models, but MethodG uses the shortest distance to make it difficult to separate the data. Moreover, the true number of models is unknown when setting the number of clusters. CFM-MSVR with CFM can find the true number of clusters and calculate the two models.

The validation error of the CFM-MSVR is recorded and compared with the results of MethodG and MethodG-MSVR. Table 1 shows the minimum and average validation error ($g_2$) in each iteration of the three methods, and CFM-MSVR performs better in the forecasting accuracy as a whole.

Table 1 Validation error ($g_2$) of CFM-MSVR and the other methods

Full size table

The average validation error of MethodG is 1.43e+04, and that of CFM-MSVR is 95.5, which is the smallest of all. It means that the use of the CFM improves the performance in the average value. Meanwhile, compared with the results of MethodG, the validation error of MethodG-MSVR is 184 on average. Based on the results of MethodG and MethodG-MSVR, MSVR significantly improves forecasting accuracy when it comes to the data in complex systems.

4.2 Real data

In this section, the proposed CFM-MSVR is tested on four real datasets containing Boston Housing, Concrete Compressive Strength, QSAR Aquatic Toxicity, and Concrete Slump Test, all of which are publicly available from https://archive.ics.uci.edu/datasets. The details of the four datasets are shown in Table 2. We discuss the validation error and computational cost of CFM-MSVR based on four real datasets. The results of CFM-MSVR are also compared with the results of MethodG.

Table 2 Description of the four datasets

Full size table

In this experiment, in CFM-MSVR, the initial number of clusters is set as 12. $Max\_Iter$ and $max\_iter$ are set as 10 and 50. And the number of agents is 25. Moreover, the radial basis function is selected as the kernel function, and the threshold $\hat{\epsilon }$ is set as 40 for Data 1, 3, set as 80 for Data 2, and set as 5 for Data 4. In NSGA-II, the crossover rate and mutation rate are set to 0.8 and 0.01, respectively. For the method MethodG, the number of iterations is set as 50, and the number of agents is 25. The number of clusters is set from 12 to 2. The experiments are conducted in a Python 3.7 environment with a 1.4 GHz quad-core Intel Core i5 and 8 GB of RAM.

To ensure the stability of the proposed algorithm, the differential entropy of the distance between samples and their centres is calculated after each iteration of CFM as:

$$\begin{aligned} h(X) = -\sum _{i=1}^{N}||X_i - C(X_i)||log(||X_i - C(X_i)||), \end{aligned}$$

(10)

where N is the number of samples and $X = (X_1,\cdots ,X_N)$ is the sample data. $C(X_i)$ is the centre of the cluster which contains the sample $X_i$.

The results of the CFM-MSVR and MethodG on the four real datasets are recorded in Table 3, which contains the computation cost and forecasting accuracy. Also, the differential entropy in each iteration is recorded and displayed in Fig. 4.

Table 3 The comparison of the two methods

Full size table

When forecasting complex real data, the proposed method, CFM-MSVR, performs better than MethodG. For example, on the high-dimensional dataset, Data 1, the minimum validation error of the CFM-MSVR is 32.12, and that of MethodG is 164.91, much higher than 32.12. Meanwhile, the proposed CFM-MSVR significantly decreases the computational cost. For example, on Data 1, the time costs of CFM-MSVR and MethodG are 8360 s and 19081 s, respectively. So, using CFM-MSVR to forecast complex multi-model data is more effective.

To verify the stability of the proposed method, the differential entropy by Eq. (10) in each iteration of CFM for the four real data is shown in Fig. 4. The difference in entropy decreases gradually in each iteration and tends to converge, which stands for the stability of the proposed CFM-MSVR.

5 The global energy forecasting competition 2012

In this section, the CFM-MSVR is tested using data from GEFCom2012. The data of Zone 21 is chosen, and the hourly load from 2004 to 2006 (26280 samples) is used as the training set, and the hourly data from 2007, which contains 8760 samples, is used as the test set. Taking the total hourly load of 20 power stations into consideration, $L_t$, $M_t$, $W_t$, $H_t$, and $T_t$ are inputs, and $y_t$ is the output, where $y_t$ is the real load at time t. $L_t = t$ is a linear trend term. $M_t$, $W_t$, and $H_t$ stand for the month, the day of the week, and the hour of the day at time t. $T_t$ is the temperature at time t.

To measure the performance of each model, the mean absolute percentage error (MAPE) is calculated as follows:

$$\begin{aligned} \text {MAPE} = \frac{1}{N_{test}} \sum _{t=1}^{N_{test}} \frac{\mid y_t - \hat{y}_t \mid }{y_t}, \end{aligned}$$

(11)

where $N_{test}$ is the size of the test set. $y_t$ is the real load at time t and $\hat{y}_t$ is the forecasting load at time t. MAPE treats errors proportionally and reduces the impact of outliers, making it more robust than the root mean square error (RMSE) in measuring prediction performance. To demonstrate the superiority of the proposed method, CFM-MSVR, the values of MAPE of MethodG and the other methods mentioned in [52] are compared.

Here, in CFM-MSVR, the number of agents is set as 15. The initial number of clusters is 80. And the maximum number of iterations of CFM and MSVR is set as 10 and 5, respectively. The threshold value $\hat{\epsilon }$ is 100. In NSGA-II, the rate of crossover and the rate of mutation are set to 0.8 and 0.01, respectively. For MethodG, the number of agents is 15, and the number of iterations is set as 50. For all different datasets, the results of the algorithm converge after 10 iterations. For the convenience of comparison, the results shown in Fig. 4 are those of the first 10 iterations.

By CFM-MSVR, after deleting the clusters with few samples, the number of clusters stays around 40, which represents the number of models in the data GEFCom2012. Figure 5 (a) shows the curve of the number of clusters. With the help of CFM and MSVR, different models are found, which stand for the different electricity load modes. In addition, we test the stability of the CFM-MSVR based on differential entropy. The differential entropy decreases gradually. When the last iteration is reached, the differential entropy stays around 366.39, as shown in Fig. 5(b).

CFM-MSVR predicts data with lower residuals, generally compared with MethodG. The proposed CFM-MSVR and MethodG forecasting loads are shown in Fig. 6 (a). Figure 6 (b) illustrates the residuals of the hourly load using CFM-MSVR and MethodG. As the figures showed, in the test data, the CFM-MSVR managed to improve the forecasting accuracy based on the CFM and MSVR. Though some of the maximum and minimum points are not predicted well, the whole forecasting result of CFM-MSVR is excellent, where the residuals of the CFM-MSVR are much lower than the residuals of MethodG.

Figure 7 shows the results of the CFM-MSVR, MethodG, and the other methods mentioned in [52], including IRLS_bis, IRLS_log, and the other popular forecasting methods. The results in [52] are used.

The MAPE% of the CFM-MSVR is 1.52, and the MAPE% of the MethodG is 24.18. Compared with MethodG, the non-linear forecasting method is quite effective, and CFM-MSVR gets more accurate forecasting results. Besides, some famous methods, including IRLS_bis, MLR, ANN, SVR, GRR, and RFR, are tested and compared. For example, the MAPE% of IRLS_bis and MLR are 5.3 and 5.22, respectively. CFM-MSVR performs best among all of the methods. Especially, compared with the SVR with the MAPE% of 5.23, CFM-MSVR manages to lower the MAPE% with the use of CFM to find the samples in different models. In addition, NSGA-II is used to find better parameters for each forecasting model.

To sum up, the performance of CFM-MSVR is stable and significant. Combined with the CFM and MSVR, the MAPE based on the test data is 1.52, and the forecasting accuracy of CFM-MSVR is better than the other forecasting methods.

6 Conclusions

This paper proposes an adaptive regression algorithm (CFM-MSVR) via a clustering process to forecast multi-model data. First, the cluster-feedback mechanism helps to recognise each model and adjust the samples in each cluster. Then we apply the modified SVR to forecast the data in each cluster. To demonstrate the effectiveness of the proposed CFM-MSVR, the proposed method is tested on one constructive dataset and four real datasets, where the results illustrate the superiority of the CFM-MSVR in the computational cost and forecasting accuracy. For instance, the validation error is 32.12, and the time cost is 8360 s based on Data 1. Moreover, CFM-MSVR is tested in the dataset GEFCom2012, and the results are compared with MethodG and other deep learning methods. The MAPE% of the CFM-MSVR is 1.52, which shows that CFM-MSVR performs better when it comes to complex multi-model data.

Despite extensive experiments, this study may not fully capture the diversity and dynamics of real-world applications, potentially limiting the algorithm’s generalizability. Moreover, the cluster-feedback mechanism needs further improvement, especially in setting the threshold value $\hat{\epsilon }$. The setting of the $\hat{\epsilon }$ should be more intelligent, which will improve the effectiveness and accuracy of finding each model. We will further explore these aspects in future research.

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Mashlakov A, Pournaras E, Nardelli PH, Honkapuro S. Decentralized cooperative scheduling of prosumer flexibility under forecast uncertainties. Appl Ener. 2021;290: 116706.
Article Google Scholar
Dai Y, Chen H, Zhuang S, Feng X, Fang Y, Tang H, et al. Immunodominant regions prediction of nucleocapsid protein for SARS-CoV-2 early diagnosis: a bioinformatics and immunoinformatics study. Pathogens Global Health. 2020;114(8):463–70.
Article Google Scholar
Chen T, Yan ZA, Xu D, Wang M, Huang J, Yan B, et al. Current situation and forecast of environmental risks of a typical lead-zinc sulfide tailings impoundment based on its geochemical characteristics. J Environ Sci. 2020;93:120–8.
Article Google Scholar
Shams SR, Jahani A, Kalantary S, Moeinaddini M, Khorasani N. The evaluation on artificial neural networks (ANN) and multiple linear regressions (MLR) models for predicting SO2 concentration. Urban Climate. 2021;37: 100837.
Article Google Scholar
Li X, Liu Y, Fan L, Shi S, Zhang T, Qi M. Research on the prediction of dangerous goods accidents during highway transportation based on the ARMA model. J Loss Prev Process Ind. 2021;72: 104583.
Article Google Scholar
Singh S, Mohapatra A. Repeated wavelet transform based ARIMA model for very short-term wind speed forecasting. Renew Energy. 2019;136:758–68.
Article Google Scholar
Singh KK, Kumar S, Dixit P, Bajpai MK. Kalman filter based short term prediction model for COVID-19 spread. Appl Intell. 2021;51(5):2714–26.
Article Google Scholar
Aly HH. An intelligent hybrid model of neuro Wavelet, time series and Recurrent Kalman Filter for wind speed forecasting. Sustain Energy Technol Assess. 2020;41: 100802.
Google Scholar
Huang X, Jagota V, Espinoza-Muñoz E, Flores-Albornoz J. Tourist hot spots prediction model based on optimized neural network algorithm. Int J Syst Assu Eng Manag. 2022;13(1):63–71.
Article Google Scholar
Hu H, Wang L, Tao R. Wind speed forecasting based on variational mode decomposition and improved echo state network. Renew Energy. 2021;164:729–51.
Article Google Scholar
Sun W, Huang C. A carbon price prediction model based on secondary decomposition algorithm and optimized back propagation neural network. J Clean Prod. 2020;243: 118671.
Article Google Scholar
Yan X, Weihan W, Chang M. Research on financial assets transaction prediction model based on LSTM neural network. Neural Comput Appl. 2021;33(1):257–70.
Article Google Scholar
Srivastava T, Vedanshu Tripathi M. Predictive analysis of RNN, GBM and LSTM network for short-term wind power forecasting. J Stat Manag Syst. 2020;23(1):33–47.
Google Scholar
Zhang J, Wei Y, Tan Z. An adaptive hybrid model for short term wind speed forecasting. Energy. 2020;190: 115615.
Article Google Scholar
Khorramdel B, Chung C, Safari N, Price G. A fuzzy adaptive probabilistic wind power prediction framework using diffusion kernel density estimators. IEEE Trans Power Syst. 2018;33(6):7109–21.
Article Google Scholar
Garg C, Namdeo A, Singhal A, Singh P, Shaw RN, Ghosh A. Adaptive fuzzy logic models for the prediction of compressive strength of sustainable concrete. Cham: Springer; 2022.
Book Google Scholar
Bagherian-Marandi N, Ravanshadnia M, Akbarzadeh-T MR. Two-layered fuzzy logic-based model for predicting court decisions in construction contract disputes. Arti Intell Law. 2021;29(4):453–84.
Article Google Scholar
Tomar A, Gupta N. Prediction for the spread of COVID-19 in India and effectiveness of preventive measures. Sci Total Environ. 2020;728: 138762.
Article Google Scholar
Torrealba-Rodriguez O, Conde-Gutiérrez R, Hernández-Javier A. Modeling and prediction of COVID-19 in Mexico applying mathematical and computational models. Chaos, Solitons & Fractals. 2020;138: 109946.
Article MathSciNet Google Scholar
Bodnar T, Lindholm M, Niklasson V, Thorsén E. Bayesian portfolio selection using VaR and CVaR. Appl Math Comput. 2022;427: 127120.
MathSciNet Google Scholar
Moosavi A, Rao V, Sandu A. Machine learning based algorithms for uncertainty quantification in numerical weather prediction models. J Comput Sci. 2021;50: 101295.
Article MathSciNet Google Scholar
Goliatt L, Yaseen ZM. Development of a hybrid computational intelligent model for daily global solar radiation prediction. Expert Syst Applic. 2022;12: 118295.
Google Scholar
Cheng D, Yang F, Xiang S, Liu J. Financial time series forecasting with multi-modality graph neural network. Pattern Recogn. 2022;121: 108218.
Article Google Scholar
Lv P, Shu Y, Xu J, Wu Q. Modal decomposition-based hybrid model for stock index prediction. Expert Syst Appl. 2022;202: 117252.
Article Google Scholar
Liu QF, Iqbal MF, Yang J, Lu XY, Zhang P, Rauf M. Prediction of chloride diffusivity in concrete using artificial neural network: modelling and performance evaluation. Constr Build Mater. 2021;268: 121082.
Article Google Scholar
Tian Z, Chen H. Multi-step short-term wind speed prediction based on integrated multi-model fusion. Appl Energy. 2021;298: 117248.
Article Google Scholar
Zhang Y, Zhang R, Ma Q, Wang Y, Wang Q, Huang Z, et al. A feature selection and multi-model fusion-based approach of predicting air quality. ISA Trans. 2020;100:210–20.
Article Google Scholar
Tobore I, Kandwal A, Li J, Yan Y, Omisore OM, Enitan E, et al. Towards adequate prediction of prediabetes using spatiotemporal ECG and EEG feature analysis and weight-based multi-model approach. Knowl-Based Syst. 2020;209: 106464.
Article Google Scholar
Ahmed K, Sachindra D, Shahid S, Iqbal Z, Nawaz N, Khan N. Multi-model ensemble predictions of precipitation and temperature using machine learning algorithms. Atmos Res. 2020;236: 104806.
Article Google Scholar
Jia B, Li R, Wang C, Qiu C, Wang X. Cluster-based content caching driven by popularity prediction. CCF Trans High Perf Comput. 2022;3:66.
Google Scholar
Seok HS. Enhancing performance of gene expression value prediction with cluster-based regression. Genes Genom. 2021;43(9):1059–64.
Article Google Scholar
Dileep P, Rao KN, Bodapati P, Gokuruboyina S, Peddi R, Grover A, et al. An automatic heart disease prediction using cluster-based bi-directional LSTM (C-BiLSTM) algorithm. Neural Comput Appl. 2023;35(10):7253–66.
Article Google Scholar
Li S, Chang J, Chu M, Li J, Yang A. A blast furnace coke ratio prediction model based on fuzzy cluster and grid search optimized support vector regression. Appl Intell. 2022;1:10.
Google Scholar
Akbarzadeh Khorshidi H, Aickelin U, Haffari G, Hassani-Mahmooei B. Multi-objective semi-supervised clustering to identify health service patterns for injured patients. Health Inf Sci Syst. 2019;7(1):1–8.
Article Google Scholar
Ghasemi Z, Khorshidi HA, Aickelin U. Multi-objective Semi-supervised clustering for finding predictive clusters. Expert Syst Appl. 2022;195: 116551.
Article Google Scholar
Rubio-Rivas M, Corbella X. Clinical phenotypes and prediction of chronicity in sarcoidosis using cluster analysis in a prospective cohort of 694 patients. Eur J Intern Med. 2020;77:59–65.
Article Google Scholar
Ramanathan L, Parthasarathy G, Vijayakumar K, Lakshmanan L, Ramani S. Cluster-based distributed architecture for prediction of student’s performance in higher education. Clust Comput. 2019;22(1):1329–44.
Article Google Scholar
Luo H, Wang J, Lin D, Kong L, Zhao Y, Guan YL. A novel energy-efficient approach based on clustering using grey prediction in WSNs for IoT infrastructures. IEEE Internet Things J. 2024;11(14):24748–60.
Article Google Scholar
Candelieri A, Giordani I, Archetti F, Barkalov K, Meyerov I, Polovinkin A, et al. Tuning hyperparameters of a SVM-based water demand forecasting system through parallel global optimization. Comput Opera Res. 2019;106:202–9.
Article MathSciNet Google Scholar
Chen S, Jq Wang, Hy Zhang. A hybrid PSO-SVM model based on clustering algorithm for short-term atmospheric pollutant concentration forecasting. Technol Forecast Soc Chang. 2019;146:41–54.
Article Google Scholar
Wang YG, Wu J, Hu ZH, McLachlan GJ. A new algorithm for support vector regression with automatic selection of hyperparameters. Pattern Recogn. 2023;133: 108989.
Article Google Scholar
Wu J, Wang YG. A working likelihood approach to support vector regression with a data-driven insensitivity parameter. Int J Machine Learn Cyber. 2022;1:17.
Google Scholar
Ye S, Zhou K, Zain AM, Wang F, Yusoff Y. A modified harmony search algorithm and its applications in weighted fuzzy production rule extraction. Front Inf Technol Elect Eng. 2023;24(11):1574–90.
Article Google Scholar
Yang Y, Zhou H, Wu J, Liu CJ, Wang YG. A novel decompose-cluster-feedback algorithm for load forecasting with hierarchical structure. Int J Elect Power Energy Syst. 2022;142: 108249.
Article Google Scholar
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
Article Google Scholar
Balogun AL, Rezaie F, Pham QB, Gigović L, Drobnjak S, Aina YA, et al. Spatial prediction of landslide susceptibility in western Serbia using hybrid support vector regression (SVR) with GWO, BAT and COA algorithms. Geosci Front. 2021;12(3): 101104.
Article Google Scholar
Yang Y, Che J, Deng C, Li L. Sequential grid approach based support vector regression for short-term electric load forecasting. Appl Energy. 2019;238:1010–21.
Article Google Scholar
Panahi M, Sadhasivam N, Pourghasemi HR, Rezaie F, Lee S. Spatial prediction of groundwater potential mapping based on convolutional neural network (CNN) and support vector regression (SVR). J Hydrol. 2020;588: 125033.
Article Google Scholar
Utami NA, Maharani W, Atastina I. Personality classification of facebook users according to big five personality using SVM (support vector machine) method. Proc Comput Sci. 2021;179:177–84.
Article Google Scholar
Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput. 2002;6(2):182–97.
Article Google Scholar
Kumar M, Guria C. The elitist non-dominated sorting genetic algorithm with inheritance (i-NSGA-II) and its jumping gene adaptations for multi-objective optimization. Inf Sci. 2017;382:15–37.
Article Google Scholar
Aflaki A, Gitizadeh M, Kantarci B. Accuracy improvement of electrical load forecasting against new cyber-attack architectures. Sustain Cities Soc. 2022;77: 103523.
Article Google Scholar

Download references

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions. The work is supported by the Chinese Fundamental Research Funds for the Central Universities (WUT: 213114009) and the Natural Science Foundation of Shandong Province, China (No. ZR2024QF057). Additional funding was provided by the Innovation Team Project of the Guangdong Provincial Department of Education (Grant No. 2022WCXTD009), the Key Field Special Project of Scientific Research by the Guangdong Provincial Department of Education (Grant No. 2024ZDZX2088), and the Guangdong Province Graduate Education Innovation Plan Project (Grant No. 2024SFKC_042). This work is also supported by the Science and Technology Innovation Program of Hunan Province (2022RC4028).

Author information

Authors and Affiliations

Wuhan University of Technology, Wuhan, 430070, PR China
Shangrui Zhao, Weiqi Yu & Yulu Wu
The University of Queensland, St Lucia, 4067, Australia
Jinran Wu
Ceyear Technologies Co., Ltd, Qingdao, 266555, PR China
Xi’an Li
Guangdong University of Finance and Economics, Guangzhou, 510320, PR China
You-Gan Wang

Authors

Shangrui Zhao
View author publications
Search author on:PubMed Google Scholar
Weiqi Yu
View author publications
Search author on:PubMed Google Scholar
Yulu Wu
View author publications
Search author on:PubMed Google Scholar
Jinran Wu
View author publications
Search author on:PubMed Google Scholar
Xi’an Li
View author publications
Search author on:PubMed Google Scholar
You-Gan Wang
View author publications
Search author on:PubMed Google Scholar

Contributions

Shangrui Zhao and Weiqi Yu: Methodology, Software and Writing - Original Draft; Yulu Wu and Xi’an Li: Writing-Reviewing and Editing; You-Gan Wang and Jinran Wu: Writing-Reviewing and Editing. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Jinran Wu or Xi’an Li.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, S., Yu, W., Wu, Y. et al. An adaptive regression algorithm with a clustering process for multi-modal data prediction. Discov Computing 28, 94 (2025). https://doi.org/10.1007/s10791-025-09565-7

Download citation

Received: 30 January 2025
Accepted: 21 April 2025
Published: 26 May 2025
Version of record: 26 May 2025
DOI: https://doi.org/10.1007/s10791-025-09565-7

Keywords

Profiles

Jinran Wu View author profile
Xi’an Li View author profile
You-Gan Wang View author profile

An adaptive regression algorithm with a clustering process for multi-modal data prediction

Abstract

Similar content being viewed by others

A hybrid multiscale filter along with an improved adaptive SVR technique for fault diagnosis and machine learning modeling: forecasting the octane number of gasoline in isomerization reactor

A new model and DCA based algorithm for clustering

A deep multiple self-supervised clustering model based on autoencoder networks

Explore related subjects

1 Introduction

2 Background

2.1 The non-linear SVR algorithm

Remark 1

2.2 The non-dominated sorting genetic algorithm II

2.2.1 Fast non-dominated sorting

2.2.2 Crowding distance

2.2.3 Selection operator

3 Proposed Method

3.1 The cluster-feedback mechanism

3.2 The modified SVR

3.3 The proposed CFM-MSVR

Remark 2

4 Numerical simulation

4.1 Constructive data

4.2 Real data

5 The global energy forecasting competition 2012

6 Conclusions

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles