Introduction

Digitization and availability of real-time data in manufacturing systems allow to continuously predict the state of machines and components. Machine learning algorithms for anomaly detection have proven to be very effective in this field (Tao et al., 2018; Wang et al., 2022). The goal of applying these algorithms is to detect patterns and relationships in process data that do not match expected behavior and thus indicate advanced wear or other critical processes (Chandola et al., 2009; Lopez et al., 2017). In this way, unplanned downtime can be avoided and maintenance can be scheduled more efficiently (Colledani and Tolio, 2012). To this end, appropriate algorithms must be selected, adapted and deployed. When faced with a large number of potentially suitable algorithms, identifying the optimal one and its hyperparameters is a challenge. Generally, these tasks require not only computational power, but also data science skills and domain-specific knowledge (Li et al., 2022). Automating these tasks could therefore be a great benefit for the application of machine learning techniques in manufacturing systems.

Machine learning bottlenecks

To better understand why an automated selection of algorithms is desirable, we will first have a look at where human activities and expertise are required in the modeling process. Let a machine be a bottleneck process in a manufacturing system and the monitoring system required to setup a Predictive Maintenance (PdM) is based on prognostic methods that use current sensor data to assess the condition of a machine or equipment. According to (Karmaker et al., 2022), two human actors are particularly relevant in this process: The domain expert and the data expert. The domain expert has extensive technical and organizational knowledge. The data expert has experience with predictive modeling but has limited knowledge of the specific application and process. Together, these specialists understand the environmental conditions under which the data is collected and formulate an overall target. The data expert translates the high-level target into a diagnostic or predictive task, extracts relevant features from the data, and selects an appropriate algorithm to solve the task. Essential steps such as understanding domain-specific attributes, formulating the prediction task, and partitioning the training and test data are performed. Once these steps have been completed, different machine learning algorithms can be applied (Karmaker et al., 2022).

Fig. 1
figure 1

Activities and communication of human experts for the implementation of a machine learning algorithm as propopsed by Karmaker et al. (2022)

It appears that an intensive exchange between data and domain experts is required not only in the formulation of the task, but also for data preparation and synthesis of the results. In these phases there is a bottleneck in terms of expertise and communication. In addition, the development of key features, the evaluation of alternative algorithms, and the validation of models require significant computational power and effort on the part of data experts (Karmaker et al., 2022). To facilitate the application of machine learning algorithms an automation of these tasks would be beneficial. There is also an opportunity to make the exchange between experts more efficient and to optimize the processes of data annotation and result compilation. At the same time, automated approaches to key feature development and algorithm evaluation can help reduce the workload on human experts. In general, by automating these critical processes, resources could be used more efficiently.

Approaches for automation

Automated Machine Learning (autoML) aims to automate feature and algorithm selection, and hyperparameter optimization (Hutter et al., 2019). In contrast to autoML systems, which start without prior knowledge to find a suitable algorithm-hyperparameter combination, meta-learning aims at using machine learning techniques to learn from past tasks for new tasks (Lemke and Gabrys, 2010; Gabbay et al., 2009; Lemke et al., 2015; Smith-Miles, 2008). According to Gabbay et al. (2009) ‘Meta-learning is the study of principled methods that exploit meta-knowledge to obtain efficient models and solutions by adapting the machine learning and data mining process.’

The problem of selecting appropriate algorithms and hyperparameters arises from the No-Free-Lunch Theorem (NFL), which states that no single algorithm achieves the best results for a wide range of problems (Wolpert and Macready, 1997). In the context of PdM, it has been shown that this assumption also holds for anomaly detection. For example, Schmidl et al. (2022) evaluated over 70 state-of-the-art supervised, unsupervised, and semi-supervised algorithms on time series datasets from different domains. The results showed that no single algorithm was superior. They also found that deep learning methods are not yet competitive, despite their high training effort, and that simpler methods can quickly produce results almost as good as complex methods.

Therefore, the central hypothesis of this work is that an individual automatic selection of the best algorithm from a set of robust candidates can contribute to better anomaly detection, which, if the appropriate measures are derived from these insights, should also lead to more effective manufacturing systems. In this way, the manual effort for algorithm selection could be reduced. It would also allow for easy adaptation to changing conditions without the need for experts. Finally, we assume that the resulting selection could be superior to conventional approaches.

Meta-learning has been successfully applied to algorithm selection in several fields (Sect. 2). However, the design of this technique depends on the data structure and the demanded task (Ali et al., 2018). This work therefore takes the step of transferring the meta-learning approach to a new domain. For the first time, novel meta-features are introduced that describe the relationship between multivariate time series and domain-specific information, and a framework for meta-learning using multi-output regression is presented. Furthermore, in contrast to other methods it is suitable for both single algorithm selection and ensemble formation. In this way, the meta-model can predict the predictive performance of all algorithms in the set of candidate anomaly detectors for multivariate manufacturing data.

The rest of the paper is organized as follows. Section 2 reviews the state of the art in meta-learning for algorithm selection with a focus on anomaly detection. After identifying the research gap, the proposed method is presented in Sect. 3. Section 4 is dedicated to the evaluation of the method in terms of prediction performance and comparison with benchmark methods. In addition, the importance of the individual meta-features and the computational times of the candidate algorithms are examined. The paper concludes with a discussion of the results and a conclusion in Sect. 5.

Related works

One application of meta-learning is to select an appropriate algorithm for a particular use case. However, a further distinction must be made with respect to the task of the algorithm, such as whether it should be used for predictive tasks, and if so, for what types of predictive tasks. (Ali et al., 2018; Vanschoren, 2018) In general, a task \(t_j \in T\) is described by a vector \(m(t_j) = (m_{j,1} ,\ldots , m_{j,n})\) of \(n\) meta-features \(m_{j,n} \in M\). The distance between \(m(t_i)\) and \(m(t_j)\) can be determined to transfer information from an old task to a new one. Further, a meta learner \(L\) can be trained on earlier evaluation results to predict the performance \(P_{i,\text {new}}\) of configurations \(\lambda _i\) of a new task \(t_{\text {new}}\). Here \(P\) represents the set of all previous scalar evaluations \(P_{i,j} = P(\lambda _i, t_j)\) of the configuration \(\lambda _i\) for the task \(t_j\) according to a suitable validation criterion and a model validation method (Bahri et al., 2022; Vanschoren, 2018). Figure 2 illustrates the concept of meta-learning as proposed by (Bahri et al., 2022).

Fig. 2
figure 2

The concept of meta-learning as proposed by Bahri et al. (2022)

Feature-based algorithm selection

According to Vanschoren (2018) meta-learning approaches can be distinguished based on what types of meta-features and how they are used. The concept depicted in Fig. 2 learns from existing model evaluations and meta-features to identify meaningful configurations of models and narrow down the search space. This can accelerate the search for optimal models and reduce computational effort, as only particularly relevant regions of the search space are explored (Zöller and Gabrys, 2020). Furthermore, there is the possibility of transferring suitable configurations to new tasks. Performance differences between various model configurations can be used as meta-features (Fürnkranz and Petrak, 2001), or placeholder models can be trained for existing tasks to weigh them depending on the similarity of a new task (Vanschoren, 2018). Additionally, the learning curves of candidate algorithms can be used as meta-features to evaluate the suitability of a configuration for new tasks (Leite and Brazdil, 2005). Another approach is meta-learning from the properties of different tasks, using a variety of meta-features to characterize the task and then select a suitable configuration (Mantovani, 2018; Reif et al., 2014; Alcobaça et al., 2020b; Vanschoren, 2010). Instead of manually defining meta-features, they can also be learned themselves to represent task groups (Sun et al., 2013). Furthermore, meta-features can be used to initiate an optimization process optimally or to predict the best configuration directly (Gomes et al., 2012). Approaches to constructing entire pipelines to select the best one for Bayesian optimization also fall into this category (Feurer et al., 2015; Fusi et al., 2017). At this point, it may also be useful to predict whether model optimization is promising (Ridd and Giraud-Carrier, 2014), or what improvement in prediction accuracy can be expected from optimization (Sanders and Giraud-Carrier, 2017). Another approach is meta-learning from previous models, such as their structure or learned model parameters. Here, an attempt is made to train a meta-learner to learn how a candidate algorithm should be trained for a new task. This approach can be further differentiated into transfer learning and meta-learning with Artifical Neural Networks (ANN). In transfer learning, models pre-trained on multiple tasks are used as a starting point for a new task. In meta-learning with ANN, neural networks are enabled to use changes in their model parameters during training as meta-features for further training (Baxter, 2019; Bengio, 2012; Caruana, 1994; Thrun and Mitchell, 1995). Learning with few data describes a variant of meta-learning that aims to adapt deep neural networks with only a few training data so that they are suitable for a new, similar task. In addition to all these mainly supervised learning approaches, meta-learning is by definition not limited to them (Bart and Ullman, 2005; Fei-Fei et al., 2006; Fink, 2004). Meta-learners can also be distinguished according to the goal of prediction. For example, the goal may be to create a ranking of suitable algorithms or entire configurations. For example, a ANN can be used to group similar tasks and then determine a ranking of the best configurations for these groups (Brazdil et al., 2003; Maforte Dos Santos et al., 2005). Alternatively, the meta-model can directly estimate the performance of a configuration (Köpf et al., 2000). A more detailed overview of these approaches can be found in the work of (Vanschoren, 2018).

Algorithm selection for anomaly detection

Anomalies are data points that significantly differ from other data points in a dataset (Hawkins, 1980). Anomalies can be caused by current disturbances and serve as early indicators of impending issues, such as declining production quality, decreasing machine functionality, or shortened component lifespan (Lughofer and Sayed-Mouchaweh, 2019; Lopez et al., 2017). The state of a machine can be defined by the degree of deviation from the expected operating behavior, with behavior being evaluated based on specific operating conditions (Vichare et al., 2004). For PdM, it is therefore essential to identify deviations from normal behavior to detect potential failures early and take necessary actions in time. Anomalies can be classified further into point, contextual, and collective anomalies (Schneider and Xhafa, 2022; Chandola et al., 2009). Point anomalies are individual data points that exhibit anomalous characteristics compared to the rest of the data. Contextual anomalies refer to data points that are considered anomalous only in a specific context or under certain conditions. In collective anomalies, a group of connected data points may be considered anomalous compared to the other data points, even if the individual instances do not represent anomalies themselves; in this case, the collective occurrence of these data points constitutes the anomaly (Cook et al., 2020). Figure 3 shows examples of these three types of anomalies.

Fig. 3
figure 3

Examples of point, context and collective anomalies

The results of a model for anomaly detection can be in the form of anomaly scores or binary labels. Anomaly scores quantify the degree of deviation of each individual data point and allow ranking the data points according to their probability of being considered anomalous. Although this output contains all relevant information, unlike a label, it does not provide a concise summary of potential outliers. Binary labels indicate whether a data point is considered an outlier or not. Some algorithms output these labels directly, while others allow the conversion of anomaly scores into binary labels (Aggarwal, 2017).

In the field of meta-learning for algorithm selection, the following contributions to anomaly detection have been identified in the literature. Fu et al. (2022) propose a meta-learning method to detect anomalies in power dispatch data. They use a hybrid ensemble selection approach that combines different models based on meta-learning to improve the overall accuracy and generalizability of the detection model. Yu et al. (2022) developed a meta-learning method based on deep unsupervised learning for engine vibration anomaly detection. Vibration signals are transformed into 52 physical and statistical features to build models for different engines. Papastefanopoulos et al. (2021) present a meta-learning algorithm for unsupervised outlier detection that combines the best techniques of existing methods through ensemble voting and unsupervised feature selection. Zhao et al. (2021) propose to select a model based on the performance of many models on historical outlier detection benchmark datasets. The method uses specialized meta-features to capture task similarity within the meta-learning framework. Meta-learning approaches based on statistical and information-theoretic meta-features require large amounts of data and computational resources. To address this issue, Kotlar et al. (2021) propose a novel set of meta-features based on domain-specific properties of data that can be efficiently extracted or estimated using a small amount of data. This demonstrates that domain-specific features can provide high added value to algorithm selection. Tavares and Junior (2021) focused on the detection of anomalous traces in event logs, which can negatively affect the quality of process execution. The authors propose a meta-learning strategy that combines coding techniques with meta-feature extraction to improve anomaly detection performance. Poulakis et al. (2020) present a framework for automated clustering. The framework consists of two modules: algorithm selection and hyperparameter tuning. Algorithm selection relies on meta-learning with novel meta-features that capture similarities in clustering structure, and hyper-parameter tuning uses Bayesian optimization with an optimization objective that combines different cluster validity indices. Cacoveanu et al. (2009) present a system that combines dataset characterizations with landmarking to increase prediction accuracy to select the best classifier for a dataset while minimizing user effort and providing flexibility.

Table 1 Related works in the field of meta-learning for anomaly detection

Overall, the existing literature supports the idea that meta-learning is a viable method for selecting suitable algorithms for new tasks. However, there is a notable gap in understanding how to design a meta-learner effectively, as it heavily relies on the choice of meta-features, and their potential needs to be fully exploited. It is evident that the effectiveness of meta-features is contingent on the specific use case, indicating the potential benefits of incorporating domain-specific features. The range of meta-learning algorithms varies, from simpler voting mechanisms to deep-learning architectures. Various approaches, including ensemble selection, single model selection, tuning mechanisms, and coding techniques, have been explored. Despite the usefulness of landmarking features, their application is limited in PdM due to the absence of labels in most cases. Furthermore, existing research primarily involves one-time selection decisions, leaving a research gap in the development of automated procedures for dynamically selecting appropriate algorithms for anomaly detection. Meta-learning approaches tailored for multivariate time series are also lacking. This raises challenges in effectively describing individual time series within a dataset, aggregating this information, and recognizing relationships between the time series. Additionally, there is a lack of documented real-use cases in the literature that demonstrate the application of meta-learning in industrial settings, particularly in manufacturing processes. This paper attempts to develop a solution that addresses the problem of meta-learning for multivariate anomaly detection in manufacturing systems. For this purpose, an adequate framework for meta-learning and domain-specific meta-features are introduced and validated in the following sections.

Automated model selection approach

Since the suitability of anomaly detection algorithms depends on the available data and the specific application (Schmidl et al., 2022; Wolpert and Macready, 1997; Zhao et al., 2020), the challenge is to select the most appropriate algorithm for the different machines or components of a manufacturing system. Figure 4 illustrates this challenge for different of types of machines in a manufacturing system. The set of possible anomaly detection algorithms is represented by a collection of differently colored magnifying glasses. These colors represent their different characteristics and capabilities. The meta-model’s task is to predict the performance of each of these candidates in order to make a reliable selection. It does this by using the incoming machine and operational data from each monitored object. At the bottom of the figure, the different colored magnifying glasses indicate that a different algorithm is selected for each machine. The selected algorithm is then used for anomaly detection. The green checkmark indicates that the machine is in normal condition. The orange triangle indicates that the machine is currently exhibiting abnormal behavior.

Formally, this challenge can be expressed as follows: Manufacturing systems comprise various types of machines \(M = {m^1,\ldots , m^m}\), each with distinct configurations and workloads subject to fluctuating operating conditions. The state of each machine is denoted by a binary variable \(y^m_{t}\), where:

$$\begin{aligned} y^m_{t} = {\left\{ \begin{array}{ll} 0 & \text {for the state "normal"} \\ 1 & \text {for the state "anomalous"} \end{array}\right. } \end{aligned}$$
(1)

Assuming the dataset \(X^m\) implicitly encodes information about the machine’s state \(y^m_{t}\) at time t, \(A = {a^1,\ldots , a^a}\) denotes a set of candidate algorithms, where each algorithm \(a^i\) is linked with a set of hyperparameters within a domain \(\Lambda ^i\). If algorithm \(a^i\) contains p hyperparameters, the entire hyperparameter space is denoted as \(\Lambda ^1 = {\lambda ^1_i,..., \lambda ^1_p}\). For \(X^m\), the aim is to discover an algorithm and a hyperparameter configuration that minimizes the following loss:

$$\begin{aligned} (a^*,\lambda ^*) \in \arg \min _{a^i \in A, \lambda \in \Lambda ^i} \frac{1}{K} \sum _{j=1}^{K} L(a^i_\lambda , X^j_{train}, X^j_{test}), \end{aligned}$$
(2)

where \(L(a^i_\lambda , X^j_{\text {train}}, X^j_{\text {test}})\) represents the loss experienced by algorithm \(a^i\) with hyperparameter \(\lambda \) on \(X^j_{\text {test}}\) when trained on \(X^j_{\text {train}}\) using K cross-validations.

Procedure overview

Figure depicts the meta-learning process. The initial phase, conducted offline, entails the training and testing of the meta-model. To establish the training database, various candidate algorithms undergo evaluation on diverse dataset collections (Sect. 3.3). Their performances and hyperparameter configurations, are stored alongside 155 meta-features (Sect. 3.4).

Fig. 4
figure 4

Conceptual framework for prognostic algorithm selection in manufacturing systems

Fig. 5
figure 5

Comprehensive flow of the meta-learning process for algorithm selection

For assessing the suitability of candidate algorithms and serving as a target for meta-learning, the F1-Score is selected as the primary evaluation criterion. The F1-Score offers a balanced assessment of both precision and recall, making it a sensible choice considering that the costs associated with false positives and false negatives depend on the application scenario. Moreover, the F1-Score proves advantageous in scenarios where there is an uneven distribution of classes within the data.

$$\begin{aligned} \text {F1-Score}= \frac{2 \cdot \text {Precision} \cdot \text {Recall}}{\text {Precision} + \text {Recall}} \end{aligned}$$
(3)

In a preprocessing step, relevant meta-features are selected and transformed to attain the desired structure for subsequent processing. It is important to differentiate between actual meta-features, which describe the properties of the dataset and individual attributes, and performance measures utilized as meta-targets.

Fig. 6
figure 6

Processing steps within the meta-model

Before passing them to the meta-model for prediction, the meta-features may be encoded. For this work, the effectiveness of an autoencoder, a PCA (Pedregosa et al., 2011), as well as a t-Distributed Stochastic Neighbor Embedding (t-SNE) (Pedregosa et al., 2011) were investigated for dimensionality reduction. The autoencoder is used to learn a compact representation of the meta-features. In this case, we designed a dense autoencoder with eight fully connected layers. After dividing the data sets into training and test data, the training data is passed through the encoder. The test data are transformed accordingly. To construct the meta-model, a multi-output regression model is used to model the target variables, i.e., the performance measures of the candidate algorithms. In this case, the validation criterion is the Mean Squared Error (MSE) between the true and predicted target variables.

In the second phase, performed online, the meta-model is applied. The goal is to identify a suitable algorithm for a previously unknown application dataset. To achieve this, meta-features are extracted from the application dataset and submitted to the meta-model, which returns an estimate of each candidate’s prognostic performances. Based on a selection mechanism that takes into account the performance predictions, a single or a combination of algorithms is selected. It is assumed that if there is a dominant algorithm in the candidate set according to the prediction of the meta-model, that this algorithm should be selected. Forming an ensemble of a very good algorithm and other less good algorithms would tend to result in lower prediction performance than using only the dominant algorithm. Forming an ensemble would also be advantageous when a dominant algorithm exists but cannot be accurately predicted. We further assume that in the absence of a particularly well-suited algorithm, an ensemble should be formed to compensate for uncertainties of the individual algorithms for this data set. The weighting of the algorithms within the ensemble is determined as a function of the predicted performance. Therefore, two approaches are considered: First, the selection of the best algorithm (Meta-Single-Selection (Meta-S-Sel)). Second, the formation of an ensemble of several good algorithms or the selection of a single algorithm (Meta-Multi-Selection (Meta-M-Sel)). For the second case, we define an upper bound \(\sigma _u\) and a lower bound \(\sigma _l\). Algorithms whose predicted performance is below \(\sigma _l\) are excluded. Algorithms are added to the ensemble until the sum of the predicted performances exceeds the upper bound \(\sigma _u\).

Algorithm 1
figure a

Procedure to construct the Meta-M-Sel

Candidate algorithms for anomaly detection

To define the set of algorithms from which the meta-model is supposed to predict the most appropriate one, some boundary conditions need to be established. It is crucial that the algorithms excel in handling multivariate data. Moreover, the datasets may contain time series, such as sensor and control signals, in addition to categorical attributes describing production history or other relevant context. Therefore, the chosen algorithms should demonstrate proficiency in effectively processing these types of data. Another consideration arises from the necessity to pre-evaluate all candidate algorithms on the training datasets. Thus, it is imperative to minimize the number of hyperparameters to reduce the potential solutions and the computational complexity of preparing the meta-learning system. Additionally, a diverse set of anomaly detection algorithms should be available as candidates to ensure a comprehensive suitability profile. In light of these considerations, the following algorithms were selected as alternatives for the selection process and implemented using the PYOD library (Zhao et al., 2019): Histogram-related Outlier Score (HBOS), Principal Component Analysis (PCA), Clustering Based Local Outlier Factor (CBLOF), Lightweight Online Detector of Anomalies (LODA), Copula Based Outlier Detector (COPOD), Local Outlier Factor (LOF), One-Class Support Vector Machine (OCSVM), IsolationForest Outlier Detector (IForest), k- Nearest Neighbors Detector (KNN), Feature Bagging (FB) and Connectivity-Based Outlier Factor (COF).

Dataset collections for meta-model construction

The composition of the training datasets plays a pivotal role in shaping the performance and generalizability of the meta-model. The datasets, selected for the problem at hand, encompass multivariate data. The number of attributes per dataset varies from 2 to 21, while the length ranges from 148 to 5473 data points per attribute. Each data point must also be associated with a binary label, a prerequisite for evaluating the performance of the candidate algorithms. The construction of the meta-model relies on publicly available data collections, ensuring the reproducibility of the study (see Table 2). Care was taken to ensure that only about 80% of the datasets were used to train the meta-model and the remaining 20% were used for evaluation (Sect. 4).

Table 2 Open source dataset collections used for meta-feature generation and model construction, along with references and download sources

The fundamental premise of this study is grounded in the concept of the No Free Lunch (NFL) theorem, which posits that no single algorithm consistently outperforms others across different tasks. Previous research by (Schmidl et al., 2022) has demonstrated that the NFL also applies to anomaly detection algorithms. However, it remains uncertain whether this assumption holds true for the specific algorithms and datasets considered in this study. Therefore, it is imperative to investigate how the performance of the various algorithms is distributed across the data, and whether there exists a dominant algorithm with consistently superior predictive capabilities for all datasets, thereby obviating the need for individual algorithm selection.

Figure 7 shows the predictive performance of various anomaly detection algorithms on the training datasets. The top 7a illustrates that there is no clear dominant algorithm; instead, many of the algorithms occupy valid positions within the candidate set. The bottom Fig. 7b further illustrates how algorithm performance varies from one dataset to another. Some algorithms yield impressive results for a particular dataset, while the average performance across all datasets is relatively poor. A direct comparison between the plots reveals that PCA achieves the highest median result (as shown in Fig. 7b), but it is not the absolute best algorithm overall (as shown in a). It is worth noting that the effectiveness of an algorithm on the training dataset does not necessarily guarantee similar performance on new incoming data or on a fresh test dataset. In summary, these results support the hypothesis that the NFL theorem holds for the given data and algorithms. Consequently, it should be advantageous to individually select algorithms for anomaly detection.

Fig. 7
figure 7

Verification of the suitability of the data set collections for the evaluation of the meta-learning procedure

Meta-features to describe task similarity

The performance of the meta-model is significantly influenced by a thoughtful selection and effective incorporation of meta-features. It is crucial to describe datasets with meta-features that the meta-model can distinguish well and that also reflect properties defining the suitability of algorithms for specific tasks (Kotlar et al., 2021; Tavares and Junior, 2021).

The choice of meta-features is contingent on the structure and dimensionality of the input data. In this paper, we introduce a set of meta-features tailored for anomaly detection and multivariate time series. This set is complemented with previously established meta-features from the literature. Statistical summaries are employed for most meta-features to aggregate individual time series features for each dataset.

Multivariate time series are a common data structure for characterizing the state of technical systems. Often there is a causal relationship between several time series within the same data set. These relationships may contain crucial information for the selection of a suitable prognostic algorithm. Therefore, we introduce the following meta-features, which consider the properties of each time series in isolation, as well as its relationships to other time series within the same dataset. It is important to note that these features are not new for describing time series; their novelty lies in their application in this context to support the meta-learning process. Detailed information on the features used can be found in (Bhattacharya and Burman, 2016), from which the following definitions are also adopted.

  • Autoregressive coefficients quantify the relationship between a time series and its lagged values, providing insight into the temporal dependencies within the data.

    $$\begin{aligned} X_t = c + \phi _1 X_{t-1} + \phi _2 X_{t-2} + \ldots + \phi _p X_{t-p} + \varepsilon _t \end{aligned}$$
    (4)

    \(X_t\) is the value of the time series at time \(t\), \(c\) is a constant term, \(\phi _1, \phi _2, \ldots , \phi _p\) are the autoregressive coefficients. \(X_{t-1}, X_{t-2}, \ldots , X_{t-p}\) are the lagged values of the time series and \(\varepsilon _t\) is the error term at time \(t\).

  • Cross-correlation measures the similarity between two time series as a function of the lag of one relative to the other, indicating potential relationships or patterns shared between them.

    $$\begin{aligned} R_{xy}(\tau ) = \frac{\sum _{t=1}^{N-\tau } (x_t - \mu _x)(y_{t+\tau } - \mu _y)}{\sqrt{\sum _{t=1}^{N} (x_t - \mu _x)^2 \sum _{t=1}^{N} (y_t - \mu _y)^2}} \end{aligned}$$
    (5)

    where \(R_{xy}(\tau )\) is the cross-correlation coefficient at lag \(\tau \), \(x_t\) and \(y_t\) are the values of the two time series at time \(t\), \(\mu _x\) and \(\mu _y\) are the means of the two time series, respectively, \(\tau \) is the lag parameter, and \(N\) is the total number of data points in the time series. The cross-correlation coefficient \(R_{xy}(\tau )\) ranges from -1 to 1, where 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no correlation.

  • Covariance quantifies the extent to which two time series \(X\) and \(Y\) change together, providing information about their joint variability.

    $$\begin{aligned} \text {Cov}(X, Y) = \frac{\sum _{i=1}^{n} (x_i - \mu _X)(y_i - \mu _Y)}{n-1} \end{aligned}$$
    (6)

    where \(\text {Cov}(X, Y)\) is the covariance between \(X\) and \(Y\), \(x_i\) and \(y_i\) are individual observations of \(X\) and \(Y\), \(\mu _X\) and \(\mu _Y\) are the means of \(X\) and \(Y\) respectively, and \(n\) is the number of observations. A positive covariance indicates a positive linear relationship, while a negative covariance indicates a negative linear relationship. A covariance close to zero indicates a weak or no linear relationship.

  • Transfer entropy quantifies the directional flow of information from one time series to another, revealing the extent of information transfer between them. Given two discrete-time processes \(X\) and \(Y\), the transfer entropy from \(X\) to \(Y\) at lag \(k\) is defined as:

    $$\begin{aligned} TE_{X\rightarrow Y}(k) = H(Y_{t+1} | Y_{t}^{(k)}, X_t^{(k)}) - H(Y_{t+1} | Y_{t}^{(k)}) \end{aligned}$$
    (7)

    where \(TE_{X\rightarrow Y}(k)\) is the transfer entropy from \(X\) to \(Y\) at lag \(k\), \(H(\cdot )\) denotes the conditional entropy, \(Y_{t+1}\) is the future value of process \(Y\), and \(Y_{t}^{(k)}\) and \(X_t^{(k)}\) are the histories of \(Y\) and \(X\) up to lag \(k\) respectively. The first term on the right-hand side measures the uncertainty in the future of \(Y\) given both the past of \(Y\) and the past of \(X\) up to lag \(k\). The second term measures the uncertainty in the future of \(Y\) given only the past of \(Y\) up to lag \(k\). The difference between these two terms quantifies the additional information provided by the past of \(X\) up to lag \(k\).

  • Maximum power frequencies represent the dominant oscillatory patterns within the time series, providing insight into periodic behavior or trends.Given a continuous signal \(x(t)\), the power spectral density (PSD) \(S_x(f)\) represents the distribution of power as a function of frequency. The MPF is defined as:

    $$\begin{aligned} f_{\text {MPF}} = \arg \max _{f} S_x(f) \end{aligned}$$
    (8)

    where \(f_{\text {MPF}}\) is the Maximum Power Frequency, and \(S_x(f)\) is the power spectral density of the signal.

Besides the above described metrics to describe time series, it seems reasonable to expect that the specific nature of anomalies present in a dataset can significantly affect the effectiveness of a chosen algorithm. Therefore, we introduce meta-features that provide additional details about the nature of anomalies. It is important to note, however, that these meta-features can only be generated if the training dataset contains labels that categorize the state of the data at different points in time. In the following, \(x_i\) represents the value of the time series at time \(i\), where \(x_i = 1\) indicates an anomalous data point and \(x_i = 0\) indicates a normal data point. \(N\) is the total number of data points in the time series.

  • Total anomaly ratio indicates the proportion of anomalous data points relative to the total length of the time series, providing a measure of the overall prevalence of anomalies.

    $$\begin{aligned} TAR = \frac{\sum _{i=1}^{N} x_i}{N} \end{aligned}$$
    (9)
  • Point anomaly ratio represents the ratio of individual data points identified as anomalies, providing insight into the frequency of isolated abnormal observations.

    $$\begin{aligned} PAR = \frac{\sum _{i=1}^{N} x_i}{N} \end{aligned}$$
    (10)
  • Collective anomaly ratio quantifies the ratio of anomalies that occur collectively, indicating instances where anomalies occur in groups or clusters within the time series.

    $$\begin{aligned} CAR = \frac{\sum _{i=1}^{N} x_i}{N} \end{aligned}$$
    (11)
  • Anomaly durations refer to the lengths of time anomalies persist in the time series, providing information about the temporal extent of abnormal patterns. To calculate Anomaly Durations, we identify consecutive sequences of anomalous data points and measure the duration of each such sequence. For example, if \(x_i = 1\) indicates an anomalous data point and \(x_i = 0\) indicates a normal data point, the duration of an anomaly sequence starting at time \(i\) and ending at time \(j\) would be \(j - i + 1\) time units.

  • Non-anomaly durations represent the length of time intervals between anomalous occurrences, providing insight into the periods of normalcy or absence of anomalies in the time series. To calculate Non-Anomaly Durations, we identify the consecutive sequences of normal data points (0s) and measure the duration of each such sequence. For example, if \(x_i = 1\) indicates an anomalous data point and \(x_i = 0\) indicates a normal data point, the duration of a non-anomaly sequence starting at time \(i\) and ending at time \(j\) would be \(j - i + 1\) time units.

Additionaly, further sets of meta-features (Table 3) were extracted using the PyMFE library (Alcobaça et al., 2020a).

Table 3 Overview of meta feature groups extracted from PyMFE

Experimental evaluation

The automotive manufacturing process encompasses a spectrum of highly automated welding procedures. For instance, clusters of robots are employed to weld various types of studs onto the car body to facilitate subsequent assembly processes. In this context, substantial volumes of data are generated and gathered to monitor and control robots and equipment. The studs’ material, length, diameter, and welding position vary depending on the car model. Consequently, different welding programs are executed. Depending on the production schedule, model changes may occur more or less frequently, resulting in a broad array of process parameter curves. Figure 8 illustrates the same two signals for different combinations of studs and bodies, each identified by a unique ProcessID.

Fig. 8
figure 8

Curves of process parameters in stud welding. a Process parameters for identical stud and car body combination (ProcessID). b Variation in process parameters across different combinations of studs and car bodies

When it comes to implementing PdM using machine learning techniques, the critical question is: Which algorithm is optimal for each process? What if there’s a need to establish fault prediction for numerous types of robots? What if additional processes and technologies are introduced, each with different variants and components? How does the monitoring adapt when the workload on a machine changes? And how does replacing a machine component affect the accuracy the prognostic model? What happens to sensor readings as environmental conditions change? The crux of the matter is that any change in these conditions may require an adjustment or modification to the algorithm used to monitor the process. In such scenarios, an automated approach to selecting appropriate algorithms would save time and resources.

In the following experiments, we investigate whether a reliable selection of algorithms is possible with the presented procedure. The performance is compared with benchmark methods, the importance of the individual meta-features is examined, and the computing times of the candidate algorithms are considered.

Meta-algorithm configuration

In a preliminary study, several meta-learning algorithms were evaluated in conjunction with different feature encoding methods using fivefold cross-validation based on the MSE. In particular, the Random Forest Regressor (RFR) algorithm showed robust performance and was therefore chosen as the meta-learning algorithm. It works by generating numerous decision trees during the training phase, each constructed using a random subset of the training data and features to ensure diversity among the trees. During the prediction phase, RFR combines the outputs of individual trees to produce the final prediction. We use the sklearn.multioutput module, which extends the capabilities of this originally single-output regression model to handle multiple target variables simultaneously. The problem involves learning a mapping function \(f\) that predicts F1-Scores (\(Y'\)) based on a matrix \(X\) of meta-features. The objective is to minimize the Mean Squared Error loss function, \(L(Y, Y')\), where \(Y\) represents the actual F1-Scores. The RFR is trained to optimize its parameters, resulting in a model \(f^*\) used for predicting F1-Scores for new meta-features. Considering a dataset collection with \(N\) rows, each dataset is characterized by \(M\) meta-features. Let \(X\) represent the matrix of meta-features (\(N \times M\)), where \(X_{ij}\) is the \(j\)-th meta-feature for the \(i\)-th dataset. The target \(Y\), is a matrix of the actual F1-Scores (\(N \times 1\)), where \(Y_i\) is the F1-Score for the \(i\)-th algorithm.

$$\begin{aligned} L(Y, Y') = \frac{1}{DN} \sum _{i=1}^{D} \sum _{j=1}^{N} (Y_{ij} - Y'_{ij})^2 \end{aligned}$$
(12)

The optimal regression model \(f^*\) minimizes this loss:

$$\begin{aligned} f^* = \arg \min _f L(Y, f(X)) \end{aligned}$$
(13)

The resulting model \(f^*\) serves for predicting F1-Scores for each candidate algorithm for the test datasets which are not known to the meta-model. To evaluate the Meta-M-Sel method, the parameters were set to \(\sigma _l\) = 0.2 and \(\sigma _u\) = 1 using a grid search.

Benchmarks

The literature lacks a standardized evaluation procedure for meta-learning approaches, leading to a diversity of evaluation methodologies without established guidelines. To fill this gap, this study employs a comparative analysis by introducing five benchmark methods for evaluating the selection strategies. This approach aims to provide an informative assessment of the meta-model’s performance relative to the following baselines:

  • Optimum: Selection of the candidate algorithm with the best performance on the test data, representing the ideal selection strategy. It is assumed that all results are known and the best one for a data set is used to calculate the F1-Score.

  • Te-Sel: Selection of the algorithm with the best overall prediction result on the test data. Again, it is assumed that all outcomes are known. Instead of using an individual algorithm for each new data set, the algorithm with the best overall result for all data sets is used to determine the F1-Score.

  • Tr1-Sel: Selects the algorithm with the best performance on the training data. Here, the results on the training data are used as the selection function. The algorithm with the best result on a subset of the data is also used to predict the evaluation data.

  • Tr2-Sel: Forms an ensemble of the two most effective algorithms on the training data. In contrast to Tr1-Sel, not a single, but a combination of the results of the two best algorithms on a part of the data is used for the evaluation part of the data.

  • Tr3-Sel: Forms an ensemble of the three best performing algorithms on the training data. Like Tr2-Sel, just a combination of the three best algorithms.

The result of the formed ensembles is determined by simple mean value calculation. The results of predicting whether or not an anomaly is present are added and divided by the number of candidates in the ensemble. The F1-Score is then calculated for each data set.

Prognostic performance

Figure 9 provides a comparative overview of the performances of the proposed meta-models and benchmark methods. It displays F1-Score predictions for previously unknown test datasets, simulating the productive use of the offline-constructed meta-models. The meta-model must make selection decisions for a single algorithm (Meta-S-Sel) or form an ensemble (Meta-M-Sel) based on incoming data set meta-features. Additional approaches include Te-1-Sel (application of the best overall algorithm on the test data), Tr-1-Sel (selection of the overall best algorithm for training data, then applied to test data), Tr-2-Sel (ensemble of the best two algorithms for training data), and Tr-3-Sel (ensemble of the best three algorithms for training data). The box plots are ordered by medians, with the optimal outcome, achievable by individually selecting the single best algorithm from the candidate set for each dataset, positioned at the far right (Optimum). Corresponding numerical values are available in Table 4, sorted by median in ascending order.

The findings suggest that when it comes to selecting a single algorithm for an unknown dataset, Meta-S-Sel demonstrates superior performance. Following closely, Meta-M-Sel performs well, while Te-1-Sel, Tr-2-Sel, Tr-3-Sel, and Tr-1-Sel exhibit significantly lower performances. The median difference between Meta-S-Sel and the optimal value is minimal (0.023), and the maximum value matches the optimum, highlighting its efficiency. In contrast, the discrepancies with the optimum are more substantial for Meta-M-Sel (0.082), Te-1-Sel (0.144), Tr-2-Sel (0.148), Tr-3-Sel (0.148), and Tr-1-Sel (0.166).

In summary, the results strongly confirm the effectiveness of the meta-learning approach in selecting the most appropriate algorithm or ensemble based on meta-data. The method even outperforms all strategies that rely on evaluations of parts of considered data sets, demonstrating its superior performance. Of particular note is its ability to outperform the selection of the overall best algorithm, the most common approach in industrial settings. The minimal difference from the optimum indicates that the chosen meta-features encapsulate indispensable information that is critical for making and informed selection decisions. Regarding the superiority of Meta-S-Sel over Meta-M-Sel, it is suspected that this is due to the fact that the meta model can clearly identify the best suitable algorithm for a data set. Thus, model selection yields better results than a combination of the best and the second-best candidate algorithms. However, there is always the possibility of overfitting the - even if large - selection of datasets. In this sense, the Meta-M-Sel may be a more robust solution for model selection in practice if the data structures are very different from those used here.

Fig. 9
figure 9

Box plots illustrating predictive performance for baseline and meta-selection strategies

Table 4 Statistical summary of predictive performances for baseline and meta-selection strategies

Meta-feature importance

Examining the importance of different meta-features for predicting the performance of individual candidate algorithms through the meta-model provides deep insights into the effectiveness of the proposed approach. In total, 155 meta-features were obtained for each dataset, 145 of which were computed from the PyMFE collection, apart from the ten meta-features presented in this paper (see Sect. 3.4). Permuation Feature Importance was used as the method to determine the results. This is an analysis method that measures the contribution of each feature to the statistical performance of an adapted model. To do this, the values of each feature are randomly shuffled and the change in model performance is observed (Breiman, 2001). In this way, it can be used to analyze how much a feature contributes to the prediction of a target variable.

Table 5 ranks the results in descending order of median, with Transfer Entropy Maximum and Mutual Information Mean emerging as the most important features by a significant margin. Interestingly, the Collective Anomaly Ratio is also among the top features, indicating that the collective anomaly ratio is a critical criterion for selecting anomaly detection algorithms. This confirms the hypothesis that the newly introduced meta-features help to describe the similarity of data sets and capture relevant criteria for the selection of anomaly detection algorithms. The results suggest that the relationships between time series provide valuable insights, justifying the use of meta-features, in particular Transfer Entropy Maximum and Mutual Information Mean. These features allow the meta-model to identify relevant patterns and structures in the data, facilitating informed decision making. However, it is important to note that the availability of labels, especially for the Collective Anomaly Ratio, may not be guaranteed in all applications. Therefore, there may be situations where certain features cannot be used due to a lack of label information. Furthermore, it should be emphasized that the relevance of features is highly dependent on the specific circumstances and requirements of the application domain. What is meaningful in one context may be less relevant in another. In summary, the results of the feature importance analysis confirm the effectiveness of the newly introduced meta-features and underline their role in describing data and capturing crucial criteria for the selection of anomaly detection algorithms. These findings are essential for adapting the meta-model to specific use cases and optimizing its performance.

Table 5 Importance of meta-features for algorithm selection

Computing times

The computational time of various candidate anomaly detection algorithms is examined by measuring the time each algorithm takes to predict each data point in a dataset. The results show clear differences in the computation times of the models considered. HBOS, PCA, and CBLOF proved to be significantly faster algorithms than LODA, COPOD, LOF, and OCSVM. Conversely, IFOREST, KNN, FB, and COF proved to be relatively slow. The slowest group not only showed generally longer computation times, but also a remarkable variation in the measured values. This variation could indicate an increased sensitivity to data characteristics, especially for complex datasets with many attributes and instances. Observing these differences in computation time raises questions about the practical applicability of the algorithms. The faster models, such as HBOS and PCA, could be advantageous in applications that require real-time response to anomalies. The longer computational times of the slower algorithms could be problematic in certain contexts, particularly where quick decisions are required. The high variance within this group indicates sensitivity to different data structures, leading to significant performance variations in complex scenarios. It is important to note that the proposed method does not use computational time as a criterion for algorithm selection, but focuses solely on predictive performance. In practice, however, the choice of an anomaly detection algorithm should be carefully considered, taking into account the specific requirements of the domain.

Fig. 10
figure 10

Boxplot of the distribution of prediction times for each candidate algorithm across all training data sets

Table 6 Statistical summary of prediction times for candidate algorithms across all datasets

Conclusion

In this paper, a meta-learning approach to algorithm selection was presented to predict the predictive performance of different candidate algorithms for anomaly detection in manufacturing systems. Subsequently, the best algorithm is selected or a dynamic ensemble is formed based on the estimated predictive performance. Experiments evaluating the method demonstrate its superiority over five benchmark approaches. In practice, even when algorithms are selected individually based on validation errors, the meta-learning approach improves predictive performance. Moreover, the success of the method lies in its ability to make a better selection than choosing the overall best algorithm for the test data. The minimal differences between the meta-learning method and the optimal strategy suggest that the meta-features are well suited to describe multivariate datasets and capture relevant properties for anomaly detection.

However, these promising results should be treated with caution. Although efforts were made to diversify the data sets, the generalizability of the method cannot be assumed. Such generalizability would require further investigation in different application domains. In addition, it is unclear how improved anomaly detection affects the manufacturing system, especially with respect to the optimization of the algorithmic evaluation metrics. In this paper, different loss functions are used in different places. First, the individual loss functions of the candidate algorithms, and second, the loss function of the meta-regression model, where the mean square error (MSE) is used to evaluate the performance of the meta-model in estimating the candidate algorithms. In addition, the F1-Score is used as a criterion for comparing different candidates and selection methods to evaluate how well anomalies can be identified. With respect to the individual loss functions of the candidates, it can be assumed that better performance leads to better anomaly detection. Better anomaly detection can in turn lead to better failure prevention and thus to a higher overall effectiveness of the manufacturing system. This correlation exists in general, but the strength of the correlation depends on the structure of the manufacturing system. More research is needed to make more accurate statements. Overall, under the conditions of this study, the initial hypothesis is confirmed: the individual selection of robust algorithms for anomaly detection leads to a better predictive performance for multivariate data sets, such as those found in machine and operational data in manufacturing. The automated applicability of the method provides an opportunity to increase the efficiency of manufacturing systems by minimizing unplanned downtime through condition-based maintenance and enabling efficient maintenance planning.