0% found this document useful (0 votes)
10 views20 pages

19 Phát Triển Và Phân Tích So Sánh Các Thuật Toán Học Máy Để Lập Mô Hình Dự Đoán Ăn Mòn Khí Quyển

This research article presents a comparative analysis of machine learning algorithms for predicting atmospheric corrosion, a significant global issue affecting industrial infrastructure. By creating a comprehensive dataset and testing various algorithms, the study finds that ensemble methods, particularly random forests, excel in predicting corrosion rates, highlighting the importance of feature selection and model refinement. The findings establish a foundation for future research and practical applications in corrosion management strategies.

Uploaded by

Can Dam Kinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views20 pages

19 Phát Triển Và Phân Tích So Sánh Các Thuật Toán Học Máy Để Lập Mô Hình Dự Đoán Ăn Mòn Khí Quyển

This research article presents a comparative analysis of machine learning algorithms for predicting atmospheric corrosion, a significant global issue affecting industrial infrastructure. By creating a comprehensive dataset and testing various algorithms, the study finds that ensemble methods, particularly random forests, excel in predicting corrosion rates, highlighting the importance of feature selection and model refinement. The findings establish a foundation for future research and practical applications in corrosion management strategies.

Uploaded by

Can Dam Kinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Open Research Europe Open Research Europe 2025, 5:84 Last updated: 18 APR 2025

RESEARCH ARTICLE

Development and comparative analysis of machine learning


algorithms for predictive atmospheric corrosion modeling
[version 1; peer review: 2 approved with reservations]

Jose Manuel Perales Fernández , María López Abelairas ,


Arturo Sánchez-Ramos , Lila Otero-Gonzalez
Idener Research and Development A.I.E., La Rinconada, Sevilla, 41300, Spain

v1 First published: 27 Mar 2025, 5:84 Open Peer Review


https://doi.org/10.12688/openreseurope.19770.1
Latest published: 27 Mar 2025, 5:84
https://doi.org/10.12688/openreseurope.19770.1 Approval Status

1 2
Abstract
version 1
Background 27 Mar 2025 view view

Industrial content and infrastructure are in constant danger from 1. Atwakyire Moses, Kabale University, Kabale,
atmospheric corrosion, which affects economies globally. However,
Uganda
there is a lack of a consistent set of comprehensive data that
completely surrounds the range of this problem in diverse climate and 2. Leonardo Bertolucci Coelho , Université
locations. The purpose of the research is to evaluate the factors that Libre de Bruxelles,Belgium, Bruxelles,
contribute to atmospheric corrosion and its diverse effects on
Belgium
materials in various environments.
Any reports and responses or comments on the
Methods article can be found at the end of the article.

By creating a comprehensive dataset by collecting and standardizing


corrosion data from diverse environments and geographic regions
and initially analyzing the data, it helped indicate the main parameters
affecting corrosion. This guided the selection of future features for
further modeling. Several machine learning algorithms were tested,
such as linear regression, decisions tree, neural network, and, most
especially, attempting methods, for their corrosion rate prediction
capabilities. These models were assessed based on their prediction’s
accuracy, and computational efficiency, with special attention to
refining their performance through detailed feature engineering and
hyperparameter adjustment.

Results

Upon evaluating the performance of conventional predictive models,


the research indicated that the machine learning approaches,

Page 1 of 20
Open Research Europe Open Research Europe 2025, 5:84 Last updated: 18 APR 2025

especially with random forests methods of dress, were excellent in


predicting corrosion rates, significantly improved upon these
capabilities. By analyzing various machine learning approaches, it
became clear that it was important to enhance their accuracy by
selecting the best features and customizing them.

Conclusions

This work represents a significant advancement in the predictive


modeling of atmospheric corrosion. It highlights the invaluable role of
machine learning in this field. By integrating varied data sets and
applying sophisticated machine learning techniques, it has
established a foundation for ongoing research and the practical
application of corrosion management strategies. The exceptional
performance of ensemble methods, like random forests, signals their
potential to improve prediction capabilities and guide more effective
corrosion prevention measures.

Plain Language Summary


This paper addresses the issue of atmospheric corrosion, a naturally
occurring phenomena that gradually deteriorates materials exposed
to air over time, jeopardising industrial facilities, bridges, and
buildings. One of the most significant obstacles in capturing and
forecasting this type of corrosion is a lack of structured data. We
collected and standardised corrosion data from several locations and
environmental scenarios to create a more comprehensive and
meaningful dataset to help with this.

This knowledge enabled us to study which elements contribute the


most to corrosion and influence our selection of relevant data for
constructing prediction models. Following that, we explored with a
variety of machine learning algorithms (computer-based methods for
detecting patterns in data), ranging from simple linear regression to
more complicated decision trees and neural networks. Following a
comparison of their accuracy and efficiency, we noticed that some
approaches, particularly those that contain multiple models known as
"ensemble methods," such as random forests, performed best in
predicting how quickly materials would corrode.

We also found that improving these models by carefully refining the


proper data inputs and changing the settings of the model allows
them to provide more accurate predictions. This study not only
reveals which machine learning algorithms are best suited for
corrosion prediction, but it also lays the framework for better
monitoring and protection of materials exposed to the atmosphere.
This work promotes the overall objective of minimising material
degradation and ensuring the longevity of critical infrastructure by
combining data from numerous sources and employing a variety of
analytical techniques.

Page 2 of 20
Open Research Europe Open Research Europe 2025, 5:84 Last updated: 18 APR 2025

Keywords
Atmospheric corrosion prediction, grid search, hyperparameter
tuning, corrosion dataset, machine learning algorithms

This article is included in the Horizon 2020


gateway.

Corresponding author: Jose Manuel Perales Fernández ([email protected])


Author roles: Perales Fernández JM: Conceptualization, Data Curation, Formal Analysis, Methodology, Resources, Software, Validation,
Visualization, Writing – Original Draft Preparation, Writing – Review & Editing; López Abelairas M: Writing – Original Draft Preparation,
Writing – Review & Editing; Sánchez-Ramos A: Writing – Original Draft Preparation, Writing – Review & Editing; Otero-Gonzalez L:
Conceptualization, Formal Analysis, Funding Acquisition, Methodology, Project Administration, Supervision, Writing – Original Draft
Preparation, Writing – Review & Editing
Competing interests: No competing interests were disclosed.
Grant information: The project has received funding from the European Union's Horizon 2020 Research and Innovation Programme
under grant agreement No 952960 (MAterials solutions for cost Reduction and Extended service life on WIND off-shore facilities
[MAREWIND Project]).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Copyright: © 2025 Perales Fernández JM et al. This is an open access article distributed under the terms of the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
How to cite this article: Perales Fernández JM, López Abelairas M, Sánchez-Ramos A and Otero-Gonzalez L. Development and
comparative analysis of machine learning algorithms for predictive atmospheric corrosion modeling [version 1; peer review: 2
approved with reservations] Open Research Europe 2025, 5:84 https://doi.org/10.12688/openreseurope.19770.1
First published: 27 Mar 2025, 5:84 https://doi.org/10.12688/openreseurope.19770.1

Page 3 of 20
Open Research Europe Open Research Europe 2025, 5:84 Last updated: 18
03 APR 2025

RESEARCH ARTICLE

Development and comparative analysis of machine learning


algorithms for predictive atmospheric corrosion modeling
[version 1; peer review: awaiting peer review]

Jose Manuel Perales Fernández , María López Abelairas ,


Arturo Sánchez-Ramos , Lila Otero-Gonzalez
Idener Research and Development A.I.E., La Rinconada, Sevilla, 41300, Spain

v1 First published: 27 Mar 2025, 5:84 Open Peer Review


https://doi.org/10.12688/openreseurope.19770.1
Latest published: 27 Mar 2025, 5:84
https://doi.org/10.12688/openreseurope.19770.1 Approval Status AWAITING PEER REVIEW

Any reports and responses or comments on the

Abstract article can be found at the end of the article.

Background

Industrial content and infrastructure are in constant danger from


atmospheric corrosion, which affects economies globally. However,
there is a lack of a consistent set of comprehensive data that
completely surrounds the range of this problem in diverse climate and
locations. The purpose of the research is to evaluate the factors that
contribute to atmospheric corrosion and its diverse effects on
materials in various environments.

Methods

By creating a comprehensive dataset by collecting and standardizing


corrosion data from diverse environments and geographic regions
and initially analyzing the data, it helped indicate the main parameters
affecting corrosion. This guided the selection of future features for
further modeling. Several machine learning algorithms were tested,
such as linear regression, decisions tree, neural network, and, most
especially, attempting methods, for their corrosion rate prediction
capabilities. These models were assessed based on their prediction’s
accuracy, and computational efficiency, with special attention to
refining their performance through detailed feature engineering and
hyperparameter adjustment.

Results

Upon evaluating the performance of conventional predictive models,


the research indicated that the machine learning approaches,

Page 4
1 of 20
13
Open Research Europe Open Research Europe 2025, 5:84 Last updated: 18
03 APR 2025

especially with random forests methods of dress, were excellent in


predicting corrosion rates, significantly improved upon these
capabilities. By analyzing various machine learning approaches, it
became clear that it was important to enhance their accuracy by
selecting the best features and customizing them.

Conclusions

This work represents a significant advancement in the predictive


modeling of atmospheric corrosion. It highlights the invaluable role of
machine learning in this field. By integrating varied data sets and
applying sophisticated machine learning techniques, it has
established a foundation for ongoing research and the practical
application of corrosion management strategies. The exceptional
performance of ensemble methods, like random forests, signals their
potential to improve prediction capabilities and guide more effective
corrosion prevention measures.

Plain Language Summary


This paper addresses the issue of atmospheric corrosion, a naturally
occurring phenomena that gradually deteriorates materials exposed
to air over time, jeopardising industrial facilities, bridges, and
buildings. One of the most significant obstacles in capturing and
forecasting this type of corrosion is a lack of structured data. We
collected and standardised corrosion data from several locations and
environmental scenarios to create a more comprehensive and
meaningful dataset to help with this.

This knowledge enabled us to study which elements contribute the


most to corrosion and influence our selection of relevant data for
constructing prediction models. Following that, we explored with a
variety of machine learning algorithms (computer-based methods for
detecting patterns in data), ranging from simple linear regression to
more complicated decision trees and neural networks. Following a
comparison of their accuracy and efficiency, we noticed that some
approaches, particularly those that contain multiple models known as
"ensemble methods," such as random forests, performed best in
predicting how quickly materials would corrode.

We also found that improving these models by carefully refining the


proper data inputs and changing the settings of the model allows
them to provide more accurate predictions. This study not only
reveals which machine learning algorithms are best suited for
corrosion prediction, but it also lays the framework for better
monitoring and protection of materials exposed to the atmosphere.
This work promotes the overall objective of minimising material
degradation and ensuring the longevity of critical infrastructure by
combining data from numerous sources and employing a variety of
analytical techniques.

Page 5
2 of 20
13
Open Research Europe Open Research Europe 2025, 5:84 Last updated: 18
03 APR 2025

Keywords
Atmospheric corrosion prediction, grid search, hyperparameter
tuning, corrosion dataset, machine learning algorithms

This article is included in the Horizon 2020


gateway.

Corresponding author: Jose Manuel Perales Fernández ([email protected])


Author roles: Perales Fernández JM: Conceptualization, Data Curation, Formal Analysis, Methodology, Resources, Software, Validation,
Visualization, Writing – Original Draft Preparation, Writing – Review & Editing; López Abelairas M: Writing – Original Draft Preparation,
Writing – Review & Editing; Sánchez-Ramos A: Writing – Original Draft Preparation, Writing – Review & Editing; Otero-Gonzalez L:
Conceptualization, Formal Analysis, Funding Acquisition, Methodology, Project Administration, Supervision, Writing – Original Draft
Preparation, Writing – Review & Editing
Competing interests: No competing interests were disclosed.
Grant information: The project has received funding from the European Union's Horizon 2020 Research and Innovation Programme
under grant agreement No 952960 (MAterials solutions for cost Reduction and Extended service life on WIND off-shore facilities
[MAREWIND Project]).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Copyright: © 2025 Perales Fernández JM et al. This is an open access article distributed under the terms of the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
How to cite this article: Perales Fernández JM, López Abelairas M, Sánchez-Ramos A and Otero-Gonzalez L. Development and
comparative analysis of machine learning algorithms for predictive atmospheric corrosion modeling [version 1; peer review:
awaiting peer review] Open Research Europe 2025, 5:84 https://doi.org/10.12688/openreseurope.19770.1
First published: 27 Mar 2025, 5:84 https://doi.org/10.12688/openreseurope.19770.1

Page 6
3 of 20
13
Open Research Europe 2025, 5:84 Last updated: 18
01 APR 2025
03

Nomenclature materials produced by chemical, electrochemical, or other reac-


A  arameter obtained from least-squares method for
P tions with their environment, is a critical issue in a wide range
Klinesmith’s model of industries, including manufacturing, infrastructure, and
transportation1. Its economic impact is enormous, requiring
B P
 arameter obtained from least-squares method for costly repairs, maintenance, and replacements. Furthermore,
Klinesmith’s model corrosion can compromise the structural integrity and safety
bi Regression coefficients of infrastructures, posing significant risks. Understanding
and preventing corrosion is, therefore, critical to assuring
C TOW mean environmental parameter in Klinesmith’s the durability and reliance of materials in a wide variety of
model applications.
Cl- Chloride concentration
Depending on the environmental conditions and the material
CR Corrosion depth qualities, several mechanisms of corrosion may occur, namely
D Parameter obtained from least-squares method for electrochemical corrosion2, galvanic corrosion3, microbiologically
Klinesmith’s model influenced corrosion (MIC)4 and passivation5, demonstrat-
ing the complexity of the effect. To deeply understand these
D Corrosion depth in µm for ISO 9224 model phenomena, modelling has resulted in a pivotal approach to
E SO2 environmental parameter in Klinesmith’s model studying and predicting corrosion behavior under different
circumstances. These techniques have experienced notable
e Euler’s number (value: 2.71828) advancements over the decades, bringing unique insights and
tools to the field.
F  arameter obtained from least-squares method for
P
Klinesmith’s model
Initially, analytical models such as Finite Element Analysis
G Cl- environmental parameter in Klinesmith’s model (FEA) and Computational Fluid Dynamics (CFD) played a
central role. In the case of FEA, a computational method
h Metal-environment-specific time exponent in ISO 9224 solves equations governing ion movement and electrochemical
model reactions, providing detailed spatial and temporal insights
H Parameter obtained from least-squares method for into corrosion processes. Studies such as the one by Izquierdo
Klinesmith’s model et al.6 have shown how FEA can mimic the intricate details of
corrosion progression, offering a profound understanding of
J Parameter obtained from least-squares method for material degradation over time. Similarly, CFD models have
Klinesmith’s model been invaluable in examining how corrosive agents like oxygen
n Number of observations and ions travel within fluids, influence corrosion rates and dis-
tribution. These models help to predict the impact of fluid
R2 Coefficient of Determination dynamics on corrosion in environments like pipelines7 and off-
Rcorr Corrosion rate in ISO 9224 model shore structures and compute corrosion rates and distribution.
Another mechanistic approach, the phase field method, is used
RMS Root Mean Square Error to simulate phase transitions and microstructural evolution
in materials, providing a microstructural perspective on corro-
SO2 Sulphur dioxide concentration
sion processes8. Additionally, atomic-scale modelling, including
t Exposure time molecular dynamics and density functional theory (DFT),
offers insights into the fundamental atomic-scale processes
T Temperature
that drive corrosion, such as adsorption, electron transfer, and
T0 Mean temperature in Klinesmith model surface reactions9.

TOW Time of wetness However, mechanistic and first principles models have certain
y Dependent variable (corrosion depth) limitations that make them less practical for large-scale pre-
dictive applications. These models are highly computationally

yι Predicted value for the i-th observation demanding, often requiring significant processing power and
y Mean of the actual values time to generate results. Additionally, they are not inherently
predictive, meaning they are limited in their ability to
yi Actual value for the i-th observation forecast corrosion behavior in new scenarios without prior
detailed knowledge of the specific corrosion process and mate-
1 Introduction rial properties. Furthermore, their focus is often on microscopic
Steel and other metallic alloys provide excellent mechanical scales, which may not translate well to insights applicable to
qualities while remaining reasonably affordable. However, they large, real-world infrastructures. These models also struggle
have a significant weakness: they are susceptible to corrosion. to handle large amounts of complex data, making them less
This phenomenon, defined as the progressive deterioration of suited for big data applications. Consequently, these drawbacks

Page 7
4 of 20
13
Open Research Europe 2025, 5:84 Last updated: 18
01 APR 2025
03

highlight the potential for machine learning models, which environmental corrosion and constructing one unified data-
can address these challenges by offering scalable, data-driven set. The second part focused on data preprocessing (Data
solutions that are capable of predicting corrosion outcomes preprocessing). Then, the dataset was separated into differ-
without the same level of computational or domain-specific ent sub-datasets: one for training, another for validation and
knowledge. a final one for testing (Data splitting). The fourth step con-
sisted of the development of the algorithm architecture by
Probabilistic models are strong tools to address these uncer- testing different ML algorithms (Model development). Addi-
tainties inherent to corrosion processes. These models consider tionally, conventional models were developed in parallel for
variability in material properties, environmental conditions, and later comparison (Conventional models). For testing the
other factors, providing a probabilistic assessment of corrosion architecture of the different ML parameters, calibration was
risk and remaining service life10. The Monte Carlo simulation done following accuracy metrics and rebuilding the model
is a prominent example of this approach. By incorporating with new model parameters (Calibration). Sixth, the results
a range of environmental variations and material diversities, obtained were analysed, looking for the model with the best
Monte Carlo simulations offer a comprehensive view of cor- predictive capacity (Analysis of results). Finally, the model
rosion behavior, as in the model proposed by Engelhardt was implemented and tested, comparing its predictive capacity
et al.11. These probabilistic models have been crucial in indus- with conventional models (Model implementation).
tries where understanding and mitigating the risk of corrosion
is essential for safety and economic efficiency. The main 2.1 Data collection
disadvantages are the computational demanding time required 2.1.1 Literature search
for more complex models12. Data was collected from various publications, including scour-
ing academic journals, research papers, industry reports, and
In recent years, the probabilistic approach has evolved into other publications related to corrosion. These publications
data-driven and machine learning (ML) models, introducing a cover a broad spectrum of metals and environmental conditions,
new promising approach to corrosion modelling. Leveraging making the dataset diverse and valuable. In this study, carbon
extensive datasets, ML models can recognize intricate patterns steel samples were selected.
related to corrosion, make accurate predictions about corrosion
rates, and fine-tune strategies for prevention and mitigation. The first step was to define the parameters involved in each
Models such as the one from De Masi et al.13 have demonstrated publication to have an overview of what kind of data can be
the significant potential of ML in this field. This approach not expected, where T in degrees, RH in %, TOW in hours per year,
only enhances the accuracy of corrosion predictions compared SO2 in mg/m2·d, Cl- in mg/m2·d, rain precipitation (P) in mm
to previous models but also allows for real-time monitoring and of water and time in years. Data of corrosion measurements
adaptive maintenance strategies, which make them very suitable were usually in µm or µm/year, depending on whether the data
to forecast and enable effective corrosion control measures. represented a time series or just a single measurement after
1 year.
Following this line, the present work studies a set of ML algo-
rithms for understanding and predicting corrosion processes After that, rows with formal defects or duplicates were located
for offshore conditions based on several environmental param- and removed. As the corrosion study is centred on offshore con-
eters. This work focuses specifically on atmospheric corrosion ditions, categories C5 and CX of corrosion were inspected
as one of the main problems that must be understood, in order following ISO 12944-2 guidelines16. All the information
to reduce the economic impact of replacing damaged materi- collected for the dataset construction and the parameters
als and extend the lifespan of structures. To achieve this, the involved are detailed in Table S1 from Extended data15.
corrosion depth suffered by steel in atmospheric conditions
will be modelled based on environmental parameters14 such 2.1.2 Construction of the dataset
as temperature (T), relative humidity (RH), time of wetness For the construction of the final dataset, several datasets were
(TOW), sulphur dioxide deposition rate (SO2), chloride deposi- first constructed considering different configurations of input
tion rate (Cl-), and exposure time. The techniques used represent parameters (T, RH, TOW, SO2, Cl-, P and time), the total amount
a broad range of approaches of various levels of complexity, all of data and the percentage of C5-CX corrosion category data
related to ML, to anticipate corrosion occurrences precisely of each dataset, removing the duplicates of them, showing
under a wide variety of climate circumstances and variables. The the final configuration on Table 1. Most of the corrosion data
findings of this study are offered to help design more effective collected refers only to the first year of corrosion, but some
corrosion prevention and mitigation strategies. publications reported time series up to 12 years that were
also included considering the year as an additional input.
2 Methods After evaluating the constructed datasets, it was found that the
For the development of the ML algorithms, a work method- percentage of C5-CX corrosion data was similar in all data-
ology divided into 7 steps was established (Figure S1 from sets, and therefore, the dataset with the larger size (Dataset 4)
Extended data15). The first step focused on the construction of was selected for the subsequent model development. Based
the dataset (Data collection). This involved searching the on this dataset, T, TOW, SO2 deposition rate, Cl- deposition rate
literature to collect the largest amount of data related to and time of exposure were considered as input parameters,

Page 8
5 of 20
13
Open Research Europe 2025, 5:84 Last updated: 18
01 APR 2025
03

Table 1. Overall description of the different built datasets, indicating the input
parameters, being that ones T, RH, TOW, SO2, Cl-, P and time (marked with X) that
contain each dataset; the total dataset size (number of rows) and the number and
percentage of data corresponding to the corrosive categories C5-CX according to
ISO 12944-2 guidelines.

T RH TOW SO2 Cl- P time Total size C5-CX % C5-CX

Dataset 1 X X X X X X 198 27 13.64

Dataset 2 X X X X X 243 32 13.17

Dataset 3 X X X X 595 67 11.26

Dataset 4 X X X X X 816 99 12.13

Dataset 5 X X X 662 72 10.88

while in years, the corrosion depth (CR in µm) was considered verify that the data distribution was adequate, the algorithm
as the output parameter. was executed 100 times. In each iteration, it was verified that
the subsets assigned reflected the same distribution of the time
This dataset of 816 records is composed of 180 records from exposure years for each sub-dataset. This verification showed that
the MICAT project (130 record obtained from Pintos et al.14, each time, for each range of time exposure (1 year, 2 years, etc.),
46 records from Chico et al.17 and 4 records from Panchenko same amount of rows (80% and 20% for each time exposure) were
et al.18), 190 records from the ISOCORRAG project (Chico distributed between both sub-datasets, with different row values
et al.17), 395 records obtained from Cai et al.19, 6 records each time (splitting randomly). The train_test_split function was
from Castaño et al.20, 7 records from Hou et al.21, 25 records applied to split the overall dataset into one sub-dataset containing
from the E-Asia projects (Table IV-3 from To et al.22) and the test data (test dataset) and another sub-dataset containing
13 records from EFC23 (the data from 24, 25 and 26 in the training and validation data together (training+validation
Table S1 from Extended data15 finally were not included in dataset), as during the model development, the training and
the final dataset). validation dataset was divided randomly several times in training
dataset and validation datasets. The test dataset was left
2.2 Preprocessing untouched to test the developed models once optimized.
2.2.1 Data cleaning and filtering process
A comprehensive data preprocessing involving data cleaning 2.4 Model’s development and their architectures
and filtering steps was put in place. The primary target was The development of the ML model was focused on an analysis
to ensure the integrity and reliability of the dataset for subse- of 6 different types of algorithms to evaluate which one can
quent analysis. As a critical phase of this process, a statistical achieve a better fitting and predictive capacity. The application
analysis aimed at detecting the presence of outliers within the of these algorithms was in increasing order of complexity, start-
dataset was conducted. To achieve this, the Isolation Forest27 ing from the simplest one and progressing to the most complex
algorithm was used as the first step. one. The objective of developing different algorithms and not
just focusing on one was to avoid any assumption that could
After the data preprocessing phase (cleaning), the statistical result in the development of an algorithm that was not the
analysis consisted of an examination of various statistical best one according to the means available.
parameters, including mean, standard deviation (std), and per-
centiles. To further explore the relationships among the model 2.4.1 Methods employed for definition of the architecture of
parameters, a Pearson’s correlation matrix was constructed each model
(Figure S2 from Extended data15). Multiple Linear Regression
The first algorithm analysed was multiple linear regression
2.3 Partition of the dataset in training, validation and (MLR)29. MLR is a statistical technique used to analyse the
testing sub-datasets relationship between a dependent variable, in this case, CR,
2.3.1 Technique employed on the dataset partition and two or more independent variables such as T, TOW, RH,
Dividing the dataset into training, validation, and testing SO2, Cl- and exposure time. The goal of this analysis was to
sub-dataset is essential for developing robust, reliable, and understand the relationship between the dependent variable
generalizable ML algorithms. It supports model development, and the independent variables and to use this relationship for
hyperparameter tuning, prevents overfitting, and allows for predicting the CR based on the environmental features.
unbiased performance assessment. The train_test_split28
function of the Python sklearn library was used to randomly For modelling the MLR the data must be first normalized. We
divide the full dataset into the subsequent 2 sub-dataset. To chose a Min-Max normalization. Min-max normalization is

Page 9
6 of 20
13
Open Research Europe 2025, 5:84 Last updated: 18
01 APR 2025
03

a data preprocessing technique used to scale the data values Subsequently, the determination of hyperparameters for the
in a range of 0 to 1. To apply min-max normalization to the DTR model was carried out, with a primary focus on two key
variables, the minimum and maximum values of each variable hyperparameters: i) the maximum depth of the tree and ii) the
were found and then the following formula was applied: minimum number of samples required to split a node (Table 2).
The maximum depth of the tree dictates how deeply the tree
Scalevalue = (original value − min value) / (max value − min value). (1) can extend before ceasing to branch further. Conversely,
the minimum number of samples necessary to split a node
The next step was to build a regression model by fitting the specifies the threshold for the minimum number of sam-
selected independent variables to the dependent variable. The ples required within a node before it can be split. The grid
equation for the regression model is the following one: search algorithm was executed 100 times to identify optimal
hyperparameter values within a defined value range (Table 2).
y = b0 + b1T + b2TOW + b3SO2 + b4Cl − + b5t , (2)
Random Forest Regressor
where y is the dependent variable (CR); b0 is the intercept; b1, The fourth algorithm analysed was random forest regressor
b2, b3, b4 and b5 are the regression coefficients for each inde- (RFR)33. In RFR, several decision trees are built, and their out-
pendent variable (environmental features); T, TOW, SO2, Cl- puts are combined to obtain the final prediction. The number
and t are the normalized independent variables (environmental of trees used in the random forest is determined by the hyper-
features). To remove the variance involved in building the parameter number of estimators. Three hyperparameters were
model by using a random partition of the training data set tuned in the RFR model (Table 3): i) maximum depth tree,
and the validation set, the model was analysed 100 times, ii) number of estimators, and iii) minimum number of
obtaining the average of the accuracy parameters for later samples required to split an internal node. The range of
comparison with the other models. values studied are listed in Table 3. The maximum depth tree
controls the depth of the individual decision trees being built,
Polynomial Regression which helps to reduce overfitting. The number of estimators
The subsequent algorithm under investigation was polynomial hyperparameter controls the number of decision trees to be
regression (PR)30, a widely employed technique in ML designed built, which can improve the model’s performance. The mini-
to establish nonlinear relationships between dependent and mum number of samples required to split internal nodes
independent variables. In this context, it focused on crafting helps to prevent further splitting in some of the nodes and
a polynomial linear regression model tailored to predict aids in reducing the model’s computational burden. A cross-
CR. The same normalization method applied in the MLR was validation technique was used for finding the best hyperpa-
used. A critical decision entailed the determination of the rameters composition. The grid search algorithm was run
polynomial algorithm’s degree. This investigation was initiated 100 times with training and validation datasets, split them
by fitting a polynomial model of degree 2 and systematically from the main dataset randomly each time, applying a cross-
increasing the degree to 4 to achieve an optimal fit for our data- validation five times per iteration to minimize the influence of
set. For each progressive polynomial degree, the dataset was training data on the model architecture.
partitioned into distinct training and validation sets. The training
set served for model training, while the validation set assessed Support Vector Regressor
the model’s performance. Employing the same criteria as in the The fifth algorithm analyzed was support vector regressor
MLR context to mitigate dataset split variance, 100 distinct (SVR)34. The SVR is a type of supervised ML model used in
analyses were executed across a range of polynomial degrees.

Decision Tree Regressor


The third algorithm under investigation was the decision tree Table 2. DTR hyperparameters studied.
regressor (DTR)31. DTR is a hierarchical model that employs
a tree-like structure to make decisions regarding specific out- Hyperparameter Value range values
comes. Each node within the tree corresponds to a decision
based on a particular feature or variable. As the tree expands Maximum depth 1–10
and branches, each decision becomes increasingly specific, Minimum number of samples 10–60 (step = 10)
ultimately culminating in a prediction concerning the outcome.

To initiate the development of a DTR model, the first step


entailed dividing the data into a training set and a valida- Table 3. RFR hyperparameters studied.
tion set. In this case, data normalization was not applied as
the prediction is discrete despite the continuous output. The Hyperparameter Range values studied
training set serves as the foundation for model construction,
while the validation set assesses the model’s accuracy. To Maximum depth 1–20
further enhance the model’s robustness and generalizability to Minimum number of samples 5–60
new data, a cross-validation32 technique was implemented,
dividing the dataset into five parts. Number of estimators 20–120

Page
Page10
7 of 13
20
Open Research Europe 2025, 5:84 Last updated: 18
01 APR 2025
03

regression tasks. It is used to predict continuous output variables carried out through a grid search algorithm applying cross-
using input features. The goal of SVR is to find a regression validation. The tuned hyperparameters were activation function,
function that minimizes the error between predicted values learning rate, alpha, solver and the number of hidden layers
and real ones with a margin of error (epsilon). (Table 5). The activation function introduces non-linearity
into the model. The learning rate determines how fast the
The development of an SVR involves selecting the right model learns from the training data. The alpha value controls
kernel, C factor, gamma, and epsilon (Table 4). The kernel overfitting though L2 regularization35. The solver parameter is
function transforms the input data into a higher-dimensional used to specify the optimization algorithm used in the mode.
space, where it can be separated into different classes. The C Finally, the number of hidden layers determines the complex-
factor is a regularization parameter that controls the trade-off ity and capacity of the neural network regressor. For this case,
between model complexity and training error. It helps in deter- the computational analysis time was the highest, so the grid
mining the amount of error allowed in the training process. search algorithm was run just one time. The values studied from
A smaller C value indicates a softer margin, allowing more the search algorithm are listed in Table 5.
training data to be misclassified, while a larger C value pro-
duces a harder margin, requiring all training data to be correctly 2.1.2 Description of the metrics employed for tunning each
classified. Gamma is a parameter that defines the shape of the modelling hyperparameter
decision boundary. It controls the degree of influence of a train- The different combinations of hyperparameters of each algo-
ingexample on the decision boundary. A smaller gamma value rithm were evaluated using the statistical indicators of root
results in a decision boundary with a higher curvature, whereas mean square error (RMSE) and R2. Those hyperparameter con-
a larger gamma value results in a decision boundary with a figurations with the lowest RMSE and highest R2 were the
lower curvature. The epsilon parameter defines the insensi- chosen configurations for each ML algorithm evaluated. R2
tive zone around the regression line, within which the model (Equation 3) is a statistical measure that quantifies the propor-
will not consider the errors. The selection of the epsilon value tion of the variance in the dependent variable (target) that can be
depends on the tolerance level for the error. explained by the independent variables used in the model. It
offers a simple and interpretable metric to gauge how well
The same normalization method as for MLR and PR algorithms the model captures the variability of the target.
was applied to the data. To optimize the model performance,
the hyperparameters were tuned using cross-validation tech- Σ( yi − yˆi ) 2
niques. In this case, the computational analysis time was too R2 = 1 − (3)
high for repeating the grid search algorithm, so it was applied Σ( yi − y ) 2
5 times to find the best hyperparameter combination. The
grid search algorithm parameters are listed in Table 4. RMSE (Equation 4) represents the dispersion or spread of pre-
diction errors, which are essentially the deviations of data
Multi-Layer Perceptron Regressor points from the regression line. In essence, RMSE provides
The sixth and last algorithm analyzed was the multi-layer insights into the degree of data clustering around the optimal
perceptron regressor (MLPR)14. The MLPR neural network fit line, indicating how closely the data conforms to this line.
is a type of feedforward artificial neural network (ANN) that
consists of multiple layers of perceptron (also known as nodes n ( yˆi − yi ) 2
or artificial neurons) arranged in a series of interconnected RMSE = ∑ i =1 n
(4)
layers. Each perceptron receives input signals, processes the
information, and produces output signals that become inputs to Here yi is the real output value, yi is the predicted output
the next layer’s perceptron until the final output is produced. value, y is the mean output value and n is the total of rows.

The data was normalized employing the Min-Max normalization.


Next, the design of the neural network architecture was
Table 5. MLPR parameters studied (Lbfgs: Limited-memory
Broyden–Fletcher–Goldfarb–Shanno; Sgd: Stochastic
gradient descent).

Table 4. SVR parameters studied.


Hyperparameter Values studied

Hyperparameter Values studied Activation function identity, logistic, tanh, Rectified linear unit

Kernel Radial basis function, sigmoid Learning rate constant, adaptive, invscaling

C 1–10 Alpha 0.001–0.11

Gamma 0.01–10 Solver adam, lbfgs, sgd

Epsilon 0.01–0.5 Hidden layers (10,10,10) -(20,80,30) step layer (2,10,2)

Page
Page11
8 of 13
20
Open Research Europe 2025, 5:84 Last updated: 18
01 APR 2025
03

3 Results and discussion the MLR. The regression coefficients were b0 = -0.0809175;
3.1 Data preprocessing b1 = 0.0945785; b2 = 0.153686; b3 = 0.287786; b4 = 0.56445;
The results of the outlier detection process were compared with b5 = 0.276023. After building the model, an inverse normali-
previous research findings (Table S1 from Extended data15) that zation was applied for obtaining absolute errors. The accuracy
had identified and examined outliers within each specific data- of the MLR based on the regression coefficients obtained
set. This comparative analysis was carried out to determine is described in Table 7.
whether it was appropriate to retain or exclude these identified
outliers. After conducting a thorough examination and carefully For the PR algorithm, first the grade of polynomial regres-
comparing each row with its corresponding publication context, sion that fits better the model was evaluated. Grades from 2 to
it was concluded to retain these outliers within the data- 4 were considered for development of the PR model architec-
set since the previous publications had already identified and ture. Figure S3 from Extended data15 shows how increasing
resolved any inconsistencies in the data. Consequently, although the polynomial degree increases the RMSE in the validation
the Isolation Forest algorithm classified 121 data points as statis- dataset while remaining relatively constant in the training
tical outliers, these anomalies seemed to be artifacts stemming dataset. This is probably due to data overfitting to the train-
from the inherent complexities in the data derived from various ing dataset at higher polynomial grades, implying a better
sources and datasets. model fit to the training dataset but less predictive power once
applied to the validation dataset, resulting in such high RMSE
The descriptive statistics (Table 6) revealed an interesting obser- values. Therefore, the best predictive configuration is obtained
vation concerning the CR parameter, which represents CR for a polynomial grade 2. Table 7 describes the accuracy
in µm. At first glance, the maximum CR value (1,804.4 µm) metrics obtained in the PR model with grade 2 as a fixed
seemed to deviate significantly from conventional statistical hyperparameter. Comparing with MLR validation (R2= 0.47 and
indicators (Q3 = 69.7 µm), suggesting it might be an outlier. RMSE of 58), the accuracy metrics of the PR are better
However, the examination using the Isolation Forest algo- (R2= 0.61 and RMSE 46) which is expected as PR models can
rithm definitively determined that this value did not meet the capture better non-linear behaviour.
criteria for an outlier. This finding held true for other dataset
parameters, except for temperature, where all statistical indicators The best fit values obtained from the gridsearch algorithm for
consistently aligned. developing the DTR model are listed in Table 8. The results
indicate that the optimal maximum depth of the tree is 8 and
A correlation analysis was performed by building a Pearson’s the minimum number of samples evaluated in each node
correlation matrix (Figure S2 from Extended data15). It are 10. For deciding which hyperparameter values must be
was observed that there was a moderate correlation (0.4) adopted, the gridsearch algorithm was launched 100 times and
between TOW and temperature, as well as a similar moderate the mode of the hyperparameter values was chosen.
correlation (0.45) between time and corrosion.
The DTR model was built with the obtained optimal hyper-
3.2 Evaluation of the developed models parameters. The accuracy metrics of the model are listed in
3.2.1 Model architectures Table 7. When compared to the PR model, the DTR model
The models were developed sequentially in increasing com- fitted the training data better with a R2 of 0.85 (compared
plexity. The metrics for comparing the different models among with a R2 of 0.65 for the PR model). However, the predictive
them were R2 and RMSE. The first model that was built was capability of the DTR model was worse than that of the PR

Table 6. Summary of descriptive statistics of the full dataset (dataset 4) used for model
development.

Index Temperature TOW (h/year) SO2 (mg/m2·d) Cl- (mg/m2·d) time (years) CR (µm)
(ºC)

count 816.00 816.00 816.00 816.00 816.00 816.00

mean 13.93 3,862.11 21.54 32.52 1.95 64.74

std 7.19 1,414.76 28.10 82.70 2.03 98.26

min -3.10 26.28 0.00 0.00 1.00 1.00

max 7.28 3,055.05 4.20 1.50 1.00 22.98

Q1 13.34 3,766.80 11.00 9.00 1.00 37.10

Q2 18.02 4,857.53 26.00 30.20 2.00 69.70

Q3 29.30 8,760.00 171.68 1,300.00 12.00 1,804.40

Page
Page12
9 of 13
20
Open Research Europe 2025, 5:84 Last updated: 18
01 APR 2025
03

Table 7. Summary of model architectures development.

Model R2 R2 RMSE RMSE Data Number of


training validation training validation normalized hyperparameter
evaluated

MLR 0.50 0.47 51 58 yes 0

PR 0.65 0.61 43 46 yes 1

DTR 0.85 0.43 29 45 no 2

RFR 0.89 0.70 23 36 no 3

SVR 0.79 0.72 33 39 yes 4

MLPR 0.78 0.76 31 41 yes 6

Table 8. Best fit values of the hyperparameters Table 9. Best fit values of the hyperparameters
adjusted for the decision tree regression (DTR) adjusted for the random forest regressor (RFR)
model. model.

Hyperparameter Best fit value Hyperparameter Best fit value

Maximum depth 8 Maximum depth 15

Minimum number of samples 10 Minimum number of samples 6

Number of estimators 30

model, as evidenced by a R2 of only 0.43 when applied to the


validation data. This behaviour is expected for DTR models application, the RFR allows obtaining the degree of importance
because these tend to overfit to the training data, with a con- of each input variable to the model. In this case, it was found
sequent reduction in the predictive capability when applied to that for the dataset created, chloride deposition was the most
a different dataset (e.g., when applied to the validation dataset)36. relevant feature in the prediction of corrosion (Figure S5
from Extended data15).
In the case of the RFR algorithm, three hyperparameters were
studied. The RFR algorithm is composed of several deci- For the SVR model, the gridsearch algorithm was ran only
sion trees, therefore two of the adjusted hyperparameters were for 5 times due to the high computational times required for
the same as for the DTR model (maximum tree depth and each run. Once the best fit values of the hyperparameters
minimum number of samples evaluated in each node). The were found (Table 10), the metrics from the best values fitted
additional hyperparameter in comparison with a single decision were obtained and summarized in Table 7. In this case, the
tree was the number of estimators (i.e., the number of decision SVR model fits the training data (R2 training = 0.79) worse than
trees). The RFR hyperparameter configuration obtained the RFR model (R2 training = 0.89), but the predictive capa-
following the methodology exposed is listed in Table 9. bilities are better as shown by the better fit of the validation data
(R2 validation = 0.72 for SVR and R2 validation = 0.70 for
The accuracy of the model with the architecture described is RFR). Figure S6 shows the adjustment between real and
shown in Table 7 where it can be observed that the predictive predicted values.
capabilities of the RFR model improve all the previous
models. The best predictive capabilities are reflected in the The last ML model tested was MLPR. In this case, as men-
accuracy metrics (R2 and RMSE), where R2 is the third highest tioned in methodology section, the computational time was
value of all the models (0.70 vs 0.72 and 0.76), indicating a good so high for evaluating multiple times the best configuration
fit between real and predicted values; while RMSE, as the main applying the gridsearch algorithm. Having in mind that, the
parameter considered for comparison of model improvement, gridsearch algorithm was employed once (without iterating
is the lowest among all the models (RMSE = 36), indicating each time the algorithm with a new random dataset split-
that the mean error average for the predicted values is the ting as before). Once the model architecture was decided
smallest. Figure S4 from Extended data15 shows how well (Table 11), the MLPR was trained and validated with the
the predictive values represent the real ones. As an additional chosen hyperparameters.

Page 13
10 of 20
13
Open Research Europe 2025, 5:84 Last updated: 18
01 APR 2025
03

be observed how the computational analysis time of each of


Table 10. Best fit values of the
hyperparameters adjusted for the the algorithms increased as they become more complex.
support vector regressor (SVR)
model. 3.3 Comparison with existing models
The collected data was used for calibrating existing cor-
Hyperparameter Best fit value rosion models to compare if the ML model improves the
models’ behavior. Two models were tested with the full dataset
Kernel rbf built for the ML algorithm. The first one was the model
C 6.00 of ISO 922416 that follows the following relationship:

Gamma 3.40
D = rcorr t h (5)
Epsilon 0.03
Where D is the CR in µm, rcorr is the corrosion rate experi-
enced for each environmental case in the first year expressed
in µm/year, t is the time of exposure expressed in years and h
Table 11. Best fit values of the is the metal-environment-specific time exponent calculated
hyperparameters adjusted for the according to ISO 9224.
multi-layer perceptron regressor
(MLPR) model. The h value obtained for the full dataset was 0.523, while a
specific rcorr value was calculated for each dataset row.
Hyperparameter Best fit value The accuracy metrics of the model were R² = -0.13 and
RMSE = 80. Seeing these results, it can be concluded that this
Activation function relu
model has not predictive behavior with the current dataset,
Learning rate constant and it just can be applied in specific conditions where the range
of the values is more limited than in the present dataset.
Alpha 0.061

Solver lbfgs The second model that was tested was the Klinesmith’s
Hidden layers (18,50,24) model37. In this model, the following corrosion relationship
is adjusted through the least-squares method38:

D F
− H
 TOW   SO2   Cl  J ⋅(T +T0 )
The metrics obtained from the MLPR simulation are described y = A⋅tB ⋅  ⋅ 1 +  ⋅ 1 +  ⋅e (6)
 C   E   G 
in Table 7. The accuracy of the MLPR model for predicting
corrosion values was the highest among the tested ML
Where y is the CR in µm; C, E, G and T0 are the mean of
algorithms (R2 validation = 0.76) but the error obtained is
each environmental parameter associated with. A, B, D, F, H
slightly higher than that obtained with SVR model (RMSE
and J are obtained from the least-squares method.
validation = 41 for MLPR and RMSE validation = 39 for SVR)
as can be observed graphically in Figure S7 from Extended data15,
where a comparison between predicted and real values its shown. C, E, G and T0 values were 3862.11, 21.54, 32.52 and 13.93,
respectively. In the case of A, B, D, F, H and J, the values
3.2.2 Analysis of models’ accuracy metrics obtained employing the least-squares method were 9.81, 0.58,
Table 7 summarizes the main characteristics of each of the 0.57, 0.44, 0.51 and 0.033, respectively. With all the param-
developed ML models, as well as the computational times that eters found, the accuracy metrics of the Klinesmith’s model
have been carried out each time any of them was evaluated. The were R² = 0.6 and RMSE = 47., which are similar to the valida-
results show that the algorithm that captures the overall trend tion accuracy metrics obtained for the PR model (R² = 0.61,
in the data more effectively is MLPR (R2 validation = 0.76), RMSE = 46). The Klinesmith’s model can predict corrosion
although the error produced is slightly higher than SVR and behavior, but it is a less powerful tool in comparison to the four
RFR (RMSE validation = 41 for MLPR, RMSE validation = 36 best ML models (DTR, RFR, SVR and MLPR), which
for RFR and RMSE validation = 39 for SVR) indicating have better predictive capabilities than conventional models
that is not the most precise model minimizing the predic- (see Table 7).
tion error. The ability of the RFR model to aggregate the results
of multiple decision trees helped it strike a balance between 4 Conclusions
fitting the training data and generalizing to the validation In this work, we have explored various levels of ML algorithms
set resulting in the lowest prediction error when compared to address the problem of atmospheric corrosion consider-
with the other two (SVR and MLPR). Additionally, it can ing environmental parameters, in order to determine which of

Page 14
11 of 20
13
Open Research Europe 2025, 5:84 Last updated: 18
01 APR 2025
03

the models offers the best predictive capabilities. Starting with The dataset built for the study in Subsection 2.1.2 Construc-
simpler models and progressing to more complex ones, RFR, tion of the Dataset was fully described on it to allow readers
SVR, and MLPR exhibited the best performance, with very to identify data sets as similar to the analysis data as possible.
similar accuracy metrics. Any of these models can reliably
predict corrosion scenarios for environments ranging from C1 Underlying data
to C4 (which comprise the majority of the data) as well as Zenodo: Development and Comparative Analysis of Machine
C5-CX, though the latter only represents 12% of the data- Learning Algorithms for Predictive Atmospheric Corrosion
set. The limited availability of data for C5-CX scenarios made Modeling: Extended Data15.
these more challenging to model. Additionally, the RFR model
revealed the importance of the environmental parameters on The project contains the following underlying data:
the corrosion behavior, being Cl- deposition the most influen-
tial parameter in severe corrosion. This explains why offshore 1. Extended data.docx
environments, where Cl- is more prevalent, experience more
intense corrosion compared to onshore environments, where Cl- Data are available under Creative Commons Attribution 4.0
levels are lower. International

Moreover, through an extensive literature review, we have Author contributions


compiled a comprehensive and curated dataset containing The authors confirm contribution to the paper as follows: study
over 800 records of atmospheric steel corrosion under various conception and design: Lila Otero-Gonzalez, Jose Manuel
environmental conditions, including T, TOW, SO2 deposition Perales; data collection: Jose Manuel Perales Fernández; analysis
rate, Cl- deposition rate and exposure duration. This dataset will and interpretation of results: Jose Manuel Perales Fernández,
be available to researchers for future studies on corrosion. Lila Otero-Gonzalez; draft manuscript preparation: Jose Manuel
Perales Fernández. Arturo Sánchez-Ramos, Maria Lopez
Abelairas, Lila Otero-Gonzalez. All authors reviewed the results
Data availability and approved the final version of the manuscript.
Restricted data access statement
Data sets that are not publicly available are not present in Acknowledgements
any form in the manuscript. The publications from where the The authors would like to thank the MAREWIND project
dataset’s data was obtained are under subscription. Journal and their members for the opportunity of developing the
policies restrict the publication of its data, not its use in the corrosion modelling work, the experimental work, as well as its
study. For granted access to the dataset data, subscription contribution of resources, which greatly enhanced the success
payment is required. of the innovation work.

References

1. What is Corrosion? ECS, Accessed September 6, 2023. 7. Redondo C, Modena M, Manzanero J, et al.: CFD–based erosion and corrosion
Reference Source modeling in pipelines using a high–order discontinuous Galerkin
2. Fontana MG, Greene ND: Corrosion engineering. 1967; Accessed September 6, multiphase solver. Wear. 2021; 478–479(1): 203882.
2023. Publisher Full Text
Reference Source 8. Qin RS, Bhadeshia HK: Phase field method. Mater Sci Technol. 2010; 26(7):
3. Tada E, Kaneko H: Galvanic corrosion of a Zn/steel couple in aqueous NaCl. 803–811.
ISIJ Int. 2011; 51(11): 1882–1889. Publisher Full Text
Publisher Full Text 9. Obot IB, Macdonald DD, Gasem ZM: Density Functional Theory (DFT) as a
4. Telegdi J, Shaban A, Trif L: 8 - Microbiologically Influenced Corrosion (MIC). powerful tool for designing new organic corrosion inhibitors. Part 1: an
In: Trends in oil and gas corrosion research and technologies. A.M. El-Sherik, Ed., overview. Corros Sci. 2015; 99: 1–30.
Woodhead Publishing Series in Energy. Woodhead Publishing, 2017; Publisher Full Text
191–214. 10. Lampe J, Hamann R: Probabilistic model for corrosion degradation of tanker
Publisher Full Text and bulk carrier. Mar Struct. 2018; 61: 309–325.
5. Arya SB, Joseph FJ: Chapter 3 - Electrochemical methods in tribocorrosion. Publisher Full Text
In: Tribocorrosion. A. Siddaiah, R. Ramachandran, and P.L. Menezes, Eds., 11. Engelhardt GR, Macdonald DD: Monte-Carlo simulation of pitting corrosion
Academic Press, 2021; 43–77. with a deterministic model for repassivation. IOPscience, J Electrochem Soc.
Publisher Full Text 2020; 167(1): 013540.
6. Izquierdo J, González-Marrero MB, Bozorg M, et al.: Multiscale electrochemical Publisher Full Text
analysis of the corrosion of titanium and nitinol for implant applications. 12. Fujii T, Ogasawara N, Tohgo K, et al.: Monte Carlo simulation of stress
Electrochim Acta. 2016; 203: 366–378. corrosion cracking in welded metals. Int J Mech Sci. 2023; 257(1): 108561.
Publisher Full Text Publisher Full Text

Page 15
12 of 20
13
Open Research Europe 2025, 5:84 Last updated: 18
01 APR 2025
03

13. De Masi G, Gentile M, Vichi R, et al.: Machine learning approach to corrosion 25. Corvo F, Haces C, Betancourt N: Atmospheric corrosivity in the Caribbean
assessment in subsea pipelines. In: OCEANS-Genova. 2015; 1–6. area. Corros Sci. 1997; 39(5): 823–833.
Publisher Full Text Publisher Full Text
14. Pintos S, Queipo NV, Troconis de Rincón O, et al.: Artificial Neural Network 26. Garita L, Yañez J, Robles J: Modelado de la velocidad de corrosion atmosférica
modeling of atmospheric corrosion in the MICAT project. Corros Sci. 2000; de acero de baja aleación en Costa Rica. Ingeniería. 2014; 24(2): 79–90.
42(1): 35–52. Publisher Full Text
Publisher Full Text 27. Mavuduru A: How to perform anomaly detection with the Isolation Forest
15. Perales JM, López Abelairas M, Sánchez-Ramos A, et al.: Development and algorithm. Medium. Apr. 8, 2022; Accessed: Sep. 4, 2023.
comparative analysis of machine learning algorithms for predictive Reference Source
atmospheric corrosion modeling: extended data. [Data set], Zenodo. 28. sklearn.model_selection.train_test_split. scikit-learn. Accessed: Sep. 4,
https://doi.org/10.5281/zenodo.14975371 2023.
16. ISO 9224:2012. ISO, Accessed October 5, 2023. Reference Source
Reference Source 29. About linear regression. IBM, Accessed: Sep. 4, 2023.
17. Chico B, de la Fuente D, Díaz I, et al.: Annual atmospheric corrosion of carbon Reference Source
steel worldwide. An integration of ISOCORRAG, ICP/UNECE and MICAT 30. Agarwal A: Polynomial Regression. Medium. Oct. 8, 2018; Accessed: Sep. 4, 2023.
databases. Materials (Basel). 2017; 10(6): 601. Reference Source
PubMed Abstract | Publisher Full Text | Free Full Text
31. Li Q, Wang J, Wang K, et al.: Determination of corrosion types from
18. Panchenko YM, Marshakov AI: Prediction of first-year corrosion losses of
electrochemical noise by gradient boosting decision tree method. Int J
carbon steel and zinc in continental regions. Materials (Basel). 2017; 10(4): 422.
Electrochem Sci. 2019; 14(2): 1516–1528.
PubMed Abstract | Publisher Full Text | Free Full Text
Publisher Full Text
19. Cai J, Cottis RA, Lyon SB: Phenomenological modelling of atmospheric
32. Gupta P: Cross-validation in machine learning. Medium. Jun. 5, 2017;
corrosion using an Artificial Neural Network. Corros Sci. 1999; 41(10):
Accessed: Sep. 4, 2023.
2001–2030.
Reference Source
Publisher Full Text
33. Yan L, Diao Y, Gao K: Analysis of environmental factors affecting the
20. Castaño JG, Botero CA, Restrepo AH, et al.: Atmospheric corrosion of carbon
atmospheric corrosion rate of low-alloy steel using random forest-based
steel in Colombia. Corros Sci. 2010; 52(1): 216–223.
models. Materials (Basel). 2020; 13(15): 3266.
Publisher Full Text
PubMed Abstract | Publisher Full Text | Free Full Text
21. Hou W, Liang C: Eight-year atmospheric corrosion exposure of steels in
34. Lv Y, Wang J, Wang J, et al.: Steel corrosion prediction based on support
China. CORROSION. 1999; 55(1): 65–73.
vector machines. Chaos Solitons Fractals. 2020; 136: 109807.
Publisher Full Text
Publisher Full Text
22. To D: Evaluation of atmospheric corrosion in steels for corrosion mapping
in Asia. PhD thesis, Yokohama National University, 2017; Accessed: May 12, 35. L1 and L2 regularization methods, explained. Built In. Accessed: Sep. 4, 2023.
2022. Reference Source
Reference Source 36. Decision Trees: complete guide to decision tree analysis. Explorium. Dec. 10,
23. European Federation of Corrosion: Exposure site catalogue. published online, 2019; Accessed: Oct. 4, 2023.
2021. Reference Source
24. Santana JJ, Ramos A, Rodriguez-Gonzalez A, et al.: Shortcomings of 37. Klinesmith DE, McCuen RH, Albrecht P: Effect of environmental conditions on
International Standard ISO 9223 for the classification, determination, corrosion rates. J Mater Civ Eng. 2007; 19(2).
and estimation of atmosphere corrosivities in subtropical archipelagic Publisher Full Text
conditions—the case of the Canary Islands (Spain). Metals. 2019; 9(10): 38. Teunissen P: Nonlinear least squares. published online, 1990; Accessed: Oct.
1105. 5, 2023.
Publisher Full Text Reference Source

Page 16
13 of 20
13
Open Research Europe Open Research Europe 2025, 5:84 Last updated: 18 APR 2025

Open Peer Review


Current Peer Review Status:

Version 1

Reviewer Report 18 April 2025

https://doi.org/10.21956/openreseurope.21388.r52605

© 2025 Coelho L. This is an open access peer review report distributed under the terms of the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.

Leonardo Bertolucci Coelho


Université Libre de Bruxelles,Belgium, Bruxelles, Belgium

Strengths:
○ Comprehensive standardized dataset from diverse sources
○ Systematic evaluation of multiple ML algorithms
○ Clear methodology for model development and validation
○ Practical focus on corrosion prediction
○ Effective comparison with conventional models
Major Comments:
1. Literature Integration: Incorporate more recent peer-reviewed studies on ML for corrosion
prediction, especially those related to similar conditions or methods.
2. Statistical Analysis: Add formal statistical tests (e.g., p-values or confidence intervals) to
support model performance comparisons.
3. Performance at High Corrosion Rates: Directly address decreased accuracy at higher
corrosion rates, discuss possible reasons, and suggest mitigation strategies relevant for
offshore environments.
4. Dataset Limitations: Clearly state the limitation regarding low representation of C5-CX
categories (12% of the dataset) and propose strategies for addressing this in future work.
5. Data Availability: Provide clearer details or justifications on dataset accessibility for
reproducibility purposes.
Minor Comments:
○ Correct typographical error "attempting methods" in Methods section.
○ Expand discussion on RFR feature importance, particularly regarding chloride deposition.
○ Ensure consistency in formatting tables and figures.
○ Clarify dataset preprocessing steps, specifically handling of units and measurement
methods.
○ Include a brief discussion on computational efficiency for practical application.
Conclusion: The manuscript offers valuable insights into ML-based corrosion prediction.

Page 17 of 20
Open Research Europe Open Research Europe 2025, 5:84 Last updated: 18 APR 2025

Addressing the suggested revisions, particularly statistical analysis, high corrosion rate
performance, and literature integration, will significantly enhance its suitability for indexing.

Is the work clearly and accurately presented and does it cite the current literature?
Partly

Is the study design appropriate and does the work have academic merit?
Yes

Are sufficient details of methods and analysis provided to allow replication by others?
Yes

If applicable, is the statistical analysis and its interpretation appropriate?


Partly

Are all the source data underlying the results available to ensure full reproducibility?
Partly

Are the conclusions drawn adequately supported by the results?


Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: corrosion, AI, machine learning

I confirm that I have read this submission and believe that I have an appropriate level of
expertise to confirm that it is of an acceptable scientific standard, however I have
significant reservations, as outlined above.

Reviewer Report 09 April 2025

https://doi.org/10.21956/openreseurope.21388.r52603

© 2025 Moses A. This is an open access peer review report distributed under the terms of the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.

Atwakyire Moses
Kabale University, Kabale, Uganda

Comment: 1
The manuscript addresses relevant literature to some extent but would benefit from deeper
integration of recent advancements in the field. In particular, the authors are encouraged to
incorporate findings from the study published under Moses A et al. [2024 (Ref-1)]
https://doi.org/10.1007/s11837-024-06674-4, which offers valuable insights that could enhance
both the depth and contextual relevance of the current work. Additionally, the reference titled

Page 18 of 20
Open Research Europe Open Research Europe 2025, 5:84 Last updated: 18 APR 2025

"What is Corrosion? ECS, Accessed September 6, 2023" is a general web source that lacks peer-review
and scholarly rigor. Replacing it with the aforementioned peer-reviewed publication would
improve the manuscript's credibility and better align it with academic standards.

Comment: 2
While the manuscript includes mean values and standard deviations for mechanical tests, the use
of statistical significance testing (e.g., ANOVA or t-tests) is not explicitly mentioned. However, the
trends are clear, and the interpretation is logical and backed by the data. If feasible, the inclusion
of p-values or confidence intervals can strengthen the statistical robustness of the current work.

Comment 3
In the extended file data provided, The plot shows a general positive correlation between
predicted and actual corrosion depths, indicating that the model captures the overall trend.
However, there is a noticeable dispersion of data points around the best-fit line, particularly in the
mid-to-high corrosion depth range. This suggests the model is less accurate when predicting
higher corrosion levels. The presence of both under- and over-estimations (scattered points far
from the line) reveals that prediction errors are not uniform across the range and these needs to
be resolved.

Comment: 4
The conclusion is coherent and highlights both the practical implications of the findings and the
contribution to the corrosion research community. However, the conclusion could be further
strengthened by more explicitly addressing the limitations of the study (e.g., data scarcity) and
suggesting how future work might overcome these. For example, strategies such as data
augmentation or transfer learning could be briefly mentioned. Additionally, while the provision of
a publicly available dataset is available it does not provide the real dataset used in the manuscript,
a more detailed mention of its potential applications (e.g., benchmarking future models or
enabling domain-specific analysis) would reinforce its research value.

References
1. Moses A, Peng X, Wang S, Chen D: Unraveling Magnesium Alloy Corrosion Patterns Through
Unsupervised Machine Learning: Exploring Clustering Techniques for Enhanced Insight. JOM.
2024; 76 (8): 4388-4403 Publisher Full Text

Is the work clearly and accurately presented and does it cite the current literature?
Partly

Is the study design appropriate and does the work have academic merit?
Yes

Are sufficient details of methods and analysis provided to allow replication by others?
Yes

If applicable, is the statistical analysis and its interpretation appropriate?


Partly

Are all the source data underlying the results available to ensure full reproducibility?

Page 19 of 20
Open Research Europe Open Research Europe 2025, 5:84 Last updated: 18 APR 2025

Partly

Are the conclusions drawn adequately supported by the results?


Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Material science, Machine learning, corrosion Testing,

I confirm that I have read this submission and believe that I have an appropriate level of
expertise to confirm that it is of an acceptable scientific standard, however I have
significant reservations, as outlined above.

Page 20 of 20

You might also like