Remotesensing 17 02593
Remotesensing 17 02593
1 State Key Laboratory of Fire Science, University of Science and Technology of China, Hefei 230026, China;
[email protected]
2 School of Emergency Management, Nanjing University of Information Science and Technology,
Nanjing 210044, China; [email protected]
* Correspondence: [email protected]
Abstract
UAVs are essential for forest fire detection due to vast forest areas and inaccessibility of
high-risk zones, enabling rapid long-range inspection and detailed close-range surveil-
lance. However, aerial photography faces challenges like multi-scale target recognition
and complex scenario adaptation (e.g., deformation, occlusion, lighting variations). RGB-
Thermal fusion methods integrate visible-light texture and thermal infrared temperature
features effectively, but current approaches are constrained by limited datasets and insuffi-
cient exploitation of cross-modal complementary information, ignoring cross-level feature
interaction. A time-synchronized multi-scene, multi-angle aerial RGB-Thermal dataset
(RGBT-3M) with “Smoke–Fire–Person” annotations and modal alignment via the M-RIFT
method was constructed as a way to address the problem of data scarcity in wildfire scenar-
ios. Finally, we propose a CP-YOLOv11-MF fusion detection model based on the advanced
YOLOv11 framework, which can learn heterogeneous features complementary to each
modality in a progressive manner. Experimental validation proves the superiority of our
method, with a precision of 92.5%, a recall of 93.5%, a mAP50 of 96.3%, and a mAP50-95 of
62.9%. The model’s RGB-Thermal fusion capability enhances early fire detection, offering a
benchmark dataset and methodological advancement for intelligent forest conservation,
with implications for AI-driven ecological protection.
Academic Editor: Ioannis Gitas Keywords: forest fire; UAV multispectral imagery; YOLOv11; wildfire dataset; small object
Received: 29 April 2025 detection; attention mechanism; computer vision
Revised: 23 June 2025
Accepted: 21 July 2025
Published: 25 July 2025
effect. The traditional fixed shape detection model has the defects of feature matching
deviation and scale sensitivity. Therefore, RGB-Thermal fusion detection can give full play
to the advantages of visible light image (RGB) and thermal infrared image (T), and make
up for the deficiency of single mode [2]. The visible light image contains rich texture and
color information, which can accurately identify the color characteristics of the flame and
provide the basis for the detailed analysis of the fire target. The thermal infrared image
is not limited by light conditions and vegetation occlusion, and can keenly capture high-
temperature heat sources. Even in the presence of heavy smoke or in nighttime conditions,
it can accurately pinpoint the locations of fire points. In the remote shooting scene, the
initial fire point is small, and the combination of the high temperature signal in the thermal
infrared image and the smoke texture feature of the visible light image can also detect the
potential fire source in time. In the close shooting scene, RGB-Thermal fusion detection can
quickly adapt to the scale change and shape distortion of the target in the face of complex
vegetation environment and fire scene with dramatic perspective change.
In the existing forest fire detection research, most of the publicly available forest fire
image datasets are limited to visible light images and lack real fire data. Furthermore,
there is a paucity of publicly available RGB-Thermal image datasets, and there is a lack
of more accurate image alignment work. The majority of these datasets are focused on
fire classification and segmentation tasks, and there is still a gap in fire detection work [3].
The FLAME1 [4] and FLAME2 [5] datasets utilise the overhead view, which is incapable
of reflecting the multi-angle characteristics of UAVs during daily forest fire inspections.
The well-labelled Corsican Fire Dataset [6] and the RGB-T wildfire dataset [2] contain a
limited amount of data and are all images of a single experimental scenario, which lacks
data generalization. The FireMan-UAV-RGBT dataset [3] has captured multi-scene forest
fire images from multiple viewpoints, but only supports the classification task and does
not provide more refined information for forest fire detection. The dataset in this paper is
comprehensive, considering the diversity of viewpoints in aerial photography, the diversity
of forest fire scenes, and the precision of the data that has been annotated. The RGBT-3M
Remote Sens. 2025, 17, 2593 3 of 24
has been constructed in order to provide reliable data support for the detection of forest
fires using RGB-Thermal imaging.
At the methodological level, the multimodal fusion strategies can be mainly catego-
rized into three types: data-level fusion, feature-level fusion and decision-level fusion.
They correspond to different stages of the algorithmic model inference process, as shown
in Figure 2.
In the field of multimodal detection methods, deep learning networks mostly use
intermediate fusion strategies [7,8], and researchers have developed a number of multi-
modal interaction and fusion strategies, which have proved to be effective in enhancing the
design of modal interactions in the feature extraction phase [9,10]. CACFNet [11] mines
complementary information from two modalities by designing cross-modal attention fusion
modules, and uses cascaded fusion modules to decode multilevel features in an up–down
manner; SICFNet [12] constructs a shared information interaction and complementary
feature fusion network, which consists of three phases: feature extraction, information inter-
action, and feature calibration refinement; and the Thermal-induced Modality-interaction
Multi-stage Attention Network (TMMANet [13]) leverages thermal-induced attention
mechanisms in both the encoder and decoder stages to effectively integrate RGB and
thermal modalities.
At present, preliminary progress has been made in forest fire identification based on
the RGB-Thermal fusion method [2,3,5,6]. Although existing work applies deep learning
frameworks such as LeNet [14], MobileViT [15], ResNet [16], YOLO [17], etc., to forest
fire detection, which improves the efficiency of forest fire recognition [5], the design of
algorithms based on RGB-Thermal correlation is still very limited. Chen et al. [5] explored
RGB-Thermal based early fusion and late fusion methods for classification and detection
Remote Sens. 2025, 17, 2593 4 of 24
of forest fire images. Rui et al. [2] proposed an adaptive learning RGB-T bimodal image
recognition framework for forest fires. Guo et al. [18] designed the SkipInception feature
extraction module and SFSeg sandwich structure to fuse visible and thermal infrared
images for flame segmentation task. Overall, the above algorithms do not deeply consider
the information interaction and propagation of cross-modal features, and lack the ability to
simultaneously achieve the calibration of shallow texture features and the localization of
high-level semantic information at all scales, and thus remain deficient in analyzing diverse
and challenging forest fire scenarios.
We propose a new forest fire detection framework. It employs a parallel backbone
network to extract RGB and TIR features. A feature cross-fertilization structure is estab-
lished in multi-scale feature extraction to enhance information interaction and propagation
between modalities. The channel and spatial attention mechanisms, along with a feature
branching selection strategy, are introduced to suppress noise from heterogeneous inter-
modal features. Finally, it achieves effective combination of complementary relationships
between modalities.
In summary, this paper will explore how to effectively fuse the features of visible and
thermal infrared images on existing deep learning models to improve the efficiency and
effectiveness of forest fire target detection in complex environments. Based on the above
background, the main contributions of this paper are as follows:
(1) A novel forest fire dataset is introduced, containing time-synchronized RGB-
thermal video data from real fires and outdoor experiments in multiple Chinese forest areas.
It provides high-quality, reliable data for classification and detection tasks via manual
frame-splitting, image alignment, and annotation, supporting subsequent deep learning
model training and testing. To the best of our knowledge, this is the first RGB-Thermal
image detection dataset for forest fires.
(2) A fire detection method was carried out by combining multimodal fusion tech-
niques and computer vision methods. We choose the well-known deep learning architecture
YOLOv11, and add cross-modal feature fusion structure and attention mechanism under
the dual backbone structure of RGB and TIR to guide the gradual fusion of modal heteroge-
neous information and improve the adaptability of the method in forest fire target detection.
(3) Our constructed model is evaluated in several challenging forest fire scenarios,
effectively demonstrating the usability and robustness of our proposed dataset and deep
learning approach in forest fire detection scenarios.
2. Dataset
2.1. Data Collection
The experimental equipment used for RGB-T image data collection is the DJI M300
RTK UAV (DJI Technology Co., Ltd., Shenzhen, China) equipped with the H20T, and the DJI
MAVIC 2 (DJI Technology Co., Ltd., Shenzhen, China) Enterprise with integrated camera,
as shown in Figure 3.
(a) (b)
Figure 3. Data collection equipment (a) DJI Matrice 300 RTK with H20T; (b) DJI MAVIC 2 Enterprise.
Remote Sens. 2025, 17, 2593 5 of 24
In the process of data collection, the specific shooting specifications are shown in
Table 1.
In order to collect forest fire images covering a wide range of scenarios, the study
carried out large-scale field environmental data collection in Anhui, Yunnan, and Inner
Mongolia, including real fires or outdoor experimental data. In the data collection process,
multiple UAV devices were used, and all devices were time-synchronized to simultaneously
acquire visible and thermal infrared videos to ensure the consistency of the acquired data.
Finally, from the large number of videos collected, we filtered out the videos with high
representativeness of forest fire scenes, and the relevant information is shown in Table 2.
Video Item UAV Camera Location Scene Time Duration File Size
Outdoor 2.68 GB,
Video pair 1 DJI Matrice 300 RTK H20T Anhui, China Night 744 s
Experiment 135 MB
Integrated 1.46 GB,
Video pair 2 DJI Mavic 2 Enterprise Yunnan, China Real Fire Daytime 348 s
Camera 389 MB
Integrated 2.96 GB,
Video pair 3 DJI Mavic 2 Enterprise Yunnan, China Real Fire Daytime 703 s
Camera 716 MB
Integrated 3.5 GB,
Video pair 4 DJI Mavic 2 Enterprise Yunnan, China Real Fire Daytime 831 s
Camera 742 MB
Integrated 1.53 GB,
Video pair 5 DJI Mavic 2 Enterprise Yunnan, China Real Fire Daytime 363 s
Camera 361 MB
DJI Mavic 2 Enterprise Integrated 2.33 GB,
Video pair 6 Yunnan, China No fire Daytime 554 s
Dual Camera 546 MB
Integrated 392 MB,
Video pair 7 DJI Mavic 2 Enterprise Yunnan, China No fire Daytime 91 s
Camera 91.2 MB
Integrated Inner Mongolia, Outdoor 486 MB,
Video pair 8 DJI Mavic 2 Enterprise Daytime 112 s
Camera China Experiment 74.2 MB
Integrated Inner Mongolia, Outdoor 940 MB,
Video pair 9 DJI Mavic 2 Enterprise Daytime 218 s
Camera China Experiment 147 MB
Outdoor 2.52 GB,
Video pair 10 DJI Matrice 300 RTK H20T Anhui, China Daytime 698 s
Experiment 140 MB
H is the univariate responsivity matrix, R and T denotes the rotation and translation
matrices between the two coordinate systems. The coordinates of a point on the plane p
in the world coordinate system in the two coordinate systems are X1 and X2 , the normal
vector of the plane p is n, and the distance to the origin of the camera coordinate system
is d.
During the construction of most forest fire RGB-T data sets, the image registration pro-
cess is usually carried out by manually selecting feature points, or by using general feature
point matching methods such as ORB [20] or SIFT [21]. We propose a two-stage bimodal
Remote Sens. 2025, 17, 2593 7 of 24
image alignment framework, termed M-RIFT, to improve the accuracy and robustness of
matching heterogeneous image data, as shown in Figure 5. In the rough alignment stage,
manually selected feature points are used as the coarse alignment step for image resizing
to quickly overcome the initial geometric distortion. In the fine alignment stage, we adopt
the RIFT multimodal image matching method [22]. First, feature points in the image are
detected via the maximum moment map. Then, the maximum value index in each direction
is searched to construct the maximum index map. Next, the FREAK descriptor is used
to generate the feature vector, and homonymous point pairs are obtained based on the
nearest-neighbor strategy. After removing outliers, the affine transform model between
images is derived. This approach enables the rapid and accurate establishment of feature
correspondences and optimization of matching results.
Through the above method, the homography matrix can be computed, and for each
match point ( xi , yi ) and ( xi′ , yi′ ), a system of equations is constructed based on the mathe-
matical model of the perspective transformation. The mathematical model of perspective
transformation is as follows:
x′ x
′
y = H y (2)
1 1
H is the homography matrix. Two equations can be obtained after expansion:
h denotes the elements of the homography matrix H expanded into a vector. The vector
h is obtained by solving the chi-square equation system via singular value decomposition
(SVD), from which the homography matrix is then derived.
Traditional feature point matching methods cannot effectively handle the modal differ-
ences between cross-modal images, making it difficult to match heterogeneous information.
Figure 6 demonstrates a comparative analysis of the feature point matching results be-
tween our approach and other methodologies. The green line connects the matching points
corresponding to the visible light image and the thermal infrared image.
The traditional methods are only capable of identifying a limited number of matching
points, frequently accompanied by issues of incorrect alignment. By contrast, the approach
proposed in this paper successfully detects a substantial number of accurate matching
points, thereby demonstrating the superiority of the proposed method.
Remote Sens. 2025, 17, 2593 8 of 24
(a) (b)
Figure 7. Example of partial images of the dataset (a) visible images; (b) thermal infrared images.
Remote Sens. 2025, 17, 2593 9 of 24
3. Method
3.1. Overall Architecture Design
Convolutional Neural Networks (CNNs) are powerful in feature extraction and mod-
eling and have achieved outstanding performance in early forest fire recognition tasks.
Forest fire images are natural images with rich local covariance, and CNNs with transla-
tional invariants are good at extracting fire features and thus learning high-level (semantic)
features. Many well-known object detection frameworks are used for fire detection tasks,
such as the YOLO series [17,23] and the RCNN series [24]. The YOLO series of algorithms
for architectural design always centers on the core goal of fast detection, which meets the
efficient detection requirements of forest fire detection tasks.
Therefore, YOLOv11 is adopted as the baseline scheme in this study, and the classic
three-stage architecture design balances feature abstraction capability and computational
efficiency, providing a robust baseline framework for subsequent model improvement. The
input resolution of visible and infrared images is denoted as W × H. The backbone network
extracts feature at different
n scales via multi-convolutional
o down-sampling. When the
feature map resolution is W H W H W H
8 × 8 , 16 × 16 , 32 × 32 , feature interaction design performs
cross-modal information interaction by extracting features from different modal networks.
After passing through the C3k2 feature fusion module, feature splicing design is applied
before feeding into the neck network.
To boost the model’s cross-modal fusion efficiency and detection performance, we
devise a novel cross-modal feature fusion algorithm within the RGB-Thermal fusion frame-
work. This algorithm comprises two key components: the feature interaction design and
the feature splicing design.
In the feature interaction design, we integrate the channel prior convolutional attention
(CPCA) mechanism. Given the significant differences in feature representations between
infrared and visible images, CPCA dynamically adjusts the importance of different channels.
By emphasizing the complementary information and suppressing redundant or conflicting
features, it enables effective cross-modal synergy. This process allows the model to fully
leverage the unique advantages of each modality.
For the feature splicing design, we adopt the parallel patch-aware splicing (PPAS)
method. PPAS divides the images into patches and processes them in parallel, guiding the
model to focus on critical regions. This approach enhances the model’s perception of the
target by capturing local details and global context simultaneously. Moreover, it suppresses
irrelevant background information, significantly improving the detection accuracy and
efficiency. The construction process is shown in Figure 8. We use different font colors to
distinguish different methods in order to better understand the naming of the model we are
building. We labeled different fusion strategies in blue, orange, and green, and we labeled
the first letter of the improved method in red.
Subsequently, the generated fused feature maps are input to the neck network layer. At
the neck network layer, the features are further integrated, and the feature information
processed by the neck network layer is finally passed to the detection layer to output the
detection results. The network structure of YOLOv11-MF is shown in Figure 9.
The method dynamically assigns attention weights in channel and spatial dimensions
to adaptively emphasize important features in different modalities, as shown in Figure 11.
In the process of channel attention calculation, the hybrid attention mechanism (CBAM,
Convolutional Block Attention Module) method is borrowed to collect spatial information
Remote Sens. 2025, 17, 2593 12 of 24
from the feature maps by applying the average pooling and maximum pooling opera-
tions. Subsequently, this collected information is fed into the shared MLP as shown in
Equation (5):
CA( F ) = σ ( MLP( AvgPool ( F )) + MLP( MaxPool ( F ))) (5)
The spatial relations between features are computed with the help of depth-separable
convolution, which reduces the complexity of computation while inter-channel relations
are preserved, as shown in Equation (6):
3
SA( F ) = Conv1×1 ( ∑ Branchi ( DwConv( F ))) (6)
i =0
DwConv denotes the depth separable convolution, and Branchi , i ∈ {0, 1, 2, 3} denotes
the first branch.
e) ⊗ F,
Fc = Mc (F e Fs = Ms (Fc ) ⊗ Fc ,
′′ (7)
F = δ(B(dropout(Fs ))),
non-overlapping patches is computed to facilitate local and global feature extraction and
interaction, as depicted in Figure 13.
F′ are partitioned into spatially contiguous patches and channel averaged. The patches
after channel averaging are linearly computed using a feed-forward neural network (FFN).
Based on this, an activation function is applied to obtain the probability distribution of
the linearly computed features in the spatial dimension and, for each marker, its weight is
adjusted by filtering the features related to the task.
4. Experiment
4.1. Experimental Settings
The algorithmic model was experimented in the Ubuntu 18.04 operating system
with NVIDIA GeForce RTX 3090 graphics card, CUDA version number 11.1, and Python
version number 3.9.19. The principle of consistency was followed during the training of all
the networks, keeping the same core optimizer parameters, optimization algorithm, and
number of training parameters. The detailed settings of each specific training parameter
are shown in Table 4.
R = TP/( TP + FN ) (9)
Z 1
AP = P( R)dR (10)
0
The detection threshold set is 0.5, which is adopted as mAP50. Meanwhile, the
evaluation index mAP50-95 is defined as the average value of detection thresholds ranging
from 0.5 to 0.95 (excluding 0.5, with a step size of 0.05).
Remote Sens. 2025, 17, 2593 14 of 24
Table 5. Comparison of the effect of different number of target categories for visible light detection.
Compared to the single-stage target detection model with similar complexity, the
YOLOv11 model reflects a more prominent performance advantage, which is significantly
higher than the RTMdet model in all metrics. Compared to the complex two-stage target
detection model, the YOLOv11 model is not much different from the FasterRCNN model in
Remote Sens. 2025, 17, 2593 15 of 24
all the indicators, but YOLOv11 is slightly higher in recall, which reflects that the YOLOv11
model performs well in reducing the leakage of detection, which is needed for the early
detection task of forest fires.
As shown in Table 7, early, mid, and late RGB-Thermal bimodal fusion frameworks
are constructed via multimodal strategies. According to different multimodal fusion strate-
gies, a simple Concat function is used for modal splicing, and the early fusion framework
(YOLOv11-EF), the mid-term fusion framework (YOLOv11-MF), and the late fusion frame-
work (YOLOv11-LF) are constructed on the basis of the YOLOv11 model. Simple bimodal
feature splicing slightly improves algorithm performance, while designing a cross-modal
feature interaction module and optimizing modal splicing modules enhances intermodal
interactions, enabling deep feature and information complementarity across modalities.
After a series of targeted improvements, the algorithm model (CP-YOLOv11-MF) con-
structed finally reaches 92.5%, 93.5%, 96.3%, and 62.9%, which reflects the effectiveness of
the various improvement methods for model performance improvement.
In the detection framework based on early fusion Strategies (YOLOv11-EF), the input
layer is improved by adding two new input channels and introducing the Concat function
for early bimodal feature splicing. First, the infrared image and the visible image are
used as input data, and the feature splicing operation is performed on the images of
two different modalities to generate the bimodal fusion feature map. Subsequently, the
generated bimodal fusion feature map is input to the backbone network layer and the neck
network layer. The detection layer performs target detection operations from the previous
layers and outputs the detection results. The network structure of YOLOv11-EF is shown
in Figure 14.
The detection framework based on late fusion strategies (YOLOv11-LF) consists of
a dual-input layer, a dual-channel backbone network layer, a dual-channel neck network
layer, and a detection layer. The visible channel backbone network and the thermal infrared
channel backbone network perform feature extraction for the visible and thermal infrared
Remote Sens. 2025, 17, 2593 16 of 24
images, and output the extracted features to the neck network layer. The features are
enhanced in the neck network layer. A feature splicing module is embedded at the output
position of the neck network layer to input the generated bimodal image fusion features to
the detection layer. The network structure is shown in Figure 15.
shown in Table 9. After adding the cross-modal feature interaction structure, the improved
mid-term fusion framework performs optimally.
Table 9. Comparison of the effects of cross-modal feature interaction structures on different detec-
tion frameworks.
The experimental results show that, after the addition of cross-modal structures, all
indicators under the late integration framework decreased to a certain extent while, for
the mid-term integration framework, all indicators were significantly improved. Since the
cross-modal structure introduced in the late fusion framework at the late stage of data
processing does not have effective feature interaction with the late integration, it makes
it difficult for the model to quickly adapt to and effectively integrate this cross-modal
information. In the mid-term fusion stage, the data is initially processed and not yet fully
solidified with feature patterns, and the introduction of cross-modal structures at this time
can timely capture the rich complementary information between different modal data. From
the network architecture, modal splicing and interactions are cross-cutting, enabling inter-
modal feature mapping, enhancing information flow, and improving characterization of
complex scenarios and diversified targets. This significantly improves model performance,
especially in mAP50-95, where obvious advantages are observed.
In the feature splicing module, we test the effectiveness of different attentions in the
optimization of the feature splicing module. Various attentional mechanisms (SimAM [29],
GAM [30], NAM [31], LCA [32], and our method) are adopted to improve the modal
splicing approach, which are applied in the feature splicing module after the C3k2 feature
extraction module of the backbone network, and the related results are shown in Table 10.
Table 10. Comparison of the Effectiveness of Different Attention Mechanisms in Optimizing Feature
Splicing Modules. Convention: best, 2nd-best.
As shown in Table 10, our designed PPAS, enabled by its multi-branching structure, ef-
fectively filters noisy information, complements cross-modal information, and outperforms
Remote Sens. 2025, 17, 2593 19 of 24
in all metrics. The GAM attention mechanism ranks second in several metrics. Simi-
lar to our approach, it leverages channel-spatial attention interaction to enhance feature
extraction accuracy.
Table 11. Performance and complexity comparison of different modal splicing schemes. Convention:
best, 2nd-best.
Modal Splicing Scheme Precision Recall mAP50 mAP50-95 Parameter (×106 ) Model Size
Scheme 1 92.1% 93.9% 96.4% 62.7% 20.51 39.7 MB
Scheme 2 92.5% 93.5% 96.3% 62.9% 11.83 23 MB
Scheme 3 92.4% 93.4% 96.3% 62.4% 9.66 18.9 MB
As shown in Table 11, simplified modal splicing Scheme 2 (replacing only the first
two modal splicing modules with PPAS) reduces model parameters and size by nearly 50%
compared to full replacement. Notably, indicators show no significant decline, with slight
improvements in accuracy and mAP50-95, verifying the effectiveness of the simplified design.
In order to further verify the balance between the detection effect and model complex-
ity of the algorithmic models, the comparison of detection performance and complexity of
the single-modal and RGB-T bimodal algorithmic models is shown in Table 12.
Table 12. Comparison of detection performance and complexity of the single-modal and RGB-T
bi-modal algorithmic models.
From Table 12, the model, designed to handle RGB-T bi-modal data, with a dual
backbone for cross-modal feature extraction, incorporates lightweight designs in data
input and algorithmic improvements. Although its parameter count and model size are
slightly larger than the original single-modal detector, it achieves improved performance at
lower complexity.
Remote Sens. 2025, 17, 2593 20 of 24
The blue box labeled “fire 0.8” indicates that the model predicts that the target is “fire”
with 80% confidence. From the above figure, it can be seen that the CP-YOLOv11-MF
algorithm model can fulfill the forest fire target detection task well.
At the same time, in order to further analyze the performance differences between
different algorithm models, representative forest fire image detection samples (night envi-
ronment, tree cover, smoke cover) are selected for visual analysis in this section, as shown
in Figures 17–19, to visualize the improvement effect of different algorithm models.
Figure 17. Visualization of the detection performance of each model for nighttime conditions.
Remote Sens. 2025, 17, 2593 21 of 24
Figure 18. Visualization of the detection performance of each model for tree occlusion conditions.
Figure 19. Visualization of the detection performance of each model for smoke occlusion conditions.
Figure 17 illustrates the detection performance of each model under nighttime con-
ditions. While each model demonstrates proficiency in detecting fires with distinct char-
acteristics, person detection may incur pixel-level displacement. This is because humans
lack rich texture in thermal infrared images, leading to blurred detection box borders that
hinder accurate localization. In the mid-term fusion framework (YOLOv11-MF), multiple
detection boxes initially appear, but the final model—incorporating cross-modal feature
fusion and splicing—achieves precise person detection with the highest confidence among
all models.
The performance of visible images in flame detection under tree occlusion conditions
is limited, as shown in Figure 18. For some fire objects, the detection confidence is only 30%,
and the detection boxes have localization bias. Thermal infrared images can effectively
recognize high-temperature target regions that stand out from the surrounding environ-
ment by virtue of their ability to perceive high-temperature areas in the scene. In the
early fusion framework (YOLOv11-EF), the poor fusion of bimodal information initially
generates multiple detection boxes. After model improvement, the confidence level of
all target detections increases to 80%, demonstrating that the adopted algorithm model
effectively enhances the accuracy and stability of forest fire target detection under tree
occlusion conditions. This provides a more reliable solution for forest fire target detection
in complex environments.
There is a false alarm problem with thermal infrared images in smoke occlusion
environments, as shown in Figure 19. Due to the existence of areas around the fire point with
temperatures close to the human body temperature, thermal infrared images incorrectly
Remote Sens. 2025, 17, 2593 22 of 24
identify these areas as personnel targets. Although visible light images have rich texture
information and are still able to detect flames and smoke under low visibility, they are
deficient in the localization accuracy of the detection framework. In addition, in the
application scenarios of the early fusion framework and the middle fusion framework,
the visible light image has under-reporting phenomenon and fails to successfully detect
some of the actual targets. When the proposed method is used for detection, the detection
confidence is increased to 80% for both critical targets, namely, flames and people, which
improves the accuracy and reliability of detection.
In summary, the algorithm model CP-YOLOv11-MF constructed is able to perform the
target detection task more accurately in complex forest fire scenarios. Compared with single-
modal detection methods, the model significantly reduces the false alarm rate and missed
alarm rate, effectively overcoming the limitations of single-modal detection. Meanwhile,
by designing the modal interaction structure and optimizing the modal splicing module,
the model’s ability to detect targets in complex environments is enhanced significantly.
5. Conclusions
In this paper, a multi-target and multi-scene forest fire aerial photography dataset
is constructed by collecting data in multiple locations with a UAV equipped with a dual-
optical camera head, which provides a more comprehensive visual dataset for subsequent
forest fire prevention and management research. Meanwhile, the early fusion detection
framework (YOLOv11-EF), the middle fusion detection framework (YOLOv11-MF) and the
late fusion detection framework (YOLOv11-LF) are constructed for the multimodal fusion
strategy on the basis of the YOLOv11 target detection model, which proves the advance-
ment of the RGB-T bimodal target detection network model compared with the single-
modal one. Based on this, a modal interaction structure is designed and a modal splicing
module is optimized to enhance deep cross-modal interaction and fusion for RGB-Thermal
bimodal target detection. Lightweight design is also incorporated during algorithm model
improvement. Finally, the RGB-T dual-modal detection model CP-YOLOv11-MF con-
structed achieves 92.5%, 93.5%, 96.3%, and 62.9% in terms of precision, recall, mAP50, and
mAP50-95. Compared with the metrics of visible light detection in a single mode, there
are 1.8%, 3.2%, 2.7%, and 7.9% improvement and, with the metrics of thermal infrared
detection in a single mode, the improvement is 1.3%, 4.9%, 2.7%, and 4.7%.
This paper presents an optimized AI-driven framework for RGB-thermal fusion in
wildfire detection, which significantly improves the accuracy and response efficiency of
monitoring systems. In the context of the growing trend of multi-source data fusion for
forest fire detection, this study provides novel insights into the integration of diverse data
modalities. Future work will focus on further enhancing the scale and diversity of the
multi-scenario fire dataset by continuing to collect data in more forested areas with different
geographic environments and climatic conditions, covering a wide range of terrains such
as mountains, hills, and plains, as well as forested scenarios with different seasons and
day/night time slots, in order to increase the dataset’s level of coverage of complex real-
world scenarios. At the algorithmic research level, we continue to study the cross-modal
fusion mechanism in depth, explore more potential modal interaction features, and improve
the efficiency of the model in utilizing the bimodal data, so as to achieve more stable and
accurate detection in the complex and changing forest fire scenarios.
Author Contributions: Conceptualization, Y.Z. and X.R.; methodology, Y.Z.; data curation, Y.Z.;
writing—original draft preparation, Y.Z.; writing—review and editing, X.R.; supervision, W.S. All
authors have read and agreed to the published version of the manuscript.
Remote Sens. 2025, 17, 2593 23 of 24
Funding: This research was funded by the National Natural Science Foundation of China (program
NO. 52321003) and the Startup Foundation for Introducing Talent of NUIST (1523142501164).
References
1. Cunningham, C.X.; Williamson, G.J.; Bowman, D.M.J.S. Increasing frequency and intensity of the most extreme wildfires on Earth.
Nat. Ecol. Evol. 2024, 8, 1420–1425. [CrossRef] [PubMed]
2. Rui, X.; Li, Z.; Zhang, X.; Li, Z.; Song, W. A RGB-Thermal based adaptive modality learning network for day–night wildfire
identification. Int. J. Appl. Earth Obs. Geoinf. 2023, 125, 103554. [CrossRef]
3. Kularatne, S.D.M.W.; Casado, C.Á.; Rajala, J.; Hänninen, T.; López, M.B.; Nguyen, L. FireMan-UAV-RGBT: A Novel UAV-
Based RGB-Thermal Video Dataset for the Detection of Wildfires in the Finnish Forests. In Proceedings of the 2024 IEEE 29th
International Conference on Emerging Technologies and Factory Automation (ETFA), Padova, Italy, 10–13 September 2024;
pp. 1–8.
4. Shamsoshoara, A.; Afghah, F.; Razi, A.; Zheng, L.; Fulé, P.Z.; Blasch, E. Aerial imagery pile burn detection using deep learning:
The FLAME dataset. Comput. Netw. 2021, 193, 108001. [CrossRef]
5. Chen, X.; Hopkins, B.; Wang, H.; O’Neill, L.; Afghah, F.; Razi, A.; Fulé, P.; Coen, J.; Rowell, E.; Watts, A. Wildland Fire Detection
and Monitoring Using a Drone-Collected RGB/IR Image Dataset. IEEE Access 2022, 10, 121301–121317. [CrossRef]
6. Toulouse, T.; Rossi, L.; Campana, A.; Celik, T.; Akhloufi, M.A. Computer vision for wildfire research: An evolving image dataset
for processing and analysis. Fire Saf. J. 2017, 92, 188–194. [CrossRef]
7. Li, X.Y.; Chen, S.G.; Tian, C.N.; Zhou, H.; Zhang, Z.X. M2FNet: Mask-Guided Multi-Level Fusion for RGB-T Pedestrian Detection.
IEEE Trans. Multimed. 2024, 26, 8678–8690. [CrossRef]
8. Song, K.C.; Wen, H.W.; Xue, X.T.; Huang, L.M.; Ji, Y.Y.; Yan, Y.H. Modality Registration and Object Search Framework for
UAV-Based Unregistered RGB-T Image Salient Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5531015. [CrossRef]
9. Jin, D.Z.; Shao, F.; Xie, Z.X.; Mu, B.Y.; Chen, H.W.; Jiang, Q.P. CAFCNet: Cross-modality asymmetric feature complement network
for RGB-T salient object detection. Expert Syst. Appl. 2024, 247, 123222. [CrossRef]
10. Lv, Y.; Liu, Z.; Li, G.Y. Context-Aware Interaction Network for RGB-T Semantic Segmentation. IEEE Trans. Multimed. 2024, 26,
6348–6360. [CrossRef]
11. Zhou, W.J.; Dong, S.H.; Fang, M.X.; Yu, L. CACFNet: Cross-Modal Attention Cascaded Fusion Network for RGB-T Urban Scene
Parsing. IEEE Trans. Intell. Veh. 2024, 9, 1919–1929. [CrossRef]
12. Zhang, B.; Li, Z.L.; Sun, F.M.; Li, Z.H.; Dong, X.B.; Zhao, X.L.; Zhang, Y.R. SICFNet: Shared Information Interaction and
Complementary Feature Fusion Network for RGB-T traffic scene parsing. Expert Syst. Appl. 2025, 276, 14. [CrossRef]
13. Pang, Y.; Huang, Y.; Weng, C.Y.; Lyu, J.L.; Bai, C.Y.; Yu, X.S. Enhanced RGB-T saliency detection via thermal-guided multi-stage
attention network. Vis. Comput. 2025, 41, 8055–8073. [CrossRef]
14. Bin Azami, M.H.; Orger, N.C.; Schulz, V.H.; Oshiro, T.; Cho, M. Earth Observation Mission of a 6U CubeSat with a 5-Meter
Resolution for Wildfire Image Classification Using Convolution Neural Network Approach. Remote Sens. 2022, 14, 1874.
[CrossRef]
15. Kumar, A.; Perrusquía, A.; Al-Rubaye, S.; Guo, W. Wildfire and smoke early detection for drone applications: A light-weight
deep learning approach. Eng. Appl. Artif. Intell. 2024, 136, 108977. [CrossRef]
16. Qurratulain, S.; Zheng, Z.Z.; Xia, J.; Ma, Y.; Zhou, F.R. Deep learning instance segmentation framework for burnt area instances
characterization. Int. J. Appl. Earth Obs. Geoinf. 2023, 116, 103146. [CrossRef]
17. Li, J.; Tang, H.; Li, X.; Dou, H.; Li, R. LEF-YOLO: A lightweight method for intelligent detection of four extreme wildfires based
on the YOLO framework. Int. J. Wildland Fire 2024, 33, WF23044. [CrossRef]
18. Guo, S.H.; Hu, B.; Huang, R. Real-Time Flame Segmentation based on RGB-Thermal Fusion. In Proceedings of the IEEE
International Conference on Robotics and Biomimetics (IEEE ROBIO), Sanya, China, 27–31 December 2021; IEEE: Piscataway, NJ,
USA; pp. 1435–1440.
19. Cui, S.; Ma, A.L.; Wan, Y.T.; Zhong, Y.F.; Luo, B.; Xu, M.Z. Cross-Modality Image Matching Network with Modality-Invariant
Feature Representation for Airborne-Ground Thermal Infrared and Visible Datasets. IEEE Trans. Geosci. Remote Sens. 2022,
60, 3099506. [CrossRef]
20. Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011
International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571.
21. Burger, W.; Burge, M.J. Scale-Invariant Feature Transform (SIFT). In Digital Image Processing: An Algorithmic Introduction Using
Java; Burger, W., Burge, M.J., Eds.; Springer: London, UK, 2016; pp. 609–664.
Remote Sens. 2025, 17, 2593 24 of 24
22. Li, J.; Hu, Q.; Ai, M. RIFT: Multi-Modal Image Matching Based on Radiation-Variation Insensitive Feature Transform. IEEE Trans.
Image Process. 2020, 29, 3296–3310. [CrossRef] [PubMed]
23. Gonçalves, L.A.O.; Ghali, R.; Akhloufi, M.A. YOLO-Based Models for Smoke and Wildfire Detection in Ground and Aerial
Images. Fire 2024, 7, 140. [CrossRef]
24. Ding, Y.H.; Wang, M.Y.; Fu, Y.J.; Wang, Q. Forest Smoke-Fire Net (FSF Net): A Wildfire Smoke Detection Model That Combines
MODIS Remote Sensing Images with Regional Dynamic Brightness Temperature Thresholds. Forests 2024, 15, 839. [CrossRef]
25. Huang, H.; Chen, Z.; Zou, Y.; Lu, M.; Chen, C.; Song, Y.; Zhang, H.; Yan, F. Channel prior convolutional attention for medical
image segmentation. Comput. Biol. Med. 2024, 178, 108784. [CrossRef] [PubMed]
26. Xu, S.; Zheng, S.; Xu, W.; Xu, R.; Wang, C.; Zhang, J.; Teng, X.; Li, A.; Guo, L. HCF-Net: Hierarchical Context Fusion Network for
Infrared Small Object Detection. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME),
Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6.
27. Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An Empirical Study of Designing
Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784. [CrossRef]
28. Ren, S.Q.; He, K.M.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.
IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
29. Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks.
In Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, Virtual,
18–24 July 2021; pp. 11863–11874.
30. Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv
2021, arXiv:2112.05561. [CrossRef]
31. Liu, Y.; Shao, Z.; Teng, Y.; Hoffmann, N. NAM: Normalization-based Attention Module. arXiv 2021, arXiv:2111.12419. [CrossRef]
32. He, A.; Li, X.; Wu, X.; Su, C.; Chen, J.; Xu, S.; Guo, X. ALSS-YOLO: An Adaptive Lightweight Channel Split and Shuffling Network
for TIR Wildlife Detection in UAV Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 17308–17326. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.