0% found this document useful (0 votes)
5 views24 pages

Remotesensing 17 02593

The document presents a novel UAV-based RGB-Thermal dataset (RGBT-3M) and a CP-YOLOv11-MF fusion detection model aimed at improving forest fire detection. The dataset, which includes time-synchronized multi-scene aerial imagery with 'Smoke–Fire–Person' annotations, addresses the limitations of existing datasets by providing comprehensive data for deep learning applications. Experimental results demonstrate the model's effectiveness, achieving high precision and recall rates, thus contributing to enhanced early fire detection and intelligent forest conservation efforts.

Uploaded by

llsgi3899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views24 pages

Remotesensing 17 02593

The document presents a novel UAV-based RGB-Thermal dataset (RGBT-3M) and a CP-YOLOv11-MF fusion detection model aimed at improving forest fire detection. The dataset, which includes time-synchronized multi-scene aerial imagery with 'Smoke–Fire–Person' annotations, addresses the limitations of existing datasets by providing comprehensive data for deep learning applications. Experimental results demonstrate the model's effectiveness, achieving high precision and recall rates, thus contributing to enhanced early fire detection and intelligent forest conservation efforts.

Uploaded by

llsgi3899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Article

A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion


Model for Enhanced Forest Fire Detection
Yalin Zhang 1 , Xue Rui 2 and Weiguo Song 1, *

1 State Key Laboratory of Fire Science, University of Science and Technology of China, Hefei 230026, China;
[email protected]
2 School of Emergency Management, Nanjing University of Information Science and Technology,
Nanjing 210044, China; [email protected]
* Correspondence: [email protected]

Abstract
UAVs are essential for forest fire detection due to vast forest areas and inaccessibility of
high-risk zones, enabling rapid long-range inspection and detailed close-range surveil-
lance. However, aerial photography faces challenges like multi-scale target recognition
and complex scenario adaptation (e.g., deformation, occlusion, lighting variations). RGB-
Thermal fusion methods integrate visible-light texture and thermal infrared temperature
features effectively, but current approaches are constrained by limited datasets and insuffi-
cient exploitation of cross-modal complementary information, ignoring cross-level feature
interaction. A time-synchronized multi-scene, multi-angle aerial RGB-Thermal dataset
(RGBT-3M) with “Smoke–Fire–Person” annotations and modal alignment via the M-RIFT
method was constructed as a way to address the problem of data scarcity in wildfire scenar-
ios. Finally, we propose a CP-YOLOv11-MF fusion detection model based on the advanced
YOLOv11 framework, which can learn heterogeneous features complementary to each
modality in a progressive manner. Experimental validation proves the superiority of our
method, with a precision of 92.5%, a recall of 93.5%, a mAP50 of 96.3%, and a mAP50-95 of
62.9%. The model’s RGB-Thermal fusion capability enhances early fire detection, offering a
benchmark dataset and methodological advancement for intelligent forest conservation,
with implications for AI-driven ecological protection.

Academic Editor: Ioannis Gitas Keywords: forest fire; UAV multispectral imagery; YOLOv11; wildfire dataset; small object
Received: 29 April 2025 detection; attention mechanism; computer vision
Revised: 23 June 2025
Accepted: 21 July 2025
Published: 25 July 2025

Citation: Zhang, Y.; Rui, X.; Song, W. 1. Introduction


A UAV-Based Multi-Scenario The frequency of forest fires has increased significantly in recent years, and extreme
RGB-Thermal Dataset and Fusion
forest fire events have had a major impact on societies and ecosystems globally [1]. With the
Model for Enhanced Forest Fire
Detection. Remote Sens. 2025, 17, 2593.
advantages of flexible mobility and multi-scale observation, UAV has become an important
https://doi.org/10.3390/rs17152593 equipment for forest fire detection. In the daily inspection of forest fires, the composite
strategy of long-term rapid inspection and close-range fine detection requires the detection
Copyright: © 2025 by the authors.
Licensee MDPI, Basel, Switzerland.
algorithm to have both lightweight and generalization characteristics.
This article is an open access article As shown in Figure 1, in the process of forest fire detection, UAVs face two different
distributed under the terms and application scenarios: remote shooting and close shooting. This brings about the difficulty
conditions of the Creative Commons of multi-scale target detection and matching for forest fires. At the same time, the aerial
Attribution (CC BY) license image is affected by the change in flight attitude and vegetation occlusion. The target
(https://creativecommons.org/
presents significant multi-scale features, deformation characteristics and edge blurring
licenses/by/4.0/).

Remote Sens. 2025, 17, 2593 https://doi.org/10.3390/rs17152593


Remote Sens. 2025, 17, 2593 2 of 24

effect. The traditional fixed shape detection model has the defects of feature matching
deviation and scale sensitivity. Therefore, RGB-Thermal fusion detection can give full play
to the advantages of visible light image (RGB) and thermal infrared image (T), and make
up for the deficiency of single mode [2]. The visible light image contains rich texture and
color information, which can accurately identify the color characteristics of the flame and
provide the basis for the detailed analysis of the fire target. The thermal infrared image
is not limited by light conditions and vegetation occlusion, and can keenly capture high-
temperature heat sources. Even in the presence of heavy smoke or in nighttime conditions,
it can accurately pinpoint the locations of fire points. In the remote shooting scene, the
initial fire point is small, and the combination of the high temperature signal in the thermal
infrared image and the smoke texture feature of the visible light image can also detect the
potential fire source in time. In the close shooting scene, RGB-Thermal fusion detection can
quickly adapt to the scale change and shape distortion of the target in the face of complex
vegetation environment and fire scene with dramatic perspective change.

Figure 1. Schematic of forest fire detection imagery based on UAV.

In the existing forest fire detection research, most of the publicly available forest fire
image datasets are limited to visible light images and lack real fire data. Furthermore,
there is a paucity of publicly available RGB-Thermal image datasets, and there is a lack
of more accurate image alignment work. The majority of these datasets are focused on
fire classification and segmentation tasks, and there is still a gap in fire detection work [3].
The FLAME1 [4] and FLAME2 [5] datasets utilise the overhead view, which is incapable
of reflecting the multi-angle characteristics of UAVs during daily forest fire inspections.
The well-labelled Corsican Fire Dataset [6] and the RGB-T wildfire dataset [2] contain a
limited amount of data and are all images of a single experimental scenario, which lacks
data generalization. The FireMan-UAV-RGBT dataset [3] has captured multi-scene forest
fire images from multiple viewpoints, but only supports the classification task and does
not provide more refined information for forest fire detection. The dataset in this paper is
comprehensive, considering the diversity of viewpoints in aerial photography, the diversity
of forest fire scenes, and the precision of the data that has been annotated. The RGBT-3M
Remote Sens. 2025, 17, 2593 3 of 24

has been constructed in order to provide reliable data support for the detection of forest
fires using RGB-Thermal imaging.
At the methodological level, the multimodal fusion strategies can be mainly catego-
rized into three types: data-level fusion, feature-level fusion and decision-level fusion.
They correspond to different stages of the algorithmic model inference process, as shown
in Figure 2.

Figure 2. Multimodal fusion strategies.

In the field of multimodal detection methods, deep learning networks mostly use
intermediate fusion strategies [7,8], and researchers have developed a number of multi-
modal interaction and fusion strategies, which have proved to be effective in enhancing the
design of modal interactions in the feature extraction phase [9,10]. CACFNet [11] mines
complementary information from two modalities by designing cross-modal attention fusion
modules, and uses cascaded fusion modules to decode multilevel features in an up–down
manner; SICFNet [12] constructs a shared information interaction and complementary
feature fusion network, which consists of three phases: feature extraction, information inter-
action, and feature calibration refinement; and the Thermal-induced Modality-interaction
Multi-stage Attention Network (TMMANet [13]) leverages thermal-induced attention
mechanisms in both the encoder and decoder stages to effectively integrate RGB and
thermal modalities.
At present, preliminary progress has been made in forest fire identification based on
the RGB-Thermal fusion method [2,3,5,6]. Although existing work applies deep learning
frameworks such as LeNet [14], MobileViT [15], ResNet [16], YOLO [17], etc., to forest
fire detection, which improves the efficiency of forest fire recognition [5], the design of
algorithms based on RGB-Thermal correlation is still very limited. Chen et al. [5] explored
RGB-Thermal based early fusion and late fusion methods for classification and detection
Remote Sens. 2025, 17, 2593 4 of 24

of forest fire images. Rui et al. [2] proposed an adaptive learning RGB-T bimodal image
recognition framework for forest fires. Guo et al. [18] designed the SkipInception feature
extraction module and SFSeg sandwich structure to fuse visible and thermal infrared
images for flame segmentation task. Overall, the above algorithms do not deeply consider
the information interaction and propagation of cross-modal features, and lack the ability to
simultaneously achieve the calibration of shallow texture features and the localization of
high-level semantic information at all scales, and thus remain deficient in analyzing diverse
and challenging forest fire scenarios.
We propose a new forest fire detection framework. It employs a parallel backbone
network to extract RGB and TIR features. A feature cross-fertilization structure is estab-
lished in multi-scale feature extraction to enhance information interaction and propagation
between modalities. The channel and spatial attention mechanisms, along with a feature
branching selection strategy, are introduced to suppress noise from heterogeneous inter-
modal features. Finally, it achieves effective combination of complementary relationships
between modalities.
In summary, this paper will explore how to effectively fuse the features of visible and
thermal infrared images on existing deep learning models to improve the efficiency and
effectiveness of forest fire target detection in complex environments. Based on the above
background, the main contributions of this paper are as follows:
(1) A novel forest fire dataset is introduced, containing time-synchronized RGB-
thermal video data from real fires and outdoor experiments in multiple Chinese forest areas.
It provides high-quality, reliable data for classification and detection tasks via manual
frame-splitting, image alignment, and annotation, supporting subsequent deep learning
model training and testing. To the best of our knowledge, this is the first RGB-Thermal
image detection dataset for forest fires.
(2) A fire detection method was carried out by combining multimodal fusion tech-
niques and computer vision methods. We choose the well-known deep learning architecture
YOLOv11, and add cross-modal feature fusion structure and attention mechanism under
the dual backbone structure of RGB and TIR to guide the gradual fusion of modal heteroge-
neous information and improve the adaptability of the method in forest fire target detection.
(3) Our constructed model is evaluated in several challenging forest fire scenarios,
effectively demonstrating the usability and robustness of our proposed dataset and deep
learning approach in forest fire detection scenarios.

2. Dataset
2.1. Data Collection
The experimental equipment used for RGB-T image data collection is the DJI M300
RTK UAV (DJI Technology Co., Ltd., Shenzhen, China) equipped with the H20T, and the DJI
MAVIC 2 (DJI Technology Co., Ltd., Shenzhen, China) Enterprise with integrated camera,
as shown in Figure 3.

(a) (b)
Figure 3. Data collection equipment (a) DJI Matrice 300 RTK with H20T; (b) DJI MAVIC 2 Enterprise.
Remote Sens. 2025, 17, 2593 5 of 24

In the process of data collection, the specific shooting specifications are shown in
Table 1.

Table 1. Video Shooting Specifications.

UAV Camera FPS Resolution


visible image: 1920 × 1080
DJI Matrice 300 RTK H20T 30
infrared image: 640 × 512
visible image: 1920 × 1080
DJI MAVIC 2 Enterprise All-in-one camera 30
infrared image: 900 × 720

In order to collect forest fire images covering a wide range of scenarios, the study
carried out large-scale field environmental data collection in Anhui, Yunnan, and Inner
Mongolia, including real fires or outdoor experimental data. In the data collection process,
multiple UAV devices were used, and all devices were time-synchronized to simultaneously
acquire visible and thermal infrared videos to ensure the consistency of the acquired data.
Finally, from the large number of videos collected, we filtered out the videos with high
representativeness of forest fire scenes, and the relevant information is shown in Table 2.

Table 2. Raw video information in RGBT-3M dataset.

Video Item UAV Camera Location Scene Time Duration File Size
Outdoor 2.68 GB,
Video pair 1 DJI Matrice 300 RTK H20T Anhui, China Night 744 s
Experiment 135 MB
Integrated 1.46 GB,
Video pair 2 DJI Mavic 2 Enterprise Yunnan, China Real Fire Daytime 348 s
Camera 389 MB
Integrated 2.96 GB,
Video pair 3 DJI Mavic 2 Enterprise Yunnan, China Real Fire Daytime 703 s
Camera 716 MB
Integrated 3.5 GB,
Video pair 4 DJI Mavic 2 Enterprise Yunnan, China Real Fire Daytime 831 s
Camera 742 MB
Integrated 1.53 GB,
Video pair 5 DJI Mavic 2 Enterprise Yunnan, China Real Fire Daytime 363 s
Camera 361 MB
DJI Mavic 2 Enterprise Integrated 2.33 GB,
Video pair 6 Yunnan, China No fire Daytime 554 s
Dual Camera 546 MB
Integrated 392 MB,
Video pair 7 DJI Mavic 2 Enterprise Yunnan, China No fire Daytime 91 s
Camera 91.2 MB
Integrated Inner Mongolia, Outdoor 486 MB,
Video pair 8 DJI Mavic 2 Enterprise Daytime 112 s
Camera China Experiment 74.2 MB
Integrated Inner Mongolia, Outdoor 940 MB,
Video pair 9 DJI Mavic 2 Enterprise Daytime 218 s
Camera China Experiment 147 MB
Outdoor 2.52 GB,
Video pair 10 DJI Matrice 300 RTK H20T Anhui, China Daytime 698 s
Experiment 140 MB

2.2. Data Pre-Processing


In the pre-processing stage, the video is extracted at 5 frames per second to reduce
similarity between image frame pairs. Images are divided into fire and non-fire frame pairs
to facilitate image classification tasks. Considering the similarity of scenes, additional frame-
skipping strategies are adopted to optimize the processing workflow. Finally, 17,862 frame
pairs of images are obtained, of which the non-fire frames are 6642 pairs and the fire frames
are 11,220 pairs.
The visible and thermal infrared images are usually captured by different sensors, and
the modality gaps caused by different imaging systems or styles pose a great challenge
Remote Sens. 2025, 17, 2593 6 of 24

to the matching task [19]. Although complementary information is provided in different


imaging modalities, multimodal images obtained directly from the camera are not aligned
for direct fusion, as shown in Figure 4. This figure depicts a stereo vision system frame-
work. Two cameras, each associated with independent coordinate systems (left and right),
are included.

Figure 4. RGB-thermal dual-optical camera imaging schematic.

Image alignment refers to the establishment of pixel-level correspondences between


images of two viewpoints, through a series of pre-processing and alignment steps, trans-
forming them into a common reference frame or coordinate system through spatial mapping
relationships, i.e., converting them into a common representation that makes them spa-
tially aligned, and allows them to be compared and analyzed on the same spatial scale.
Image alignment is used to merge the strengths of different modalities, resulting in a more
comprehensive, accurate, and robust characterization.
At the device hardware level, the temporal acquisition frame rates of visible and
thermal infrared images are kept synchronized, i.e., they are already aligned in the temporal
dimension. In the spatial dimension, the alignment between the visible and thermal infrared
images can be realized by solving the homography matrix of the visible images and the
thermal infrared images and performing affine transformations, i.e.,
(
H = R + T 1d N T
(1)
X2 = HX1

H is the univariate responsivity matrix, R and T denotes the rotation and translation
matrices between the two coordinate systems. The coordinates of a point on the plane p
in the world coordinate system in the two coordinate systems are X1 and X2 , the normal
vector of the plane p is n, and the distance to the origin of the camera coordinate system
is d.
During the construction of most forest fire RGB-T data sets, the image registration pro-
cess is usually carried out by manually selecting feature points, or by using general feature
point matching methods such as ORB [20] or SIFT [21]. We propose a two-stage bimodal
Remote Sens. 2025, 17, 2593 7 of 24

image alignment framework, termed M-RIFT, to improve the accuracy and robustness of
matching heterogeneous image data, as shown in Figure 5. In the rough alignment stage,
manually selected feature points are used as the coarse alignment step for image resizing
to quickly overcome the initial geometric distortion. In the fine alignment stage, we adopt
the RIFT multimodal image matching method [22]. First, feature points in the image are
detected via the maximum moment map. Then, the maximum value index in each direction
is searched to construct the maximum index map. Next, the FREAK descriptor is used
to generate the feature vector, and homonymous point pairs are obtained based on the
nearest-neighbor strategy. After removing outliers, the affine transform model between
images is derived. This approach enables the rapid and accurate establishment of feature
correspondences and optimization of matching results.

Figure 5. Flowchart of M-RIFT image alignment method.

Through the above method, the homography matrix can be computed, and for each
match point ( xi , yi ) and ( xi′ , yi′ ), a system of equations is constructed based on the mathe-
matical model of the perspective transformation. The mathematical model of perspective
transformation is as follows:    
x′ x
 ′
y  = H y (2)
 
1 1
H is the homography matrix. Two equations can be obtained after expansion:

h11 x +h12 y+h13


x′ = h31 x +h32 y+h33
h21 x +h22 y+h23 (3)
y′ = h31 x +h32 y+h33

h denotes the elements of the homography matrix H expanded into a vector. The vector
h is obtained by solving the chi-square equation system via singular value decomposition
(SVD), from which the homography matrix is then derived.
Traditional feature point matching methods cannot effectively handle the modal differ-
ences between cross-modal images, making it difficult to match heterogeneous information.
Figure 6 demonstrates a comparative analysis of the feature point matching results be-
tween our approach and other methodologies. The green line connects the matching points
corresponding to the visible light image and the thermal infrared image.
The traditional methods are only capable of identifying a limited number of matching
points, frequently accompanied by issues of incorrect alignment. By contrast, the approach
proposed in this paper successfully detects a substantial number of accurate matching
points, thereby demonstrating the superiority of the proposed method.
Remote Sens. 2025, 17, 2593 8 of 24

Figure 6. Feature point matching results of image matching methods.

2.3. Statistical Analysis of the Dataset


The multi-scene, multi-target and multimodal forest fire aerial photography dataset
(RGBT-3M dataset) contains 22,440 images (i.e., 11,220 pairs of images) of fire frames, which
were annotated using LabelImg (version: 1.8.6), and the labeled targets were smoke, fire,
and person, with the numbers of 13,574, 11,315, and 5888. Infrared images lacked obvious
smoke features, so we provided labels excluding smoke targets. The dataset is divided
according to the ratio of 7:3 to form the training set and validation set, some representative
scenes are shown in Figure 7, and the specific data are shown in Table 3. The dataset will
be updated to the website: https://complex.ustc.edu.cn/.

(a) (b)

Figure 7. Example of partial images of the dataset (a) visible images; (b) thermal infrared images.
Remote Sens. 2025, 17, 2593 9 of 24

Table 3. RGBT-3M dataset content.

Dataset Number of Images Number of Labels Smoke Fire Person


Training Set 7854 21,550 9488 7914 4148
Validation Set 3366 9227 4086 3401 1740
Total 11,220 30,777 13,574 11,315 5888

3. Method
3.1. Overall Architecture Design
Convolutional Neural Networks (CNNs) are powerful in feature extraction and mod-
eling and have achieved outstanding performance in early forest fire recognition tasks.
Forest fire images are natural images with rich local covariance, and CNNs with transla-
tional invariants are good at extracting fire features and thus learning high-level (semantic)
features. Many well-known object detection frameworks are used for fire detection tasks,
such as the YOLO series [17,23] and the RCNN series [24]. The YOLO series of algorithms
for architectural design always centers on the core goal of fast detection, which meets the
efficient detection requirements of forest fire detection tasks.
Therefore, YOLOv11 is adopted as the baseline scheme in this study, and the classic
three-stage architecture design balances feature abstraction capability and computational
efficiency, providing a robust baseline framework for subsequent model improvement. The
input resolution of visible and infrared images is denoted as W × H. The backbone network
extracts feature at different
n scales via multi-convolutional
o down-sampling. When the
feature map resolution is W H W H W H
8 × 8 , 16 × 16 , 32 × 32 , feature interaction design performs
cross-modal information interaction by extracting features from different modal networks.
After passing through the C3k2 feature fusion module, feature splicing design is applied
before feeding into the neck network.
To boost the model’s cross-modal fusion efficiency and detection performance, we
devise a novel cross-modal feature fusion algorithm within the RGB-Thermal fusion frame-
work. This algorithm comprises two key components: the feature interaction design and
the feature splicing design.
In the feature interaction design, we integrate the channel prior convolutional attention
(CPCA) mechanism. Given the significant differences in feature representations between
infrared and visible images, CPCA dynamically adjusts the importance of different channels.
By emphasizing the complementary information and suppressing redundant or conflicting
features, it enables effective cross-modal synergy. This process allows the model to fully
leverage the unique advantages of each modality.
For the feature splicing design, we adopt the parallel patch-aware splicing (PPAS)
method. PPAS divides the images into patches and processes them in parallel, guiding the
model to focus on critical regions. This approach enhances the model’s perception of the
target by capturing local details and global context simultaneously. Moreover, it suppresses
irrelevant background information, significantly improving the detection accuracy and
efficiency. The construction process is shown in Figure 8. We use different font colors to
distinguish different methods in order to better understand the naming of the model we are
building. We labeled different fusion strategies in blue, orange, and green, and we labeled
the first letter of the improved method in red.

3.2. Forest Fire Detection Frameworks Based on Mid-Term Fusion Strategies


The YOLOv11-MF network architecture consists of a dual-input layer, a dual-channel
backbone network layer, a neck network layer, and a detection layer. The same modal
feature splicing module is set up after the C3k2 module of each backbone network branch.
Remote Sens. 2025, 17, 2593 10 of 24

Subsequently, the generated fused feature maps are input to the neck network layer. At
the neck network layer, the features are further integrated, and the feature information
processed by the neck network layer is finally passed to the detection layer to output the
detection results. The network structure of YOLOv11-MF is shown in Figure 9.

Figure 8. Building process of CP-YOLOv11-MF.

Figure 9. Network structure of YOLOv11-MF.

3.3. Cross-Modal Feature Fusion Algorithm Design


In order to better fuse the bimodal information, the design of the feature interaction
module and the optimization of the modal splicing module is carried on the mid-term
fusion framework (YOLOv11-MF). In this way, the model can pay more attention to the
features that contribute to the detection effect, while suppressing the influence of irrelevant
or noisy features, which enhances the model’s ability to fuse between different information
and improves the performance of forest fire target detection.
Finally, we design an RGB-Thermal fusion detection model, named CP-YOLOv11-MF
as shown in Figure 10.
Remote Sens. 2025, 17, 2593 11 of 24

Figure 10. The network structure of the proposed CP-YOLOv11-MF.

3.3.1. Feature Interaction Module Design


To address the visible and infrared feature interaction problem, we designed a fea-
ture interaction module. As shown in Equation (4), visible and infrared features are
processed via channel prior convolutional attention [25] to compute channel and spatial
attention, after which they are combined element-wise with the original visible and thermal
infrared features:
F = FRGB ⊕ FT ⊕ FCPCA (4)

The method dynamically assigns attention weights in channel and spatial dimensions
to adaptively emphasize important features in different modalities, as shown in Figure 11.

Figure 11. Schematic of Channel Prior Convolutional Attention.

In the process of channel attention calculation, the hybrid attention mechanism (CBAM,
Convolutional Block Attention Module) method is borrowed to collect spatial information
Remote Sens. 2025, 17, 2593 12 of 24

from the feature maps by applying the average pooling and maximum pooling opera-
tions. Subsequently, this collected information is fed into the shared MLP as shown in
Equation (5):
CA( F ) = σ ( MLP( AvgPool ( F )) + MLP( MaxPool ( F ))) (5)

The spatial relations between features are computed with the help of depth-separable
convolution, which reduces the complexity of computation while inter-channel relations
are preserved, as shown in Equation (6):

3
SA( F ) = Conv1×1 ( ∑ Branchi ( DwConv( F ))) (6)
i =0

DwConv denotes the depth separable convolution, and Branchi , i ∈ {0, 1, 2, 3} denotes
the first branch.

3.3.2. Feature Splicing Module Design


Owing to the characteristic disparities across different modes, merely through sim-
ple splicing is substantial noise introduced, rendering inter-modal information transfer
ineffective. To address this challenge, this study draws inspiration from the multi-branch
feature extraction strategy in HCF-Net [26], and designs a parallel patch-aware splicing
module (PPAS). The module employs a parallel multi-branching framework comprising
local, global, and serial convolutional branches, which effectively suppresses extraneous in-
formation. Additionally, it leverages spatial and channel attention mechanisms for adaptive
feature enhancement, as illustrated in Figure 12.

Figure 12. Flowchart of PPAS.


′ ′ ′ ′ ′ ′ ′ ′ ′
Flocal ∈ R H ×W ×C , Fglobal ∈ R H ×W ×C , Flocal ∈ R H ×W ×C are computed through the
three branches, and the weighted sum is derived as F e ∈ R H ′ ×W ′ ×C′ . The attention module
consists of a series of channel attention and spatial attention. Thereafter, F e ∈ R H ′ ×W ′ × C ′

is processed sequentially by one-dimensional channel attention maps Mc ∈ R1×1×C and
′ ′
Ms ∈ R H ×W ×1 , as shown in Equation (7):

e) ⊗ F,
Fc = Mc (F e Fs = Ms (Fc ) ⊗ Fc ,
′′ (7)
F = δ(B(dropout(Fs ))),

⊗ represents element-wise product, Fc = Mc (F e) ⊗ F


e and Fs = Ms (Fc ) ⊗ Fc calcu-
late the selected features, and δ(·) and B(·) denote the rectified linear unit and batch
normalization operations.
The distinction between the local and global branches is accomplished by controlling
the patch size parameter, which is implemented via the aggregation and displacement of
non-overlapping patches in the spatial dimension. Thereby, the attention matrix between
Remote Sens. 2025, 17, 2593 13 of 24

non-overlapping patches is computed to facilitate local and global feature extraction and
interaction, as depicted in Figure 13.

Figure 13. Patch-Aware Flowchart.

F′ are partitioned into spatially contiguous patches and channel averaged. The patches
after channel averaging are linearly computed using a feed-forward neural network (FFN).
Based on this, an activation function is applied to obtain the probability distribution of
the linearly computed features in the spatial dimension and, for each marker, its weight is
adjusted by filtering the features related to the task.

4. Experiment
4.1. Experimental Settings
The algorithmic model was experimented in the Ubuntu 18.04 operating system
with NVIDIA GeForce RTX 3090 graphics card, CUDA version number 11.1, and Python
version number 3.9.19. The principle of consistency was followed during the training of all
the networks, keeping the same core optimizer parameters, optimization algorithm, and
number of training parameters. The detailed settings of each specific training parameter
are shown in Table 4.

Table 4. Training parameter.

Training Environment Parameter Settings


CPU Intel®Xeon(R) Gold 6226R CPU @ 2.90 GHz ×64
GPU NVIDIA GeForce RTX 3090
Operating System Ubuntu 20.04.6 LTS
Deep Learning Environment Python: 3.8.19, torch: 1.8.0, CUDA: 11.1
Optimizer Stochastic Gradient Descent (SGD)
Momentum 0.937
Weight Decay 0.0005
Training Epochs 200
Batch Size 4

4.2. Evaluation Criteria


Metrics such as precision (P), detection recall (R), and average precision (AP) are used
to evaluate the model performance.

P = TP/( TP + FP) (8)

R = TP/( TP + FN ) (9)
Z 1
AP = P( R)dR (10)
0

The detection threshold set is 0.5, which is adopted as mAP50. Meanwhile, the
evaluation index mAP50-95 is defined as the average value of detection thresholds ranging
from 0.5 to 0.95 (excluding 0.5, with a step size of 0.05).
Remote Sens. 2025, 17, 2593 14 of 24

4.3. Comparative Experiment


In this section, a comparative study is carried out for the above single-modal detection
model as well as the RGB-Thermal detection framework. It is worth noting that smoke is
visible on visible images and difficult to recognize on thermal infrared images. This is due
to the low sensitivity of the thermal infrared camera carried by the UAV and, for the remote
observation, the UAV is far away from the forest fire target and cannot effectively capture
the smoke information. Meanwhile, in order to focus more on the RGB-T bimodal fusion
detection in small target detection, the subsequent comparison experiments are carried out
with the smoke label removed, and only the flame and person are targeted for analysis.
Table 5 gives a comparison of the effect of different number of target categories for visible
light detection.

Table 5. Comparison of the effect of different number of target categories for visible light detection.

Model Class P R mAP50 mAP50-95


smoke 93.9% 93.2% 97% 74.6%
YOLOv11
fire 92.5% 85.8% 91.5% 51.2%
(All object)
person 90.5% 88.4% 93.8% 57.2%
YOLOv11 fire 91.6% 89.6% 92.9% 52.1%
(remove smoke object) person 89.8% 91% 94.3% 58%

For single-modal comparison, we selected RTMdet [27], a single-stage target detection


algorithm with similar model complexity to YOLOv11, and FasterRCNN [28], a well-
known two-stage target detection algorithm, for comparison experiments. Table 6 presents
a comparison of the effect of single-modal detection methods.

Table 6. Model performance comparison of single-modal.

Model Class Precision Recall mAP50 mAP50-95


all 90.7% 90.3% 93.6% 55%
YOLOv11
fire 91.6% 89.6% 92.9% 52.1%
(Visible image)
person 89.8% 91% 94.3% 58%
all 79% 73.8% 76% 38.6%
RTMdet [27]
fire 82.2% 71.9% 73.5% 33.1%
(Visible image)
person 75.7% 75.7% 78.5% 44.1%
all 90.6% 89.3% 93.6% 55.5%
FasterRCNN [28]
fire 92.6% 88.4% 93.3% 52.2%
(Visible image)
person 88.5% 90.2% 94% 58.8%
all 91.2% 88.6% 93.6% 57.6%
YOLOv11
fire 92.2% 86.5% 92.4% 57.5%
(Infrared image)
person 90.2% 90.8% 94.9% 57.6%
all 84.5% 76.8% 82.2% 43.6%
RTMdet [27]
fire 83.7% 72.9% 78.7% 41.4%
(Infrared image)
person 85.4% 80.6% 85.7% 45.9%
all 91.2% 87.6% 93.6% 57.7%
FasterRCNN [28]
fire 92.5% 86.2% 93.3% 58%
(Infrared image)
person 89.9% 89% 94% 57.5%

Compared to the single-stage target detection model with similar complexity, the
YOLOv11 model reflects a more prominent performance advantage, which is significantly
higher than the RTMdet model in all metrics. Compared to the complex two-stage target
detection model, the YOLOv11 model is not much different from the FasterRCNN model in
Remote Sens. 2025, 17, 2593 15 of 24

all the indicators, but YOLOv11 is slightly higher in recall, which reflects that the YOLOv11
model performs well in reducing the leakage of detection, which is needed for the early
detection task of forest fires.

4.4. Ablation Experiment


In order to verify the effect of each improvement module on the model detection
capability, we compare the effects of different improvements on the model detection
performance. Ablation test results of the algorithm model are shown in Table 7. The
experiments adopt YOLOv11 as the baseline model for unimodal detection in visible
and infrared images. Based on this, the mid-term fusion framework “YOLOv11-MF” is
designed. Adding a cross-modal feature interaction design based on CPCA to YOLOv11-
MF yields “YOLOv11-MF+ feature interaction structure”. Finally, integrating the PPAS
feature splicing module results in “CP-YOLOv11-MF”.

Table 7. Ablation experiments.

Modal Precision Recall mAP50 mAP50-95


YOLOv11
90.7% 90.3% 93.6% 55%
(visible image)
YOLOv11
91.2% 88.6% 93.6% 57.6%
(infrared image)
YOLOv11-MF 90.4% 91.3% 95.3% 58.7%
YOLOv11-MF + feature interaction structure 91.7% 92.6% 96% 61.6%
YOLOv11-MF + feature interaction structure
92.5% 93.5% 96.3% 62.9%
+ PPAS (CP-YOLOv11-MF)

As shown in Table 7, early, mid, and late RGB-Thermal bimodal fusion frameworks
are constructed via multimodal strategies. According to different multimodal fusion strate-
gies, a simple Concat function is used for modal splicing, and the early fusion framework
(YOLOv11-EF), the mid-term fusion framework (YOLOv11-MF), and the late fusion frame-
work (YOLOv11-LF) are constructed on the basis of the YOLOv11 model. Simple bimodal
feature splicing slightly improves algorithm performance, while designing a cross-modal
feature interaction module and optimizing modal splicing modules enhances intermodal
interactions, enabling deep feature and information complementarity across modalities.
After a series of targeted improvements, the algorithm model (CP-YOLOv11-MF) con-
structed finally reaches 92.5%, 93.5%, 96.3%, and 62.9%, which reflects the effectiveness of
the various improvement methods for model performance improvement.
In the detection framework based on early fusion Strategies (YOLOv11-EF), the input
layer is improved by adding two new input channels and introducing the Concat function
for early bimodal feature splicing. First, the infrared image and the visible image are
used as input data, and the feature splicing operation is performed on the images of
two different modalities to generate the bimodal fusion feature map. Subsequently, the
generated bimodal fusion feature map is input to the backbone network layer and the neck
network layer. The detection layer performs target detection operations from the previous
layers and outputs the detection results. The network structure of YOLOv11-EF is shown
in Figure 14.
The detection framework based on late fusion strategies (YOLOv11-LF) consists of
a dual-input layer, a dual-channel backbone network layer, a dual-channel neck network
layer, and a detection layer. The visible channel backbone network and the thermal infrared
channel backbone network perform feature extraction for the visible and thermal infrared
Remote Sens. 2025, 17, 2593 16 of 24

images, and output the extracted features to the neck network layer. The features are
enhanced in the neck network layer. A feature splicing module is embedded at the output
position of the neck network layer to input the generated bimodal image fusion features to
the detection layer. The network structure is shown in Figure 15.

Figure 14. Network structure of YOLOv11-EF.

Figure 15. Network structure of YOLOv11-LF.

In order to verify the effectiveness of the RGB-Thermal bimodal target detection


algorithm model, in this section, the YOLOv11 model is utilized to train the infrared
and visible images separately to obtain the detection results in a single modality, i.e., the
YOLOv11 network model processes the two types of images, visible and thermal infrared,
and directly outputs the detection results without any fusion. As shown in Table 8, the
performance of each model under single-modal detection and different detection fusion
frameworks is compared.
Remote Sens. 2025, 17, 2593 17 of 24

Table 8. Model performance comparison of single-modal and dual-modal algorithms.

Model Class Precision Recall mAP50 mAP50-95


all 90.7% 90.3% 93.6% 55%
YOLOv11
fire 91.6% 89.6% 92.9% 52.1%
(visible image)
person 89.8% 91% 94.3% 58%
all 91.2% 88.6% 93.6% 57.6%
YOLOv11
fire 92.2% 86.5% 92.4% 57.5%
(Infrared image)
person 90.2% 90.8% 94.9% 57.6%
all 91.1% 89.8% 94.9% 58.2%
YOLOv11-EF fire 91.6% 88.8% 94.1% 57.5%
person 90.7% 90.7% 95.6% 59%
all 90.6% 91.2% 95.3% 58.6%
YOLOv11-MF fire 91% 90.9% 94.8% 58.1%
person 90.2% 91.4% 95.9% 59.1%
all 91.3% 91.5% 95.3% 59.7%
YOLOv11-LF fire 92.3% 91.5% 95.3% 59.7%
person 90.2% 91.4% 95.4% 59.7%

A comparison of visible and infrared image detection results in a single modality


shows that visible images, despite richer information, contain more interference. Evaluation
of detection performance reveals similar mAP50 values for both modalities. However,
infrared images exhibit a significant advantage in flame detection under the stricter mAP50-
95 metric, outperforming visible images by 5.4%. In contrast, detection accuracy of person
with infrared is slightly lower than that with visible light.
In terms of the comparative analysis of the detection effect of single-modal and dual-
modal, the RGB-T dual-modal image detection method is significantly better than the
single-modal image detection results in the three types of evaluation indexes (Precision,
Recall, mAP), which proves the effectiveness of the image fusion technology in improving
the performance of target detection. In-depth analysis of the characteristics of different
stages of fusion methods within each RGB-Thermal bimodal fusion framework reveals
that later fusion shows certain advantages. In the early fusion stage, because the original
data has not been processed in depth, a large amount of redundant information and
potential noise are not effectively eliminated, and these interfering factors are likely to
have a negative impact on the subsequent model analysis process. Mid-term fusion also
suffers from a similar problematic potential, in which a certain degree of noise interference
inevitably exists in the data processing process. In this case, relying only on simple splicing
operations to integrate multi-source data, it is not possible to fully explore the intrinsic
correlation between the data, and it is difficult to realize the efficient fusion and utilization
of information. In contrast, the fusion in the later stages of the information processing
process, the data underwent multiple rounds of rigorous screening, effectively reducing
noise impact. Subsequently, the information is integrated, which enables the model to more
accurately refine the key features, thus presenting a better performance than the mid-term
fusion framework on this dataset.
Under the multiple RGB-Thermal bimodal fusion frameworks mentioned above, only
the Concat function is used for modal splicing operations and, in order to better perform
modal fusion interactions, a cross-modal feature interaction structure is designed. Since
only a single backbone network exists in the early fusion framework (YOLOv11-EF), the
cross-modal feature interaction structure is applied in this section to the mid-term fusion
framework (YOLOv11-MF) and late fusion framework (YOLOv11-LF) for experiments, as
Remote Sens. 2025, 17, 2593 18 of 24

shown in Table 9. After adding the cross-modal feature interaction structure, the improved
mid-term fusion framework performs optimally.

Table 9. Comparison of the effects of cross-modal feature interaction structures on different detec-
tion frameworks.

Model Class Precision Recall mAP50 mAP50-95


all 91.3% 91.5% 95.3% 59.7%
YOLOv11-LF fire 92.3% 91.5% 95.3% 59.7%
person 90.2% 91.4% 95.4% 59.7%
YOLOv11-LF + all 90.3% 90% 94.9% 57.8%
feature interaction fire 90% 89.4% 94.4% 57.4%
structure person 90.6% 90.7% 95.5% 58.1%
all 90.6% 91.2% 95.3% 58.6%
YOLOv11-MF fire 91% 90.9% 94.8% 58.1%
person 90.2% 91.4% 95.9% 59.1%
YOLOv11-MF+ all 91.9% 92.3% 96% 61.5%
feature interaction fire 92.4% 92.4% 96% 61.7%
structure person 91.4% 92.2% 96% 61.3%

The experimental results show that, after the addition of cross-modal structures, all
indicators under the late integration framework decreased to a certain extent while, for
the mid-term integration framework, all indicators were significantly improved. Since the
cross-modal structure introduced in the late fusion framework at the late stage of data
processing does not have effective feature interaction with the late integration, it makes
it difficult for the model to quickly adapt to and effectively integrate this cross-modal
information. In the mid-term fusion stage, the data is initially processed and not yet fully
solidified with feature patterns, and the introduction of cross-modal structures at this time
can timely capture the rich complementary information between different modal data. From
the network architecture, modal splicing and interactions are cross-cutting, enabling inter-
modal feature mapping, enhancing information flow, and improving characterization of
complex scenarios and diversified targets. This significantly improves model performance,
especially in mAP50-95, where obvious advantages are observed.
In the feature splicing module, we test the effectiveness of different attentions in the
optimization of the feature splicing module. Various attentional mechanisms (SimAM [29],
GAM [30], NAM [31], LCA [32], and our method) are adopted to improve the modal
splicing approach, which are applied in the feature splicing module after the C3k2 feature
extraction module of the backbone network, and the related results are shown in Table 10.

Table 10. Comparison of the Effectiveness of Different Attention Mechanisms in Optimizing Feature
Splicing Modules. Convention: best, 2nd-best.

Attention Mechanism Precision Recall mAP50 mAP50-95


- 91.9% 92.3% 96% 61.5%
SimAM [29] 91.6% 92.9% 96.1% 61.2%
GAM [30] 91.3% 93.5% 96.2% 61.8%
NAM [31] 91.9% 93% 96.2% 61.7%
LCA [32] 90.9% 92.4% 95.8% 61.3%
Ours 92.1% 93.9% 96.4% 62.7%

As shown in Table 10, our designed PPAS, enabled by its multi-branching structure, ef-
fectively filters noisy information, complements cross-modal information, and outperforms
Remote Sens. 2025, 17, 2593 19 of 24

in all metrics. The GAM attention mechanism ranks second in several metrics. Simi-
lar to our approach, it leverages channel-spatial attention interaction to enhance feature
extraction accuracy.

4.5. Lightweight Design


In the experimental process, during the modal splicing function design, the overall
complexity of the model is increased greatly after all modal splicing modules are replaced
with PPAS. Considering that the algorithmic model is mainly used for UAVs to perform
forest fire detection tasks (especially small targets), the lightweight approach is designed in
this section. The original modal splicing function (Concat) is retained in the third modal
splicing module corresponding to the third detection layer (which is mainly used for the
detection of large targets).
This section presents comparative experiments to assess the detection performance of
different modal splicing methods and model complexity. All schemes are evaluated within
the medium-term fusion framework of the enhanced modal fusion structure (YOLOv11—
MF+ feature interaction structure). As detailed in Table 11, Scheme 1 replaces all modal
splicing modules with PPAS; Scheme 2 substitutes only the first two modules with PPAS;
and Scheme 3 replaces only the first module with PPAS.

Table 11. Performance and complexity comparison of different modal splicing schemes. Convention:
best, 2nd-best.

Modal Splicing Scheme Precision Recall mAP50 mAP50-95 Parameter (×106 ) Model Size
Scheme 1 92.1% 93.9% 96.4% 62.7% 20.51 39.7 MB
Scheme 2 92.5% 93.5% 96.3% 62.9% 11.83 23 MB
Scheme 3 92.4% 93.4% 96.3% 62.4% 9.66 18.9 MB

As shown in Table 11, simplified modal splicing Scheme 2 (replacing only the first
two modal splicing modules with PPAS) reduces model parameters and size by nearly 50%
compared to full replacement. Notably, indicators show no significant decline, with slight
improvements in accuracy and mAP50-95, verifying the effectiveness of the simplified design.
In order to further verify the balance between the detection effect and model complex-
ity of the algorithmic models, the comparison of detection performance and complexity of
the single-modal and RGB-T bimodal algorithmic models is shown in Table 12.

Table 12. Comparison of detection performance and complexity of the single-modal and RGB-T
bi-modal algorithmic models.

Model Precision Recall mAP50 mAP50-95 Parameter (×106 ) Model Size


YOLOv11
90.7% 90.3% 93.6% 55% 9.41 18.3 MB
(visible image)
YOLOv11
91.2% 88.6% 93.6% 57.6% 9.41 18.3 MB
(Infrared image)
CP-YOLOv11-MF 92.5% 93.5% 96.3% 62.9% 11.83 23 MB

From Table 12, the model, designed to handle RGB-T bi-modal data, with a dual
backbone for cross-modal feature extraction, incorporates lightweight designs in data
input and algorithmic improvements. Although its parameter count and model size are
slightly larger than the original single-modal detector, it achieves improved performance at
lower complexity.
Remote Sens. 2025, 17, 2593 20 of 24

4.6. Visual Analysis


In order to visualize the performance of the algorithmic model established in the forest
fire target detection task, Figure 16 shows the partial detection results of the algorithmic
model CP-YOLOv11-MF, which demonstrates the detection effect of the UAV from different
viewpoints and in different scenes.

Figure 16. Visualization of CP-YOLOv11-MF detection results.

The blue box labeled “fire 0.8” indicates that the model predicts that the target is “fire”
with 80% confidence. From the above figure, it can be seen that the CP-YOLOv11-MF
algorithm model can fulfill the forest fire target detection task well.
At the same time, in order to further analyze the performance differences between
different algorithm models, representative forest fire image detection samples (night envi-
ronment, tree cover, smoke cover) are selected for visual analysis in this section, as shown
in Figures 17–19, to visualize the improvement effect of different algorithm models.

Figure 17. Visualization of the detection performance of each model for nighttime conditions.
Remote Sens. 2025, 17, 2593 21 of 24

Figure 18. Visualization of the detection performance of each model for tree occlusion conditions.

Figure 19. Visualization of the detection performance of each model for smoke occlusion conditions.

Figure 17 illustrates the detection performance of each model under nighttime con-
ditions. While each model demonstrates proficiency in detecting fires with distinct char-
acteristics, person detection may incur pixel-level displacement. This is because humans
lack rich texture in thermal infrared images, leading to blurred detection box borders that
hinder accurate localization. In the mid-term fusion framework (YOLOv11-MF), multiple
detection boxes initially appear, but the final model—incorporating cross-modal feature
fusion and splicing—achieves precise person detection with the highest confidence among
all models.
The performance of visible images in flame detection under tree occlusion conditions
is limited, as shown in Figure 18. For some fire objects, the detection confidence is only 30%,
and the detection boxes have localization bias. Thermal infrared images can effectively
recognize high-temperature target regions that stand out from the surrounding environ-
ment by virtue of their ability to perceive high-temperature areas in the scene. In the
early fusion framework (YOLOv11-EF), the poor fusion of bimodal information initially
generates multiple detection boxes. After model improvement, the confidence level of
all target detections increases to 80%, demonstrating that the adopted algorithm model
effectively enhances the accuracy and stability of forest fire target detection under tree
occlusion conditions. This provides a more reliable solution for forest fire target detection
in complex environments.
There is a false alarm problem with thermal infrared images in smoke occlusion
environments, as shown in Figure 19. Due to the existence of areas around the fire point with
temperatures close to the human body temperature, thermal infrared images incorrectly
Remote Sens. 2025, 17, 2593 22 of 24

identify these areas as personnel targets. Although visible light images have rich texture
information and are still able to detect flames and smoke under low visibility, they are
deficient in the localization accuracy of the detection framework. In addition, in the
application scenarios of the early fusion framework and the middle fusion framework,
the visible light image has under-reporting phenomenon and fails to successfully detect
some of the actual targets. When the proposed method is used for detection, the detection
confidence is increased to 80% for both critical targets, namely, flames and people, which
improves the accuracy and reliability of detection.
In summary, the algorithm model CP-YOLOv11-MF constructed is able to perform the
target detection task more accurately in complex forest fire scenarios. Compared with single-
modal detection methods, the model significantly reduces the false alarm rate and missed
alarm rate, effectively overcoming the limitations of single-modal detection. Meanwhile,
by designing the modal interaction structure and optimizing the modal splicing module,
the model’s ability to detect targets in complex environments is enhanced significantly.

5. Conclusions
In this paper, a multi-target and multi-scene forest fire aerial photography dataset
is constructed by collecting data in multiple locations with a UAV equipped with a dual-
optical camera head, which provides a more comprehensive visual dataset for subsequent
forest fire prevention and management research. Meanwhile, the early fusion detection
framework (YOLOv11-EF), the middle fusion detection framework (YOLOv11-MF) and the
late fusion detection framework (YOLOv11-LF) are constructed for the multimodal fusion
strategy on the basis of the YOLOv11 target detection model, which proves the advance-
ment of the RGB-T bimodal target detection network model compared with the single-
modal one. Based on this, a modal interaction structure is designed and a modal splicing
module is optimized to enhance deep cross-modal interaction and fusion for RGB-Thermal
bimodal target detection. Lightweight design is also incorporated during algorithm model
improvement. Finally, the RGB-T dual-modal detection model CP-YOLOv11-MF con-
structed achieves 92.5%, 93.5%, 96.3%, and 62.9% in terms of precision, recall, mAP50, and
mAP50-95. Compared with the metrics of visible light detection in a single mode, there
are 1.8%, 3.2%, 2.7%, and 7.9% improvement and, with the metrics of thermal infrared
detection in a single mode, the improvement is 1.3%, 4.9%, 2.7%, and 4.7%.
This paper presents an optimized AI-driven framework for RGB-thermal fusion in
wildfire detection, which significantly improves the accuracy and response efficiency of
monitoring systems. In the context of the growing trend of multi-source data fusion for
forest fire detection, this study provides novel insights into the integration of diverse data
modalities. Future work will focus on further enhancing the scale and diversity of the
multi-scenario fire dataset by continuing to collect data in more forested areas with different
geographic environments and climatic conditions, covering a wide range of terrains such
as mountains, hills, and plains, as well as forested scenarios with different seasons and
day/night time slots, in order to increase the dataset’s level of coverage of complex real-
world scenarios. At the algorithmic research level, we continue to study the cross-modal
fusion mechanism in depth, explore more potential modal interaction features, and improve
the efficiency of the model in utilizing the bimodal data, so as to achieve more stable and
accurate detection in the complex and changing forest fire scenarios.

Author Contributions: Conceptualization, Y.Z. and X.R.; methodology, Y.Z.; data curation, Y.Z.;
writing—original draft preparation, Y.Z.; writing—review and editing, X.R.; supervision, W.S. All
authors have read and agreed to the published version of the manuscript.
Remote Sens. 2025, 17, 2593 23 of 24

Funding: This research was funded by the National Natural Science Foundation of China (program
NO. 52321003) and the Startup Foundation for Introducing Talent of NUIST (1523142501164).

Data Availability Statement: The RGBT-3M dataset can be found at https://complex.ustc.edu.cn.

Conflicts of Interest: The authors declare no conflicts of interest.

References
1. Cunningham, C.X.; Williamson, G.J.; Bowman, D.M.J.S. Increasing frequency and intensity of the most extreme wildfires on Earth.
Nat. Ecol. Evol. 2024, 8, 1420–1425. [CrossRef] [PubMed]
2. Rui, X.; Li, Z.; Zhang, X.; Li, Z.; Song, W. A RGB-Thermal based adaptive modality learning network for day–night wildfire
identification. Int. J. Appl. Earth Obs. Geoinf. 2023, 125, 103554. [CrossRef]
3. Kularatne, S.D.M.W.; Casado, C.Á.; Rajala, J.; Hänninen, T.; López, M.B.; Nguyen, L. FireMan-UAV-RGBT: A Novel UAV-
Based RGB-Thermal Video Dataset for the Detection of Wildfires in the Finnish Forests. In Proceedings of the 2024 IEEE 29th
International Conference on Emerging Technologies and Factory Automation (ETFA), Padova, Italy, 10–13 September 2024;
pp. 1–8.
4. Shamsoshoara, A.; Afghah, F.; Razi, A.; Zheng, L.; Fulé, P.Z.; Blasch, E. Aerial imagery pile burn detection using deep learning:
The FLAME dataset. Comput. Netw. 2021, 193, 108001. [CrossRef]
5. Chen, X.; Hopkins, B.; Wang, H.; O’Neill, L.; Afghah, F.; Razi, A.; Fulé, P.; Coen, J.; Rowell, E.; Watts, A. Wildland Fire Detection
and Monitoring Using a Drone-Collected RGB/IR Image Dataset. IEEE Access 2022, 10, 121301–121317. [CrossRef]
6. Toulouse, T.; Rossi, L.; Campana, A.; Celik, T.; Akhloufi, M.A. Computer vision for wildfire research: An evolving image dataset
for processing and analysis. Fire Saf. J. 2017, 92, 188–194. [CrossRef]
7. Li, X.Y.; Chen, S.G.; Tian, C.N.; Zhou, H.; Zhang, Z.X. M2FNet: Mask-Guided Multi-Level Fusion for RGB-T Pedestrian Detection.
IEEE Trans. Multimed. 2024, 26, 8678–8690. [CrossRef]
8. Song, K.C.; Wen, H.W.; Xue, X.T.; Huang, L.M.; Ji, Y.Y.; Yan, Y.H. Modality Registration and Object Search Framework for
UAV-Based Unregistered RGB-T Image Salient Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5531015. [CrossRef]
9. Jin, D.Z.; Shao, F.; Xie, Z.X.; Mu, B.Y.; Chen, H.W.; Jiang, Q.P. CAFCNet: Cross-modality asymmetric feature complement network
for RGB-T salient object detection. Expert Syst. Appl. 2024, 247, 123222. [CrossRef]
10. Lv, Y.; Liu, Z.; Li, G.Y. Context-Aware Interaction Network for RGB-T Semantic Segmentation. IEEE Trans. Multimed. 2024, 26,
6348–6360. [CrossRef]
11. Zhou, W.J.; Dong, S.H.; Fang, M.X.; Yu, L. CACFNet: Cross-Modal Attention Cascaded Fusion Network for RGB-T Urban Scene
Parsing. IEEE Trans. Intell. Veh. 2024, 9, 1919–1929. [CrossRef]
12. Zhang, B.; Li, Z.L.; Sun, F.M.; Li, Z.H.; Dong, X.B.; Zhao, X.L.; Zhang, Y.R. SICFNet: Shared Information Interaction and
Complementary Feature Fusion Network for RGB-T traffic scene parsing. Expert Syst. Appl. 2025, 276, 14. [CrossRef]
13. Pang, Y.; Huang, Y.; Weng, C.Y.; Lyu, J.L.; Bai, C.Y.; Yu, X.S. Enhanced RGB-T saliency detection via thermal-guided multi-stage
attention network. Vis. Comput. 2025, 41, 8055–8073. [CrossRef]
14. Bin Azami, M.H.; Orger, N.C.; Schulz, V.H.; Oshiro, T.; Cho, M. Earth Observation Mission of a 6U CubeSat with a 5-Meter
Resolution for Wildfire Image Classification Using Convolution Neural Network Approach. Remote Sens. 2022, 14, 1874.
[CrossRef]
15. Kumar, A.; Perrusquía, A.; Al-Rubaye, S.; Guo, W. Wildfire and smoke early detection for drone applications: A light-weight
deep learning approach. Eng. Appl. Artif. Intell. 2024, 136, 108977. [CrossRef]
16. Qurratulain, S.; Zheng, Z.Z.; Xia, J.; Ma, Y.; Zhou, F.R. Deep learning instance segmentation framework for burnt area instances
characterization. Int. J. Appl. Earth Obs. Geoinf. 2023, 116, 103146. [CrossRef]
17. Li, J.; Tang, H.; Li, X.; Dou, H.; Li, R. LEF-YOLO: A lightweight method for intelligent detection of four extreme wildfires based
on the YOLO framework. Int. J. Wildland Fire 2024, 33, WF23044. [CrossRef]
18. Guo, S.H.; Hu, B.; Huang, R. Real-Time Flame Segmentation based on RGB-Thermal Fusion. In Proceedings of the IEEE
International Conference on Robotics and Biomimetics (IEEE ROBIO), Sanya, China, 27–31 December 2021; IEEE: Piscataway, NJ,
USA; pp. 1435–1440.
19. Cui, S.; Ma, A.L.; Wan, Y.T.; Zhong, Y.F.; Luo, B.; Xu, M.Z. Cross-Modality Image Matching Network with Modality-Invariant
Feature Representation for Airborne-Ground Thermal Infrared and Visible Datasets. IEEE Trans. Geosci. Remote Sens. 2022,
60, 3099506. [CrossRef]
20. Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011
International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571.
21. Burger, W.; Burge, M.J. Scale-Invariant Feature Transform (SIFT). In Digital Image Processing: An Algorithmic Introduction Using
Java; Burger, W., Burge, M.J., Eds.; Springer: London, UK, 2016; pp. 609–664.
Remote Sens. 2025, 17, 2593 24 of 24

22. Li, J.; Hu, Q.; Ai, M. RIFT: Multi-Modal Image Matching Based on Radiation-Variation Insensitive Feature Transform. IEEE Trans.
Image Process. 2020, 29, 3296–3310. [CrossRef] [PubMed]
23. Gonçalves, L.A.O.; Ghali, R.; Akhloufi, M.A. YOLO-Based Models for Smoke and Wildfire Detection in Ground and Aerial
Images. Fire 2024, 7, 140. [CrossRef]
24. Ding, Y.H.; Wang, M.Y.; Fu, Y.J.; Wang, Q. Forest Smoke-Fire Net (FSF Net): A Wildfire Smoke Detection Model That Combines
MODIS Remote Sensing Images with Regional Dynamic Brightness Temperature Thresholds. Forests 2024, 15, 839. [CrossRef]
25. Huang, H.; Chen, Z.; Zou, Y.; Lu, M.; Chen, C.; Song, Y.; Zhang, H.; Yan, F. Channel prior convolutional attention for medical
image segmentation. Comput. Biol. Med. 2024, 178, 108784. [CrossRef] [PubMed]
26. Xu, S.; Zheng, S.; Xu, W.; Xu, R.; Wang, C.; Zhang, J.; Teng, X.; Li, A.; Guo, L. HCF-Net: Hierarchical Context Fusion Network for
Infrared Small Object Detection. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME),
Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6.
27. Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An Empirical Study of Designing
Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784. [CrossRef]
28. Ren, S.Q.; He, K.M.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.
IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
29. Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks.
In Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, Virtual,
18–24 July 2021; pp. 11863–11874.
30. Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv
2021, arXiv:2112.05561. [CrossRef]
31. Liu, Y.; Shao, Z.; Teng, Y.; Hoffmann, N. NAM: Normalization-based Attention Module. arXiv 2021, arXiv:2111.12419. [CrossRef]
32. He, A.; Li, X.; Wu, X.; Su, C.; Chen, J.; Xu, S.; Guo, X. ALSS-YOLO: An Adaptive Lightweight Channel Split and Shuffling Network
for TIR Wildlife Detection in UAV Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 17308–17326. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like