DFS A Dataset For Fire and Smoke Object Detection
DFS A Dataset For Fire and Smoke Object Detection
https://doi.org/10.1007/s11042-022-13580-x
Abstract
Fire and smoke object detection is of great significance due to the extreme destructive
power of fire disasters. Most of the existing methods, whether traditional computer vision-
based models with sensors or deep learning-based models have circumscribed application
scenes with relatively poor detection speed and accuracy. This means seldom taking smoke
into consideration and always focusing on classification tasks. To advance object detection
research in fire and smoke detection, we introduce a dataset called DFS (Dataset for Fire and
Smoke detection), which is of high quality, constructed by collecting from real scenes and
annotated by strict and reasonable rules. To reduce the possibility of erroneous judgments
caused by objects that are similar to fires in color and brightness, apart from annotating
‘fire’ and ‘smoke’, we annotate these objects as a new class ‘other’. There are a total of 9462
images named by the fire size, which can benefit different detection tasks. Furthermore, by
carrying out extensive and abundant experiments on Various object detection models, we
provide a comprehensive benchmark on our dataset. Experimental results show that DFS
well represents real applications in fire and smoke detection and is quite challenging. We
also test models with different training and testing proportions on our dataset to find the
optimal split ratio in real situations. The dataset is released at https://github.com/siyuanwu/
DFS-FIRE-SMOKE-Dataset.
1 Introduction
Fire can seriously damage our lives, environment, and property. Wildfire in Australia last
year destroyed thousands of buildings, claimed hundreds of lives of people and countless
Siyuan Wu
[email protected]
plants and animals, damaged hectares of vegetation communities, and cost over hundreds
of billion dollars. Fire can seriously damage our lives, environment, and property. To avoid
disasters and losses caused by fires, researchers have begun to focus on fire detection
methods.
Current fire detection models have various drawbacks. Traditional fire detection tasks
use smoke, chromatic and thermal sensors, which take a long time to detect and are applied
in small scope and single scene. Later, deep learning-based models are used to increase the
speed and accuracy of detection and also to provide visual interpretation and application.
At present, fire detection based on deep learning has become a hot spot. Dunnings et al.
[11] use reduced complexity deep convolutional neural network to do real-time fire region
detection, which is then improved by Samarth et al. [3] by combining different optimization
and normalization techniques. Jadon et al. [28] design a neural network for lightweight fire
detection from scratch. Existing deep learning-based fire detection models require a large
number of well-annotated fire images as a driving force to get the best training result. How-
ever, current methods face the following problems. First, location information of fire isn’t
included in the fire and smoke dataset. At present, many algorithms only pay attention to
the identification of fire, while ignoring the acquisition of fire location, which is not con-
ducive to the automatic fire extinguishing of firefighting equipment. Second, the fire and
smoke object detection dataset contain too few images, which leads to the lack of general-
ization ability of the training algorithm and the limitation of the test results of the algorithm.
Third, interference factors are not marked in the fire and smoke dataset, resulting in a large
number of false alarms in the application of actual scenes. As far as we know, there is
only one fire and smoke object detection dataset posted by Lei [31] on https://github.com/
gengyanlei/fire-detect-yolov4, containing about 2000 photos with fire annotated. Nonethe-
less, 2000 annotated photos are too few to give a high detection accuracy. Also, single-class
annotation leads to a high error detection rate when it comes to luminous objects.
Since there is no applicable and large-scale fire-and-smoke dataset with bounding boxes
for object detection problems, we introduce a new relatively large fire-and-smoke dataset,
which contains 9462 images in total with high quality. Each image is collected from real-
life scenes and is annotated and reviewed with a clear and reasonable criterion to reduce
the annotation errors. Due to the ultimate goal of detecting fire as soon as possible, smoke
released by fire other than smoke alone and luminous objects that can be easily mistaken
for fire such as lights, bright color, and reflections is annotated along with fire. These three
classes are called ‘fire’, ‘smoke’, and ‘other’ in the dataset. Pictures containing fire are
classified and named by the flame size: large, middle, and small, with respectively 3357,
4722, and 349 images. Other 1034 photos are luminous objects. With such clear named
files, training can be benefited by augmenting specific size fire pictures for various kinds of
fire detection projects.
To evaluate the efficiency of our dataset, we compare the detection results of two yolov4
[4] models which are trained by the same pretrained model provided by darknet [1] and
respectively on our dataset and Lei’s [31]. The results are demonstrated in the Fig. 1. The
first row shows results on our dataset and the second row shows results on Lei’s [31]. Part a
presents that the model trained on our dataset can reduce the error rate and bound luminous
objects that can be mistaken for fire as ‘other’. Part b shows that, regardless of the size
of the flame and the density of smoke, our model performs better in the accuracy of the
classification and the bounding region. Part c indicates that our model has a lower missing
Multimedia Tools and Applications (2023) 82:6707–6726 6709
Fig. 1 Comparison of the fire detection results using the yolov4 model trained on our DFS dataset and on
the dataset posted by Lei. (a: error detection; b: detection accuracy; c: missing detection)
detection rate. Thus, we can conclude that our fire and smoke dataset effectively improves
the detection rate and reduces the false detection rate. In summary, the contributions of our
work can be concluded as follows:
1) We construct a large-scale fire and smoke dataset with bounding boxes annotated
specifically for object detection problems.
2) To test the performance of classical and novel models, we conduct extensive and
abundant experiments, providing a benchmark with mAP50 and mAP75 as metrics.
3) We carry out exploratory experiments to find a recommended train-Val sets ratio for
our dataset and discuss the data imbalance problem due to the introduction of the class
‘other’.
4) To verify the feasibility of our dataset in the application of fire detection, we test
and compare the existing fire detection algorithms. In addition to traditional image
detection, the compared methods also include detection methods based on deep
learning.
The rest of this paper is organized as follows. In Section 2, we introduce some works
related to fire dataset and object detection. Section 3 describes the details of dataset con-
struction and properties. We report and analyze the experimental results of classical and
novel object detection models and fire detection models in Section 4 and conclude the paper
in Section 6.
2 Related works
This section details some related works for fire detection and our fire-and-smoke dataset.
We introduce general object detection methods in Section 2.1. Additionally, the existing fire
datasets are shown in Section 2.2 and briefly compared with our proposed dataset.
Object detection is a key task in computer vision which can detect instances of objects of
different classes in the image with boxes bounded to show the precise location. Traditional
detection models are built based on sophisticated handcrafted features. For example, VJ
detector [47, 48] incorporate “integral image”, “feature selection” and “detection cascades”
6710 Multimedia Tools and Applications (2023) 82:6707–6726
into the process of training detectors by sliding windows. The DPM (Deformable Part-based
Model) [18, 19] considers training as learning how to decompose an object and inference as
the ensemble of detections on different object parts. HOG Detector [12] improves the scale-
invariant feature transform and shape contexts in order to detect various object classes. Early
object detection datasets focus mostly on specific problems, such as face detection [25, 26]
and pedestrian detection [12, 13]. With the development of convolutional neural networks,
deep learning based detection methods are born. They can be classified into two genres,
“two-stage detection” which trains the detector from coarse to fine and “one-stage detec-
tion” which aims to detecting objects in one step. For example, RCNN [20, 21], SPPNet
[27], Fast RCNN [22], Faster RCNN [41] and Feature Pyramid Networks [32] are two-stage
detectors. YOLO [4, 42–44, 49], SSD [33] and Retina-Net [34] are one-stage detectors. In
the past 10 years, a great number of well-known datasets have been released such as PAS-
CAL VOC Challenge [16, 17], ImageNet Large Scale Visual Recognition Challenge [45],
MS-COCO Detection Challenge [35], etc. The improvements on these datasets have shown
the power of deep learning based object detection models. It is easier to train an effective
detector by adopting the pre-trained networks on these datasets and fine tuning the weights
on specific datasets because the way original pre-trained networks used to detect margin,
texture, shape and so on can be transferred to the new network. Thus, in this paper, we fine
tune various novel and classical deep learning based object detection models on our fire
dataset with the pre-trained weights trained from the COCO dataset.
In fire and smoke detection, traditional and deep learning-based classification methods are
applied in detection systems with thermal and photosensors, real-time detection systems,
video sequence detection, etc. These tasks are trained on datasets only with class infor-
mation labeled. For example, Dunnings et al [11] proposed a fire image dataset in 2018
[14]; Grammalidis et al [5] created a database of videos for flame and smoke detection and
Sharma made the dataset public on the github [29], containing both fire and normal images.
There are also datasets for smoke detection, which are collected on the website: http://
smoke.ustc.edu.cn/datasets.html. Also, Dunnings et al [11] applied semantic segmentation
in fire detection, providing super-pixel image dataset [15].
To detect fire and give precise location information, classification methods are not appli-
cable. Although semantic segmentation can give super-pixel details of the fire, it is too slow
to be applied in real systems. Since datasets have played an important role in data-driven
researches, especially for deep learning-based techniques, a large-scale fire and the smoke
dataset is in urgent need to train an effective detector. As far as we know, there are only a
few annotated images and videos for the fire and smoke detection task. Steffensbola post 23
videos and OpenCV 2.4.9 XML format annotation files on https://github.com/steffensbola/
furg-fire-dataset. Leilei post more than 10000 fire and smoke photos with about 2000 pho-
tos with fire annotated on https://github.com/gengyanlei/fire-detect-yolov4. In our dataset,
we provide over 8000 annotated fire and smoke photos in real scenes from different sce-
narios and over 1000 luminous objects which help reduce the erroneous judgment of fire.
Compared with the existing datasets, our dataset has much more annotated photos with
high quality. Since we keep detecting fire as the key objective, we only use smoke photos
generated by flames without a large number of smoke-only photos.
Multimedia Tools and Applications (2023) 82:6707–6726 6711
3 DFS dataset
In this section, we clearly show the details about our dataset from collection to annotation.
We also state the statistics of our fire and smoke dataset for users’ reference.
We collect images from dunnings [14], Sharma [29], Leilei, and also from Google Images
and Baidu Images. To ensure the effectiveness of our dataset in practical applications, when
we collect fire images, we require that the images include fires that occur in real scenes as
much as possible. At the same time, the fire situation includes only flames or only thick
smoke in the early stage of the fire. Or when the fire is serious, both flame and smoke exist at
the same time. We choose the images to take into account the degree of fire and divide them
into large, middle, and small fire scenes to interpret the diversity of our dataset more com-
prehensively. Nature images from different scenes where fire occurs are selected. Common
scenes can be classified into two classes: outdoor and indoor fire. Forest fire, building fire,
car fire, factory fire, etc are outdoor fires. The fire generated by cooking oil, electrical types
of equipment, and ordinary combustibles burning indoors are indoor fires. We collect both
iconic and non-iconic images, meaning that we have images containing fire or smoke as the
sole object and have photos with both fire and smoke. The non-iconic filtering process was
first adopted in MS-COCO [35]. Apart from selecting photos from a single scene, we also
add successive frames from on-fire videos which contain the information of the changes of
flames. Also, the angles of view of these fire-and-smoke photos are diverse, which means
the same size and type of fire photos are taken from different distances. What’s more, to
facilitate the training process for tasks paying attention to different sizes of fire, we clas-
sify and rename the fire photos by the size of real flames. Large fires can be flames that
are large enough to bring about disasters. Middle fires can be burning objects which may
cause great losses if no control management is used. Small fires can be natural flames used
in our lives such as fire produced by candles or gas stoves and can be flames on the sub-
stance that starts to burn. Although the margin of the two classes is coarse, it won’t affect
the accuracy because that is the case a fire image can be classified into either small or mid-
dle. In conclusion, we need to select images containing flame of different scales to focus on
the training process of tasks with different fire sizes. Thus, we mainly selected three levels
of fire images: large fire images, middle fire images, and small fire images. Finally, consid-
ering all possible situations, we also chose scenes in life that are not at the fire scene but
with fog, smoke, or items that resemble flames, and flame-like objects in the fire scene to
enrich our dataset. Although the dataset of dunnings contains a large number of fire images,
most of the fire images are images of repeated scenes or continuous multiple frames of
images intercepted from the video. Therefore, we only selected 234 large fire images, 436
middle images, and 18 small fire images from 27,1010 images and 48 videos. 722 large
fire images, 567 middle fire images, and 16 small fire images all containing flames and
smoke come from the dataset provided by Leilei. And in Sharma’s dataset, we selected 168
large fire images, 157 middle fire images, and 5 small fire images covering real fire scenes.
The remaining 6105 fire images were collected through Google images and Baidu images.
Among 1034 non-fire images in our dataset containing objects easily mistaken for fire, we
selected 85 images from the 1301 non-fire images in Leilei’s dataset. There are 46 images
6712 Multimedia Tools and Applications (2023) 82:6707–6726
from 551 non-fire images in Sharma’s datasets. The 400 non-fire images are from dunnings’
dataset and the rest are from Google images or Baidu images.
We select three categories for fire and smoke detection: ‘fire’, ‘smoke’, and ‘other’, where ‘other’
refers to luminous objects that can be easily mistaken for fire such as lights, bright colors, and
reflections. The content contained in the ‘other’ class are objects that are different from the
flame but have been mistaken for fire. In practical applications, when detecting fire, objects
in the ‘other’ class are always mistaken for fire, and the alarm is triggered by mistake. In
our dataset, the introduction of the ‘other’ class is aimed at assisting the state-of-the-art fire
detection algorithms in the future. The new fire detection algorithm based on deep learning
can learn the characteristics of flames and flame-like objects through our dataset, to further
grasp the difference between the two, and finally, better improve the detection accuracy. If
the algorithm can accurately detect objects belonging to the ‘other’ class, it will also reduce
the probability of the algorithm misdetecting flame-like objects as flames.
3.2 Annotation
A good annotation criterion is essential for an object detection task. In our fire and smoke
detection, we utilize the property of the correlation between fire and smoke and also specify
substances that can be mistaken for fire in real scenarios. We use the software ‘labelme’
to annotate images and convert the annotation files to PASCAL VOC, COCO, and YOLO
format.
All the fires are annotated. Smoke that is thick enough with little transparency is considered
in our task, thus neglecting the one which is too thin to recognize. Objects that are easily
mistaken as fire and are common in natural fire photos are chosen as ‘other’. For example,
car lights, street lamps, sun lights, reflections of the metal and glasses, colors such as bright
red, bright orange, bright yellow, and white are labeled as ‘other’ in our dataset. Examples
of ‘other’ are shown in the Fig. 2.
Fires are annotated in one box if they are generated by the same source and the ignition
point is connected. Otherwise, each fire produced by the kindling point is labeled in one
bounding box. Examples are shown in Fig. 3. For smoke, because it is produced by burning
objects, we bound the smoke with a little cross with fire. Also, noticing that the smoke in
one photo may be composed of several clouds of smoke, we bound each cloud as smoke.
These kinds of bounding examples can be found in the Fig. 4. All the substances that are
classified as ‘other’ are bounded.
There are 9462 images in our dataset: 3357 large fires, 4722 middle fires, 349 small fires,
and 1034 photos containing only objects that are easily mistaken for fire. About half of fire
photos containing smoke and several fire photos had ‘other’ substances. Our fire and smoke
Multimedia Tools and Applications (2023) 82:6707–6726 6713
dataset have much larger images annotated than existing datasets for object detection tasks.
Images are of high quality, choose from various real scenes and angles of view.
We classify the size of fire into large, middle, and small to augment specific types if
needed. We define a large fire when the flame range occupies 60% or more of the size of
the entire image. When the range of the flame occupies 30% to 60% of the size of the entire
image, it is a middle fire. When the range of the flame occupies 30% or less of the size of
the image, a small fire is defined.
The smoke we define in this article is produced by the flame burning of certain substances
and has obvious boundaries, and the transparency is small enough, so it can be observed by
the human eye. The fog generated by the humid weather is not within the range of smoke
defined by us.
We also add a new class, ‘other’ to represent objects that are easily mistaken for fire in
real cases. With the strict and reasonable classification and annotation criterion, our dataset
performs well in detecting fire and smoke. However, since class ’other’ is hard to collect and
we only consider the smoke produced by on-fire substances, the number of these two classes
is not in balance with fire. More images need to be added or specific data augmentation
techniques need to be applied.
4 Experiments
In this section, we will clarify the experimental settings, measurement metrics, contrasting
experiments of state-of-the-art and classic models, comparative experiment of existing fire
algorithms, and analysis on training set ratio, which are described in Sections 4.1, 4.2, 4.3,
4.4 and 4.5 respectively. The baseline results are provided based on the following methods:
detectoRS [40], nas-fpn [23], ssd [33], faster-rcnn [41], yolov3 [44], yolov4 [4], and yolov5
[30].
Methods detectoRS [40], nas-fpn [23], ssd [33], faster-rcnn [41], yolov3 [44] are trained
under mmdetection [6] on the single gpu with Quadro RTX 6000 as the hardware platform,
the memory of which is 24GB. Yolov4 is trained on the darknet [1] on single gpu with
TITAN X (PASCAL) as the hardware platform, whose memory is 12GB. We use 70% of the
images in our fire-and-smoke dataset as the training set and the rest as the valuation set. We
finetune all the models based on the networks trained on the MS-COCO dataset. To compare
the effectiveness of the contrasting methods by their possible optimal results, we follow the
linear scaling rule [24] to set the learning rate proportional to the batch size and follow the
tricks provided by the authors of these methods. Apart from special augmentation strategies
used in yolov4, the others use ordinary augmentation strategy, random flip with 0.5 as its
parameter. After calculating the anchors by K-means algorithm [38], the originally chosen
anchors are better, so we don’t change the anchor settings in the models. All the evaluations
are tested without TTA(Test Time Augmentation).
Average precision(AP) and specific Intersect over union (IOU) are used in our experiments
to evaluate the performance of detectors.
Multimedia Tools and Applications (2023) 82:6707–6726 6715
metric meaning
Besides Microsoft COCO criteria, flame and smoke are special objects, which are diverse
in shape, texture, and color. Bounding boxes generated by object detectors may slightly dif-
fer from ground-truth, thereby influencing the calculation of average precision, but detectors
do identify the fire areas and the smoke areas successfully (Figs. 5 and 6).
Therefore, to evaluate our dataset more comprehensively, we introduce four additional
evaluation metrics, namely accuracy (Acc), precision (P), recall (R), and false-positive rate
(FPR), which are. For one image, if the detector misses any fire object, we call it is a false
negative (FN), otherwise true positive (TP). If the detector treats any fire-like object as fire,
we call it is a false positive (FP), otherwise true negative (TN).
Note that Acc., P, R, and FPR is calculated on the test set. Accuracy (Acc) represents the
percentage of the number of tests set correctly identified (true positive and true negative)
divided by the total number of a test set, which can be calculated as (7). The higher the
accuracy, the better the classification performance.
TP +TN
Acc = (7)
T P + FN + FP + FN
Precision (P) means the percentage of the number of correctly identified fires (flame or
smoke) in the total number of identified fires (flame or smoke). precision is calculated as
(8).
TP
P = (8)
T P + FP
Then, recall (R) indicates the percentage of the number of correctly identified fire (flame or
smoke) in the total number of fire (flame or smoke) in the test set, which is calculated as (9).
TP
R= (9)
T P + FN
At last, false-positive rate (FPR) refers to the proportion of actual non-fire samples that
are incorrectly judged as fire samples. The smaller this value, the better the detection
performance of the model. It is calculated as (10).
FP
FPR = (10)
FP + T N
Detected box
Detected box
Single-Stage Detector
Detection
Picture
Results
Two-Stage Detector
DenseHead
Fig. 6 Flowchart of the framework of the single-stage detector(above) and the two-stage detector(below)
To further verify the feasibility of our data set in practical applications, on one hand,
we carry out experiments on detecting ‘fire’, ‘smoke’ and ‘other’ three kinds of objects
with all the methods and experiments on detecting “fire” with the methods achieving high
AP in the previous three classes detection, which are told in the following subsections. On
the other hand, to prove that our data set can be used for fire detection in real scenes in
practical applications. At the same time, the existence of this data set helps research scholars
to propose more advanced fire detection algorithms, we compare the existing fire detection
algorithms and evaluate them from the four indicators Acc, P, R, and FPR.
The numerical results of three classes detection are shown in Table 2. If we use mAP50 as
the key metric, retinanet-nasfpn performs the best among all of the methods, hitting 0.463
mAP50 for all the classes, 0.653 for ‘fire’ and 0.477 for ‘smoke’. Although class ‘other’
has mAP50 0.258, which is slightly lower than faster-rcnn 0.269 and detectoRS 0.282, con-
sidering the ultimate goal of detecting fire and smoke, this low precision is permitted. This
retinanet-nasfpn method uses retinanet [34] as its backbone and adopts neural architec-
ture search to optimize the process of designing scalable feature pyramid networks. From
our experiments, we can know that NAS-FPN can extract and fuse features effectively by
meticulously designing the feature pyramid network and combining top-down and bottom-
up connections. If we use mAP75 as the key metric, which has higher demands for the
accuracy of bounding regions, detectoRS performs better than all of the others, with 0.185
for all the classes, 0.306 for ‘fire’, 0.068 for ‘other’ and 0.180 for ’smoke’. Based on the
ResNet-50 backbone, the authors apply Recursive Feature Pyramid(rfp) which has outputs
of FPN brought back to each stage of the bottom-up backbone through feedback connections
and Switchable Atrous Convolution(sac) which has the inputs convolved with two different
atrous rates to the multistage cascade detector HTC. Our experiments show that more accu-
rate location information can be extracted and predicted after thinking twice at both macro
and micro level, because rfp or sac works well alone. Also, faster-rcnn as two-stage methods
perform well with mAP50 0.454 for all the classes.
Although multi-stage and two-stage detectors have higher precision than one-stage detec-
tors, they often have longer inference time due to their complex networks, which aren’t
6718
Method Detector (backbone) mAP50 mAP75 f mAP50 o mAP50 s mAP50 f mAP75 o mAP75 s mAP75
detectoRS Cascade+ResNet-50 0.461 0.185 0.639 0.282 0.462 0.306 0.068 0.180
rfp Cascade+ResNet-50 0.452 0.174 0.619 0.276 0.461 0.286 0.065 0.172
sac Cascade+ResNet-50 0.451 0.177 0.627 0.269 0.457 0.294 0.064 0.172
retinanet-fpn ResNet-50+FPN 0.420 0.149 0.604 0.240 0.416 0.253 0.052 0.141
retinanet-nasfpn ResNet-50+NASFPN 0.463 0.168 0.653 0.258 0.477 0.286 0.054 0.164
ssd300 VGG16 Size300 0.435 0.143 0.608 0.241 0.455 0.242 0.049 0.139
ssd512 VGG16 Size512 0.435 0.141 0.608 0.246 0.451 0.237 0.049 0.137
faster-rcnn ResNet-50+FPN 0.454 0.156 0.635 0.269 0.459 0.261 0.064 0.142
yolov3-416 DarkNet-53 Scale:416 0.361 0.110 0.531 0.203 0.348 0.196 0.035 0.099
yolov3-608 DarkNet-53 Scale:608 0.398 0.107 0.573 0.218 0.402 0.191 0.025 0.106
yolov4 CSPDarkNet-53 Scale:416 0.414 0.108 0.575 0.254 0.412 0.196 0.028 0.098
yolov4-tiny CSPDarkNet-53 Scale:416 0.385 0.068 0.561 0.210 0.385 0.127 0.018 0.058
appliable for real-time detections. Since we trained and tested all the methods on two dif-
ferent kinds of GPUs, we haven’t provided the inference time contrast and FLOPs contrast.
However, by referring to the results of these methods trained on COCO, we know that the
inference time of multi-stage and two-stage detectors is always ten times longer than that
of one-stage detectors when counting in ms. In terms of FLOPs, multi-stage and two-stage
detectors are about a hundred times larger than one-stage detectors. Among all the one-
stage detectors, ssd performs well with a slight difference between sizes 300 and 512. Also,
yolov4 performs much better than yolov3 with universal and new features added, such as
WRC, CSP Mish activation, Mosaic data augmentation, etc. The reason why yolov4 per-
forms worse than ssd may be the new data augmentation strategies used in yolov4 aren’t
suitable for fire and smoke datasets, which needs to be proven by experiments in the future.
For different classes, we can know from the experiments that class ’fire’ achieves higher
scores than ‘smoke’ than ‘other’. This phenomenon is directly caused by the number of
each class and indirectly caused by the properties of each class. Firstly, 8428 images are
containing ‘fire’ and the property of fire is conspicuous for its special color and shape.
Secondly, there are about one-third of fire images have ‘smoke’, in which only thick smoke
is annotated. Due to its appearance correlation with fire and shapes close to clouds, the
detection of ‘smoke’ is easier than ‘other’. Thirdly, when comparing with fire, class ‘other’
has few images, with only 1034 images containing sole class ‘other’ and hundreds of fire
images containing ‘other’, such as street lamps and car lights. Also, since we annotate all
the objects that can be easily mistaken for fire as class ‘other’, the shape, color, and other
properties of this class are various, which can’t be considered to be the properties of an
object in our traditional concepts. This is the reason why class ‘other’ has such a low mAP
in our experiments. Although, because of the data imbalance of three classes, the mAP
of fire will be pulled down by the existence of ‘other’, which will be compared in the
following Section 4.3.2, the introduction of ‘other’ can effectively reduce the possibility of
the judgments of fire as discussed in Section 4.6.
We choose typical methods from three kinds of stage methods for single-class fire detection.
The numerical results of single class detection are shown in Table 3. The contrast of different
methods is close to that of the three classes of detection. Retinanet-nasfpn hits the highest
score for mAP50 and detectoRS achieves the best for mAP75. The precisions are normally
higher when comparing multi-stage or two-stage detectors with one-stage detectors.
We can also learn from the table that mAP50 of fire is about 0.2 averagely higher than
that of three classes detection and mAP75 is approximately 0.1 higher on average. It shows
the problem of data imbalance which will be detailed in Section 4.6 and also shows that
the existing methods aren’t strong enough to complete object detections on an imbalanced
dataset.
The split ratio of the training set, validation set, and testing set can directly affect how well
the networks are trained and perform. Since we use deep learning-based object detection
models which are data-driven models, insufficient training data will not fit the model effec-
tively and a small number of validation images will not well finetune the parameters in the
networks. Thus, we carry out experiments to find the optimal split ratio of our datasets on
deep learning-based object detection models. We choose detectoRS, retinanet-nasfpn, and
6720 Multimedia Tools and Applications (2023) 82:6707–6726
yolov4 as the testing methods. In the experiments, we split the training set to be 10%, 20%,
30%, 40%, 50%, 60%, 70%, 80% and 90% of the whole dataset. The results of training
set ratio experiments are shown in Table 4. We use k to denote the ratio of the training set,
where k:10-k equals train: Val. Also, we introduce a variable ’changeR’ to show the change
ratio of the current mAP relative to the previous one, which is defined in (11). When k is
close to 10, changeR will be small, meaning that the increase of precision is low and there
are few parameters for fine-tuning. In these cases, models’ precision in a real application
will not be as large as that in the test table. So, we use changeR to choose a reasonable k.
(curmAP − premAP )
changeR = (11)
premAP
According to the numerical results, the mAP of the detection models grows with the increase
of training data. This phenomenon is due to models’ data-driven property. For model detec-
toRS, when the training set accounts for 70% of the whole dataset, mAP50 increases the
most relative to the previous one. When k equals 3, changeR of map75 is the highest, mean-
ing that about 3000 images are necessary for training. However, mAP isn’t high enough
for use. By comparing the changeR of high mAP75, when k is 7, changeR is the largest.
For model retinanet-nasfpn, at the first experiment, when k is 5, 8, and 9, mAP is greatly
lower than the previous one which is unreasonable. For these k values, we test again and
find that the results can be normal according to our acknowledgment. So, this model isn’t
robust enough when training. When k is 3, changeR is the highest but map50 is still low.
When 6 is chosen as k, changeR is large and the precision is high enough for training. For
the model yolov4, when k is 3, changeR is the largest for map50 among large k numbers.
When k equals 2, changeR is the highest for map75. When k is 6, both map50 and map75
achieve a high level with large changeR. All in all, when k is 6 or 7, object detection models
can be trained and perform well on our dataset.
In the following experiment, we selected 851 images as the test set from our dataset, of
which 740 are fire images with different scales of flame, 111 are non-fire images with fire-
like objects. While not all flame images contain smoke, there is smoke in 363 images, but
none of the rest. Note that Acc., P, R, and FPR is calculated on the test set.
Table 5 shows the experimental comparison results of existing methods based on our
dataset. Treating our dataset as the test set can get real detection results regardless of
whether it is a fire detection method based on image processing or based on deep learn-
ing methods. In the flame detection algorithms, there are traditional models based on image
processing like the RGB model, YCbCr model, and models based on deep learning such
Table 4 Numerical Results (AP) of Different Training and Valuation Set Ratio (train:val = k:10-k)
k map50 changeR map75 changeR map50 changeR map75 changeR map50 changeR map75 changeR
5 0.443 0.025 0.179 0.065 0.369 -0.107 0.109 -0.243 0.388 0.060 0.093 0.094
0.443 0.073 0.162 0.125
6 0.454 0.025 0.174 -0.028 0.443 0.201 0.157 0.440 0.407 0.049 0.104 0.118
7 0.460 0.132 0.185 0.063 0.453 0.023 0.16 0.019 0.420 0.032 0.105 0.010
8 0.461 0.002 0.192 0.038 0.413 -0.088 0.143 -0.106 0.427 0.017 0.107 0.019
0.446 -0.015 0.167 0.044
9 0.463 0.004 0.184 -0.042 0.359 -0.131 0.115 -0.196 0.434 0.016 0.109 0.019
0.397 -0.109 0.124 -0.257
The value darkened of ChangeR indicates that the optimal value which has been achieved locally
6721
6722
as InceptionV1-OnFire, FireNet, finetuned GoogleNet, and so on. The accuracy of the tra-
ditional fire detection method based on the RGB model is higher than a method based on
the YCbCr model, which can reach 82.0%. The InceptionV3-OnFire model has the highest
accuracy 89.1% among the fire detection algorithms based on deep learning. Through exper-
imental results, we can find that in the flame detection parts, both traditional fire detection
methods based on image processing and fire detection methods based on deep learning have
good detection accuracy. However, traditional fire detection methods focused on creating
artificial features like color, motion, and texture to detect fire, which can not fully charac-
terize the characteristics of the flame and is easily interfered with by objects similar to the
flame color and motion, so the false detection rate of traditional methods is very high and
can reach 90.9%.
In the smoke detection algorithms, there are traditional models based on image process-
ing like the HOG+SVM model, LBP+SVM model, and models based on deep learning such
as 9 layers CNNs. Among all smoke detection methods, the method based on HOG+SVM
achieves the highest accuracy 80.5% and the lowest false detection rate 19.5%. In this
article, the models that can detect both flame and smoke are FireNet(7 layers CNN) and Effi-
cientNet. Both of them have achieved good results in detecting flames, but their ability in
detecting smoke is slightly insufficient. Among all of the fire detection methods, the method
based on EfficientNet achieves the highest accuracy 93.6% and the lowest false detection
rate 15.3% when detecting flame, in addition, when detecting smoke, although this method
is not as accurate as model HOG+SVM, it has the lowest false detection rate 18.8%.
In the fire and smoke detection task, fires can be easily mistaken for other objects, which
can lead to great costs for fire management. Thus, we introduce class ‘other’ to decrease the
possibility of an erroneous judgment. In our tests, we can see that lights, bright colors can
be mistaken for fires. However, since it’s hard to collect all the common objects with bright
colors and lights in different scenes, the number of class ‘other’ is greatly lower than ‘fire’,
leading to the problem of data imbalance.
We carry out an experiment with relatively balanced fire and others, with no smoke, to
see if it performs well with balanced data input. When there are 2059 fire photos and 1034
other images, mAP50 of all the classes is 48.35%. Specifically, ‘fire’ is 62.29 and ‘other’ is
34.41%. Then, we only train the fire detector, getting that mAP50 is 45.73%. So, with the
introduction of ‘other’, there’s a 2.62% increase to the mAP50 of all the classes and a great
increase to fire. Thus, we can conclude that with data that can be easily mistaken for fire as
inputs and these photos are in relative balance with fire, the detector of detecting fire can
be greatly optimized. In the future, more strategies for dealing with the imbalance problem
need to be applied.
5 Future work
For the fire and smoke object detection problem, the difficulty can be concluded to speed
and accuracy of detecting. In terms of detection speed, real-time object detection has enor-
mous demands for models’ reaction time and memories. Since current networks are too
large to be applied on devices with low computation, to speed up the models, network prun-
ing, quantification, and distillation techniques can be used. When it comes to accuracy, fire
detection is low mostly due to the small size and erroneous judgments. Small fires features
6724 Multimedia Tools and Applications (2023) 82:6707–6726
are hard to be extracted by the networks due to the low resolution. Methods like feature
fusion, extra high-resolution handcrafted features can be put into use. For erroneous judg-
ments, as discussed in the previous Section 4.6, the introduction of class ‘other’ is beneficial
to detecting fires but with a little benefit when the inputs are not balanced. Specific data aug-
mentation strategies for fire and smoke need to be explored in the future. Also, cross-modal
learning with fire images as inputs can also be applied to increase the accuracy.
6 Conclusion
We build a dataset for fire and smoke object detection, which is much larger than existing
datasets in fire detection tasks. Apart from annotating fires and smokes, we bound objects
that can be easily mistaken for fire as class ‘other’ in order to decrease the possibility of
erroneous judgments. Furthermore, we establish a benchmark for object detection in fire
and smoke images using mAP50 and mAP75 metrics. When the training set is 60% or 70%
of the whole dataset, it’s proved that object detection models can be trained well on our
dataset. The dataset we proposed will be helpful for the further development of fire detection
model learning, fire images segmentation detection, and data enhancement related to fire
detection. Currently, the introduction of class ‘other’ with fewer images than fires cause the
problem of data imbalance, thus more data augmentation methods need to be explored in
the future. However, it is believed that the introduction of class ‘other’ can help researchers
to study the impact of fire-like interference objects in fire detection, thereby improving the
accuracy of fire detection in a targeted manner. Meanwhile, it is also hoped that this work
can help researchers further improve the accuracy and speed of fire detection.
Acknowledgements Siyuan Wu and Xinrong Zhang are co-first authors of the article.
Funding Information No funding was received to assist with the preparation of this manuscript.
Declarations
Conflict of Interests All authors certify that they have no affiliations with or involvement in any organiza-
tion or entity with any financial interest or non-financial interest in the subject matter or materials discussed
in this manuscript.
References
1. AlexeyAB (2017) Yolo v4, v3 and v2 for windows and linux, https://github.com/AlexeyAB/darknet.
Accessed 12 Jan 2021
2. Avgerinakis K, Briassouli A, Kompatsiaris I (2012) Smoke detection using temporal hoghof descriptors
and energy colour statistics from video. In: International workshop on multi-sensor systems and networks
for fire detection and management
3. Bhowmik N, Breckon TP et al (2019) Experimental exploration of compact convolutional neural network
architectures for non-temporal real-time fire detection, arXiv:1911.09010
4. Bochkovskiy A, Wang C-Y, Liao H-YM (2020) Yolov4: optimal speed and accuracy of object detection,
arXiv:2004.10934
5. Cetin NGKDE (2017) Firesense database of videos for flame and smoke detection, https://zenodo.org/
record/836749#.X AFQi17FN1. Accessed 25 Nov 2020
6. Chen K, Wang J, Pang J, Cao Y, Xiong Y, Li X, Sun S, Feng W, Liu Z, Xu J, Zhang Z, Cheng D, Zhu
C, Cheng T, Zhao Q, Li B, Lu X, Zhu R, Wu Y, Dai J, Wang J, Shi J, Ouyang W, Loy CC, Lin D (2019)
Mmdetection: open mmlab detection toolbox and benchmark
Multimedia Tools and Applications (2023) 82:6707–6726 6725
7. Chen K, Pang J, Wang J, Xiong Y, Li X, Sun S, Feng W, Liu Z, Shi J, Ouyang W, Loy CC, Lin D (2019)
Hybrid task cascade for instance segmentation
8. Chen T-H, Wu P-H, Chiou Y-C (2004) An early fire-detection method based on image processing. In:
2004 International Conference on Image Processing. ICIP’04. IEEE, vol 3, pp 1707–1710
9. Celik T, Demirel H, Ozkaramanli H, Uyguroglu M (2007) Fire detection using statistical color model in
video sequences. J Vis Commun Image Represent 18(2):176–185
10. CA GS, Bhowmik N, Breckon TP (2019) Experimental exploration of compact convolutional neural net-
work architectures for non-temporal real-time fire detection. In: 2019 18th IEEE international conference
on machine learning and applications (ICMLA). IEEE, pp 653–658
11. Dunnings AJ, Breckon TP (2018) Experimentally defined convolutional neural network architecture
variants for non-temporal real-time fire detection. In: 2018 25th IEEE international conference on image
processing (ICIP). IEEE, pp 1558–1562
12. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer
society conference on computer vision and pattern recognition (CVPR’05), vol 1. IEEE, pp 886–893
13. Dollar P, Wojek C, Schiele B, Perona P (2011) Pedestrian detection: an evaluation of the state of the art.
IEEE Trans Pattern Anal Mach Intell 34(4):743–761
14. Dunnings A (2018) Fire image data set for dunnings 2018 study - png still image set, https://collections.
durham.ac.uk/files/r2d217qp536#.X AIqC17FN0. Accessed 18 Nov 2020
15. Dunnings A (2019) Fire superpixel image data set for samarth 2019 study - png still image set, https://
collections.durham.ac.uk/files/r10r967374q#.XzzP9fgzZQI. Accessed 20 Nov 2020
16. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes
(voc) challenge. Int J Comput Vis 88(2):303–338
17. Everingham M, Eslami SA, Van Gool L, Williams CK, Winn J, Zisserman A (2015) The pascal visual
object classes challenge: a retrospective. Int J Comput Vis 111(1):98–136
18. Felzenszwalb P, McAllester D, Ramanan D (2008) A discriminatively trained, multiscale, deformable
part model. In: 2008 IEEE conference on computer vision and pattern recognition. IEEE, pp 1–8
19. Felzenszwalb PF, Girshick RB, McAllester D (2010) Cascade object detection with deformable part
models. In: 2010 IEEE Computer society conference on computer vision and pattern recognition. IEEE,
pp 2241–2248
20. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection
and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 580–587
21. Girshick R, Donahue J, Darrell T, Malik J (2015) Region-based convolutional networks for accurate
object detection and segmentation. IEEE Trans Pattern Anal Mach Intell 38(1):142–158
22. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision,
pp 1440–1448
23. Ghiasi G, Lin T-Y, Le QV (2019) Nas-fpn: learning scalable feature pyramid architecture for object
detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7036–
7045
24. Goyal P, Dollš/cr P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2018)
Accurate large minibatch sgd: training imagenet in 1 hour
25. Hjelmås E, Low BK (2001) Face detection: a survey. Comput Vis Image Understand 83(3):236–274
26. Huang GB, Mattar M, Berg T, Learned-Miller E (2008) Labeled faces in the wild: a database forstudying
face recognition in unconstrained environments
27. He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual
recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916
28. Jadon A, Omama M, Varshney A, Ansari MS, Sharma R (2019) Firenet: a specialized lightweight fire &
smoke detection model for real-time iot applications, arXiv:1905.11922
29. Jivitesh Sharma MG (2017) Fire-detection-image-dataset, https://github.com/cair/Fire-Detection-Image-
Dataset. Accessed 13 Nov 2020
30. Jocher G et al (2020) Yolov5, Code repository https://github.com/ultralytics/yolov5. Accessed 5 Jan 2021
31. Lei G (2020) Fire-detect-yolov4, https://github.com/gengyanlei/fire-detect-yolov4. Accessed 5 Jan 2021
32. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for
object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition,
pp 2117–2125
33. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox
detector. In: European conference on computer vision. Springer, pp 21–37
34. Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In:
Proceedings of the IEEE international conference on computer vision, pp 2980–2988
6726 Multimedia Tools and Applications (2023) 82:6707–6726
35. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft
coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
36. Luo Y, Zhao L, Liu P, Huang D (2018) Fire smoke detection algorithm based on motion characteristic
and convolutional neural networks. Multimed Tools Appl 77(12):15075–15092
37. Muhammad K, Ahmad J, Mehmood I, Rho S, Baik SW (2018) Convolutional neural networks based fire
detection in surveillance videos. IEEE Access 6:18174–18183
38. mnslarcher (2020) K-means anchors ratios calculator, https://github.com/mnslarcher/kmeans-anchors-
ratios. Accessed 7 Feb 2021
39. midasklr. Firesmokedetectionbyefficientnet, https://github.com/midasklr/FireSmokeDetectionByEfficientNet/,
Accessed 18 May 2020
40. Qiao S, Chen L-C, Yuille A (2020) Detectors: detecting objects with recursive feature pyramid and
switchable atrous convolution, arXiv:2006.02334
41. Ren S, He K, Girshick R, Sun J (2016) Faster r-cnn: towards real-time object detection with region
proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
42. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detec-
tion. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–
788
43. Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference
on computer vision and pattern recognition, pp 7263–7271
44. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement, arXiv:1804.02767
45. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein
M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
46. Russo AU, Deb K, Tista SC, Islam A (2018) Smoke detection method based on lbp and svm from
surveillance camera. In: 2018 International conference on computer, communication, chemical, material
and electronic engineering (IC4ME2). IEEE, pp 1–4
47. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceed-
ings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR
2001, vol 1. IEEE, pp I–I
48. Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154
49. Wang C-Y, Bochkovskiy A, Liao H-YM (2020) Scaled-yolov4: scaling cross stage partial network,
arXiv:2011.08036
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with
the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article
is solely governed by the terms of such publishing agreement and applicable law.