1 - A Deep Learning Based Computer Vision Solution For Construction Vehicle Detection
1 - A Deep Learning Based Computer Vision Solution For Construction Vehicle Detection
12530
I N D U ST R I A L A P P L I CAT I O N
detection (Xue & Li, 2018), and pavement crack detection single object-tracking algorithm. Second, they used optical
(Zhang et al., 2017) have been investigated by researchers. flow estimation to extract temporal and spatial information
Image-based monitoring and evaluation have also received about objects. Then, they classified the activities from both
considerable attention in the construction engineering domain information streams. Finally, they merged the results of
due to the dynamic nature and vastness of typical construc- the temporal and spatial stream classifications to estimate
tion sites. For instance, Memarzadeh, Golparvar-Fard, and the final activity (Luo, Li, Cao, Yu et al., 2018). Luo,
Niebles (2013) used handcrafted feature extraction, namely Li, Cao, Yu et al.’s framework used still-image data to
the histogram of oriented gradients (HOG) and colors, in identify worker activities. They employed Faster R-CNN
order to detect construction vehicles and workers. Also, Chi to detect image objects and spatial information about these
and Caldas (2011) proposed a methodology for detecting con- objects in order to define activity patterns (Luo, Li, Cao,
struction vehicles and workers using background subtraction, Dai et al., 2018).
morphological processing, and neural networks for classi- Construction vehicle detection using deep learning has
fying objects. Kim, Kim, and Kim (2016) have proposed a also been investigated by some researchers. Along with using
framework for monitoring struck-by accidents using computer transfer learning (Shao, Zhu, & Li, 2015), Kim, Kim, Hong,
vision techniques and fuzzy inference. In the computer vision and Byun (2018) employed R-FCN (Dai, Li, He, & Sun, 2016)
step, they used background subtraction, morphological oper- to detect construction vehicles. W. Fang, Ding, Zhong, Love,
ations and object classification, and tracking. Afterward, they and Luo (2018) used Faster R-CNN to detect workers and
employed proximity and crowdedness as contextual informa- excavators on construction sites. Son, Choi, Seong, and Kim
tion about the construction site in fuzzy inference (Kim et al., (2019) used Faster R-CNN to detect construction workers
2016). in various poses against a variety of backgrounds. However,
Unfortunately, the abovementioned traditional computer almost all of the abovementioned endeavors have focused
vision solutions suffer from an inherent lack of generaliza- on providing solutions that disregard the proposed solu-
tion and require extensive development effort and domain tion’s real-time on-the-construction-site performance, effi-
knowledge (LeCun et al., 2015). In contrast, a deep learning ciency, and cost of deployment.
approach to computer vision problems introduces an alterna- Extensive review of the literature finds that most studies
tive end-to-end solution capable of automatic feature extrac- focus on the development of improved techniques for image
tion without explicit use of domain-knowledge-based feature analytics, but very few look at the economics of final deploy-
selection. ment nor the inevitable trade-off between accuracy and cost
Among safety-specific studies, W. Fang, Ding, Luo, and of deployment. In the infrastructure management domain, we
Love (2018) used the Faster R-CNN (Ren, He, Girshick, found only one study investigating inference at the edge of net-
& Sun, 2015) structure to detect workers and their safety works applied to road damage detection (Maeda, Sekimoto,
harnesses. Faster R-CNN was also used by Q. Fang et al. Seto, Kashiyama, & Omata, 2018). However, to the best of
(2018) to detect workers with no hardhat. A good compari- our knowledge, there is no study in construction engineering
son between traditional computer vision techniques and deep that focuses on inference at the edge using embedded devices.
learning solutions can be seen by comparing and contrasting This paper aims at filling this gap and providing researchers
Q. Fang et al. (2018) and Park, Elsafty, and Zhu (2015). Park and engineers a practical and comprehensive deep-learning-
et al. (2015) employed handcrafted feature extraction, that is, based solution that can detect construction vehicles from the
HOG, to detect human bodies and hardhats. They then used very first step, solution development, to the last step, solution
spatial information about hats and bodies to match detected deployment. Our solution includes not only related software
hats and bodies (Park et al., 2015). Q. Fang et al. (2018), how- but also hardware.
ever, utilized Faster R-CNN to detect workers with no hard- This paper covers both phases necessary for a comprehen-
hats directly. The solution proposed by Q. Fang et al. (2018) sive deep learning solution for industrial applications. In the
is end to end, requiring no handcrafted feature extraction. first phase, the development phase, data gathering and prepa-
In the activity-understanding area, Ding et al. (2018) ration, model selection, model training, and model validation
used a hybrid deep learning model to detect dangerous are covered. The second phase describes model optimization,
activities. They employed the Inseption V3 (Szegedy et al., application-specific hardware selection, and solution evalua-
2015) structure to extract features and then used long short- tion. The main contributions of this paper can be summarized
term memory (Greff, Srivastava, Koutník, Steunebrink, & as follows:
Schmidhuber, 2016) in order to incorporate temporal effects
and identify unsafe activities (Ding et al., 2018). Moreover, • Improving the advanced infrastructure management (AIM)
Luo et al. proposed a convolutional network-based solution construction vehicle dataset by adding a new vehicle class
for recognizing workers’ activities. Their study consisted and annotating the additional images associated with this
of four main steps. First, they tracked the workers using a class.
14678667, 2020, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12530 by Eduardo Miranda - Estsp Politecnico Do Porto , Wiley Online Library on [27/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ARABI ET AL. 755
• Proposing an improved version of the SSD_MobileNet TABLE 1 Data-splitting details for training, validation, and
object detector that is suitable for embedded devices. testing image datasets
• Proposing several embedded devices for various scenarios Image types Total Training Validation Testing
in the construction engineering domain. These scenarios Loader 787 504 126 157
include but are not limited to (a) applications requiring real- Excavator 361 231 58 72
time performance such as safety and object tracking and (b) Dump truck 760 486 122 152
applications needing semi-real-time performance such as Mixer truck 659 422 105 132
productivity analysis as well as management- and security- Roller 353 226 56 71
related analyses. Grader 351 225 56 70
• Proposing two hardware setups to meet the needs of varying
use cases for practical field implementation.
Visual inspection of the resulting datasets reveals multi-
ple challenges associated with the computer vision task, such
2 DEVELOPMENT PHASE as viewpoint variation, scale variation, occlusion, background
clutter, and intraclass variation. Figure 1 shows examples.
Our development phase involved (a) data gathering and pre- Contrary to large-scale image datasets, relatively fewer data
processing that resulted in prepared data leading to (b) our are available for specialized applications such as construc-
proposed detection model and (c) training and validation of tion vehicle detection. Consequently, training a model capa-
this model. ble of generalization while not underfitted or overfitted may
be unattainable. Moreover, in some applications, limitations
2.1 Data gathering and preprocessing in hardware resources even further increase the complexity
of developing a workable solution. This is not necessarily the
Our first step of solution development involved data gathering
case for other scenarios such as computer vision object detec-
and preprocessing. Data can be gathered using three major
tion competitions that differ greatly from real-life use cases.
processes. First, it can be gathered from available large-scale
Although the authors have tried their best to provide com-
datasets such as ImageNet (Russakovsky et al., 2015), the
prehensive information about each phase of the study, it was
Common Objects in Context (COCO) database (T.Y. Lin
not possible to cover every technical detail within this paper.
et al., 2014), and the Open Image dataset (Kuznetsova et al.,
However, we have referenced sufficient studies covering these
2018). Second, data can be gathered using web-crawling
details to enable readers to fully understand and follow our
techniques (Olston & Najork, 2010). Finally, image data can
process. The subsequent sections detail our proposed detec-
be captured at the location of application by researchers/
tion model, model training, and model validation.
engineers. The first and second approaches were used in this
study.
We used the AIM dataset (Kim et al., 2018) originally from
2.2 Proposed detection model
the ImageNet dataset (Russakovsky et al., 2015). This dataset There are two general types of object detectors, that is, one-
is a subset of ImageNet and contains construction vehicle stage and two-stage object detectors. Although two-stage
images, that is, excavators, loaders, rollers, concrete mixer object detectors are known for their accuracy, they are too
trucks, and dump trucks. Moreover, to complete the data- computationally intensive to be used for embedded devices
gathering process, we employed the web-crawling technique or for applications requiring real-time performance (Girshick,
to enhance the dataset. Web crawling refers to an automated 2015; Girshick, Donahue, Darrell, & Malik, 2015; Ren et al.,
process in which a crawler (bot) systematically browses the 2015). One-stage detectors, however, combine these two steps
web to retrieve information (Olston & Najork, 2010). Web and perform classification and localization using one single
crawling was used to gather images related to the “Grader” network.
object class. The resulting dataset can be shared upon request. The detector model used in this study is the single shot
After both automated and human visual inspection of the detector (SSD; Liu et al., 2016). This model uses an auxil-
“Grader” images and confirmation of their correctness and iary network for feature extraction, also known as a base net-
quality, these additional construction vehicle images were work. We used MobileNet as the base network in this study.
annotated by the authors. The dataset was then split into a (a) MobileNet and its variants are optimized primarily for speed
training/validation dataset and (b) testing dataset. Only 80% (Howard et al., 2017).
of the original dataset was dedicated to training/validation. The main building block of MobileNets are depthwise sep-
The remaining 20% was dedicated to testing. The train- arable convolutions that factorize the standard convolution
ing/validation dataset was similarly divided. Only 80% of this into two distinct operations, that is, depthwise and pointwise
dataset was used for training. The other 20% was devoted to convolutions (Sifre, 2014). It can be shown that depthwise
validation. Table 1 summarizes the data-splitting details. separable convolutions have fewer parameters and can reduce
14678667, 2020, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12530 by Eduardo Miranda - Estsp Politecnico Do Porto , Wiley Online Library on [27/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
756 ARABI ET AL.
FIGURE 1 Examples of challenges associated with the visual detection of construction equipment: (a) viewpoint variation, (b) scale variation,
(c) background clutter, (d) occlusion, and (e) intraclass variation
the computation eight to nine times (Howard et al., 2017). In feature maps, some of them from the MobileNet base network,
this study, followed by each convolution in the network, batch to perform classification and localization regression. These
normalization (Ioffe & Szegedy, 2015) and Relu6 activation feature maps have different sizes in order to leverage high-
(Krizhevsky & Hinton, 2010) were used. SSD uses different level as well as low-level information.
14678667, 2020, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12530 by Eduardo Miranda - Estsp Politecnico Do Porto , Wiley Online Library on [27/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ARABI ET AL. 757
TABLE 2 Average precision (AP) for each object category derived using the training model and test dataset
Dump truck Excavator Grader Loader Mixer truck Roller
AP 92.31% 83.70% 93.86% 93.77% 96.94% 86.65% mAP = 91.20%
FIGURE 4 Examples of failure in detection: (a) misclassified, (b) merged, (c and d) missed, and (e) wrong classification
14678667, 2020, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12530 by Eduardo Miranda - Estsp Politecnico Do Porto , Wiley Online Library on [27/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ARABI ET AL. 759
proposed to address the needs of two distinct scenarios. several benefits result (International Electrotechnical Com-
NVIDIA Jetson TX2 (NVIDIA Developer, 2019a) along with mission, 2016):
TensorRT optimizations (NVIDIA Developer, 2019b) has
been introduced as a GPU-accelerated solution for applica-
• Efficient and fast intelligent decision making through
tions that need real-time, yet accurate performance. Moreover,
deploying the machine learning algorithms at the edge of
Jetson Nano (NVIDIA Developer, 2019a) with TensorRT
the network, thereby eliminating the roundtrip delay intro-
optimizations is presented in this paper as a GPU-accelerated
duced by the cloud-computing paradigm.
platform for low-demand applications. Additionally, Rasp-
berry Pi (R. Pi) 3B+ (Raspberrypi.org, 2019) along with Intel • Securing data close to its origin and being able to follow
Neural Compute Stick (NCS) (Software.intel.com, 2019) was local management and control policies.
investigated as an alternative for low-demand applications. • Fast recovery from network failure or maintenance.
• Decreasing data transfer costs by lowering communication
3.1 Inference at the edge over public networks. In this case, only alarms or decisions
The availability of labeled data generated by various types of are sent to cloud servers.
sensors and devices together with recent progress in the arti-
ficial intelligence (AI) area has introduced innovative appli- Three edge computing platforms are introduced in this
cations such as connected autonomous vehicles, smart cities, study, that is, NVIDIA Jetson TX2, NVIDIA Jetson Nano, and
and intelligent infrastructures. R. Pi 3 with an Intel NCS.
There are two approaches to intelligent decision making,
namely, cloud-computing-based decision making, and edge-
3.2 Jetson TX2
computing-based decision making. Cloud computing refers to
a set of computing services such as servers, storage, analytics, NVIDIA Jetson TX2 uses the Tegra System on Module (SoM)
databases, etc., which are delivered over the Internet. In this and is the size of a credit card with input, output, and process-
model, data acquisition is conducted at the edge of a network ing hardware, similar to a typical computer. It takes advan-
(via sensors). The data are then sent to the cloud for process- tage of NVIDIA GPUs, which enable it to accelerate deep-
ing and decision making. Although this solution is relatively learning-related computations. The width and length of the
fast and easy to set up, it is associated with some inherent TX2 module is 50 and 87 mm, respectively. Table 3 sum-
limitations such as latency and jitter, limited bandwidth, and marizes the technical specifications of the TX2. The TX2
personal data privacy and security (Ericsson.com, 2016). module is called the Jetson TX2 development kit when it
On the other hand, with the edge-computing paradigm, the is mounted on its development carrier board, which is a
gathering, storing, processing, and decision making can all be 17.78 cm × 17.78 cm printed circuit with typical input and
done at the edge of a network. Although the main disadvan- output ports. A heatsink and a fan are mounted on the module
tage of edge computing is deployment and maintenance costs, to improve heat transfer.
TABLE 3 Detailed specifications of Jetson TX2, Jetson Nano, Raspberry Pi 3B+, and Intel NCS
Jetson TX2 Jetson Nano Raspberry Pi 3B+ Intel NCS
GPU NVIDIA Pascal, 256 NVIDIA Maxwell, 128 Broadcom VideoCore IV Intel Movidius Myriad 2
CUDA cores CUDA cores Vision Processing Unit
(VPU)
CPU Dual-core Denver 2 64-bit Quad-core ARM 4 × ARM Cortex-A53, N.A.
+ Quad-core ARM Cortex-A57 MPCore 1.2 GHz
Cortex-A57 MPCore
Memory 8 GB 128-bit LPDDR4 4 GB 64-bit LPDDR4 1 GB LPDDR2 N.A.
Display 2x DSI, 2x DP 1.2, HDMI HDMI 2.0, DP1.2, eDP 1.4, HDMI, DSI N.A.
2.0, eDP 1.4 2x DSI
Data storage 32 GB eMMC, SDIO, MicroSD card MicroSD N.A.
SATA
USB USB 3, USB 2 USB 3, USB 2 USB 2 N.A.
Connectivity 1 Gigabit Ethernet, Gigabit Ethernet 100 Base Ethernet, 2.4 GHz USB 3
802.11ac WLAN, 802.11n wireless
Bluetooth
Mechanical 50 mm × 87 mm 45 mm × 70 mm 56.5 mm × 85.60 mm 72.5 mm x 27 mm
14678667, 2020, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12530 by Eduardo Miranda - Estsp Politecnico Do Porto , Wiley Online Library on [27/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
760 ARABI ET AL.
A neural network can be trained using a GPU-accelerated R. Pi may be suitable for basic computer tasks, it cannot
host machine or using a GPU-enabled cloud computer deliver high performance for computationally intensive tasks
instance. The neural network can then be optimized and like object detection. We therefore added an Intel Movidius
deployed on the TX2 module. NVIDIA JetPack should be NCS as a deep learning accelerator to the proposed system.
used as the Software Development Kit (SDK). JetPack should The NCS is a USB-drive-sized fanless deep learning device
be installed on the host machine to ensure the necessary tool- that can accelerate computationally intensive inference at the
kits and packages such as CUDA, CuDNN, and TensorRT will edge. It is powered by an Intel Movidius Vision Processing
also be installed on the TX2. In this study, JetPack 3.2 was Unit that optimizes neural network operations. It is an ideal
used to flash the TX2. compact deep learning inference accelerator for resource-
CUDA is a parallel computing platform that increases restricted platforms such as R. Pi. It supports the TensorFlow
computing performance by utilizing GPUs. The CUDA deep (TensorFlow, 2019) and Caffe (Jia et al., 2014) deep learning
neural network (cuDNN) is a GPU-accelerated library that frameworks. Detailed technical specifications for the NSC can
includes highly tuned implementations of operations such as be found in Table 3. The NCS consumes only 1 W of power
convolution, pooling, and activation. TensorRT is a platform and has proven that it can greatly accelerate inference over use
for high-performance deep learning inference that includes an of the R. Pi CPU alone (Software.intel.com, 2019). In order to
optimizer and runtime and enables the making of applications make this setup functional, a trained neural network must first
with low latency and high throughput. be converted into an intermediate representation (IR) using
TensorRT is a C++ library that improves the inference per- the OpenVINO toolkit provided by Intel. The optimized IR
formance of NVIDIA GPUs. The input of the TensorRT opti- can then be used for inference.
mizer is a trained neural network and its output is an opti-
mized inference engine. It is only the inference engine that
needs to be deployed in the production environment. Ten-
sorRT enhances latency, power efficiency, memory consump-
4 RESULTS AND DISCUSSION
tion, and the throughput of a network by combining layers and
optimizing kernel selection. It can further improve network
4.1 Comparison of embedded devices
performance by running it with lower precision. For instance, A trained neural network can be deployed on either the cloud
it eliminates layers whose outputs are not used, horizontally or embedded devices. As construction is typically a long-term
and vertically fuses convolution and activation operations, and process, utilizing cloud services can be expensive. For exam-
adjusts the precision of weights from FP32 to FP16 or INT8. ple, Amazon Machine Learning will cost more than $90 for
20 hr of compute time and 890,000 batch predictions. Conse-
3.3 Jetson Nano quently, in the context of the current study, we mainly focus
on embedded devices.
Similar to the Jetson TX2, the Jetson Nano is an SoM designed
The inference speed, power efficiency, and normalized
and optimized for AI applications. As an affordable alterna-
benefit of six options were investigated in this study: (a) a Jet-
tive to Jetson TX2, the Jetson Nano can be ideal for low-
son TX2, (b) a Jetson TX2 with TensorRT optimizations, (c) a
demand applications. The Jetson Nano module is mounted on
Jetson Nano, (d) a Jetson Nano with TensorRT optimizations,
a development carrier board of 100 mm × 80 mm. A heatsink
(e) a R. Pi 3B+ with an Intel NCS, and (f) a desktop computer
is mounted on the module to improve heat transfer. The Jet-
with a GTX 1080 Ti GPU and Intel Core i7 CPU. It should be
son Nano module supports all of the required libraries for
noted that all benchmarking was done by operating the Jetson
GPU-accelerated deep learning inference such as TensorRT,
TX2 and Nano at the maximum performance.
CUDA, cuDNN, etc. Similar to Jetson TX2, Jetson Nano has
In regard to the first Jetson TX2 embedded device, review
all standard input and output ports and it can be used as a
of the results shows that its inference speed is 25 frames per
Linux desktop. Table 3 summarizes the technical specifica-
second (FPS). In order to compare the inference accuracy
tions of Jetson Nano. Jetpack 4.2 was used in this study to
of this setup with the others, the AP for each category was
flash the Jetson Nano.
calculated using the test dataset and a mAP of 93.41% was
achieved.
3.4 Raspberry Pi and Intel NCS
For the second embedded device option, the model was
The R. Pi 3 Model B+ is the latest version of the R. Pi and is optimized using TensorRT and half-precision floating point
the size of a credit card, uses a system on chip, and can func- (FP16) accuracy. It was then deployed on the Jetson TX2.
tion similarly to a standard high-performance computer for Examination of the optimized model shows that it is able to
basic tasks. Its low cost and tiny size make it ideal for embed- obtain the inference speed of 47 FPS, which is well above
ded devices in particular. Table 3 summarizes the main tech- the inference speed needed for real-time applications. Utiliz-
nical specifications of R. Pi 3 Model B+ (R. Pi). Although ing TensorRT was able to greatly increase the inference speed
14678667, 2020, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12530 by Eduardo Miranda - Estsp Politecnico Do Porto , Wiley Online Library on [27/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ARABI ET AL. 761
TABLE 4 AP for each object category derived using the optimized models on embedded systems
Dump truck Excavator Grader Loader Mixer truck Roller
Jetson TX2 [$600] 93.73% 87.74% 94.28% 96.43% 98.49% 89.78% mAP = 93.41%
Jetson TX2 with TensorRT [$600] 92.29% 83.67% 93.86% 93.77% 96.95% 87.63% mAP = 91.36%
Jetson Nano [$99] 92.72% 84.54% 93.87% 93.79% 96.95% 89.32% mAP = 91.86%
Jetson Nano with TensorRT [$99] 92.29% 83.36% 93.86% 93.75% 96.95% 86.69% mAP = 91.15%
Raspberry Pi 3B+ with NCS [$150] 92.30% 83.73% 93.87% 93.79% 96.95% 86.70% mAP = 91.22%
GTX 1080 Ti GPU with Intel Core i7 92.31% 83.70% 93.86% 93.77% 96.94% 86.65% mAP = 91.20%
CPU [$1700]
(from 25 to 47 FPS) at the cost of only minimally reduced pared to that of the Jetson TX2 and Jetson Nano, this slight
inference accuracy (from 93.41% to 91.36% mAP). This com- inconsistency between mAPs was anticipated.)
bination is especially ideal for safety as well as object-tracking A detailed inference accuracy comparison across the pro-
applications which require real-time processing. posed embedded devices can be found in Table 4. Comparison
Regarding the third and fourth embedded device options, of inference speed for each of the six options discussed is pre-
for the Jetson Nano without Tensor RT optimization, infer- sented in Figure 5a.
ence speed and accuracy were 13.9 FPS and 91.86% mAP, Inference efficiency was also investigated. Inference effi-
respectively. Once the model was optimized on the Jetson ciency is measured by dividing inference speed by power
Nano using Tensor RT, inference speed increased to 22 FPS consumption, namely FPS/Watt. The Jetson TX2 at maxi-
with a mAP of 91.15%. This combination is likely to be mum performance consumed 15 W of power, the Jetson Nano
especially beneficial for applications requiring semi-real-time 10 W of power, the R. Pi 3B+ and Intel NCS 6 W, and desk-
performance. For example, this setup could be used as a top PC (with a GTX 1080 Ti GPU and Intel Core i7 CPU) is
video-recording trigger in certain situations to improve the estimated to consume almost 850 W. Figure 5b summarizes
management and security of a construction site. This would the inference efficiency of the different setups investigated in
save substantial storage space as well as facilitate inspection this study. The Jetson TX2 with TensorRT optimization was
processes. the most efficient option investigated in this study.
For the fifth embedded device option investigated in this Normalized inference benefit analysis was also conducted
study, it should be pointed out that as elaborated in Sec- for the proposed embedded devices, considering the price of
tion 3.1.3, an intermediate representative (IR) is needed to each development kit. At the time of writing for this paper,
conduct inference with a R. Pi and NCS setup. The NCS is the price of the Jetson TX2, Jetson Nano, R. Pi, and Intel NCS
not inherently necessary in this setup because inference can were $600, $99, $75, and $75, respectively. The desktop PC
be conducted on R. Pi using OpenCV with an IR backend. used in this study cost roughly $1700. Figure 5c depicts the
However, the resulting inference speed (about 0.25 FPS in studied systems’ normalized inference benefit. Based on the
this study) is very low due to R. Pi’s limited computational results of this analysis, the Jetson Nano with TensorRT opti-
performance. Adding the NCS to the R. Pi setup increased mization offers the highest inference benefit.
the inference speed 32 times to 8 FPS. The inference accu-
racy of this setup is 91.22% mAP. (It should be noted that
multiple NCSs can be used together to further enhance infer-
4.2 Model selection and comparative study
ence speed.) Similar to the standalone Jetson Nano, the R. Pi In order to choose a suitable object detector, there are two
with NCS setup would also be suitable for applications need- main components of the model that should be considered:
ing semi-real-time performance. For example, semi-real-time the meta-architecture and feature extractor. Usually, Faster-
tracking of construction vehicles might be of interest for pro- RCNN (Ren et al., 2015), SSD (Liu et al., 2016), or R-FCN
ductivity analytics applications. (Dai et al., 2016) are used as the meta-architecture and incep-
The last option evaluated in this study is the use of a desktop tion (Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2016);
PC. Though desktop PCs can be used for inference at the edge, VGG (Simonyan & Zisserman, 2014), ResNet (He, Zhang,
this setup requires expensive fiber optic cables to transfer Ren, & Sun, 2016), MobileNet (Howard et al., 2017), etc., or
video feeds in real time. For construction sites already using their variants are used as the feature extractor (Huang et al.,
fiber optic cables for other purposes, this method is workable. 2017). Depending on the use case, different combinations of
With the setup used in the current study, this allowed for one meta-architecture and feature extractor can be used.
video stream to be processed with an inference speed of 166 Any proposed model can be optimized over two main cri-
FPS and a mAP of 91.20%. (As different TensorFlow versions teria, that is, inference speed or inference accuracy. Usu-
were used to generate .pb frozen graphs for this setup com- ally, using an accuracy-optimized feature extractor leads to
14678667, 2020, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12530 by Eduardo Miranda - Estsp Politecnico Do Porto , Wiley Online Library on [27/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
762 ARABI ET AL.
SSD has the least inference time, memory usage, and highest
overall mAP (Huang et al., 2017). (It should be noted these
results were achieved using the large-scale COCO dataset;
T.Y. Lin et al., 2014).
Therefore, we wanted to compare our model proposed in
this study to feature extractor/meta-architecture combina-
tions reported in the literature as particularly optimized for
accuracy and speed, respectively. To compare the inference
accuracy of the model presented in this study, we trained a
Faster-RCNN_Inception-ResNet-V2 combination with our
dataset. Faster-RCNN_Inception-ResNet-V2 was used in
particular because it has been reported as the most accurate
combination by Huang et al. (2017). To compare the infer-
ence speed of our SSD_MobileNet model, we trained an
SSD_Inception-V2 because this has been reported the fastest
model after SSD_MobileNet.
To ensure a realistic comparison with our proposed model,
we set the image input size for these models to 300 × 300 pix-
els. The training and validation loss as well as mAP results
can be seen in A1. Review of the results shows that both the
model reported in the literature to be particularly accuracy-
optimized and that reported as particularly speed-optimized
perform worse than our proposed model. The highest mAP
achieved by the Faster-RCNN_Inception-ResNet-V2 combi-
nation was 75.71%, which is far less than the above 91% mAP
of our proposed SSD_MobileNet model. This was expected
due to the limited size of our dataset and the huge number of
parameters needed to train Faster-RCNN_Inception-ResNet-
V2 models. Other studies have reported similar outcomes.
For example, Xue and Li (2018) could not accomplish the
training of VGG16 due to the very high number of param-
eters in this model. The slow inference speed of the Faster-
RCNN_Inception-ResNet-V2 combination was also antici-
pated and, indeed, even with the GTX 1080 Ti GPU, use of
this model achieved an inference speed of only 3 FPS.
For SSD_Inception-V2 combination on the GTX 1080 Ti
GPU, an inference speed of 85 FPS was achieved. This lower
FIGURE 5 Performance comparison of proposed embedded inference speed than with our SSD_MobileNet model (i.e.,
systems: (a) inference speed, (b) inference efficiency, and (c) inference 85 FPS vs. 166 FPS) was expected as similar results have also
normalized benefit been reported in other studies (Huang et al., 2017). It should
be noted that 67.65% was the highest mAP achieved by this
an accurate object detector (Huang et al., 2017). Inception model.
ResNet-V2 (Szegedy, Ioffe, Vanhoucke, & Alemi, 2017) has
demonstrated an exceptional top-1 accuracy of 80.4% on the
ImageNet dataset (Huang et al., 2017). A comparison of
4.3 Practical field implementation
different combinations of this feature extractor and vari- At the production level, development carrier boards are not
ous meta-architectures has shown that the Faster-RCNN_ usually used due to the intended application’s size and weight
Inception-ResNet-V2 is the most accurate combination constraints. Alternatives include several third-party carrier
(Huang et al., 2017). boards. In this study, Connect Tech’s Orbitty carrier board
In contrast, speed-optimized feature extractors having designed for the Jetson TX2 module (Connect Tech Inc.,
fewer parameters increase inference speed and lower mem- 2019) was used. This board’s size is identical to the TX2
ory usage (Huang et al., 2017). MobileNet is a good example module, that is, 87 mm × 50 mm and it offers USB2, USB3,
(Howard et al., 2017). It has been shown that MobileNet with and Ethernet ports. The process required was detaching the
14678667, 2020, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12530 by Eduardo Miranda - Estsp Politecnico Do Porto , Wiley Online Library on [27/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ARABI ET AL. 763
Jetson TX2 module from the development board and attach- chosen because not all neural network layers are supported by
ing it to the Orbitty carrier board and reflashing it with TensorRT optimizations by default.
the Orbitty carrier board in order to activate all the USB Hardware limitations should also be considered when
ports. designing a solution for use in the field. Although in this
The input power of the computing unit is 9–14 V DC, which study the Jetson Nano and R. Pi with NCS demonstrated
can be supplied via wall power using an AC power adapter or substantial normalized benefit, these devices should be used
by a DC battery. with caution because they have limited memory in addition to
To provide examples of how our proposed solution might being prone to overheating and consequently to system freez-
be implemented in the field, two sets of hardware were pro- ing. This should especially be considered in regard to high-
posed and investigated in this study for two distinct use cases. temperature construction seasons. Moreover, as the hardware
The first is powered by wall power, using an AC-to-DC may not be easily accessible after installation on the construc-
adapter. This is the preferred setup where wall power is avail- tion site, its Internet connectivity should also be considered
able. A female power adapter was used to connect the AC-to- when designing a solution. Both Jetson TX2 and R. Pi have a
DC adapter to the computing unit. The second setup is pow- built-in Wi-Fi adapter, whereas the Jetson Nano does not. A
ered by a DC battery and can be used where wall power is PCI Express network adapter or wireless Internet dongle can
not available on the construction site. A 3S LiPo battery was be used to connect the Jetson Nano to Internet.
used, which is capable of providing 11.1 V of power. This
5,000 mAh battery is capable of powering the system for
almost 4 hr. Because battery voltage will drop as it is used, 5 CONC LU SI ON
a voltage-regulator power supply board was used to ensure
a consistent 12-V output to the carrier board (O’Kelly et al., A comprehensive deep-learning-based solution for real-time
2019). An actual application was considered to test the pro- construction vehicle detection has been proposed in this study.
posed solution. Figure A2a illustrates an arbitrary frame from Many deep-learning-based computer vision studies neglect
a 2-min video of excavator operation using a fixed camera. solution deployment and, to the best of our knowledge, this
The black circles are the center of the detection bounding is the first study dealing with construction vehicle detection
boxes in the entire video. As the data were limited, we extrap- to address this gap. Although this study has focused on solu-
olated the detection results as if a 4-hr video was available tion deployment, it has also covered solution development.
(Figure A2b). Finally, the heatmap of vehicle operation was This study’s solution development phase involved initial
produced to determine its activity level. Figure A2c is the data gathering using available, labeled datasets followed by
heatmap generated using the data presented in Figure A2b. use of the web-crawling technique. This study proposes as
Using the hardware and software proposed in this section, its object detection model an improved SSD_MobileNet.
the detection results can be saved and uploaded to the cloud. The model was carefully selected to take into account the
Then, the heatmaps can be generated using the detection hardware-restricted nature of embedded devices. The trained
results in order to identify the active or inactive vehicles as model was then evaluated and its generalizability verified.
well as their level of activity. This study has proposed three embedded devices to address
the needs associated with several scenarios. First, an NVIDIA
Jetson TX2 with TensorRT optimizations was introduced as
4.4 Hardware/software limitations of
a GPU-accelerated solution for scenarios needing real-time,
proposed solution
yet accurate, performance such as safety- and construction-
The proposed models in this study resize images to 300 × 300 vehicle-tracking applications. For low-demand applications,
pixels at the beginning of the network. For some applications, both a Jetson Nano and an R. Pi 3B+/Intel NCS setup were
this may be a limitation of the current solution. Although proposed. These latter solutions are particularly suitable for
resizing contributes to the model’s high inference speed, this scenarios requiring semi-real-time performance such as pro-
does make objects smaller. Generally, resizing images of con- ductivity and other managerial applications. Among the pro-
struction vehicles to a smaller resolution is not problem- posed embedded systems, the Jetson TX2 with TensorRT
atic because construction vehicles are large and their visual optimizations had the highest inference speed and efficiency,
appearance diverse. However, the model should be used with whereas the Jetson Nano was associated with the highest nor-
caution for applications involving vehicles extremely far from malized inference benefit.
monitoring cameras. To ensure similar inference accuracy to Smartphones are sometimes used for the deployment of
that in our study, such images should be checked after resizing deep learning models. However, although smartphones can
to verify all construction vehicles the images include remain be useful for inspection purposes, they are not suitable to be
recognizable to the human eye. Moreover, to leverage Ten- employed as embedded devices because they cannot be kept
sorRT optimizations, deep learning model must be carefully in the field for long-term monitoring, especially during harsh
14678667, 2020, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12530 by Eduardo Miranda - Estsp Politecnico Do Porto , Wiley Online Library on [27/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
764 ARABI ET AL.
weather conditions. Moreover, smartphones are not optimized Arabi, S., Shafei, B., & Phares, B. M. (2019). Investigation of fatigue
for AI applications and their poor inference speed perfor- in steel sign-support structures under diurnal temperature changes.
mance, due to lack of GPUs, has been reported in the literature Journal of Constructional Steel Research, 153, 286–297.
Cha, Y.-J., Choi, W., & Büyüköztürk, O. (2017). Deep learning-
(Maeda et al., 2018).
based crack damage detection using convolutional neural networks.
To compare the performance of our proposed model, we
Computer-Aided Civil and Infrastructure Engineering, 32(5), 361–
also conducted a comparative study. This study supported pre- 378.
vious findings that the performance of models trained on a Cha, Y.-J., Choi, W., Suh, G., Mahmoudkhani, S., & Büyüköztürk, O.
large-scale dataset cannot be generalized to models trained (2018). Autonomous structural visual inspection using region-based
on a more limited dataset (Xue & Li, 2018). For example, deep learning for detecting multiple damage types. Computer-Aided
the Faster R-CNN model that had the highest mAP with the Civil and Infrastructure Engineering, 33(9), 731–747.
Chakraborty, P., Adu-Gyamfi, Y. O., Poddar, S., Ahsani, V., Sharma, A.,
COCO dataset performed much more poorly than our pro-
& Sarkar, S. (2018). Traffic congestion detection from camera images
posed model when applied to this study’s smaller construction
using deep convolution neural networks. Transportation Research
vehicle dataset. Therefore, it is of vital importance that dataset Record: Journal of the Transportation Research Board, 2672(45),
size be considered when designing deep-learning-based com- 222–231.
puter vision solutions. Chakraborty, P., Sharma, A., & Hegde, C. (2018). Freeway traf-
A wall-power setup is preferable for deploying an on-site fic incident detection from cameras: A semi-supervised learning
construction vehicle detection solution because DC battery approach. In 21st International Conference on Intelligent Trans-
setups require adding a voltage-regulator circuit in order to portation Systems (ITSC) (pp. 1840–1845), IEEE. https://doi.org/
10.1109/ITSC.2018.8569426
ensure voltage consistency. Moreover, DC batteries offer only
Chi, S., & Caldas, C. H. (2011). Automated object identification using
limited runtime and the maintenance of DC batteries, espe- optical video cameras on construction sites. Computer-Aided Civil
cially LiPo batteries, requires extra caution. and Infrastructure Engineering, 26(5), 368–380.
In regard to future studies, construction vehicle detection Connect Tech Inc. (2019). Orbitty carrier for NVIDIA Jetson
applications such as safety monitoring, productivity assess- TX2/TX2i/TX1. Connect Tech Inc. Retrieved from http://connecttech.
ment, and construction site management should be investi- com/product/orbitty-carrier-for-NVIDIA-Jetson-tx2-tx1/
Dai, J., Li, Y., He, K., & Sun, J. (2016, December). R-FCN: Object detec-
gated. Potential safety monitoring applications to investigate
tion via region-based fully convolutional networks. In Proceedings of
include the use of multiple cameras along with homographic
the 30th International Conference on Neural Information Processing
transformation in order to obtain distance information that can Systems (pp. 379–387). Curran Associates Inc.
then be used to send an alarm signal to vehicles at risk of col- Ding, L., Fang, W., Luo, H., Love, P. E. D., Zhong, B., & Ouyang,
lision. Productivity assessment applications could include the X. (2018). Automation in construction. A deep hybrid learning
utilization of tracking information in order to identify active model to detect unsafe behavior: Integrating convolution neural net-
and inactive vehicles. Management applications could include works and long short-term memory. Automation in Construction, 86,
the activation of video-recording triggers in response to pre- 118–124.
Dung, C. V., & Le, A. D. (2019). Autonomous concrete crack detection
determined situations in order to improve the management
using deep fully convolutional neural network. Automation in Con-
and the security of a construction site. Clearly, there are many struction, 99(March), 52–58.
potential benefits to applying the real-time monitoring tech- Ericsson.com. (2016). Ericsson—Hyperscale cloud—Reimagining data
nologies demonstrated feasible in this study. centers from hardware to applications. Retrieved from https://www.
ericsson.com/en/white-papers/hyperscale-cloud-reimagining-data-
REFERENCES centers-from-hardware-to-applications
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisser-
Adeli, H. (2001). Neural networks in civil engineering: 1989–2000. man, A. (2010). The PASCAL visual object classes (VOC) challenge.
Computer-Aided Civil and Infrastructure Engineering, 16(2), 126– International Journal of Computer Vision, 88, 303–338.
142. Fang, Q., Li, H., Luo, X., Ding, L., Luo, H., Rose, T. M., & An,
Amezquita-Sanchez, J. P., & Adeli, H. (2016). Signal processing tech- W. (2018). Detecting non-hardhat-use by a deep learning method
niques for vibration-based health monitoring of smart structures. from far-field surveillance videos. Automation in Construction, 85,
Archives of Computational Methods in Engineering, 23(1), 1–15. 1–9.
Amezquita-Sanchez, J. P., Valtierra-Rodriguez, M., & Adeli, H. Fang, W., Ding, L., Luo, H., & Love, P. E. D. (2018). Falls from
(2018). Wireless smart sensors for monitoring the health condi- heights: A computer vision-based approach for safety harness detec-
tion of civil infrastructure. Scientia Iranica A, 25(6), 2913–2925. tion. Automation in Construction, 91, 53–61.
https://doi.org/10.24200/SCI.2018.21136 Fang, W., Ding, L., Zhong, B., Love, P. E. D., & Luo, H. (2018). Auto-
Arabi, S., & Shafei, B. (2019). Multi-stressor fatigue assessment of steel mated detection of workers and heavy equipment on construction
sign-support structures: A case study in Iowa. Engineering Struc- sites: A convolutional neural network approach. Advanced Engineer-
tures, 200. https://doi.org/10.1016/j.engstruct.2019.109721 ing Informatics, 37, 139–149.
Arabi, S., Shafei, B., & Phares, B. M. (2018). Fatigue analysis of sign- Gao, Y., & Mosalam, K. M. (2018). Deep transfer learning for image-
support structures during transportation under road-induced excita- based structural damage recognition. Computer-Aided Civil and
tions. Engineering Structures, 164(2), 305–315. Infrastructure Engineering, 33(9), 748–768.
14678667, 2020, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12530 by Eduardo Miranda - Estsp Politecnico Do Porto , Wiley Online Library on [27/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ARABI ET AL. 765
Girshick, R. (2015). Fast R-CNN. In Proceedings of the IEEE Interna- Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., …
tional Conference on Computer Vision (pp. 1440–1448) Zitnick, C. L. (2014). Microsoft COCO: Common objects in context.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2015). Region- European Conference on Computer Vision (pp. 740–755). Cham:
based convolutional networks for accurate object detection and Springer.
segmentation. IEEE Transactions on Pattern Analysis and Machine Lin, Y., Nie, Z., & Ma, H. (2017). Structural damage detection with
Intelligence, 38(1), 142–158. automatic feature-extraction through deep learning. Computer-Aided
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmid- Civil and Infrastructure Engineering, 32(12), 1025–1046.
huber, J. (2016). LSTM: A search space odyssey. IEEE Transactions Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., &
on Neural Networks and Learning Systems, 28(10), 2222–2232. Berg, A. C. (2016). SSD: Single shot multibox detector. European
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning Conference on Computer Vision (pp. 21–37). Cham: Springer.
for image recognition. In Proceedings of the IEEE Conference on Luo, X., Li, H., Cao, D., Dai, F., Seo, J., & Lee, S. (2018). Recognizing
Computer Vision and Pattern Recognition (pp. 770–778). diverse construction activities in site images via relevance networks
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, of construction-related objects detected by convolutional neural net-
T., … Adam, H. (2017). MobileNets: Efficient convolutional neural works. Journal of Computing in Civil Engineering, 32(3), 4018012.
networks for mobile vision applications. arXiv:1704.04861. Luo, X., Li, H., Cao, D., Yu, Y., Yang, X., & Huang, T. (2018). Towards
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., … efficient and objective work sampling: Recognizing workers’ activ-
Murphy, K. (2017). Speed/accuracy trade-offs for modern convolu- ities in site surveillance videos with two-stream convolutional net-
tional object detectors. In Proceedings of the IEEE Conference on works. Automation in Construction, 94, 360–370.
Computer Vision and Pattern Recognition (pp. 7310–7311). Maeda, H., Sekimoto, Y., Seto, T., Kashiyama, T., & Omata, H. (2018).
International Electrotechnical Commission. (2016). Edge intelligence. Road damage detection and classification using deep neural networks
Retrieved from https://www.iec.ch/whitepaper/pdf/IEC_WP_Edge_ with smartphone images. Computer-Aided Civil and Infrastructure
Intelligence.pdf Engineering, 33(12), 1127–1141.
Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating Memarzadeh, M., Golparvar-Fard, M., & Niebles, J. C. (2013). Auto-
deep network training by reducing internal covariate shift. In Inter- mated 2D detection of construction equipment and workers from
national Conference on Machine Learning (pp. 448–456). site video streams using histograms of oriented gradients and colors.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Automation in Construction, 32, 24–37.
… Darrell, T. (2014, November). Caffe: Convolutional architecture Nabian, M. A., & Meidani, H. (2018). Deep learning for accelerated seis-
for fast feature embedding. In Proceedings of the 22nd ACM Interna- mic reliability analysis of transportation networks. Computer-Aided
tional Conference on Multimedia (pp. 675–678). ACM. Civil and Infrastructure Engineering, 33(6), 443–458.
Kim, H., Kim, H., Hong, Y. W., & Byun, H. (2018). Detecting construc- NVIDIA Developer. (2019a). Autonomous machines. Retrieved from
tion equipment using a region-based fully convolutional network and https://developer.NVIDIA.com/embedded-computing
transfer learning. Journal of Computing in Civil Engineering, 32(2). NVIDIA Developer. (2019b). NVIDIA TensorRT. Retrieved from
https://doi.org/10.1061/%28ASCE%29CP.1943-5487.0000731 https://developer.NVIDIA.com/tensorrt
Kim, H., Kim, K., & Kim, H. (2016). Vision-based object-centric safety O’Kelly, M., Sukhil, V., Abbas, H., Harkins, J., Kao, C., Pant, Y. V.,
assessment using fuzzy inference: Monitoring struck-by accidents … Bertogna, M. (2019). F1/10: An open-source autonomous cyber-
with moving objects. Journal of Computing in Civil Engineer- physical platform. arXiv:1901.08567.
ing, 30(4). https://doi.org/10.1061/%28ASCE%29CP.1943-5487. Olston, C., & Najork, M. (2010). Web crawling. Foundations and Trends
0000562 R in Information Retrieval, 4(3), 175–246.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimiza- Park, M.-W., Elsafty, N., & Zhu, Z. (2015). Hardhat-wearing detection
tion. arXiv:1412.6980. for enhancing on-site safety of construction workers. Journal of Con-
Krizhevsky, A., & Hinton, G. (2010). Convolutional deep belief networks struction Engineering and Management, 141(9), 4015024.
on CIFAR-10 (Technical report). University of Toronto. Raspberrypi.org. (2019). Teach, learn, and make with Rasp-
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, berry Pi— Raspberry Pi. Accessed December 23, 2019.
J., … Ferrari, V. (2018). The Open Images dataset V4: Unified image https://www.raspberrypi.org/.
classification, object detection, and visual relationship detection at Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN:
scale. arXiv:1811.00982. Towards real-time object detection with region proposal networks.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, In Advances in Neural Information Processing Systems (pp. 91–99).
521(7553), 436–444. NIPS.
Li, R., Yuan, Y., Zhang, W., & Yuan, Y. (2018). Unified vision-based Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S.,
methodology for simultaneous concrete defect detection and geolo- … Fei-Fei, L. (2015). ImageNet large scale visual recognition
calization. Computer-Aided Civil and Infrastructure Engineering, challenge. International Journal of Computer Vision, 115(3), 211–
33(7), 527–544. 252.
Liang, X. (2019). Image-based post-disaster inspection of reinforced Shao, L., Zhu, F., & Li, X. (2015). Transfer learning for visual categoriza-
concrete bridge systems using deep learning with Bayesian optimiza- tion: A survey. IEEE Transactions on Neural Networks and Learning
tion. Computer-Aided Civil and Infrastructure Engineering, 34(5), Systems, 26(5), 1019–1034.
415–430. Sifre, L. (2014). Rigid-motion scattering for image classification (Ph. D.
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal dissertation). Ecole Polytechnique, CMAP.
loss for dense object detection. Proceedings of the IEEE International Simonyan, K., & Zisserman, A. (2014). Very deep convolutional net-
Conference on Computer Vision (pp. 2980–2988). works for large-scale image recognition. arXiv:1409.1556.
14678667, 2020, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12530 by Eduardo Miranda - Estsp Politecnico Do Porto , Wiley Online Library on [27/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
766 ARABI ET AL.
Software.intel.com. (2019). Intel® MovidiusTM neural compute stick. Xue, Y., & Li, Y. (2018). A fast detection method via region-based
Retrieved from https://software.intel.com/en-us/movidius-ncs fully convolutional neural networks for shield tunnel lining defects.
Son, H., Choi, H., Seong, H., & Kim, C. (2019). Automation in con- Computer-Aided Civil and Infrastructure Engineering, 33(8), 638–
struction detection of construction workers under varying poses and 654.
changing background in image sequences via very deep residual net- Yeum, C. M., & Dyke, S. J. (2015). Vision-based automated crack detec-
works. Automation in Construction, 99(November), 27–38. tion for bridge inspection. Computer-Aided Civil and Infrastructure
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception- Engineering, 30(10), 759–770.
v4, inception-resnet and the impact of residual connections on learn- Zhang, A., Wang, K. C. P., Li, B., Yang, E., Dai, X., Peng, Y., …
ing. Thirty-first AAAI Conference on Artificial Intelligence, San Chen, C. (2017). Automated pixel-level pavement crack detection on
Francisco, CA. 3D asphalt surfaces using a deep-learning network. Computer-Aided
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … Civil and Infrastructure Engineering, 32(10), 805–819.
Rabinovich, A. (2015). Going deeper with convolutions. Proceedings
of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, Boston, MA (pp. 1–9). How to cite this article: Arabi S, Haghighat A,
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. Sharma A. A deep-learning-based computer vision
(2016). Rethinking the inception architecture for computer vision. solution for construction vehicle detection. Com-
Proceedings of the IEEE Conference on Computer Vision and Pat- put Aided Civ Inf. 2020;35:753–767. https://doi.org/
tern Recognition, Las Vegas, NV (pp. 2818–2826).
10.1111/mice.12530
TensorFlow. (2019). TensorFlow. Retrieved from https://www.
tensorflow.org/
APPENDIX
(a) (b)
(c) (d)
FIGURE A1 Behavior of the detection models over training iterations: (a) training loss of Faster-RCNN_Inception-ResNet-V2, (b) validation
loss and mAP of Faster-RCNN_Inception-ResNet-V2, (c) training loss of SSD_Inception-V2, and (d) validation loss and mAP of SSD_Inception-V2
14678667, 2020, 7, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/mice.12530 by Eduardo Miranda - Estsp Politecnico Do Porto , Wiley Online Library on [27/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
767
Construction vehicle activity level identification: (a) an arbitrary frame from a 2-min video of excavator operation overlaid with
center of detection bounding boxes, (b) extrapolation of the results presented in (a) for a 4-hr video, and (c) excavator activity heatmap generated
using data presented in (b)
FIGURE A2
ARABI ET AL.