Basak 2020
Basak 2020
Abstract—Depth estimation from a single RGB image has in image resolution and quality of predictions in the resulting
been one of the most important research topics in recent depth maps. The recent applications of computer vision like
days as it has several important applications in self-supervised virtual reality, 3D reconstruction, autonomous driving, and
driving in autonomous cars, image reconstruction, and scene
segmentation. Depth estimation from a single monocular image medical imaging demands accurate object boundaries and fine-
has been challenging as compared to stereo images due to the grained depth predictions as well as a faster computation for
lack of spatio-temporal features per frame that makes 3D depth these methods to be effective. Therefore, the problem demands
perception easier. Existing models and solutions in monocular discontinuity composition most accurately and faithfully to
depth estimation often resulted in low resolution and blurry depth avoid huge trepidations that are very common in deep learning-
maps and often fail to identify small object boundaries. In this
paper, we propose a simple encoder-decoder based network that based depth predictions.
can predict high-quality depth images from single RGB images However, there are other geometric and hardware-based
using transfer learning. We have utilized important features methods too for depth estimations. Recently, structure from
extracted from pre-trained networks, and after initializing the motion (SFM) methods have successfully evolved for Simul-
encoder with fine-tuning and important augmentation strategies, taneous Localization and Mapping (SLAM) implementation
the network decoder part computes the high-end depth maps.
The network has fewer trainable parameters and small iterations, where 3D scenes were reconstructed from a series of 2D
though it outperforms the existing state-of-the-art methods and monocular images through feature correspondence and geo-
captures accurate boundaries when evaluated on two standard metrical constraints from the image sequences. Human eyes
datasets, KITTI, and NYU Depth V2. can efficiently estimate the depth maps from scenes using
Keywords—Monocular image, Depth estimation, Encoder- occlusion, knowledge of known object shapes, shadow and
decoder, Transfer learning.
lighting information, known perspective, and relative scaling.
I. I NTRODUCTION This method has been mimicked by stereo image pairs where
the depths are calculated from disparity maps of the images
Retrieving depth information from RGB images has been
captured by a slightly displaced pair of cameras. The disparity
of utmost importance and application in recent days and it
maps are calculated from the cost function where the scale
has already implemented in semantic segmentation, augmented
information is included as the position transformation of both
reality, Simultaneous Localization and Mapping (SLAM) [1],
the cameras are calibrated before the experiment. The RGB-
scene understanding, image refocusing, and real-time naviga-
D cameras and LIDAR methods have also been instrumental
tion in self-driving cars. Recent studies have focused mostly
in the depth estimations recently where the camera sensors
on using CNNs and other deep learning methods for the recon-
can capture the depth maps directly from the scene images.
struction of images from 2D to 3D. Though there have been
Though the LIDAR method can generate spare 3D maps only
steady improvements in the dense depth estimation strategies
and have been implemented in self-driving cars recently, they
in the last few years, there are several scopes of improvements
suffer from relatively low information in low lighting condi-
978-0-7381-1151-3/20/$31.00 ©2020 IEEE tions at night and have high cost that makes it incompatible
Authorized licensed use limited to: Carleton University. Downloaded on May 27,2021 at 02:09:59 UTC from IEEE Xplore. Restrictions apply.
for use in small robotics applications. Therefore, the problem But similar to previous models, it too involved the requirement
demands an accurate and simple depth-estimation method ground truth data at the time of model training. Flynn et al.
which we tried to propose in this paper. [13] proposed “DeepStereo” which involved the relative pose
of multiple cameras while training, to predict the depth value
of an input image. This model functioned by selecting the
most relevant depth values based on plane sweep volumes. A
major disadvantage of this method again lies in the fact that
several similar but slightly varying posed images are required
for training the model. Any insufficiency in the training set
causes the model to predict with much less accuracy. This
was rectified to a large extent by Xie et al. [15], where from
an input left-sided image, the corresponding right-sided image
was predicted and vice-versa. However, the disadvantage of
this method is, with a flux of disparity values, the model
became less memory efficient, thus limiting its usage for
images with much larger resolutions.
A. Contributions of current paper
Based on the pre-existing architectures and training strate-
gies [2], [3], [4], [5], we have designed a simple encoder-
Fig. 1. Comparison of ground truth and estimated depth maps. Row 1 shows decoder architecture with fewer parameters and less complex
that our model performed significantly well in low lighting condition too.
that makes the training and evaluation process easier. Our
contribution can be defined as a two-fold contribution:
II. R ELATED WORKS • We have designed a simple transfer learning-based net-
In this section, we illustrate the existing research works work architecture that can detect the object boundaries
involving monocular depth approximation, where the input is more faithfully and predict the depth maps more explic-
only a single input image. In all the works mentioned below, itly than the existing methods. The performance of our
suppositions about the environment are not made before- model can be inferred from figure 1 where the predicted
hand. Predefined resemblance calculations by considering the depth maps have been compared with the ground truth
matching similar to supervised learning and then developing depth maps.
a function based on that, produced excellent results as in • We define data augmentation and training strategy with
[7]. It has been demonstrated that such a multi-class cate- much fewer parameters and finally, we have defined
gorization outperforms both in terms of accuracy and rapidity. a corresponding loss function for faster and smoother
Mayer et al. [9] proposed a fully convolutional [10] deep convergence.
network, termed as “DispNet”. In his proposed methodology,
the disparity for every pixel is forecasted, by a minimized III. M ATERIALS AND METHODS
regression loss function. However, it trains on a large amount In this section, we present the detailed workflow of our
of ground truth data with their original stereo images, which method that includes the dataset description, network architec-
is quite difficult to organize from real-world scenarios. Saxena ture, complexity assessment, loss function, and augmentation
et al. [12] proposed a model termed “Make3D”. This method policy. The detailed descriptions of each section are as given
initially breaks the input image and then predicts the disparity, below.
based on the three-dimensional positioning of planes. A major
back draw of this method is patented when predictions made A. Datasets
on thin structures failed to generate the required output, as a 1) NYU Depth V2: The dataset contains 120k training
result of the dearth of a wide range of universal circumstances. images and their corresponding depth maps of indoor scenes,
The approach made by Liu et al. [13] involved convolutional though we used a 50k subset for training purposes and 654
neural networks (CNN) for learning and prediction. Ladicky et images were used for testing purposes. The images in the
al. [14] proposed the usage of semantics to improve disparity dataset have a dimension of 640×480 and the depth maps
measurement. Karsch et al. [18] developed a model that have a resolution as half of the raw images (i.e. 320×240) [6].
predicted disparity with great accuracy. However, it required We did not crop and resize the images during training though
the entire training set to be available at the time of testing. there was a possibility of distortion correction preprocessing
Eigen et al. In [19] proposed a model, which predicted the which may result in missing pixels. Though, during testing,
depth by using a two-scale deep network trained. The model we calculated the depth maps and up-sampled in twice to
was trained on images and their corresponding depth values. It match the input image dimension. The output was taken
learned to represent disparity directly from the raw image data by calculating the average of the mirror image pairs of the
and did not rely on supervised features or over-segmentation. predictions.
Authorized licensed use limited to: Carleton University. Downloaded on May 27,2021 at 02:09:59 UTC from IEEE Xplore. Restrictions apply.
2) KITTI Dataset: The KITTI dataset consists of stereo
images and their corresponding laser scans captured by a mov-
ing vehicle. The images have dimensions around 1241×376
whereas there are lots of missing data and low-resolution
images in the depth map though they are compensated using
the inpainting method. We used around 26k images for training
and 697 images for testing purposes. The images were up-
sampled using bilinear interpolation to a specified dimension
of 1280×384 to meet the encoder’s requirement of divisibility
of image dimension by 32 [8]. During testing, the depth
map was cropped to 624×192 resolution to match the input
resolution and finally, the output was taken by calculating the
average of the mirror image pairs of the predictions.
B. Network architecture
The network is based on the encoder-decoder structure
(figure 2) where we have used the DenseNet-169 [8] ar-
chitecture pretrained on ImageNet [11] as the backbone in Fig. 3. The basic architecture of the proposed network. The encoder part (up
to CONV2) are same as the DenseNet-169 architecture whereas there is a
the encoder part and we removed the FC layer to remove Leaky ReLU after each CONVB layer. The output dimensions are mentioned
the ImageNet classification result. The feature maps are then for the NYU Depth V2 dataset in the form of height*width*channel.
fed into an up-sampling network which, along with the skip
connections, form the decoder part. We started with 1×1
convolution with same number of channels of the encoder truth depth map (d) and the predicted depth map (d’). There
followed by successive convolutions consisting of 2×2 bi- are several loss functions that have been used successfully in
linear up-sampling and 3×3 convolution layers. All the up- the existing depth estimation tasks and performed quite well.
sampling layers were associated with an activation function The Structural Similarity (SSIM) has been proven to be very
which was set as Leaky Rectified Linear Unit with α value useful in reconstruction tasks by using CNNs. This commonly
kept as 0.2. The decoder part did not comprise of any batch used metric is defined by the following equation :
normalization or other advanced layers as suggested by the
contemporary existing complex network architectures [16], (2.Avgd .Avgd’ + a)(Convdd’ + b)
[17]. The detailed network architecture is described in figure SSIM(d,d’) =
(Avgd 2 + Avgd’ 2 + a)(Vard 2 + Vard’ 2 + b)
3. We have experimented with several other encoders (e.g. (1)
ResNet-50, DenseNet-121) and also with some other decoder Where, Avg and Var2 signify the average and variance
architectures [20], [21] and after thorough evaluation, we respectively. Conv is the convergence metric. a= (c1 L)2 and
concluded that the complexity in the network architecture and b=(c2 L)2 ; L= (2no. of bits per pixel - 1);c1 =0.01 and c2 =0.03
a greater number of convolutional layers must not necessarily
contribute to better performance.Based on several experiments 1 − SSIM(d,d )
with different encoder architectures with various upsampling SSIM Loss(d,d ) = (2)
2
methods as mentioned above, it is concluded that our proposed
Secondly, the gradient loss is defined as the cumulative
method shows that a simple encoder-decoder architecture with
average of the absolute values of gradient in predicted x and
a simple 2×2 bilinear up-sampling can even contribute to very
y components
good results.
1
N
C. Loss function
Gradient Loss(d,d ) = (|Gradx (di , di )| + |Grady (di , di )|)
The loss function in the depth estimation task is defined as N
i=1
a function that calculates the difference between the ground (3)
Authorized licensed use limited to: Carleton University. Downloaded on May 27,2021 at 02:09:59 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Qualitative comparison of our proposed method with the state-of-the-art method on NYU Depth V2 dataset. The first column is the input RGB image,
the second column is the corresponding depth map, the third column is the prediction map by Fu et al. [17] and the last column is our predicted depth map.
The depth loss is defined as the difference in depth values numerous other options of data augmentation and it remains an
between the predicted and ground truth calculated point-wise. important field of research that how these methods can bridge
the gap between limited datasets and improved performance.
1
N
Depth Loss(d,d ) = |di − di | (4)
N IV. R ESULTS AND D ISCUSSION
i=1
However, we tried to incorporate a new mixing loss function A. Implementation details
keeping in mind two important tasks: first, giving weightage Our proposed method was implemented on the TensorFlow
to penalize the high-frequency components in the predicted backend using a Tesla K80 GPU having 2496 CUDA core and
depth maps and secondly, minimizing the difference between having a VRAM of 32 GB. The encoder was a DenseNet-
the predicted and original depth maps and thus reconstruction 169 pretrained on ImageNet and the weights of the decoder
of depth images. Our proposed loss function, which is defined were initialized randomly. The training parameters for the
below, balances between these two operations. depth-estimation network were set experimentally and the
best parameters were selected for optimized results. ADAM
Loss(d,d ) = SSIM Loss(d,d ) + Gradient Loss(d,d )+ optimizer was used in the training process with the base
(5) learning rate 1e-3 and batch size=4, keeping β 1 and β 2 values
0.1 × Depth Loss(d,d )
as 0.9 and 0.999. The total trainable parameters in our network
D. Data augmentation were approximately 41.5M, and it took approximately 36
Data augmentation is an important tool to improve the hours to complete training on the NYU Depth V2 dataset and
learning of a model and to avoid overfitting of a network. 15 hours for the KITTI dataset.
The geometrical and photometric transformations have been
used widely in almost all of the supervised, unsupervised B. Qualitative measurement
or semi-supervised learning-based depth estimations, however, The results from our experiments were compared with
we have found that all of them are not necessarily important other existing methods and it is observed that our method
and useful for our purpose. As the vertical flipping may result outperformed all of them including the state-of-the-art DORN,
to ambiguity in floor and ceiling position in an image, we 2018 method [17]. Besides, our method requires almost 50%
discarded it, therefore, we have kept the horizontal mirroring parameters, as compared to the DORN method with fewer
with a probability of 0.25. Another important photometric iterations, and was trained on a small part of the complete
transformation, color channel swapping has been found to be dataset of 120K images. The median value of ground truth
useful in boosting the performance, hence we have kept this depth maps was multiplied (scalar multiplication) with the
augmentation with a probability of 0.5. However, there are predicted depth maps to remove the problem of absolute
Authorized licensed use limited to: Carleton University. Downloaded on May 27,2021 at 02:09:59 UTC from IEEE Xplore. Restrictions apply.
Fig. 5. Qualitative comparison of our proposed method with the state-of-the-art method on KITTI benchmark dataset. The first column is the input RGB
image, the second column is the depth map predicted by our method and the last column is the result from Fu et al. [17].
TABLE I
C OMPARISON OF DIFFERENT EVALUATION METRICS OF OUR RESULT WITH THE EXISTING METHODS IN NYU D EPTH V2 DATASET. T HE RESULT SHOWS
THAT OUR METHOD PERFORMS BETTER THAN ALL OTHER EXISTING METHODS .
TABLE II
C OMPARISON OF DIFFERENT EVALUATION METRICS OF OUR RESULT WITH THE EXISTING METHODS IN KITTI BENCHMARK DATASET. T HE RESULT
SHOWS THAT OUR METHOD PERFORMS THE SECOND BEST AMONG ALL OTHER EXISTING METHODS .
scaling of the scene which is a major source of error in many inconsistency and several missing cases in ground truths.
SOTA methods. The performance of our proposed method Hence, we could not compare the results with ground truths in
on NYU Depth v2 is compared with the ground truth and each case. It is observed that in both our case and the SOTA
[17] in figure 4 where it is clear that our method predicted Fu et al. [17], the models fail to identify the objects that are
closer to the actual ground truth image. There are large at a long distance from the camera. However, our method still
distortions in the output images of DORN with inconsistency predicts better results to identify humans, vehicles, and other
in boundary detection and smoothness of depth maps. Row 2 objects in the scenes with smoother depth maps as compared
and 3 also suggest that DORN cannot predict depth maps at all to the DORN [17].
in low lighting conditions whereas our method still manages
C. Quantitative evaluation
to predict quite well in these conditions.
We have quantitatively compared our depth maps with
Similarly, we also compared our results with the SOTA some of the existing previous works by using the following
methods on the KITTI benchmark dataset. The dataset has evaluation metrics. Table 1 and 2 signifies the comparitive
Authorized licensed use limited to: Carleton University. Downloaded on May 27,2021 at 02:09:59 UTC from IEEE Xplore. Restrictions apply.
study of different evaluation metrics on KITTI and NYU [6] Silberman, N., Hoiem, D., Kohli, P. and Fergus, R., 2012, October.
Depth V2 dataset. The bold results suggest the best results Indoor segmentation and support inference from rgbd images. In Eu-
ropean conference on computer vision (pp. 746-760). Springer, Berlin,
among all the results which were considered for comparison. Heidelberg.
The results suggests that our method performes the best for the [7] S. Zagoruyko and N. Komodakis, ”Learning to compare image patches
NYU dataset but performs second best in terms of evaluation via convolutional neural networks,” 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 4353-
scores in KITTI dataset. 4361, doi: 10.1109/CVPR.2015.7299064.
1 |di − di |
N [8] G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger, ”Densely
rel(Relative error) = (6) Connected Convolutional Networks,” 2017 IEEE Conference on Com-
N di puter Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp.
i=1 2261-2269, doi: 10.1109/CVPR.2017.243.
[9] N. Mayer et al., ”A Large Dataset to Train Convolutional Networks
1 N for Disparity, Optical Flow, and Scene Flow Estimation,” 2016 IEEE
rms error = (di − di )2 (7) Conference on Computer Vision and Pattern Recognition (CVPR), Las
N Vegas, NV, 2016, pp. 4040-4048, doi: 10.1109/CVPR.2016.438.
i=1
[10] J. Long, E. Shelhamer and T. Darrell, ”Fully convolutional networks for
semantic segmentation,” 2015 IEEE Conference on Computer Vision
1
N and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 3431-3440,
Average log10 error = |log10 (di ) − log10 (di )| (8) doi: 10.1109/CVPR.2015.7298965.
N [11] J. Deng, W. Dong, R. Socher, L. Li, Kai Li and Li Fei-Fei, ”ImageNet:
i=1
A large-scale hierarchical image database,” 2009 IEEE Conference on
Computer Vision and Pattern Recognition, Miami, FL, 2009, pp. 248-
1
δ p (threshold accuracy) = × (di ); for all values of 255, doi: 10.1109/CVPR.2009.5206848.
100
[12] A. Saxena, M. Sun and A. Y. Ng, ”Make3D: Learning 3D Scene
Structure from a Single Still Image,” in IEEE Transactions on Pattern
di di
di , di : max , = δ < threshold value for (9) Analysis and Machine Intelligence, vol. 31, no. 5, pp. 824-840, May
di di 2009, doi: 10.1109/TPAMI.2008.132.
threshold = 1.25p , where p = 1, 2, 3 [13] F. Liu, C. Shen, G. Lin and I. Reid, ”Learning Depth from Single
Monocular Images Using Deep Convolutional Neural Fields,” in IEEE
V. C ONCLUSION AND FUTURE WORK Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no.
10, pp. 2024-2039, 1 Oct. 2016, doi: 10.1109/TPAMI.2015.2505283.
In this paper, we proposed a simple encoder-decoder based [14] L. Ladický, J. Shi and M. Pollefeys, ”Pulling Things out of Perspective,”
network architecture that can perform better than the existing 2014 IEEE Conference on Computer Vision and Pattern Recognition,
Columbus, OH, 2014, pp. 89-96, doi: 10.1109/CVPR.2014.19.
methods based on transfer learning and with fewer trainable [15] Xie, J., Girshick, R. and Farhadi, A., 2016. Deep3D: Fully Automatic
parameters as well as smaller computational resources. How- 2D-to-3D Video Conversion with Deep Convolutional Neural Networks.
ever, there remain huge opportunities for experimentation in Computer Vision – ECCV 2016, pp.842-857.
[16] Z. Hao, Y. Li, S. You and F. Lu, ”Detail Preserving Depth Estimation
encoder depths, decoder layers, and color channel augmen- from a Single Image Using Attention Guided Networks,” 2018 Interna-
tation, and other photometric and geometric transformations. tional Conference on 3D Vision (3DV), Verona, 2018, pp. 304-313, doi:
The main target of this experiment was to push forward 10.1109/3DV.2018.00043.
[17] H. Fu, M. Gong, C. Wang, K. Batmanghelich and D. Tao, ”Deep Ordinal
the depth estimation task and we have performed the object Regression Network for Monocular Depth Estimation,” 2018 IEEE/CVF
detection and depth estimation task more faithfully than the Conference on Computer Vision and Pattern Recognition, Salt Lake City,
previous works. Further studies can be made on available UT, 2018, pp. 2002-2011, doi: 10.1109/CVPR.2018.00214.
[18] K. Karsch, C. Liu and S. Kang, ”Depth Extraction from Video Using
public and private datasets, which was not possible for us Non-parametric Sampling”, Computer Vision – ECCV 2012, pp. 775-
to perform due to various constraints. There are numerous 788, 2012. Available: 10.1007/978-3-642-33715-4 56 [Accessed 14 July
opportunities to use more compact encoder structures and 2020].
[19] D. Eigen and R. Fergus, ”Predicting Depth, Surface Normals and
pretrained weights to outperform our methods and we would Semantic Labels with a Common Multi-scale Convolutional Architec-
like to pursue further with different learning strategies, aug- ture,” 2015 IEEE International Conference on Computer Vision (ICCV),
mentation methods, network architectures, and the reasons for Santiago, 2015, pp. 2650-2658, doi: 10.1109/ICCV.2015.304.
[20] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari and N. Navab,
our method to perform better. ”Deeper Depth Prediction with Fully Convolutional Residual Networks,”
2016 Fourth International Conference on 3D Vision (3DV), Stanford,
R EFERENCES CA, 2016, pp. 239-248, doi: 10.1109/3DV.2016.32.
[1] G. Hu, S. Huang, L. Zhao, A. Alempijevic and G. Dissanayake, ”A ro- [21] A. Levin, D. Lischinski and Y. Weiss, ”Colorization using optimization”,
bust RGB-D SLAM algorithm”2012 IEEE/RSJ International Conference ACM SIGGRAPH 2004 Papers on - SIGGRAPH ’04, 2004. Available:
on Intelligent Robots and Systems, Vilamoura, 2012, pp. 1714-1719, doi: 10.1145/1186562.1015780 [Accessed 14 July 2020].
10.1109/IROS.2012.6386103. [22] Xu D, Ricci E, Ouyang W, Wang X, Sebe N. Multi-scale continuous
[2] Bo Li, Chunhua Shen, Yuchao Dai, A. van den Hengel and Mingyi crfs as sequential deep networks for monocular depth estimation. In-
He, ”Depth and surface normal estimation from monocular images Proceedings of the IEEE Conference on Computer Vision and Pattern
using regression on deep features and hierarchical CRFs,”2015 IEEE Recognition 2017 (pp. 5354-5362).
Conference on Computer Vision and Pattern Recognition (CVPR), [23] C. Godard, O. M. Aodha and G. J. Brostow, ”Unsupervised Monocular
Boston, MA, 2015, pp. 1119-1127, doi: 10.1109/CVPR.2015.7298715. Depth Estimation with Left-Right Consistency,” 2017 IEEE Conference
[3] Epic Games, Inc. Marketplace - UE4 Marketplace, 2018 on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI,
[4] M. Song and W. Kim, ”Depth Estimation From a Single Image Using 2017, pp. 6602-6611, doi: 10.1109/CVPR.2017.699.
Guided Deep Network”, IEEE Access, vol. 7, pp. 142595-142606, 2019. [24] Y. Kuznietsov, J. Stückler and B. Leibe, ”Semi-Supervised Deep Learn-
Available: 10.1109/access.2019.2944937. ing for Monocular Depth Map Prediction,” 2017 IEEE Conference on
[5] Araar O, Aouf N, Dietz JL. Power pylon detection and monocular depth Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017,
estimation from inspection UAVs. Industrial Robot: An International pp. 2215-2223, doi: 10.1109/CVPR.2017.238.
Journal. 2015 May 18.
Authorized licensed use limited to: Carleton University. Downloaded on May 27,2021 at 02:09:59 UTC from IEEE Xplore. Restrictions apply.