Basak 2020

This paper presents a novel encoder-decoder architecture for monocular depth estimation from a single RGB image, leveraging transfer learning to improve depth map quality. The proposed model outperforms existing methods by accurately predicting object boundaries and generating high-resolution depth maps, validated on standard datasets like KITTI and NYU Depth V2. The architecture is designed to be efficient with fewer parameters, making it suitable for applications in autonomous driving and computer vision.

Uploaded by

bhatsajid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views6 pages

Basak 2020

Uploaded by

bhatsajid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2020 IEEE 7th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON) | 978-1-6654-0373-3/20/$31.

00 ©2020 IEEE | DOI: 10.1109/UPCON50219.2020.9376365

Monocular Depth Estimation Using

Encoder-Decoder Architecture and Transfer
Learning from Single RGB Image
Hritam Basak Sagnik Ghosal Mainak Sarkar
Department of Electrical Engineering Department of Electrical Engineering
Department of Electrical Engineering Jadavpur University Jadavpur University
Jadavpur University Kolkata, India Kolkata, India
Kolkata, Inida [email protected] [email protected]
[email protected]

Mayukhmali Das Soham Chattopadhyay

Department of Electronics and Telecommunication Engineering Department of Electrical Engineering
Jadavpur University Jadavpur University
Kolkata, India Kolkata, India
[email protected] [email protected]

Abstract—Depth estimation from a single RGB image has in image resolution and quality of predictions in the resulting
been one of the most important research topics in recent depth maps. The recent applications of computer vision like
days as it has several important applications in self-supervised virtual reality, 3D reconstruction, autonomous driving, and
driving in autonomous cars, image reconstruction, and scene
segmentation. Depth estimation from a single monocular image medical imaging demands accurate object boundaries and fine-
has been challenging as compared to stereo images due to the grained depth predictions as well as a faster computation for
lack of spatio-temporal features per frame that makes 3D depth these methods to be effective. Therefore, the problem demands
perception easier. Existing models and solutions in monocular discontinuity composition most accurately and faithfully to
depth estimation often resulted in low resolution and blurry depth avoid huge trepidations that are very common in deep learning-
maps and often fail to identify small object boundaries. In this
paper, we propose a simple encoder-decoder based network that based depth predictions.
can predict high-quality depth images from single RGB images However, there are other geometric and hardware-based
using transfer learning. We have utilized important features methods too for depth estimations. Recently, structure from
extracted from pre-trained networks, and after initializing the motion (SFM) methods have successfully evolved for Simul-
encoder with fine-tuning and important augmentation strategies, taneous Localization and Mapping (SLAM) implementation
the network decoder part computes the high-end depth maps.
The network has fewer trainable parameters and small iterations, where 3D scenes were reconstructed from a series of 2D
though it outperforms the existing state-of-the-art methods and monocular images through feature correspondence and geo-
captures accurate boundaries when evaluated on two standard metrical constraints from the image sequences. Human eyes
datasets, KITTI, and NYU Depth V2. can efficiently estimate the depth maps from scenes using
Keywords—Monocular image, Depth estimation, Encoder- occlusion, knowledge of known object shapes, shadow and
decoder, Transfer learning.
lighting information, known perspective, and relative scaling.
I. I NTRODUCTION This method has been mimicked by stereo image pairs where
the depths are calculated from disparity maps of the images
Retrieving depth information from RGB images has been
captured by a slightly displaced pair of cameras. The disparity
of utmost importance and application in recent days and it
maps are calculated from the cost function where the scale
has already implemented in semantic segmentation, augmented
information is included as the position transformation of both
reality, Simultaneous Localization and Mapping (SLAM) [1],
the cameras are calibrated before the experiment. The RGB-
scene understanding, image refocusing, and real-time naviga-
D cameras and LIDAR methods have also been instrumental
tion in self-driving cars. Recent studies have focused mostly
in the depth estimations recently where the camera sensors
on using CNNs and other deep learning methods for the recon-
can capture the depth maps directly from the scene images.
struction of images from 2D to 3D. Though there have been
Though the LIDAR method can generate spare 3D maps only
steady improvements in the dense depth estimation strategies
and have been implemented in self-driving cars recently, they
in the last few years, there are several scopes of improvements
suffer from relatively low information in low lighting condi-
978-0-7381-1151-3/20/$31.00 ©2020 IEEE tions at night and have high cost that makes it incompatible

Authorized licensed use limited to: Carleton University. Downloaded on May 27,2021 at 02:09:59 UTC from IEEE Xplore. Restrictions apply.
for use in small robotics applications. Therefore, the problem But similar to previous models, it too involved the requirement
demands an accurate and simple depth-estimation method ground truth data at the time of model training. Flynn et al.
which we tried to propose in this paper. [13] proposed “DeepStereo” which involved the relative pose
of multiple cameras while training, to predict the depth value
of an input image. This model functioned by selecting the
most relevant depth values based on plane sweep volumes. A
major disadvantage of this method again lies in the fact that
several similar but slightly varying posed images are required
for training the model. Any insufficiency in the training set
causes the model to predict with much less accuracy. This
was rectified to a large extent by Xie et al. [15], where from
an input left-sided image, the corresponding right-sided image
was predicted and vice-versa. However, the disadvantage of
this method is, with a flux of disparity values, the model
became less memory efficient, thus limiting its usage for
images with much larger resolutions.
A. Contributions of current paper
Based on the pre-existing architectures and training strate-
gies [2], [3], [4], [5], we have designed a simple encoder-
Fig. 1. Comparison of ground truth and estimated depth maps. Row 1 shows decoder architecture with fewer parameters and less complex
that our model performed significantly well in low lighting condition too.
that makes the training and evaluation process easier. Our
contribution can be defined as a two-fold contribution:
II. R ELATED WORKS • We have designed a simple transfer learning-based net-
In this section, we illustrate the existing research works work architecture that can detect the object boundaries
involving monocular depth approximation, where the input is more faithfully and predict the depth maps more explic-
only a single input image. In all the works mentioned below, itly than the existing methods. The performance of our
suppositions about the environment are not made before- model can be inferred from figure 1 where the predicted
hand. Predefined resemblance calculations by considering the depth maps have been compared with the ground truth
matching similar to supervised learning and then developing depth maps.
a function based on that, produced excellent results as in • We define data augmentation and training strategy with
[7]. It has been demonstrated that such a multi-class cate- much fewer parameters and finally, we have defined
gorization outperforms both in terms of accuracy and rapidity. a corresponding loss function for faster and smoother
Mayer et al. [9] proposed a fully convolutional [10] deep convergence.
network, termed as “DispNet”. In his proposed methodology,
the disparity for every pixel is forecasted, by a minimized III. M ATERIALS AND METHODS
regression loss function. However, it trains on a large amount In this section, we present the detailed workflow of our
of ground truth data with their original stereo images, which method that includes the dataset description, network architec-
is quite difficult to organize from real-world scenarios. Saxena ture, complexity assessment, loss function, and augmentation
et al. [12] proposed a model termed “Make3D”. This method policy. The detailed descriptions of each section are as given
initially breaks the input image and then predicts the disparity, below.
based on the three-dimensional positioning of planes. A major
back draw of this method is patented when predictions made A. Datasets
on thin structures failed to generate the required output, as a 1) NYU Depth V2: The dataset contains 120k training
result of the dearth of a wide range of universal circumstances. images and their corresponding depth maps of indoor scenes,
The approach made by Liu et al. [13] involved convolutional though we used a 50k subset for training purposes and 654
neural networks (CNN) for learning and prediction. Ladicky et images were used for testing purposes. The images in the
al. [14] proposed the usage of semantics to improve disparity dataset have a dimension of 640×480 and the depth maps
measurement. Karsch et al. [18] developed a model that have a resolution as half of the raw images (i.e. 320×240) [6].
predicted disparity with great accuracy. However, it required We did not crop and resize the images during training though
the entire training set to be available at the time of testing. there was a possibility of distortion correction preprocessing
Eigen et al. In [19] proposed a model, which predicted the which may result in missing pixels. Though, during testing,
depth by using a two-scale deep network trained. The model we calculated the depth maps and up-sampled in twice to
was trained on images and their corresponding depth values. It match the input image dimension. The output was taken
learned to represent disparity directly from the raw image data by calculating the average of the mirror image pairs of the
and did not rely on supervised features or over-segmentation. predictions.

Authorized licensed use limited to: Carleton University. Downloaded on May 27,2021 at 02:09:59 UTC from IEEE Xplore. Restrictions apply.
2) KITTI Dataset: The KITTI dataset consists of stereo
images and their corresponding laser scans captured by a mov-
ing vehicle. The images have dimensions around 1241×376
whereas there are lots of missing data and low-resolution
images in the depth map though they are compensated using
the inpainting method. We used around 26k images for training
and 697 images for testing purposes. The images were up-
sampled using bilinear interpolation to a speciﬁed dimension
of 1280×384 to meet the encoder’s requirement of divisibility
of image dimension by 32 [8]. During testing, the depth
map was cropped to 624×192 resolution to match the input
resolution and ﬁnally, the output was taken by calculating the
average of the mirror image pairs of the predictions.

Fig. 2. The simpliﬁed encoder-decoder architecture with skip connections

proposed in this paper. The encoder part is made of the pretrained DenseNet-
169 architecture while the decoder part consists of several up-sampling layers.

B. Network architecture
The network is based on the encoder-decoder structure
(figure 2) where we have used the DenseNet-169 [8] ar-
chitecture pretrained on ImageNet [11] as the backbone in Fig. 3. The basic architecture of the proposed network. The encoder part (up
to CONV2) are same as the DenseNet-169 architecture whereas there is a
the encoder part and we removed the FC layer to remove Leaky ReLU after each CONVB layer. The output dimensions are mentioned
the ImageNet classification result. The feature maps are then for the NYU Depth V2 dataset in the form of height*width*channel.
fed into an up-sampling network which, along with the skip
connections, form the decoder part. We started with 1×1
convolution with same number of channels of the encoder truth depth map (d) and the predicted depth map (d’). There
followed by successive convolutions consisting of 2×2 bi- are several loss functions that have been used successfully in
linear up-sampling and 3×3 convolution layers. All the up- the existing depth estimation tasks and performed quite well.
sampling layers were associated with an activation function The Structural Similarity (SSIM) has been proven to be very
which was set as Leaky Rectified Linear Unit with α value useful in reconstruction tasks by using CNNs. This commonly
kept as 0.2. The decoder part did not comprise of any batch used metric is defined by the following equation :
normalization or other advanced layers as suggested by the
contemporary existing complex network architectures [16], (2.Avgd .Avgd’ + a)(Convdd’ + b)
[17]. The detailed network architecture is described in figure SSIM(d,d’) =
(Avgd 2 + Avgd’ 2 + a)(Vard 2 + Vard’ 2 + b)
3. We have experimented with several other encoders (e.g. (1)
ResNet-50, DenseNet-121) and also with some other decoder Where, Avg and Var2 signify the average and variance
architectures [20], [21] and after thorough evaluation, we respectively. Conv is the convergence metric. a= (c1 L)2 and
concluded that the complexity in the network architecture and b=(c2 L)2 ; L= (2no. of bits per pixel - 1);c1 =0.01 and c2 =0.03
a greater number of convolutional layers must not necessarily
contribute to better performance.Based on several experiments 1 − SSIM(d,d )
with different encoder architectures with various upsampling SSIM Loss(d,d ) = (2)
2
methods as mentioned above, it is concluded that our proposed
Secondly, the gradient loss is defined as the cumulative
method shows that a simple encoder-decoder architecture with
average of the absolute values of gradient in predicted x and
a simple 2×2 bilinear up-sampling can even contribute to very
y components
good results.
1
N
C. Loss function
Gradient Loss(d,d ) = (|Gradx (di , di )| + |Grady (di , di )|)
The loss function in the depth estimation task is defined as N
i=1
a function that calculates the difference between the ground (3)

Authorized licensed use limited to: Carleton University. Downloaded on May 27,2021 at 02:09:59 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Qualitative comparison of our proposed method with the state-of-the-art method on NYU Depth V2 dataset. The ﬁrst column is the input RGB image,
the second column is the corresponding depth map, the third column is the prediction map by Fu et al. [17] and the last column is our predicted depth map.

The depth loss is defined as the difference in depth values numerous other options of data augmentation and it remains an
between the predicted and ground truth calculated point-wise. important field of research that how these methods can bridge
the gap between limited datasets and improved performance.
1
N
Depth Loss(d,d ) = |di − di | (4)
N IV. R ESULTS AND D ISCUSSION
i=1
However, we tried to incorporate a new mixing loss function A. Implementation details
keeping in mind two important tasks: first, giving weightage Our proposed method was implemented on the TensorFlow
to penalize the high-frequency components in the predicted backend using a Tesla K80 GPU having 2496 CUDA core and
depth maps and secondly, minimizing the difference between having a VRAM of 32 GB. The encoder was a DenseNet-
the predicted and original depth maps and thus reconstruction 169 pretrained on ImageNet and the weights of the decoder
of depth images. Our proposed loss function, which is defined were initialized randomly. The training parameters for the
below, balances between these two operations. depth-estimation network were set experimentally and the
best parameters were selected for optimized results. ADAM
Loss(d,d ) = SSIM Loss(d,d ) + Gradient Loss(d,d )+ optimizer was used in the training process with the base
(5) learning rate 1e-3 and batch size=4, keeping β 1 and β 2 values
0.1 × Depth Loss(d,d )
as 0.9 and 0.999. The total trainable parameters in our network
D. Data augmentation were approximately 41.5M, and it took approximately 36
Data augmentation is an important tool to improve the hours to complete training on the NYU Depth V2 dataset and
learning of a model and to avoid overfitting of a network. 15 hours for the KITTI dataset.
The geometrical and photometric transformations have been
used widely in almost all of the supervised, unsupervised B. Qualitative measurement
or semi-supervised learning-based depth estimations, however, The results from our experiments were compared with
we have found that all of them are not necessarily important other existing methods and it is observed that our method
and useful for our purpose. As the vertical flipping may result outperformed all of them including the state-of-the-art DORN,
to ambiguity in floor and ceiling position in an image, we 2018 method [17]. Besides, our method requires almost 50%
discarded it, therefore, we have kept the horizontal mirroring parameters, as compared to the DORN method with fewer
with a probability of 0.25. Another important photometric iterations, and was trained on a small part of the complete
transformation, color channel swapping has been found to be dataset of 120K images. The median value of ground truth
useful in boosting the performance, hence we have kept this depth maps was multiplied (scalar multiplication) with the
augmentation with a probability of 0.5. However, there are predicted depth maps to remove the problem of absolute

Authorized licensed use limited to: Carleton University. Downloaded on May 27,2021 at 02:09:59 UTC from IEEE Xplore. Restrictions apply.
Fig. 5. Qualitative comparison of our proposed method with the state-of-the-art method on KITTI benchmark dataset. The ﬁrst column is the input RGB
image, the second column is the depth map predicted by our method and the last column is the result from Fu et al. [17].

TABLE I
C OMPARISON OF DIFFERENT EVALUATION METRICS OF OUR RESULT WITH THE EXISTING METHODS IN NYU D EPTH V2 DATASET. T HE RESULT SHOWS
THAT OUR METHOD PERFORMS BETTER THAN ALL OTHER EXISTING METHODS .

Relative Root mean Average Threshold Threshold Threshold

Method
error square error log loss accuracy (p=1) accuracy (p=2) accuracy (p=3)
DORN [17] 0.120 0.510 0.049 0.832 0.970 0.991
Eigen et al. [19] 0.161 0.639 - 0.791 0.949 0.979
Hao et al. [16] 0.128 0.548 0.050 0.839 0.963 0.990
Xu et al. [22] 0.123 0.585 0.053 0.806 0.959 0.984
Ours (scaled) 0.103 0.388 0.047 0.892 0.978 0.995

TABLE II
C OMPARISON OF DIFFERENT EVALUATION METRICS OF OUR RESULT WITH THE EXISTING METHODS IN KITTI BENCHMARK DATASET. T HE RESULT
SHOWS THAT OUR METHOD PERFORMS THE SECOND BEST AMONG ALL OTHER EXISTING METHODS .

Relative Root mean Average Threshold Threshold Threshold

Method
error square error log loss accuracy (p=1) accuracy (p=2) accuracy (p=3)
DORN [17] 0.070 3.745 0.118 0.930 0.979 0.991
Eigen et al. [19] 0.189 7.148 0.267 0.687 0.900 0.958
Godard et al. [23] 0.124 4.942 0.201 0.858 0.950 0.977
Kuznietsov et al. [24] 0.117 4.459 0.186 0.859 0.959 0.975
Ours (scaled) 0.093 3.498 0.197 0.879 0.964 0.982

scaling of the scene which is a major source of error in many inconsistency and several missing cases in ground truths.
SOTA methods. The performance of our proposed method Hence, we could not compare the results with ground truths in
on NYU Depth v2 is compared with the ground truth and each case. It is observed that in both our case and the SOTA
[17] in ﬁgure 4 where it is clear that our method predicted Fu et al. [17], the models fail to identify the objects that are
closer to the actual ground truth image. There are large at a long distance from the camera. However, our method still
distortions in the output images of DORN with inconsistency predicts better results to identify humans, vehicles, and other
in boundary detection and smoothness of depth maps. Row 2 objects in the scenes with smoother depth maps as compared
and 3 also suggest that DORN cannot predict depth maps at all to the DORN [17].
in low lighting conditions whereas our method still manages
C. Quantitative evaluation
to predict quite well in these conditions.
We have quantitatively compared our depth maps with
Similarly, we also compared our results with the SOTA some of the existing previous works by using the following
methods on the KITTI benchmark dataset. The dataset has evaluation metrics. Table 1 and 2 signiﬁes the comparitive

Authorized licensed use limited to: Carleton University. Downloaded on May 27,2021 at 02:09:59 UTC from IEEE Xplore. Restrictions apply.
study of different evaluation metrics on KITTI and NYU [6] Silberman, N., Hoiem, D., Kohli, P. and Fergus, R., 2012, October.
Depth V2 dataset. The bold results suggest the best results Indoor segmentation and support inference from rgbd images. In Eu-
ropean conference on computer vision (pp. 746-760). Springer, Berlin,
among all the results which were considered for comparison. Heidelberg.
The results suggests that our method performes the best for the [7] S. Zagoruyko and N. Komodakis, ”Learning to compare image patches
NYU dataset but performs second best in terms of evaluation via convolutional neural networks,” 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 4353-
scores in KITTI dataset. 4361, doi: 10.1109/CVPR.2015.7299064.
1 |di − di |
N [8] G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger, ”Densely
rel(Relative error) = (6) Connected Convolutional Networks,” 2017 IEEE Conference on Com-
N di puter Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp.
i=1 2261-2269, doi: 10.1109/CVPR.2017.243.

[9] N. Mayer et al., ”A Large Dataset to Train Convolutional Networks
1 N for Disparity, Optical Flow, and Scene Flow Estimation,” 2016 IEEE
rms error = (di − di )2 (7) Conference on Computer Vision and Pattern Recognition (CVPR), Las
N Vegas, NV, 2016, pp. 4040-4048, doi: 10.1109/CVPR.2016.438.
i=1
[10] J. Long, E. Shelhamer and T. Darrell, ”Fully convolutional networks for
semantic segmentation,” 2015 IEEE Conference on Computer Vision
1
N and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 3431-3440,
Average log10 error = |log10 (di ) − log10 (di )| (8) doi: 10.1109/CVPR.2015.7298965.
N [11] J. Deng, W. Dong, R. Socher, L. Li, Kai Li and Li Fei-Fei, ”ImageNet:
i=1
A large-scale hierarchical image database,” 2009 IEEE Conference on
Computer Vision and Pattern Recognition, Miami, FL, 2009, pp. 248-
1
δ p (threshold accuracy) = × (di ); for all values of 255, doi: 10.1109/CVPR.2009.5206848.
100
[12] A. Saxena, M. Sun and A. Y. Ng, ”Make3D: Learning 3D Scene
Structure from a Single Still Image,” in IEEE Transactions on Pattern
di di
di , di : max , = δ < threshold value for (9) Analysis and Machine Intelligence, vol. 31, no. 5, pp. 824-840, May
di di 2009, doi: 10.1109/TPAMI.2008.132.
threshold = 1.25p , where p = 1, 2, 3 [13] F. Liu, C. Shen, G. Lin and I. Reid, ”Learning Depth from Single
Monocular Images Using Deep Convolutional Neural Fields,” in IEEE
V. C ONCLUSION AND FUTURE WORK Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no.
10, pp. 2024-2039, 1 Oct. 2016, doi: 10.1109/TPAMI.2015.2505283.
In this paper, we proposed a simple encoder-decoder based [14] L. Ladický, J. Shi and M. Pollefeys, ”Pulling Things out of Perspective,”
network architecture that can perform better than the existing 2014 IEEE Conference on Computer Vision and Pattern Recognition,
Columbus, OH, 2014, pp. 89-96, doi: 10.1109/CVPR.2014.19.
methods based on transfer learning and with fewer trainable [15] Xie, J., Girshick, R. and Farhadi, A., 2016. Deep3D: Fully Automatic
parameters as well as smaller computational resources. How- 2D-to-3D Video Conversion with Deep Convolutional Neural Networks.
ever, there remain huge opportunities for experimentation in Computer Vision – ECCV 2016, pp.842-857.
[16] Z. Hao, Y. Li, S. You and F. Lu, ”Detail Preserving Depth Estimation
encoder depths, decoder layers, and color channel augmen- from a Single Image Using Attention Guided Networks,” 2018 Interna-
tation, and other photometric and geometric transformations. tional Conference on 3D Vision (3DV), Verona, 2018, pp. 304-313, doi:
The main target of this experiment was to push forward 10.1109/3DV.2018.00043.
[17] H. Fu, M. Gong, C. Wang, K. Batmanghelich and D. Tao, ”Deep Ordinal
the depth estimation task and we have performed the object Regression Network for Monocular Depth Estimation,” 2018 IEEE/CVF
detection and depth estimation task more faithfully than the Conference on Computer Vision and Pattern Recognition, Salt Lake City,
previous works. Further studies can be made on available UT, 2018, pp. 2002-2011, doi: 10.1109/CVPR.2018.00214.
[18] K. Karsch, C. Liu and S. Kang, ”Depth Extraction from Video Using
public and private datasets, which was not possible for us Non-parametric Sampling”, Computer Vision – ECCV 2012, pp. 775-
to perform due to various constraints. There are numerous 788, 2012. Available: 10.1007/978-3-642-33715-4 56 [Accessed 14 July
opportunities to use more compact encoder structures and 2020].
[19] D. Eigen and R. Fergus, ”Predicting Depth, Surface Normals and
pretrained weights to outperform our methods and we would Semantic Labels with a Common Multi-scale Convolutional Architec-
like to pursue further with different learning strategies, aug- ture,” 2015 IEEE International Conference on Computer Vision (ICCV),
mentation methods, network architectures, and the reasons for Santiago, 2015, pp. 2650-2658, doi: 10.1109/ICCV.2015.304.
[20] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari and N. Navab,
our method to perform better. ”Deeper Depth Prediction with Fully Convolutional Residual Networks,”
2016 Fourth International Conference on 3D Vision (3DV), Stanford,
R EFERENCES CA, 2016, pp. 239-248, doi: 10.1109/3DV.2016.32.
[1] G. Hu, S. Huang, L. Zhao, A. Alempijevic and G. Dissanayake, ”A ro- [21] A. Levin, D. Lischinski and Y. Weiss, ”Colorization using optimization”,
bust RGB-D SLAM algorithm”2012 IEEE/RSJ International Conference ACM SIGGRAPH 2004 Papers on - SIGGRAPH ’04, 2004. Available:
on Intelligent Robots and Systems, Vilamoura, 2012, pp. 1714-1719, doi: 10.1145/1186562.1015780 [Accessed 14 July 2020].
10.1109/IROS.2012.6386103. [22] Xu D, Ricci E, Ouyang W, Wang X, Sebe N. Multi-scale continuous
[2] Bo Li, Chunhua Shen, Yuchao Dai, A. van den Hengel and Mingyi crfs as sequential deep networks for monocular depth estimation. In-
He, ”Depth and surface normal estimation from monocular images Proceedings of the IEEE Conference on Computer Vision and Pattern
using regression on deep features and hierarchical CRFs,”2015 IEEE Recognition 2017 (pp. 5354-5362).
Conference on Computer Vision and Pattern Recognition (CVPR), [23] C. Godard, O. M. Aodha and G. J. Brostow, ”Unsupervised Monocular
Boston, MA, 2015, pp. 1119-1127, doi: 10.1109/CVPR.2015.7298715. Depth Estimation with Left-Right Consistency,” 2017 IEEE Conference
[3] Epic Games, Inc. Marketplace - UE4 Marketplace, 2018 on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI,
[4] M. Song and W. Kim, ”Depth Estimation From a Single Image Using 2017, pp. 6602-6611, doi: 10.1109/CVPR.2017.699.
Guided Deep Network”, IEEE Access, vol. 7, pp. 142595-142606, 2019. [24] Y. Kuznietsov, J. Stückler and B. Leibe, ”Semi-Supervised Deep Learn-
Available: 10.1109/access.2019.2944937. ing for Monocular Depth Map Prediction,” 2017 IEEE Conference on
[5] Araar O, Aouf N, Dietz JL. Power pylon detection and monocular depth Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017,
estimation from inspection UAVs. Industrial Robot: An International pp. 2215-2223, doi: 10.1109/CVPR.2017.238.
Journal. 2015 May 18.

Authorized licensed use limited to: Carleton University. Downloaded on May 27,2021 at 02:09:59 UTC from IEEE Xplore. Restrictions apply.

AI Depth Estimation Overview
No ratings yet
AI Depth Estimation Overview
16 pages
Dattnet: Monocular Depth Estimation Network Based On Attention Mechanisms
No ratings yet
Dattnet: Monocular Depth Estimation Network Based On Attention Mechanisms
10 pages
Monocular Depth Estimation Using Laplacian Pyramid-Based Depth Residuals
No ratings yet
Monocular Depth Estimation Using Laplacian Pyramid-Based Depth Residuals
13 pages
Monocular Depth Estimation Based On Deep Learning: An Overview
No ratings yet
Monocular Depth Estimation Based On Deep Learning: An Overview
14 pages
CV Sce
No ratings yet
CV Sce
12 pages
Atapour-Abarghouei Real-Time Monocular Depth CVPR 2018 Paper PDF
No ratings yet
Atapour-Abarghouei Real-Time Monocular Depth CVPR 2018 Paper PDF
11 pages
Tousi 2020
No ratings yet
Tousi 2020
6 pages
JournalPaper ASC Updated
No ratings yet
JournalPaper ASC Updated
16 pages
Depth Estimation Based On Monocular Camera Sensors in Autonomous Vehicles: A Self Supervised Learning Approach
No ratings yet
Depth Estimation Based On Monocular Camera Sensors in Autonomous Vehicles: A Self Supervised Learning Approach
13 pages
Depth Estimation for Vision Experts
No ratings yet
Depth Estimation for Vision Experts
18 pages
Depthanything
No ratings yet
Depthanything
18 pages
Unsupervised Domain Adaptation For Depth Prediction From Images
No ratings yet
Unsupervised Domain Adaptation For Depth Prediction From Images
14 pages
Fdsafdsfsafasdfbrwa
No ratings yet
Fdsafdsfsafasdfbrwa
14 pages
Depth Anything - Unleashing The Power of Large-Scale Unlabeled Data
No ratings yet
Depth Anything - Unleashing The Power of Large-Scale Unlabeled Data
11 pages
AdaBins: Adaptive Depth Estimation Model
No ratings yet
AdaBins: Adaptive Depth Estimation Model
13 pages
Neural RGB D Sensing: Depth and Uncertainty From A Video Camera
No ratings yet
Neural RGB D Sensing: Depth and Uncertainty From A Video Camera
13 pages
Depth Estimation From A Single Image Using Deep Learned Phase Coded Mask
No ratings yet
Depth Estimation From A Single Image Using Deep Learned Phase Coded Mask
12 pages
Computer Vision
No ratings yet
Computer Vision
8 pages
Domain Randomization-Enhanced Depth Simulation and Restoration For Perceiving and Grasping Specular and Transparent Objects
No ratings yet
Domain Randomization-Enhanced Depth Simulation and Restoration For Perceiving and Grasping Specular and Transparent Objects
26 pages
Improving Monocular Visual Odometry Using Learned Depth
No ratings yet
Improving Monocular Visual Odometry Using Learned Depth
14 pages
Neural RGBRD Sensing Depth and Uncertainty From A Video Camera
No ratings yet
Neural RGBRD Sensing Depth and Uncertainty From A Video Camera
10 pages
D P: S M M D L T S: Epth RO Harp Onocular Etric Epthin ESS Hana Econd
No ratings yet
D P: S M M D L T S: Epth RO Harp Onocular Etric Epthin ESS Hana Econd
33 pages
Park Depth Prompting For Sensor-Agnostic Depth Estimation CVPR 2024 Paper
No ratings yet
Park Depth Prompting For Sensor-Agnostic Depth Estimation CVPR 2024 Paper
11 pages
Project Synopsis Template
No ratings yet
Project Synopsis Template
5 pages
Zusc S 24 00845
No ratings yet
Zusc S 24 00845
15 pages
Access-2025-23163 Proof Hi
No ratings yet
Access-2025-23163 Proof Hi
14 pages
Event-Based Monocular Depth Estimation With Recurrent Transformers
No ratings yet
Event-Based Monocular Depth Estimation With Recurrent Transformers
13 pages
Monovit: Self-Supervised Monocular Depth Estimation With A Vision Transformer
No ratings yet
Monovit: Self-Supervised Monocular Depth Estimation With A Vision Transformer
11 pages
Unsupervised Monocular Depth Learning With Integrated Intrinsics and Spatio-Temporal Constraints
No ratings yet
Unsupervised Monocular Depth Learning With Integrated Intrinsics and Spatio-Temporal Constraints
8 pages
Depth Estimation
No ratings yet
Depth Estimation
3 pages
Depth Estimation with CNN Transfer
No ratings yet
Depth Estimation with CNN Transfer
15 pages
The Fourth Monocular Depth Estimation Challenge
No ratings yet
The Fourth Monocular Depth Estimation Challenge
14 pages
Deep Learning Based Monocular Depth Estimation For Object Distance Inference in 2D Images
No ratings yet
Deep Learning Based Monocular Depth Estimation For Object Distance Inference in 2D Images
5 pages
Demon: Depth and Motion Network For Learning Monocular Stereo
No ratings yet
Demon: Depth and Motion Network For Learning Monocular Stereo
22 pages
Group 09
No ratings yet
Group 09
9 pages
Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume
No ratings yet
Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume
14 pages
Multidepth: Single-Image Depth Estimation Via Multi-Task Regression and Classification
No ratings yet
Multidepth: Single-Image Depth Estimation Via Multi-Task Regression and Classification
9 pages
Img Paper
No ratings yet
Img Paper
10 pages
开题报告 Akash 6
No ratings yet
开题报告 Akash 6
1 page
Main Paper With Supp
No ratings yet
Main Paper With Supp
16 pages
2024 - Learning Temporally Consistent Video Depth From Video Diffusion Priors - Shao Et Al
No ratings yet
2024 - Learning Temporally Consistent Video Depth From Video Diffusion Priors - Shao Et Al
13 pages
Full Thesis
No ratings yet
Full Thesis
57 pages
Depth Perception in Single RGB Camera System Using Lens Aperture and Object Size: A Geometrical Approach For Depth Estimation
No ratings yet
Depth Perception in Single RGB Camera System Using Lens Aperture and Object Size: A Geometrical Approach For Depth Estimation
16 pages
The Application of Deep Learning in Stereo Matching and Disparity
No ratings yet
The Application of Deep Learning in Stereo Matching and Disparity
24 pages
Stag2024 DDD
No ratings yet
Stag2024 DDD
10 pages
Dge CNN
No ratings yet
Dge CNN
10 pages
1 s2.0 S0893608021002707 Main
No ratings yet
1 s2.0 S0893608021002707 Main
13 pages
Yin Learning To Recover 3D Scene Shape From A Single Image CVPR 2021 Paper
No ratings yet
Yin Learning To Recover 3D Scene Shape From A Single Image CVPR 2021 Paper
10 pages
Qin MonoGround Detecting Monocular 3D Objects From The Ground CVPR 2022 Paper
No ratings yet
Qin MonoGround Detecting Monocular 3D Objects From The Ground CVPR 2022 Paper
10 pages
2021-Depth From Defocus With Learned Optics For Imaging and Occlusion-Aware Depth Estimation
No ratings yet
2021-Depth From Defocus With Learned Optics For Imaging and Occlusion-Aware Depth Estimation
12 pages
Bilateral Propagation Network For Depth Completion
No ratings yet
Bilateral Propagation Network For Depth Completion
15 pages
Depth Reconstruction With Deep Neural Networks (Part 2)
No ratings yet
Depth Reconstruction With Deep Neural Networks (Part 2)
54 pages
Decoder Modulation For Indoor Depth Completion
No ratings yet
Decoder Modulation For Indoor Depth Completion
11 pages
Neural Networks for Depth Sensors
No ratings yet
Neural Networks for Depth Sensors
11 pages
CroMo Cross-Modal Learning For Monocular Depth Estimation
No ratings yet
CroMo Cross-Modal Learning For Monocular Depth Estimation
11 pages
Depth Reconstruction With Deep Neural Networks (Part 1)
No ratings yet
Depth Reconstruction With Deep Neural Networks (Part 1)
66 pages
Revisiting Depth Completion From A Stereo Matching Perspective For Cross-Domain Generalization
No ratings yet
Revisiting Depth Completion From A Stereo Matching Perspective For Cross-Domain Generalization
22 pages
Ali Real-Time Vehicle Distance Estimation Using Single View Geometry
No ratings yet
Ali Real-Time Vehicle Distance Estimation Using Single View Geometry
10 pages
AWS Certified AI Practitioner (AIF-C01)
No ratings yet
AWS Certified AI Practitioner (AIF-C01)
24 pages
Angirekula Prudhvi Ai ML Engineer
No ratings yet
Angirekula Prudhvi Ai ML Engineer
2 pages
A Comprehensive Overview of Large Language Models
No ratings yet
A Comprehensive Overview of Large Language Models
69 pages
Unit-2 (Ques-Ans of Session-1 Revisiting AI, ML, DL Common Data Terminologies)
No ratings yet
Unit-2 (Ques-Ans of Session-1 Revisiting AI, ML, DL Common Data Terminologies)
4 pages
PaySafe Al Intelligent Fraud Detection For UPI Transactions Using Machine Learning
No ratings yet
PaySafe Al Intelligent Fraud Detection For UPI Transactions Using Machine Learning
7 pages
基于深度强化学习的多智能体协同算法关键技术研究王思颖
No ratings yet
基于深度强化学习的多智能体协同算法关键技术研究王思颖
155 pages
Employee Salary Prediction
No ratings yet
Employee Salary Prediction
10 pages
Recurrent Neural Network (RNN)
No ratings yet
Recurrent Neural Network (RNN)
8 pages
Course Details
No ratings yet
Course Details
3 pages
Mini Project Report Group13
No ratings yet
Mini Project Report Group13
35 pages
Artificial Intelligence For UPSSSC
No ratings yet
Artificial Intelligence For UPSSSC
10 pages
Application of Linear Algebra in Image Recognition
No ratings yet
Application of Linear Algebra in Image Recognition
3 pages
SHAP Values Algorithm Intro
No ratings yet
SHAP Values Algorithm Intro
22 pages
Learning Deep Learning Theory and Practice of Neural Networks Computer Vision NLP and Transformers Using TensorFlow 1st Edition Ekman Magnus Instant Download
100% (2)
Learning Deep Learning Theory and Practice of Neural Networks Computer Vision NLP and Transformers Using TensorFlow 1st Edition Ekman Magnus Instant Download
82 pages
Digital Image Processing Remote Sensing
No ratings yet
Digital Image Processing Remote Sensing
140 pages
Anirudh Cherukuri: Profile Summary
No ratings yet
Anirudh Cherukuri: Profile Summary
1 page
Solution Manual For Neural Networks and Learning Machines 3rd Edition
No ratings yet
Solution Manual For Neural Networks and Learning Machines 3rd Edition
6 pages
Evading Watermark Based Detection of AI-Generated Content
No ratings yet
Evading Watermark Based Detection of AI-Generated Content
20 pages
UNIT-5 Question Bank
No ratings yet
UNIT-5 Question Bank
4 pages
Paper 1
No ratings yet
Paper 1
10 pages
Weakly Supervised Deep Learning For Whole Slide Lung Cancer Image Analysis
No ratings yet
Weakly Supervised Deep Learning For Whole Slide Lung Cancer Image Analysis
13 pages
Handout - LLM Training and Inference
No ratings yet
Handout - LLM Training and Inference
5 pages
Artificial Intelligence (Ai) Engineer: Program Brochure
No ratings yet
Artificial Intelligence (Ai) Engineer: Program Brochure
18 pages
Customer Support Dataset With Multi-Intent Annotations For Conversational Ai
No ratings yet
Customer Support Dataset With Multi-Intent Annotations For Conversational Ai
27 pages
Towards Universal Fake Image Detectors That Generalize Across Generative Models
No ratings yet
Towards Universal Fake Image Detectors That Generalize Across Generative Models
17 pages
Liao DiffusionDrive Truncated Diffusion Model For End-To-End Autonomous Driving CVPR 2025 Paper
No ratings yet
Liao DiffusionDrive Truncated Diffusion Model For End-To-End Autonomous Driving CVPR 2025 Paper
11 pages
CHAPTER-2 Advance Concepts of Modeling in AI Class 10 Questions and Answers
No ratings yet
CHAPTER-2 Advance Concepts of Modeling in AI Class 10 Questions and Answers
13 pages
Video Object Tracking Using SIFT and Mean Shift: Zhu Chaoyang
No ratings yet
Video Object Tracking Using SIFT and Mean Shift: Zhu Chaoyang
49 pages
Inside Deep Learning: Math, Algorithms, Models 1st Edition Edward Raff Online Version
No ratings yet
Inside Deep Learning: Math, Algorithms, Models 1st Edition Edward Raff Online Version
164 pages
MLP XOR Theory and Diagram Corrected
No ratings yet
MLP XOR Theory and Diagram Corrected
2 pages

Basak 2020

Uploaded by

Basak 2020

Uploaded by

2020 IEEE 7th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON) | 978-1-6654-0373-3/20/$31.

00 ©2020 IEEE | DOI: 10.1109/UPCON50219.2020.9376365

Monocular Depth Estimation Using

Mayukhmali Das Soham Chattopadhyay

Fig. 2. The simpliﬁed encoder-decoder architecture with skip connections

Relative Root mean Average Threshold Threshold Threshold

Relative Root mean Average Threshold Threshold Threshold

You might also like