论文笔记Monocular Depth Estimation through Virtual-world Supervision and Real-world SfM Self-Supervision_《packnet:3d packing for self-supervised monocular -CSDN博客

该文提出了一种新的CNN架构，名为MonoDEVSNet，结合虚拟世界监督和真实世界的SfM自我监督进行训练，减少了监督数据和半监督数据在特征空间中的域差异。通过GradientReversalLayer(GRL)学习域不变的深度特征，以实现无监督的领域适应。方法包括使用真实世界交通数据和虚拟世界序列，以及依赖于SfM自我监督损失、像素级损失和领域适应损失的损失函数。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Problem Formulation

MDE on CNN : $Ψ(θ;x)→d\Psi(\theta;x) \rightarrow d$
$θ∗=min⁡θL(θ;Xr,Xs.Ys)\theta^* = \min_{\theta} \mathcal{L}(\theta;X^r,X^s.Y^s)$ .

Contributions

A CNN architecture training on virtual-world supervision and real-world SfM self supervision.
Reduce domain discrepancied between supervised (virtual world) and semi-supervised (real world) data at the space of the extracted features (backbone bottleneck) by Gradient reveral layer(GRL).

Methods

Assume two sources of Data 1. Real-world traffic $Xr={xtr}t=1NrX^r = \{x^r_t\}^{N^r}_{t=1}$ , $N^r$ is the num of real-world sequences. 2. Analogous sequences $Xs={xts}t=1NsX^s = \{x^s_t\}^{N^s}_{t=1}$ , $N_r$ is the Num of frames from the Virtual-world .

MonoDEVSNet architecture: $Ψ(θ;x)\Psi(\theta;x)$

$Ψ(θ;x)\Psi(\theta;x)$ has three blocks: Encoding block with $θenc\theta^{enc}$ , a multi-scale pyramidal block, $θpyr\theta^{pyr}$ and a decoding block with $θdec\theta^{dec}$ .
The role of the multi-scale pyramid block is to adapt the bottleneck of the chosen encoder to the decoder.
$L\mathcal{L}$ relies on three different losses, $Lsf(θ,Vsf;Xr),Lsp(θ,Xs;Ys),LDA(θenc,VDA;Xr,Xs)\mathcal{L}^{sf}(\theta,\mathcal{V}^{sf};X^r),\mathcal{L}^{sp}(\theta,X^{s};Y^s),\mathcal{L}^{DA}(\theta^{enc},\mathcal{V}^{DA};X^r,X^s )$ .
$Lsf(θ,Vsf;Xr)\mathcal{L}^{sf}(\theta,\mathcal{V}^{sf};X^r)$ Sfm self-supervised loss is almost like Mono2.
$Lsp(θ,Xs;Ys)\mathcal{L}^{sp}(\theta,X^{s};Y^s)$ will discard pixels with $dts(p)≥dmaxd^s_t(p) \geq d^{max}$ .
Domain adaptation loss $LDA(θenc,VDA;Xr,Xs)\mathcal{L}^{DA}(\theta^{enc},\mathcal{V}^{DA};X^r,X^s )$
Aim at learning the Depth features, so hope the feature couldn’t be distinguished whether from real(target domain) or virtual wold(source domain).
In the Gradient-Reversal-Layer, the domain invariance of $θenc\theta^{enc}$ is measured by a binary target/source domain-classifier CNN,D,of weights ${θenc,LDA}\{ \theta^{enc}, \mathcal{L}^{DA} \}$ .

在这里插入图片描述

$D(θenc,VDA;Xtr)D(\theta^{enc},\mathcal{V}^{DA};X^r_t)$ outputs 1 if $\in X^r$ and 0 if $\in X^s$ . This means that during forward passes of training, it acts as an identity function, while, during back-propagation, it reverses the gradient vector passing through it. Both the GRL and $VDA\mathcal{V}^{DA}$ are required at training time, but not at testing time.

Unsupervised Domain Adaptation by Backpropagation[1]

在这里插入图片描述

At training time, in order to obtain domain-invariant features, we seek the parameters $θ_f$ of the feature mapping that maximize the loss of the domain classifier (by making the two feature distributions as similar as possible), while simultaneously seeking the parameters $θ_d$ of the domain classifier that minimize the loss of the domain classifier. In addition, we seek to minimize the loss of the label predictor.

在这里插入图片描述

Such reduction can be accomplished by introducing a special gradient reversal layer (GRL) defined as follows. The gradient reversal layer has no parameters associated with it (apart from the meta-parameter λ, which is not updated by backpropagation). During the forward propagation, GRL acts as an identity transform. During the backpropagation though, GRL takes the gradient from the subsequent level, multiplies it by −λ and passes it to the preceding layer.