Problem Formulation
- MDE on CNN : Ψ(θ;x)→d\Psi(\theta;x) \rightarrow dΨ(θ;x)→d
- θ∗=minθL(θ;Xr,Xs.Ys)\theta^* = \min_{\theta} \mathcal{L}(\theta;X^r,X^s.Y^s)θ∗=minθL(θ;Xr,Xs.Ys).
Contributions
- A CNN architecture training on virtual-world supervision and real-world SfM self supervision.
- Reduce domain discrepancied between supervised (virtual world) and semi-supervised (real world) data at the space of the extracted features (backbone bottleneck) by Gradient reveral layer(GRL).
Methods
- Assume two sources of Data 1. Real-world traffic Xr={xtr}t=1NrX^r = \{x^r_t\}^{N^r}_{t=1}Xr={xtr}t=1Nr, NrN^rNr is the num of real-world sequences. 2. Analogous sequences Xs={xts}t=1NsX^s = \{x^s_t\}^{N^s}_{t=1}Xs={xts}t=1Ns, NrN_rNr is the Num of frames from the Virtual-world .
MonoDEVSNet architecture:Ψ(θ;x)\Psi(\theta;x)Ψ(θ;x)
- Ψ(θ;x)\Psi(\theta;x)Ψ(θ;x) has three blocks: Encoding block with θenc\theta^{enc}θenc, a multi-scale pyramidal block, θpyr\theta^{pyr}θpyr and a decoding block with θdec\theta^{dec}θdec.
- The role of the multi-scale pyramid block is to adapt the bottleneck of the chosen encoder to the decoder.
- L\mathcal{L}L relies on three different losses, Lsf(θ,Vsf;Xr),Lsp(θ,Xs;Ys),LDA(θenc,VDA;Xr,Xs)\mathcal{L}^{sf}(\theta,\mathcal{V}^{sf};X^r),\mathcal{L}^{sp}(\theta,X^{s};Y^s),\mathcal{L}^{DA}(\theta^{enc},\mathcal{V}^{DA};X^r,X^s )Lsf(θ,Vsf;Xr),Lsp(θ,Xs;Ys),LDA(θenc,VDA;Xr,Xs).
- Lsf(θ,Vsf;Xr)\mathcal{L}^{sf}(\theta,\mathcal{V}^{sf};X^r)Lsf(θ,Vsf;Xr) Sfm self-supervised loss is almost like Mono2.
- Lsp(θ,Xs;Ys)\mathcal{L}^{sp}(\theta,X^{s};Y^s)Lsp(θ,Xs;Ys) will discard pixels with dts(p)≥dmaxd^s_t(p) \geq d^{max}dts(p)≥dmax.
- Domain adaptation loss LDA(θenc,VDA;Xr,Xs)\mathcal{L}^{DA}(\theta^{enc},\mathcal{V}^{DA};X^r,X^s )LDA(θenc,VDA;Xr,Xs)
- Aim at learning the Depth features, so hope the feature couldn’t be distinguished whether from real(target domain) or virtual wold(source domain).
- In the Gradient-Reversal-Layer, the domain invariance of θenc\theta^{enc}θenc is measured by a binary target/source domain-classifier CNN,D,of weights {θenc,LDA}\{ \theta^{enc}, \mathcal{L}^{DA} \}{θenc,LDA}.
- D(θenc,VDA;Xtr)D(\theta^{enc},\mathcal{V}^{DA};X^r_t)D(θenc,VDA;Xtr) outputs 1 if x∈Xrx \in X^rx∈Xr and 0 if x∈Xsx \in X^sx∈Xs. This means that during forward passes of training, it acts as an identity function, while, during back-propagation, it reverses the gradient vector passing through it. Both the GRL and VDA\mathcal{V}^{DA}VDA are required at training time, but not at testing time.
Unsupervised Domain Adaptation by Backpropagation[1]
- At training time, in order to obtain domain-invariant features, we seek the parameters θfθ_fθf of the feature mapping that maximize the loss of the domain classifier (by making the two feature distributions as similar as possible), while simultaneously seeking the parameters θdθ_dθd of the domain classifier that minimize the loss of the domain classifier. In addition, we seek to minimize the loss of the label predictor.
- Such reduction can be accomplished by introducing a special gradient reversal layer (GRL) defined as follows. The gradient reversal layer has no parameters associated with it (apart from the meta-parameter λ, which is not updated by backpropagation). During the forward propagation, GRL acts as an identity transform. During the backpropagation though, GRL takes the gradient from the subsequent level, multiplies it by −λ and passes it to the preceding layer.
Important Reference
[1] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in Int. Conf. on Machine Learning (ICML), 2015.