0% found this document useful (0 votes)
39 views11 pages

Multi-task Learning with Synthetic Images

The paper presents a novel multi-task deep learning approach for visual representation learning using synthetic imagery, aiming to improve generalization across tasks. By employing an unsupervised feature space domain adaptation method based on adversarial learning, the model predicts surface normal, depth, and instance contour from synthetic images while minimizing domain differences with real images. Extensive experiments demonstrate that this multi-task approach yields more transferable representations compared to single-task baselines, achieving state-of-the-art results in transfer learning tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views11 pages

Multi-task Learning with Synthetic Images

The paper presents a novel multi-task deep learning approach for visual representation learning using synthetic imagery, aiming to improve generalization across tasks. By employing an unsupervised feature space domain adaptation method based on adversarial learning, the model predicts surface normal, depth, and instance contour from synthetic images while minimizing domain differences with real images. Extensive experiments demonstrate that this multi-task approach yields more transferable representations compared to single-task baselines, achieving state-of-the-art results in transfer learning tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Cross-Domain Self-supervised Multi-task Feature Learning

using Synthetic Imagery

Zhongzheng Ren and Yong Jae Lee


University of California, Davis
{zzren, yongjaelee}@[Link]
arXiv:1711.09082v1 [[Link]] 24 Nov 2017

Synthetic Image
Abstract

In human learning, it is common to use multiple sources


of information jointly. However, most existing feature
learning approaches learn from only a single task. In
this paper, we propose a novel multi-task deep network to
learn generalizable high-level visual representations. Since
multi-task learning requires annotations for multiple prop-
erties of the same training instance, we look to synthetic
images to train our network. To overcome the domain dif-
ference between real and synthetic data, we employ an un- Depth Surface Normal Instance Contour
supervised feature space domain adaptation method based Figure 1. Main idea. A graphics engine can be used to easily ren-
on adversarial learning. Given an input synthetic RGB im- der realistic synthetic images together with their various physical
age, our network simultaneously predicts its surface nor- property maps. Using these images, we train a self-supervised vi-
mal, depth, and instance contour, while also minimizing the sual representation learning algorithm in a multi-task setting that
feature space domain differences between real and synthetic also adapts its features to real-world images.
data. Through extensive experiments, we demonstrate that
our network learns more transferable representations com- sual cues within an image as supervision such as recovering
pared to single-task baselines. Our learned representation the input from itself [69, 26], color from grayscale [75, 76],
produces state-of-the-art transfer learning results on PAS- equivariance of local patchs [52], or predicting the rela-
CAL VOC 2007 classification and 2012 detection. tive position of spatially-neighboring patches [51, 14]. The
second uses external sensory information such as motor
signals [3, 30] or sound [53, 4] to learn image transfor-
1. Introduction
mations or categories. The third uses motion cues from
In recent years, deep learning has brought tremendous videos [70, 31, 48, 54]. Although existing methods have
success across various visual recognition tasks [44, 23, 73]. demonstrated exciting results, these approaches often re-
A key reason for this phenomenon is that deep networks quire delicate and cleverly-designed tasks in order to force
trained on ImageNet [13] learn transferable representations the model to learn semantic features. Moreover, most exist-
that are useful for other related tasks. However, building ing methods learn only a single task. While the model could
large-scale, annotated datasets like ImageNet [13] is ex- learn to perform really well at that task, it may in the pro-
tremely costly both in time and money. Furthermore, while cess lose its focus on the actual intended task, i.e., to learn
benchmark datasets (e.g., MNIST [40], Caltech-101 [37], high-level semantic features. Recent self-supervised meth-
Pascal VOC [19], ImageNet [13], MS COCO [42]) enable ods that do learn from multiple tasks either require a com-
breakthrough progress, it is only a matter of time before plex model to account for the potentially large differences
models begin to overfit and the next bigger and more com- in input data type (e.g., grayscale vs. color) and tasks (e.g.,
plex dataset needs to be constructed. The field of computer relative position vs. motion prediction) [15] or is designed
vision is in need of a more scalable solution for learning specifically for tabletop robotic tasks and thus has difficulty
general-purpose visual representations. generalizing to more complex real-world imagery [57].
Self-supervised learning is a promising direction, of In human learning, it is common to use multiple sources
which there are currently three main types. The first uses vi- of information jointly. Babies explore a new object by look-

1
ing, touching, and even tasting it; humans learn a new lan- tations that are better than alternative single-task baselines,
guage by listening, speaking, and writing in it. We aim and highly competitive with the state-of-the-art.
to use a similar strategy for visual representation learning.
Specifically, by training a model to jointly learn several 2. Related work
complementary tasks, we can force it to learn general fea-
Synthetic data for vision. CAD models have been used
tures that are not overfit to a single task and are instead use-
for various vision tasks such as 2D-3D alignment [7, 5], ob-
ful for a variety of tasks. However, multi-task learning us-
ject detection [56], joint pose estimation and image-shape
ing natural images would require access to different types
alignment [66, 27]. Popular datasets include the Princeton
of annotations (e.g., depth [17], surface normal [17, 47],
Shape Benchmark [62], ShapeNet [12], and SUNCG [65].
segmentations [47]) for each image, which would be both
Synthetic data has also begun to show promising usage for
expensive and time-consuming to collect.
vision tasks including learning optical flow [45], semantic
Our main idea is to instead use synthetic images and their
segmentation [59, 60, 2], video analysis [20], stereo [77],
various free annotations for visual representation learning.
navigation [82], and intuitive physics [41, 72, 49]. In con-
Why synthetic data? First, computer graphics (CG) imagery
trast to these approaches, our work uses synthetic data
is more realistic than ever and is only getting better over
to learn general-purpose visual representations in a self-
time. Second, rendering synthetic data at scale is easier and
supervised way.
cheaper compared to collecting and annotating photos from
the real-world. Third, a user has full control of a virtual Representation learning. Representation learning has
world, including its objects, scenes, lighting, physics, etc. been a fundamental problem for years; see Bengio et al. [8]
For example, the global illumination or weather condition for a great survey. Classical methods such as the autoen-
of a scene can be changed trivially. This property would be coder [26, 69] learn compressed features while trying to re-
very useful for learning a robust, invariant visual represen- cover the input image. Recent self-supervised approaches
tation since the same scene can be altered in various ways have shown promising results, and include recovering color
without changing the semantics. Finally, the CG industry from a grayscale image (and vice versa) [75, 76, 39], im-
is huge and continuously growing, and its created content age inpainting [55], predicting the relative spatial location
can often be useful for computer vision researchers. For ex- or equivariance relation of image patches [51, 14, 52], using
ample, [59] demonstrated how the GTA-V [1] game can be motion cues in video [70, 31, 48, 54], and using GANs [16].
used to quickly generate semantic segmentation labels for Other works leverage non-visual sensory data to predict
training a supervised segmentation model. egomotion between image pairs [3, 30] and sound from
Although synthetic data provides many advantages, it video [53, 4]. In contrast to the above works, we explore
can be still challenging to learn general-purpose features ap- the advantage of using multiple tasks.
plicable to real images. First, while synthetic images have While a similar multi-task learning idea has been stud-
become realistic, it’s still not hard to differentiate them from ied in [15, 57, 71], each have their drawbacks. In [15], four
real-world photos; i.e., there is a domain difference that very different tasks are combined into one learning frame-
must be overcome. To tackle this, we propose an unsuper- work. However, because the tasks are very different in the
vised feature-level domain adaptation technique using ad- required input data type and learning objectives, each task
versarial training, which leads to better performance when is learned one after the other rather than simultaneously and
the learned features are transferred to real-world tasks. Sec- special care must be made to handle the different data types.
ond, any semantic category label must still be provided by In [57], a self-supervised robot learns to perform differ-
a human annotator, which would defeat the purpose of us- ent tasks and in the process acquires useful visual features.
ing synthetic data for self-supervised learning. Thus, we However, it has limited transferability because the learning
instead leverage other free physical cues to learn the visual is specific to the tabletop robotic setting. Finally, [71] com-
representations. Specifically, we train a network that takes bines the tasks of spatial location prediction [14] and mo-
an image as input and predicts its depth, surface normal, and tion coherence [70], by first initializing with the weights
instance contour maps. We empirically show that learning learned on spatial location prediction and then continuing to
to predict these mid-level cues forces the network to also learn via motion coherence (along with transitive relations
learn transferable high-level semantics. acquired in the process). Compared to these methods, our
model is relatively simple yet generalizes well, and learns
Contributions. Our main contribution is a novel self-
all tasks simultaneously.
supervised multi-task feature learning network that learns
from synthetic imagery while adapting its representation Domain adaptation. To overcome dataset bias, visual
to real images via adversarial learning. We demonstrate domain adaptation was first introduced in [61]. Recent
through extensive experiments on ImageNet and PASCAL methods using deep networks align features by minimiz-
VOC that our multi-task approach produces visual represen- ing some distance function across the domains [67, 21].
GAN [25] based pixel-level domain adaptation methods Shared weights
Synthetic
have also gained a lot of attention and include those that re- Edge

quire paired data [29] as well as unpaired data [81, 33, 43].
Surface
Domain adaptation techniques have also been used Base normal
to adapt models trained on synthetic data to real-world
Depth
tasks [63, 10]. Our model also minimizes the domain gap
Real world
between real and synthetic images, but we perform domain
adaptation in feature space similar to [68, 22], whereby Domain
a domain discriminator learns to distinguish the domains Base D
Real / Synthetic

while the learned representation (through a generator) tries


to fool the discriminator. To our knowledge, our model is
the first to adapt the features learned on synthetic data to Figure 2. Network architecture. The upper net takes a synthetic
real images for self-supervised feature learning. image and predicts its depth, surface normal, and instance contour
map. The bottom net extracts features from a real-world image.
Multi-task learning. Multi-task learning [11] has been
The domain discriminator D tries to differentiate real and synthetic
used for a variety vision problems including surface normal features. The learned blue modules are used for transfer learning
and depth prediction [17, 18], semantic segmentation [47], on real-world tasks.
pose estimation [24], robot manipulation [58, 57], and face
detection [79]. Kokkinos [34] introduces a method to
jointly learn low-, mid-, and high-level vision tasks in a uni- correspond to semantic edges (i.e., contours of objects) as
fied architecture. Inspired by these works, we use multi-task opposed to low-level edges. Fig. 1 shows an example; no-
learning for self-supervised feature learning. We demon- tice how the edges within an object, texture, and shadows
strate that our multi-task learning approach learns better are ignored. Using these semantic contour maps, we can
representations compared to single-task learning. train a model to ignore the low-level edges within an ob-
ject and focus instead on the high-level edges that separate
3. Approach one object from another, which is exactly what we want in
a high-level feature learning algorithm.
We introduce our self-supervised deep network which More specifically, we formulate the task as a binary se-
jointly learns multiple tasks for visual representation learn- mantic edge/non-edge prediction task, and use the class-
ing, and the domain adaptor which minimizes the feature balanced sigmoid cross entropy loss proposed in [73]:
space domain gap between real and synthetic images. Our P P
final learned features will be transferred to real-world tasks. Le (E) = −β i log P (yi = 1|θ) − (1 − β) j log P (yj = 0|θ)
0
where E is our predicted edge map, E is the ground-truth
3.1. Multi-task feature learning 0 0 0 0 0
edge map, β = |E− |/|E− + E+ |, and |E− | and |E+ | de-
To learn general-purpose features that are useful for a va- note the number of ground-truth edges and non-edges, re-
riety of tasks, we train our network to simultaneously solve spectively, i indexes the ground-truth edge pixels, j indexes
three different tasks. Specifically, our network takes as in- the ground-truth background pixels, θ denotes the network
put a single synthetic image and computes its corresponding parameters, and P (yi = 1|θ) and P (yj = 0|θ) are the pre-
instance contour map, depth map, and surface normal map, dicted probabilities for a pixel corresponding to an edge and
as shown in Fig. 2. background, respectively.
Instance contour detection. We can easily extract Depth prediction. Existing feature learning methods
instance-level segmentation masks from synthetic imagery. mainly focus on designing ‘pre-text’ tasks such as predict-
The masks are generated from pre-built 3D models, and are ing the relative position of spatial patches [14, 51] or image
clean and accurate. However, the tags associated with an in-painting [55]. The underlying physical properties of a
instance are typically noisy or inconsistent (e.g., two identi- scene like its depth or surface normal have been largely un-
cal chairs from different synthetic scenes could be named explored for learning representations. The only exception is
‘chair1’ and ‘furniture2’). Fixing these errors (e.g., for the work of [6], which learns using surface normals corre-
semantic segmentation) would require a human annotator, sponding to real-world images. (In Sec. 4, we demonstrate
which would defeat the purpose of self-supervised learning. that our multi-task approach using synthetic data leads to
We therefore instead opt to extract edges from the better transferable representations.)
instance-level segmentation masks, which alleviates the is- Predicting the depth for each pixel in an image requires
sues with noisy instance labels. For this, we simply run the understanding high-level semantics about objects and their
canny edge detector on the segmentation masks. Since the relative placements in a scene; it requires the model to fig-
edges are extracted from instance-level segmentations, they ure out the objects that are closer/farther from the camera,
and their shape and pose. While real-world depth imagery Algorithm 1 Multi-task Adversarial Domain Adaptation
computed using a depth camera (e.g., the Kinect) can often Input: Synthetic images X, real images Y , max iteration T
be noisy, the depth map extracted from a synthetic scene is Output: Domain adapted base network B
1: for t = 1 to T do
clean and accurate. To train the network to predict depth, we 2: Sample a batch of synthetic images x = {xi }
follow the approach of [18], which compares the predicted 3: Sample a batch of real images y = {yj }
and ground-truth log depth maps of an image Q = log Y 4: Extract feature for each image: zxi = B(xi ), zyj = B(yj )
and Q0 = log Y 0 , where Y and Y 0 are the predicted and 5: Keep D frozen, update B, H through LBH (φB , φH |zx )
6: Keep B frozen, update D through LD (φD |zx , zy )
ground-truth depth maps, respectively. Their scale-invariant
depth prediction loss is:
1 1 first stage, given a batch of synthetic images x = {xi } and
d2i −
P P
Ld (Q) = n i 2n2 i,j di dj
a batch of real images y = {yj }, the generator B (base
where i indexes the pixels in an image, n is the total number network in Fig. 2) computes features zxi = B(xi ) and
of pixels, and d = Q−Q0 is the element-wise difference be- zyj = B(yj ) for each synthetic image xi and real image yj ,
tween the predicted and ground-truth log depth maps. The respectively. The domain discriminator D then updates its
first term is the L2 difference and the second term tries to parameters φD by minimizing the following binary cross-
enforce errors to be consistent with one another in their sign. entropy loss:
P P
Surface normal estimation. Surface normal is highly re- LD (φD |zx , zy ) = − i log(D(zxi )) − j log(1 − D(zyj ))
lated to depth, and previous work [17, 18] show that comb-
where we assign 1, 0 labels to synthetic and real images
ing the two tasks can help both. We use the inverse of the
xi , yj , respectively.
dot product between the ground-truth and the prediction as
In the second stage, we fix D and update the generator B
the loss [17]:
as well as the tasks heads H for the three tasks. Specifically,
Ls (S) = − n1 i Si · Si0
P the parameters φB , φH are updated jointly using:
X
where i indexes the pixels in an image, n is the total number LBH (φB , φH |zx ) = − log(1 − D(zxi ))
i
of pixels, S is the predicted surface normal map, and S 0 is
+ λe Le (Exi ) + λd Ld (Qxi ) + λs Ls (Sxi ),
the ground-truth surface normal map.
where Le (Exi ), Ld (Qxi ), Ls (Sxi ) are the losses for in-
3.2. Unsupervised feature space domain adaptation stance contour, depth, and surface normal prediction for
While the features learned above on multiple tasks will synthetic image xi , respectively, and λe , λd , λs are weights
be more general-purpose than those learned on a single task, to scale their gradients to have similar magnitude. LBH up-
they will not be directly useful for real-world tasks due to dates B so that D is fooled into thinking that the features ex-
the domain gap between synthetic and real images. Thus, tracted from a synthetic image are from a real image, while
we next describe how to adapt the features learned on syn- also updating H so that the features are good for instance
thetic images to real images. contour, depth, and surface normal prediction.
Since our goal is to learn features in a self-supervised Our training process is summarized in Alg. 1. Note that
way, we cannot assume that we have access to any task la- we do not directly update the generator B using any real im-
bels for real images. We therefore formulate the problem ages; instead the real images only directly update D, which
as unsupervised domain adaptation, where the goal is to in turn forces B to produce more domain-agnostic features
minimize the domain gap between synthetic xi ∈ X and for synthetic [Link] also tried updating B with real
real yj ∈ Y images. We follow a generative adversarial images (by adding − j log(D(zyj )) to LBH ), but this did
learning (GAN) [25] approach, which pits a generator and not result in any improvement. Once training converges, we
a discriminator against each other. In our case, the two net- transfer B and finetune it on real-world tasks like ImageNet
works learn from each other to minimize the domain dif- classification and PASCAL VOC detection.
ference between synthetic and real-world images so that the
3.3. Network architecture
features learned on synthetic images can generalize to real-
world images, similar to [22, 63, 10, 68]. Since the domain Our network architecture is shown in Fig. 2. The blue
gap between our synthetic data and real images can be po- base network consists of convolutional layers, followed by
tentially huge (especially in terms of high-level semantics), ReLU nonlinearity and BatchNorm [28]. The ensuing bot-
we opt to perform the adaptation at the feature-level [22, 68] tleneck layers (middle blue block) consist of dilated convo-
rather than at the pixel-level [63, 10]. lution layers [74] to enlarge the receptive field. In our exper-
Specifically, we update the discriminator and generator iments, the number of layers and filters in the base and bot-
networks by alternating the following two stages. In the tleneck blocks follow the standard AlexNet [36] model to
Query Random weights Ours w/o Domain Adaptation Ours full model ImageNet Pretrained

Figure 3. Nearest neighbor retrieval results. The first column contains the query images. We show the four nearest neighbors of a randomly
initialized AlexNet, our model without domain adaptation, our model with domain adaptation, and ImageNet pre-trained AlexNet.

ensure a fair comparison with existing self-supervised fea-


ture learning methods (e.g., [14, 75, 76, 52]). The task heads
(red, green, and orange blocks) consist of deconvolution
layers, followed by ReLU and BatchNorm [28]. Finally, the
domain discriminator is a 13 × 13 patch discriminator [29],
which takes ‘conv5’ features from the base network. Exact
architecture details are provided in the Appendix.
Empirically, we find that minimizing the domain shift in
a mid-level feature space like ‘conv5’ rather than at a lower
or higher feature space produces the best transfer learning
Figure 4. (left) The conv1 filters learned using our model on
results. In Sec. 4, we validate the effect of adaptation across
SUNCG and SceneNet. (right) The conv1 filters learned on Im-
different layers. ageNet. While not as sharp as those learned on ImageNet, our
model learns gabor-like conv1 filters.
4. Results
In this section, we evaluate the quality and transferabil- compute instance contour maps from the provided instance
ity of the features that our model learns from synthetic data. masks. For surface normal, we use the ground-truth maps
We first produce qualitative visualizations of our learned provided by [70] for SceneNet [46] and those provided by
conv1 filters, nearest neighbors obtained using our learned SUNCG [65].
features, and learned task predictions on synthetic data. We 4.2. Qualitative analysis without finetuning
then evaluate on transfer learning benchmarks: fine-tuning
the features on PASCAL VOC classification and detection, Nearest neighbor retrieval We first perform nearest
and freezing the features learned from synthetic data and neighbor retrieval experiments on the PASCAL VOC 2012
then training a classifier on top of them for ImageNet clas- trainval dataset. For this experiment, we compare a ran-
sification. We then conduct ablation studies to analyze the domly initalized AlexNet, ImagenNet pretrained AlexNet,
different components of our algorithm. Finally, we evaluate our model without domain adaptation, and our full model
our features on NYUD surface normal prediction. with domain adaptation. For each model, we extract conv5
features for each VOC image and retrieve the nearest neigh-
4.1. Experimental setup bors for each query image.
Fig. 3 shows example results. We make several observa-
Architecture As described in Sec. 3.3, we set our base
tions: (1) Both our full model and model without domain
network to use the same convolutional and pooling layers
adaptation produces better features than randomly initial-
as AlexNet [36] (the blue blocks in Fig. 2) to ensure a fair
ized features. (2) Since many of the ImageNet objects are
comparison with existing self-supervised approaches [75,
not present in our synthetic dataset, our model is unable to
16, 14, 70, 31, 3, 53, 71]. We set our input to be grayscale by
distinguish between very similar categories but instead re-
randomly duplicating one of the RGB channels three times
trieves them together (e.g., cars, buses, and airplanes as the
since it can lead to more robust features [52, 14, 70].
neighbor of query car). (3) Our full model performs better
Dataset We use Places365 [80] as the source of real im- than our model without domain adaptation when there are
ages for domain adaptation, which contains 1.8 million im- humans or animals in the query images. This is likely be-
ages. For synthetic images, we combine SUNCG [65] and cause although these categories are never seen in our syn-
SceneNet RGB-D [46] to train our network. Both datasets thetic training set, they are common in Places [80] which
come with depth maps for each synthetic image, and we we use for adaptation. (4) Compared to a pre-trained Ima-
Synthetic Depth Depth Surface normal Surface normal Instance Instance contour
RGB Pred. GT Pred. GT contour GT
Pred.
Figure 5. Representative examples of our model’s depth, surface normal, and instance contour predictions on unseen SUNCG [65] images.
Our network produces predictions that are sharp and detailed, and close to the ground-truth.

geNet [13] model, our full model is less discriminative and something inside it.
prefers to capture images with more objects in the image These visualizations illustrate how well our network per-
(e.g., third row with humans). This may again be due to forms on each ‘pre-text’ task for feature learning. The better
Places [80] since it is a scene dataset rather than an object- our model performs on these tasks, the better transferable
centric dataset like ImageNet. Overall, this result can be features it is likely to get. In the remainder of the experi-
seen as initial evidence that our pre-trained model can cap- ments, we demonstrate that this is indeed the case, and also
ture high-level semantics on real-world data. provide quantitative evaluations on the surface normal ‘pre-
Conv1 filter visualization In Fig. 4, we visualize the text’ task in Sec. 4.5 where we fine-tune our network for
conv1 features learned on synthetic data. While not as sharp surface normal estimation on NYUD [50].
as those learned on ImageNet [13], our model learns conv1
features that resemble gabor-like filters. Since we always 4.3. Transfer learning
convert our input image to gray scale, our network does not Pascal VOC classification and detection We first evalu-
learn any color blob filters. ate on VOC classification following the protocol in [35]. We
Learned task prediction visualization We next show transfer the learned weights from our network (blue blocks
how well our model performs on the tasks that it is trained Fig. 2) to a standard AlexNet [36] and then re-scale the
on. Fig. 5 shows our model’s depth, surface normal, and in- weights using [35]. We then fine-tune our model’s weights
stance contour predictions on unseen SUNCG [65] images. on VOC 2007 trainval and test on VOC 2007 test. Table 1
Overall, our predictions are sharp and clean, and look quite second column, shows the results. Our model outperforms
close to the ground-truth. Note that these are representative all previous methods despite never having directly used any
predictions and we only sampled these because they con- real images for pre-training (recall that the real images are
tain interesting failure cases. For example, in the first row only used for domain adaptation). In contrast, the existing
there is a transparent glass door. Our network failures to methods are all trained on real images or videos. While pre-
capture the semantic meaning of a glass door and instead vious research has mainly shown that synthetic data can be a
tries to predict the bathtub’s surface normal and contours good supplement to real-world imagery [59, 60], this result
behind it. In the third row, our network fails to correctly indicates the promise of directly using synthetic data and its
predict the pan and pot’s depth and surface normals due to free annotations for self-supervised representation learning.
ambiguity in 3D shape. This indicates that our network can We next test VOC detection accuracy using the Fast-
struggle when predicting very detailed 3D properties. Simi- RCNN [23] detector. We test two models: (1) finetuning
lar results can been seen in the fourth row with the telescope on VOC 2007 trainval and testing on VOC 2007 test data;
body and legs. Finally, in the last row, there is a door whose (2) finetuning on VOC 2012 train and testing on VOC 2012
inside is too dark to see. Therefore, our network predicts val data. Table 1, right two columns show the results. Our
it as a wall but the ground-truth indicates there is actually models obtain the second best result on VOC 2007 and the
Dataset 07 07 12 method conv1 conv2 conv3 conv4 conv5
Tasks CLS. DET. DET. ImageNet [36] 19.3 36.3 44.2 48.3 50.5
ImageNet [36] 79.9 56.8 56.5 Gaussian 11.6 17.1 16.9 16.3 14.1
Gaussian 53.4 41.3 - Krähenbühl et al. [35] 17.5 23.0 24.5 23.2 20.6
Autoencoder [35] 53.8 41.9 - context [14] 16.2 23.3 30.2 31.7 29.6
Krahenbuel et al. [35] 56.6 45.6 42.8 BiGAN [16] 17.7 24.5 31.0 29.9 28.0
Ego-equivariance [30] - 41.7 - context-encoder [55] 14.1 20.7 21.0 19.8 15.5
Egomotion [3] 54.2 43.9 - colorization [75] 12.5 24.5 30.4 31.5 30.3
context-encoder [55] 56.5 44.5 jigsaw [51] 18.2 28.8 34.0 33.9 27.1
BiGAN [16] 58.6 46.2 44.9
splitbrain [76] 17.7 29.3 35.4 35.2 32.8
sound [53] 61.3 - 42.9
counting [52] 18.0 30.6 34.3 32.5 25.7
flow [54] 61 52.2 48.6
Ours 16.5 27.0 30.5 30.1 26.5
motion [70] 63.1 47.2 43.5
clustering [9] 65.3 49.4 - Table 2. Transfer learning results on ImageNet [13]. We freeze
context [35] 65.3 51.1 49.9 the weights of our model and train a linear classifier for ImageNet
colorization [75] 65.9 46.9 44.5 classification [13]. Our model is trained purely on synthetic data
jigsaw [51] 67.6 53.2 - while all other methods are trained on ImageNet [13] (without la-
splitbrain [76] 67.1 46.7 43.8 bels). Despite the domain gap, our model still learns useful fea-
counting [52] 67.7 51.4 -
tures for image classification.
Ours 68.0 52.6 50.0
Table 1. Transfer learning results on PASCAL VOC 2007 classi-
fication and VOC 2007 and 2012 detection. We report the best Does multi-task learning help in learning semantics?
numbers for each method reported in [35, 76, 52]. We first analyze whether multi-task learning produces more
transferable features compared to single-task learning. Ta-
best result on 2012. These results on detection verify that ble 3, first four rows show the transfer learning results of
our learned features are robust and are able to generalize our final multi-task model (‘3 tasks’) versus each single-
across different high-level tasks. More importantly, it again task model (‘Edge’, ‘Depth’, ‘Surf.’). Our multi-task model
shows that despite only using synthetic data, we can still outperforms all single-task models on both VOC classifi-
learn transferable visual semantics. cation and detection, which demonstrates that the tasks are
complementary and that multi-task learning is beneficial for
ImageNet classification We next evaluate our learned feature learning.
features on ImageNet classification [13]. We freeze our net- Does domain adaptation help? If so, on which layer
work’s pre-trained weights and train a multinomial logis- should it be performed? Table 3, rows 5-8 show the
tic regression classifier on top of each layer from conv1 to transfer learning results after applying domain adaptation in
conv5 using the ImageNet classification training data. Fol- different layers (i.e., in Fig. 2, which layer’s features will go
lowing [76], we bilinearly interpolate the feature maps of into the domain discriminator). We see that domain adap-
each layer so that the resulting flattened features across lay- tation helps when performed on conv5 and conv61 , which
ers produce roughly equal number of dimensions. verifies that there is indeed a domain difference between our
Table 2 shows the results. Our model shows improve- synthetic and real images that needs to be addressed. For
ment over the different data initialization methods (Gaus- example, on VOC classification, performing domain adap-
sian and Krähenbühl et al. [35]), but underperforms com- tation on conv5 results in 67.4% accuracy vs. 65.6% without
pared to the state-of-the-art. This is understandable since domain adaptation. Interestingly, we see a slight decrease
existing self-supervised approaches [14, 16, 55, 75] are in performance from conv5 to conv6 across all tasks (rows
trained on ImageNet, which here is also the test dataset. Our 7 & 8). We hypothesize that this drop in performance is due
model is instead trained on synthetic indoor images, which to the biases in the synthetic and real-world image datasets
can have quite different high-level semantics and thus has we use; SUNCG and SceneNet are both comprised of in-
never seen most of the ImageNet categories during train- door scenes mostly with man-made objects whereas Places
ing (e.g., there are no dogs in SUNCG). Still, it outper- is much more diverse and consists of indoor and outdoor
forms [55] and performs similarly to [75] up through conv4, scenes with man-made, natural, and living objects. Thus,
which shows that the learned semantics on synthetic data the very high-level semantic differences will be hard to
can still be useful for real-world image classification. overcome, and domain adaptation becomes difficult at the
very high layers.
4.4. Ablation studies We also see that it actually hurts to perform domain
adaptation at a very low-layer like conv1. The low per-
We next perform ablation studies to dissect the contri-
formance on conv1 is likely due to the imperfect rendering
bution of the different components of our model. For this,
we again use the PASCAL VOC classification and detection 1 Since our pre-text tasks are pixel prediction tasks, we convert fc6-7 of

tasks for transfer learning. AlexNet into equivalent conv6-7 layers.


Task Adaptation #data 07-C 07-D 12-D Lower the better Higher the better
Edge - 0.5M 63.9 46.9 44.8 GT Methods Mean Median 11.25◦ 22.5◦ 30◦
Depth - 0.5M 61.9 48.9 45.8 [17] Zhang et al. [78] 22.1 14.8 39.6 65.6 75.3
Surf. - 0.5M 65.3 48.2 45.4 [17] Ours 21.9 14.6 39.5 66.7 76.5
3 tasks - 0.5M 65.6 51.3 47.2 [38] Wang et al. [71] 26.0 18.0 33.9 57.6 67.5
3 tasks conv1 0.5M 61.9 48.7 46 [38] Ours 23.8 16.2 36.6 62.0 72.9
3 tasks conv4 0.5M 63.4 49.5 46.3 Table 4. Surface normal estimation on the NYUD [50] test set.
3 tasks conv5 0.5M 67.4 52.0 49.2
3 tasks conv6 0.5M 66.9 51.5 48.2
3 tasks conv5 Bi-fool 0.5M 66.2 51.3 48.5 by fooling the discriminator in both ways, the optimization
3 tasks conv5 1.5M 68.0 52.6 50.0 process becomes unnecessarily tougher. This issue could
Table 3. Ablation study results. We evaluate the impact of multi- potentially be solved using stabilizing methods such as a
task learning, feature space domain adaptation, and amount of data history buffer [63], which we leave for future study.
on transfer learning. All of these factors contribute together to
make our model learn transferable visual features from large-scale 4.5. Surface normal on NYUD
synthetic data.
Finally, we evaluate our model’s transfer learning per-
quality of the synthetic data that we use. Many of the ren- formance on the NYUD [50] dataset for surface normal es-
dered images from SUNCG [65] are a bit noisy. Hence, if timation. Since one of our pre-training tasks is surface nor-
we take the first layer’s conv1 features for domain adapta- mal estimation, this experiment also allows us to measure
tion, it is easy for the discriminator to overfit to this artifact, how well our model does in learning that task. We use the
which causes low-level differences from real images. In- standard split of 795 images for training and 654 images for
deed, we find that the conv1 filters learned in this setting are testing. The evaluation metrics we use are the Mean, Me-
quite noisy, and this leads to lower transfer learning perfor- dian, RMSE error and percentage of pixels that have angle
mance. By performing the domain-adaptation at a higher- error less than 11.25◦ , 22.5◦ , and 30◦ between the model
level, we find that the competition between the discrimi- predictions and the ground-truth predictions. We use both
nator and generator better levels-out, leading to improved the ground-truths provided by [38] and [17].
transfer learning performance. Overall, performing domain We compare our model with the self-supervised model
adaptation in between the very low and very high layers, of [71], which pre-trains on the combined tasks of spatial
such as conv5, results in the best performance. location prediction [14] and motion coherence [70], and the
supervised model trained with synthetic data [78], which
Does more data help? The main benefit of self- pre-trains on ImageNet classification and SUNCG surface
supervised or unsupervised learning methods is their scal- normal estimation. For this experiment, we use an FCN [44]
ability since they do not need any manually-labeled data. architecture with skip connections similar to [78] and pre-
Thus, we next evaluate the impact that increasing data size train on 0.5 million SUNCG synthetic images on joint sur-
has on feature learning. Specifically, we increase the size face normal, depth, and instance contour prediction.
of our synthetic dataset from 0.5 million images to 1.5 mil-
Table 4 shows the results. Our model clearly outper-
lion images. From Table 3, we can clearly see that hav-
forms [71], which is somewhat expected since we directly
ing more data helps (‘3task conv5’ model, rows 7 vs. 10).
pre-train on surface normal estimation as one of the tasks,
Specifically, both classification and detection performance
and performs slightly better than [78] on average. Our
improves by 0.5-0.6% points.
model still needs to adapt from synthetic to real images,
Does fooling the discriminator both ways help? Since so our good performance likely indicates that (1) our model
both of our real and synthetic images go through one base performs well on the pre-training tasks (surface normal es-
network, in contrast to standard GAN architectures, during timation being one of them) and (2) our domain adaptation
the generator update we can fool the discriminator in both reduces the domain gap between synthetic and real images
ways (i.e., generate synthetic features that look real and real to ease fine-tuning.
image features that look synthetic). As seen in Table 3,
row 9, fooling the discriminator in this way hurts the per- 5. Conclusion
formance slightly, compared to only generating synthetic While synthetic data has become more realistic than
features that look real (row 7), but is still better than no do- ever before, prior work has not explored learning general-
main adaptation (row 4). One likely reason for this is that purpose visual representations from them. Our novel cross-
updating the generator to fool the discriminator into think- domain multi-task feature learning network takes a promis-
ing that a real image feature is synthetic does not directly ing first step in this direction.
help the generator produce good features for the synthetic
depth, surface normal, and instance contour tasks (which Acknowledgements. This work was supported in part by
are ultimately what is needed to learn semantics). Thus, the National Science Foundation under Grant No. 1748387,
the AWS Cloud Credits for Research Program, and GPUs [21] Y. Ganin and V. S. Lempitsky. Unsupervised domain adap-
donated by NVIDIA. tation by backpropagation. In ICML, 2015. 2
[22] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle,
References F. Laviolette, M. Marchand, and V. Lempitsky. Domain-
adversarial training of neural networks. JMLR, 2016. 3, 4
[1] Grand theft auto five(v). [Link] [23] R. B. Girshick. Fast R-CNN. In ICCV, 2015. 1, 6
[Link]/V/. 2 [24] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. R-cnns
[2] M. S. A. Shafaei, J. J. Little. Play and learn: Using video for pose estimation and action detection. In arXiv, 2014. 3
games to train computer vision models. In BMVC, 2016. 2 [25] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
[3] P. Agrawal, J. Carreira, and J. Malik. Learning to see by D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
moving. In ICCV, 2015. 1, 2, 5, 7 erative adversarial nets. In NIPS, 2014. 2, 4
[4] R. Arandjelovic and A. Zisserman. Look, listen and learn. [26] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimen-
ICCV, 2017. 1, 2 sionality of data with neural networks. In Science, 2006. 1,
[5] M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic. 2
Seeing 3d chairs: exemplar part-based 2d-3d alignment us- [27] Q. Huang, H. Wang, and V. Koltun. Single-view reconstruc-
ing a large dataset of cad models. In CVPR, 2014. 2 tion via joint analysis of image and shape collections. SIG-
[6] A. Bansal, X. Chen, B. Russell, A. Gupta, and D. Ramanan. GRAPH, 2015. 2
Pixelnet: Representation of the pixels, by the pixels, and for [28] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
the pixels. In arXiv:1702.06506, 2017. 3 deep network training by reducing internal covariate shift. In
[7] A. Bansal, B. Russell, and A. Gupta. Marr revisited: 2d-3d ICML, 2015. 4, 5
model alignment via surface normal prediction. In CVPR, [29] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image
2016. 2 translation with conditional adversarial networks. CVPR,
[8] Y. Bengio, A. Courville, and P. Vincent. Representation 2017. 3, 5, 11
learning: A review and new perspectives. In PAMI, 2013. [30] D. Jayaraman and K. Grauman. Learning image representa-
2 tions tied to egomotion. In ICCV, 2015. 1, 2, 7
[9] P. Bojanowski and A. Joulin. Unsupervised learning by pre- [31] D. Jayaraman and K. Grauman. Slow and steady feature
dicting noise. In ICML, 2017. 7 analysis: higher order temporal coherence in video. In
[10] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Kr- CVPR, 2016. 1, 2, 5
ishnan. Unsupervised pixel-level domain adaptation with [32] S. R. Kaiming He, Xiangyu Zhang and J. Sun. Delving deep
generative adversarial networks. CVPR, 2017. 3, 4 into rectifiers: Surpassing human-level performance on ima-
[11] R. Caruana. Multitask learning. Machine Learning, 1997. 3 genet classification. In ICCV, 2015. 11
[12] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, [33] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to
Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, discover cross-domain relations with generative adversarial
J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d networks. ICML, 2017. 3
model repository. In arXiv:1512.03012, 2015. 2 [34] I. Kokkinos. Ubernet: Training a ’universal’ convolutional
[13] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. neural network for low-, mid-, and high-level vision using
ImageNet: A Large-Scale Hierarchical Image Database. In diverse datasets and limited memory. CVPR, 2017. 3
CVPR, 2009. 1, 5, 6, 7 [35] P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell. Data-
[14] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised vi- dependent initializations of convolutional neural networks.
sual representation learning by context prediction. In ICCV, In ICLR, 2016. 6, 7
2015. 1, 2, 3, 5, 7, 8 [36] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
[15] C. Doersch and A. Zisserman. Multi-task self-supervised classification with deep convolutional neural networks. In
visual learning. In ICCV, 2017. 1, 2 NIPS, 2012. 4, 5, 6, 7, 11
[16] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial fea- [37] R. F. L. Fei-Fei and P. Perona. Learning generative visual
ture learning. In ICLR, 2017. 2, 5, 7 models from few training examples: an incremental bayesian
[17] D. Eigen and R. Fergus. Predicting depth, surface normals approach tested on 101 object categories. In CVPR Work-
and semantic labels with a common multi-scale convolu- shop, 2004. 1
tional architecture. In ICCV, 2015. 2, 3, 4, 8 [38] L. Ladicky, B. Zeisl, and M. Pollefeys. Discriminatively
[18] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction trained dense surface normal estimation. In ECCV, 2014.
from a single image using a multi-scale deep network. In 8
NIPS, 2014. 3, 4 [39] G. Larsson, M. Maire, and G. Shakhnarovich. Colorization
[19] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, as a proxy task for visual understanding. CVPR, 2017. 2
and A. Zisserman. The pascal visual object classes (voc) [40] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
challenge. In IJCV, 2010. 1 based learning applied to document recognition. In Proceed-
[20] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds ings of the IEEE, 1998. 1
as proxy for multi-object tracking analysis. In CVPR, 2016. [41] A. Lerer, S. Gross, and R. Fergus. Learning physical intu-
2 ition of block towers by example. In ICML, 2016. 2
[42] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. [61] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting vi-
Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and sual category models to new domains. In ECCV, 2010. 2
C. L. Zitnick. Microsoft COCO: common objects in context. [62] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser. The
CoRR, 2014. 1 Princeton shape benchmark. In Shape Modeling Interna-
[43] M. Liu and O. Tuzel. Coupled generative adversarial net- tional, 2004. 2
works. NIPS, 2016. 3 [63] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang,
[44] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional and R. Webb. Learning from simulated and unsupervised
networks for semantic segmentation. In CVPR, 2015. 1, 8, images through adversarial training. CVPR, 2017. 3, 4, 8
11 [64] K. Simonyan and A. Zisserman. Very deep convolutional
[45] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, networks for large-scale image recognition. In ICLR, 2015.
A. Dosovitskiy, and T. Brox. A large dataset to train convo- 11
lutional networks for disparity, optical flow, and scene flow [65] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and
estimation. In CVPR, 2016. 2 T. Funkhouser. Semantic scene completion from a single
[46] J. McCormac, A. Handa, S. Leutenegger, and A. [Link]. depth image. CVPR, 2017. 2, 5, 6, 7
Scenenet rgb-d: Can 5m synthetic images beat generic ima- [66] H. Su, Q. Huang, N. J. Mitra, Y. Li, and L. Guibas. Estimat-
genet pre-training on indoor segmentation? In ICCV, 2017. ing image depth using shape collections. SIGGRAPH, 2014.
5 2
[47] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross- [67] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultane-
stitch networks for multi-task learning. In CVPR, 2016. 2, ous deep transfer across domains and tasks. In ICCV, 2015.
3 2
[48] I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and learn:
[68] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Adversarial
Unsupervised learning using temporal order verification. In
discriminative domain adaptation. In CVPR, 2017. 3, 4
ECCV, 2016. 1, 2
[69] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A.
[49] R. Mottaghi, H. Bagherinezhad, M. Rastegari, and
Manzagol. Stacked denoising autoencoders: Learning use-
A. Farhadi. Newtonian image understanding: Unfolding the
ful representations in a deep network with a local denoising
dynamics of objects in static images. In CVPR, 2016. 2
criterion. In JMLR, 2010. 1, 2
[50] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor
[70] X. Wang and A. Gupta. Unsupervised learning of visual rep-
segmentation and support inference from rgbd images. In
resentations using videos. In ICCV, 2015. 1, 2, 5, 7, 8
ECCV, 2012. 6, 8
[71] X. Wang, K. He, and A. Gupta. Transitive invariance for self-
[51] M. Noroozi and P. Favaro. Unsupervised learning of visual
supervised visual representation learning. In ICCV, 2017. 2,
representations by solving jigsaw puzzles. In ECCV, 2016.
5, 8
1, 2, 3, 7
[52] M. Noroozi, H. Pirsiavash, and P. Favaro. Representation [72] J. Wu, I. Yildirim, J. J. Lim, W. T. Freeman, and J. B. Tenen-
learning by learning to count. In ICCV, 2017. 1, 2, 5, 7 baum. Galileo: Perceiving physical object properties by inte-
grating a physics engine with deep learning. In NIPS, 2015.
[53] A. Owens, J. Wu, J. McDermott, W. Freeman, and A. Tor-
2
ralba. Ambient sound provides supervision for visual learn-
ing. In ECCV, 2016. 1, 2, 5, 7 [73] S. Xie and Z. Tu. Holistically-nested edge detection. In
ICCV, 2015. 1, 3
[54] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariha-
ran. Learning features by watching objects move. In CVPR, [74] F. Yu and V. Koltun. Multi-scale context aggregation by di-
2017. 1, 2, 7 lated convolutions. In ICLR, 2015. 4, 11
[55] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and [75] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza-
A. Efros. Context encoders: Feature learning by inpainting. tion. In ECCV, 2016. 1, 2, 5, 7, 11
In CVPR, 2016. 2, 3, 7 [76] R. Zhang, P. Isola, and A. A. Efros. Split-brain autoen-
[56] X. Peng, B. Sun, K. Ali, and K. Saenko. Learning deep ob- coders: Unsupervised learning by cross-channel prediction.
ject detectors from 3d models. In ICCV, 2015. 2 In CVPR, 2017. 1, 2, 5, 7
[57] L. Pinto, D. Gandhi, Y. Han, Y. Park, and A. Gupta. The [77] Y. Zhang, W. Qiu, Q. Chen, X. Hu, and A. L. Yuille. Unre-
curious robot: Learning visual representations via physical alstereo: A synthetic dataset for analyzing stereo vision. In
interactions. In ECCV, 2016. 1, 2, 3 arXiv, 2016. 2
[58] L. Pinto and A. Gupta. Learning to push by grasping: Using [78] Y. Zhang, S. Song, E. Yumer, M. Savva, J.-Y. Lee, H. Jin, and
multiple tasks for effective learning. In ICRA, 2017. 3 T. Funkhouser. Physically-based rendering for indoor scene
[59] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for understanding using convolutional neural networks. CVPR,
data: Ground truth from computer games. In ECCV, 2016. 2017. 8
2, 6 [79] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark
[60] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and detection by deep multi-task learning. In ECCV, 2014. 3
A. Lopez. The SYNTHIA Dataset: A large collection of [80] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva.
synthetic images for semantic segmentation of urban scenes. Places: An image database for deep scene understanding. In
In CVPR, 2016. 2, 6 arXiv, 2016. 5, 6
[81] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image- Layer S C KS St P D
Input 224 3 - - - -
to-image translation using cycle-consistent adversarial net-
conv1 1 224 64 3 1 1 1
works. In ICCV, 2017. 3 conv1 2 224 64 3 1 1 1
[82] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei- pool1 112 64 2 2 0 1
Fei, and A. Farhadi. Target-driven visual navigation in in- conv2 1 112 128 3 1 1 1
door scenes using deep reinforcement learning. In ICRA, conv2 2 112 128 3 1 1 1
2017. 2 pool2 56 128 2 2 0 1
conv3 1 56 256 3 1 1 1
conv3 2 56 256 3 1 1 1
6. Appendix conv3 3 56 256 3 1 1 1
pool3 28 256 2 2 0 1
The details of our network architectures are provided conv4 1 28 512 3 1 1 1
here. We first introduce the AlexNet based network used conv4 2 28 512 3 1 1 1
for the experiments in Sections 4.2-4.4. We then describe conv4 3 28 512 3 1 1 1
pool4 14 512 2 2 0 1
the VGG16 based network used for surface normal predic- conv5 1 14 512 3 1 1 1
tion on NYUD in Section 4.5. conv5 2 14 512 3 1 1 1
conv5 3 14 512 3 1 1 1
6.1. AlexNet D1 6 1024 4 2 0 1
D2 6 1024 1 1 0 0
Our network details are give in Table 5. There are D3 6 1024 1 1 0 0
mainly 4 components in our network (recall Figure 2 in the Deconv1 14 512 4 2 0 1
Deconv2 28 256 4 2 0 1
main paper): base network (big blue blobs), bottleneck net-
Deconv3 56 128 4 2 0 1
work (small blue block), task heads (orange, red, and green Deconv4 112 64 4 2 0 1
blocks), and domain discriminator (gray block). output 224 1 or 3 4 2 0 1
Our base network takes a 227×227×3 image as in- Table 6. VGG16 based architecture. S: spatial size of output; C:
put. The conv1 to conv5 layers are identical to those in # of channels; KS: kernel size; St: stride; P: padding; D: dilation.
AlexNet [36]. We change the stride of pool5 from 2 to 1 to
avoid losing too much spatial information, following [75]. used to recover the full-sized image outputs for each task.
For the bottleneck block, we use a dilated convolutional The output layer is also deconvolutional and has three chan-
layer [74] in fc6 to increase its receptive field. The base and nels for surface normal prediction, and one for depth predic-
bottleneck network can be combined and converted into a tion and instance contour detection.
standard AlexNet [36], which we use for the transfer learn- We use a patch discriminator as in [29] whose final out-
ing experiments. During conversion, we absorb the batch put is a 6 × 6 feature map. There are three layers in our
normalization layer, convert the convolutional fc6-7 into domain discriminator (D1 - D3), which takes as input the
full-connected layers, and rescale the cross-layer weights. conv5 output. Leaky ReLU [32] with slope 0.2 and batch
All of the above operations are identical to [75]. normalization comes after the convolutional layers to stabi-
Three deconvolutional layers (Deconv8 - Deconv10) are lize the adversarial training process.
6.2. VGG16
Layer S C KS St P D
Input 227 3 - - - - Our VGG16 based network has three basic components:
conv1 55 96 11 4 0 1 base network, task heads, and domain discriminator as
pool1 55 96 3 2 0 1 shown in Table 6. To save memory, unlike our AlexNet
conv2 27 256 5 1 2 1 based architecture, we do not have a bottleneck network.
pool2 27 256 3 2 0 1
conv3 13 384 3 1 0 1 Our base network takes a 224×224×3 image as input.
conv4 13 384 3 1 0 1 The conv1 1 to conv5 3 layers are identical to VGG16 [64].
conv5 13 256 3 1 0 1 To obtain accurate pixel-level predictions for the three tasks,
pool5 13 256 3 1 1 1 we use skip connections between the base and task heads
fc6 13 4096 6 1 5 2
fc7 13 4096 1 1 0 1 (we do not do this for our AlexNet architecture for fair com-
D1 13 256 3 2 0 1 parison with prior feature learning work). We use (a → b)
D2 6 512 1 1 0 1 to denote a skip connection from the output of a to the in-
D3 6 1 1 1 0 1 put of b. The skip connections in our network are (conv2 2
Deconv8 27 64 3 2 0 1
Deconv9 55 64 3 2 0 1
→ Deconv4), (conv3 3 → Deconv3), and (conv4 3 → De-
Deconv10 112 64 5 2 0 1 conv2). Similar to our AlexNet architecture, we use a
Output 227 1 or 3 3 2 0 1 patch discriminator, leaky ReLU, and batch normalization
Table 5. AlexNet based architecture. S: spatial size of output; C: in the three layers of the discriminator, which takes as input
number of channels; KS: kernel size; St: stride; P: padding; D: di- conv5 3 output features.
lation. Note: fc6, fc7 are fully convolutional layers as in FCN [44].

You might also like