Minimax Entropy for SSDA
Minimax Entropy for SSDA
Kuniaki Saito1 , Donghyun Kim1 , Stan Sclaroff1 , Trevor Darrell2 and Kate Saenko1
1
Boston University, 2 University of California, Berkeley
1 2
{keisaito, donhk, sclaroff, saenko}@bu.edu, [email protected]
arXiv:1904.06487v5 [cs.CV] 14 Sep 2019
Abstract Task-specific
Classifier
Conventional Domain Classifier Based Method
Train Domain Classifier Update features
Domain Classifier
Contemporary domain adaptation methods are very ef- Labeled Source
fective at aligning feature distributions of source and tar- Labeled Target
get domains without any target supervision. However, Unlabeled Target
we show that these techniques perform poorly when even Estimated Prototypes
Deceive
Domain Classifier
a few labeled examples are available in the target do-
main. To address this semi-supervised domain adapta- Minimax Entropy (Ours)
tion (SSDA) setting, we propose a novel Minimax Entropy Estimate prototype
Update prototype Update features
(MME) approach that adversarially optimizes an adaptive with labeled examples
by alternately maximizing the conditional entropy of un- Figure 1: We address the task of semi-supervised domain adapta-
labeled target data with respect to the classifier and min- tion. Top: Existing domain-classifier based methods align source
imizing it with respect to the feature encoder. We em- and target distributions but can fail by generating ambiguous fea-
pirically demonstrate the superiority of our method over tures near the task decision boundary. Bottom: Our method esti-
many baselines, including conventional feature alignment mates a representative point of each class (prototype) and extracts
and few-shot methods, setting a new state of the art for discriminative features using a novel minimax entropy technique.
SSDA. Our code is available at http://cs-people.
bu.edu/keisaito/research/MME.html.
call Minimax Entropy (MME), is based on optimizing a
minimax loss on the conditional entropy of unlabeled data,
1. Introduction as well as the task loss; this reduces the distribution gap
while learning discriminative features for the task.
Deep convolutional neural networks [16] have signifi- We exploit a cosine similarity-based classifier architec-
cantly improved image classification accuracy with the help ture recently proposed for few-shot learning [12, 5]. The
of large quantities of labeled training data, but often gener- classifier (top layer) predicts a K-way class probability vec-
alize poorly to new domains. Recent unsupervised domain tor by computing cosine similarity between K class-specific
adaptation (UDA) methods [11, 19, 20, 28, 37] improve weight vectors and the output of a feature extractor (lower
generalization on unlabeled target data by aligning distri- layers), followed by a softmax. Each class weight vector is
butions, but can fail to learn discriminative class boundaries an estimated “prototype” that can be regarded as a represen-
on target domains (see Fig. 1.) We show that in the Semi- tative point of that class. While this approach outperformed
Supervised Domain Adaptation (SSDA) setting where a few more advanced methods in few-shot learning and we con-
target labels are available, such methods often do not im- firmed its effectiveness in our setting, as we show below it
prove performance relative to just training on labeled source is still quite limited. In particular, it does not leverage unla-
and target examples, and can even make it worse. beled data in the target domain.
We propose a novel approach for SSDA that overcomes Our key idea is to minimize the distance between the
the limitations of previous methods and significantly im- class prototypes and neighboring unlabeled target samples,
proves the accuracy of deep classifiers on novel domains thereby extracting discriminative features. The problem is
with only a few labels per class. Our approach, which we how to estimate domain-invariant prototypes without many
Class1 Class2 Baseline Few-shot Learning Method
Estimated Prototypes
Entire Network Optimization without unlabeled examples
Labeled Source
Labeled Target
Classification loss
Unlabeled Target minimization
Proposed Method
Step1: Update Estimated Prototypes Step2: Update Feature Extractor
Figure 2: Top: baseline few-shot learning method, which estimates class prototypes by weight vectors, yet does not consider
unlabeled data. Bottom: our model extracts discriminative and domain-invariant features using unlabeled data through a
domain-invariant prototype estimation. Step 1: we update the estimated prototypes in the classifier to maximize the entropy
on the unlabeled target domain. Step 2: we minimize the entropy with respect to the feature extractor to cluster features
around the estimated prototype.
Labeled Example
Lce (p, y)
・ ・
<latexit sha1_base64="SRTudCVhkHiiWVupo5SXcuwrI+4=">AAACinichVG7SgNBFD2u7/hI1EawWQwRBZGJCL4aUQsLC19RwUjYXSfJ4L7YnQTikh/QD7CwUhARK1stbfwBCz9BLBVsLLzZLIiKeoeZOXPmnjtnZnTXFL5k7LFBaWxqbmlta491dHZ1xxM9vZu+U/IMnjEc0/G2dc3nprB5Rgpp8m3X45qlm3xL31+o7W+VuecLx96QFZfvWlrBFnlhaJKoXCKlLucCg1fVrMnzcljNWpos6vnArY6qFTXriUJRjqi5RJKNsTDUnyAdgSSiWHESF8hiDw4MlGCBw4YkbEKDT20HaTC4xO0iIM4jJMJ9jipipC1RFqcMjdh9Ggu02olYm9a1mn6oNugUk7pHShUp9sAu2Qu7Z1fsib3/WisIa9S8VGjW61ru5uKH/etv/6osmiWKn6o/PUvkMRV6FeTdDZnaLYy6vnxw/LI+s5YKhtgZeyb/p+yR3dEN7PKrcb7K104Qow9If3/un2BzfCxNeHUiOTcffUUbBjCIYXrvScxhCSvI0LlHuMYNbpVOZVyZVmbrqUpDpOnDl1AWPwBKgZby</latexit>
・ ・
C
<latexit sha1_base64="GQyATEZNrXBuezetIIGN/QQBFHk=">AAACZHichVHLSsNAFD2N7/potQiCIGKpuCo3IiiuxG5cttXWQi2SxKmGpklI0kIt/oBuFReuFETEz3DjD7jwBwRxWcGNC2/TgGhR7zAzZ87cc+fMjGobuusRPYWknt6+/oHBofDwyOhYJDo+kXetmqOJnGYZllNQFVcYuilynu4ZomA7QqmqhthWK6n2/nZdOK5umVtewxalqrJv6mVdUzymMqndaJyS5MdsN5ADEEcQaSt6gx3swYKGGqoQMOExNqDA5VaEDILNXAlN5hxGur8vcIQwa2ucJThDYbbC4z6vigFr8rpd0/XVGp9icHdYOYsEPdItteiB7uiFPn6t1fRrtL00eFY7WmHvRo6nNt//VVV59nDwpfrTs4cyVnyvOnu3faZ9C62jrx+etzZXs4nmPF3RK/u/pCe65xuY9TftOiOyFwjzB8g/n7sb5BeTMiXlzFJ8bT34ikFMYw4L/N7LWMMG0sjxuQInOMVZ6FkakWLSZCdVCgWaGL6FNPMJjEyJww==</latexit>
<latexit
L2 WT f p
Labeled Target
F
<latexit sha1_base64="p+V/6RrHgYVCGEasarEveFmTkbk=">AAACZHichVHLSgMxFD0dX7VWWy2CIEixVFxJKoLiqiiIyz7sA2opM2Nah86LmWmhFn9At4oLVwoi4me48Qdc9AcEcVnBjQtvpwOiRb0hycnJPTcniWSqiu0w1vEJQ8Mjo2P+8cBEcHIqFJ6eydtGw5J5TjZUwypKos1VRec5R3FUXjQtLmqSygtSfbu3X2hyy1YMfc9pmbysiTVdqSqy6BCV3qmEY2yFuREdBAkPxOBFygjfYh8HMCCjAQ0cOhzCKkTY1EpIgMEkrow2cRYhxd3nOEaAtA3K4pQhElunsUarksfqtO7VtF21TKeo1C1SRhFnT+yOddkju2cv7OPXWm23Rs9Li2apr+VmJXQyl33/V6XR7ODwS/WnZwdVbLheFfJuukzvFnJf3zy66GY3M/H2Ertmr+T/inXYA91Ab77JN2meuUSAPiDx87kHQX51JUE4vRZLbnlf4cc8FrFM772OJHaRQo7O5TjFGc59z0JQiAiz/VTB52ki+BbCwieSTonG</latexit>
<latexit
Normalize f
<latexit sha1_base64="qJB/PHFia9q7YWEVD/rcOeHnvng=">AAACbXichVHLSgMxFD0d3/VVFUFQpFiqrkoqguKq6Malrz6wLWVmTOvQeTGTFrT0B1wLLkRBQUT8DDf+gIt+grhwUcGNC2+nA6JFvSHJyck9NyeJYuuaKxhrBKSu7p7evv6B4ODQ8MhoaGw85VoVR+VJ1dItJ6PILtc1kyeFJnSesR0uG4rO00p5o7WfrnLH1SxzTxzZPG/IJVMraqosiNrPGbI4VIq1Yr0QirAY8yLcCeI+iMCPLSt0ixwOYEFFBQY4TAjCOmS41LKIg8EmLo8acQ4hzdvnqCNI2gplccqQiS3TWKJV1mdNWrdqup5apVN06g4pw4iyJ3bHmuyR3bNn9vFrrZpXo+XliGalreV2YfRkavf9X5VBs8Dhl+pPzwJFrHpeNfJue0zrFmpbXz0+a+6u7URr8+yavZD/K9ZgD3QDs/qm3mzznXME6QPiP5+7E6SWYnHC28uRxLr/Ff2YxhwW6b1XkMAmtpCkc02c4gKXgVdpUpqRZtupUsDXTOBbSAufwZ6OBg==</latexit>
<latexit
<latexit sha1_base64="EkM06j/iRBGDJqMNRRaQ0NIYNo0=">AAACgHichVG7SgNBFD1ZXzG+ojaCjRgUqzgrgmIl2lhGk5hAomF3nY1L9sXuJKDLFrb+gIWVgojY6DfY+AMWfoJYKthYeLNZEBX1DjNz5sw9d87MqK5p+IKxx4TU1d3T25fsTw0MDg2PpEfHtn2n6Wm8qDmm45VVxeemYfOiMITJy67HFUs1eUltrLf3Sy3u+YZjF8SBy3cspW4buqEpgqhaerKqe4oWVC1F7Kt6EASl3UKoh2EYFMJaOsOyLIqpn0COQQZx5Jz0JarYgwMNTVjgsCEIm1DgU6tABoNL3A4C4jxCRrTPESJF2iZlccpQiG3QWKdVJWZtWrdr+pFao1NM6h4ppzDDHtgVe2H37Jo9sfdfawVRjbaXA5rVjpa7tZHjifzbvyqLZoH9T9WfngV0LEdeDfLuRkz7FlpH3zo8ecmvbM0Es+ycPZP/M/bI7ugGdutVu9jkW6dI0QfI35/7J9heyMqENxczq2vxVyQxiWnM0XsvYRUbyKFI5x7hEje4lSRpTpqX5E6qlIg14/gS0soHigSVFw==</latexit>
T
Softmax <latexit sha1_base64="pqpsMzL0PDji+m5tsXqyLGOmol4=">AAACbXichVHLSgMxFD0d3/VVFUFQRCw+ViUjguKq6Malr6rYFpkZ0xo6L2bSgpb+gGvBhSgoiIif4cYfcOEniAsXCm5ceGc6IFrUG5KcnNxzc5Loril8ydhjTGlqbmlta++Id3Z19/Qm+vo3fafsGTxjOKbjbeuaz01h84wU0uTbrsc1Szf5ll5aCva3KtzzhWNvyAOX5y2taIuCMDRJ1E7O0uS+Xqi6td1EkqVYGGONQI1AElGsOIlr5LAHBwbKsMBhQxI2ocGnloUKBpe4PKrEeYREuM9RQ5y0ZcrilKERW6KxSKtsxNq0Dmr6odqgU0zqHinHMMEe2A17Zffslj2xj19rVcMagZcDmvW6lru7vUdD6+//qiyaJfa/VH96lihgPvQqyLsbMsEtjLq+cnjyur6wNlGdZJfsmfxfsEd2RzewK2/G1SpfO0WcPkD9+dyNYHMmpRJenU2mF6OvaMcwxjFN7z2HNJaxggyda+MYZziPvSiDyogyWk9VYpFmAN9CmfoE1aiOEA==</latexit>
Unlabeled
<latexit sha1_base64="pSwD012ky3kYeCYh62VUh3X0tsU=">AAACcXichVG7SgNBFD1Z3/EVtVFsgkGJiOGujWIl2lj6igqJht11kizui91JQIM/4A8oWCmIiJ9h4w9Y+AliGcHGwrubBVFR7zAzZ87cc+fMjO5ZZiCJnhJKW3tHZ1d3T7K3r39gMDU0vB24Nd8QecO1XH9X1wJhmY7IS1NaYtfzhWbrltjRD1fC/Z268APTdbbkkSf2bK3imGXT0CRT+7Or2aKtyapebngn06VUhnIURfonUGOQQRxrbuoGRRzAhYEabAg4kIwtaAi4FaCC4DG3hwZzPiMz2hc4QZK1Nc4SnKExe8hjhVeFmHV4HdYMIrXBp1jcfVamMUmPdEtNeqA7eqb3X2s1ohqhlyOe9ZZWeKXB09HNt39VNs8S1U/Vn54lyliIvJrs3YuY8BZGS18/PmtuLm5MNqboil7Y/yU90T3fwKm/GtfrYuMCSf4A9ftz/wTbczmVcuo6ZZaW46/oxjgmkOX3nscSVrGGPJ/r4xyXuEo0lTElrUy0UpVErBnBl1BmPgAB/o76</latexit>
H(p)
Target
Gradient
Unlabeled Target Flipping
・
・
Figure 3: An overview of the model architecture and MME. The inputs to the network are labeled source examples (y=label),
a few labeled target examples, and unlabeled target examples. Our model consists of the feature extractor F and the classifier
C which has weight vectors (W) and temperature T . W is trained to maximize entropy on unlabeled target (Step 1 in Fig.
2) whereas F is trained to minimize it (Step 2 in Fig. 2). To achieve the adversarial learning, the sign of gradients for entropy
loss on unlabeled target examples is flipped by a gradient reversal layer [11, 37].
model-ensemble [17], and adversarial approaches [22] have 3. Minimax Entropy Domain Adaptation
boosted performance in semi-supervised learning, but do In semi-supervised domain adaptation, we are given
not address domain shift. Conditional entropy minimization source images and the corresponding labels in the source
(CEM) is a widely used method in SSL [13, 10]. However, domain Ds = {(xsi , yi s )}i=1
ms
. In the target domain, we
we found that CEM fails to improve performance when are also given a limited number of labeled target images
there is a large domain gap between the source and target mt
Dt = {(xti , yi t )}i=1 , as well as unlabeled target images
domains (see experimental section.) MME can be regarded u mu
Du = {(xi )}i=1 . Our goal is to train the model on
as a variant of entropy minimization which overcomes the Ds , Dt , and Du and evaluate on Du .
limitation of CEM in domain adaptation.
Few-shot learning (FSL). Few shot learning [35, 39, 3.1. Similarity based Network Architecture
26] aims to learn novel classes given a few labeled examples Inspired by [5], our base model consists of a feature ex-
and labeled “base” classes. SSDA and FSL make differ- tractor F and a classifier C. For the feature extractor F ,
ent assumptions: FSL does not use unlabeled examples and we employ a deep convolutional neural network and per-
aims to acquire knowledge of novel classes, while SSDA form `2 normalization on the output of the network. Then,
aims to adapt to the same classes in a new domain. How- the normalized feature vector is used as an input to C
ever both tasks aim to extract discriminative features given a which consists of weight vectors W = [w1 , w2 , . . . , wK ]
few labeled examples from a novel domain or novel classes. where K represents the number of classes and a temper-
We employ a network with `2 normalization on features be- ature parameter T . C takes kF F (x)
(x)k as an input and out-
fore the last linear layer and a temperature parameter T , T
which was proposed for face verification [25] and applied puts T1 W F (x)
kF (x)k . The output of C is fed into a softmax-
to few-shot learning [12, 5]. Generally, classification of a layer to obtain the probabilistic output p ∈ Rn . We denote
T
feature vector with a large norm results in confident out- p(x) = σ( T1 W F (x)
kF (x)k ), where σ indicates a softmax func-
put. To make the output more confident, networks can try tion. In order to classify examples correctly, the direction
to increase the norm of features. However, this does not of a weight vector has to be representative to the normal-
necessarily increase the between-class variance because in- ized features of the corresponding class. In this respect, the
creasing the norm does not change the direction of vectors. weight vectors can be regarded as estimated prototypes for
`2 normalized feature vectors can solve this issue. To make each class. An architecture of our method is shown in Fig. 3.
the output more confident, the network focuses on making
the direction of the features from the same class closer to 3.2. Training Objectives
each other and separating different classes. This simple ar- We estimate domain-invariant prototypes by performing
chitecture was shown to be very effective for few-shot learn- entropy maximization with respect to the estimated proto-
ing [5] and we build our method on it in our work. type. Then, we extract discriminative features by perform-
ing entropy minimization with respect to feature extractor. trained to maximize the entropy, whereas the feature ex-
Entropy maximization prevents overfitting that can reduce tractor F is trained to minimize it. Both C and F are also
the expressive power of the representations. Therefore, en- trained to classify labeled examples correctly. The overall
tropy maximization can be considered as the step of select- adversarial learning objective functions are:
ing prototypes that will not cause overfitting to the source
examples. In our method, the prototypes are parameterized θ̂F = argmin L + λH
θF
by the weight vectors of the last linear layer. First, we train (3)
F and C to classify labeled source and target examples cor- θ̂C = argmin L − λH
θC
rectly and utilize an entropy minimization objective to ex-
tract discriminative features for the target domain. We use where λ is a hyper-parameter to control a trade-off between
a standard cross-entropy loss to train F and C for classifi- minimax entropy training and classification on labeled ex-
cation: amples. Our method can be formulated as the iterative min-
L = E(x,y)∈Ds ,Dt Lce (p(x), y) . (1) imax training. To simplify training process, we use a gra-
With this classification loss, we ensure that the feature ex- dient reversal layer [11] to flip the gradient between C and
tractor generates discriminative features with respect to the F with respect to H. With this layer, we can perform the
source and a few target labeled examples. However, the minimax training with one forward and back-propagation,
model is trained on the source domain and a small fraction which is illustrated in Fig. 3.
of target examples for classification. This does not learn
3.3. Theoretical Insights
discriminative features for the entire target domain. There-
fore, we propose minimax entropy training using unlabeled As shown in [2], we can measure domain-divergence by
target examples. using a domain classifier. Let h ∈ H be a hypothesis, s (h)
A conceptual overview of our proposed adversarial and t (h) be the expected risk of source and target respec-
learning is illustrated in Fig. 2. We assume that there exists tively, then t (h) 6 s (h) + dH (p, q) + C0 where C0 is a
a single domain-invariant prototype for each class, which constant for the complexity of hypothesis space and the risk
can be a representative point for both domains. The esti- of an ideal hypothesis for both domains and dH (p, q) is the
mated prototype will be near source distributions because H-divergence between p and q.
source labels are dominant. Then, we propose to estimate
s
t
the position of the prototype by moving each wi toward tar- dH (p, q) , 2 sup Pr
s
[h(f ) = 1] − Pr h(f ) = 1
h∈H x ∼p xt ∼q
get features using unlabeled data in the target domain. To (4)
achieve this, we increase the entropy measured by the simi- where f s and f t denote the features in the source and target
larity between W and unlabeled target features. Entropy is domain respectively. In our case the features are outputs of
calculated as follows, the feature extractor. The H-divergence relies on the capac-
K
H = −E(x,y)∈Du
X
p(y = i|x) log p(y = i|x) (2) ity of the hypothesis space H to distinguish distributions p
i=1
and q. This theory states that the divergence between do-
mains can be measured by training a domain classifier and
where K is the number of classes and p(y = i|x) represents features with low divergence are the key to having a well-
the probability of prediction to class i, namely i th dimen- performing task-specific classifier. Inspired by this, many
T
sion of p(x) = σ( T1 W F (x)
kF (x)k ). To have higher entropy, that methods [11, 3, 37, 36] train a domain classifier to discrim-
is, to have uniform output probability, each wi should be inate different domains while also optimizing the feature
similar to all target features. Thus, increasing the entropy extractor to minimize the divergence.
encourages the model to estimate the domain-invariant pro- Our proposed method is also connected to Eq. 4. Al-
totypes as shown in Fig. 2. though we do not have a domain classifier or a domain clas-
To obtain discriminative features on unlabeled target ex- sification loss, our method can be considered as minimizing
amples, we need to cluster unlabeled target features around domain-divergence through minimax training on unlabeled
the estimated prototypes. We propose to decrease the en- target examples. We choose h to be a classifier that decides
tropy on unlabeled target examples by the feature extractor a binary domain label of a feature by the value of the en-
F . The features should be assigned to one of the prototypes tropy, namely,
to decrease the entropy, resulting in the desired discrimina- (
tive features. Repeating this prototype estimation (entropy 1, if H(C(f )) ≥ γ,
h(f ) = (5)
maximization) and entropy minimization process yields dis- 0, otherwise
criminative features.
To summarize, our method can be formulated as adver- where C denotes our classifier, H denotes entropy, and
sarial learning between C and F . The task classifier C is γ is a threshold to determine a domain label. Here,
we assume C outputs the probability of the class pre- noisy, we pick 4 domains (Real, Clipart, Painting, Sketch)
diction for simplicity.
Eq. 4 can be rewritten as follows, and 126 classes. We focus on the adaptation scenarios
dH (p, q) , 2 sup Pr [h(f s ) = 1] − Pr
t
h(f ) = 1 where the target domain is not real images, and construct
s ∼p
h∈H
f t
f ∼q
7 scenarios from the four domains. See our supplemental
s t
material for more details. Office-Home [38] contains 4 do-
= 2 sup Pr [H(C(f )) ≥ γ] − Pr [H(C(f )) ≥ γ]
C∈C f ∼p
s f t ∼q mains (Real, Clipart, Art, Product) with 65 classes. This
H(C(f t )) ≥ γ . dataset is one of the benchmark datasets for unsupervised
≤ 2 sup Pr t
C∈C f ∼q domain adaptation. We evaluated our method on 12 sce-
In the last inequality, we assume that sPr [H(C(f s )) ≥ γ] ≤ narios in total. Office [27] contains 3 domains (Amazon,
f ∼p
Pr H(C(f t )) ≥ γ . This assumption should be realistic Webcam, DSLR) with 31 classes. Webcam and DSLR are
f t ∼p
small domains and some classes do not have a lot of exam-
because we have access to many labeled source examples
ples while Amazon has many examples. To evaluate on the
and train entire networks to minimize the classification
domain with enough examples, we have 2 scenarios where
loss. Minimizing the cross-entropy loss (Eq. 1) on source
we set Amazon as the target domain and DSLR and Web-
examples ensures that the entropy on a source example
cam as the source domain.
is very small. Intuitively, this inequality states that the
divergence can be bounded by the ratio of target examples Implementation Details. All experiments are implemented
having entropy greater than γ. Therefore, we can have in Pytorch [23]. We employ AlexNet [16] and VGG16 [34]
the upper bound by finding the C that achieves maximum pre-trained on ImageNet. To investigate the effect of deeper
entropy for all target features. Our objective is finding architectures, we use ResNet34 [14] in experiments on Do-
features that achieve lowest divergence. We suppose there mainNet. We remove the last linear layer of these networks
exists a C that achieves the maximum in the inequality to build F , and add a K-way linear classification layer C
above, then the objective can be rewritten as, with a randomly initialized weight matrix W . The value of
temperature T is set 0.05 following the results of [25] in
H(C(f t )) ≥ γ
min max Pr (6) all settings. Every iteration, we prepared two mini-batches,
tf C∈C f ∼q
t
Table 1: Accuracy on the DomainNet dataset (%) for one-shot and three-shot settings on 4 domains, R: Real, C: Clipart, P:
Clipart, S: Sketch. Our MME method outperformed other baselines for all adaptation scenarios and for all three networks,
except for only one case where it performs similarly to ENT.
Office-Home Office Note that all methods except for CDAN are trained with
Net Method exactly the same architecture used in our method. In case
1-shot 3-shot 1-shot 3-shot
S+T 44.1 50.0 50.2 61.8 of CDAN, we could not find any advantage of using our
DANN 45.1 50.3 55.8 64.8 architecture. The details of baseline implementations are in
ADR 44.5 49.5 50.6 61.3 our supplemental material.
AlexNet
CDAN 41.2 46.2 49.4 60.8 4.2. Results
ENT 38.8 50.9 48.1 65.1
MME 49.2 55.2 56.5 67.6 Overview. The main results on the DomainNet dataset are
shown in Table 1. First, our method outperformed other
S+T 57.4 62.9 68.7 73.3
baselines for all adaptation scenarios and all three networks
DANN 60.0 63.9 69.8 75.0
except for one case. On average, our method outperformed
ADR 57.4 63.0 69.4 73.7
VGG S+T with 9.5% and 8.9% in ResNet one-shot and three-shot
CDAN 55.8 61.8 65.9 72.9
setting respectively. The results on Office-Home and Office
ENT 51.6 64.8 70.6 75.3
are summarized in Table 2, where MME also outperforms
MME 62.7 67.6 73.4 77.0
all baselines. Due to the limited space, we show the results
Table 2: Results on Office-Home and Office dataset (%). averaged on all adaptation scenarios.
The value is the accuracy averaged over all adaptation sce- Comparison with UDA Methods. Generally, baseline
narios. Performance on each setting is summarized in sup- UDA methods need strong base networks such as VGG
plementary material. or ResNet to perform better than S+T. Interestingly, these
methods cannot improve the performance in some cases.
The superiority of MME over existing UDA methods is sup-
reveals how much gain will be obtained compared to the ported by Tables 1 and 2. Since CDAN uses entropy min-
existing domain alignment-based methods. ENT [13] is a imization and ENT significantly hurts the performance for
model trained with labeled source and target and unlabeled AlexNet and VGG, CDAN does not consistently improve
target using standard entropy minimization. Entropy is cal- the performance for AlexNet and VGG.
culated on unlabeled target examples and the entire network Comparison with Entropy Minimization. ENT does not
is trained to minimize it. The difference from MME is that improve performance in some cases because it does not ac-
ENT does not have a maximization process, thus compari- count for the domain gap. Comparing results on one-shot
son with this baseline clarifies its importance. and three-shot, entropy minimization gains performance
Method R-C R-P P-C C-S S-P R-S P-R Avg 4.3. Analysis
Source 41.1 42.6 37.4 30.6 30.0 26.3 52.3 37.2 Varying Number of Labeled Examples. First, we show
DANN 44.7 36.1 35.8 33.8 35.9 27.6 49.3 37.6 the results on unsupervised domain adaptation setting in Ta-
ADR 40.2 40.1 36.7 29.9 30.6 25.9 51.5 36.4
ble 3. Our method performed better than other methods
CDAN 44.2 39.1 37.8 26.2 24.8 24.3 54.6 35.9
ENT 33.8 43.0 23.0 22.9 13.9 12.0 51.2 28.5
on average. In addition, only our method improved per-
MME 47.6 44.7 39.9 34.0 33.0 29.0 53.5 40.2 formance compared to source only model in all settings.
Furthermore, we observe the behavior of our method when
Table 3: Results on the DomainNet dataset in the unsuper- the number of labeled examples in the target domain varies
vised domain adaptation setting (%). from 0 to 20 per class, which corresponds to 2520 labeled
examples in total. The results are shown in Fig. 4. Our
method works much better than S+T given a few labeled ex-
amples. On the other hand, ENT needs 5 labeled examples
per class to improve performance. As we add more labeled
examples, the performance gap between ENT and ours is
reduced. This result is quite reasonable, because prototype
estimation will become more accurate without any adapta-
tion as we have more labeled target examples.
Effect of Classifier Architecture. We introduce an abla-
(a) AlexNet (b) VGG tion study on the classifier network architecture proposed
Figure 4: Accuracy vs the number of labeled target exam- in [5, 25] with AlexNet on DomainNet. As shown in Fig.
ples. The ENT method needs more labeled examples to ob- 3, we employ `2 normalization and temperature scaling. In
tain similar performance to our method. this experiment, we compared it with a model having a stan-
dard linear layer without `2 normalization and temperature.
R to C R to S The result is shown in Table 4. By using the network ar-
Method chitecture proposed in [5, 25], we can improve the per-
1-shot 3-shot 1-shot 3-shot
S+T (Standard Linear) 41.4 44.3 26.5 28.7 formance of both our method and the baseline S+T model
S+T (Few-shot [5, 25]) 43.3 47.1 29.1 33.3 (model trained only on source examples and a few labeled
MME (Standard Linear) 44.9 47.7 30.0 32.2 target examples.) Therefore, we can argue that the net-
MME (Few-shot [5, 25]) 48.9 55.6 33.3 37.9 work architecture is an effective technique to improve per-
formance when we are given a few labeled examples from
Table 4: Comparison of classifier architectures on the Do-
the target domain.
mainNet dataset using AlexNet, showing the effectiveness
of the architecture proposed in [5, 25]. Feature Visualization. In addition, we plot the learned fea-
tures with t-SNE [21] in Fig. 5. We employ the scenario
Real to Clipart of DomainNet using AlexNet as the pre-
trained backbone. Fig 5 (a-d) visualizes the target features
with the help of labeled examples. As we have more labeled
and estimated prototypes. The color of the cross represents
target examples, the estimation of prototypes will be more
its class, black points are the prototypes. With our method,
accurate without any adaptation. In case of ResNet, entropy
the target features are clustered to their prototypes and do
minimization often improves accuracy. There are two po-
not have a large variance within the class. We visualize fea-
tential reasons. First, ResNet pre-trained on ImageNet has
tures on the source domain (red cross) and target domain
a more discriminative representation than other networks.
(blue cross) in Fig. 5 (e-h). As we discussed in the method
Therefore, given a few labeled target examples, the model
section, our method aims to minimize domain-divergence.
can extract more discriminative features, which contributes
Indeed, target features are well-aligned with source features
to the performance gain in entropy minimization. Second,
with our method. Judging from Fig. 5f, entropy minimiza-
ResNet has batch-normalization (BN) layers [15]. It is re-
tion (ENT) also tries to extract discriminative features, but
ported that BN has the effect of aligning feature distribu-
it fails to find domain-invariant prototypes.
tions [4, 18]. Hence, entropy minimization was done on
aligned feature representations, which improved the perfor- Quantitative Feature Analysis. We quantitatively investi-
mance. When there is a large domain gap such as C to S, gate the characteristics of the features we obtain using the
S to P, and R to S in Table 1, BN is not enough to handle same adaptation scenario. First, we perform the analysis on
the domain gap. Therefore, our proposed method performs the eigenvalues of the covariance matrix of target features.
much better than entropy minimization in such cases. We We follow the analysis done in [9]. Eigenvectors represent
show an analysis of BN in our supplemental material, re- the components of the features and eigenvalues represent
vealing its effectiveness for entropy minimization. their contributions. If the features are highly discrimina-
(a) Ours (b) ENT (c) DANN (d) S+T
Figure 5: Feature visualization with t-SNE. (a-d) We plot the class prototypes (black circles) and features on the target domain
(crosses). The color of a cross represents its class. We observed that features on our method show more discrimative features
than other methods. (e-h) Red: Features of the source domain. Blue: Features of the target domain. Our method’s features
are well-aligned between domains compared to other methods.
Figure 6: (a) Eigenvalues of the covariance matrix of the features on the target domain. Eigenvalues reduce quickly in our
method, which shows that features are more discriminative than other methods. (b) Our method achieves lower entropy than
baselines except ENT. (c) Our method clearly reduces domain-divergence compared to S+T.
Real Real
Amazon
Clipart
Product
DSLR
Painting Art
Sketch Webcam
Clipart
Table 5: Results on Office-Home. Our method performs better than baselines in most settings.
arately between unlabeled target and labeled ones. Some jointly and Joint BN will not help to reduce domain-gap.
previous work [4, 18] have demonstrated that this operation We compare ours with entropy minimization on both Sepa-
can reduce domain-gap. We call this batch strategy as a rate BN and Joint BN. Entropy minimization with Joint BN
“Separate BN”. To analyze the effect of Separate BN, we performs much worse than Separate BN as shown in Table
compared this with a “Joint BN” where we forwarded unla- 8. This results show that entropy minimization does not re-
beled and labeled examples at once. BN stats are calculated duce domain-gap by itself. On the other hand, our method
W to A D to A
Network Method Method Joint BN Separate BN
1-shot 3-shot 1-shot 3-shot
S+T 50.4 61.2 50.0 62.4 ENT 63.6 68.9
DANN 57.0 64.4 54.5 65.2 MME 69.5 69.6
ADR 50.2 61.2 50.9 61.4
AlexNet
CDAN 50.4 60.3 48.5 61.4 Table 8: Ablation study of batch-normalization. The per-
ENT 50.7 64.0 50.0 66.2 formance of the ENT method highly depends on the choice
MME 57.2 67.3 55.8 67.8 of BN while our method shows consistent behavior.
S+T 69.2 73.2 68.2 73.3
DANN 69.3 75.4 70.4 74.6
ADR 69.7 73.3 69.2 74.1
VGG 1-shot 3-shot
CDAN 65.9 74.4 64.4 71.4 Method
ENT 69.1 75.4 72.1 75.1 sp0 sp1 sp2 sp0 sp1 sp2
MME 73.1 76.3 73.6 77.6 S+T 43.3 43.8 43.8 47.1 45.9 48.8
Table 6: Results on Office. Our method outperformed other DANN 43.3 44.0 45.4 46.1 43.1 45.3
baselines in all settings. ENT 37.0 32.9 38.2 45.5 45.4 47.8
MME 48.9 51.2 51.4 55.6 55.0 55.8
Table 9: Results on different training splits on DomainNet,
works well even in case of Joint BN. This is because our Real to Clipart adaptation scenario using AlexNet.
training method is designed to reduce domain-gap.
Comparison with SSDA methods [33, 1] Since there
are no recently proposed SSDA methods using deep learn- R to C D to A W to A
AlexNet AlexNet
ing, we compared with the state-of-the-art unsupervised DA 1-shot 3-shot 1-shot 1-shot
methods modified for the SSDA task. We also compared DIRT-T [33] 45.2 48.0 GDSDA [1] 51.5 48.3
our method with [33] and [1]. We implemented [33] and MME 48.9 55.6 MME 58.5 60.4
also modified it for the SSDA task. To compare with [1], we
follow their evaluation protocol and report our and their best
accuracy (see Fig. 3 (c)(f) in [1]). As shown in Table 10, Table 10: Comparison with [33, 1].
we outperform these methods with a significant margin.
Results on Multiple Runs. We investigate the stability
of our method and several baselines. Table 11 shows results Method 1-shot 3-shot
averaged accuracy and standard deviation of three runs. The CDAN 62.9±1.5 65.3±0.1
deviation is not large and we can say that our method is ENT 59.5± 1.5 63.6±1.3
stable.
MME 64.3± 0.8 66.8±0.4
Results on Different Splits. We investigate the stability
of our method for labeled target examples. Table 9 shows Table 11: Results on three runs on DomainNet, Sketch to
results on different splits. sp0 correponds to the split we use Painting adaptation scenario using ResNet.
in the experiment on our paper. For each split, we randomly
picked up labeled training examples and validation exam-
ples. Our method consistently performs better than other
methods.
Method R to C R to P P to C C to P C to S S to P R-S P to R
S+T 47.1 45.0 44.9 35.9 36.4 38.4 33.3 58.7
VAT 46.1 43.8 44.3 35.8 35.6 38.2 31.8 57.7
MME 55.6 49.0 51.7 40.2 39.4 43.0 37.9 60.7