0% found this document useful (0 votes)
174 views13 pages

Minimax Entropy for SSDA

1) The document proposes a new method called Minimax Entropy (MME) for semi-supervised domain adaptation when a few labeled target examples are available. 2) MME alternately maximizes the conditional entropy of unlabeled target data to estimate domain-invariant class prototypes, and minimizes entropy to learn discriminative features that cluster around the prototypes. 3) Experiments show MME outperforms baselines like conventional feature alignment and few-shot learning methods, setting a new state-of-the-art for semi-supervised domain adaptation.

Uploaded by

Hưng Ngô
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
174 views13 pages

Minimax Entropy for SSDA

1) The document proposes a new method called Minimax Entropy (MME) for semi-supervised domain adaptation when a few labeled target examples are available. 2) MME alternately maximizes the conditional entropy of unlabeled target data to estimate domain-invariant class prototypes, and minimizes entropy to learn discriminative features that cluster around the prototypes. 3) Experiments show MME outperforms baselines like conventional feature alignment and few-shot learning methods, setting a new state-of-the-art for semi-supervised domain adaptation.

Uploaded by

Hưng Ngô
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Semi-supervised Domain Adaptation via Minimax Entropy

Kuniaki Saito1 , Donghyun Kim1 , Stan Sclaroff1 , Trevor Darrell2 and Kate Saenko1
1
Boston University, 2 University of California, Berkeley
1 2
{keisaito, donhk, sclaroff, saenko}@bu.edu, [email protected]
arXiv:1904.06487v5 [cs.CV] 14 Sep 2019

Abstract Task-specific
Classifier
Conventional Domain Classifier Based Method
Train Domain Classifier Update features
Domain Classifier
Contemporary domain adaptation methods are very ef- Labeled Source
fective at aligning feature distributions of source and tar- Labeled Target
get domains without any target supervision. However, Unlabeled Target
we show that these techniques perform poorly when even Estimated Prototypes
Deceive
Domain Classifier
a few labeled examples are available in the target do-
main. To address this semi-supervised domain adapta- Minimax Entropy (Ours)
tion (SSDA) setting, we propose a novel Minimax Entropy Estimate prototype
Update prototype Update features
(MME) approach that adversarially optimizes an adaptive with labeled examples

few-shot model. Our base model consists of a feature


encoding network, followed by a classification layer that
computes the features’ similarity to estimated prototypes
Maximize entropy Minimize entropy
(representatives of each class). Adaptation is achieved on unlabeled target on unlabeled target

by alternately maximizing the conditional entropy of un- Figure 1: We address the task of semi-supervised domain adapta-
labeled target data with respect to the classifier and min- tion. Top: Existing domain-classifier based methods align source
imizing it with respect to the feature encoder. We em- and target distributions but can fail by generating ambiguous fea-
pirically demonstrate the superiority of our method over tures near the task decision boundary. Bottom: Our method esti-
many baselines, including conventional feature alignment mates a representative point of each class (prototype) and extracts
and few-shot methods, setting a new state of the art for discriminative features using a novel minimax entropy technique.
SSDA. Our code is available at http://cs-people.
bu.edu/keisaito/research/MME.html.
call Minimax Entropy (MME), is based on optimizing a
minimax loss on the conditional entropy of unlabeled data,
1. Introduction as well as the task loss; this reduces the distribution gap
while learning discriminative features for the task.
Deep convolutional neural networks [16] have signifi- We exploit a cosine similarity-based classifier architec-
cantly improved image classification accuracy with the help ture recently proposed for few-shot learning [12, 5]. The
of large quantities of labeled training data, but often gener- classifier (top layer) predicts a K-way class probability vec-
alize poorly to new domains. Recent unsupervised domain tor by computing cosine similarity between K class-specific
adaptation (UDA) methods [11, 19, 20, 28, 37] improve weight vectors and the output of a feature extractor (lower
generalization on unlabeled target data by aligning distri- layers), followed by a softmax. Each class weight vector is
butions, but can fail to learn discriminative class boundaries an estimated “prototype” that can be regarded as a represen-
on target domains (see Fig. 1.) We show that in the Semi- tative point of that class. While this approach outperformed
Supervised Domain Adaptation (SSDA) setting where a few more advanced methods in few-shot learning and we con-
target labels are available, such methods often do not im- firmed its effectiveness in our setting, as we show below it
prove performance relative to just training on labeled source is still quite limited. In particular, it does not leverage unla-
and target examples, and can even make it worse. beled data in the target domain.
We propose a novel approach for SSDA that overcomes Our key idea is to minimize the distance between the
the limitations of previous methods and significantly im- class prototypes and neighboring unlabeled target samples,
proves the accuracy of deep classifiers on novel domains thereby extracting discriminative features. The problem is
with only a few labels per class. Our approach, which we how to estimate domain-invariant prototypes without many
Class1 Class2 Baseline Few-shot Learning Method
Estimated Prototypes
Entire Network Optimization without unlabeled examples

Labeled Source
Labeled Target
Classification loss
Unlabeled Target minimization

Proposed Method
Step1: Update Estimated Prototypes Step2: Update Feature Extractor

Entropy Maximization Entropy Minimization

Figure 2: Top: baseline few-shot learning method, which estimates class prototypes by weight vectors, yet does not consider
unlabeled data. Bottom: our model extracts discriminative and domain-invariant features using unlabeled data through a
domain-invariant prototype estimation. Step 1: we update the estimated prototypes in the classifier to maximize the entropy
on the unlabeled target domain. Step 2: we minimize the entropy with respect to the feature extractor to cluster features
around the estimated prototype.

labeled target examples. The prototypes are dominated by 2. Related Work


the source domain, as shown in the leftmost side of Fig. 2 Domain Adaptation. Semi-supervised domain adapta-
(bottom), as the vast majority of labeled examples come tion (SSDA) is a very important task [8, 40, 1], however it
from the source. To estimate domain-invariant prototypes, has not been fully explored, especially with regard to deep
we move weight vectors toward the target feature distribu- learning based methods. We revisit this task and compare
tion. Entropy on target examples represents the similarity our approach to recent semi-supervised learning or unsu-
between the estimated prototypes and target features. A uni- pervised domain adaptation methods. The main challenge
form output distribution with high entropy indicates that the in domain adaptation (DA) is the gap in feature distribu-
examples are similar to all prototype weight vectors. There- tions between domains, which degrades the source classi-
fore, we move the weight vectors towards target by maxi- fier’s performance. Most recent work has focused on unsu-
mizing the entropy of unlabeled target examples in the first pervised domain adaptation (UDA) and, in particular, fea-
adversarial step. Second, we update the feature extractor to ture distribution alignment. The basic approach measures
minimize the entropy of the unlabeled examples, to make the distance between feature distributions in source and tar-
them better clustered around the prototypes. This process get, then trains a model to minimize this distance. Many
is formulated as a mini-max game between the weight vec- UDA methods utilize a domain classifier to measure the dis-
tors and the feature extractor and applied over the unlabeled tance [11, 37, 19, 20, 33]. The domain-classifier is trained
target examples. to discriminate whether input features come from the source
Our method offers a new state-of-the-art in performance or target, whereas the feature extractor is trained to deceive
on SSDA; as reported below, we reduce the error relative to the domain classifier to match feature distributions. UDA
baseline few-shot methods which ignore unlabeled data by has been applied to various applications such as image clas-
8.5%, relative to current best-performing alignment meth- sification [27], semantic segmentation [32], and object de-
ods by 8.8%, and relative to a simple model jointly trained tection [6, 29]. Some methods minimize task-specific deci-
on source and target by 11.3% in one adaptation scenario. sion boundaries’ disagreement on target examples [30, 28]
Our contributions are summarized as follows: to push target features far from decision boundaries. In this
respect, they increase between-class variance of target fea-
tures; on the other hand, we propose to make target features
• We highlight the limitations of state-of-the-art domain
well-clustered around estimated prototypes. Our MME ap-
adaptation methods in the SSDA setting;
proach can reduce within-class variance as well as increas-
• We propose a novel adversarial method, Minimax En- ing between-class variance, which results in more discrim-
tropy (MME), designed for the SSDA task; inative features. Interestingly, we empirically observe that
UDA methods [11, 20, 28] often fail in improving accuracy
• We show our method’s superiority to existing methods in SSDA.
on benchmark datasets for domain adaptation. Semi-supervised learning (SSL). Generative [7, 31],
Labeled Source
F
<latexit sha1_base64="p+V/6RrHgYVCGEasarEveFmTkbk=">AAACZHichVHLSgMxFD0dX7VWWy2CIEixVFxJKoLiqiiIyz7sA2opM2Nah86LmWmhFn9At4oLVwoi4me48Qdc9AcEcVnBjQtvpwOiRb0hycnJPTcniWSqiu0w1vEJQ8Mjo2P+8cBEcHIqFJ6eydtGw5J5TjZUwypKos1VRec5R3FUXjQtLmqSygtSfbu3X2hyy1YMfc9pmbysiTVdqSqy6BCV3qmEY2yFuREdBAkPxOBFygjfYh8HMCCjAQ0cOhzCKkTY1EpIgMEkrow2cRYhxd3nOEaAtA3K4pQhElunsUarksfqtO7VtF21TKeo1C1SRhFnT+yOddkju2cv7OPXWm23Rs9Li2apr+VmJXQyl33/V6XR7ODwS/WnZwdVbLheFfJuukzvFnJf3zy66GY3M/H2Ertmr+T/inXYA91Ab77JN2meuUSAPiDx87kHQX51JUE4vRZLbnlf4cc8FrFM772OJHaRQo7O5TjFGc59z0JQiAiz/VTB52ki+BbCwieSTonG</latexit>
<latexit
= Feature Extractor W
<latexit sha1_base64="FzdOKzVVMODyZQ63IytaQ1+SYO4=">AAACbXichVHLSgMxFD0dX7W+qiIIihSLj5VkRFBciW5cqrW22IrMjKkG58VMWtChP+BacCEKCiLiZ7jxB1z0E8SFCwU3LrydDoiKekOSk5N7bk4S3TWFLxmrxZSm5pbWtnh7oqOzq7sn2du34Ttlz+BZwzEdL69rPjeFzbNSSJPnXY9rlm7ynL6/VN/PVbjnC8delwcu37K0XVuUhKFJojaLlib39FKQq24n02yKhZH6CdQIpBHFipO8RhE7cGCgDAscNiRhExp8agWoYHCJ20JAnEdIhPscVSRIW6YsThkasfs07tKqELE2res1/VBt0CkmdY+UKYyxB3bDXtg9u2WP7P3XWkFYo+7lgGa9oeXuds/RYObtX5VFs8Tep+pPzxIlzIVeBXl3Q6Z+C6OhrxyevGTm18aCcXbJnsj/BauxO7qBXXk1rlb52ikS9AHq9+f+CTamp1TCqzPphcXoK+IYwigm6b1nsYBlrCBL59o4xhnOY8/KgDKsjDRSlVik6ceXUCY+AKOPjfc=</latexit>
<latexit
= Weight Matrix Lce (p, y)
<latexit sha1_base64="EcJi5/JAALAe1M/jfJ4cEu+Re9w=">AAACeXicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQuI+MRXJ6fWasTkJpZkJKVVF9TqVGrGCygb6BmAgQImwxDKUGaAgoB8geUMMQwpDPkMyQylDLkMqQx5DCVAdg5DIkMxEEYzGDIYMBQAxWIZqoFiRUBWJlg+laGWgQuotxSoKhWoIhEomg0k04G8aKhoHpAPMrMYrDsZaEsOEBcBdSowqBpcNVhp8NnghMFqg5cGf3CaVQ02A+SWSiCdBNGbWhDP3yUR/J2grlwgXcKQgdCF180lDGkMFmC3ZgLdXgAWAfkiGaK/rGr652CrINVqNYNFBq+B7l9ocNPgMNAHeWVfkpcGpgbNZuACRoAhenBjMsKM9AwN9AwDTZQdnKBRwcEgzaDEoAEMb3MGBwYPhgCGUKC9VQwLGFYyrGL8zaTIpMGkBVHKxAjVI8yAApiMAc04kgY=</latexit>
= Cross Entropy Loss
Class1 Class2
C
<latexit sha1_base64="GQyATEZNrXBuezetIIGN/QQBFHk=">AAACZHichVHLSsNAFD2N7/potQiCIGKpuCo3IiiuxG5cttXWQi2SxKmGpklI0kIt/oBuFReuFETEz3DjD7jwBwRxWcGNC2/TgGhR7zAzZ87cc+fMjGobuusRPYWknt6+/oHBofDwyOhYJDo+kXetmqOJnGYZllNQFVcYuilynu4ZomA7QqmqhthWK6n2/nZdOK5umVtewxalqrJv6mVdUzymMqndaJyS5MdsN5ADEEcQaSt6gx3swYKGGqoQMOExNqDA5VaEDILNXAlN5hxGur8vcIQwa2ucJThDYbbC4z6vigFr8rpd0/XVGp9icHdYOYsEPdItteiB7uiFPn6t1fRrtL00eFY7WmHvRo6nNt//VVV59nDwpfrTs4cyVnyvOnu3faZ9C62jrx+etzZXs4nmPF3RK/u/pCe65xuY9TftOiOyFwjzB8g/n7sb5BeTMiXlzFJ8bT34ikFMYw4L/N7LWMMG0sjxuQInOMVZ6FkakWLSZCdVCgWaGL6FNPMJjEyJww==</latexit>
<latexit
= Classifier <latexit sha1_base64="NKtVB3y0RiJGq12NLSJEAOPhV6k=">AAACZHichVHLSsNAFD2Nr1qrrRZBEKRYKq5kIoLiqujGZd8t1FKSOK2haRKStFCLP6BbxYUrBRHxM9z4Ay76A4K4rODGhbdpQLSod5iZM2fuuXNmRjY11XYY6/qEkdGx8Qn/ZGAqOD0TCs/O5W2jaSk8pxiaYRVlyeaaqvOcozoaL5oWlxqyxgtyfbe/X2hxy1YNPeu0TV5uSDVdraqK5BCVylbCMbbG3IgOA9EDMXiRNMK32McBDChoogEOHQ5hDRJsaiWIYDCJK6NDnEVIdfc5jhEgbZOyOGVIxNZprNGq5LE6rfs1bVet0CkadYuUUcTZE7tjPfbI7tkL+/i1Vset0ffSplkeaLlZCZ0sZN7/VTVodnD4pfrTs4MqtlyvKnk3XaZ/C2Wgbx1d9DLb6XhnhV2zV/J/xbrsgW6gt96UmxRPXyJAHyD+fO5hkF9fEwmnNmKJHe8r/FjEMlbpvTeRwB6SyNG5HKc4w7nvWQgKEWF+kCr4PE0E30JY+gSuTonU</latexit>
<latexit
T = Temperature H
<latexit sha1_base64="7LJVnfaswH0P4hKinZRIP5cYjKo=">AAACZHichVHLSsNAFD2Nr1ofrRZBEEQsiqtyI4Liquimyz5sLVSRJI4amiYhSQu1+AO6VVy4UhARP8ONP+CiPyCIywpuXHibBkRFvcPMnDlzz50zM6pt6K5H1ApJPb19/QPhwcjQ8MhoNDY2XnStmqOJgmYZllNSFVcYuikKnu4ZomQ7QqmqhthUK+ud/c26cFzdMje8hi22q8q+qe/pmuIxlU3vxBKUJD9mfgI5AAkEkbFiN9jCLixoqKEKARMeYwMKXG5lyCDYzG2jyZzDSPf3BY4QYW2NswRnKMxWeNznVTlgTV53arq+WuNTDO4OK2cwR490S216oDt6pvdfazX9Gh0vDZ7VrlbYO9Hjyfzbv6oqzx4OPlV/evawhxXfq87ebZ/p3ELr6uuH5+38am6uOU9X9ML+L6lF93wDs/6qXWdF7gIR/gD5+3P/BMXFpExJObuUSK0FXxHGFGaxwO+9jBTSyKDA5wqc4BRnoSdpWIpLE91UKRRo4vgS0vQHlkyJyA==</latexit>
<latexit
= Entropy

Labeled Example
Lce (p, y)
・ ・
<latexit sha1_base64="SRTudCVhkHiiWVupo5SXcuwrI+4=">AAACinichVG7SgNBFD2u7/hI1EawWQwRBZGJCL4aUQsLC19RwUjYXSfJ4L7YnQTikh/QD7CwUhARK1stbfwBCz9BLBVsLLzZLIiKeoeZOXPmnjtnZnTXFL5k7LFBaWxqbmlta491dHZ1xxM9vZu+U/IMnjEc0/G2dc3nprB5Rgpp8m3X45qlm3xL31+o7W+VuecLx96QFZfvWlrBFnlhaJKoXCKlLucCg1fVrMnzcljNWpos6vnArY6qFTXriUJRjqi5RJKNsTDUnyAdgSSiWHESF8hiDw4MlGCBw4YkbEKDT20HaTC4xO0iIM4jJMJ9jipipC1RFqcMjdh9Ggu02olYm9a1mn6oNugUk7pHShUp9sAu2Qu7Z1fsib3/WisIa9S8VGjW61ru5uKH/etv/6osmiWKn6o/PUvkMRV6FeTdDZnaLYy6vnxw/LI+s5YKhtgZeyb/p+yR3dEN7PKrcb7K104Qow9If3/un2BzfCxNeHUiOTcffUUbBjCIYXrvScxhCSvI0LlHuMYNbpVOZVyZVmbrqUpDpOnDl1AWPwBKgZby</latexit>

・ ・

C
<latexit sha1_base64="GQyATEZNrXBuezetIIGN/QQBFHk=">AAACZHichVHLSsNAFD2N7/potQiCIGKpuCo3IiiuxG5cttXWQi2SxKmGpklI0kIt/oBuFReuFETEz3DjD7jwBwRxWcGNC2/TgGhR7zAzZ87cc+fMjGobuusRPYWknt6+/oHBofDwyOhYJDo+kXetmqOJnGYZllNQFVcYuilynu4ZomA7QqmqhthWK6n2/nZdOK5umVtewxalqrJv6mVdUzymMqndaJyS5MdsN5ADEEcQaSt6gx3swYKGGqoQMOExNqDA5VaEDILNXAlN5hxGur8vcIQwa2ucJThDYbbC4z6vigFr8rpd0/XVGp9icHdYOYsEPdItteiB7uiFPn6t1fRrtL00eFY7WmHvRo6nNt//VVV59nDwpfrTs4cyVnyvOnu3faZ9C62jrx+etzZXs4nmPF3RK/u/pCe65xuY9TftOiOyFwjzB8g/n7sb5BeTMiXlzFJ8bT34ikFMYw4L/N7LWMMG0sjxuQInOMVZ6FkakWLSZCdVCgWaGL6FNPMJjEyJww==</latexit>
<latexit

L2 WT f p
Labeled Target
F
<latexit sha1_base64="p+V/6RrHgYVCGEasarEveFmTkbk=">AAACZHichVHLSgMxFD0dX7VWWy2CIEixVFxJKoLiqiiIyz7sA2opM2Nah86LmWmhFn9At4oLVwoi4me48Qdc9AcEcVnBjQtvpwOiRb0hycnJPTcniWSqiu0w1vEJQ8Mjo2P+8cBEcHIqFJ6eydtGw5J5TjZUwypKos1VRec5R3FUXjQtLmqSygtSfbu3X2hyy1YMfc9pmbysiTVdqSqy6BCV3qmEY2yFuREdBAkPxOBFygjfYh8HMCCjAQ0cOhzCKkTY1EpIgMEkrow2cRYhxd3nOEaAtA3K4pQhElunsUarksfqtO7VtF21TKeo1C1SRhFnT+yOddkju2cv7OPXWm23Rs9Li2apr+VmJXQyl33/V6XR7ODwS/WnZwdVbLheFfJuukzvFnJf3zy66GY3M/H2Ertmr+T/inXYA91Ab77JN2meuUSAPiDx87kHQX51JUE4vRZLbnlf4cc8FrFM772OJHaRQo7O5TjFGc59z0JQiAiz/VTB52ki+BbCwieSTonG</latexit>
<latexit

Normalize f
<latexit sha1_base64="qJB/PHFia9q7YWEVD/rcOeHnvng=">AAACbXichVHLSgMxFD0d3/VVFUFQpFiqrkoqguKq6Malrz6wLWVmTOvQeTGTFrT0B1wLLkRBQUT8DDf+gIt+grhwUcGNC2+nA6JFvSHJyck9NyeJYuuaKxhrBKSu7p7evv6B4ODQ8MhoaGw85VoVR+VJ1dItJ6PILtc1kyeFJnSesR0uG4rO00p5o7WfrnLH1SxzTxzZPG/IJVMraqosiNrPGbI4VIq1Yr0QirAY8yLcCeI+iMCPLSt0ixwOYEFFBQY4TAjCOmS41LKIg8EmLo8acQ4hzdvnqCNI2gplccqQiS3TWKJV1mdNWrdqup5apVN06g4pw4iyJ3bHmuyR3bNn9vFrrZpXo+XliGalreV2YfRkavf9X5VBs8Dhl+pPzwJFrHpeNfJue0zrFmpbXz0+a+6u7URr8+yavZD/K9ZgD3QDs/qm3mzznXME6QPiP5+7E6SWYnHC28uRxLr/Ff2YxhwW6b1XkMAmtpCkc02c4gKXgVdpUpqRZtupUsDXTOBbSAufwZ6OBg==</latexit>
<latexit

<latexit sha1_base64="EkM06j/iRBGDJqMNRRaQ0NIYNo0=">AAACgHichVG7SgNBFD1ZXzG+ojaCjRgUqzgrgmIl2lhGk5hAomF3nY1L9sXuJKDLFrb+gIWVgojY6DfY+AMWfoJYKthYeLNZEBX1DjNz5sw9d87MqK5p+IKxx4TU1d3T25fsTw0MDg2PpEfHtn2n6Wm8qDmm45VVxeemYfOiMITJy67HFUs1eUltrLf3Sy3u+YZjF8SBy3cspW4buqEpgqhaerKqe4oWVC1F7Kt6EASl3UKoh2EYFMJaOsOyLIqpn0COQQZx5Jz0JarYgwMNTVjgsCEIm1DgU6tABoNL3A4C4jxCRrTPESJF2iZlccpQiG3QWKdVJWZtWrdr+pFao1NM6h4ppzDDHtgVe2H37Jo9sfdfawVRjbaXA5rVjpa7tZHjifzbvyqLZoH9T9WfngV0LEdeDfLuRkz7FlpH3zo8ecmvbM0Es+ycPZP/M/bI7ugGdutVu9jkW6dI0QfI35/7J9heyMqENxczq2vxVyQxiWnM0XsvYRUbyKFI5x7hEje4lSRpTpqX5E6qlIg14/gS0soHigSVFw==</latexit>
T
Softmax <latexit sha1_base64="pqpsMzL0PDji+m5tsXqyLGOmol4=">AAACbXichVHLSgMxFD0d3/VVFUFQRCw+ViUjguKq6Malr6rYFpkZ0xo6L2bSgpb+gGvBhSgoiIif4cYfcOEniAsXCm5ceGc6IFrUG5KcnNxzc5Loril8ydhjTGlqbmlta++Id3Z19/Qm+vo3fafsGTxjOKbjbeuaz01h84wU0uTbrsc1Szf5ll5aCva3KtzzhWNvyAOX5y2taIuCMDRJ1E7O0uS+Xqi6td1EkqVYGGONQI1AElGsOIlr5LAHBwbKsMBhQxI2ocGnloUKBpe4PKrEeYREuM9RQ5y0ZcrilKERW6KxSKtsxNq0Dmr6odqgU0zqHinHMMEe2A17Zffslj2xj19rVcMagZcDmvW6lru7vUdD6+//qiyaJfa/VH96lihgPvQqyLsbMsEtjLq+cnjyur6wNlGdZJfsmfxfsEd2RzewK2/G1SpfO0WcPkD9+dyNYHMmpRJenU2mF6OvaMcwxjFN7z2HNJaxggyda+MYZziPvSiDyogyWk9VYpFmAN9CmfoE1aiOEA==</latexit>

Unlabeled
<latexit sha1_base64="pSwD012ky3kYeCYh62VUh3X0tsU=">AAACcXichVG7SgNBFD1Z3/EVtVFsgkGJiOGujWIl2lj6igqJht11kizui91JQIM/4A8oWCmIiJ9h4w9Y+AliGcHGwrubBVFR7zAzZ87cc+fMjO5ZZiCJnhJKW3tHZ1d3T7K3r39gMDU0vB24Nd8QecO1XH9X1wJhmY7IS1NaYtfzhWbrltjRD1fC/Z268APTdbbkkSf2bK3imGXT0CRT+7Or2aKtyapebngn06VUhnIURfonUGOQQRxrbuoGRRzAhYEabAg4kIwtaAi4FaCC4DG3hwZzPiMz2hc4QZK1Nc4SnKExe8hjhVeFmHV4HdYMIrXBp1jcfVamMUmPdEtNeqA7eqb3X2s1ohqhlyOe9ZZWeKXB09HNt39VNs8S1U/Vn54lyliIvJrs3YuY8BZGS18/PmtuLm5MNqboil7Y/yU90T3fwKm/GtfrYuMCSf4A9ftz/wTbczmVcuo6ZZaW46/oxjgmkOX3nscSVrGGPJ/r4xyXuEo0lTElrUy0UpVErBnBl1BmPgAB/o76</latexit>
H(p)
Target
Gradient
Unlabeled Target Flipping

Backward path for unlabeled target examples



Backward path for labeled source and target examples

Figure 3: An overview of the model architecture and MME. The inputs to the network are labeled source examples (y=label),
a few labeled target examples, and unlabeled target examples. Our model consists of the feature extractor F and the classifier
C which has weight vectors (W) and temperature T . W is trained to maximize entropy on unlabeled target (Step 1 in Fig.
2) whereas F is trained to minimize it (Step 2 in Fig. 2). To achieve the adversarial learning, the sign of gradients for entropy
loss on unlabeled target examples is flipped by a gradient reversal layer [11, 37].

model-ensemble [17], and adversarial approaches [22] have 3. Minimax Entropy Domain Adaptation
boosted performance in semi-supervised learning, but do In semi-supervised domain adaptation, we are given
not address domain shift. Conditional entropy minimization source images and the corresponding labels in the source
(CEM) is a widely used method in SSL [13, 10]. However, domain Ds = {(xsi , yi s )}i=1
ms
. In the target domain, we
we found that CEM fails to improve performance when are also given a limited number of labeled target images
there is a large domain gap between the source and target mt
Dt = {(xti , yi t )}i=1 , as well as unlabeled target images
domains (see experimental section.) MME can be regarded u mu
Du = {(xi )}i=1 . Our goal is to train the model on
as a variant of entropy minimization which overcomes the Ds , Dt , and Du and evaluate on Du .
limitation of CEM in domain adaptation.
Few-shot learning (FSL). Few shot learning [35, 39, 3.1. Similarity based Network Architecture
26] aims to learn novel classes given a few labeled examples Inspired by [5], our base model consists of a feature ex-
and labeled “base” classes. SSDA and FSL make differ- tractor F and a classifier C. For the feature extractor F ,
ent assumptions: FSL does not use unlabeled examples and we employ a deep convolutional neural network and per-
aims to acquire knowledge of novel classes, while SSDA form `2 normalization on the output of the network. Then,
aims to adapt to the same classes in a new domain. How- the normalized feature vector is used as an input to C
ever both tasks aim to extract discriminative features given a which consists of weight vectors W = [w1 , w2 , . . . , wK ]
few labeled examples from a novel domain or novel classes. where K represents the number of classes and a temper-
We employ a network with `2 normalization on features be- ature parameter T . C takes kF F (x)
(x)k as an input and out-
fore the last linear layer and a temperature parameter T , T

which was proposed for face verification [25] and applied puts T1 W F (x)
kF (x)k . The output of C is fed into a softmax-
to few-shot learning [12, 5]. Generally, classification of a layer to obtain the probabilistic output p ∈ Rn . We denote
T
feature vector with a large norm results in confident out- p(x) = σ( T1 W F (x)
kF (x)k ), where σ indicates a softmax func-
put. To make the output more confident, networks can try tion. In order to classify examples correctly, the direction
to increase the norm of features. However, this does not of a weight vector has to be representative to the normal-
necessarily increase the between-class variance because in- ized features of the corresponding class. In this respect, the
creasing the norm does not change the direction of vectors. weight vectors can be regarded as estimated prototypes for
`2 normalized feature vectors can solve this issue. To make each class. An architecture of our method is shown in Fig. 3.
the output more confident, the network focuses on making
the direction of the features from the same class closer to 3.2. Training Objectives
each other and separating different classes. This simple ar- We estimate domain-invariant prototypes by performing
chitecture was shown to be very effective for few-shot learn- entropy maximization with respect to the estimated proto-
ing [5] and we build our method on it in our work. type. Then, we extract discriminative features by perform-
ing entropy minimization with respect to feature extractor. trained to maximize the entropy, whereas the feature ex-
Entropy maximization prevents overfitting that can reduce tractor F is trained to minimize it. Both C and F are also
the expressive power of the representations. Therefore, en- trained to classify labeled examples correctly. The overall
tropy maximization can be considered as the step of select- adversarial learning objective functions are:
ing prototypes that will not cause overfitting to the source
examples. In our method, the prototypes are parameterized θ̂F = argmin L + λH
θF
by the weight vectors of the last linear layer. First, we train (3)
F and C to classify labeled source and target examples cor- θ̂C = argmin L − λH
θC
rectly and utilize an entropy minimization objective to ex-
tract discriminative features for the target domain. We use where λ is a hyper-parameter to control a trade-off between
a standard cross-entropy loss to train F and C for classifi- minimax entropy training and classification on labeled ex-
cation: amples. Our method can be formulated as the iterative min-
L = E(x,y)∈Ds ,Dt Lce (p(x), y) . (1) imax training. To simplify training process, we use a gra-
With this classification loss, we ensure that the feature ex- dient reversal layer [11] to flip the gradient between C and
tractor generates discriminative features with respect to the F with respect to H. With this layer, we can perform the
source and a few target labeled examples. However, the minimax training with one forward and back-propagation,
model is trained on the source domain and a small fraction which is illustrated in Fig. 3.
of target examples for classification. This does not learn
3.3. Theoretical Insights
discriminative features for the entire target domain. There-
fore, we propose minimax entropy training using unlabeled As shown in [2], we can measure domain-divergence by
target examples. using a domain classifier. Let h ∈ H be a hypothesis, s (h)
A conceptual overview of our proposed adversarial and t (h) be the expected risk of source and target respec-
learning is illustrated in Fig. 2. We assume that there exists tively, then t (h) 6 s (h) + dH (p, q) + C0 where C0 is a
a single domain-invariant prototype for each class, which constant for the complexity of hypothesis space and the risk
can be a representative point for both domains. The esti- of an ideal hypothesis for both domains and dH (p, q) is the
mated prototype will be near source distributions because H-divergence between p and q.
source labels are dominant. Then, we propose to estimate

s
 t 
the position of the prototype by moving each wi toward tar- dH (p, q) , 2 sup Pr

s
[h(f ) = 1] − Pr h(f ) = 1
h∈H x ∼p xt ∼q
get features using unlabeled data in the target domain. To (4)
achieve this, we increase the entropy measured by the simi- where f s and f t denote the features in the source and target
larity between W and unlabeled target features. Entropy is domain respectively. In our case the features are outputs of
calculated as follows, the feature extractor. The H-divergence relies on the capac-
K
H = −E(x,y)∈Du
X
p(y = i|x) log p(y = i|x) (2) ity of the hypothesis space H to distinguish distributions p
i=1
and q. This theory states that the divergence between do-
mains can be measured by training a domain classifier and
where K is the number of classes and p(y = i|x) represents features with low divergence are the key to having a well-
the probability of prediction to class i, namely i th dimen- performing task-specific classifier. Inspired by this, many
T
sion of p(x) = σ( T1 W F (x)
kF (x)k ). To have higher entropy, that methods [11, 3, 37, 36] train a domain classifier to discrim-
is, to have uniform output probability, each wi should be inate different domains while also optimizing the feature
similar to all target features. Thus, increasing the entropy extractor to minimize the divergence.
encourages the model to estimate the domain-invariant pro- Our proposed method is also connected to Eq. 4. Al-
totypes as shown in Fig. 2. though we do not have a domain classifier or a domain clas-
To obtain discriminative features on unlabeled target ex- sification loss, our method can be considered as minimizing
amples, we need to cluster unlabeled target features around domain-divergence through minimax training on unlabeled
the estimated prototypes. We propose to decrease the en- target examples. We choose h to be a classifier that decides
tropy on unlabeled target examples by the feature extractor a binary domain label of a feature by the value of the en-
F . The features should be assigned to one of the prototypes tropy, namely,
to decrease the entropy, resulting in the desired discrimina- (
tive features. Repeating this prototype estimation (entropy 1, if H(C(f )) ≥ γ,
h(f ) = (5)
maximization) and entropy minimization process yields dis- 0, otherwise
criminative features.
To summarize, our method can be formulated as adver- where C denotes our classifier, H denotes entropy, and
sarial learning between C and F . The task classifier C is γ is a threshold to determine a domain label. Here,
we assume C outputs the probability of the class pre- noisy, we pick 4 domains (Real, Clipart, Painting, Sketch)
diction for simplicity.
Eq. 4 can be rewritten as follows, and 126 classes. We focus on the adaptation scenarios

dH (p, q) , 2 sup Pr [h(f s ) = 1] − Pr
 t 
h(f ) = 1 where the target domain is not real images, and construct
s ∼p
h∈H

f t
f ∼q

7 scenarios from the four domains. See our supplemental
s t
material for more details. Office-Home [38] contains 4 do-
= 2 sup Pr [H(C(f )) ≥ γ] − Pr [H(C(f )) ≥ γ]
C∈C f ∼p
s f t ∼q mains (Real, Clipart, Art, Product) with 65 classes. This
H(C(f t )) ≥ γ . dataset is one of the benchmark datasets for unsupervised
 
≤ 2 sup Pr t
C∈C f ∼q domain adaptation. We evaluated our method on 12 sce-
In the last inequality, we assume that sPr [H(C(f s )) ≥ γ] ≤ narios in total. Office [27] contains 3 domains (Amazon,
f ∼p
Pr H(C(f t )) ≥ γ . This assumption should be realistic Webcam, DSLR) with 31 classes. Webcam and DSLR are
 
f t ∼p
small domains and some classes do not have a lot of exam-
because we have access to many labeled source examples
ples while Amazon has many examples. To evaluate on the
and train entire networks to minimize the classification
domain with enough examples, we have 2 scenarios where
loss. Minimizing the cross-entropy loss (Eq. 1) on source
we set Amazon as the target domain and DSLR and Web-
examples ensures that the entropy on a source example
cam as the source domain.
is very small. Intuitively, this inequality states that the
divergence can be bounded by the ratio of target examples Implementation Details. All experiments are implemented
having entropy greater than γ. Therefore, we can have in Pytorch [23]. We employ AlexNet [16] and VGG16 [34]
the upper bound by finding the C that achieves maximum pre-trained on ImageNet. To investigate the effect of deeper
entropy for all target features. Our objective is finding architectures, we use ResNet34 [14] in experiments on Do-
features that achieve lowest divergence. We suppose there mainNet. We remove the last linear layer of these networks
exists a C that achieves the maximum in the inequality to build F , and add a K-way linear classification layer C
above, then the objective can be rewritten as, with a randomly initialized weight matrix W . The value of
temperature T is set 0.05 following the results of [25] in
H(C(f t )) ≥ γ
 
min max Pr (6) all settings. Every iteration, we prepared two mini-batches,
tf C∈C f ∼q
t

one consisting of labeled examples and the other of unla-


Finding the minimum with respect to f t is equivalent to find beled target examples. Half of the labeled examples comes
a feature extractor F that achieves that minimum. Thus, from source and half from labeled target. Using the two
we derive the minimax objective of our proposed learning mini-batches, we calculated the objective in Eq. 3. To im-
method in Eq . 3. To sum up, our maximum entropy pro- plement the adversarial learning in Eq. 3, we use a gradient
cess can be regarded as measuring the divergence between reversal layer [11, 37] to flip the gradient with respect to
domains, whereas our entropy minimization process can be entropy loss. The sign of the gradient is flipped between C
regarded as minimizing the divergence. In our experimen- and F during backpropagation. We adopt SGD with mo-
tal section, we observe that our method actually reduces mentum of 0.9. In all experiments, we set the trade-off pa-
domain-divergence (Fig. 6c). In addition, target features rameter λ in Eq. 3 as 0.1. This is decided by the validation
produced by our method look aligned with source features performance on Real to Clipart experiments. We show the
and are just as discriminative. These come from the effect performance sensitivity to this parameter in our supplemen-
of the domain-divergence minimization. tal material, as well as more details including learning rate
scheduling.
4. Experiments
Baselines. S+T [5, 25] is a model trained with the labeled
4.1. Setup source and labeled target examples without using unlabeled
We randomly selected one or three labeled examples per target examples. DANN [11] employs a domain classifier
class as the labeled training target examples (one-shot and to match feature distributions. This is one of the most pop-
three-shot setting, respectively.) We selected three other la- ular methods in UDA. For fair comparison, we modify this
beled examples as the validation set for the target domain. method so that it is trained with the labeled source, labeled
The validation examples are used for early stopping, choos- target, and unlabeled target examples. ADR [28] utilizes a
ing the hyper-parameter λ, and training scheduling. The task-specific decision boundary to align features and ensure
other target examples are used for training without labels, that they are discriminative on the target. CDAN [20] is
their labels are only used to evaluate classification accuracy one of the state-of-the art methods on UDA and performs
(%). All examples of the source are used for training. domain alignment on features that are conditioned on the
Datasets. Most of our experiments are done on a subset output of classifiers. In addition, it utilizes entropy min-
of DomainNet [24], a recent benchmark dataset for large- imization on target examples. CDAN integrates domain-
scale domain adaptation that has many classes (345) and six classifier based alignment and entropy minimization. Com-
domains. As labels of some domains and classes are very parison with these UDA methods (DANN, ADR, CDAN)
R to C R to P P to C C to S S to P R to S P to R MEAN
Net Method
1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot 1-shot 3-shot
S+T 43.3 47.1 42.4 45.0 40.1 44.9 33.6 36.4 35.7 38.4 29.1 33.3 55.8 58.7 40.0 43.4
DANN 43.3 46.1 41.6 43.8 39.1 41.0 35.9 36.5 36.9 38.9 32.5 33.4 53.6 57.3 40.4 42.4
ADR 43.1 46.2 41.4 44.4 39.3 43.6 32.8 36.4 33.1 38.9 29.1 32.4 55.9 57.3 39.2 42.7
AlexNet
CDAN 46.3 46.8 45.7 45.0 38.3 42.3 27.5 29.5 30.2 33.7 28.8 31.3 56.7 58.7 39.1 41.0
ENT 37.0 45.5 35.6 42.6 26.8 40.4 18.9 31.1 15.1 29.6 18.0 29.6 52.2 60.0 29.1 39.8
MME 48.9 55.6 48.0 49.0 46.7 51.7 36.3 39.4 39.4 43.0 33.3 37.9 56.8 60.7 44.2 48.2
S+T 49.0 52.3 55.4 56.7 47.7 51.0 43.9 48.5 50.8 55.1 37.9 45.0 69.0 71.7 50.5 54.3
DANN 43.9 56.8 42.0 57.5 37.3 49.2 46.7 48.2 51.9 55.6 30.2 45.6 65.8 70.1 45.4 54.7
ADR 48.3 50.2 54.6 56.1 47.3 51.5 44.0 49.0 50.7 53.5 38.6 44.7 67.6 70.9 50.2 53.7
VGG
CDAN 57.8 58.1 57.8 59.1 51.0 57.4 42.5 47.2 51.2 54.5 42.6 49.3 71.7 74.6 53.5 57.2
ENT 39.6 50.3 43.9 54.6 26.4 47.4 27.0 41.9 29.1 51.0 19.3 39.7 68.2 72.5 36.2 51.1
MME 60.6 64.1 63.3 63.5 57.0 60.7 50.9 55.4 60.5 60.9 50.2 54.8 72.2 75.3 59.2 62.1
S+T 55.6 60.0 60.6 62.2 56.8 59.4 50.8 55.0 56.0 59.5 46.3 50.1 71.8 73.9 56.9 60.0
DANN 58.2 59.8 61.4 62.8 56.3 59.6 52.8 55.4 57.4 59.9 52.2 54.9 70.3 72.2 58.4 60.7
ADR 57.1 60.7 61.3 61.9 57.0 60.7 51.0 54.4 56.0 59.9 49.0 51.1 72.0 74.2 57.6 60.4
ResNet
CDAN 65.0 69.0 64.9 67.3 63.7 68.4 53.1 57.8 63.4 65.3 54.5 59.0 73.2 78.5 62.5 66.5
ENT 65.2 71.0 65.9 69.2 65.4 71.1 54.6 60.0 59.7 62.1 52.1 61.1 75.0 78.6 62.6 67.6
MME 70.0 72.2 67.7 69.7 69.0 71.7 56.3 61.8 64.8 66.8 61.0 61.9 76.1 78.5 66.4 68.9

Table 1: Accuracy on the DomainNet dataset (%) for one-shot and three-shot settings on 4 domains, R: Real, C: Clipart, P:
Clipart, S: Sketch. Our MME method outperformed other baselines for all adaptation scenarios and for all three networks,
except for only one case where it performs similarly to ENT.

Office-Home Office Note that all methods except for CDAN are trained with
Net Method exactly the same architecture used in our method. In case
1-shot 3-shot 1-shot 3-shot
S+T 44.1 50.0 50.2 61.8 of CDAN, we could not find any advantage of using our
DANN 45.1 50.3 55.8 64.8 architecture. The details of baseline implementations are in
ADR 44.5 49.5 50.6 61.3 our supplemental material.
AlexNet
CDAN 41.2 46.2 49.4 60.8 4.2. Results
ENT 38.8 50.9 48.1 65.1
MME 49.2 55.2 56.5 67.6 Overview. The main results on the DomainNet dataset are
shown in Table 1. First, our method outperformed other
S+T 57.4 62.9 68.7 73.3
baselines for all adaptation scenarios and all three networks
DANN 60.0 63.9 69.8 75.0
except for one case. On average, our method outperformed
ADR 57.4 63.0 69.4 73.7
VGG S+T with 9.5% and 8.9% in ResNet one-shot and three-shot
CDAN 55.8 61.8 65.9 72.9
setting respectively. The results on Office-Home and Office
ENT 51.6 64.8 70.6 75.3
are summarized in Table 2, where MME also outperforms
MME 62.7 67.6 73.4 77.0
all baselines. Due to the limited space, we show the results
Table 2: Results on Office-Home and Office dataset (%). averaged on all adaptation scenarios.
The value is the accuracy averaged over all adaptation sce- Comparison with UDA Methods. Generally, baseline
narios. Performance on each setting is summarized in sup- UDA methods need strong base networks such as VGG
plementary material. or ResNet to perform better than S+T. Interestingly, these
methods cannot improve the performance in some cases.
The superiority of MME over existing UDA methods is sup-
reveals how much gain will be obtained compared to the ported by Tables 1 and 2. Since CDAN uses entropy min-
existing domain alignment-based methods. ENT [13] is a imization and ENT significantly hurts the performance for
model trained with labeled source and target and unlabeled AlexNet and VGG, CDAN does not consistently improve
target using standard entropy minimization. Entropy is cal- the performance for AlexNet and VGG.
culated on unlabeled target examples and the entire network Comparison with Entropy Minimization. ENT does not
is trained to minimize it. The difference from MME is that improve performance in some cases because it does not ac-
ENT does not have a maximization process, thus compari- count for the domain gap. Comparing results on one-shot
son with this baseline clarifies its importance. and three-shot, entropy minimization gains performance
Method R-C R-P P-C C-S S-P R-S P-R Avg 4.3. Analysis
Source 41.1 42.6 37.4 30.6 30.0 26.3 52.3 37.2 Varying Number of Labeled Examples. First, we show
DANN 44.7 36.1 35.8 33.8 35.9 27.6 49.3 37.6 the results on unsupervised domain adaptation setting in Ta-
ADR 40.2 40.1 36.7 29.9 30.6 25.9 51.5 36.4
ble 3. Our method performed better than other methods
CDAN 44.2 39.1 37.8 26.2 24.8 24.3 54.6 35.9
ENT 33.8 43.0 23.0 22.9 13.9 12.0 51.2 28.5
on average. In addition, only our method improved per-
MME 47.6 44.7 39.9 34.0 33.0 29.0 53.5 40.2 formance compared to source only model in all settings.
Furthermore, we observe the behavior of our method when
Table 3: Results on the DomainNet dataset in the unsuper- the number of labeled examples in the target domain varies
vised domain adaptation setting (%). from 0 to 20 per class, which corresponds to 2520 labeled
examples in total. The results are shown in Fig. 4. Our
method works much better than S+T given a few labeled ex-
amples. On the other hand, ENT needs 5 labeled examples
per class to improve performance. As we add more labeled
examples, the performance gap between ENT and ours is
reduced. This result is quite reasonable, because prototype
estimation will become more accurate without any adapta-
tion as we have more labeled target examples.
Effect of Classifier Architecture. We introduce an abla-
(a) AlexNet (b) VGG tion study on the classifier network architecture proposed
Figure 4: Accuracy vs the number of labeled target exam- in [5, 25] with AlexNet on DomainNet. As shown in Fig.
ples. The ENT method needs more labeled examples to ob- 3, we employ `2 normalization and temperature scaling. In
tain similar performance to our method. this experiment, we compared it with a model having a stan-
dard linear layer without `2 normalization and temperature.
R to C R to S The result is shown in Table 4. By using the network ar-
Method chitecture proposed in [5, 25], we can improve the per-
1-shot 3-shot 1-shot 3-shot
S+T (Standard Linear) 41.4 44.3 26.5 28.7 formance of both our method and the baseline S+T model
S+T (Few-shot [5, 25]) 43.3 47.1 29.1 33.3 (model trained only on source examples and a few labeled
MME (Standard Linear) 44.9 47.7 30.0 32.2 target examples.) Therefore, we can argue that the net-
MME (Few-shot [5, 25]) 48.9 55.6 33.3 37.9 work architecture is an effective technique to improve per-
formance when we are given a few labeled examples from
Table 4: Comparison of classifier architectures on the Do-
the target domain.
mainNet dataset using AlexNet, showing the effectiveness
of the architecture proposed in [5, 25]. Feature Visualization. In addition, we plot the learned fea-
tures with t-SNE [21] in Fig. 5. We employ the scenario
Real to Clipart of DomainNet using AlexNet as the pre-
trained backbone. Fig 5 (a-d) visualizes the target features
with the help of labeled examples. As we have more labeled
and estimated prototypes. The color of the cross represents
target examples, the estimation of prototypes will be more
its class, black points are the prototypes. With our method,
accurate without any adaptation. In case of ResNet, entropy
the target features are clustered to their prototypes and do
minimization often improves accuracy. There are two po-
not have a large variance within the class. We visualize fea-
tential reasons. First, ResNet pre-trained on ImageNet has
tures on the source domain (red cross) and target domain
a more discriminative representation than other networks.
(blue cross) in Fig. 5 (e-h). As we discussed in the method
Therefore, given a few labeled target examples, the model
section, our method aims to minimize domain-divergence.
can extract more discriminative features, which contributes
Indeed, target features are well-aligned with source features
to the performance gain in entropy minimization. Second,
with our method. Judging from Fig. 5f, entropy minimiza-
ResNet has batch-normalization (BN) layers [15]. It is re-
tion (ENT) also tries to extract discriminative features, but
ported that BN has the effect of aligning feature distribu-
it fails to find domain-invariant prototypes.
tions [4, 18]. Hence, entropy minimization was done on
aligned feature representations, which improved the perfor- Quantitative Feature Analysis. We quantitatively investi-
mance. When there is a large domain gap such as C to S, gate the characteristics of the features we obtain using the
S to P, and R to S in Table 1, BN is not enough to handle same adaptation scenario. First, we perform the analysis on
the domain gap. Therefore, our proposed method performs the eigenvalues of the covariance matrix of target features.
much better than entropy minimization in such cases. We We follow the analysis done in [9]. Eigenvectors represent
show an analysis of BN in our supplemental material, re- the components of the features and eigenvalues represent
vealing its effectiveness for entropy minimization. their contributions. If the features are highly discrimina-
(a) Ours (b) ENT (c) DANN (d) S+T

(e) Ours (f) ENT (g) DANN (h) S+T

Figure 5: Feature visualization with t-SNE. (a-d) We plot the class prototypes (black circles) and features on the target domain
(crosses). The color of a cross represents its class. We observed that features on our method show more discrimative features
than other methods. (e-h) Red: Features of the source domain. Blue: Features of the target domain. Our method’s features
are well-aligned between domains compared to other methods.

(a) Eigenvalues (b) Entropy (c) A-distance

Figure 6: (a) Eigenvalues of the covariance matrix of the features on the target domain. Eigenvalues reduce quickly in our
method, which shows that features are more discriminative than other methods. (b) Our method achieves lower entropy than
baselines except ENT. (c) Our method clearly reduces domain-divergence compared to S+T.

tive, only a few components are needed to summarize them. 5. Conclusion


Therefore, in such a case, the first few eigenvalues are ex-
pected to be large, and the rest to be small. The features are
clearly summarized by fewer components in our method as
shown in Fig. 9a. Second, we show the change of entropy We proposed a novel Minimax Entropy (MME) ap-
value on the target in Fig. 9b. ENT diminishes the entropy proach that adversarially optimizes an adaptive few-shot
quickly, but results in poor performance. This indicates that model for semi-supervised domain adaptation (SSDA). Our
the method increases the confidence of predictions incor- model consists of a feature encoding network, followed by a
rectly while our method achieves higher accuracy at the classification layer that computes the features’ similarity to
same time. Finally, in Fig. 6c, we calculated A-distance a set of estimated prototypes (representatives of each class).
by training a SVM as a domain classifier as proposed in [2]. Adaptation is achieved by alternately maximizing the con-
Our method greatly reduces the distance compared to S+T. ditional entropy of unlabeled target data with respect to the
The claim that our method reduces a domain divergence is classifier and minimizing it with respect to the feature en-
empirically supported with this result. coder. We empirically demonstrated the superiority of our
method over many baselines, including conventional feature
alignment and few-shot methods, setting a new state of the
art for SSDA.
6. Acknowledgements [20] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and
This work was supported by Honda, DARPA, BAIR, Michael I Jordan. Conditional adversarial domain adapta-
BDD, and NSF Award No. 1535797. tion. In NIPS, 2018.
[21] Laurens van der Maaten and Geoffrey Hinton. Visualizing
References data using t-sne. JMLR, 9(11):2579–2605, 2008.
[22] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken
[1] Shuang Ao, Xiang Li, and Charles X Ling. Fast generalized Nakae, and Shin Ishii. Distributional smoothing with virtual
distillation for semi-supervised domain adaptation. In AAAI, adversarial training. arXiv, 2015.
2017. [23] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
[2] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-
Pereira. Analysis of representations for domain adaptation. ban Desmaison, Luca Antiga, and Adam Lerer. Automatic
In NIPS, 2007. differentiation in pytorch. 2017.
[3] Konstantinos Bousmalis, George Trigeorgis, Nathan Silber- [24] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate
man, Dilip Krishnan, and Dumitru Erhan. Domain separa- Saenko, and Bo Wang. Moment matching for multi-source
tion networks. In NIPS, 2016. domain adaptation. ICCV, 2019.
[4] Fabio Maria Cariucci, Lorenzo Porzi, Barbara Caputo, Elisa [25] Rajeev Ranjan, Carlos D Castillo, and Rama Chellappa. L2-
Ricci, and Samuel Rota Bulò. Autodial: Automatic domain constrained softmax loss for discriminative face verification.
alignment layers. In ICCV, 2017. arXiv, 2017.
[5] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank [26] Sachin Ravi and Hugo Larochelle. Optimization as a model
Wang, and Jia-Bin Huang. A closer look at few-shot classi- for few-shot learning. arXiv, 2016.
fication. arXiv, 2018. [27] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell.
[6] Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, and Adapting visual category models to new domains. In ECCV,
Luc Van Gool. Domain adaptive faster r-cnn for object de- 2010.
tection in the wild. In CVPR, 2018. [28] Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, and Kate
[7] Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Saenko. Adversarial dropout regularization. In ICLR, 2018.
Ruslan R Salakhutdinov. Good semi-supervised learning that [29] Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, and Kate
requires a bad gan. In NIPS, 2017. Saenko. Strong-weak distribution alignment for adaptive ob-
[8] Jeff Donahue, Judy Hoffman, Erik Rodner, Kate Saenko, and ject detection. In CVPR, 2018.
Trevor Darrell. Semi-supervised domain adaptation with in- [30] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tat-
stance constraints. In CVPR, 2013. suya Harada. Maximum classifier discrepancy for unsuper-
[9] Abhimanyu Dubey, Otkrist Gupta, Ramesh Raskar, and vised domain adaptation. In CVPR, 2018.
Nikhil Naik. Maximum-entropy fine grained classification. [31] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki
In NIPS, 2018. Cheung, Alec Radford, and Xi Chen. Improved techniques
[10] Ayse Erkan and Yasemin Altun. Semi-supervised learning for training gans. In NIPS, 2016.
via generalized maximum entropy. In AISTATS, 2010. [32] Swami Sankaranarayanan, Yogesh Balaji, Arpit Jain,
[11] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain Ser Nam Lim, and Rama Chellappa. Learning from synthetic
adaptation by backpropagation. In ICML, 2014. data: Addressing domain shift for semantic segmentation. In
[12] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot CVPR, 2018.
visual learning without forgetting. In CVPR, 2018. [33] Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon.
[13] Yves Grandvalet and Yoshua Bengio. Semi-supervised A dirt-t approach to unsupervised domain adaptation. In
learning by entropy minimization. In NIPS, 2005. ICLR, 2018.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [34] Karen Simonyan and Andrew Zisserman. Very deep convo-
Deep residual learning for image recognition. In CVPR, lutional networks for large-scale image recognition. arXiv,
2016. 2014.
[15] Sergey Ioffe and Christian Szegedy. Batch normalization: [35] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical
Accelerating deep network training by reducing internal co- networks for few-shot learning. In NIPS, 2017.
variate shift. arXiv, 2015. [36] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell.
[16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Adversarial discriminative domain adaptation. In CVPR,
Imagenet classification with deep convolutional neural net- 2017.
works. In NIPS, 2012. [37] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and
[17] Samuli Laine and Timo Aila. Temporal ensembling for semi- Trevor Darrell. Deep domain confusion: Maximizing for
supervised learning. arXiv, 2016. domain invariance. arXiv, 2014.
[18] Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and [38] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty,
Xiaodi Hou. Revisiting batch normalization for practical do- and Sethuraman Panchanathan. Deep hashing network for
main adaptation. arXiv, 2016. unsupervised domain adaptation. In CVPR, 2017.
[19] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I [39] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan
Jordan. Learning transferable features with deep adaptation Wierstra, et al. Matching networks for one shot learning.
networks. In ICML, 2015. In NIPS, 2016.
[40] Ting Yao, Yingwei Pan, Chong-Wah Ngo, Houqiang Li, and we calculate sensitivity loss and trained C to maximize it
Tao Mei. Semi-supervised domain adaptation with subspace whereas trained F to minimize it. We also implemented C
learning for visual recognition. In CVPR, 2015. with deeper layers, but could not find improvement.
ENT. The difference from MME is that the entire net-
work is trained to minimize entropy loss for unlabeled ex-
amples in addition to classification loss.
CDAN [20]. We used the official implementation of
CDAN provided in https://github.com/thuml/
CDAN. For brevity, CDAN in our paper denotes CDAN+E
Supplemental Material in their paper. We changed their implementation so that the
model is trained with labeled target examples. Similar to
1. Datasets DANN, the domain classifier of CDAN is trained to distin-
First, we show the examples of datasets we employ in the guish source examples and unlabeled target examples.
experiments in Fig 7. We also attach a list of classes used
in our experiments on DomainNet with this material. 3. Additional Results Analysis
2. Implementation Detail Results on Office-Home and Office. In Table 5 and Ta-
ble 6, we report all results on Office-Home and Office. In
We provide details of our implementation. We will pub- almost all settings, our method outperformed baseline meth-
lish our implementation upon acceptance. The reported ods.
performance in the main paper is obtained by one-time Sensitivity to hyper-parameter λ. In Fig. 8, we
training. In this material, we also report both average and show our method’s performance when varying the hyper-
variance on multiple runs and results on different dataset parameter λ which is the trade-off parameter between clas-
splits (i.e., different train/val split). sification loss on labeled examples and entropy on unla-
Implementation of MME. For VGG and AlexNet, we beled target examples. The best validation result is obtained
replace the last linear layer with randomly initialized linear when λ is 0.1. From the result on validation, we set λ 0.1 in
layer. With regard to ResNet34, we remove the last linear all experiments.
layer and add two fully-connected layers following [20]. Changes in accuracy during training. We show the
We use the momentum optimizer where the initial learn- learning curve during training in Fig 9. Our method gradu-
ing rate is set 0.01 for all fully-connected layers whereas ally increases the performance whereas others quickly con-
it is set 0.001 for other layers including convolution lay- verges.
ers and batch-normalization layers. We employ learning Comparison with virtual adversarial training. Here,
rate annealing strategy proposed in [11]. Each mini-batch we present the comparison with general semi-supervised
consists of labeled source, labeled target and unlabeled tar- learning algorithm. We select virtual adversarial training
get images. Labeled examples and unlabeled examples are (VAT) [22] as the baseline because the method is one of the
separately forwarded. We sample s labeled source and la- state-of-the art algorithms on semi-supervised learning and
beled target images and 2s unlabeled target images. s is works well on various settings. The work proposes a loss
set to be 32 for AlexNet, but 24 for VGG and ResNet due called virtual adversarial loss. The loss is defined as the ro-
to GPU memory contraints. We use horizontal-flipping and bustness of the conditional label distribution around each in-
random-cropping based data augmentation for all training put data point against local perturbation. We add the virtual
images. adversarial loss for unlabeled target examples in addition
to classification loss. We employ hyper-parameters used in
2.1. Baseline Implementation
the original implementation because we could not see im-
Except for CDAN, we implemented all baselines by our- provement in changing the parameters. We show the results
selves. S+T [5]. This approach only uses labeled source and in Table 7. We do not observe the effectiveness of VAT in
target examples with the cross-entropy loss for training. SSDA. This could be due to the fact that the method does
DANN [11]. We train a domain classifier on the output not consider the domain-gap between labeled and unlabeled
of the feature extractor. It has three fully-connected layers examples. In order to boost the performance, it should be
with relu activation. The dimension of the hidden layer is better to account for the gap.
set 512. We use a sigmoid activation only for the final layer. Analysis of Batch Normalization. We investigate the
The domain classifier is trained to distinguish source exam- effect of BN and analyze the behavior of entropy minimiza-
ples and unlabeled target examples. tion and our method with ResNet. When training all mod-
ADR [28]. We put dropout layer with 0.1 dropout rate els, unlabeled target examples and labeled examples are for-
after l2-normalization layer. For unlabeled target examples, warded separately. Thus, the BN stats are calculated sep-
LSDAC Office-Home Office
Eiffel
Bear Bee Cello Alarm Clock Backpack Bed Alarm Clock Backpack Speaker
Tower

Real Real
Amazon

Clipart
Product

DSLR

Painting Art

Sketch Webcam
Clipart

Figure 7: Example images in DomainNet, Office-Home, and Office.

Network Method R to C R to P R to A P to R P to C P to A A to P A to C A to R C to R C to A C to P Mean


One-shot
S+T 37.5 63.1 44.8 54.3 31.7 31.5 48.8 31.1 53.3 48.5 33.9 50.8 44.1
DANN 42.5 64.2 45.1 56.4 36.6 32.7 43.5 34.4 51.9 51.0 33.8 49.4 45.1
ADR 37.8 63.5 45.4 53.5 32.5 32.2 49.5 31.8 53.4 49.7 34.2 50.4 44.5
AlexNet
CDAN 36.1 62.3 42.2 52.7 28.0 27.8 48.7 28.0 51.3 41.0 26.8 49.9 41.2
ENT 26.8 65.8 45.8 56.3 23.5 21.9 47.4 22.1 53.4 30.8 18.1 53.6 38.8
MME 42.0 69.6 48.3 58.7 37.8 34.9 52.5 36.4 57.0 54.1 39.5 59.1 49.2
S+T 39.5 75.3 61.2 71.6 37.0 52.0 63.6 37.5 69.5 64.5 51.4 65.9 57.4
DANN 52.0 75.7 62.7 72.7 45.9 51.3 64.3 44.4 68.9 64.2 52.3 65.3 60.0
ADR 39.7 76.2 60.2 71.8 37.2 51.4 63.9 39.0 68.7 64.8 50.0 65.2 57.4
VGG
CDAN 43.3 75.7 60.9 69.6 37.4 44.5 67.7 39.8 64.8 58.7 41.6 66.2 55.8
ENT 23.7 77.5 64.0 74.6 21.3 44.6 66.0 22.4 70.6 62.1 25.1 67.7 51.6
MME 49.1 78.7 65.1 74.4 46.2 56.0 68.6 45.8 72.2 68.0 57.5 71.3 62.7
Three-shot
S+T 44.6 66.7 47.7 57.8 44.4 36.1 57.6 38.8 57.0 54.3 37.5 57.9 50.0
DANN 47.2 66.7 46.6 58.1 44.4 36.1 57.2 39.8 56.6 54.3 38.6 57.9 50.3
ADR 45.0 66.2 46.9 57.3 38.9 36.3 57.5 40.0 57.8 53.4 37.3 57.7 49.5
AlexNet
CDAN 41.8 69.9 43.2 53.6 35.8 32.0 56.3 34.5 53.5 49.3 27.9 56.2 46.2
ENT 44.9 70.4 47.1 60.3 41.2 34.6 60.7 37.8 60.5 58.0 31.8 63.4 50.9
MME 51.2 73.0 50.3 61.6 47.2 40.7 63.9 43.8 61.4 59.9 44.7 64.7 55.2
S+T 49.6 78.6 63.6 72.7 47.2 55.9 69.4 47.5 73.4 69.7 56.2 70.4 62.9
DANN 56.1 77.9 63.7 73.6 52.4 56.3 69.5 50.0 72.3 68.7 56.4 69.8 63.9
ADR 49.0 78.1 62.8 73.6 47.8 55.8 69.9 49.3 73.3 69.3 56.3 71.4 63.0
VGG
CDAN 50.2 80.9 62.1 70.8 45.1 50.3 74.7 46.0 71.4 65.9 52.9 71.2 61.8
ENT 48.3 81.6 65.5 76.6 46.8 56.9 73.0 44.8 75.3 72.9 59.1 77.0 64.8
MME 56.9 82.9 65.7 76.7 53.6 59.2 75.7 54.9 75.3 72.9 61.1 76.3 67.6

Table 5: Results on Office-Home. Our method performs better than baselines in most settings.

arately between unlabeled target and labeled ones. Some jointly and Joint BN will not help to reduce domain-gap.
previous work [4, 18] have demonstrated that this operation We compare ours with entropy minimization on both Sepa-
can reduce domain-gap. We call this batch strategy as a rate BN and Joint BN. Entropy minimization with Joint BN
“Separate BN”. To analyze the effect of Separate BN, we performs much worse than Separate BN as shown in Table
compared this with a “Joint BN” where we forwarded unla- 8. This results show that entropy minimization does not re-
beled and labeled examples at once. BN stats are calculated duce domain-gap by itself. On the other hand, our method
W to A D to A
Network Method Method Joint BN Separate BN
1-shot 3-shot 1-shot 3-shot
S+T 50.4 61.2 50.0 62.4 ENT 63.6 68.9
DANN 57.0 64.4 54.5 65.2 MME 69.5 69.6
ADR 50.2 61.2 50.9 61.4
AlexNet
CDAN 50.4 60.3 48.5 61.4 Table 8: Ablation study of batch-normalization. The per-
ENT 50.7 64.0 50.0 66.2 formance of the ENT method highly depends on the choice
MME 57.2 67.3 55.8 67.8 of BN while our method shows consistent behavior.
S+T 69.2 73.2 68.2 73.3
DANN 69.3 75.4 70.4 74.6
ADR 69.7 73.3 69.2 74.1
VGG 1-shot 3-shot
CDAN 65.9 74.4 64.4 71.4 Method
ENT 69.1 75.4 72.1 75.1 sp0 sp1 sp2 sp0 sp1 sp2
MME 73.1 76.3 73.6 77.6 S+T 43.3 43.8 43.8 47.1 45.9 48.8
Table 6: Results on Office. Our method outperformed other DANN 43.3 44.0 45.4 46.1 43.1 45.3
baselines in all settings. ENT 37.0 32.9 38.2 45.5 45.4 47.8
MME 48.9 51.2 51.4 55.6 55.0 55.8
Table 9: Results on different training splits on DomainNet,
works well even in case of Joint BN. This is because our Real to Clipart adaptation scenario using AlexNet.
training method is designed to reduce domain-gap.
Comparison with SSDA methods [33, 1] Since there
are no recently proposed SSDA methods using deep learn- R to C D to A W to A
AlexNet AlexNet
ing, we compared with the state-of-the-art unsupervised DA 1-shot 3-shot 1-shot 1-shot
methods modified for the SSDA task. We also compared DIRT-T [33] 45.2 48.0 GDSDA [1] 51.5 48.3
our method with [33] and [1]. We implemented [33] and MME 48.9 55.6 MME 58.5 60.4
also modified it for the SSDA task. To compare with [1], we
follow their evaluation protocol and report our and their best
accuracy (see Fig. 3 (c)(f) in [1]). As shown in Table 10, Table 10: Comparison with [33, 1].
we outperform these methods with a significant margin.
Results on Multiple Runs. We investigate the stability
of our method and several baselines. Table 11 shows results Method 1-shot 3-shot
averaged accuracy and standard deviation of three runs. The CDAN 62.9±1.5 65.3±0.1
deviation is not large and we can say that our method is ENT 59.5± 1.5 63.6±1.3
stable.
MME 64.3± 0.8 66.8±0.4
Results on Different Splits. We investigate the stability
of our method for labeled target examples. Table 9 shows Table 11: Results on three runs on DomainNet, Sketch to
results on different splits. sp0 correponds to the split we use Painting adaptation scenario using ResNet.
in the experiment on our paper. For each split, we randomly
picked up labeled training examples and validation exam-
ples. Our method consistently performs better than other
methods.

Method R to C R to P P to C C to P C to S S to P R-S P to R
S+T 47.1 45.0 44.9 35.9 36.4 38.4 33.3 58.7
VAT 46.1 43.8 44.3 35.8 35.6 38.2 31.8 57.7
MME 55.6 49.0 51.7 40.2 39.4 43.0 37.9 60.7

Table 7: Comparison with VAT [22] using AlexNet on Do-


mainNet. VAT does not perform bettern than S+T

Figure 8: Sensitivity to hyper-parameter λ. The result is


obtained when we use AlexNet on DomainNet, Real to Cli-
part.
(a) Test accuracy (b) Validation accuracy
Figure 9: Test and validation accuracy over iterations. Our
method increases performances over iterations while others
quickly converges. The result is obtained on Real to Clipart
adaptation of DomainNet using AlexNet.

You might also like