
such as clustering or retrieval, prohibitively expensive, as
each probe has to go through the network paired up with
every gallery image.
In this paper we show that, contrary to current opin-
ion, a plain CNN with a triplet loss can outperform current
state-of-the-art approaches on both the Market-1501 [39]
and MARS [38] datasets. The triplet loss allows us to per-
form end-to-end learning between the input image and the
desired embedding space. This means we directly optimize
the network for the final task, which renders an additional
metric learning step obsolete. Instead, we can simply com-
pare persons by computing the Euclidean distance of their
embeddings.
A possible reason for the unpopularity of the triplet loss
is that, when applied na
¨
ıvely, it will indeed often produce
disappointing results. An essential part of learning using
the triplet loss is the mining of hard triplets, as otherwise
training will quickly stagnate. However, mining such hard
triplets is time consuming and it is unclear what defines
“good” hard triplets [23, 24]. Even worse, selecting too
hard triplets too often makes the training unstable. We show
how this problem can be alleviated, resulting in both faster
training and better performance. We systematically analyze
the design space of triplet losses, and evaluate which one
works best for person ReID. While doing so, we place two
previously proposed variants [7, 25] into this design space
and discuss them in more detail in Section 2. Specifically,
we find that the best performing version has not been used
before. Furthermore we also show that a margin-less formu-
lation performs slightly better, while removing one hyper-
parameter.
Another clear trend seems to be the use of pretrained
models such as GoogleNet [28] or ResNet-50 [12]. In-
deed, pretrained models often obtain great scores for person
ReID [8, 41], while ever fewer top-performing approaches
use networks trained from scratch [18, 1, 6, 34, 24, 31, 3].
Some authors even argue that training from scratch is
bad [8]. However, using pretrained networks also leads to
a design lock-in, and does not allow for the exploration
of new deep learning advances or different architectures.
We show that, when following best practices in deep learn-
ing, networks trained from scratch can perform compete-
tively for person ReID. Furthermore, we do not rely on
network components specifically tailored towards person
ReID, but train a plain feed-foward CNN, unlike many other
approaches that train from scratch [1, 31, 18, 34, 27]. In-
deed, our networks using pretrained weights obtain the best
results, but our far smaller architecture obtains respectable
scores, providing a viable alternative for applications where
person ReID needs to be performed on resource-constrained
hardware, such as embedded devices.
In summary our contribution is twofold: Firstly we intro-
duce variants of the classic triplet loss which render mining
of hard triplets unnecessary and we systematically evalu-
ate these variants. And secondly, we show how, contrary
to the prevailing opinion, using a triplet loss and no special
layers, we achieve state-of-the-art results both with a pre-
trained CNN and with a model trained from scratch.
2. Learning Metric Embeddings, the Triplet
Loss, and the Importance of Mining
The goal of metric embedding learning is to learn a func-
tion f
θ
(x) : R
F
→ R
D
which maps semantically sim-
ilar points from the data manifold in R
F
onto metrically
close points in R
D
. Analogously, f
θ
should map semanti-
cally different points in R
F
onto metrically distant points
in R
D
. The function f
θ
is parametrized by θ and can be
anything ranging from a linear transform [33, 19, 36, 22] to
complex non-linear mappings usually represented by deep
neural networks [7, 6, 8]. Let D(x, y) : R
D
× R
D
→ R
be a metric function measuring distances in the embedding
space. For clarity we use the shortcut notation D
i,j
=
D(f
θ
(x
i
), f
θ
(x
j
)), where we omit the indirect dependence
of D
i,j
on the parameters θ. As is common practice, all
loss-terms are divided by the number of summands in a
batch; we omit this term in the following equations for con-
ciseness.
Weinberger and Saul [33] explore this topic with the ex-
plicit goal of performing k-nearest neighbor classification in
the learned embedding space and propose the “Large Mar-
gin Nearest Neighbor loss” for optimizing f
θ
:
L
LMNN
(θ) = (1 − µ)L
pull
(θ) + µL
push
(θ), (1)
which is comprised of a pull-term, pulling data points i to-
wards their target neighbor T (i) from the same class, and
a push-term, pushing data points from a different class k
further away:
L
pull
(θ) =
X
i,j∈T (i)
D
i,j
, (2)
L
push
(θ) =
X
a,n
y
a
6=y
n
m + D
a,T (a)
− D
a,n
+
. (3)
Because the motivation was nearest-neighbor classification,
allowing disparate clusters of the same class was an explicit
goal, achieved by choosing fixed target neighbors at the on-
set of training. Since this property is harmful for retrieval
tasks such as face and person ReID, FaceNet [23] proposed
a modification of L
LMNN
(θ) called the “Triplet loss”:
L
tri
(θ) =
X
a,p,n
y
a
=y
p
6=y
n
[m + D
a,p
− D
a,n
]
+
. (4)
This loss makes sure that, given an anchor point x
a
, the
projection of a positive point x
p
belonging to the same class