0% found this document useful (0 votes)
123 views10 pages

1 Evaluating Multiple Object Tracking Performance The CLEAR MOT Metrics

The article introduces two new metrics, Multiple Object Tracking Precision (MOTP) and Multiple Object Tracking Accuracy (MOTA), to evaluate the performance of multiple object tracking systems. These metrics aim to provide a systematic and objective means to compare various tracking techniques by measuring precision in object location estimation and consistency in object labeling over time. The metrics have been applied in international evaluations to assess their effectiveness and highlight their advantages and limitations.

Uploaded by

Guillaume Rossi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views10 pages

1 Evaluating Multiple Object Tracking Performance The CLEAR MOT Metrics

The article introduces two new metrics, Multiple Object Tracking Precision (MOTP) and Multiple Object Tracking Accuracy (MOTA), to evaluate the performance of multiple object tracking systems. These metrics aim to provide a systematic and objective means to compare various tracking techniques by measuring precision in object location estimation and consistency in object labeling over time. The metrics have been applied in international evaluations to assess their effectiveness and highlight their advantages and limitations.

Uploaded by

Guillaume Rossi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Hindawi Publishing Corporation

EURASIP Journal on Image and Video Processing


Volume 2008, Article ID 246309, 10 pages
doi:10.1155/2008/246309

Research Article
Evaluating Multiple Object Tracking Performance:
The CLEAR MOT Metrics

Keni Bernardin and Rainer Stiefelhagen

Interactive Systems Lab, Institut für Theoretische Informatik, Universität Karlsruhe, 76131 Karlsruhe, Germany

Correspondence should be addressed to Keni Bernardin, keni@[Link]

Received 2 November 2007; Accepted 23 April 2008

Recommended by Carlo Regazzoni

Simultaneous tracking of multiple persons in real-world environments is an active research field and several approaches have
been proposed, based on a variety of features and algorithms. Recently, there has been a growing interest in organizing systematic
evaluations to compare the various techniques. Unfortunately, the lack of common metrics for measuring the performance of
multiple object trackers still makes it hard to compare their results. In this work, we introduce two intuitive and general metrics to
allow for objective comparison of tracker characteristics, focusing on their precision in estimating object locations, their accuracy
in recognizing object configurations and their ability to consistently label objects over time. These metrics have been extensively
used in two large-scale international evaluations, the 2006 and 2007 CLEAR evaluations, to measure and compare the performance
of multiple object trackers for a wide variety of tracking tasks. Selected performance results are presented and the advantages and
drawbacks of the presented metrics are discussed based on the experience gained during the evaluations.

Copyright © 2008 K. Bernardin and R. Stiefelhagen. This is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.

1. INTRODUCTION CLEAR [11]). However, although benchmarking is rather


straightforward for single object trackers, there is still no
The audio-visual tracking of multiple persons is a very active general agreement on a principled evaluation procedure
research field with applications in many domains. These using a common set of objective and intuitive metrics for
range from video surveillance, over automatic indexing, to measuring the performance of multiple object trackers.
intelligent interactive environments. Especially in the last Li et al. in [12] investigate the problem of evaluating
case, a robust person tracking module can serve as a powerful systems for the tracking of football players from multiple
building block to support other techniques, such as gesture camera images. Annotated ground truth for a set of visible
recognizers, face or speaker identifiers, head pose estimators players is compared to the tracker output and 3 measures
[1], and scene analysis tools. In the last few years, more and are introduced to evaluate the spatial and temporal accuracy
more approaches have been presented to tackle the problems of the result. Two of the measures, however, are rather
posed by unconstrained, natural environments and bring specific to the football tracking problem, and the more
person trackers out of the laboratory environment and into general measure, the “identity tracking performance,” does
real-world scenarios. not consider some of the basic types of errors made by
In recent years, there has also been a growing inter- multiple target trackers, such as false positive tracks or
est in performing systematic evaluations of such tracking localization errors in terms of distance or overlap. This limits
approaches with common databases and metrics. Examples the application of the presented metric to specific types of
are the CHIL [2] and AMI [3] projects, funded by the trackers or scenarios.
EU, the U.S. VACE project [4], the French ETISEO [5] Nghiem et al. in [13] present a more general framework
project, the U.K. Home Office iLIDS project [6], the CAVIAR for evaluation, which covers the requirements of a broad
[7] and CREDS [8] projects, and a growing number of range of visual tracking tasks. The presented metrics aim
workshops (e.g., PETS [9], EEMCV [10], and more recently at allowing systematic performance analysis using large
2 EURASIP Journal on Image and Video Processing

amounts of benchmark data. However, a high number Section 3 briefly introduces the CLEAR tracking tasks and
of different metrics (8 in total) are presented to evaluate their various requirements. In Section 4, sample results are
object detection, localization and tracking performance, with shown and the usefulness of the metrics is discussed. Finally,
many dependencies between separate metrics, such that Section 5 gives a summary and a conclusion.
one metric can often only be interpreted in combination
with one or more others. This is for example the case for 2. PERFORMANCE METRICS FOR MULTIPLE
the “tracking time” and “object ID persistence/confusion” OBJECT TRACKING
metrics. Further, many of the proposed metrics are still
designed with purely visual tracking tasks in mind. To allow a better understanding of the proposed metrics, we
Because of the lack of commonly agreed on and generally first explain what qualities we expect from an ideal multiple
applicable metrics, it is not uncommon to find tracking object tracker. It should at all points in time find the correct
approaches presented without quantitative evaluation, while number of objects present; and estimate the position of each
many others are evaluated using varying sets of more or object as precisely as possible (note that properties such as
less custom measures (e.g., [14–18]). To remedy this, this the contour, orientation, or speed of objects are not explicitly
paper proposes a thorough procedure to detect the basic considered here). It should also keep consistent track of each
types of errors produced by multiple object trackers and object over time: each object should be assigned a unique
introduces two novel metrics, the multiple object tracking track ID which stays constant throughout the sequence (even
precision (MOTP), and the multiple object tracking accuracy after temporary occlusion, etc.). This leads to the following
(MOTA), that intuitively express a tracker’s overall strengths design criteria for performance metrics.
and are suitable for use in general performance evaluations.
Perhaps the work that most closely relates to ours is (i) They should allow to judge a tracker’s precision in
that of Smith et al. in [19], which also attempts to define determining exact object locations.
an objective procedure to measure multiple object tracker (ii) They should reflect its ability to consistently track
performance. However, key differences to our contribution object configurations through time, that is, to cor-
exist: again, a large number of metrics are introduced: 5 for rectly trace object trajectories, producing exactly one
measuring object configuration errors, and 4 for measuring trajectory per object.
inconsistencies in object labeling over time. Some of the mea-
sures are defined in a dual way for trackers and for objects Additionally, we expect useful metrics
(e.g., MT/MO, FIT/FIO, TP/OP). This makes it difficult to (i) to have as few free parameters, adjustable thresholds,
gain a clear and direct understanding of the tracker’s overall and so forth, as possible to help making evaluations
performance. Moreover, under certain conditions, some of straightforward and keeping results comparable;
these measures can behave in a nonintuitive fashion (such
as the CD, as the authors state, or the FP and FN, as we (ii) to be clear, easily understandable, and behave accord-
will demonstrate later). In comparison, we introduce just 2 ing to human intuition, especially in the occurrence
overall performance measures that allow a clear and intuitive of multiple errors of different types or of uneven
insight into the main tracker characteristics: its precision repartition of errors throughout the sequence;
in estimating object positions, its ability to determine the (iii) to be general enough to allow comparison of most
number of objects and their configuration, and its skill at types of trackers (2D, 3D trackers, object centroid
keeping consistent tracks over time. trackers, or object area trackers, etc.);
In addition to the theoretical framework, we present (iv) to be few in number and yet expressive, so they may
actual results obtained in two international evaluation be used, for example, in large evaluations where many
workshops, which can be seen as field tests of the proposed systems are being compared.
metrics. These evaluation workshops, the classification of
events, activities, and relationships (CLEAR) workshops, Based on the above criteria, we propose a procedure for
were held in spring 2006 and 2007 and featured a variety the systematic and objective evaluation of a tracker’s char-
of tracking tasks, including visual 3D person tracking using acteristics. Assuming that for every time frame t, a multiple
multiple camera views, 2D face tracking, 2D person and object tracker outputs a set of hypotheses {h1 , . . . , hm } for a
vehicle tracking, acoustic speaker tracking using microphone set of visible objects {o1 , . . . , on }, the evaluation procedure
arrays, and even audio-visual person tracking. For all these comprises the following steps.
tracking tasks, each with its own specificities and require- For each time frame t,
ments, the here-introduced MOTP and MOTA metrics,
or slight variants thereof, were employed. The experiences (i) establish the best possible correspondence between
made during the course of the CLEAR evaluations are hypotheses h j and objects oi ,
presented and discussed as a means to better understand the (ii) for each found correspondence, compute the error in
expressiveness and usefulness, but also the weaknesses of the the object’s position estimation,
MOT metrics. (iii) accumulate all correspondence errors:
The remainder of the paper is organized as follows.
Section 2 presents the new metrics, the MOTP and the (a) count all objects for which no hypothesis was
MOTA and a detailed procedure for their computation. output as misses,
K. Bernardin and R. Stiefelhagen 3

tracker has missed the object and is tracking something else.


t This is illustrated in Figure 2(a). For object area trackers
o2
h2 (i.e., trackers that also estimate the size of objects or the
area occupied by them), distance could be expressed in
o1 terms of the overlap between object and hypothesis, for
t
h1 h5 example, as in [14], and the threshold T could be set to zero
overlap. For object centroid trackers, one could simply use
t the Euclidian distance, in 2D image coordinates or in real 3D
world coordinates, between object centers and hypotheses,
Misses
False and the threshold could be, for example, the average width
o8 positives of a person in pixels or cm. The optimal setting for T
therefore depends on the application task, the size of objects
t
involved, and the distance measure used, and cannot be
defined for the general case (while a task-specific, data-driven
Figure 1: Mapping tracker hypotheses to objects. In the easiest case, computation of T may be possible in some cases, this was
matching the closest object-hypothesis pairs for each time frame t not further investigated here. For the evaluations presented
is sufficient.
in Sections 3 and 4, empirical determination based on task
knowledge proved sufficient). In the following, we refer to
correspondences as valid if disti, j < T.
(b) count all tracker hypotheses for which no real
object exists as false positives,
2.1.2. Consistent tracking over time
(c) count all occurrences where the tracking
hypothesis for an object changed compared to Second, to measure the tracker’s ability to label objects
previous frames as mismatch errors. This could consistently, one has to detect when conflicting correspon-
happen, for example, when two or more objects dences have been made for an object over time. Figure 2(b)
are swapped as they pass close to each other, illustrates the problem. Here, one track was mistakenly
or when an object track is reinitialized with a assigned to 3 different objects over the course of time.
different track ID, after it was previously lost A mismatch can occur when objects come close to each
because of occlusion. other and the tracker wrongfully swaps their identities.
It can also occur when a track was lost and reinitialized
Then, the tracking performance can be intuitively expressed with a different identity. One way to measure such errors
in two numbers: the “tracking precision” which expresses could be to decide on a “best” mapping (oi , h j ) for every
how well exact positions of persons are estimated, and the object oi and hypothesis h j , for example, based on the
“tracking accuracy” which shows how many mistakes the initial correspondence made for oi , or the correspondence
tracker made in terms of misses, false positives, mismatches, (oi , h j ) most frequently made in the whole sequence. One
failures to recover tracks, and so forth. These measures will would then count all correspondences where this mapping
be explained in detail in the latter part of this section. is violated as errors. In some cases, this kind of measure can
however become nonintuitive. As shown in Figure 2(c), if,
2.1. Establishing correspondences between objects for example, the identity of object oi is swapped just once in
and tracker hypotheses the course of the tracking sequence, the time frame at which
the swap occurs drastically influences the value output by
As explained above, the first step in evaluating the perfor- such an error measure.
mance of a multiple object tracker is finding a continu- This is why we follow a different approach: only count
ous mapping between the sequence of object hypotheses mismatch errors once at the time frames where a change in
{h1 , . . . , hm } output by the tracker in each frame and the real object-hypothesis mappings is made; and consider the corre-
objects {o1 , . . . , on }. This is illustrated in Figure 1. Naively, spondences in intermediate segments as correct. Especially in
one would match the closest object-hypothesis pairs and cases where many objects are being tracked and mismatches
treat all remaining objects as misses and all remaining are frequent, this gives us a more intuitive and expressive
hypotheses as false positives. A few important points need error measure. To detect when a mismatch error occurs, a
to be considered, though, which make the procedure less list of object-hypothesis mappings is constructed. Let Mt =
straightforward. {(oi , h j )} be the set of mappings made up to time t and let
M0 = {·}. Then, if a new correspondence is made at time
2.1.1. Valid correspondences t + 1 between oi and hk which contradicts a mapping (oi , h j )
in Mt , a mismatch error is counted and (oi , h j ) is replaced by
First of all, the correspondence between an object oi and (oi , hk ) in Mt+1 .
a hypothesis h j should not be made if their distance disti, j The so constructed mapping list Mt can now help
exceeds a certain threshold T. There is a certain conceptual to establish optimal correspondences between objects and
boundary beyond which we can no longer speak of an error hypotheses at time t + 1, when multiple valid choices exist.
in position estimation, but should rather argue that the Figure 2(d) shows such a case. When it is not clear, which
4 EURASIP Journal on Image and Video Processing

False h1
positive
Mismatch
h1
o1

h2
h1 Dist.> T
h1
o2
o1 o1
o1 h3
Miss
o3
t t+1 t+2 Mismatch
t t+1 t+2 t+3 t+4 t+5 t+6 t+7

(a) (b)

Case 1:
h2
o1

h1
h1 Miss
h1 False
t t+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 positive Dist.< T
h2
Case 2:
o1 o1
o1
o1

h1 h2
t t+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 t t+1 t+2

(c) (d)

Figure 2: Optimal correspondences and error measures. (a) When the distance between o1 and h1 exceeds a certain threshold T, one can
no longer make a correspondence. Instead, o1 is considered missed and h1 becomes a false positive. (b): Mismatched tracks. Here, h2 is first
mapped to o2 . After a few frames, though, o1 and o2 cross paths and h2 follows the wrong object. Later, it wrongfully swaps again to o3 . (c):
Problems when using a sequence-level “best” object-hypothesis mapping based on most frequently made correspondences. In the first case,
o1 is tracked just 2 frames by h1 , before the track is taken over by h2 . In the second case, h1 tracks o1 for almost half the sequence. In both
cases, a “best” mapping would pair h2 and o1 . This however leads to counting 2 mismatch errors for case 1; and 4 errors for case 2, although
in both cases only one error of the same kind was made. (d): Correct reinitialization of a track. At time t, o1 is tracked by h1 . At t + 1, the
track is lost. At t + 2, two valid hypotheses exist. The correspondence is made with h1 although h2 is closer to o1 , based on the knowledge of
previous mappings up to time t + 1.

hypothesis to match to an object oi , priority is given to not exceed the threshold T, make the correspondence
ho with (oi , ho ) ∈ Mt , as this is most likely the correct between oi and h j for frame t.
track. Other hypotheses are considered false positives, and
could have occurred because the tracker outputs several (2) For all objects for which no correspondence was
hypotheses for oi , or because a hypothesis that previously made yet, try to find a matching hypothesis. Allow
tracked another object accidentally crossed over to oi . only one-to-one matches, and pairs for which the
distance does not exceed T. The matching should
be made in a way that minimizes the total object-
2.1.3. Mapping procedure hypothesis distance error for the concerned objects.
This is a minimum weight assignment problem,
Having clarified all the design choices behind our strategy for
and is solved using Munkres’ algorithm [20] with
constructing object-hypothesis correspondences, we sum-
polynomial runtime complexity. If a correspondence
marize the procedure as follows.
(oi , hk ) is made that contradicts a mapping (oi , h j ) in
Let M0 = {·}. For every time frame t, consider the
Mt−1 , replace (oi , h j ) with (oi , hk ) in Mt . Count this
following.
as a mismatch error and let mmet be the number of
mismatch errors for frame t.
(1) For every mapping (oi , h j ) in Mt−1 , verify if it is still
valid. If object oi is still visible and tracker hypothesis (3) After the first two steps, a complete set of matching
h j still exists at time t, and if their distance does pairs for the current time frame is known. Let ct be
K. Bernardin and R. Stiefelhagen 5

the number of matches found for time t. For each of Misses


theses matches, calculate the distance dti between the
object oi and its corresponding hypothesis. o1
(4) All remaining hypotheses are considered false posi-
tives. Similarly, all remaining objects are considered o2
misses. Let f pt and mt be the number of false
positives and misses, respectively, for frame t. Let also
gt be the number of objects present at time t. o3
h1
(5) Repeat the procedure from step 1 for the next time
frame. Note that since for the initial frame, the set of o4
mappings M0 is empty, all correspondences made are
initial and no mismatch errors occur. t t+1 t+2 t+3 t+4 t+5 t+6 t+7

In this way, a continuous mapping between objects and Figure 3: Computing error ratios. Assume a sequence length of 8
tracker hypotheses is defined and all tracking errors are frames. For frames t1 to t4 , 4 objects o1 , . . . , o4 are visible, but none
accounted for. is being tracked. For frames t5 to t8 , only o4 remains visible, and is
being consistently tracked by h1 . In each frame t1 , . . . , t4 , 4 objects
are missed, resulting in 100% miss rate. In each frame t5 , . . . , t8 ,
2.2. Performance metrics the miss rate is 0%. Averaging these frame level error rates yields a
global result of (1/8)(4.100 + 4.0) = 50% miss rate. On the other
Based on the matching strategy described above, two very
hand, summing up all errors first, and computing a global ratio
intuitive metrics can be defined. yield a far more intuitive result of 16 misses/20 objects = 80%.
(1) The multiple object tracking precision (MOPT):

dti Summing up over the different error ratios gives us the total
MOTP = i,t . (1)
t ct error rate Etot , and 1 − Etot is the resulting tracking accuracy.
The MOTA accounts for all object configuration errors made
It is the total error in estimated position for matched by the tracker, false positives, misses, mismatches, over all
object-hypothesis pairs over all frames, averaged frames. It is similar to metrics widely used in other domains
by the total number of matches made. It shows (such as the word error rate (WER), commonly used in
the ability of the tracker to estimate precise object speech recognition) and gives a very intuitive measure of
positions, independent of its skill at recognizing the tracker’s performance at detecting objects and keeping
object configurations, keeping consistent trajectories, their trajectories, independent of the precision with which
and so forth. the object locations are estimated.
(2) The multiple object tracking accuracy (MOTA): Remark on computing averages: note that for both MOTP
and MOTA, it is important to first sum up all errors across

t (mt + f pt + mmet ) frames before a final average or ratio can be computed.
MOTA = 1 −  , (2) The reason is that computing ratios rt for each frame
t gt  t
independently before calculating a global average (1/n) t rt
where mt , f pt , and mmet are the number of misses, for all n frames (such as, e.g., for the FP and FN measures
of false positives, and of mismatches, respectively, for in [19]) can lead to nonintuitive results. This is illustrated
time t. The MOTA can be seen as derived from 3 error in Figure 3. Although the tracker consistently missed most
ratios: objects in the sequence, computing ratios independently per
 frame and then averaging would still yield only 50% miss
m rate. Summing up all misses first and computing a single
m = t t , (3)
t gt global ratio, on the other hand, produces a more intuitive
result of 80% miss rate.
the ratio of misses in the sequence, computed over
the total number of objects present in all frames,
3. TRACKING EVALUATIONS IN CLEAR

f pt
f p = t , (4) The theoretical framework presented here for the evalua-
t gt
tion of multiple object trackers was applied in two large
the ratio of false positives, and evaluation workshops. The classification of events, activities,

and relationships (CLEARs) workshops [11] as organized
t mmet in a collaboration between the European CHIL project, the
mme =  , (5) U.S. VACE project, and the National Institute of Standards
t gt
and Technology (NIST) [21] (as well as the AMI project, in
the ratio of mismatches. 2007), and were held in the springs of 2006 and 2007. They
6 EURASIP Journal on Image and Video Processing

represent the first international evaluations of their kind, Figure 4 shows examples for the scenes from the seminar
using large databases of annotated multimodal data, and database used for 3D person tracking.
aimed to provide a platform for researchers to benchmark
systems for acoustic and visual tracking, identification,
activity analysis, event recognition, and so forth, using 3.2. 2D face tracking
common task definitions, datasets, tools, and metrics. They
featured a variety of tasks related to the tracking of humans The face tracking task was to be evaluated on two different
or other objects in natural, unconstrained indoor and databases: one featuring single views of the scene and one
outdoor scenarios, and presented new challenges to systems featuring multiple views to help better resolve problems of
for the fusion of multimodal and multisensory data. A detection and track verification. In both cases, the objective
complete description of the CLEAR evaluation workshops, was to detect and track faces in each separate view, estimating
the participating systems, and the achieved results can be not only their position in the image, but also their extension,
found in [22, 23]. that is, the exact area covered by them. Although in the 2006
The authors wish to make the point here, that these evaluation, a variety of separate measures were used, in the
evaluations represent a systematic, large-scale effort using 2007 evaluation, the same MOT metrics as in the 3D person
hours of annotated data, and with a substantial amount tracking task, with only slight variations, were successfully
of participating systems, and can therefore be seen as a applied. In this case, the overlap between hypothesized and
true practical test of the usefulness of the MOT metrics. labeled face bounding boxes in the image was used as
The experience from these workshops was that the MOT distance measure, and the distance error threshold was set
metrics were indeed applicable to a wide range of tracking to zero overlap.
tasks, made it easy to gain insights into tracker strengths Figure 5 shows examples for face tracker outputs on the
and weaknesses, to compare overall system performances, CLEAR seminar database.
and helped researchers publish and convey performance
results that are objective, intuitive, and easy to interpret.
In the following, the various CLEAR tracking tasks are 3.3. 2D person and vehicle tracking
briefly presented, highlighting the differences and specifici-
ties that make them interesting from the point of view Just as in the face tracking task, the 2D view-based tracking
of the requirements posed to evaluation metrics. While of persons and vehicles was also evaluated on different sets
in 2006, there were still some exceptions, in 2007 all of databases representing outdoor traffic scenes, using only
tasks related to tracking, for single or multiple objects, slight variants of the MOT metrics. Here also, bounding box
and for all modalities, were evaluated using the MOT overlap was used as the distance measure.
metrics. Figure 6 shows a scene from the CLEAR vehicle tracking
database.
3.1. 3D visual person tracking
3.4. 3D acoustic and multimodal person tracking
The CLEAR 2006 and 2007 evaluations featured a 3D person
tracking task, in which the objective was to determine the The task of 3D person tracking in seminar or meeting
location on the ground plane of persons in a scene. The scenarios also featured an acoustic subtask, where tracking
scenario was that of small meetings or seminars, and several was to be achieved using the information from distributed
camera views were available to help determine 3D locations. microphone networks, and a multimodal subtask, where the
Both the tracking of single persons (the lecturer in front of an combination of multiple camera and multiple microphone
audience) and of multiple persons (all seminar participants) inputs was available. It is noteworthy here, that the MOT
were attempted. The specifications of this task posed quite a measures could be applied with success to the domain of
challenge for the design of appropriate performance metrics: acoustic source localization, where overall performance is
measures such as track merges and splits, usually found in traditionally measured using rather different error metrics,
the field of 2D image-based tracking, had little meaning and is decomposed into speech segmentation performance
in the 3D multicamera tracking scenario. On the other and localization performance. Here, the miss and false
hand, errors in location estimation had to be carefully dis- positive errors in the MOTA measure accounted for seg-
tinguished from false positives and false track associations. mentation errors, whereas the MOTP expressed localization
Tracker performances were to be intuitively comparable for precision. As a difference to visual tracking, mismatches were
sequences with large differences in the number of ground not considered in the MOTA calculation, as acoustic trackers
truth objects, and thus varying levels of difficulty. In the end, were not expected to distinguish the identities of speakers,
the requirements of the 3D person tracking task drove much and the resulting variant, the A − MOTA, was used for
of the design choices behind the MOT metrics. For this task, system comparisons. In both, the acoustic and multimodal
error calculations were made using the Euclidian distance subtasks, systems were expected to pinpoint the 3D location
between hypothesized and labeled person positions on the of active speakers and the distance measure used was the
ground plane, and the correspondence threshold was set to Euclidian distance on the ground plane, with the threshold
50 cm. set to 50 cm.
K. Bernardin and R. Stiefelhagen 7

ITC-irst UKA AIT IBM UPC

Figure 4: Scenes from the CLEAR seminar database used in 3D person tracking.

UKA AIT

Figure 5: Scenes from the CLEAR seminar database used for face detection and tracking.

False pos.
Site/system MOTP Miss rate rate Mismatches MOTA
System A 92 mm 30.86% 6.99% 1139 59.66%
System B 91 mm 32.78% 5.25% 1103 59.56%
System C 141 mm 20.66% 18.58% 518 59.62%
System D 155 mm 15.09% 14.5% 378 69.58%
System E 222 mm 23.74% 20.24% 490 54.94%
System F 168 mm 27.74% 40.19% 720 30.49%
System G 147 mm 13.07% 7.78% 361 78.36%
Figure 6: Sample from the CLEAR vehicle tracking database (i-
LIDS dataset [6]). Figure 7: Results for the CLEAR’07 3D multiple person tracking
visual subtask.

4. EVALUATION RESULTS truth uncertainties. More importantly still, it shows us


that the correspondence threshold T strongly influences the
This section gives a brief overview of the evaluation results behavior of the MOTP and MOTA measures. Theoretically,
from select CLEAR tracking tasks. The results serve to a threshold of T = ∞ means that all correspondences
demonstrate the effectiveness of the proposed MOT metrics stay valid once made, no matter how large the distance
and act as a basis for discussion of inherent advantages, between object and track hypothesis [Link] reduces
drawbacks, and lessons learned during the workshops. For a the impact of the MOTA to measuring the correct detection
more detailed presentation, the reader is referred to [22, 23]. of the number of objects, and disregards all track swaps,
Figure 7 shows the results for the CLEAR 2007 Visual 3D stray track errors, and so forth, resulting in an also unusable
person tracking task. A total of seven tracking systems with MOTP measure. On the other hand, if T approximates 0,
varying characteristics participated. Looking at the first col- all tracked objects will eventually be considered as missed,
umn, the MOTP scores, one finds that all systems performed and the MOTP and MOTA measures lose their meaning. As
fairly well, with average localization errors under 20 cm. This a consequence, the single correspondence threshold T must
can be seen as quite low, considering the area occupied on be carefully chosen based on the application and evaluation
average by a person and the fact that the ground truth itself, goals at hand. For the CLEAR 3D person tracking task,
representing the projections to the ground plane of head the margin was intuitively set to 50 cm, which produced
centroids, was only labeled to 5–8 cm accuracy. However, one reasonable results, but the question of determining the
must keep in mind that the fixed threshold of 50 cm, beyond optimal threshold, perhaps automatically in a data driven
which an object is considered as missed completely by the way, is still left unanswered.
tracker, prevents the MOTP from rising too high. Even in The rightmost column in Figure 7, the MOTA measure,
the case of uniform distribution of localization errors, the proved somewhat more interesting for overall performance
MOTP value would be 25 cm. This shows us that, considering comparisons, at least in the case of 3D person tracking, as
the predefined threshold, System E is actually not very precise it was not bounded to a reasonable range, as the MOTP
at estimating person coordinates, and that System B, on the was. There was far more room for errors in accuracy in
other hand, is extremely precise, when compared to ground the complex multitarget scenarios under evaluation. The
8 EURASIP Journal on Image and Video Processing

best and worst overall systems, G and F reached 78% and Miss rate Miss rate
30% accuracy, respectively. Systems A, B, C, and E, on the Site/system MOTP (dist > T) (no hypo) MOTA
System A 246 mm 88.75% 2.28% −79.78%
other hand, produced very similar numbers, although they
System B 88 mm 5.73% 2.57% 85.96%
used quite different features and algorithms. While the System C 168 mm 15.29% 3.65% 65.44%
MOTA measure is useful to make such broad high-level System D 132 mm 4.34% 0.09% 91.23%
comparisons, it was felt that the intermediate miss, false System E 127 mm 14.32% 0% 71.36%
positive and mismatch errors measures, which contribute to System F 161 mm 9.64% 0.04% 80.67%
System G 207 mm 12.21% 0.06% 75.52%
the overall score, helped to gain a better understanding of
tracker failure modes, and it was decided to publish them Figure 8: Results for the CLEAR’06 3D Single Person Tracking
alongside the MOTP and MOTA measures. This was useful, visual subtask
for example, for comparing the strengths of systems B and C,
which had a similar overall score.
Notice that in contrast to misses and false positives,
for the 2007 CLEAR 3D person tracking task, mismatches they were broken down into misses resulting from failures
were presented as absolute numbers as the total number of to detect the person of interest (miss rate (no hypo)), and
errors made in all test [Link] is due to an imbalance, misses resulting from localization errors exceeding the 50 cm
which was already noted during the 2006 evaluations, and threshold (miss rate (dist > T)). In the latter case, as a
for which no definite solution has been found as of yet: for consequence of the metric definition, every miss error was
a fairly reasonable tracking system and the scenarios under automatically accompanied by a false positive error, although
consideration, the number of mismatch errors made in a these were not presented separately for conciseness.
sequence of several minutes labeled at 1 second intervals is This effect, which is much more clearly observable in the
in no proportion to the number of ground truth objects, single object case, can be perceived as penalizing a tracker
or, for example, to the number of miss errors incurring if twice for gross localization errors (one miss penalty, and one
only one of many objects is missed for a portion of the false positive penalty). This effect is however intentional and
[Link] typically resulted in mismatch error ratios of desirable for the following reason: intelligent trackers that
often less than 2%, in contrast to 20–40% for misses or use some mechanisms such as track confidence measures,
false positives, which considerably reduced the impact of to avoid outputting a track hypothesis when their location
faulty track labeling on the overall MOTA score. Of course, estimation is poor, are rewarded compared to trackers which
one could argue that this is an intuitive result because continuously output erroneous hypotheses. It can be argued
track labeling is a lesser problem compared to the correct that a tracker which fails to detect a lecturer for half of a
detection and tracking of multiple objects, but in the end sequence performs better than a tracker which consistently
the relative importance of separate error measures is purely tracks the empty blackboard for the same duration of time.
dependent on the application. To keep the presentation of This brings us to the noteworthy point: just as much as
results as objective as possible, absolute mismatch errors the types of tracker errors (misses, false positives, distance
were presented here, but the consensus from the evaluation errors, etc.) that are used to derive performance measures,
workshops was that according more weight to track labeling precisely “how” these errors are counted, the procedure for
errors was desirable, for example, in the form of trajectory- their computation when it comes to temporal sequences,
based error measures, which could help move away from plays a major role in the behavior and expressiveness of the
frame-based miss and false positive errors, and thus reduce resulting metric.
the imbalance. Figure 9 shows the results for the 2007 3D person
Figure 8 shows the results for the visual 3D single person tracking acoustic subtask. According to the task definition,
tracking task, evaluated in 2006. As the tracking of a single mismatch errors played no role and just as in the visual
object can be seen as a special case of multiple object single person case, components of the MOTA score were
tracking, the MOT metrics could be applied in the same broken down into miss and false positive errors resulting
way. Again, one can find at a glance the best performing from faulty segmentation (true miss rate, true false pos. rate),
system in terms of tracking accuracy, System D with 91% and those resulting from gross localization errors (loc. error
accuracy, by looking at the MOTA values. One can also rate). One can easily make out System G as the overall best
quickly discern that, overall, systems performed better on the performing system, both in terms of MOTP and MOTA, with
less challenging single person scenario. The MOTP column performance varying greatly from system to system. Figure 9
tells us that System B was remarkable, among all others, in demonstrates the usefulness of having just one or two overall
that it estimated target locations down to 8.8 cm precision. performance measures when large numbers of systems are
Just as in the previous case, more detailed components of involved, in order to gain a high-level overview before going
the tracking error were presented in addition to the MOTA. into a deeper analysis of their strengths and weaknesses.
In contrast to multiple person tracking, mismatch errors Figure 10, finally, shows the results for the 2007 face
play no role (or should not play any) in the single person tracking task on the CLEAR seminar database. The main
case. Also, as a lecturer was always present and visible in the difference to the previously presented tasks lies in the
considered scenarios, false positives could only come from fact that 2D image tracking of the face area is performed
gross localization errors, which is why only a detailed analysis and the distance error between ground truth objects and
of the miss errors was given. For better understanding, tracker hypotheses is expressed in terms of overlap of the
K. Bernardin and R. Stiefelhagen 9

True miss True false Loc. error easy comparison of overall performance, and are applicable
Site/system MOTP rate pos. rate rate A-MOTA to a variety of scenarios.
System A 257 mm 35.3% 11.06% 26.09% 1.45%
System B 256 mm 0% 22.01% 41.6% −5.22%
System C 208 mm 11.2% 7.08% 18.27% 45.18% ACKNOWLEDGMENT
System D 223 mm 11.17% 7.11% 29.17% 23.39%
System E 210 mm 0.7% 21.04% 23.94% 30.37% The work presented here was partly funded by the European
System F 152 mm 0% 22.04% 14.96% 48.04% Union (EU) under the integrated project CHIL, Computers in
System G 140 mm 8.08% 12.26% 12.52% 54.63%
System H 168 mm 25.35% 8.46% 12.51% 41.17%
the Human Interaction Loop (Grant no. IST-506909).

Figure 9: Results for the CLEAR’07 3D person tracking acoustic REFERENCES


subtask.
[1] M. Voit, K. Nickel, and R. Stiefelhagen, “Multi-view head pose
estimation using neural networks,” in Proceedings of the 2nd
MOTP False pos. Mismatch Workshop on Face Processing in Video (FPiV ’05), in association
Site/system (overlap) Miss rate rate rate MOTA
with the 2nd IEEE Canadian Conference on Computer and
System A 0.66 42.54% 22.1% 2.29% 33.07%
System B 0.68 19.85% 10.31% 1.03% 68.81% Robot Vision (CRV ’05), pp. 347–352, Victoria, Canada, May
2005.
Figure 10: Results for the CLEAR’07 face tracking task. [2] CHIL—Computers In the Human Interaction Loop, http://
[Link]/.
[3] AMI—Augmented Multiparty Interaction, [Link]
[Link]/.
respective bounding boxes. This is reflected in the MOTP [4] VACE—Video Analysis and Content Extraction, [Link]
column. As the task required the simultaneous tracking of .[Link]/arda/[Link].
multiple faces, all types of errors, misses, false positives, and [5] ETISEO—Video Understanding Evaluation, [Link]
mismatches were of relevance, and were presented along [Link]/etiseo/.
with the overall MOTA score. From the numbers, one can [6] The i-LIDS dataset, [Link]
derive that although systems A and B were fairly equal in uk/hosdb/cctv-imaging-technology/video-based-detection-
estimating face extensions once they were found, System B systems/i-lids/.
clearly outperformed System A when it comes to detecting [7] CAVIAR—Context Aware Vision using Image-based Active
and keeping track of these faces in the first place. This case Recognition, [Link]
[8] F. Ziliani, S. Velastin, F. Porikli, et al., “Performance evaluation
again serves to demonstrate how the MOT measures can be
of event detection solutions: the CREDS experience,” in
applied, with slight modifications but using the same general Proceedings of the IEEE International Conference on Advanced
framework, for the evaluation of various types of trackers Video and Signal Based Surveillance (AVSS ’05), pp. 201–206,
with different domain-specific requirements, and operating Como, Italy, September 2005.
in a wide range of scenarios. [9] PETS—Performance Evaluation of Tracking and Surveillance,
[Link]
5. SUMMARY AND CONCLUSION [10] EEMCV—Empirical Evaluation Methods in Computer
Vision, [Link]
In order to systematically assess and compare the perfor- [11] CLEAR—Classification of Events, Activities and Relation-
mance of different systems for multiple object tracking, ships, [Link]
metrics which reflect the quality and main characteristics of [12] Y. Li, A. Dore, and J. Orwell, “Evaluating the performance of
systems for tracking football players and ball,” in Proceedings
such systems are needed. Unfortunately, no agreement on a
of the IEEE International Conference on Advanced Video and
set of commonly applicable metrics has yet been reached. Signal Based Surveillance (AVSS ’05), pp. 632–637, Como, Italy,
In this paper, we have proposed two novel metrics for the September 2005.
evaluation of multiple object tracking systems. The proposed [13] A. T. Nghiem, F. Bremond, M. Thonnat, and V. Valentin,
metrics—the multiple object tracking precision (MOTP) “ETISEO, performance evaluation for video surveillance sys-
and the multiple object tracking accuracy (MOTA)—are tems,” in Proceedings of the IEEE International Conference on
applicable to a wide range of tracking tasks and allow for Advanced Video and Signal Based Surveillance (AVSS ’07), pp.
objective comparison of the main characteristics of tracking 476–481, London, UK, September 2007.
systems, such as their precision in localizing objects, their [14] R. Y. Khalaf and S. S. Intille, “Improving multiple people
accuracy in recognizing object configurations, and their tracking using temporal consistency,” MIT Department of
ability to consistently track objects over time. Architecture House n Project Technical Report, Massachusetts
We have tested the usefulness and expressiveness of the Institute of Technology, Cambridge, Mass, USA, 2001.
[15] A. Mittal and L. S. Davis, “M2Tracker: a multi-view approach
proposed metrics experimentally, in a series of international
to segmenting and tracking people in a cluttered scene
evaluation workshops. The 2006 and 2007 CLEAR work- using region-based stereo,” in Proceedings of the 7th European
shops hosted a variety of tracking tasks for which a large Conference on Computer Vision (ECCV ’02), vol. 2350 of
number of systems were benchmarked and compared. The Lecture Notes in Computer Science, pp. 18–33, Copenhagen,
results of the evaluation show that the proposed metrics Denmark, May 2002.
indeed reflect the strengths and weaknesses of the various [16] N. Checka, K. Wilson, V. Rangarajan, and T. Darrell, “A prob-
used systems in an intuitive and meaningful way, allow for abilistic framework for multi-modal multi-person tracking,”
10 EURASIP Journal on Image and Video Processing

in Proceedings of the IEEE Workshop on Multi-Object Tracking


(WOMOT ’03), Madison, Wis, USA, June 2003.
[17] K. Nickel, T. Gehrig, R. Stiefelhagen, and J. McDonough,
“A joint particle filter for audio-visual speaker tracking,” in
Proceedings of the 7th International Conference on Multimodal
Interfaces (ICMI ’05), pp. 61–68, Torento, Italy, October 2005.
[18] H. Tao, H. Sawhney, and R. Kumar, “A sampling algorithm for
tracking multiple objects,” in Proceedings of the International
Workshop on Vision Algorithms (ICCV ’99), pp. 53–68, Corfu,
Greece, September 1999.
[19] K. Smith, D. Gatica-Perez, J. Odobez, and S. Ba, “Evaluating
multi-object tracking,” in Proceedings of the IEEE Workshop
on Empirical Evaluation Methods in Computer Vision (EEMCV
’05), vol. 3, p. 36, San Diego, Calif, USA, June 2005.
[20] J. Munkres, “Algorithms for the assignment and transporta-
tion problems,” Journal of the Society of Industrial and Applied
Mathematics, vol. 5, no. 1, pp. 32–38, 1957.
[21] NIST—National Institute of Standards and Technology,
[Link]
[22] R. Stiefelhagen and J. Garofolo, Eds., Multimodal Technologies
for Perception of Humans: First International Evaluation Work-
shop on Classification of Events, Activities and Relationships,
CLEAR 2006, vol. 4122 of Lecture Notes in Computer Science,
Springer, Berlin, Germany.
[23] R. Stiefelhagen, J. Fiscus, and R. Bowers, Eds., Multimodal
Technologies for Perception of Humans, Joint Proceedings of the
CLEAR 2007 and RT 2007 Evaluation Workshops, vol. 4625 of
Lecture Notes in Computer Science, Springer, Berlin, Germany.

You might also like