Urban Vehicle Classification Systems
Urban Vehicle Classification Systems
Doctor of Philosophy
June, 2010
Collaborating partner:
Traffic Directorate at Transport for London
Supervision:
Dr. Sergio A. Velastin (Director of Studies)
Dr. James Orwell
video in urban traffic scenes is presented. The final aim is to produce systems to
guide surveillance operators and reduce human resources for observing hundreds of
cameras in urban traffic surveillance. Cameras are a well established means for
traffic managers to observe traffic states and improve journey experiences. Firstly,
and compared to a projected model silhouette to identify the ground plane position
and class of vehicles and pedestrians. The system has been evaluated with the
reference i-LIDS data sets from the UK Home Office. Performance has been
compared for varying numbers of classes, for three different weather conditions and
for different video input filters. The full system including detection and
texture saliency classifier has been proposed to detect people in a video frame by
identifying salient texture regions. The image is classified into foreground and
classification. The system is used for the task of detecting people entering a sterile
zone, a common scenario for visual surveillance. Testing has been performed on the
i-LIDS sterile zone benchmark data set of the UK Home Office. The basic detector
is extended by fusing its output with simple motion information, which significantly
Based on the good results for local features, a novel classifier has been
appearance of vehicles varies substantially with the viewing angle and local features
may often be occluded. In this thesis, full 3D models are used for the object
categories to be detected and the feature patches are defined over these models. A
of oriented gradients) are defined. A variable set of interest points is used in the
are visible. The 3DHOG feature is compared with features based on FFT and simple
histograms and also to the motion silhouette baseline on the same data. The results
shaped motion silhouettes which can be caused by variable lighting, camera quality
and occlusions from other objects.
The proposed algorithms are evaluated further on a new data set from a
different camera with higher resolution, which demonstrates the portability of the
training data to novel camera views. Kalman filter tracking is introduced to gain
tracks of 94% outperform a baseline motion tracker (OpenCV) tested under the
same conditions. A demonstrator for bus lane monitoring is introduced using the
output of the detection and classification system. The thesis concludes with a
critical analysis of the work and the outlook for future research opportunities.
to my parents
Acknowledgments
Firstly, I would like to thank Sergio Velastin and James Orwell, my supervisors, for
their effort, help, motivation and invaluable guidance. They both went out of their
parts of the world. The specific balance between independence and useful feedback
provided a prolific work environment.
particular, many thanks to the Technology Delivery Group for making me feel very
welcome during my time in the office in London and for many conversations on the
practical aspects of traffic management. I would like to thank the people involved in
the project: Mark Cracknell, Jeremy Evans, William Lowder, John McCarthy and
Derek Renaud for their guidance and feedback.
our ‗research life‘ easier. I want to thank Justin Cobb, Alberto Colombo, Jesús
Martínez del Rincón, Hussein Ragheb and Damien Simonnet. Furthermore, Fei Yin
kindly provided his tracking performance evaluation framework to test parts of the
system in addition to many valuable conversations across desks. The daily lunches
of the research group provided a strong and international community for my time in
Kingston. Thank you guys! There are many more people, who are not named in
person, to whom I am grateful.
Love and gratitude to my fiancée Marina and all my family for their
encouragement and support along my educational path. These thanks extend both to
my family in Austria, and my future family in law in England.
PRELIMINARIES Glossary of Terms
Glossary of Terms
3DHOG 3D extended histogram of oriented gradients
DC Direct current
-i-
PRELIMINARIES Glossary of Terms
PF Particle filter
TV Television
- Performance Evaluation
- ii -
PRELIMINARIES List of Figures
List of Figures
Figure 1 Illustrations of camera installations in London and the resulting
views........................................................................................................ 3
Figure 2 Example frames from the i-LIDS parked car data set (iLIDS, nd).
a,b) sunny conditions with shadows and reflections on cars c)
image saturation in the upper part of the image d) detail of a light
car in the saturated area where only dark elements remain visible
e) interlacing artefacts are commonly dealt with by removing
every second video line and therefore halving the resolution f)
raining condition with reflections g) rain during dusk h) headlight
reflections during night. ........................................................................ 11
Figure 3 Block diagram for a top-down surveillance system. The grouping of
pixels in the foreground mask into silhouettes that represent
objects is done early with a simple algorithm without knowledge
of object classes. .................................................................................... 14
Figure 4 Block diagram for a bottom-up surveillance system. Local image
patches are first extracted from the input image and classified as
being a specific part of a trained object class. Those identified
parts are combined into objects based on the class through a
grouping or voting process. Advanced tracking concepts (Leibe
et al., 2008b) allow this grouping to be performed in the spatial-
temporal domain, which directly produces an object trajectory
rather than frame per frame object detections. ...................................... 15
Figure 5 Example views from the i-LIDS data set with detected vehicles and
pedestrians. The left image also shows an ambiguous foreground
region (thin blue outline) on the top left, which was classified as
class ‗other‘ and that consequently, has no wire frame. The
outlines of regions of interest R are shown as dark red rectangles
on the road. ............................................................................................ 53
Figure 6 Block diagram of the detection and classification system ......................... 55
Figure 7 Example pictogram structure of the detector corresponding to the
block diagram in Figure 6. The mean background images of the
GMM modes are shown along the bottom, followed on top by the
foreground mask and connected components S . .................................. 58
Figure 8 Wire frame models Fi used for classification. Refer to Table 1 for
model and class correspondences. ......................................................... 59
Figure 9 Illustration of the projection process of models. The wire frame of
models is projected to the camera view and flood filled. ...................... 61
Figure 10 Illustration of model matching process. The normalised overlap
between silhouettes and model mask is calculated. .............................. 62
- iii -
PRELIMINARIES List of Figures
Figure 11 Match measure for one silhouette S . The upper left image shows
the silhouette and best fitting model iS at ground plane position
( x, y ) . Top right: the winning match surface max M p ,i , S with
i
data points. Bottom: Cross- section through every model's match
surface M p ,i , S along the minimum and maximum decay
direction at ( x, y ) . The legend label ID corresponds to the model
index i in Table 1. ................................................................................ 63
Figure 12 Illustration of data flow for the classification framework. This
corresponds to the classifier block in Figure 6. One example
silhouette S with centroid c in green is shown. The map on the
bottom left illustrates the ground plane hypotheses h p as green
crosses. The model projected on the red position results in the red
flood filled model mask M , shown as example for a single
hypothesis. The normalised overlap operation M p ,i , S is
illustrated in the middle of the classifier for the example silhouette
S and model mask M .......................................................................... 65
Figure 13 Examples of correct detections and classification of vehicles using
the silhouette classifier with shadow removal filter .............................. 72
Figure 14 Top: Two examples for false positives due to pedestrians being
detected as bike and as car due to occlusion in a group. The
bottom left image shows a car being misclassified as bike as it
turns into the car park. The last image shows a missed car due to
its similar colour compared to the saturated road area .......................... 73
Figure 15 Performance comparison for the classification framework without
pedestrian models using 4 different filter algorithms: shadow
removal (Sr), shadow removal with de-interlacing (Sr+Di), de-
interlacing (Di) and no filter (-). The left diagram shows system
recall R , precision P and classifier precision PC . The right
diagram indicates the detector recall RD and precision PD .................. 75
Figure 16 Two example views of the classification framework including
pedestrian models for two different filter configurations from top:
shadow removal (Sr) and bottom: shadow removal with de-
interlacing (Sr+Di). The car at the bottom left is missed, because
of a tighter silhouette of the car in addition to the saturation
artefact. .................................................................................................. 76
Figure 17 Two example views of the classification framework including
pedestrian models for the remaining filter configurations top: de-
interlacing (Di) and bottom: no filter (-). Too large silhouettes
(their perimeters shown in blue) can be observed when shadow
removal is not carried out, causing missed vehicles and wrong
classifications (last two columns). ........................................................ 77
Figure 18 Performance comparison for the motion silhouette classifier under
three different weather conditions ......................................................... 81
Figure 19 Sunny examples top: true positive, bottom: false positive car and
missed car .............................................................................................. 82
- iv -
PRELIMINARIES List of Figures
Figure 20 Overcast examples top: two correct frames and bottom: one
misclassified frame and one wrong detection due to saturation in
the image. .............................................................................................. 83
Figure 21 Changing weather examples: Two correct frames at the top and two
misclassified frames at the bottom. ....................................................... 85
Figure 22 Examples of the i-LIDS data set showing the two camera views,
different environmental conditions (falling snow in the middle
left) and ways the fence is approached.................................................. 91
Figure 23 Block diagram of the intrusion detector .................................................. 96
Figure 24 Example region mask R1 for the approach (green) ................................ 98
Figure 25 Input frame with overlapping patches Pi , p .............................................. 99
Figure 26 Filtered Fourier spectrum patches Pˆ i , p The spectral value range is
normalised across regions to span the grey level range (the two
regions are normalised independently for display purposes). The
patches on the right illustrate the filtering by blanking the inner
and outer area of the spectral patches Pˆ i , p . ......................................... 100
Figure 27 Scalar features f i , p (right) of image patches (left). The feature
value range is normalised to span the full grey level range. The
fence and grass region are normalised independently. The grass
area shows an area distinctly different from the average, which
corresponds to an intruder. The second example along the bottom
illustrates, that the method is applicable for inhomogeneous
illumination at night where intruders can be darker than the
background. ......................................................................................... 101
Figure 28 Clusters Ci ,k . The left image shows the clusters for the fence and
the right image the clusters for the grass. Every image patch is
represented by one dot, where the colour indicates the cluster
label. .................................................................................................... 103
Figure 29 Final foreground patches Fi detected in the example frame ................. 105
Figure 30 Intruder with trajectory T for the example frame................................. 107
Figure 31 Block diagram for the intrusion detection with motion extension in
orange. ................................................................................................. 108
Figure 32 Pictogram of data flow of intrusion detection with motion
extension. This corresponds to the block diagram in Figure 31 and
uses the same colour code. The blue path gives an overview of the
basic intrusion detection described in 4.3.1. The individual images
are described in section 4.3.1. The orange path shows the
extension with a motion mask M included on the bottom left. ......... 110
Figure 33 Block diagram for intrusion detector with Kalman Filter extension ..... 111
-v-
PRELIMINARIES List of Figures
Figure 34 True positive examples of Kalman extension showing smooth
trajectories. The second extension with a Kalman filter overcomes
the problem of late detection by allowing trajectories to be
initialised purely by motion. Note the person rolling sideways in
the image on the left, which indicates the various ways the fence
is approached in the i-LIDS data set. .................................................. 112
Figure 35 Comparison of alarm triggering time. The left column shows the
frame when the system with Kalman filter triggered an alarm. The
right column shows later alarms of the system without the filter,
especially when intruders are partly occluded by the edge of the
camera for a long time. ........................................................................ 114
Figure 36 Block diagram of system implementation with frame grabber,
capture application and Matlab computer vision module. .................. 116
Figure 37 Runtime analysis of the whole system implementation with average
runtime of every module. .................................................................... 117
Figure 38 Performance for 10 seconds alarm window. Results are shown for
alarming sequences, total per view including the non alarm
sequences and total of the whole data set. ........................................... 119
Figure 39 The top left image shows a wrongly detected bird flying towards
the fence. The top right images shows a false detection due to fast
moving clouds present at the same time as fence shadows, both
errors are caused by texture. The bottom images show missed
intruders due to low lighting conditions at night................................. 120
Figure 40 Performance for 20 seconds alarm window. An improvement
compared to 10 seconds is noticeable for both intrusion detectors
due to later correct detections of slow moving people. ....................... 121
Figure 41 Example views from the i-LIDS data set with detected and
classified pedestrians and vehicles using 3DHOG .............................. 125
Figure 42 Overview of the algorithm. This block diagram outlines the
relationship between the different stages of the algorithm. ................ 127
Figure 43 3D spatial models taken from chapter 3 extended with interest
points. The interest points are illustrated as cones, which signify
the position and also normal direction of interest points. The
diameter of the cone will later be used to visualise the interest
point‘s weight. ..................................................................................... 129
Figure 44 Illustration of patch image extraction .................................................... 133
Figure 45 a) Input image and b) hatchback model. The radii of cones indicate
the weights q (described later) of interest points p . c) shows the
set of extracted image patches I I ................................................... 134
Figure 46 Feature vectors fˆk generated from the set of image patches I in
Figure 45. a) 3DHOG features, b) spectral features (FFT) and c)
image histogram. ................................................................................. 136
- vi -
PRELIMINARIES List of Figures
Figure 47 Illustration of histogram in the FFT feature extraction process. The
continuous red lines indicate the frequency borders, whereas the
dashed blue lines indicate the angle borders. ...................................... 138
Figure 48 Block diagram for the training of appearance models. Features are
extracted from training videos given object location annotation
and the 3D models with interest points. A Gaussian model and
subsequently normalisation coefficients and weights are
calculated from the features. ............................................................... 139
Figure 49 Example average feature distance surface DM . The centre dC at
position 0, 0 corresponds to the training position x and has
usually the lowest value. The feature distance increases for
coordinates further away from the training position. .......................... 143
Figure 50 Estimated sigmoid function shown as a dashed line. The continuous
line is the gradient of the sigmoid function defined by the centre
value dC of the distance surface DM and the mean distance d of
all grid points. ...................................................................................... 145
Figure 51 Final match measure surface M M after application of the sigmoid
function. A distinct peak at the training position can be observed.
This peak is set to the same value for all interest points of all
models. ................................................................................................ 145
Figure 52 Block diagram for the 3DHOG classifier. The general structure is
identical to the motion silhouette classifier in chapter 3. 3DHOG
features are extracted directly from the input frame based on the
ground plane hypothesis. The match measure operates in
appearance feature space in contrast to the image space for the
silhouette classifier (please compare to Figure 6 on page 55). ........... 148
Figure 53 Example of car detection with occlusion of pedestrians showing a
match measure surface with a good peak. ........................................... 149
Figure 54 True positive examples for vehicles and pedestrians using 3DHOG. ... 151
Figure 55 Two examples of errors generated with 3DHOG. Left: Missed car
due to low contrast of the vehicle bonnet and roof. Right:
Misclassified SUV as van due to similar size and appearance. .......... 152
Figure 56 Left: Correctly classified lorry with 3DHOG despite shadows and
oversized motion silhouette. Right: wrongly detected pedestrian at
the front edge of a lorry due to vertical edges. .................................... 156
- vii -
PRELIMINARIES List of Figures
Figure 57 Comparison of the 3DHOG classifier (left) with the motion
silhouette classifier (right). The first image shows a position offset
and wrong classification of the pedestrian of the silhouette
classifier due to the tree shadow. In comparison, the 3DHOG
classifier correctly identifies the pedestrian within the silhouette
and aligns the car better. The bottom images show a missed
vehicle of the silhouette classifier, because the silhouette is too
small due to the overexposed camera view. The pedestrian is
detected as bicycle due to similar size. Both problems are resolved
with the 3DHOG classifier. ................................................................. 157
Figure 58 3DHOG performance figures for varying patch size to 0.8m at
resolution 20 P ixel m and 0.5m at 16 P ixel m ............. 158
Figure 59 Example images from the new data set containing a transmission
artefact on the left. The right image shows the classified car,
which is typically fully visible. Partly occluded cars (at the
camera‘s edge) are mostly ignored...................................................... 161
Figure 60 3DHOG classification results of 4 separate frames. The blue outline
shows the estimated motion foreground. The 3DHOG classifier
produces the wire frame and the respective 3D location of the
vehicle on the road map. Good localisation performance is
demonstrated even for the third example containing a transmission
artefact. ................................................................................................ 164
Figure 61 Two examples comparing the 3DHOG results with the industrial
baseline in Figure 59. The left image shows correct detection
despite the artefact. This example operates without region of
interest, which is why the occluded vehicles are detected. The
right image shows a later frame with active region of interest to
remove occluded vehicles to avoid excessive false positives. ............ 165
Figure 62 Example for partially occluded vehicles. This image illustrates that
very limited visibility of vehicles is sufficient for detection and
potentially classification. The algorithm for dealing with
incomplete representations of objects is a core part of the 3DHOG
classifier framework. Occlusion is resolved seamlessly in the
same way as variable visibility of the 3D models depending on the
camera view and vehicle orientation. .................................................. 166
Figure 63 The motion silhouette classifier produces the wire frame and the
respective 3D location of the vehicle on the road map. The wire
frames are slightly offset and the last stopped van was missed due
to ambiguous silhouette shape merging into the background, but
correctly classified by 3DHOG (Figure 60). ....................................... 168
- viii -
PRELIMINARIES List of Figures
Figure 64 Two examples comparing the motion silhouette results with the
industrial baseline in Figure 59. The left image shows correct
detection of the central car despite the artefact. This example
operates without region of interest, which is why the occluded
vehicles are detected. The silhouette based classifier does not
classify occluded vehicles as well as 3DHOG, especially the
location performance is worse. The right image shows a later
frame with active region of interest to remove occluded vehicles
to avoid false positives. ....................................................................... 169
Figure 65 Block diagram of detector with 3D classifier and subsequent
tracker. ................................................................................................. 171
Figure 66 Example of detection and classification with ground plane tracking.
The wire frame projection in red is used to estimate the bounding
box for tracked vehicles. The information can be used for an
anonymised animation to overcome privacy limitations of video
data. ..................................................................................................... 172
Figure 67 Correct detected tracks inside the active regions of interest (dark
red boxes). Left: the proposed system with corresponding ground
plane tracks. Right: OpenCV tracker result. Note the spatially
fragmented tracks for the baseline in the first row and the correct
number for tracks for the proposed tracker. ........................................ 177
Figure 68 In this frame, the second car is missed due to occlusion between the
vehicles. The proposed tracker on the left correctly locates the
first car. The OpenCV tracker merged both cars with a large
bounding box at a central position. ..................................................... 178
Figure 69 Pedestrians are correctly rejected as ―other‖ class by the proposed
tracker and detected by the OpenCV tracker. ..................................... 178
Figure 70 Bus lane intrusion detection example. The restricted zone is marked
as green region permitting only buses. Vehicles of a restricted
class with tracks inside this region trigger alarms shown in red in
the lower image. .................................................................................. 180
Figure 71 Estimated sigmoid function shown as a dashed line. The continuous
line is the gradient of the sigmoid function defined by the centre
value dC of the distance surface DM and the mean distance d of
all values. ............................................................................................. 191
Figure 72 Snapshot of graphical user interface for parameter selection.
Default values for parameters are displayed, selected modules can
be configured and help is available for every element. ....................... 201
- ix -
PRELIMINARIES List of Tables
List of Tables
Table 1 Class configuration table T showing correspondences between model
ID i and class ID j (note that a class may correspond to more
than one model) ..................................................................................... 61
Table 2 Confusion matrix and overall silhouette classifier performance for
vehicle only operation using shadow removal ...................................... 71
Table 3 Confusion matrix for the silhouette classifier using shadow removal
and per class evaluation......................................................................... 71
Table 4 Confusion matrix of the motion silhouette classifier when using the
shadow removal filter including overall performance figures for
pedestrians ............................................................................................. 78
Table 5 Classifier and class wise performance figures for the motion
silhouette classifier when using the shadow removal filter .................. 78
Table 6 Confusion matrix and system performance of the motion silhouette
classifier for shadow removal and de- interlacing filter ........................ 79
Table 7 Confusion matrix for the motion silhouette classifier with de-
interlacing filtering and with no filter. More tables for those cases
are provided in appendix C.1. ............................................................... 80
Table 8 Confusion matrix for sunny conditions ....................................................... 82
Table 9 Confusion matrix for overcast conditions ................................................... 84
Table 10 Confusion matrix for changing conditions ............................................... 85
Table 11 Detailed numbers of TP, FP and FN with F1 measures for all four
systems with alarm window setting of 20 seconds. ............................ 122
Table 12 Parameters used during evaluation of the 3DHOG classifier ................. 150
Table 13 Extended confusion matrix for 3DHOG with total system
performance ......................................................................................... 153
Table 14 Classifier confusion matrix for 3DHOG and class wise results.............. 153
Table 15 Extended confusion matrix for FFT features with total system
performance ......................................................................................... 154
Table 16 Extended confusion matrix for histogram features with total system
performance ......................................................................................... 154
Table 17 Extended confusion matrix for 3DHOG with total system
performance including pedestrians...................................................... 155
Table 18 Classifier confusion matrix for 3DHOG and class wise results
including pedestrians ........................................................................... 155
Table 19 Industrial classifier confusion matrix and full system (detection and
classification) confusion matrix. Both matrixes are identical, as
the system output for detection is used as ground truth. The
classifier shows a strong misclassification of vans as lorries.............. 162
-x-
PRELIMINARIES List of Tables
Table 20 Industrial classifier total system performance and class wise
evaluation ............................................................................................ 162
Table 21 The 3DHOG classifier exhibits good performance. The high number
of false negatives is due to stationary objects at the traffic lights,
which are not picked up by the motion detection. Some detections
reported as false positives were actually cars, but were not picked
up by the industrial classifier used as baseline. Examples of both
of those issues are illustrated in Figure 59. ......................................... 165
Table 22 Total system performance for the 3DHOG classifier. The
classification outperforms the industrial baseline. .............................. 166
Table 23 Confusion matrix for the silhouette classifier. The detection rate is
lower than 3DHOG. This is due to ambiguous motion silhouettes
for stopped vehicles merging with the background. Those
silhouettes do not match a model well, but are sufficient for the
3DHOG classifier to start a search. ..................................................... 169
Table 24 Classification and full system performance for silhouette classifier. ..... 170
Table 25 Tracking results ....................................................................................... 179
Table 26 Classifier confusion matrix and overall performance figures for the
motion silhouette classifier with de-interlacing filter ......................... 202
Table 27 Class wise performance figures for the motion silhouette classifier
with de-interlacing filter ...................................................................... 202
Table 28 Classifier confusion matrix and overall performance figures for the
motion silhouette classifier without additional filter........................... 203
Table 29 Class wise performance figures for the motion silhouette classifier
without additional filter ....................................................................... 203
Table 30 Classifier confusion matrix and overall performance figures for the
motion silhouette classifier under sunny conditions ........................... 203
Table 31 Class wise performance figures for the motion silhouette classifier
under sunny conditions........................................................................ 204
Table 32 Classifier confusion matrix and overall performance figures for the
motion silhouette classifier under overcast conditions ....................... 204
Table 33 Class wise performance figures for the motion silhouette classifier
under overcast conditions .................................................................... 204
Table 34 Classifier confusion matrix and overall performance figures for the
motion silhouette classifier under changing weather conditions ........ 205
Table 35 Class wise performance figures for the motion silhouette classifier
under changing weather conditions ..................................................... 205
Table 36 Confusion matrix and system performance for the 3DHOG classifier
with patch size of 0.8m at resolution 20 P ixel m ............... 206
Table 37 Classifier confusion matrix and class wise performance for the
3DHOG classifier with patch size of 0.8m at resolution
20 P ixel m ................................................................................... 206
- xi -
PRELIMINARIES List of Tables
Table 38 Confusion matrix and system performance for the 3DHOG classifier
with patch size of 0.5m at resolution 16 P ixel m ................ 207
Table 39 Classifier confusion matrix and class wise performance for the
3DHOG classifier with patch size of 0.5m at resolution
16 P ixel m .................................................................................... 207
- xii -
PRELIMINARIES Content
Content
1. Introduction .................................................................................................... 1
1.1. Video Analytics for Traffic Management ................................................. 1
1.2. Scope and Outline ..................................................................................... 4
1.3. Contribution............................................................................................... 6
2. Review ........................................................................................................... 10
2.1. Introduction ............................................................................................. 10
2.2. Video Analytics deployed in the Traffic Domain ................................... 12
2.2.1. Vehicle Counting ............................................................................. 12
2.2.2. Automatic Number Plate Recognition ............................................. 12
2.2.3. Incident Detection ............................................................................ 13
2.3. Elements of Traffic Analysis Systems .................................................... 14
2.3.1. Foreground Segmentation ................................................................ 16
[Link]. Frame Differencing ...................................................................... 16
[Link]. Background Subtraction ............................................................... 17
[Link]. Gaussian Mixture Model .............................................................. 19
[Link]. Graph Cuts ................................................................................... 20
[Link]. Shadow Removal ......................................................................... 21
[Link]. Object based Segmentation .......................................................... 22
2.3.2. Top-Down Vehicle Classification .................................................... 23
[Link]. Features ........................................................................................ 23
[Link]. Machine Learning ........................................................................ 25
2.3.3. Bottom-up Classification ................................................................. 31
[Link]. Interest Point Descriptors ............................................................. 32
[Link]. Boosting ....................................................................................... 35
[Link]. Explicit Shape .............................................................................. 36
[Link]. Object Classification without Explicit Shape Structure ............... 38
2.3.4. Tracking ........................................................................................... 39
2.4. Complete Traffic Analysis Systems ........................................................ 42
2.4.1. Urban ................................................................................................ 42
[Link]. Analysis in the Camera Domain .................................................. 42
[Link]. 3D Modelling ............................................................................... 44
2.4.2. Highways ......................................................................................... 46
[Link]. Detection ...................................................................................... 46
[Link]. Classification ................................................................................ 47
2.5. Discussion ............................................................................................... 48
2.5.1. Challenges ........................................................................................ 48
2.5.2. Data Sets .......................................................................................... 49
2.6. Future Research and Thesis Outline ........................................................ 50
3. Motion Silhouette Classifier ........................................................................ 52
3.1. Introduction ............................................................................................. 52
3.1.1. Outline of the Proposed Approach ................................................... 54
3.2. Detection ................................................................................................. 55
3.3. Classification ........................................................................................... 58
- xiii -
PRELIMINARIES Content
3.4. Evaluation ................................................................................................ 65
3.4.1. Metrics ............................................................................................. 66
3.4.2. Data set ............................................................................................. 69
3.4.3. Detection and Classification without Pedestrian Models................. 70
[Link]. State of the art literature ............................................................... 74
[Link]. Input filter comparison ................................................................. 74
3.4.4. Detection and Classification with all Road User Models ................ 76
[Link]. Shadow removal filter .................................................................. 78
[Link]. Shadow removal and de-interlacing filter .................................... 79
[Link]. De-interlacing filter and no filter ................................................. 80
3.4.5. Influence of Weather Conditions ..................................................... 80
[Link]. Sunny conditions .......................................................................... 81
[Link]. Overcast condition ....................................................................... 83
[Link]. Overcast changing to sunny ......................................................... 84
3.5. Summary ................................................................................................. 86
4. Local Features for Human Detection ......................................................... 89
4.1. Introduction ............................................................................................. 89
4.2. Related Work ........................................................................................... 92
4.2.1. Overall Approach ............................................................................. 94
4.3. Intrusion Detector .................................................................................... 95
4.3.1. Foreground Estimation ..................................................................... 96
[Link]. Local patch generation from region masks .................................. 97
[Link]. Fourier transform of individual patches ....................................... 99
[Link]. Noise reduction and feature generation from frequency spect..... 99
[Link]. Clustering features ..................................................................... 102
[Link]. Classification of patches into foreground and background ........ 103
4.3.2. Intrusion Rule ................................................................................. 105
4.4. Detector Extensions ............................................................................... 107
4.4.1. Motion Extension ........................................................................... 107
4.4.2. Kalman Filter Extension ................................................................ 111
4.5. i-LIDS Testing ....................................................................................... 115
4.5.1. The Data ......................................................................................... 115
4.5.2. Framework ..................................................................................... 116
4.5.3. Runtime Analysis ........................................................................... 116
4.6. Results ................................................................................................... 117
4.6.1. Metrics ........................................................................................... 117
4.6.2. Baseline .......................................................................................... 118
4.6.3. Analysis .......................................................................................... 119
4.7. Summary ............................................................................................... 123
5. 3DHOG Classifier ...................................................................................... 124
5.1. Introduction ........................................................................................... 124
5.2. Related Work ......................................................................................... 126
5.2.1. 3DHOG Detector and Classifier .................................................... 127
5.3. Defining Spatial 3D Models .................................................................. 129
5.4. Extracting Local Features ...................................................................... 131
5.4.1. Extracting Normalised Image Patches ........................................... 131
5.4.2. Generating Patch Features.............................................................. 135
- xiv -
PRELIMINARIES Content
[Link]. 3D Histogram of Oriented Gradients (3DHOG) ........................ 135
[Link]. FFT feature ................................................................................. 137
[Link]. Histogram feature ....................................................................... 137
5.5. Training Appearance Models ................................................................ 138
5.5.1. Training Data and Annotation........................................................ 139
5.5.2. Gaussian Appearance Model for Interest Points ............................ 140
5.5.3. Sigmoid Parameters for Model Normalisation .............................. 141
[Link]. Distance surface at training position .......................................... 141
[Link]. Sigmoid function ........................................................................ 143
5.5.4. Interest Point Weights .................................................................... 146
5.6. Classification Framework...................................................................... 146
5.6.1. Match Measure between Model and Image ................................... 148
5.7. Evaluation .............................................................................................. 150
5.7.1. Feature Comparison for Vehicle Detection and Classification ...... 152
5.7.2. Simultaneous Operation for All Road Users .................................. 155
5.7.3. Influence of Patch Size................................................................... 156
5.8. Summary ............................................................................................... 158
6. Applications ................................................................................................ 159
6.1. Introduction ........................................................................................... 159
6.2. Occlusion and Portability ...................................................................... 159
6.2.1. Industrial Classifier Results ........................................................... 161
6.2.2. 3DHOG Results ............................................................................. 163
6.2.3. Motion Silhouette Results .............................................................. 167
6.2.4. Comparison .................................................................................... 170
6.3. Tracking................................................................................................. 170
6.3.1. Kalman Filter ................................................................................. 172
6.3.2. Evaluation Framework ................................................................... 174
6.3.3. Results ............................................................................................ 176
6.4. Behaviour Analysis ............................................................................... 179
6.5. Summary ............................................................................................... 181
7. Conclusions ................................................................................................. 182
7.1. Summary ............................................................................................... 182
7.2. Discussion ............................................................................................. 183
7.3. Future Work .......................................................................................... 185
7.4. Publications ........................................................................................... 187
7.4.1. Accepted conditional to changes .................................................... 188
7.4.2. In preparation for journal ............................................................... 188
7.4.3. Presentations .................................................................................. 188
7.5. Personal Statement ................................................................................ 189
- xv -
PRELIMINARIES Content
- xvi -
CHAPTER 1 INTRODUCTION 1.1 Video Analytics for Traffic Management
1. Introduction
“Concern for man and his fate must always form the chief interest of all
technical endeavours. Never forget this in the midst of your diagrams
and equations.”
- Albert Einstein
efficient monitoring of traffic. In turn, this has increased the scope for automatic
analysis of urban traffic activity from CCTV in recent years. This increase can be
of analytical techniques to process the video (and other) data together with
contextual information from video. The main concept is to aid human operators in
observing video data. This can allow online and post-event detection of events of
interest, which is useful for traffic management due to additional data available. The
guidance for the operators to pick cameras to view and accumulate statistics, with
the aim to improve traffic flow. Video cameras have been deployed for a long time
for traffic and other monitoring purposes, because they provide a rich information
source for human understanding. Video analytics may now provide added value to
-1-
CHAPTER 1 INTRODUCTION 1.1 Video Analytics for Traffic Management
With 1200 cameras and over 100 monitors it is not feasible to continuously
monitor every CCTV camera installed within Transport for London‘s (TfL) road
network which demonstrates the bottleneck described above. In fact, it has been
shown that for manual monitoring the accuracy of detection significantly decreases
over time. Therefore, the development of a technology that provides automatic and
relevant real- time alerts to Traffic Co-ordinators can have an immediate and long
These include the detection of traffic violations (illegal turns, one way streets, etc)
and the identification of road users. For the latter task, the most reliable approach to
date is either through recognition of the number plates (ANPR) or radio frequency
of the interactions between road users, etc. that may be possible with computer
vision using standard cameras. Thus, for the monitoring objectives outlined above,
the detection and classification of road users is a key task. However, using general
surveillance data is generally poor and the range of operational conditions (night-
time, low angle and changeable weather that affects the auto-iris) require robust
object recognition is important for the understanding of the methods used. Object
recognition tasks typically focus on high resolution images (mega pixel range) with
-2-
CHAPTER 1 INTRODUCTION 1.1 Video Analytics for Traffic Management
few constraints on the viewing angle. The Visual Object Classes (VOC) challenge
(Everingham et al., 2009) gives precise definitions for classification, detection and
In contrast to the above, traffic surveillance systems deal with low camera
resolution. Many current installations include analogue PAL and NTSC cameras,
which only provide a limited amount of visual detail of road users. The monitoring
proposed approaches. The scenes are usually more constrained than object
recognition, with cameras mounted on poles above roads (illustrated in Figure 1).
operators take control over a camera. Many algorithms use this assumption to
extract information when no operator is observing the camera and information may
-3-
CHAPTER 1 INTRODUCTION 1.2 Scope and Outline
general, this surveillance task itself is not as well defined as it is for example for
image retrieval, and no scientific benchmarking challenge has yet taken place.
Prospective users of this technology have to evaluate such technology on a per case
basis.
In early 2006, TfL launched the Image Recognition and Incident Detection
(IRID) project. This project was tasked to review the current image processing
market and see how it met TfL‘s detection requirements. Testing was carried out on
the following criteria: congestion, stopped vehicles, banned turns, vehicle counting,
subway monitoring and bus detection (Cracknell, 2007, Cracknell, 2008). Results
from this testing show good performance in congestion detection (80% precision),
but poor performance in tracking based detection (~20% precision), clearly showing
limitations in capability. This PhD project was sponsored by TfL to investigate low
developed for the detection and classification of vehicles from currently installed
monitoring, which will be described in more detail in the literature review. The
Based on requirements of TfL, five generic classes are identified for the
classifier:
-4-
CHAPTER 1 INTRODUCTION 1.2 Scope and Outline
Bus/Lorry
Van
Car/Taxi
Motorbike/Bicycle
Pedestrian.
for every single camera is not feasible. Camera ground plane calibration can be
manually. A demonstrator for tracking and some behaviour analysis is in the scope
of the thesis, whereas more complex analysis of the generated meta data will be
future work.
general framework used throughout the thesis will be introduced in section 2.6
based on the literature. Vehicle detection and classification is first solved by motion
estimation and 3D models for vehicles in chapter 3. The motion estimation is state
of the art for stationary surveillance cameras. The 3D model approach allows
vehicles. Camera calibration is sufficient to use the models for any view. However,
the classification relies on motion silhouettes (binary mask) which is noisy and can
be affected by camera shake, shadows, occlusion and so on.
information of the input image is available in the binary mask. The appearance
-5-
CHAPTER 1 INTRODUCTION 1.3 Contribution
features show good performance for discriminating people from background based
on appearance in single frames.
framework in chapter 5. This on the one hand uses the complete image information
allowing it to work on still images and on the other hand provides portability due to
the novel integration with 3D models. For features, FFT as before, histogram and
histogram of oriented gradients (HOG) are used. In this way, the approach moves
the object recognition domain into video surveillance. The algorithm implicitly
Portability and occlusion performances are evaluated on a new data set. Tracking is
incorporated into the framework and a demonstrator for bus lane monitoring is
introduced. The thesis concludes with chapter 7 with a critical discussion of the
work presented and the outlook for future work. Additional implementation details
of the proposed framework and tables not included in the main text are available in
the appendix.
1.3. Contribution
A comprehensive review of visual traffic analysis systems and related methods of
computer vision is presented in chapter 2. The main gaps in literature identified are
firstly the classification of vehicles based on richer information than the motion
silhouette size. Secondly, coverage for urban environments is historically less than
for highways. This section will introduce the contributions of the thesis in respect to
the literature and the problem definition in section 1.2.
-6-
CHAPTER 1 INTRODUCTION 1.3 Contribution
models for road user classification. In particular, matching those 3D models with
closed contours extracted from motion foreground is novel. Several methods for
background refinement are available. All main road users are detected and classified
with a single framework. The second contribution is a unified framework for the use
of 3D models, which will allow seamless integration of appearance later on. The
third contribution is the system evaluation on a public data set. Evaluation results
are presented on the i-LIDS data set from the UK Home Office which can be
licensed by research institutions and manufacturers (iLIDS, nd). This will provide
the baseline for the following chapters.
saliency classifier is proposed for human intrusion detection in still images. People
are not assumed upright, as is the case for most pedestrian detectors in the literature
e.g. (Dalal and Triggs, 2005, Jones and Snow, 2008). Salient objects are detected in
real- time, based on spectral texture features of local image patches. The basic
classifier is extended with a novel fusion of the saliency and a simple inter-frame
difference motion mask. A second extension uses Kalman filtering and allows
contribution is the testing of the algorithms on the i-LIDS sterile zone data set (full
results with the state of the art OpenCV blob tracker are provided. Finally, detailed
runtime and complexity analysis for the framework is presented.
chapter 5: Firstly, the 3D spatial models are extended to incorporate the location of
-7-
CHAPTER 1 INTRODUCTION 1.3 Contribution
interest points from which local features are extracted. The local features are
image histograms or FFT features. The combination of 3D interest points and HOG
is hence introduced as the novel 3DHOG feature. Image patches are arranged on a
3D surface rather than a 2D grid, which preserves the advantages of both 3D model
and local features. Performance is evaluated, comparing 3DHOG with FFT and
silhouettes and can be applied to stationary objects, still images or moving cameras
first contribution is the evaluation on a new data set and comparison with an
the 3D vehicle detector and classifier by tracking on the ground plane. A variable
classifying before tracking. The evaluation framework of (Yin et al., 2007) is used
bounding boxes. The performance of the 3D model based ground plane tracker is
-8-
CHAPTER 1 INTRODUCTION 1.3 Contribution
compared to a state of the art blob tracker. The final contribution of the chapter is
the introduction of a bus lane monitor to generate alarms for prohibited vehicles
entering a restricted zone.
with 3D models for object detection and classification (3DHOG). 3D models and
local features are evaluated independently in the surveillance domain first. The
3DHOG algorithm is then tested for urban traffic analysis scenarios and its
properties are investigated. The next chapter will discuss related work as
background for the remainder of the thesis.
-9-
CHAPTER 2 REVIEW 2.1 Introduction
2. Review
2.1. Introduction
This chapter will focus on recent approaches for road side cameras in urban
autonomous driving can be found in (Sun et al., 2006) and a conference paper (Sun
2008) and (Valera and Velastin, 2005) with a particular focus to distributed
surveillance systems. Figure 2 shows some example camera views from the i-LIDS
data set (iLIDS, nd) provided by the Home Office of the United Kingdom.
systems for video analytics is considered in section 2.2. The generic elements of
traffic analysis systems are introduced with examples in section 2.3. Considering
complete systems in section 2.4, the full surveillance task from reading a video
discussions and the outlook for future research are provided in section 2.5. The
chapter concludes with an overview of the thesis in respect to the review in section
2.6.
- 10 -
CHAPTER 2 REVIEW 2.1 Introduction
a) b)
c) d)
e) f)
g) h)
Figure 2 Example frames from the i-LIDS parked car data set (iLIDS, nd). a,b) sunny
conditions with shadows and reflections on cars c) image saturation in the upper
part of the image d) detail of a light car in the saturated area where only dark
elements remain visible e) interlacing artefacts are commonly dealt with by removing
every second video line and therefore halving the resolution f) raining condition with
reflections g) rain during dusk h) headlight reflections during night.
- 11 -
CHAPTER 2 REVIEW 2.2 Video Analytics deployed in the Traffic Domain
first part in section 2.2.1 will focus on vehicle counting, which is mainly applied to
application typically used for tolling and discussed in section 2.2.2. The most
challenging and least solved problem holding the highest research potential is
incident detection in section 2.2.3.
loops. Those loops provide high precision, but are very intrusive to the road
pavement and therefore come with a high maintenance cost. Most video analytics
more detailed statistics (Traficon, nd, Citilog, nd, Ipsotek, nd, Autoscope, nd, CRS,
nd). Some systems have also been adapted for urban environments, with cameras
mounted on high poles. This provides a higher viewing angle, which limits the
highways. However, those highly mounted cameras are specifically for video
analytics, because standard CCTV cameras for human operators are mounted lower.
ANPR is a very specialised and well researched application for video analytics.
There is a vast range of companies e.g. (CRS, nd, Virage, nd) providing solutions
Cameras are highly zoomed to provide a high resolution image of the number plate,
but therefore losing the context of the scene. Active infrared lighting is often used
to exploit the reflective nature of the number plate. The task is simplified by the fact
- 12 -
CHAPTER 2 REVIEW 2.2 Video Analytics deployed in the Traffic Domain
Toll stations of freeways have dedicated lanes with cameras, where registered users
can pass slowly without stopping. In contrast, inner city congestions charge systems
(e.g. Stockholm, London, Singapore) have to be less intrusive and operate on the
normal flow of passing traffic. Point to point travel time statistics are obtained from
re- identification of vehicles with time stamps across the road network.
the two above approaches. Examples for highways are the detection of accidents
(Traficon, nd, Citilog, nd, Ipsotek, nd, Autoscope, nd, CRS, nd) and stopped
vehicles. Tunnel surveillance also focuses on smoke detection for warning of tunnel
fires. Hard shoulder running has been rolled out as a pilot in the UK including video
analytics from (Ipsotek, nd). The hard shoulder of a motorway is turned into a
running lane during peak time, which requires reliable inspection for obstacles and
monitoring for incidents during operation.
detection is being rolled out in London (Cracknell, 2008) based on existing CCTV
cameras including systems from (Ipsotek, nd). Existing systems could not
demonstrate acceptable results for practical deployment for other scenarios at the
time of the study. Detection of illegal parking is the objective for one data set from
i-LIDS (iLIDS, nd). A high level of position accuracy is required for illegal turning,
bus lane monitoring and box junctions. Those target applications also require
from a zoomed out camera. A system for detecting ‗car park surfing‘ is available
from (Ipsotek, nd), which monitors if pedestrians move from car to car. This is
- 13 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
Figure 3 Block diagram for a top-down surveillance system. The grouping of pixels in
the foreground mask into silhouettes that represent objects is done early with a
regarded as usual behaviour before a theft to identify target vehicles. The next
section will focus on algorithmic aspects of systems, whereas section 2.4 will revisit
applications in respect to the literature.
introduced. To structure the presentation, the literature has been grouped into top-
tracking (section 2.3.4). See Figure 3 for a block diagram. A statistical model
typically estimates foreground pixels, which are then grouped with a basic model
(e.g. connected regions) and propagated through the system until the classification
stage e.g. (Gupte et al., 2002, Morris and Trivedi, 2006a, Hsieh et al., 2006, Bloisi
and Iocchi, 2009, Creusen et al., 2009, Gao et al., 2009b). Classification then uses
to assign a class label. For the remainder of the review, this class of algorithms will
be referred to as 'top-down' or 'object-based', because pixels are grouped into
- 14 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
Training:
Class specific
Patch models
Figure 4 Block diagram for a bottom-up surveillance system. Local image patches
are first extracted from the input image and classified as being a specific part of a
trained object class. Those identified parts are combined into objects based on the
et al., 2008b) allow this grouping to be performed in the spatial- temporal domain,
which directly produces an object trajectory rather than frame per frame object
detections.
one which detects and classifies parts of an object first (block diagram in Figure 4).
This initial classification of the parts uses learned prior information about the final
object classes, e.g. an image area is classified to be a car wheel or a pedestrian head
those parts into valid objects and trajectories is the final step of the algorithm e.g.
(Leibe et al., 2004, Leibe et al., 2008b, Opelt et al., 2006b). This type of approach
is typically used in generic object recognition. In the next section, the top-down
- 15 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
surveillance systems. The foreground regions are marked for processing in the
subsequent steps. The foreground is defined as every object, which is not fixed
furniture of a scene where fixed could normally mean months or years. This
foreground, which both use strong assumptions to comply with the above definition.
about the scene background of a video sequence. The model is then compared to the
current frame to identify differences (or ‗motion‘), provided that the camera is
stationary. This concept lends itself well for computer implementation, but leads to
problems with slow moving traffic. Any car should be considered foreground, but
stationary objects are missed due to the lack of motion. The next sections discus
different solutions for using motion as the main cue for foreground segmentation.
(see section [Link]). This approach can be used for moving as well as for stationary
cameras, but requires prior information for foreground object appearances.
pixel by pixel difference map is computed between two consecutive frames. This
difference is thresholded and used as foreground mask. This algorithm is very fast,
however, it can not cope with noise, abrupt illumination changes or periodic
movements in the background like trees. In (Park et al., 2007), frame differencing is
used to detect street parking vehicles. Special care is taken in the algorithm to
suppress the influence of noise. Motorcycles are detected in (Nguyen and Le, 2008)
based on frame differencing. However, using more information than just the last
- 16 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
subtracted from the current video frame. A threshold is applied to the resulting
difference image to give the foreground mask. The threshold can be constant or
dynamic as used in (Gupte et al., 2002). The methods described below differ in the
way the background picture is obtained.
[Link].1. Averaging
In the background averaging method, all video frames are summed up. The learning
rate specifies the weight between a new frame and the background. This algorithm
has little computational cost, however, it is likely to produce tails behind moving
objects due to contamination of the background with the appearance of the moving
objects. (Gupte et al., 2002) and (Huang and Liao, 2004) use the instantaneous
background, which is the current frame with detected objects removed. The regions
of detected objects are filled with the old background pattern. By averaging the
moving objects are reduced. The feedback of the motion mask could however lead
foreground. Other papers report the use of averaging, usually for computational
reasons: (Kanhere et al., 2005, Chen and Zhang, 2007, Kanhere, 2008, Kanhere and
Birchfield, 2008)
- 17 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
To improve robustness, a single Gaussian model can be used for the background.
Instead of only the mean value as for averaging, the variance of the background
pixels is calculated additionally. This results in a mean image and variance image
for the background model. A new pixel is classified depending on the position in
(Kumar et al., 2003, Morris and Trivedi, 2006a, Morris and Trivedi, 2006b, Su
et al., 2007) use a single Gaussian background model.
(Zheng et al., 2005) use the mode of the temporal histogram of every pixel to
estimate the background image, which is a non parametric method. The mode
changes and long term operation are not demonstrated in the paper. The described
algorithm took 230 seconds for processing 600 frames on a Pentium 4 at 3 GHz and
1 GB RAM. For the mode in the histogram to correctly represent the background,
Gaussian Mixture Model (GMM) and holds for typical traffic surveillance
applications, but fails for parked vehicles or heavy congestion. However, the
algorithm is sensitive to the bin size. If the size is too small and the input pixel
values vary over several bins, no distinct peak would appear. The GMM (see
section [Link]) in comparison is a parametric method and models the width of the
- 18 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
A Kalman filter can be used to estimate the background image, where the colour of
each pixel is modelled by one filter. The foreground can be interpreted as noise for
the filter state. However, illumination changes are non Gaussian noise and violate
basic assumptions for the use of Kalman filters. (Messelodi et al., 2005a) proposes a
Kalman filter approach which can deal with illumination changes. The illumination
distribution over the image is estimated and used to adjust the individual Kalman
filter states. The foreground estimation was tested in (Messelodi et al., 2005b)
[Link].5. Wavelets
than the GMM (Stauffer and Grimson, 1999), however the test data is very limited
in size.
The GMM was introduced in the seminal paper of (Stauffer and Grimson, 1999)
and (Stauffer and Grimson, 2000). Each pixel is modelled as a mixture of two or
more Gaussians and updated online. The stability of the Gaussian distributions is
evaluated to estimate if they are the result of a more stable background process or a
distribution representing it is stable above a threshold. The model can deal with
lighting changes and repetitive clutter. The computational complexity is higher than
- 19 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
an intersection. (Martel-Brisson and Zaccarin, 2007) extend the GMM to deal with
shadows (see section [Link]). For an introduction to Gaussian mixture models see
Bowden, 2001) is available in the OpenCV library (OpenCV, nd) and is commonly
used in research. Many researchers have adapted this model for traffic analysis
(Veeraraghavan et al., 2002, Zhang et al., 2008b, Bloisi and Iocchi, 2009, Wang
et al., 2009b, Johansson et al., 2009). The limitation of the approach remains the
background image is kept. A pixel process is estimated for every pixel. Based on
the estimated PDF, the probability for an observed pixel to occur is calculated. If
the probability is high, nothing unexpected happened and the pixel is assumed to be
algorithm is very cost effective, as only an estimation for a Gaussian Mixture Model
is calculated. The computation for every frame involves only an update and not a
recalculation of the model. At a resolution of 320x240, the algorithms takes less
than 80 ms on a Pentium 4 at 3.3 GHz and 2.5GB RAM to segment a new video
frame.
Random Field (MRF). Every pixel of the images is represented by a node in the
graph. The vertices between nodes and sources are set to a weight related to the data
(data constraint). Sources represent the labels for a pixel, in this case foreground
- 20 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
nodes connected to either to indicate that this pixel corresponds to the respective
label. The advantage of graph cuts is that the optimal solution can be found in
polynomial time. (Boykov and Veksler, 2005) give a general introduction to graph
cuts. Applications for image restoration, stereo imaging and video blending are
applications use graph cuts for scene understanding from moving vehicles (Sturgess
et al., 2009). A new more general Marginal Probability Field (MPF) has been
introduced in (Woodford et al., 2009). MRF is a special linear case of this new
MPF.
authors grouped the literature into four different categories. The first category is
statistical non- parametric (SNP) which considers the colour consistency of the
human eye to detect shadows. An example of this is (Cucchiara et al., 2001) which
is used in several traffic systems (Zhang et al., 2008b, Johansson et al., 2009). The
Two different deterministic non- model based approaches are described which use a
combination of statistical and knowledge based assumptions. No single approach
performs best, furthermore, the type of applications determines the best suited
algorithm. Deep cast shadow positions are predicted in (Johansson et al., 2009)
based on GPS location, time information and 3D vehicle models. With this
consistency, the authors use the stability of states in the GMM to determine
shadows. In contrast to the two groups of states in (Stauffer and Grimson, 1999),
- 21 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
one background state, several shadow states and several foreground states are used.
The concept assumes, that shadow states are less stable than background states but
more stable than foreground states. Converged shadow states are copied into a
Gaussian mixture shadow model (GMSM) to prevent them from being overridden
by foreground states. This model calculates the shadow volume in the RGB space
rather than assuming it to be a cylinder like for the colour consistency assumption in
(Cucchiara et al., 2001).
this section, methods are considered which detect objects in a holistic way by
searching for full objects. (Sullivan et al., 1996) convert the wire frame of a 3D
vehicle into a gradient image by assigning a triangular grey level profile to every
edge. The projected image is compared to the gradient image of the camera to find a
match. This work has been followed up in (Hu et al., 2004, Lou et al., 2005,
Remagnino et al., 1998, Zhang et al., 2008c). Optical flow is used in addition to
wire frames in (Ottlik and Nagel, 2008) to segment vehicles in the image.
model projections and new images. (Messelodi et al., 2005b) generates the convex
hull for 3D vehicle models in the image. The ratio between convex hull overlap of
model and image normalized by the union of both areas generates a matching score.
Similar 3D vehicle models are matched with a motion segmented input video in
et al., 2009), which also adapts the size of vehicles. A method for rendering 3D
vehicle models for matching at new viewing angles is proposed in (Guo et al.,
2008).
An approach with edges is used in (Kim and Malik, 2003). Horizontal and
vertical edges are grouped into vehicles using a probabilistic framework. The
- 22 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
grouped vehicles are used for tracking in a highway surveillance application. All
highway scenes in (Yoneyama et al., 2005). This allows vehicle detection by simply
taking the difference between the mean colour and a pixel. The approach would not
work in urban environments with street clutter.
instances called the class. The classifier needs information about a new instance
which is usually referred to as features. Features are extracted from the whole object
extract discriminative information from the features (see section [Link]). The
classifier then uses this learned information to assign a class label to a new instance.
[Link]. Features
produces similar values for the instances of a given class throughout the video
stream. This section gives an overview of different kinds of features, grouped by the
support in the image as either a binary foreground region, the contour of this region
or larger image patches.
Region based features are usually extracted from the whole image region of an
object. In video sequences, this is mainly the area of the foreground silhouette
extracted by the foreground segmentation algorithm. Image moments are often used
- 23 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
to generate a feature vector for the silhouette. Without any feature generation, the
convex hull of the silhouette (binary mask) can be used for comparison. (Messelodi
et al., 2005b, Song and Nevatia, 2007, Johansson et al., 2009) use such an approach
for region matching. (Gupte et al., 2002, Zhang et al., 2007a) use length and height
to classify vehicles on a highway. Rule based approaches are common, e.g. (Hsieh
et al., 2006) use size and a linearity feature for vehicle classification. The linearity
feature is a measure for the roughness of the vehicle silhouette. (Huang and Liao,
2004) use size, area and length with a set of rules to classify vehicles in a highway
scene. Occlusions between vehicles can produce similar effects on the silhouette,
which is demonstrated in (Zhang et al., 2008b) where a similar measure is used for
occlusion reasoning. For vehicle classification, (Morris and Trivedi, 2006a) and
(Morris and Trivedi, 2006b) use 17 different region features including 7 moments
for 7 classes. A comparison between image based features (IB) like pixels and
image measurements features (IM) like region size is given. Both feature types are
used with Principal Component Analysis (PCA) and Linear Discriminant Analysis
(LDA) as dimensionality reduction technique. IM with LDA was used for the final
algorithm as it gave the best performance. The features are classified using a
(section [Link].1) is used to track the foreground regions based on the centroids.
incorporates HOG features for in- vehicle systems (Gandhi and Trivedi, 2007). In
(Alonso et al., 2007), initial bounding boxes for vehicles are generated based on
edges, which assumes that street clutter does not exhibit similar edge patterns. The
bounding boxes are verified by symmetry and corner detection inside this region.
- 24 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
Contour based features only take the edge of a silhouette into account. The distance
closed contours as extracted from video sequences. The contour including edges is
used in (Hu et al., 2004, Lou et al., 2005, Remagnino et al., 1998).
the shape. The convex outline of a contour of two vehicles will have dents, which
can be identified to separate two vehicles, if the occlusion is not severe. The contour
signature is used in (Zhang et al., 2008a) for vehicle classification from side views.
training data and to assign class labels to unseen data. An important property of the
learning technique is the supervision during learning. This describes the amount of
labelling information required of the training data. Labelling can range from simply
tagging an image with a class to completely segmenting the image manually and
wrongly labelled images. Ground truth is similar information and required for
evaluation. The classifier output for test data is then compared to this manually
generated ground truth. Large amounts of ground truth are required to provide
evaluation with high statistical confidence. Section 2.5.2 will look into common
data sets, which is important to share the effort in generating this ground truth. A
the next sections, first distance measures and clustering for training are introduced
before discussing different classifier architectures.
- 25 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
This representation allows the definition of a distance (i.e. difference) between two
measure similarity between features. Many distance measures are available with
various properties. Firstly, the Manhattan distance calculates the sum of the
absolute difference along every coordinate axis between the vectors. This results in
the least computational effort but complex mathematics. Secondly, the Euclidean
distance (Gao et al., 2009a) returns the geometric distance between two vectors.
possible to normalise the Euclidean distance along every axis to reduce the effect of
Euclidean. The variance of the data along a coordinate axis is used for
normalisation. The covariance matrix of the data needs to be calculated for this
reason. This normalisation transforms the data cloud into a spherical shape. This
distance is used for the training in chapter 5. For histogram comparison, the
Mahalanobis distance, however, it does not require the calculation of the covariance
matrix. The paper describes a system for distinguishing between two classes of
vehicles. The vehicles are presented centred in the image and at the same scale at
calculated on edge points to give a rich representation for the image. Generated
feature vectors are labelled according to the training vector with the smallest 2 -
distance. A constellation model is used to find the most probable vehicle class based
on the positions of the observed feature vectors. This was evaluated for two separate
- 26 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
cases of binary classification. In the first case, 50 cars and 50 minivans were
randomly chosen from the sample pool for training. The testing was performed on
200 samples from each class taken from their own data. About 98% accuracy is
reported for that case. The test between sedans and taxis resulted in a slightly lower
accuracy. More detailed results are given for different shape models of the
probabilistic framework (refer to section [Link] for shape models).
For feature vectors, not all dimensions are necessarily statistically independent.
covariance matrix of the training data with the highest eigenvalues are used as new
coordinate axes. This transformation ensures that the largest data variance is
the feature space. (Zhang et al., 2005) uses this concept with SIFT feature vectors to
generate PCA- SIFT features. PCA has been applied directly on candidate images
for vehicle detection at night- time in (Thi et al., 2008, Robert, 2009a, Robert,
2009b). (Morris and Trivedi, 2006a) use Linear Discriminant Analysis (LDA),
which is a similar concept, for vehicle classification. (Chen and Zhang, 2007) use
Independent Component Analysis (ICA) which separates the data into independent
sources in addition to the orthogonal coordinate base of PCA. The paper introduces
to get bounding boxes of vehicles. The pixel values inside the bounding boxes form
- 27 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
training images to reduce the dimension of the feature space. This is similar to the
image based feature (IB) described in (Morris and Trivedi, 2006b) in section
[Link].1. To assign one of the three class labels to new feature vectors during
operation, three one class Support Vector Machines (SVM) are used (see section
[Link].4). The SVMs are trained with 50 vehicles each. Three tests are conducted
with 150 sample vehicles randomly chosen from the author‘s own sample pool. The
reported performance is 65% recall at 75% precision. The ICA based algorithm is
algorithm, however, (Thi et al., 2008, Robert, 2009a, Robert, 2009b) report much
higher performance with their PCA based approach.
Several non- linear embedding methods are compared in a review (van der
LLE, Local Tangent Space Analysis, Locally Linear Coordination LLC, and
manifold charting.
[Link].3. Clustering
Clustering is performed on the training data. If the training data only contains object
clusters in the data and the correspondence of the training samples to those clusters.
As this general clustering problem has not been solved satisfactorily, k- means
samples into a specified number of groups based on the distance between features.
(Morris and Trivedi, 2006a) uses this clustering technique for vehicle classification.
builds a cluster tree, which allows cutting off branches at different levels and sizes.
- 28 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
Metrics other than the final cluster number can be used for this cutting, which
allows more flexibility.
classification. This is usually applied if several local feature vectors are used to
specify an object. The class label for every feature vector is known from
feature vectors is used to group them together. Every group of feature vectors is
replaced by one codebook entry holding all class labels of the individual vectors.
This approach can increase the speed of the final classifier and reduce the amount of
training and data storage, as shown in (Leibe et al., 2004, Opelt, 2006, Leibe et al.,
2008b) for bottom-up object detection. The same concept is termed 'visual
dictionary' in (Serre et al., 2007) and used for vehicle classification in (Wijnhoven
et al., 2008, Wijnhoven and de With, 2009, Creusen et al., 2009).
[Link].4. Classifiers
Classifiers map a new unknown object instance with extracted feature vector to a
known class or perhaps no class. This mapping process depends on what was
previously learned from training data. Different ways for generating and performing
this mapping are outlined in the next sections.
The nearest neighbour classifier is the simplest non parametric classifier for a
feature vector. The distance between a new feature vector and every vector of the
training set is calculated. Any distance measure can be used for that purpose. The
class label of the closest training vector is assigned to the new vector. To improve
robustness, the k- nearest neighbour algorithm can be used. The class label for the
new class is determined by the k- nearest training vectors. Both methods require
- 29 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
many distance calculations and do not scale very well for large training sets in terms
requirement for training, however, the classification time increases with the training
size. (Morris and Trivedi, 2006a, Hsieh et al., 2006) use this method to classify
vehicles based on binary foreground features. In the seminal paper for SIFT (Lowe,
1999), corresponding interest points are found using the nearest neighbour
neighbour algorithm. For this case, the class membership is defined by weights
which results in a softer decision boundary. (Morris and Trivedi, 2006b) use this
algorithm to improve robustness against outliers.
An introduction and review of kernel based learning used for Support Vector
Machines (SVM) can be found in (Muller et al., 2001). SVM perform classification
using linear decision hyper-planes in the feature space. During training, the hyper-
planes are calculated to separate the training data with different labels. (Dalal and
Triggs, 2005, Chen and Zhang, 2007, Serre et al., 2007, Thi et al., 2008, Wijnhoven
and de With, 2009, Creusen et al., 2009) use a SVM for vehicle classification. If the
training data is not linearly separable, a kernel function can be used to transform the
data into a new vector space. The data has to be linearly separable in the new space.
Support vector machines scale well for large training sets. The complexity for
training increases with the number of training samples, however, the classification
is independent of it. The generic approach does not provide confidence measures
for the classification. There are extensions which derive a confidence based on the
distance of a feature vector to the hyper-planes, which is not always reliable.
- 30 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
Probabilistic Frameworks
estimate the (posterior) probability based on observed data and prior knowledge.
from the image data and the prior knowledge of how frequent vehicles of class A
are observed. The vehicle detection system presented in (Song and Nevatia, 2007)
uses a Bayesian framework with Markov chain Monte Carlo sampling. First, a
computed from the foreground map, indicating likely vehicle centroids. The
distance of points from the boundary of the foreground map indicates the likelihood
in the proposal map. A Bayesian problem is formulated for the vehicle positions.
Markov chain Monte Carlo (MCMC) algorithm is used to search for several good
solutions. The MCMC generates new states by changing the number of vehicles, the
optimisation algorithm which finds the optimal track through the set of solutions for
every frame. Other works (Kim and Malik, 2003, Hsieh et al., 2006) use
probabilistic frameworks for vehicle detection and tracking.
concept, which is traditionally used for generic object recognition is given in (Pinz,
2005). As discussed at the beginning of section 2.3, this involves detecting parts of
objects and classifying them, before they are grouped to objects. The next section
features from images patches. Section [Link] covers the learning technique of
- 31 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
boosting, which has proved to be very powerful when used with interest points.
Sections [Link] and [Link] introduce spatial models for interest points.
Interest points (also referred to as key points) are image positions, from which
features are extracted. Those points may be uniformly sampled in the image space
(Dalal and Triggs, 2005, Dalal et al., 2006), in a 3D surface space (chapter 5) or
1999), Hessian (Bay et al., 2006), etc. A comprehensive comparison of local patch
features can be found in (Mikolajczyk and Schmid, 2005, Zhang et al., 2007b),
including a temporal extension in (Wang et al., 2009a) where it is shown, that the
The simplest patch based feature vector is the collection of values of the image
patches for object classification. The distance measure between patches is defined
by the cross- correlation of them. The correlation function is very sensitive to size
and illumination changes of the image. This fact encourages other feature
transformations which can deal with changing conditions. The following paragraphs
introduce several solutions.
Using a histogram rather than pixel values allows for more spatial
invariance. The seminal paper for those concepts is (Lowe, 1999) followed up by
many other algorithms (Dalal and Triggs, 2005, Mikolajczyk and Schmid, 2005,
Bay et al., 2006).
- 32 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
2006a, Opelt, 2006, Ma and Grimson, 2005, Kim and Malik, 2003) use the Canny
edge detector to generate features.
The Scale Invariant Feature Transformation (SIFT) was introduced in the seminal
paper of (Lowe, 1999). The local features generated are invariant to image scaling,
translation and rotation and partially invariant to illumination changes and affine
projection changes. The feature vectors are generated at maxima of the scale space
of the gradient input image. In addition to the 160- dimensional feature vector, the
the image, which will remain salient even if the image is resized, rotated or the
illumination is changed. The SIFT features can be used to find point to point
correspondences in two different images of the same object. (Opelt et al., 2006a)
combines SIFT features and other local features for generic object recognition.
(Zhang et al., 2005) uses a derivation of SIFT, the PCA-SIFT (Principal Component
Analysis- SIFT) for generic object recognition. The local features are used in
Modified SIFT descriptors are used in (Ma and Grimson, 2005) to generate a rich
representation of vehicle images. (Gao et al., 2009a) uses re-identified SIFT interest
points between frames for tracking vehicles in urban scenes.
The SURF descriptors are introduced by (Bay et al., 2006). The descriptor aims for
loss of performance. The use of box filters instead of Gaussian filters in the case of
- 33 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
regions around an interest point are used to generate the feature vector, which can
be calculated with integral images in this way.
by (Dalal and Triggs, 2005). To calculate the feature vector, the gradient input
image window is divided into a grid of cells. For every cell, a histogram of the
dimensional local feature vector. The vectors of all cells are concatenated to give
one global feature vector for the image window. In the original paper, this vector is
by introducing 3DHOG which uses 3D model surfaces rather than 2D grids of cells.
This allows for the algorithm to resolve scale and use a single model for variable
viewpoints of road users.
There have been a wide range of other descriptors introduced in the literature. The
et al., 2006b). The model uses only segments of contours for generic object
recognition. The idea of local interest point features as used in (Lowe, 1999, Leibe
et al., 2004, Crandall et al., 2005, Opelt et al., 2006a) is extended to boundary
entries. The use of a Canny edge detector to generate the boundary fragments
allows the model to be used with still images.
- 34 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
vector with finer quantisation than SIFT is extracted and the dimensionality is
reduced using PCA based on a large training set.
[Link]. Boosting
Schapire, 1999). AdaBoost uses weak classifiers, which only need to perform better
than random. Weights for those weak classifiers are learned during training. Every
round of training changes the weights of training images to force the classifier to be
trained on difficult examples. The weighted weak classifiers result in a final strong
robust against over fitting. The original paper (Viola and Jones, 2004) uses a
cascade of AdaBoost classifiers with underlying Haar filters for face detection. The
success of this face detector increased the popularity of AdaBoost for computer
vision. The same authors used a temporal extension of their algorithm for pedestrian
detection in road surveillance (Jones and Snow, 2008). (Zhang et al., 2005) perform
generic object recognition with a binary multi layer AdaBoost network. In (Opelt
et al., 2006a, Opelt et al., 2006b), binary AdaBoost is used for generic object
2006c, Opelt, 2006). (Khammari et al., 2005) uses boosting of gradient features to
detect vehicles in road scenes. (Acunzo et al., 2007) uses a boosted classifier for
illumination condition detection (day, night, etc).
- 35 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
Explicit shape implies direct modelling the spatial relationship between parts of
objects detected. Various different models for the shape are introduced here with
relevance for traffic surveillance.
[Link].1. k- fans
The k-fan model was first introduced in (Crandall et al., 2005) to schematise part
based object recognition. The parts of an object are divided into reference nodes and
regular nodes of a graph. The parameter k represents the number of reference nodes.
Every reference node has a spatial relation to every other node in the graph. By
changing k from 0 to the total number of nodes, the spatial prior can be changed
from no shape modelled to a full rigid structure. Most shape models are related to k-
fans. (Kim and Malik, 2003) use a 1-fan model to group edges of a highway scene
vehicles. (Ma and Grimson, 2005) use a constellation model similar to 1-fan for
The HOG (Dalal and Triggs, 2005) and 3DHOG (chapter 5) algorithms use a fully
connected graph.
The implicit shape model (1- fan) is introduced in (Leibe et al., 2004) and explained
in more detail in (Leibe et al., 2008a). Image patches at key points of objects are
learned during training. In addition to the object label, a probability density function
for the relative position in the object is provided. The evidence for object positions
is accumulated based on those positions through generic Hough voting. In the case
of Hough transform for line detection, every pixel of the image contributes to
possible lines in the angle and position space. If many pixels vote for one angle and
- 36 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
one position, this line is detected. A similar concept is used for object voting in
(Lowe, 1999). Every detected SIFT interest point votes for its corresponding object
centroid in x-y voting space. The maximum in this space defines the detected and
classified object at a position. This method is extended using different features and
distance measures in (Leibe et al., 2004, Leibe et al., 2005, Leibe et al., 2007,
Cornelis et al., 2008, Leibe et al., 2008b, Opelt et al., 2006a, Opelt et al., 2006b,
Opelt et al., 2006c).
between detected parts are used to generate a feature vector. Both methods use pixel
values of the image patches. A good example for bottom-up surveillance based on
ISM is (Leibe et al., 2008b), where road users are tracked from a static urban
surveillance camera. The framework was first introduced in (Leibe et al., 2007)
based on a generic object detector (Leibe et al., 2004) with implicit shape model for
vehicle detection from a moving camera. This work shows how bottom-up object
detection approaches can be used for traffic analysis. The algorithm is demonstrated
to perform, in urban environments, similar to the state of the art on moving stereo,
while most foreground segmentation methods discussed in section 2.3.1 would not
work for such a scenario. The limitations of this approach are lower detection ratios
compared to typical bottom-up approaches and higher computational complexity.
[Link].3. Alphabets
Instead of using every single feature vector from training, similar vectors are
combined. The resulting entry holds a list of class labels and could take several
positions in a shape model. This concept is used in (Leibe et al., 2005, Ma and
Grimson, 2005, Opelt, 2006, Wijnhoven et al., 2008, Wijnhoven and de With, 2009,
Creusen et al., 2009).
- 37 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
A solution for generic object recognition without shape structure is given in (Opelt
et al., 2006a) and commonly referred to as 'bag of words' (0- fan). A large set of
trained with those features. This training procedure automatically selects the most
introduced in (Zhang et al., 2005). This second layer uses global features to
improve the classification.
research. For example, (Ullman, 2007) discusses the structure of the human visual
cortex and derives a tree style object hierarchy. The features of objects are based on
image patches. (Serre et al., 2007) present something that is more relevant for
segmentation system is implemented using a visual cortex structure. Four layers are
used, which perform simple filtering, complex searching and a repetition of those
(Wijnhoven and de With, 2007, Wijnhoven et al., 2008, Creusen et al., 2009,
motion tracker (no details are provided) generate vehicle images, which are passed
through a sequence of simple and complex layers represented by Gabor filters and a
trained for every 90 degree of viewing angle. In contrast, the algorithm in chapter 5
can operate on arbitrary viewing angles. On the same data set, (Wijnhoven and
de With, 2007) outperform (Ma and Grimson, 2005). The authors have moved this
- 38 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
de With, 2009, Creusen et al., 2009). Feature vectors are now extracted from
interest point locations rather than a uniform density over the whole image patch.
2.3.4. Tracking
two steps: Firstly, features for the object or foreground regions are generated in
every video frame (see section [Link]). Secondly, a data association step has to
avoid confusion of tracks and to smooth noisy position outputs of detectors. The
data association step can use the same distance measure as machine learning
algorithm, see section [Link].1. The classification result and location in the image
is typically included in the feature for this association. The next sections discuss
motion models for tracking in traffic applications and possible data association
based on prediction.
The Kalman filter was originally introduced in (Kalman, 1960) and has been
successfully used in many applications including missile tracking. The optimal state
of a linear time invariant motion model is estimated assuming Gaussian process and
measurement noise. The prediction stage of the Kalman filter is used to extrapolate
the position of objects in a new frame based on a constant velocity constraint. The
detectors. A correction step uses the detection as measurement and updates the filter
state. This concept is used in (Morris and Trivedi, 2006b, Messelodi et al., 2005b,
Rad and Jamzad, 2005, Song and Nevatia, 2007, Johansson et al., 2009, Bloisi and
- 39 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
Iocchi, 2009) for tracking. Kalman filters propagate a single object state between
frames compared to multiple hypotheses for particle filters in the next section. The
extended Kalman filter (EKF) can facilitate non- linear models.
et al., 1993). A recent tutorial (Doucet and Johansen, 2009) reviews the filter and
(Isard and Blake, 1998) introduced the particle filter into the computer vision
domain. The filter is used for traffic videos in (Nummiaro et al., 2003, Bardet and
Chateau, 2008, Mauthner et al., 2008, Nguyen and Le, 2008, Wang et al., 2009b,
Gao et al., 2009a).
The Spatial- Temporal Markov Random Field (S-T MRF) is introduced by (Kamijo
et al., 2000, Kamijo et al., 2001a, Kamijo et al., 2001b) for vehicle tracking in
urban traffic scenes. The input image of resolution 640 x 480 is divided into blocks
modelled as a graph like in section [Link]. The S-T MRF is used to generate
vehicle labels for the blocks. Adjacent blocks as well as blocks in consecutive
frames are considered neighbours for the model. A solution for the object map
(nodes of the S-T MRF) of the current frame is found based on the current image,
the previous image and the previous object map. The result is used in a Hidden
Markov Model (HMM) to detect events like vehicle passes or collisions. (Kamijo
- 40 -
CHAPTER 2 REVIEW 2.3 Elements of Traffic Analysis Systems
and Sakauchi, 2002, Kamijo et al., 2004) is an extension to the earlier work
introducing incident detection in tunnels.
(Gupte et al., 2002) for vehicle tracking. Every region in a frame is represented by a
node in the graph similar to MRF. One vertex leaving every node is generated for
two consecutive frames. The destination node of the vertex is determined by the
best overlap score of the image regions. Due to this bidirectional structure of the
graph, splitting and merging of region during tracking can be handled. To avoid
conflicts in the graph, adding conflicting vertexes is suppressed. (Huang and Liao,
2004, Veeraraghavan et al., 2002) use the graph correspondence for vehicle
tracking and classification. In (Taj et al., 2008), vehicle and pedestrian tracking is
evaluated on the CLEAR data set (CLEAR, 2007) and uses greedy graph
several frames. (Song and Nevatia, 2007) uses the Viterbi algorithm to find the
optimal vehicle constellations over several frames.
The concept of event cones to find space time trajectories is introduced in (Leibe
et al., 2007). Every object observation in a frame is assigned an event cone, which
in turn represents a volume of possible object positions in the future and the past.
The shape of the cone is determined by the dynamic model of the object, similar to
trajectories to explain the full history of observations. This allows tracks to be split
- 41 -
CHAPTER 2 REVIEW 2.4 Complete Traffic Analysis Systems
retrospectively, which is traded off against optimisation of a growing data set for
long video sequences. In addition, a real time scene understanding system might be
shown similar to the deployment discussed in section 2.2. This is partly due to the
easier conditions on a highway with usually more homogeneous and constant flow
than in urban areas. In addition, the distance between vehicles is larger and reduces
2.4.1. Urban
The challenge for monitoring urban traffic is the high density of vehicles and the
low camera angle. The combination of both leads to a high degree of occlusion. In
addition, the clutter on the streets increases the complexity of scenes. The literature
is divided up into 2D approaches, which operate in the domain of the camera view
This section deals with systems that work directly in the camera coordinate domain.
- 42 -
CHAPTER 2 REVIEW 2.4 Complete Traffic Analysis Systems
correspondence. The tracked objects are classified into pedestrians and vehicles
based on the main orientation of the bounding box. Example images for different
used in (Wang et al., 2006). Shape based data association in tracking feeding back
to the detection shows to significantly improve the results. A multi agent framework
a frame differencing motion map, parking in and out conditions are calculated. A
state machine is used to track the speed changes of vehicles until stopping to
generate those conditions. The system is evaluated with 24 hours of video data from
two different sites. A good detection rate of 94.7% is reported on their own data.
and Sayed, 2006). This provides robustness against errors in the background
estimation and can deal with changing viewing angle, as no prior assumption to the
and 94% depending on the data set. Whole vehicle parts rather than individual
points are tracked with particle filters in (Mauthner et al., 2008) operating on very
low resolution images.
applications. (Nguyen and Le, 2008) focuses on motorcycle tracking with multi-
modal particle filters. A recall rate for counting of 99% is demonstrated for videos
from Vietnam. In an urban setting in Venice, boats are tracked in (Bloisi and Iocchi,
2009). GMM is combined with optical flow and a Kalman filter to track and count
- 43 -
CHAPTER 2 REVIEW 2.4 Complete Traffic Analysis Systems
boats along the Grand Canal. Counting accuracy is 94% for a 2 hour sequence,
which is particularly challenging due to waves on the water.
[Link]. 3D Modelling
Systems in this section use explicit 3D modelling. A real time system is introduced
models are used to initialise an object list for every fifth frame based on the convex
hull overlap of model projection and motion map. Camera calibration is required for
this operation. A feature tracker follows the detected objects along some frames
before a new initialisation takes place. The tracker is used to speed up operation, as
the 3D operation would not be fast enough to operate on every frame in real time.
The objects are classified into 8 classes based on a two-stage classifier. The first
stage evaluated the convex hull, the second layer uses pixel appearance (colour) for
video data from two different sites. The total classification rate is given with 91.5%
for the test data of the authors.
The use of 3D wire frame models for vehicle detection and classification
was proposed in (Sullivan et al., 1996, Tan et al., 1998). First, a hypothesis for a
three axes of cars (horizontal forward, sideways and vertical) are correlated with
trained templates. The hypothesis is verified by correlating the gradient input image
with the wire frame image. The wire frame image is generated using the camera
calibration to project the wire frame and replacing every line with a three pixel wide
et al., 1998, Hu et al., 2004, Lou et al., 2005). A similar work using optical flow to
find detection regions is presented in (Ottlik and Nagel, 2008) with previous work
in (Dahlkamp et al., 2006, Dahlkamp et al., 2004). The 3D wire frames of vehicles
are used in a Hough transform to provide additional cues for vehicle detection. Only
- 44 -
CHAPTER 2 REVIEW 2.4 Complete Traffic Analysis Systems
4 vehicle models are provided, which leads to a low detection rate of 65% on video
data from (Nagel, nd).
(Song and Nevatia, 2007) use a Bayesian framework with Markov chain
Monte Carlo sampling. A proposal map is computed from the foreground map,
indicating likely vehicle centroids based on a constant size vehicle model. The
evaluation on two video sequences shows detection rates of 96.8% and 88%. The
only shadows from known lighting conditions can be dealt with, e.g. sun light. This
models but requiring manual setup of vehicle orientation. The performance of this
algorithm is not evaluated quantitatively.
2003). An edge detector is applied in an entry window to the side view highway
image to retrieve horizontal and vertical lines of vehicles. Those line features are
model. Once vehicles are detected in the entry window of the scene, they are
tracked using cross correlation between frames. The detection rate compared to
hand counting is reported to be 85%.
(Atev et al., 2005; Atev and Papanikolopoulos, 2008). Multiple cameras, calibrated
according to (Masoud and Papanikolopoulos, 2007) with road primitives, are used
masks to the road plane. 85% of 273 vehicles are detected successfully on the
authors‘ data.
- 45 -
CHAPTER 2 REVIEW 2.4 Complete Traffic Analysis Systems
2.4.2. Highways
Observing highway scenes usually gives the advantage of high camera angle and
focuses on this topic. Newer references are discussed here and divided up into
detection (section [Link]) and classification (section [Link]).
[Link]. Detection
et al., 2002). The main focus of the paper is on detection with effort put in a fast
silhouette overlap. The proposed classifier uses two classes (cars, non cars) with
size based features. The camera calibration is required to normalise those features.
Tracking using particle filters is introduced in (Wang et al., 2009b). The system is
motivated by generic surveillance, but results are shown for their own highway
video sequence. A Markov chain Monte Carlo particle filter (MCMC PF) is used in
(Bardet and Chateau, 2008) to track vehicles detected with simple frame
The traffic system proposed in (Hsieh et al., 2006) allows vehicle detection
and classification into four classes. The camera is assumed to be in axis with the
highway. This assumption allows the estimation of the lane centres by using the
tracks (by Kalman filter) of vehicle centroids. The lane centres are used to calculate
the lane division lines. Those lines are used to separate vehicle blobs merged due to
shadows. The detected vehicles are classified based on size and the linearity feature.
This ad hoc feature is a measure of the roughness of the blob. A Bayes classifier
based on the Mahalanobis distance between feature vectors with constant prior is
- 46 -
CHAPTER 2 REVIEW 2.4 Complete Traffic Analysis Systems
from three different sites. The reported detection accuracy is 82%. Out of the
detected vehicles, 93% are classified correctly using cues from multiple frames.
Higher detection accuracy is reported in (Wang et al., 2004), however, the test
video exhibits less occlusion. A rule based framework to deal with shadows and
Kanhere and Birchfield, 2008). With camera calibration, the height of interest
points is estimated throughout the video based on a foot point constraint of the
bottom of a motion silhouette. This allows effective grouping of points into cars and
trucks. The segmentation and tracking performance exceeds 90%.
[Link]. Classification
(Rad and Jamzad, 2005) propose a system to track and classify vehicles on
highways. Vehicles are first classified into three classes based on the width of the
bounding box and the travelling speed. The classified bounding boxes are tracked
using a Kalman filter. The reported tracking error rate is 5.4%.
algorithm. Seven vehicle types are classified from side view motorway images.
Blob features like length and compactness are used with a rule based classifier. The
are separated using dense optical flow fields. However, this method only works if
evaluated with a test sequence lasting for 463 seconds which results in 91% overall
classification rate.
- 47 -
CHAPTER 2 REVIEW 2.5 Discussion
2.5. Discussion
This section will discuss challenges in the field of traffic surveillance, especially in
the urban domain. One major aspect is common data sets, which are analysed in
section 2.5.2. Future research directions are given in section 2.6 in relation to this
thesis.
tracking have been successfully applied for highway surveillance (Bardet and
Chateau, 2008, Morris and Trivedi, 2006b, Kanhere and Birchfield, 2008). There
are attempts to overcome the problem of occlusion and shadows for that type of
scene. Urban environments are more challenging due to denser traffic, variable
approaches have been suggested including 3D models (Lou et al., 2005, Messelodi
et al., 2005b), shadow prediction (Song and Nevatia, 2007, Dahlkamp et al., 2006),
appearance models (Kim and Malik, 2003, Ma and Grimson, 2005), etc. Algorithms
developed for the generic object recognition domain have been applied and show
promising results in the urban traffic domain (Leibe et al., 2008b, Wijnhoven and
de With, 2007, Wijnhoven and de With, 2009).
2.5.1. Challenges
detection or traffic rule enforcement can be useful. This has generated a large and
clear tasks like it has been done in object recognition with (project PASCAL, nd).
The main contribution of a challenge like this is a public data set. The next section
- 48 -
CHAPTER 2 REVIEW 2.5 Discussion
introduces a few available data sets. One possible reason for the lack of a common
framework is the diversity of traffic rules, car classes, etc. around the world.
adopting vehicle classes according to local traffic regulations. There is very limited
literature dealing with night time (Robert, 2009a) and difficult light (Johansson
et al., 2009). To cover all possible situations, there might be the requirement for a
bank of detectors, which are switched based on illumination (Acunzo et al., 2007,
Thi et al., 2008).
dense traffic. There are many solutions for occlusion handling in highway scenes
(Hsieh et al., 2006, Su et al., 2007, Kanhere and Birchfield, 2008) for relatively
sparse traffic, which can not necessarily be transferred to urban environments. The
Public data sets and evaluation would allow the field to objectively compare
algorithms. In addition, labelled training data is essential for the training of machine
their proprietary data, which is rarely made available on the web. Even with videos
available, ground truth is scarcer and very often application dependent. The i-LIDS
data set (iLIDS, nd) is an attempt by the UK Home Office to benchmark visual
surveillance systems based on requirements of end users. One scenario deals with
illegally parked cars in urban roads and consists of 24 hours of video. There is only
event based ground truth, which is of limited use for evaluation of low level
algorithms. Tracking ground truth is available for parts of those videos through
- 49 -
CHAPTER 2 REVIEW 2.6 Future Research and Thesis Outline
(CLEAR, 2007) with a vehicle and pedestrian tracker evaluated in (Taj et al., 2008).
Greyscale images of urban intersection from a long distance high vantage point
view are provided at (Nagel, nd) and are used in (Dahlkamp et al., 2004, Dahlkamp
et al., 2006). Image patches used in (Ma and Grimson, 2005) are available 1 as
Matlab data files. Similar image patches are used repeatedly in (Creusen et al.,
2009, Wijnhoven et al., 2008, Wijnhoven and de With, 2009, Wijnhoven and
de With, 2007) but no direct download is provided. Data for more general visual
surveillance with some traffic related scenes is available from (VISOR, nd).
with a specific focus on urban environments was presented in this chapter. Research
is expanding from the highway environment to the more challenging urban domain.
This opens many more application possibilities with traffic management and
down classification, which can raise issues under urban conditions. Methods from
the object recognition domain (bottom-up) have shown promising results, but not
required to generate a more consistent body of work, which uses common data for
There is a larger body of work dealing with vehicle detection than with
classification. For many applications, knowing the class of road users is essential.
Some combined detectors and classifiers have been proposed (Leibe et al., 2008b,
Wijnhoven and de With, 2009, Lou et al., 2005). Future classifiers should be able to
take tracking prediction into account. According to several studies (Wang et al.,
1
[Link]
- 50 -
CHAPTER 2 REVIEW 2.6 Future Research and Thesis Outline
2006, Morris and Trivedi, 2006a), the combination of both improves the results.
based classifier, which can incorporate tracking predictions as initial hypotheses for
After the low level detection and tracking is tackled, there is significant
potential for traffic rule enforcement. Current systems mainly focus on basic
counting in highway and urban scenes. More sophisticated analysis of road user
pedestrians. Intelligent traffic light timing could benefit from a measurement of the
state (position, velocity, class, etc.) of all road users at an intersection. The currently
introduced in chapter 3 and used throughout the remainder of the thesis. The use of
estimation. The classification task is the main objective for the algorithms
is integrated with the unified classification framework. This structure allows the
- 51 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.1 Introduction
3.1. Introduction
This chapter presents work done by the author to detect and classify vehicles and
pedestrians (called collectively ‗road users‘) in urban traffic scenes. The problem
tackled is road user classification on a per frame basis of a video stream. Every
extracted by foreground analysis are the input to the classifier. The classification
process is based on 3D models for road users. Related work has been introduced in
page 44. Because of the overall context of CCTV monitoring, the detection of the
system can be restricted to specific region(s) of the camera view (also referred to as
evaluation is also used for the work described in a subsequent chapter where a more
sophisticated detection mechanism is proposed and evaluated.
road user being fully visible. This implies no occlusion in the scene and between
road users. The orientation of the road users on the ground plane throughout the
scene remains approximately constant, which implies that road users follow a
straight road. The viewing direction of road users towards the camera can change,
however, particularly if vehicles move from the back to the front of the camera
view. The assumption of constant orientation clearly does not hold for pedestrians,
who could be walking in any direction on the road. However, because of their
posture (walking) and size (on typical road monitoring CCTV), the appearance of
- 52 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.1 Introduction
Figure 5 Example views from the i-LIDS data set with detected vehicles and
pedestrians. The left image also shows an ambiguous foreground region (thin blue
outline) on the top left, which was classified as class „other‟ and that consequently,
has no wire frame. The outlines of regions of interest R are shown as dark red
their silhouettes does not change significantly with direction and so it turns out that
what seems as an unrealistic assumption does not have a major effect on the
detection of pedestrians, as will be shown later with the results. Every region of
There are five classes used for the classifier as indicated in chapter 1 plus
an additional class for objects not belonging to any defined class.
Bus / Lorry
Van
Car / Taxi
Motorbike / Bicycle
Pedestrian
Other (class for objects not belonging to any of the above classes).
- 53 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.1 Introduction
The rest of the chapter is organised as follows. Section 3.1.1 gives an overview of
the method. The detector is introduced in section 3.2. Section 3.3 covers the
classifier and models used. The evaluation of the proposed system is given in
section 3.4. Finally, a summary can be found in section 3.5.
scene by matching silhouettes with road users‘ models. This is done by placing
candidate 3D models on the scene‘s ground plane and projecting it to the camera
view. A match measure is calculated for every hypothesis by comparing the model
with the foreground silhouette. Every model is placed on a grid of positions on the
ground plane to produce the match measure for every silhouette. This represents an
algorithm based on image measurement features, which are better than image
features according to Morris and Trivedi, 2006b (see discussion in section [Link]
on page 23). The highest match measure indicates the most likely position of the
road user given the silhouette. The highest match measures of different classes are
compared to make a decision about the class of a silhouette. Silhouettes with low
match measures for all classes are classified as being of the class ‗other‘ (see
example in Figure 5). To use the 3D models, cameras are calibrated by means of a
map and a minimum of five corresponding points with the image. A system block
less fixed, as pointed out earlier. One ground plane orientation is defined for every
region of interest. A single object is assumed for every silhouette (i.e. no overlap).
- 54 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.2 Detection
Detector
filtered GMM
frame (Stauffer and
Grimson, 1999)
shad. remo
3.2. Detection
The detector uses background estimation to extract motion silhouettes from a video
frame. Those silhouettes will later be used by the classifier. See the detector part of
Figure 6 for a block diagram. Every block of the detector is described in more detail
in this section, followed by the classifier in section 3.3, which takes the silhouettes
as input. This structure will be expanded by a tracker in chapter 6, but will keep the
same generic structure.
- 55 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.2 Detection
interlaced video signals. This process captures different parts of a video frame at
slightly different times, which causes blurry boundaries for moving objects. To
rectify the zigzag boundary artefacts generated for moving objects, a pre-processing
step linearly interpolates odd video lines between even lines. In this way, the
original size ratio of the images is preserved for camera calibration and human
viewing. Alternative methods remove odd video lines completely, which causes the
image to appear squashed. Performance results will be presented with and without
de-interlacing filtering.
from the OpenCV library (OpenCV, nd) was used. The GMM, first introduced in
the seminal paper of (Stauffer and Grimson, 1999), is used to generate an initial
foreground mask. The software is set to estimate five Gaussians using a background
threshold of 0.7 , which is the default value. The temporal window size, which is
incorporated into the background within 15 seconds. The outdoor scene recorded
with an auto iris function of the camera requires fast learning to accommodate
illumination changes. Large objects in the scene can change the overall illumination
conditions due to this gain control.
Shadow removal
The foreground pixels are post processed with the constant chromaticity shadow
- 56 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.2 Detection
removed from the foreground mask, if its shadow condition is true: The pixel colour
in the background image (most stable Gaussian from GMM) is compared to the
pixel colour in the current image. For the comparison, both colour values are
transformed into the HSV (Hue, Saturation, and Value) colour space. Value
reductions down to 55% of the current pixel with respect to the background pixel
are considered shadows, and the pixel is removed from the foreground mask. This
algorithm assumes that for shadowed surfaces hue and saturation stay constant and
only the value changes (all compared to the background image). This assumption
holds for light shadows as seen in overcast condition if the camera is not saturated.
Connected components
Binary masks connected components are extracted from the final foreground mask.
The purpose is to generate silhouettes, which are connected components and will be
is used to produce a set of final silhouettes S considering size and location with
respect to the region of interest R as explained below. This set of final silhouettes
gives the overlap of a silhouette with the region of interest R (e.g. red outlines in
Figure 5):
A S R
S, R (1)
AS
greater or equal than a threshold L and that the overlap is greater or equal than a
threshold O . Values of L 200pixel and O 0.25 are used for the experiments.
Then:
- 57 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.3 Classification
Remove Connect
GMM Shadow Comp.
Mean images
of GMM
diagram in Figure 6. The mean background images of the GMM modes are shown
along the bottom, followed on top by the foreground mask and connected
components S.
S S L S L S, R O (2)
The length threshold L should be made equal the smallest road user in the scene.
This means, that smaller silhouettes corresponding to noise are filtered out. If the
value is chosen too large, smaller road users might be wrongly filtered out. The
choice of overlap threshold O only affects the silhouettes entering at the edge of
the region of interest R . Practically, road users are fully contained in the region of
interest R for most of the time, where the choice of overlap threshold O has no
effect. The data flow of the detector is illustrated in Figure 7 as a pictogram.
3.3. Classification
This step classifies each silhouette from the detection to be one of the set of road
user types shown in Figure 8. This will be achieved by finding the match between
the projected model and a silhouette. The classifier is divided into four steps shown
in the classifier block diagram in Figure 6: ground plane hypothesis generation, 2D
- 58 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.3 Classification
Figure 8 Wire frame models Fi used for classification. Refer to Table 1 for model and
class correspondences.
model projection, overlap of model with silhouettes and maximum search. This
section follows this structure.
The camera requires calibration to be able to operate in the ground plane space and
to use 3D models. The algorithm of (Tsai, 1986) is used to obtain the ground plane
calibration for the camera using a map of the road and defining at least five
corresponding points between the map image and the camera image. Based on the
coordinates. Back projection from the image to the ground plane implies that points
are located on the ground plane in 3D world space.
- 59 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.3 Classification
First, ground plane hypothesis are generated for the silhouettes. The 2D
ground plane (i.e. implying zero height) giving ground plane centre r . This
may not lie in the ground plane. This is dealt with by generating several hypotheses
around the ground plane centre. Those additional hypotheses also compensate for
points around the ground plane centre r where p is the index of the grid positions.
The grid parameters were optimised for the experiments: the total grid width was 7
metres containing 7 rows and 7 columns. The grid width has to be sufficiently large
to compensate for ground plane centre estimation noise of large models (e.g. bus). If
this grid size is chosen too small, large models will not be matched at the correct
location. To limit computational time, the number of rows and columns was chosen
as low as possible. If the number of rows and columns is chosen low, the
localisation of road users will be coarse.
2D model projection
The 2D projection generates model masks M p ,i for every ground plane hypothesis
h p . Figure 8 shows the full set of wire frame models F Fi used for
classification, where the model index i is in the range of 0 to 9 (see Table 1). The
according to the following two steps: Every model point of wire frame Fi is
projected to the camera view (mask image) and the projected wire frame is drawn in
- 60 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.3 Classification
and class ID j (note that a class may correspond to more than one model)
2D
Project.
Figure 9 Illustration of the projection process of models. The wire frame of models is
the mask image between the projected points. The binary mask M p ,i is generated by
flood filling the projected wire frame. This process is illustrated in Figure 9. The
above algorithm changes model masks‘ size and shape implicitly according to the
ground plane location.
- 61 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.3 Classification
Silhouettes
from Detector
Overlap Intersection
Area i.e. Overlap
(in green)
Flood filled
Model mask M p ,i
from projection
A M p ,i S
M p ,i , S (4)
A M p ,i S
which is similar to the approach presented in (Messelodi et al., 2005b). The area of
the intersection of both masks is divided by the area of the union of both masks,
which results in a match measure in the range 0,1 . Figure 10 gives an illustration
of the overlap calculation.
where g S and iS are the arguments that generated the maximum S i.e.
- 62 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.3 Classification
S
Figure 11 Match measure for one silhouette S . The upper left image shows the
silhouette and best fitting model iS at ground plane position ( x, y ) . Top right: the
through every model's match surface M , S along the minimum and maximum
p ,i
Table 1.
The overlap response is illustrated in Figure 11. Note the well shaped peak of the
overlap function in respect to the ground plane positions h p . The peak is elliptic
rather than circular, which can be observed by the different gradients in the bottom
graphs in Figure 11. This can be expected due to the perspective angle of the
camera. A shift along the x-axis (sideways on the road) produces a large horizontal
shift in the image, which generates a sharp drop in overlap (bottom right graph). A
shift along the y-axis (along the road) produces a less distinct vertical shift in the
image, which causes a slower drop and therefore lower accuracy (bottom left
- 63 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.3 Classification
graph). The lower the camera angle, the less accurate the y-axis measurement
becomes.
index jS for the model iS as there can be many models for one class to allow for
intra class variability
jS T iS . (7)
of a known class, i.e. has sufficient fit S to model iS . The set of detected road
D S S P , S S . (8)
missed road users and wrongly detected road users. For completeness, the
intermediate results and internal steps of the whole classification algorithm are
illustrated as a pictogram in Figure 12 showing mask and silhouette images.
- 64 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
Scores Classifier
Silhouettes Overlap
Overlap
Maximum
Area
ground
plane Labels
Hypothesis Model
Silhouette
Model
2D
Project. Model
Figure 12 Illustration of data flow for the classification framework. This corresponds
h p as green crosses. The model projected on the red position results in the red
flood filled model mask M , shown as example for a single hypothesis. The
3.4. Evaluation
The proposed system has been evaluated on video from the i-LIDS data sets (iLIDS,
nd). The set of ground truth GT GT was partly provided by i-LIDS (ground
truthed by NIST) in Viper format (Viper,) consisting of bounding boxes and class
labels for road users. It had to be converted from NTSC to PAL indexing and was
extended for pedestrians. The classifier produces bounding boxes and class labels
for classified road users D also in Viper format. The next section introduces the
metrics used, followed by the data set in section 3.4.2. Section 3.4.3 gives results
for vehicle only detection and classification. Joint operation of all road user classes
- 65 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
3.4.1. Metrics
Every classified road user D , is matched with the best overlapping bounding box
classified road user D is entered in column FP (false positive). All non matched
road users in the ground truth GT within the region of interest are entered in row
FN (false negatives). All the metrics used for evaluation can be derived from an
classifier, row FN (false negative) and column FP (false positive) are added to the
confusion matrix.
Groundtruth
C1 C2 CN FP
C1 c1,1 c1,2 c1, N c1, N 1
C2 c2,1 c2,2 c2, N c2, N 1 (9)
Detected
C N cN ,1 cN ,2 cN , N cN , N 1
FN cN 1,1 cN 1,2 cN 1, N 0
The metrics used for the evaluation of the whole system (detector and classifier)
will be precision, recall and the F1 measure. The definitions are taken from the i-
LIDS trial (iLIDS, nd) specifications. Precision P and recall R are calculated
independently for every class Ci and jointly for all classes. The following
definitions are used for i-LIDS:
TP
R (10)
TP FN
- 66 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
TP
P (11)
TP FP
F1
1 RP (12)
R P
The recall bias can be set according to the application. The values for the above
equations can be read from the confusion matrix (9). The true positive (TP) for any
class Ci is the corresponding diagonal element ci ,i . The ground truth for recall is
the column sum of all classes. The number of detections used for precision is the
row sum of all classes Ci . Equations (10) to (12) can be expressed in terms of the
confusion matrix from equation (9) with a matrix element defined as ci , j . The total
number of classes is N . The recall RS ,i of the whole system (index S ) per class Ci
and the precision PS ,i of the whole system per class Ci are defined as follows:
c
RS ,i N 1
i ,i
(13)
c
j 1
j ,i
c
PS ,i N 1
i ,i
. (14)
c j 1
i, j
Joint values for recall RS and precision PS for all classes can be calculated by
summing up all diagonal elements and the corresponding rows or columns. Every
class has an implicit weight according to the number of occurrences.
N
c i ,i
RS N
i 1
N 1
(15)
c
i 1 j 1
j ,i
c i ,i
PS N
i 1
N 1
(16)
c
i 1 j 1
i, j
- 67 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
Precision PC ,i for the classifier only (index C ) per class Ci can be calculated by
ignoring the column for FP. This equation deals with the classification result of
correct detected objects only. The classifier recall RC PC when considering all
classes jointly. The recall RC ,i and precision PC ,i for the classifier per class and the
joint precision PC are defined as:
ci ,i
RC ,i N
(17)
c
j 1
j ,i
ci ,i
PC ,i N
(18)
c
j 1
i, j
c i ,i
PC N
i 1
N
. (19)
c
i 1 j 1
i, j
Finally, precision PD and recall RD can be calculated for the detector only (index
D ). The classification performance is ignored by summing over all classes to
calculate the true positives. The column sum is used for recall and the row sum is
used for precision. Once again, the values can be calculated for each class Ci ( PD ,i ,
RD,i ) or jointly for all classes ( PD , RD ).
N
c
j 1
j ,i
RD ,i N 1
(20)
c
j 1
j ,i
N N
c
i 1 j 1
j ,i
RD N N 1
(21)
c
i 1 j 1
j ,i
- 68 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
N
c
j 1
i, j
PD ,i N 1
(22)
c
j 1
i, j
N N
c
i 1 j 1
i, j
PD N N 1
(23)
c
i 1 j 1
i, j
This full set of metrics allows comparison with many published results. Often in the
literature only a subset of those metrics is provided in a single paper. These metrics
The i-LIDS data sets (iLIDS, nd) are licensed by the UK Home Office for image
research institutions and manufacturers. Each data set comprises 24 hours of video
sequences under a range of realistic operational conditions. They are used by the
evaluating and comparing algorithms by the computer vision community and there
is a gradual increase in take-up. Out of the Parked Car data set, what i-LIDS calls
―scenario 1‖ was chosen, because it complies with the assumptions in section 3.1
and provides road users with large scale variations. Refer to Figure 5, Figure 13 and
Figure 14 for example views. There is no public data set commonly used for urban
traffic analysis. This makes direct comparison of reported results difficult. One
contribution in this chapter is the use of this public data set to allow quick future
for sunny, overcast and changing conditions has been selected for the evaluation:
(PVTRA10xxxx) 1a03, 1a07, 1a13, 1a19, 1a20, 1a21, 2a04, 2a05, 2a06, 2a08, 2a09,
2a10, 2a11 and 2a15. The recordings use a camera with an auto iris function that
- 69 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
keeps the average illumination of the view constant. Large vehicles with a
predominant colour can cause adjustments in the iris and noticeable changes in the
background. In addition, the overcast videos contain saturated areas in the middle
and far end of the view. These are useful challenges to test the limit of the proposed
approach(es).
Some ground truth usable for the tests (the data is normally used for event
detection tests) was provided with the data set, however it had to be converted and
extended. This limited the total length of video used for the evaluation. The total
for each class are as follows: 47% car/taxi, 31% pedestrian and 8% each for van and
bus/lorry and 6% for motorbike/bicycle.
This section provides results for the proposed system without pedestrian models.
The pedestrian model has been removed for this section just to make the results
comparable to state of the art solutions in the literature, which usually do not
consider pedestrians. The author‘s results compare to the state of the art, but for
practical reasons are not evaluated on the same data for vehicle detection and
classification. Using the shadow removal filter without the de-interlacing filter
gives the best performance. Comparison of the filters is provided at the end of this
confusion matrix including FP (false positives) and FN (false negatives) for the
evaluation of detector and classifier and Table 3 shows results for the classifier and
details per class. All values are normalised to the ground truth count per class
displayed at a bottom row. The overlap indicates the overlap between ground truth
bounding box and detection bounding box, which is obtained as the bounding box
of the detected wire frame model. The whole system evaluates to a recall R of 87%
- 70 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
Table 2 Confusion matrix and overall silhouette classifier performance for vehicle
Table 3 Confusion matrix for the silhouette classifier using shadow removal and per
class evaluation
refer to Figure 13 for true positive examples and Figure 14 for wrong classification.
The higher number of false positives for the class bike is due to pedestrians being
classified as bikes. At this stage, no pedestrian model was used and all the motion
silhouettes resulting from pedestrians in the scene should have been classified as
belonging to class ‗other‘.
- 71 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
- 72 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
Figure 14 Top: Two examples for false positives due to pedestrians being detected
as bike and as car due to occlusion in a group. The bottom left image shows a car
being misclassified as bike as it turns into the car park. The last image shows a
missed car due to its similar colour compared to the saturated road area
- 73 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
Direct comparison of quantitative results with the literature is difficult due to the
lack of a common data set for vehicle classification. A detailed introduction to state
of the art algorithms has been given in section 2.4 on page 42. The total recall R
systems in terms of their reported results on their own data sets. The first reference
is (Messelodi et al., 2005b); with a system performance of 82.8% for detection and
classification of urban road users into 8 classes is reported. All the following
systems use highway imagery, which highlights the lack of work of vehicle
classification using urban data. Total system R 65% at P 75% for classifying 150
car samples into 3 classes after detection and tracking is achieved in (Chen and
Zhang, 2007). On 20 minutes test video, 70% of vehicles are classified (cars/non
cars) after detection and tracking in (Gupte et al., 2002). A classifier accuracy of
74.4% is reported for a 24 hour test sequence in (Morris and Trivedi, 2006b) using
3 classes. The same authors extended the system to 7 classes with a classification
accuracy of 88.4% in (Morris and Trivedi, 2006a).
Results of the algorithm proposed in this thesis are compared for four different
scenarios using input filters for shadow removal and de-interlacing. The effect of
using different combinations of those filters is shown in Figure 15. Shadow removal
is essential for good performance, whereas de-interlacing has a negative effect. This
is partly due to smoother outline of silhouettes and the additional noise introduced
filter will be more important in chapter 5 where the appearance of road users is used
for classification. The next section discusses all four cases in more detail when
pedestrian detection is also considered.
- 74 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
pedestrian models using 4 different filter algorithms: shadow removal (Sr), shadow
removal with de-interlacing (Sr+Di), de-interlacing (Di) and no filter (-). The left
diagram shows system recall R , precision P and classifier precision PC . The right
diagram indicates the detector recall RD and precision PD
- 75 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
models for two different filter configurations from top: shadow removal (Sr) and
bottom: shadow removal with de-interlacing (Sr+Di). The car at the bottom left is
artefact.
This section shows results for the proposed algorithm, when all road users are
classified with the same framework. Results are given for the same four filter
de-interlacing and c) not filters) introduced in the last section with a qualitative
comparison in Figure 16 and Figure 17. Best performance can be seen for shadow
removal.
- 76 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
models for the remaining filter configurations top: de-interlacing (Di) and bottom: no
filter (-). Too large silhouettes (their perimeters shown in blue) can be observed when
shadow removal is not carried out, causing missed vehicles and wrong
- 77 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
Table 4 Confusion matrix of the motion silhouette classifier when using the shadow
Table 5 Classifier and class wise performance figures for the motion silhouette
The framework used with the shadow removal filter gives the best performance for
road user classification. Refer to Table 4 for an extended confusion matrix with
overall performance figures and to Table 5 for class wise results. Very good
occurs between bikes and pedestrians. This is due to very similar motion silhouettes
of both road users, especially in the far region of the camera view when bicycles are
- 78 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
Table 6 Confusion matrix and system performance of the motion silhouette classifier
seen front on (see Figure 13). The higher false positive rate for the bike class
observed earlier for the classifier without pedestrian models (Table 2) does not
appear here, as a pedestrian model was used. The low detection performance of
pedestrians is due to their non-rigid nature. The basic cube-like models do not
match motion silhouettes of pedestrian as well as they do cars, which required the
cameras does affect smaller object more, which explains the performance increase
of pedestrians when using a de-interlacing filter in the next section. However, using
a single algorithm for all road users is beneficial in terms of system complexity.
The framework with both input filters indicates best performance for pedestrians.
The confusion matrix in Table 6 shows system recall 82% for pedestrians, which is
de-interlacing filter allows a better match of motion silhouettes compared to the last
- 79 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
Table 7 Confusion matrix for the motion silhouette classifier with de-interlacing
filtering and with no filter. More tables for those cases are provided in appendix C.1.
For these experiments, only the de-interlacing filter or no filters were used. In both
cases, the performance is significantly worse than the experiments that include the
shadow removal filter, which can be observed in Table 7. Compared to the best
performance in section [Link], recall drops by 11.7% to 67.8% and precision drops
by 10.5% to 73.4%. This is due to oversized motion silhouettes, which can be seen
in Figure 17. Therefore, this demonstrates that shadow removal is essential for this
framework to perform well.
section compares the performance of the classifier without pedestrian model for
indicating that the approach performs best for sunny conditions. This may be due to
the high contrast in the videos and therefore good foreground estimation. The
following sections give more details about each condition. Some performance tables
- 80 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
Figure 18 Performance comparison for the motion silhouette classifier under three
have been omitted here for space reasons and are provided in appendix C.1.
The best individual performance is achieved for sunny conditions, with the
confusion matrix shown in Table 8. Many researchers have reported that sunny
explained by the dynamic range of the images. The high contrast and the deep
shadow can be seen in examples of Figure 19. The sun allows a precise detection of
the outline of road users; however it includes a deep shadow. The classifier can deal
with that shadow as the silhouette is only extended in a single direction which
reduces the overlap match measure for all models but keeps the ordering. In
contrast, the lower dynamic range and the tendency of image saturation for overcast
conditions introduce more noise to the road user‘s silhouette. This noise has a
greater variability on the size of the silhouettes which can then lead to matching of a
wrong model. However, due to the shadow, the mean overlap measure of the
winning class in sunny conditions is 0.65, lower than the corresponding figure in
overcast conditions (0.69). This means that the accuracy of the detected location for
road users under sunny conditions is lower compared to overcast conditions.
- 81 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
Figure 19 Sunny examples top: true positive, bottom: false positive car and missed
car
- 82 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
Figure 20 Overcast examples top: two correct frames and bottom: one misclassified
The performance for overcast conditions is second best after sunny. The confusion
matrix in Table 9 shows many false positives for bikes. The false positives are
observations of pedestrians, which should have been classified as ‗other‘. The miss
classifications are mainly due to missed foreground areas due to saturation and low
dynamic range of the scene. Refer to Figure 20 for examples.
- 83 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
The worst performance can be observed for changing conditions. During those
sequences, the sun appears several times which causes the auto iris of the camera to
adjust. This produces ambiguous foreground silhouettes for short periods of time
resulting in lower performance. Refer to Table 10 for the extended confusion matrix
for this case with example views in Figure 21. The low performance of vans is due
to their predominant white colour, which causes reduced foreground areas during
times of saturation. This problem can be dealt with by exploiting the constraint that
the same road users are present in the scene for many frames. Temporal filters and
tracking are discussed in chapter 6.
- 84 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.4 Evaluation
Figure 21 Changing weather examples: Two correct frames at the top and two
- 85 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.5 Summary
3.5. Summary
This chapter presented a new algorithm for road user detection and classification
using 3D models. The target application is urban traffic analysis which has different
manufactures‘ dimensions are projected onto the image plane to search for a
peak at the right ground plane position and distinguishes different classes. This
method has the potential of being useful for different applications (e.g. assembly,
intelligent spaces) to generate camera specific object templates for visual matching
and searching.
Evaluation was performed on the public i-LIDS data sets. Results have
been provided for several input filters and weather conditions. Good overall
89.8% which is higher than reported results in the literature, but evaluated on
different data sets. The lack of a common data set makes a direct comparison really
shape and therefore the knowledge of expected motion silhouettes. The simple
camera calibration used here allows the application of the same models to be used
across cameras. The full model incorporates vehicles and pedestrians into the same
the classifier without pedestrian models (precision 92.9%). Some confusion can be
observed between bicycles and pedestrians, where 49% of bicycles are classified as
pedestrians. This is due to their similar size and motion silhouette. The evaluation
of input filters indicates that shadow removal filtering gives the best overall
- 86 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.5 Summary
is due to the higher contrast and therefore less noise of the silhouettes in sunshine.
As the author‘s classifier can deal with deep shadows, this condition gives the best
results.
give arguably similar results but no ground plane road location of road users. The
setup of the above would; however, require more specialist knowledge and require a
new setup for every camera. In contrast, the proposed algorithm uses a single set of
models (3D volumes for road users) for all cameras. The dimensions are taken from
real car dimensions in metres, which does not require any domain knowledge of
computer vision. To set up a new camera, the camera calibration is simply obtained
by clicking corresponding points on a road map and image.
the noise or imperfections of the silhouettes exceed the size variations between
models, the classification will be erroneous. New vehicle shapes with similar size
will not degrade performance. This is because the overall match between models
and silhouettes might be lowered, but the rank order which determines the class
would not be affected.
faced when processing the size and variety made me shift my focus from expert like
systems (rule based) as proposed in this chapter toward learning based methods like
the next chapters. In this way, the variety and challenges of the data can be
automatically tackled, provided that representative training data can be gathered.
- 87 -
CHAPTER 3 MOTION SILHOUETTE CLASSIFIER 3.5 Summary
The remainder of the thesis is dedicated towards more robust detection and
classification of road users. 3D models show good performance, but the use of
motion silhouettes alone is a distinct limitation for robustness and occlusions. This
information into the classifier to have additional information (such as texture and
appearance) apart from the motion foreground, which has shown limitations
performance in (Ma and Grimson, 2005) and are commonly used in object
recognition style approaches e.g. (Leibe et al., 2007, Leibe et al., 2008b). The next
chapter will focus on evaluating local features for surveillance tasks by considering
the seemingly simple problem of detecting human intrusion in sterile zones. Those
chapter to generate what the author will call 3DHOG features in chapter 5 in an
effort to overcome the limitations of the classifier presented here. In this way, the
appearance of road users will be incorporated into the 3D models.
- 88 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.1 Introduction
4.1. Introduction
The previous chapter illustrated how motion (foreground) cues can effectively be
used for road user classification. The summary then identified some shortcomings
of estimating motion foreground. The use of local features (such as edges, textures,
etc.) might be a way of overcoming these limitations. So, this chapter describes
work done by the author to identify ways in which such local texture features
(extracted in patches) can be used to detect intrusion in sterile zones under a range
of environmental conditions. The work incorporates appearance into the models and
moves away from the motion approach. This means that segmentation and object
detection could take place using individual images (a task normally carried out
without any apparent difficulty by human beings). The integration of both concepts
(motion and local features) is then demonstrated in chapter 5. In this chapter the
author takes a seemingly simple scenario (something one would assume would have
been fully solved by now): the detection of people entering a sterile zone. This is a
common task for surveillance e.g. a fence along a railway line, warehouse
perimeters or similar. Such scenes contain a protected area typically with a physical
barrier (e.g. fence) and a restricted (sterile) zone bordering the barrier. The author
also uses the training and stringent testing framework given by the i-LIDS sterile
zone test data set of the United Kingdom Home Office (iLIDS, nd). This data set is
- 89 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.1 Introduction
crawling, running, rolling, etc.) in two camera views, referred to in here as View1
and View2 (Figure 22). The i-LIDS programme was inspired by a government need
(informed by CCTV users) to rank systems so that those with an appropriate level
forces. At the same time, the i-LIDS data set provides a common set of data that
researchers can use to compare results. Although some of its definitions of what
constitute true and false detections might seem arbitrary and even idiosyncratic, in
this work we fully adopt the i-LIDS definitions so that researchers and even end-
users could consider our results in that common context. The main challenge for the
camera shake, illumination changes, auto iris (adaptive gain), 24 hour operation,
rain, snow, wild animals, etc. Commonly used methods like motion estimation have
problems dealing with those conditions. Those problems arise from the need of such
method able to detect regions of interest (in this case intruders) on single images. In
this context, it is observed that in many cases sterile zones contain greenery, gravel
(railway) or other mostly homogeneous surfaces within which intrusion takes place.
The author therefore formulates the intrusion detection problem as one of detecting
features (in this case corresponding to the intruder). Later in the next chapter it will
be explored, how the use of local features might improve the detection of road users
in urban conditions.
Thus, this chapter presents a new texture saliency classifier for intrusion
detection in still images. Salient objects are detected in real- time, based on spectral
texture features of image regions. This means in practice, that people are detected
- 90 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.1 Introduction
Figure 22 Examples of the i-LIDS data set showing the two camera views, different
environmental conditions (falling snow in the middle left) and ways the fence is
approached.
due to their texture difference compared to their surrounding texture. The basic
detector is then extended with a combination of the texture saliency and an inter-
- 91 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.2 Related Work
detection time (this is particularly relevant to the i-LIDS benchmark that allows
moving). The algorithms are tested on the i-LIDS sterile zone data set and
comparative results with the state of the art (at the time the work was done)
OpenCV blob tracker are presented.
Extensions to the detector are introduced in section 4.4. Section 4.5 describes the
data set and provides details on the implementation including timing analysis. Full
results are provided in section 4.6. Section 4.7 concludes the chapter.
background and methods operating on single frames. The first methods are usually
solution belongs to the second group, which gains robustness by solving the harder
problem of foreground reasoning when considering single frames only. The recent
body of pedestrian detectors like HOG (Dalal and Triggs, 2005), AdaBoost (Jones
and Snow, 2008) or edgelets (Wu and Nevatia, 2005) are not applicable, because
pedestrians are assumed to be upright. In the data set used here, people are also
crawling, rolling sideways, etc. which breaks this assumption.
wise background model with which to estimate motion foreground and perform
OpenCV1.0 blob tracker (OpenCV, nd). A background model based on the mode in
the temporal histogram is given in (Zheng et al., 2005). The disadvantage of using a
- 92 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.2 Related Work
histogram is the slow adaptation for changed background when a high mode is
established. The seminal papers of (Stauffer and Grimson, 1999, Stauffer and
computational speed and memory size. This approach generally provides good
results for outdoor scenes. (Sheikh and Shah, 2005) consider a probabilistic
approach to model regions of pixels jointly. This allows the local spatial structure to
dynamic scene. This work is extended by (Culibrk et al., 2009) who estimate stable
texture regions. Periodically changing backgrounds are modelled in (Colombo
et al., 2007) to incorporate periodic distractions like escalators into the background
2006) and used for tracking in (Takala and Pietikainen, 2007). Pixel and block
based approaches are combined in (Chen et al., 2007) for a hierarchical method.
scene changes which are typical for realistic conditions. Detection based on single
frames may overcome those problems, however it increases the difficulty of
(Davies and Lienhart, 2006) to classify pixels into road and non road for vehicle
mounted cameras, assuming known road and non road seed areas. This does not
require an offline training phase but has additional input from a laser range scanner.
Based on training and structure from motion (Sturgess et al., 2009) propose a
segmentation system using graph cuts for road scene understanding. Texton, colour,
Tan, 2002); these are commonly used for classification of images and content based
- 93 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.2 Related Work
retrieval. Recent work by (Shotton et al., 2009) uses Textons to segment a single
image and perform multi object recognition based on initial training.
A new texture saliency classifier is proposed here for intrusion detection in still
images. Intruders are detected because of their differing texture compared to the
surrounding texture in the image. This is achieved through the analysis of the
texture of local image patches in a video frame (the use of local patches is then
taken forward, as discussed in the next chapter, to the more general problem of road
user detection and classification). Thus, to analyse local texture the input image is
divided up into patches, for which individual spectral texture features are generated.
To identify image areas with similar texture, the patches are clustered in spatial and
feature space. Comparing the texture features of those clusters gives an indication
significantly differ from the rest. Those differing clusters (i.e. the corresponding
―foreground‖ (or salient) and grouped into objects. In this way, image patches are
evaluated for saliency within a single frame based on the overall homogeneity of
the image. Object detections (intruders) per frame are remembered over time to
build trajectories of object centres in the image space, which are then evaluated to
object approaches the barrier. How this is defined and used will become clearer
later.
The detection approach does not rely on temporal consistency of the frames.
In this way, the algorithm is robust to camera shake, illumination changes and
similar practical issues discussed earlier. In addition, complexity and runtime is still
- 94 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.3 Intrusion Detector
basic detector and an interframe mask (that indicates possible motion by identifying
robustness against distracting appearances (e.g. fence shadows) that mainly only
affect texture and also against camera shake/illumination changes that mainly only
affect the differencing mask. To decrease the alarm response time of the system, a
Kalman filter is then introduced. The i-LIDS trial specification defines a hard ad-
hoc time limit for alarms of 10 seconds after the first appearance of an intruder.
This is probably driven by end-user demands and it means that a detected event is
only a true positive if detected within that time, otherwise it becomes a false
positive. By tracking partly visible people at the edge of the camera with a Kalman
filter based on motion, this early evidence allows faster alarm triggers within the
specified time. Having outlined the approach, in what follows, more detailed
descriptions of the algorithm are given.
This section describes the author‘s intrusion detector based on texture saliency.
Section 4.3.1 discusses the five steps of foreground estimation and shows how the
spectral features of image patches are used to detect salient texture foreground
regions. Those regions are then combined into objects, for which trajectories are
built as described in section 4.3.2. The last section also describes the rules used to
trigger intrusion alarms based on the trajectories.
- 95 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.3 Intrusion Detector
distribution of features for local image patches in a single frame is analysed for
the input image is first divided into patches, for which texture features are
calculated. The foreground estimation using this local texture is performed in five
steps shown as blue blocks in Figure 23:
Clustering
Those steps are each described in detail in the next sections after discussing the
acquisition of input images first.
A demonstrator system was then built consisting of a video player (the original
Quick Time MJPEG i-LIDS sequences were converted to MPEG-2 mpg files and
- 96 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.3 Intrusion Detector
(CVBS) video output). This output was then fed to a frame grabber based on a
Philips TM-1300 Trimedia DSP (Digital Signal Processor) set to digitize the video
night-time footage, a decision was made to use only the luminance channel (a
similar argument can be made for weather conditions such as snow and fog that
have very little chrominance). Therefore the CIF monochrome (256 levels) output
of the frame grabber is passed (via the PCI bus) to the main algorithm that runs as a
normal PC application, thus this digitised video feed is the main input to the
The i-LIDS data includes the characterisation of two distinct regions: the approach
(in this case the grass) and the boundary (in this case the fence). Those two regions
are defined by two binary pixel masks R i , which will drive the generation of two
populations of local image patches from the input image. Those two populations
will be analysed for saliency independently, which is why the region index i is
introduced to distinguish them. The region index i has value 0 for the boundary
(i.e. fence) and value 1 for the approach (i.e. grass). The masks R i are taken
directly from the sterile zone benchmark definition of i-LIDS (i.e. from the data set)
and were not chosen by the author. Figure 24 shows an example of such a mask.
- 97 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.3 Intrusion Detector
Image patches Pi , p are efficiently fitted to the region mask R i . The index
p enumerates patches for each region R i independently. The fitting process starts
by placing patches at the top left inside the mask and continues to the bottom right
by populating the mask with patches. The number of patches fitted depends on their
size and overlap, which will be discussed in the next paragraph. If the boundary of
the region is not vertical, the fitting will produce an unaligned grid of patches, as a
new row of patches always starts at the edge of the region mask R i , which can be
seen in Figure 25.
The patches Pi , p are 16x16 pixels and have 20 percent overlap between
them. The patch size is chosen as a power of 2 to enable the use of the Fast Fourier
Transform (FFT). The size should be chosen to be as small as possible to allow for
a fine foreground resolution. On the other hand, the patches have to be sufficiently
large to capture texture information. For the specific camera views, it was found
through evaluation, that 16x16 pixel patches are sufficient to detect intrusion
foreground resolution and practically the sensitivity to small objects. The upper
limit for increasing the overlap is ultimately limited by the required frame rate. The
- 98 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.3 Intrusion Detector
computation time of the algorithm increases with the square of the overlap, because
the number of patches increases with the square of the overlap. Figure 25 shows a
frame with patches Pi , p for both masks R i in different colours. The overlap can be
best observed for patches in the bottom row highlighted by blue arrows.
To capture the texture and generate features of image patches Pi , p , Fast Fourier
Transform (FFT) is performed on each patch
Pi , p FFT Pi , p . (24)
The centre of the spectral patch Pi , p corresponds to the highest frequency, whereas
the border corresponds to the lowest frequency. How noise can be reduced from the
spectrum is explained in the next section.
The spectral patches Pi , p may contain noise, which may affect the foreground
detection. This step first reduces noise and then calculates texture features for every
image patch. Low frequencies contain the illumination conditions of the patch,
which can differ significantly during night e.g. Figure 22 on page 91 on the far
- 99 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.3 Intrusion Detector
contrary, high frequencies contain noise from the original (analogue) video feed.
The random analogue noise (thermal noise) causes spatially small distortions also
called snow (Ciciora et al., 2004). This is introduced into the video player, cables
and capture card. Both low and high frequency components have to be removed
from the spectrum Pi , p resulting in a filtered spectral patch Pˆ i , p . To reduce all the
above noise, band pass filtering is applied as follows: Pixels along the border of
spectral patches Pi , p are blanked (pixel value 0) to fully remove low frequencies
and pixels in the centre of the patches are also blanked to fully remove high
frequencies. The width of the outer border area to be blanked is 2 pixels; in addition
the central square of 8 pixels width is removed. In this way, the noise is removed
Changing the width of blanked pixels by one pixel does not impact on performance
noticeably. An illustration of the filtered spectral patches Pˆ i , p is given in Figure 26.
normalised independently for display purposes). The patches on the right illustrate
the filtering by blanking the inner and outer area of the spectral patches Pˆ i , p .
- 100 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.3 Intrusion Detector
To generate a scalar feature f i , p for each patch Pi , p , the sum of the all the
elements of filtered spectral patch Pˆ i , p is calculated by
fi , p Pˆ i , p . (25)
The feature f i , p can be used to discriminate people from background due to their
different texture, while at the same time it gives a similar response over the whole
background in typical sterile zone scenarios as defined by (iLIDS, nd). The example
in Figure 27 shows the difference of feature value of the intruder compared to the
grass background in the right area of the image.
Figure 27 Scalar features f i , p (right) of image patches (left). The feature value range
is normalised to span the full grey level range. The fence and grass region are
normalised independently. The grass area shows an area distinctly different from the
average, which corresponds to an intruder. The second example along the bottom
- 101 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.3 Intrusion Detector
After calculating the feature for each image patch, salient patches i.e. patches
large salient regions in the image, neighbouring patches Pi , p are clustered with
respect to their location and the feature scalar value f i , p . The mean feature value of
patches, larger support for saliency is accumulated. This reduces the detection of
clusters. For the clustering itself, a hierarchical cluster tree is generated to find N
clusters Ci ,k with cluster index k 1, N and the region index i (as we continue to
maintain two separate populations of data). The choice of value of parameter N
will be discussed in the next section. Ward's linkage algorithm (Ward, 1963) is used
to combine clusters in the tree which effectively minimises the square of the
Euclidian distance between elements in the clusters. The clusters of the example
frame are illustrated in Figure 28 as dots with different colours, where the elevated
clusters (red and green) in the right graph correspond to the intruder and will be
classified as foreground in the next step. For every cluster Ci ,k , the mean feature
f i , k is calculated by dividing the sum of features by the number of elements in the
cluster
Pi , p Ci ,k
fi , p
fi ,k . (26)
Pi , p Ci ,k
The clusters for both regions (i.e. fence and grass) are illustrated in Figure 28.
- 102 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.3 Intrusion Detector
Score of patches
Score of patches
4000
4000
3500
3000 3000
2500
Score
Score
f 0, p 2000
2000
1500
f1, p
1000
1000
500 0
300 300
200 400
200 200
150 300
100 100 200
100
50 100
0 0 0 0
Y X Y X
Figure 28 Clusters Ci ,k . The left image shows the clusters for the fence and the right
image the clusters for the grass. Every image patch is represented by one dot, where
The resulting clusters Ci ,k with mean features f i , k are now classified into
assumed that most of the image contains background and only a maximum of
M patches are foreground Fi . This is a valid assumption for typical sterile zone
scenarios where a camera covers a large area with a limited number of people
entering the scene. The i-LIDS trial allows for algorithm training, which could be
used to find parameters. The values for patches in the foreground M and the
number of clusters N empirically represent the scale and perspective of the camera
view and can be obtained from analysing the scene by the following procedure: The
smallest foreground object should occupy approximately one cluster. This results in
N 15 by estimating the ratio between the number of image patches Pi , p for the
smallest object and the total number of patches Pi , p in region R i . The number of
- 103 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.3 Intrusion Detector
evaluated again this background statistic to confirm them as final foreground. The
leaving all other clusters as background Bi Ci ,k Ci ,k Fi . As background
statistic, the mean feature value fi of the background clusters and their variance i2
is calculated, separate for both populations of data, i by:
fi mean fi ,k Ci ,k Bi (27)
i2 var fi ,k Ci ,k Bi . (28)
Fi Ci ,k Ci ,k Fi fi ,k T i2 fi (29)
with saliency threshold T 5 . This implies that the foreground patches have to
have higher feature values than the background. The threshold was optimised for
sample videos from the dataset‘s testing set. The choice of T is not very sensitive,
- 104 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.3 Intrusion Detector
here. The fact that the background is not modelled temporally here requires the
threshold to be applied to the feature value rather than the distribution proportion.
The use of a threshold like in (Stauffer and Grimson, 1999) would result in the
foreground being always the same fraction of the whole image. The graphs in
Figure 28 show similar absolute values for both fence and grass clusters, but the
clusters of the person in the grass are significantly elevated above the background
clusters. This shows that the feature value with respect to background statistics may
be an indication for foreground.
The intrusion rule part of the algorithm first generates objects from the foreground,
computed as described in the previous section, and then evaluates their temporal
blocks in Figure 23 on page 96. Firstly, spatially close foreground clusters are
- 105 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.3 Intrusion Detector
merged into single objects O . This means, that foreground clusters with patches
overlapping each other are merged. Large objects close to the camera are usually
associated with the closest trajectory based on the Euclidean distance in image
coordinates. This is sufficient due to the low false detection rate of the intrusion
detector and typical low number of trajectories. If there is more than one object in a
section 4.4.2. All trajectories T are considered for an alarm condition. The alarm
rule requires a trajectory to have accumulated support from the texture saliency
detection for 2 seconds and the horizontal motion component has to be consistently
towards the barrier (i.e. left or right, depending on the side of the fence). For
different camera setups, a different motion direction could be used (e.g. vertical for
a barrier along the top). A longer time window would increase the performance due
to the increased evidence of an intruder; however, the stringent time window
defined by the i-LIDS specification requires raising alarms fast. The fence location
(left or right) is obtained from the i-LIDS scenario definition together with the
sterile zone masks. An example frame with intruder and trajectory is shown in
Figure 30.
- 106 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.4 Detector Extensions
quality and shorten the alarm triggering time. The information fusion resulted in a
dependencies. The extension with Kalman improved the time, but degraded
The algorithm described in section 4.3.1 does not use any temporal information for
detecting objects. The main reason for false detections is the existence of semi
- 107 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.4 Detector Extensions
differencing for pixel wise motion foreground estimation; please see Figure 31 for a
block diagram (orange colour).
Figure 31 Block diagram for the intrusion detection with motion extension in orange.
The output of the motion estimation step is a final motion mask M and is
pixels are selected as interframe motion. This enforces that only part of a frame
(e.g. a person if present) can be foreground at any given time, considering the
changes in image conditions (e.g. illumination change due to sun) can be dealt with
by focusing on the most significant moving objects. If there are no moving objects,
the interframe motion corresponds to uniform noise pixels which will then not be
considered further due to their small size. Finally, morphological opening with a
3x3 kernel is applied to eliminate such small noise and join up larger regions to
result in the final motion mask M . This motion mask tends to contain only edges of
moving objects loosing the middle section due to the crude interframe difference.
The fusion will be able to use this mask M , because the fusion does not require
complete coverage of the object, which is discussed in the next step. For
- 108 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.4 Detector Extensions
Information fusion
On the one hand, motion information is affected by camera shake, fast changing
illumination conditions, etc. which is typical for this application as pointed out in
the introduction of this chapter. On the other hand, it is robust against the existence
of stationary objects, which could affect the texture saliency detection only. The
information fusion requires valid objects to contain at least 5 motion pixels inside
the bounding box(es) of the object(s) detected by saliency. In this way, an object
bounding box tends to always fully enclose the moving intruder due to the coarse
comfortably. The number of motion pixels required was chosen as low as possible
to avoid rejection of slowly moving intruders, but larger than the typical number of
noise pixels in texture bounding boxes for the testing dataset. The fusion approach
reduces false detections, as will be shown in the results section, as noise for
appearance and motion is independent and therefore less likely to occur jointly.
This allows lower detection thresholds for both detectors, which significantly
- 109 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.4 Detector Extensions
Pi , p Pˆ i , p fi , p
Score of patches Score of patches
4000 4000
3500
3000
3000
2500
Score
Score
2000
2000
1500
1000
1000 Ci ,k
500 0
300 300
200 400
200 200
150 300
100 100 100 200
50 100
Y
0 0 Y 0 0
X X
Fi
M Fusion O
Figure 32 Pictogram of data flow of intrusion detection with motion extension. This
corresponds to the block diagram in Figure 31 and uses the same colour code. The
blue path gives an overview of the basic intrusion detection described in 4.3.1. The
individual images are described in section 4.3.1. The orange path shows the
- 110 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.4 Detector Extensions
O
Interframe Threshold Morph. Motion mask M
Fusion
Difference 10% Opening
Intrusion T Kalman
Alarm
Rule Filter
Figure 33 Block diagram for intrusion detector with Kalman Filter extension
The algorithms described here so far suffer from a delay until the first detection of a
person (i.e. latency). Very slow moving people stay partly occluded by the edge of
the camera for a significant time which potentially delays detection. The second
extension with a Kalman filter (Kalman, 1960) addresses this problem by allowing
is very noisy. In contrast to the basic intrusion detection system, some filtering is
required to provide consistency for trajectories. Please refer to Figure 33 for a block
diagram and to Figure 34 for visual results. Examples in Figure 35 show people
who may stay partly occluded until the latest possible alarm triggering time. The
model. First, silhouettes S are extracted from the motion mask M as connected
- 111 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.4 Detector Extensions
The second extension with a Kalman filter overcomes the problem of late detection
sideways in the image on the left, which indicates the various ways the fence is
New trajectories are initialised for both of those inputs (motion silhouettes
x, y over time, where the centroid of a motion silhouette S becomes the first
object location in the trajectory. The saliency detector in comparison has a much
higher precision and in practice does not require a minimum size filter. All
- 112 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.4 Detector Extensions
and objects, the distance between a Kalman filter prediction xˆ , yˆ and object
locations is evaluated. The closest object is used for the update according to
equation (30). Positions of salient texture objects O are denoted xo , yo and for
motion silhouettes xs , ys which defines the measurement selection as
x , y i f min
o o
z o xˆ xo yˆ yo
2 2
min
s
xˆ xs yˆ ys
2 2
. (30)
xs , ys e lse
The update with the motion silhouettes S allows trajectories T to start at the first
appearance of a person at the edge of the camera and to fill temporal gaps in the
saliency detection. The alarm delay time is thus reduced by this early detection of
partly occluded people at the edge of the camera before the saliency detector
triggers for the first time (see Figure 35). Saliency detection is mandatory for an
- 113 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.4 Detector Extensions
Figure 35 Comparison of alarm triggering time. The left column shows the frame
when the system with Kalman filter triggered an alarm. The right column shows later
alarms of the system without the filter, especially when intruders are partly occluded
- 114 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.5 i-LIDS Testing
algorithm. The system has been tested on the i-LIDS data set, which is described
with the particular requirements for system design. A runtime analysis for the real-
time performance of the system is provided.
The i-LIDS challenge aims at providing a benchmark for systems which is defined
by end users of the technology. The fact that the problem definition and data is
generated by users ensures relevance and applicability of tested systems. Each data
realistic operational conditions. The data set is limited in terms of number of views;
however, producing a new view with the same variation of conditions carries a
significant cost. The sterile zone test data set is used, which consists of two views
(one colour, one black&white) during day and night with various weather
conditions (rain, snow, fast moving shadows, etc.). The test requires to raise one
alarm for every intrusion event and to compare the response with the provided
later than 10 seconds after the first appearance of an intruder. Each of the two
camera views (View1 and View2) is split into a sequence with alarms (208 total)
and a sequence without alarms but with various distractions (birds, rabbits, etc.)
recorded over the duration of a whole year. Refer to Figure 22, Figure 34 and
Figure 35 for detection examples.
- 115 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.5 i-LIDS Testing
4.5.2. Framework
video input with 25 frames per second at PAL resolution and providing a relay
alarm output (see Figure 36). For the tests, the video was played back to the
computer with a hard drive video player as composite video signal. A Trimedia
frame grabber (NXP, nd) was used to sample the video and provide it to a capture
compiled and dynamically linked to the capture application. The capture application
provides access to the hardware and performs conditioning of the input frames e.g.
by allowing full control of brightness and contrast. This application also contains
the user interface for ground truth handling and setting up of experiments. The
Matlab module contains the algorithm described in this chapter by taking frames as
input and providing alarms and trajectories as outputs.
The system was tested on a Pentium 4 with 2.4 GHz and 1GB RAM. Real- time
processing time of 81ms. Figure 37 shows the capture application‘s execution time
over 200 processed frames. Overhead for the frame grabber is not shown. There is a
- 116 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.6 Results
little overhead for performing the Matlab call from the capture application of 1.1ms.
The majority of time is spent for the patch analysis (FFT) and the subsequent
clustering, classification and information fusion. The Kalman filter takes little time
(0.5ms), but additional connected component analysis of the motion mask decreases
the frame rate to 9 fps (from 10 fps) for the intrusion detector.
4.6. Results
This section describes the evaluation metrics, the baseline algorithm and gives
qualitative results with analysis.
4.6.1. Metrics
The i-LIDS challenge defines event based evaluation, where only alarms reported
within a window of 10 seconds of ground truth events are considered true positives
(TP). Any alarms reported outside this window are false positives (FP). This is a
consider the speed (e.g. slow) or location of an intruder. Later in this chapter, results
- 117 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.6 Results
are reported for the 10 seconds window but also for a 20 seconds window to
illustrate the effect of this metric. A person who might cause a second alarm due to
a lost track would also count as false positive. Any missed person causes a false
equations (10) to (12) on page 66. The recall bias can have two values depending
on the system‘s role, where higher values of increase the weight on the recall.
false alarms, which could disturb operators. Systems for event recording use
0.75 to focus on detecting intrusions more reliably with less penalty on false
alarms.
4.6.2. Baseline
The baseline used is a standard Kalman filter blob tracker with Gaussian
background modelling based on the OpenCV library (OpenCV, nd) blob tracker
(parameters FG_1, BD_CC, CCMSPF, Kalman). This algorithm belongs to the first
extracted from the foreground mask. Blobs are tracked by mean shift and resulting
trajectories are post processed with a Kalman filter. The intrusion rule framework
from section 4.3.2 is then applied to the trajectories. The main reasons for false
detections are camera shake, fast illumination changes due to clouds, birds and
changes from black & white to colour of the camera. This tracker is not without
limitation, but it has been exposed to many applications and the behaviour is well
understood so that the performance figures can be interpreted more easily.
- 118 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.6 Results
Figure 38 Performance for 10 seconds alarm window. Results are shown for alarming
sequences, total per view including the non alarm sequences and total of the whole
data set.
4.6.3. Analysis
Four algorithms are compared in this section (see Figure 38). The performance data
is split into two camera views (View1, View2) and into sequences containing
alarms and the total performance for the whole camera view. First is the baseline
followed by the intrusion detector based on texture saliency. The final two
algorithms incorporate the motion extension and the Kalman filter into the intrusion
detector. All performance values are for operational alert 0.65 unless stated
the intrusion detector. This outperforms the motion tracker, however, there are
errors related to texture when shadows of the fences are detected. Low image
contrast is the most common error cause and the reason for lower performance on
View2, see Figure 39 for false positives from texture and false negatives from low
contrast. A high detection threshold is required to eliminate false positives.
- 119 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.6 Results
Figure 39 The top left image shows a wrongly detected bird flying towards the fence.
The top right images shows a false detection due to fast moving clouds present at
the same time as fence shadows, both errors are caused by texture. The bottom
F1 0.89 by exploiting the independence of the noise sources. A low threshold for
saliency and motion detection allows reduction of false negatives from 35 to 17. To
achieve this result, the saliency threshold was optimised resulting in T 2 , because
lower thresholds produced arbitrary detection when no intruders were present in the
image. With the fusion, the false positives are also reduced from 44 to 16. One
- 120 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.6 Results
10 seconds is noticeable for both intrusion detectors due to later correct detections
This increased alarm time inspired the second extension by using Kalman filtering
and initialising tracks from motion silhouettes S in the inter- frame difference mask
M . It is the last system shown in the figures. The minimum silhouette size 5 ,
which is larger than the typical noise observed in the data (e.g. Figure 32). The
extension. This is due to a larger number of false positives particularly during the
snow sequence. To keep those false positives down, the catch area for trajectories is
kept small, which trades off some fragmented trajectories for people. Those
trajectories are too short to alarm on which causes false negatives. The average
alarm time in the 10 seconds window is lowered from 3.4 seconds for motion
extension to 3 seconds for Kalman filtering, which was the aim of the extension.
- 121 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.6 Results
Table 11 Detailed numbers of TP, FP and FN with F1 measures for all four systems
(Figure 40 and Table 11) which is caused by late correct detections. A late detection
negative and false positive at the same time. The best overall performance is
F1 0.92 for the intrusion detector with motion extension with the best
performance for View1 of F1 0.95 . View2 suffers from very low contrast, which
is a particular problem for the detector; however, the motion extension improves the
performance significantly by reducing the false positives from 30 to 7.
Finally, the F1 measure is compared for the two values of recall bias .
When using the event recording setting 0.75 the motion tracker performance is
0.1% . The other two systems are not affected by due to an even balance
between false positives and false negatives in the results.
- 122 -
CHAPTER 4 LOCAL FEATURES FOR HUMAN DETECTION 4.7 Summary
4.7. Summary
This chapter proposed a new texture saliency classifier to detect objects in still
images of the i-LIDS sterile zone data set. The intrusion detector has been
implemented in C++ and Matlab to operate in real time from analogue video input.
detail. The detector outperforms the OpenCV blob tracker as baseline. A first
performance significantly to F1=0.92 on the 24 hour test data set. The second
extension using a Kalman filter is used to improve alarm response times, however it
degrades the overall performance due to more false positives. The false positives
are caused by noisy motion based foreground estimation. The results demonstrate
good performance for local features in surveillance tasks with minimal reliance on
motion information.
This part of the work was important because it gave rise to the concept of
local feature patches which is carried forward to the work described in the next
chapter, where it is combined with 3D spatial models. This allows the combination
modelling. This additional concept extends the application range from intrusion
detection to vehicle classification.
- 123 -
CHAPTER 5 3DHOG CLASSIFIER 5.1 Introduction
5. 3DHOG Classifier
5.1. Introduction
This chapter describes the 3D extended Histogram of Oriented Gradients (3DHOG)
classifier for vehicle and pedestrian detection. This new concept, developed by the
author, extends 3D models from chapter 3 with the idea of local features evaluated
in chapter 4. The overall concept including camera calibration remains the same as
in chapter 3. The model matching process is changed and solved by also
considering the appearance of objects. The spatial 3D models are combined with
patch based appearance models to make the model matching independent of motion
silhouettes (example results in Figure 41). Given a hypothesised (or known during
training) object position and orientation on the ground plane, the model matching
models. Those 3D appearance representations are incomplete and only contain data
from the visible part of objects in the 2D frame, which can vary depending on the
view point. The classifier uses this incomplete representation of a new object to
match it against a trained and complete 3D appearance model of the known classes.
In this way, the classifier requires only a single but complete 3D appearance model
(containing data from many viewing angles) to deal with any object view point.
- 124 -
CHAPTER 5 3DHOG CLASSIFIER 5.1 Introduction
Figure 41 Example views from the i-LIDS data set with detected and classified
points2 on the surface of the models. Appearance information (local features) at the
location of those interest points will be extracted to construct the appearance models
and to classify newly seen objects. The local features themselves are constructed
interest points and HOG is hence introduced as the novel 3DHOG feature.
features. Given a hypothesised object location and orientation, the new feature
visibility and self occlusion). Trained models can be matched against objects in any
given viewing direction. The framework can deal with part occlusions e.g. by the
edge of the camera, which is shown in section 6.2.
discusses related work. Section 5.3 describes the spatial models and how interest
points are used in those models. Section 5.4 describes the appearance feature
2
In this work, the term interest point is used to refer to a point or location, around which local
features will be extracted (i.e. they define the location of an image patch for computing features).
The location of the point is not directly determined from the image data, but defined by the models.
This is a slightly different use than it is common in generic object recognition, where interest points
are extracted from images by interest point detectors (e.g. Corners, SIFT, Hessian).
- 125 -
CHAPTER 5 3DHOG CLASSIFIER 5.2 Related Work
extraction process using the spatial models. The training of the appearance models
3.3) is given in section 5.6 with performance evaluation in section 5.7. The chapter
is summarised in section 5.8.
2006a, Song and Nevatia, 2007, Messelodi et al., 2005b). This approach is
environments due to low camera angles, occlusions, etc. The above 2D approaches
Nevatia, 2007, Messelodi et al., 2005b, Park et al., 2007, Ottlik and Nagel, 2008
and chapter 3.
model with SIFT features is used in (Ma and Grimson, 2005) for vehicle
classification. The implicit shape model is used in (Leibe et al., 2005), (Leibe et al.,
2007) and (Leibe et al., 2008b) for pedestrian detection and shows the object
2008, Pingkun et al., 2007). ‗Top-down‘ and ‗bottom-up‘ approaches are combined
by (Dalal and Triggs, 2005), using local features with 2D fixed spatial constraints.
- 126 -
CHAPTER 5 3DHOG CLASSIFIER 5.2 Related Work
Figure 42 Overview of the algorithm. This block diagram outlines the relationship
This is used for pedestrian detection and for action recognition with temporal
extension in (Kläser et al., 2008, Wang et al., 2009a).
The new approach takes the good results from 3D models into account (Song and
Nevatia, 2007, Messelodi et al., 2005b) and incorporates local appearance features
into the models. The whole framework is outlined in Figure 42 and follows four
steps, which are described in the subsequent sections:
Extracting local features in section 5.4 deals with the feature extraction
process based on the spatial models above. The feature extraction will
then be used for both training and classification.
the feature extraction uses the spatial models to generate local features
for a new image. Those features are matched against the appearance
- 127 -
CHAPTER 5 3DHOG CLASSIFIER 5.2 Related Work
spatial and appearance models to qualify the above four steps. The new method
defines the local features and the spatial relationship between them in 3D world
where scale is defined. It also allows a single complete appearance model to be used
for any viewing angle. In general, the method of histogram of oriented gradients
(HOG) using a planar 2D search window (Dalal and Triggs, 2005) is generalised to
3D by conceptually ‗wrapping‘ the camera image around the models (Figure 43)
like in (Starck and Hilton, 2005). Using calibrated cameras, obtained in a relatively
straightforward way given a plan map of the scene, the scale of objects is
determined directly, in contrast to the multiple scale search in (Dalal and Triggs,
2005). The search space is now the ground plane as it was in chapter 3. By
introducing a model match framework that deals with variable numbers of visible
interest points, a single appearance model can be used to match objects from any
angle. The trained classifier is portable between different cameras, only requiring
the calibration of a new camera or a new camera position. This will be shown in
section 6.2.
The algorithm detects rigid vehicles and pedestrians in the same way and
does not use special cases. Texture is used to generate local features which do not
rely on potentially noisy motion information. This implies that the method could be
stationary objects, single frames and moving cameras. The match measure is
- 128 -
CHAPTER 5 3DHOG CLASSIFIER 5.3 Defining Spatial 3D Models
Figure 43 3D spatial models taken from chapter 3 extended with interest points. The
interest points are illustrated as cones, which signify the position and also normal
direction of interest points. The diameter of the cone will later be used to visualise
locations in model space. Those locations will then be used, as explained in the next
section, to extract features. The positions of a set of interest points are defined to be
similar to chapter 3). The method, described in more detail next, is applied to all
models and to keep the expressions succinct there is no model index subscript. An
- 129 -
CHAPTER 5 3DHOG CLASSIFIER 5.3 Defining Spatial 3D Models
origin p 0 (centre of the face). The direction of interest points p is normal to the
face. The function grid( ) produces the set of interest points P as a two dimensional
array of points filling the whole face (polygon):
P grid p0 , d f . (31)
To ensure good coverage for small faces of e.g. pedestrians, while also limiting the
total number of interest points for large faces of e.g. buses, the face density d f is
adjusted according to face size s f following equation (32) below. The face size s f
d0
df (32)
sf
1
s0
For the experiments s0 4m , 0.35 and d0 4 interest points per metre were
used. This trades off over sampling against very sparse interest points. In the case of
low reference density d 0 and therefore sparse interest points, image patches would
need to be made large to cover the spatial model without gaps between patches.
This would then in turn lead to global rather than local features and hence to loss of
discriminating power.
full set of interest points P , which contains all the interest points of a given model.
A typical car contains 300 interest points in this set P . The full set P contains
interest points for any viewing direction of the model. This can be observed in
- 130 -
CHAPTER 5 3DHOG CLASSIFIER 5.4 Extracting Local Features
Figure 43, where interest points are displayed as cones. The orientation of the cone
corresponds to the orientation vector e of interest points p .
the previous section, given an object location in the image. The object location
appearance model training (section 5.5) and classification (section 5.6). First, for a
candidate object location, image patches are obtained for interest points that are
sufficiently visible as explained in the next section. Feature vectors are then
calculated from those patches, as explained in section 5.4.2.
The visibility of interest points is first confirmed, before image patches are
the scale (i.e. depth) and perspective distortion (orientation) of the observation and
presents a constant size image patch for feature extraction.
The locations of interest points in real world coordinates are used to extract visible
plane orientation r is required as input. Road users are assumed to be on the ground
plane, which results in only one degree of freedom for the orientation angle r .
Using this location x , points in the model coordinate system can be transformed to
the real world coordinate system by adding the location x to the model coordinates
(Dunn and Parberry, 2002). All further descriptions will be in reference to the real
world coordinate system. Let v be the unit vector of the viewing direction in real
- 131 -
CHAPTER 5 3DHOG CLASSIFIER 5.4 Extracting Local Features
world coordinates of interest point p . The visible set of interest points P P is a
P p e, v v , p P . (33)
The above equation calculates the dot product between the direction e of the
interest point and the viewing direction v . The dot product will be 1, if both vectors
are pointing in the same direction and the interest point is viewed head on. The
product decreases, if the interest point is less aligned and is viewed further from the
side. If the interest point direction is perpendicular to the viewing direction, the dot
product will be 0 and the point will be invisible at the same time.
A set of 2D square image patches I I in real world space is extracted for every
visible interest point. Due to the perspective, this patch may not correspond to a
square area in the original input image. An affine transformation will be estimated
to map this distorted part of the input image to a square normalised patch in ground
plane space. One square image patch I is defined for every visible interest point
p P with constant pixel width l p using constant 3D world resolution
in pixels per metre and width in metres, allowing some overlap of patches.
Values for all parameters will be provided in the evaluation section 5.7. An affine
patch images I with coordinates x, y . This produces the set of visible image
u c0 x c1 y c2
. (34)
v c3 x c4 y c5
- 132 -
CHAPTER 5 3DHOG CLASSIFIER 5.4 Extracting Local Features
systems, the resulting set of 6 equations can be directly solved for the 6 parameters
4
affine
p
transformation
y
2
v 2 x 3
3
u
Figure 44 Illustration of patch image extraction
space to image space. Visible interest points p P specify the centres of
corresponding image patches I I . Four corner points 1..4 of image patch I are
calculated by generating a square shape with width in a plane (2D) perpendicular
to the orientation e of interest point p . Corner points 1..4 can be projected to the
depending on the viewing direction of the model. The overall process can be viewed
as one of wrapping the camera image around the model resulting in invariant
- 133 -
CHAPTER 5 3DHOG CLASSIFIER 5.4 Extracting Local Features
a set of image patches I with normalised images and we are now ready to extract
features from those images, as explained in the next session. Please refer to Figure
45 for an example of extracted image patches I .
a) b)
c)
Figure 45 a) Input image and b) hatchback model. The radii of cones indicate the
weights q (described later) of interest points p . c) shows the set of extracted image
patches I I .
- 134 -
CHAPTER 5 3DHOG CLASSIFIER 5.4 Extracting Local Features
The image patches I extracted as explained in the previous section are used to
generate normalised feature vectors fˆk , k 1..Ic with examples shown in Figure 46.
Those features are then used to train appearance models and also to classify new
section 5.7. The length of feature vectors fˆk depends on the individual algorithm
used, but the training and classification framework is independent of that length.
The index k of vectors fˆk enumerates the image patches I from which features are
FFT or Histogram) are normalised by the Euclidean norm for better performance,
following (Dalal and Triggs, 2005):
f
fˆk k (35)
fk
The generation of the feature vectors fk for image patches I is performed in the
same way that (Dalal and Triggs, 2005) generate the vectors for single cells. First, a
Sobel kernel 1, 0,1 is used to compute the gradient image for all three colour
recommended for rigid objects like vehicles by the Dalal and Triggs. The
alternative would be the interval 0, , which would consider gradients from light
to dark identical to dark to light. A single histogram is generated for every image
patch with bins. The highest gradient magnitude of the three colour channels is
used for the histogram. In the process described in section 5.4.1 earlier, the visible
- 135 -
CHAPTER 5 3DHOG CLASSIFIER 5.4 Extracting Local Features
a)
b) c)
Figure 46 Feature vectors fˆk generated from the set of image patches I in Figure
seminal paper of (Dalal and Triggs, 2005). Due to changes in the viewpoint of
objects, the number of visible interest points changes according to equation (33).
This directly changes the number of feature vectors fk and therefore makes the
concatenation to a single constant size feature vector impossible. This fact adds
- 136 -
CHAPTER 5 3DHOG CLASSIFIER 5.4 Extracting Local Features
complexity for the training and classification to allow for this variable number of
Alternative features to HOG are used to compare performance between features and
Features based on FFT have previously provided good performance for intrusion
detection in chapter 4. Fast Fourier transform (FFT) features fk are calculated from
magnitude spectrum is used to fill a two dimensional histogram with angle bins
and frequency bins. Every angle bin corresponds to a sector in the spectral image,
whereas frequency bins correspond to annuli (see Figure 47). This approach is
similar to using banks of Gabor filters and accumulating the responses into a feature
vector.
The grey level histogram is one of the simplest image features that can be used in
the classification framework proposed here and thus it is used to compare with the
performance of the 3DHOG and FFT features. The number of bins is . Colour
performance. The next section will deal with how these appearance features can be
used to train a system to recognise given classes of road users.
- 137 -
CHAPTER 5 3DHOG CLASSIFIER 5.5 Training Appearance Models
continuous red lines indicate the frequency borders, whereas the dashed blue lines
models. The annotations consist of road user positions in the training images. These
appearance models are then used in the classification of new images. Training takes
place using the following five steps (Figure 48):
- 138 -
CHAPTER 5 3DHOG CLASSIFIER 5.5 Training Appearance Models
In this way, the match response of models for new images can be
improved.
Figure 48 Block diagram for the training of appearance models. Features are
extracted from training videos given object location annotation and the 3D models
The training set comprises frame images from the i-LIDS data set with labelled road
user locations. A set of model locations on the ground plane L x represents the
- 139 -
CHAPTER 5 3DHOG CLASSIFIER 5.5 Training Appearance Models
annotation for the training. Those positions x were generated with the algorithm in
chapter 3 for the training videos and manually refined, where necessary.
Alternatively, the location of road users in training videos could be hand labelled.
There are 20 to 30 labelled images for every type of road user model in the training
models will be visible in the training data, so that their appearance can be learned as
described in the next section.
new images with the model. The approach described in section 5.4 is used to extract
feature vector samples for every visible interest point in a training frame at location
x . Sample vectors per interest point p are accumulated into sample set S fˆ for
each interest point. For the estimation of the mean μ and covariance matrix Σ of
each interest point p , the training set S is used. The covariance matrices Σ are
D M f f μ Σ1 f μ .
T
(36)
The above equation will be used for classification and also for model refinement as
discussed in the next sections. For a joint day/night classifier, the single Gaussian
- 140 -
CHAPTER 5 3DHOG CLASSIFIER 5.5 Training Appearance Models
After estimating the Gaussian model for every interest point, the appearance model
interest points have to be dealt with, because Support Vector Machine (SVM)
classification as in (Dalal and Triggs, 2005) is not possible due to the variable
number of visible interest points here. The variability stems from variable
points can vary, the average response for different appearance models can be
spatial models slightly away from training positions and checking the response as
explained in the next section. This surface is then used to parameterise the sigmoid
function as described in section [Link].
To generate a distance surface per interest point p , features are calculated with
spatial models moved away from the exact training position. In this way, the change
is evaluated. Good interest points should exhibit a strong drop in match response
when moved away from the training position. A regular grid of positions g M with
M M max ..M max is generated for every position x in training set L , similar to the
hypothesis grid in section 3.3 on page 58. The size of the grid is set to 4m with 9
image patch and a total displacement of twice the patch size in every direction.
models can be assessed for normalisation in the next step and weight estimation as
in section 5.5.4. The Mahalanobis distance DM fˆM in equation (36) between the
- 141 -
CHAPTER 5 3DHOG CLASSIFIER 5.5 Training Appearance Models
interest points‘ models μ, Σ and the extracted feature vectors fˆM at model positions
DM DM fˆM (37)
A mean distance surface DM (Figure 49) over all training samples xL of interest
D M
DM L
. (38)
L
The above distance surface represents the average match response between the
trained appearance model and the training data of an interest point. The next step
will be to find parameters of a sigmoid function to normalise this response across all
interest points.
- 142 -
CHAPTER 5 3DHOG CLASSIFIER 5.5 Training Appearance Models
0, 0 corresponds to the training position x and has usually the lowest value. The
feature distance increases for coordinates further away from the training position.
measure between model and observation to fit a fixed response interval. This
normalisation will be used during weight estimation in section 5.5.4 and then
mk s d k (39)
1
sd (40)
1 eab d
and uses two parameters a and b . The parameters can be estimated from the
distance surface DM for every model. The proposed parameterisation places the
- 143 -
CHAPTER 5 3DHOG CLASSIFIER 5.5 Training Appearance Models
centre point of the distance surface dC at the middle of the sigmoid function (Figure
50) with a match measure of s dC 0.5 . This results in
2
a (41)
dC d
b dC , (42)
where dC is the centre point value of the distance surface DM and d in the mean
value of the whole surface d mean DM . See Figure 49 for an example surface
of an interest point. A full proof for equations (41) and (42) is included in Appendix
section A.1 on page 190. By using the mean of all distance data points in equation
(41) and therefore considering all data, a uniform drop of match measure mk is
generated for different interest points when moved away from the training position
x . The impact of feature outliers is limited due to the nature of equation (40),
which is bound to the interval 0,1 . Any subset of interest points will provide the
same match measure for appearance models after this normalisation, which is
essential during self and part occlusion. The normalised match measure response
M M at training positions is given as
M M s DM , (43)
where s d is the sigmoid function from earlier. An example output of the match
- 144 -
CHAPTER 5 3DHOG CLASSIFIER 5.5 Training Appearance Models
Figure 50 Estimated sigmoid function shown as a dashed line. The continuous line is
the gradient of the sigmoid function defined by the centre value dC of the distance
surface DM and the mean distance d of all grid points.
- 145 -
CHAPTER 5 3DHOG CLASSIFIER 5.6 Classification Framework
Relative weights are given to different interest points in order to favour those with
good localisation performance and remove those with bad performance. The shape
of the match measure response in the previous section is analysed for this task. For
classification, the weight will be used to calculate a total weighted average match
measure over visible interest points. To analyse the peak shape, a histogram
H h hist M M of the match surface M M is calculated where every bin h
M
corresponds to a ring of the surface. Low variance of the match measures Η h inside
var(H h )
qk 1 . (44)
h Ch
The above equation penalises (decreases) the weight of an interest point, if the
match measure surface exhibits local maxima. To complete the training, the best
80% of interest points are used for the classifier with q as their weights. Refer to
Figure 45 on page 134 for a car example with marked up interest points as cones,
where the diameter of the cone corresponds to the weight qk .
Once the training has been completed, a classifier can use the trained data to
classify previously unseen road users. How this is done is discussed in the next
section.
against the trained appearance models. The classification framework used here is
based on the framework described in chapter 3. The difference lies in the way
models are matched against observations for every road user hypothesis. Feature
- 146 -
CHAPTER 5 3DHOG CLASSIFIER 5.6 Classification Framework
vectors are extracted according to what was explained in section 5.4. Those feature
vectors are then matched against previously trained appearance models as described
in section 5.5. The matching will be described in section 5.6.1 after a short
overview of the overall classification framework.
and Bowden, 2001) and shadow removal is used to generate motion silhouettes. For
each silhouette, a grid of ground plane object hypotheses is generated from the
centroid and scored by the classifier using equation (45) from the next section.
Please refer to Figure 52 for a block diagram. The silhouettes are often noisy due to
the challenging video data in urban environments with changing lighting conditions
and low camera angle, but are a good indicator for the existence of a road user.
(Example is shown in results of Figure 54).
based on matching appearance models against new features to find the highest
match measure above the detection threshold M . In the process, the 3DHOG
framework is used to extract visible image patches and features for every hypothesis
Figure 53. To limit the search space, orientations of road users are assumed to align
with the road direction, which is realistic for many road videos. The classification is
performed on a per frame basis without tracking or temporal refinement.
- 147 -
CHAPTER 5 3DHOG CLASSIFIER 5.6 Classification Framework
Detector
frame GMM Classifier
(Stauffer and
Appearance
Grimson, 1999)
data model
shad. remo
foreground mask p
silhouettes mk labels
Connected Match
Maximum
component measure
fˆk
Ground x 3DHOG
Plane feature 3D model
Hypothesis extraction & IP
frame
Figure 52 Block diagram for the 3DHOG classifier. The general structure is identical
to the motion silhouette classifier in chapter 3. 3DHOG features are extracted directly
from the input frame based on the ground plane hypothesis. The match measure
operates in appearance feature space in contrast to the image space for the
The match measure is calculated from the comparison between new feature vectors
and the trained appearance models by summation of the match measure responses
of individual interest points. First, feature vectors fˆk are generated for visible
interest points p P , where the spatial model location for the extraction is the
ground plane hypothesis location. The index k enumerates the visible interest
points of the given hypothesis. Every feature vector fˆk is matched against its
- 148 -
CHAPTER 5 3DHOG CLASSIFIER 5.6 Classification Framework
equation (40). At this point, the match measure mk of all visible interest points has
to be combined to a single value m for the whole spatial model. This is achieved by
a weighted average
m q k k
m k
, (45)
q k
k
where the weights qk are as defined in section 5.5.4. An example of this match
response for the different hypotheses on the ground plane is illustrated in Figure 53.
The use of interest point appearance models in this section provides a method of
- 149 -
CHAPTER 5 3DHOG CLASSIFIER 5.7 Evaluation
5.7. Evaluation
Evaluation was performed on realistic (operational quality) videos for traffic
surveillance. All three algorithms are compared to the motion silhouette baseline in
provides a parameter list for the tests. The same part of the i-LIDS data sets that
was used as described in section 3.4 has been used here. Approximately one hour of
video for sunny, overcast and changing conditions was selected. Some illustrative
examples are shown in Figure 54 and for classification problems in Figure 55.
- 150 -
CHAPTER 5 3DHOG CLASSIFIER 5.7 Evaluation
Figure 54 True positive examples for vehicles and pedestrians using 3DHOG.
- 151 -
CHAPTER 5 3DHOG CLASSIFIER 5.7 Evaluation
Figure 55 Two examples of errors generated with 3DHOG. Left: Missed car due to low
contrast of the vehicle bonnet and roof. Right: Misclassified SUV as van due to
Out of the three features discussed in section 5.4.2 (HOG, FFT, Histogram), the best
(Table 13) and classification accuracy of 87.9%. This compares to recall of 87% for
a precision of 85.5% for the motion silhouette baseline from chapter 3 run on the
same data set, but 3DHOG should be better at dealing with silhouette noise and
later in Figure 57 on page 157 and the occlusion analysis is performed as part of the
outperforms the algorithm from chapter 3 with 0.67, which indicates better
localisation performance for 3DHOG. The system using FFT features showed lower
performance (Recall 48.4% at precision 42.3% from Table 15) similar to the
histogram features (Recall 48.9% at precision 42% from Table 16). The localisation
performance for those two features is identical with 0.64. From those numbers, it is
clear (as the rest of the framework is the same) that the 3DHOG provides a more
discriminative and descriptive feature than the FFT and the Histograms. The
- 152 -
CHAPTER 5 3DHOG CLASSIFIER 5.7 Evaluation
Table 13 Extended confusion matrix for 3DHOG with total system performance
Table 14 Classifier confusion matrix for 3DHOG and class wise results
gradient features are most descriptive for vehicle detection compared to FFT and
tendency for detecting smaller vehicles, which is highlighted by the low classifier
recall rate of 43.9% for the bus/lorry class. This can be attributed to two effects.
Firstly, the predominately frontal or rear view of vehicles, which allows a good fit
of models from the next class of smaller vehicles. Secondly, there are only limited
numbers of training samples for the larger vehicles in the data set. In contrast, the
classifier recall for cars is 98.7% for the dominant class in the data set.
- 153 -
CHAPTER 5 3DHOG CLASSIFIER 5.7 Evaluation
Table 15 Extended confusion matrix for FFT features with total system performance
Table 16 Extended confusion matrix for histogram features with total system
performance
The confusion matrixes for FFT and Histogram features in Table 15 and
Table 16 show a confusion of most classes with the class bike. This is due to the
fact that a smaller model can be mistakenly fit more easily to an arbitrary image
region resembling features of the model than a larger one. 3DHOG is sufficiently
discriminative also for the smaller models as not to show such a dominant
confusion.
- 154 -
CHAPTER 5 3DHOG CLASSIFIER 5.7 Evaluation
Table 17 Extended confusion matrix for 3DHOG with total system performance
including pedestrians
Table 18 Classifier confusion matrix for 3DHOG and class wise results including
pedestrians
Full quantitative performance figures for road user detection and classification
(including pedestrians) are shown in Table 17 and Table 18. The overall recall
degraded to 60.7% due to the non rigid nature of pedestrians, which increases the
complexity for the detection task. The same effect can be observed for the motion
silhouette baseline from chapter 3. In contrast, precision is not affected by the
- 155 -
CHAPTER 5 3DHOG CLASSIFIER 5.7 Evaluation
Figure 56 Left: Correctly classified lorry with 3DHOG despite shadows and oversized
motion silhouette. Right: wrongly detected pedestrian at the front edge of a lorry due
to vertical edges.
addition of the pedestrian model. The localisation performance expressed with the
overlap measure of 0.66 outperforms the motion silhouette baseline with 0.64. A
case. The motion silhouette classifier exhibited confusion between pedestrians and
bikes in Table 4 due to similar size. This issue does not arise with 3DHOG;
however, wing- mirrors and front corners of lorries are misclassified as pedestrians
due to a similar appearance (see Figure 56). This leads to a low systems recall of
25.4% for lorries. Combining information from both classifiers could resolve some
of the misclassifications due to the different failure modes. The classifier precision
for cars, the predominant class in the data set, is 96.9%, which is slightly higher
than the motion silhouette baseline with 96.6%.
This section analyses the sensitivity of the 3DHOG algorithm to the patch size. The
- 156 -
CHAPTER 5 3DHOG CLASSIFIER 5.7 Evaluation
Figure 57 Comparison of the 3DHOG classifier (left) with the motion silhouette
classifier (right). The first image shows a position offset and wrong classification of
the pedestrian of the silhouette classifier due to the tree shadow. In comparison, the
3DHOG classifier correctly identifies the pedestrian within the silhouette and aligns
the car better. The bottom images show a missed vehicle of the silhouette classifier,
because the silhouette is too small due to the overexposed camera view. The
pedestrian is detected as bicycle due to similar size. Both problems are resolved
appendix C.2. The recall increases with larger patch size from 66.8% to 77.3% as
the models become more discriminative. At the same time, the classification
performance slightly degrades from 89.1% to 87.9% due to the larger number of
detections, which represent harder cases, which were rejected for the small patch
size previously. The overlap and therefore the location performance improves
slightly from 0.68 to 0.69. Increasing the patch size further could mean that patches
represent global rather than local appearance.
- 157 -
CHAPTER 5 3DHOG CLASSIFIER 5.8 Summary
5.8. Summary
A novel algorithm, 3DHOG, for detection and classification of road users in urban
scenes was presented in this chapter. This is an extension to the HOG feature
framework has been proposed generating weights for learned interest points for
classification. Three algorithms for features based on HOG, FFT and simple
baseline approach using motion silhouettes. The classifier sweeps the hypotheses
space to find the best match between images (observation) and 3D models based on
the average match measure between interest points and the training data.
The next chapter will demonstrate applications of this classifier and show
using the training data generated from the i-LIDS data set for classification of
videos recorded with a high definition camera.
- 158 -
CHAPTER 6 APPLICATIONS 6.1 Introduction
6. Applications
6.1. Introduction
This chapter focuses on applications and possible deployment of the proposed
system in the traffic management domain. For this purpose, robustness of the
proposed classifiers has been evaluated and extensions for tracking lead to a
training data has to be usable between cameras to avoid lengthy re- training for
data and directly compared to an industrial classifier. Following on, section 6.3
comparative evaluation with a state of the art blob tracker. The tracking increases
analysis e.g. illegal turns, bus lane intrusion or vehicle interactions. A discussion
and demonstrator for this behaviour analysis is given in section 6.4 with a summary
in section 6.5.
between different camera views and with different resolutions. Due to the nature of
both vehicle classifiers proposed, the only scene dependent information is the
camera calibration. Re- training of an appearance classifier for every camera view
does not scale well and is not feasible for large networks, like the network operated
by TfL for traffic operations. 3DHOG generates image patches with a normalised
- 159 -
CHAPTER 6 APPLICATIONS 6.2 Occlusion and Portability
scale using the camera calibration. Those patches can then be used across cameras,
camera with progressive scanning and a high resolution of 1360x1024 is used for
recording of the test video. This represents effectively four times as many lines
compared to the interlaced video used for training in chapter 5. Vehicles are also
closer to the camera in this scene which results in an order of magnitude size
difference between vehicles from training and classification. In addition to this new
size, the test data also contains occluded and partially visible vehicles to
operates with visible interest points, whereas the number of those points can vary to
allow for occlusions. The system performance is compared to an industrial state of
the art vehicle classifier. This classifier however, is limited to a single view and
single size of vehicles. All systems are tested on the same video data recorded from
a two lane road with traffic lights. Each of the following videos comprises of about
500 frames:
AVIFile_2009_05_16_08_59_32.avi
AVIFile_2009_05_16_09_01_29.avi
AVIFile_2009_05_16_09_02_40.avi
AVIFile_2009_05_16_09_03_57.avi
AVIFile_2009_05_16_09_05_49.avi
Some frames contain artefacts from the wireless video transmission during
recording. This can be noticed by a mixture of old and new images for one frame
(e.g. Figure 59) and can be dealt with by the 3DHOG classifier. The industrial
classifier does not detect vehicles in such damaged frames. Evaluation is provided
for an industrial classifier as baseline in section 6.2.1, for the 3DHOG classifier in
section 6.2.2 and for the motion silhouette classifier section 6.2.3.
- 160 -
CHAPTER 6 APPLICATIONS 6.2 Occlusion and Portability
Figure 59 Example images from the new data set containing a transmission artefact
on the left. The right image shows the classified car, which is typically fully visible.
Partly occluded cars (at the camera‟s edge) are mostly ignored.
The industrial classifier used as baseline combines a support vector machine (SVM)
with a windowing approach of the input image. A sliding window is moved across
an input frame to detect and classify vehicles. This global appearance based
approach is a good baseline for the local appearances used in 3DHOG. The
manually verified results from this classifier are used as ground truth to compare
with the two proposed classifiers (3DHOG, motion silhouette). The ground truth
data comprises of the class label and the top left corner of the bounding box of
vehicles. The width and height of the bounding boxes is fixed with 300 pixels,
because no size information is available from the classifier. Refer to Figure 59 for
example views. Partly occluded cars are ignored by the baseline, which requires a
small region of interest for the classifiers described in this thesis (i.e. motion
silhouette and 3DHOG). By considering the full frame, many of those partly
occluded vehicles would be detected by the author‘s classifiers and reported as false
positives (see Figure 61). In addition, some frames with transmission artefacts are
not detected by the industrial baseline and are also reported as false positives of the
author‘s methods. Full quantitative results are provided in Table 19 and Table 20.
- 161 -
CHAPTER 6 APPLICATIONS 6.2 Occlusion and Portability
Table 19 Industrial classifier confusion matrix and full system (detection and
classification) confusion matrix. Both matrixes are identical, as the system output for
vans as lorries.
Table 20 Industrial classifier total system performance and class wise evaluation
The detection performance for the industrial classifier is 100%, because its results
were used as ground truth. The classification performance, however, show a strong
confusion between vans and lorries.
- 162 -
CHAPTER 6 APPLICATIONS 6.2 Occlusion and Portability
The results in this section are generated with the 3DHOG classifier from chapter 5.
The training data is also taken from that chapter and only the camera was newly
calibrated. There is a significant change in vehicle size and viewing angle between
the data sets. The part of the i-LIDS data set used for training does not contain a
view of the front and right corner of vehicles being visible as it is the case in this
new data set. Example classification results are provided in Figure 60 and Figure
61. The map for camera calibration was manually generated and is displayed with
the result figures. The good match between wire frames and vehicles indicates that
the calibration is sufficient for this application. Full performance results are given in
Table 21 and Table 22, using the result of the previous section as ground truth. The
some stationary vehicles were not considered (i.e. missed) by 3DHOG, leading to a
low detection recall of 45%. This is due to the fact that 3DHOG takes vehicle
hypotheses from motion estimation, which does no detect objects which are
stationary for a longer time period. In contrast, some vehicles with transmission
artefacts are correctly classified but deemed false positives, because of their absence
in the ground truth. Qualitative evaluation of this is given in Figure 59 and Figure
62.
- 163 -
CHAPTER 6 APPLICATIONS 6.2 Occlusion and Portability
Figure 60 3DHOG classification results of 4 separate frames. The blue outline shows
the estimated motion foreground. The 3DHOG classifier produces the wire frame and
the respective 3D location of the vehicle on the road map. Good localisation
artefact.
- 164 -
CHAPTER 6 APPLICATIONS 6.2 Occlusion and Portability
Figure 61 Two examples comparing the 3DHOG results with the industrial baseline in
Figure 59. The left image shows correct detection despite the artefact. This example
operates without region of interest, which is why the occluded vehicles are detected.
The right image shows a later frame with active region of interest to remove
Table 21 The 3DHOG classifier exhibits good performance. The high number of false
negatives is due to stationary objects at the traffic lights, which are not picked up by
the motion detection. Some detections reported as false positives were actually cars,
but were not picked up by the industrial classifier used as baseline. Examples of
- 165 -
CHAPTER 6 APPLICATIONS 6.2 Occlusion and Portability
Table 22 Total system performance for the 3DHOG classifier. The classification
Figure 62 Example for partially occluded vehicles. This image illustrates that very
The algorithm for dealing with incomplete representations of objects is a core part of
the 3DHOG classifier framework. Occlusion is resolved seamlessly in the same way
as variable visibility of the 3D models depending on the camera view and vehicle
orientation.
- 166 -
CHAPTER 6 APPLICATIONS 6.2 Occlusion and Portability
This section provides results for the motion silhouette classifier to compare to the
two appearance based classifiers earlier. The parameters for the algorithms were
unchanged from the previous setup in chapter 3 and the same camera calibration as
for 3DHOG was used. A reduced region of interest was defined in the view, to
Quantitative results are provided in Table 23 and Table 24. The classifier precision
with 77.3% is slightly lower than the industrial classifier (80.3%) and significantly
lower than 3DHOG (91.1%). The appearance based systems seem to be able to
exploit the higher resolution image compared to i-LIDS better than the motion
based system. In addition, the location performance with overlap 0.22 is slightly
lower than 3DHOG (0.24), which can be seen by slightly offset vehicle wire frames
in the results (Figure 63 and Figure 64).
- 167 -
CHAPTER 6 APPLICATIONS 6.2 Occlusion and Portability
Figure 63 The motion silhouette classifier produces the wire frame and the
respective 3D location of the vehicle on the road map. The wire frames are slightly
offset and the last stopped van was missed due to ambiguous silhouette shape
merging into the background, but correctly classified by 3DHOG (Figure 60).
- 168 -
CHAPTER 6 APPLICATIONS 6.2 Occlusion and Portability
Figure 64 Two examples comparing the motion silhouette results with the industrial
baseline in Figure 59. The left image shows correct detection of the central car
despite the artefact. This example operates without region of interest, which is why
the occluded vehicles are detected. The silhouette based classifier does not classify
The right image shows a later frame with active region of interest to remove
Table 23 Confusion matrix for the silhouette classifier. The detection rate is lower
than 3DHOG. This is due to ambiguous motion silhouettes for stopped vehicles
merging with the background. Those silhouettes do not match a model well, but are
- 169 -
CHAPTER 6 APPLICATIONS 6.3 Tracking
6.2.4. Comparison
The best detection is provided by the industrial classifier, because a vehicle search
baseline (80.3%) and motion silhouettes (77.3%). For the 3DHOG classifier, good
operation on occluded vehicles and damaged frames is demonstrated, which are not
considered in the baseline. There is strong confusion between vans and lorries for
this baseline resulting in a classifier recall of 11% for this class. Considering the
better class of cars, the classifier recall of 91.9% is still inferior to the 3DHOG
classifier precision of 94.2%
6.3. Tracking
Following on from portability between cameras without needed retraining, this
section will show the integration of a variable sample rate Kalman filter with the
classification framework for tracking. Consistent location information over time can
then be used for behaviour analysis, which the next section will look into. Tracking
performance has been evaluated using the framework of (Yin et al., 2007) and
compared to a state of the art OpenCV blob tracker (OpenCV, nd) operating on the
- 170 -
CHAPTER 6 APPLICATIONS 6.3 Tracking
Detector
GMM
frame
+
Shadow Tracker
removal
foreground mask Classifier
labels
silhouettes scores GP pos tracks
Connected Overlap Kalman
Maximum
component Area Filter
model
silhouette
Ground 2D
Plane Projection Models
Hypothesis
same video data. Previously, vehicle tracking in urban environments has been
performed in (Song and Nevatia, 2007). However, only a single 3D model for cars
is used to estimate a vehicle constellation per frame with optimisation solved with a
Markov Chain Monte Carlo (MCMC) algorithm. The paper of (Morris and Trivedi,
2006a) presents a combined tracking and classification approach for side views of
Kalman filter is used to track the foreground regions based on the centroids in the
image plane only. The OpenCV blob tracker (OpenCV, nd) used as baseline here
works in a similar fashion.
is extended by a tracker illustrated in Figure 65. The ground plane positions and
labels of classified vehicles are the input to a Kalman filter to provide temporal
Transport for London is shown in Figure 66. The next sections will provide details
on the Kalman filter and the evaluation framework before presenting the results.
- 171 -
CHAPTER 6 APPLICATIONS 6.3 Tracking
Figure 66 Example of detection and classification with ground plane tracking. The
wire frame projection in red is used to estimate the bounding box for tracked
of the previous section. The classifier is extended by a Kalman filter with variable
sample rate. The detector with joint classifier operating on single frames may reject
valid vehicles in some frames due to noise, which requires the Kalman filter to
Alternatively, the update step of the Kalman filter could be modified when no
observation is available by setting the measurement noise to infinity (or a very large
value), which in practice has a similar effect. The author chose to use variable time
intervals to avoid numerical problems when inverting large noise matrices.
behaviour analysis like bus lane monitoring. The standard formulation of the
Kalman filter for a constant velocity model of vehicles is used
- 172 -
CHAPTER 6 APPLICATIONS 6.3 Tracking
transition matrix F propagates the old state xk1 to the current state x k . The input
vector u k 0 , which means that there is no input to the system. The input matrix B
specifies how the input influences the state update. The measurement matrix H
transforms the state x k to the output z k (i.e. measurement). The constant process
noise wk and the measurement noise v k are assumed Gaussian and have to be
specified. All time and speed related constants for the filter are based on seconds
rather than the sample rate or frame rate. The ground plane coordinates are in
metres, all noise and position estimates are in metres or metres per second. The
The only requirement to operate the Kalman filter at variable sample rate is to
update T0 in the transition matrix F constantly. For prediction steps, T0 is the time
between the last update step of the filter and the current time. The state prediction
xˆ k |k 1 and the error covariance prediction Pk |k 1 is therefore estimated for the correct
time. If a measurement is available, the update step is performed with the same
prediction steps will be performed with increasing time T0 until an update takes
place. Tracks can be discarded if the predicted error covariance Pk |k 1 grows beyond
a threshold.
The parameters for the filter are as follows: The process noise w is set to
1.1m s for velocity and 0.7 m for position. Those values can be derived from the
- 173 -
CHAPTER 6 APPLICATIONS 6.3 Tracking
1m for position. The initial position state corresponds to the detection position with
zero velocity. The velocity is updated during the second detection using the first
motion vector. Observations m i ,k are associated with tracks based on the distance
observation m i ,k . The observation with the smallest distance is then associated with
track j . Changes in the model id (i.e. classification result) between the last
observation of a track id j and the current observation idi are penalised, as the
difference between ids increases the distance d i , j in equation (48). The total number
of model ids is 10. This approach is possible because the system performs
classification before the tracking.
baseline tracker (OpenCV blob tracker)). The OpenCV tracker uses an adaptive
data association and Kalman filtering for tracking blob position and size. The i-
LIDS benchmarking video data set is used for evaluation. Performance is evaluated
on a subset (due to limited ground truth for tracking) of the previous chapter‘s data
(PVTRA10xxxx): 1a03, 1a07, 1a13, 1a19, 1a20, 2a05, 2a10 and 2a11. Those videos
contain overcast, sunny, changing weather conditions and camera saturation. The
ground truth used for evaluation is provided with the i-LIDS data set. It is of limited
duration within the videos and does not include pedestrians on the road. The
evaluation was constrained to the two regions of interest on the road (dark red boxes
in Figure 67) for both trackers.
- 174 -
CHAPTER 6 APPLICATIONS 6.3 Tracking
2007) evaluated performance of motion detection based on ROC-like curves and the
F-measure. The latter allows comparison using a single value domain, but is mainly
body of work dealing with evaluation of both motion detection and tracking.
(Needham and Boyle, 2003) proposed a set of metrics and statistics for comparing
trajectories to account for detection lag, or constant spatial shift. However, taking
only the trajectory (a set of points over time) as the input of evaluation may not give
sufficient information about how precise the tracks are, since the size of the object
is not considered. (Bashir and Porikli, 2006) use the spatial overlap of ground truth
and system bounding boxes which is unbiased towards large objects. However they
are counted per frame, which is justified when the objective is object detection. In
object tracking, counting true positive (TP), false positive (FP) and false negative
(FN) tracks is a more natural choice which is consistent with the expectations of
video analytics end-users. Brown et al., 2005 suggest a framework for matching of
system track centroids and an enlarged ground truth bounding box which favours
tracks of large objects.
2007 is used here. A rich set of metrics is proposed, such as Correct Detected
Tracks, False Detected Tracks and Track Detection Failure to provide a general
the data association module of the system. Latency indicates how quick the system
can respond to an object entering the camera view, and Track Completeness how
complete the object has been tracked. Metrics such as Track Distance Error and
- 175 -
CHAPTER 6 APPLICATIONS 6.3 Tracking
Closeness of Tracks indicate the accuracy of estimating the position, the spatial and
the temporal extent of the objects respectively.
6.3.3. Results
First, illustrative examples (Figure 67 to Figure 69) compare the tracker with the
OpenCV baseline before providing the evaluation metrics. The full results in Table
25 indicate that the proposed system outperforms the OpenCV tracker on high level
metrics such as correct detected tracks, track detection failure, false detected tracks
and track fragmentation. This can mainly be attributed to the additional prior
information from using 3D models to classify the content of the input video.
For metrics that evaluate the motion segmentation such as track closeness
and distance error, both trackers have similar performance, which can be explained
by the similar background estimation method. The track closeness of the proposed
system is better than the baseline due to 3D models which are more robust against
shadows, which can be observed for the bus in Figure 67 and the occluded car in
Figure 68. The extent of the projected wire frame model is used as bounding box for
the proposed system. The false detected tracks of the OpenCV tracker are high due
in chapters 3 and 5 exhibited low precision for bicycles when no pedestrian model
was used due to the same reasons. Refer to Figure 69 for an example. The proposed
system detected 94% of the ground truth tracks compared to 88% of the base line. It
also has half of the track detection failures compared to the base line. The higher
which produces more complete and additional noisy detections. However, the
track of an object leaving is continued for a new object. This is worse for the
proposed system compared to the OpenCV tracker, because the tracker is more
- 176 -
CHAPTER 6 APPLICATIONS 6.3 Tracking
Figure 67 Correct detected tracks inside the active regions of interest (dark red
boxes). Left: the proposed system with corresponding ground plane tracks. Right:
OpenCV tracker result. Note the spatially fragmented tracks for the baseline in the
first row and the correct number for tracks for the proposed tracker.
less track fragmentations. The path information in image and ground plane
generated by the tracker can be used for high level behaviour analysis. The next
section will introduce possible applications.
- 177 -
CHAPTER 6 APPLICATIONS 6.3 Tracking
Figure 68 In this frame, the second car is missed due to occlusion between the
vehicles. The proposed tracker on the left correctly locates the first car. The OpenCV
tracker merged both cars with a large bounding box at a central position.
Figure 69 Pedestrians are correctly rejected as “other” class by the proposed tracker
- 178 -
CHAPTER 6 APPLICATIONS 6.4 Behaviour Analysis
proposed OpenCV
Metrics
Tracker blob Tr.
Number of Ground truth tracks 100 100
Number of system tracks 144 203
Correct detected tracks 94 88
Track detection failure 6 12
False detected tracks 27 90
Latency (frames) 5 5
Track fragmentation 8 18
Average track Completeness (time) 64% 55%
ID change 10 3
Average track closeness (bbox overlap) 54% 35%
Standard Deviation of closeness 20% 13%
Average distance error (pixels) 22 21
Standard Deviation of distance error 19 15
Table 25 Tracking results
London and has been rolled out to over 100 cameras in the last year. The next level
of road users is bus lane intrusion of non authorised vehicles, stopping in box
junctions when the exit is blocked and banned turns at intersection. All those
- 179 -
CHAPTER 6 APPLICATIONS 6.4 Behaviour Analysis
Figure 70 Bus lane intrusion detection example. The restricted zone is marked as
green region permitting only buses. Vehicles of a restricted class with tracks inside
green). This region is defined in the same way as the region of interest for the
- 180 -
CHAPTER 6 APPLICATIONS 6.5 Summary
centre of vehicles rather than whole motion silhouettes. This is very beneficial in
low angle urban camera views, where motion centroids and ground plane centres
may differ significantly. If the region is defined along the edges of the lanes of a
road, the majority of a high vehicle‘s silhouette might be outside this region as
shown in the example. The ground plane centre however is correctly estimated
within the road region according to the prior knowledge of the 3D scene. Correct
operation for a bus and car is shown in Figure 70. Depending on further
applications, many more monitoring tasks can be integrated into the framework due
to a simple plug-in structure of the software (appendix A).
6.5. Summary
This chapter introduced application related issues for traffic surveillance systems in
respect to the proposed system. Good portability between camera views has been
demonstrated for a data set which was not part of the training. Superior performance
classifier precision of 95.1% for the car class. The data set and classification results
were provided by the same company. Due to the zoomed view, significant
occlusions occurred for vehicles. The classifier implicitly deals with occlusions in
the same way as with visibility changes due to orientation. This enabled correct
operation under those conditions.
based on Kalman filters was proposed. The performance was evaluated with the
outperformed for most metrics demonstrating 94% correct detected tracks. The
Examples for bus lane intrusions are shown. In addition, the framework allows
integration of plug-ins for other monitoring objectives.
- 181 -
CHAPTER 7 CONCLUSIONS 7.1 Summary
7. Conclusions
7.1. Summary
This thesis addressed computer vision algorithms for road user detection and
classification. The aim was to provide more information to traffic managers, who in
turn can improve the car travel experience in real-time or with strategic planning.
operators in the urban environments. For ease of reading, a short summary of the
work is provided here. The next section will critically discuss the work and an
outlook for further work will be provided in section 7.3.
using 3D models has been introduced. Camera calibration makes this framework
portable between cameras without the requirement for retraining the classifier.
Good performance is demonstrated on the i-LIDS data set provided by the UK
Home Office. The system is limited by the quality of the extracted motion
silhouette. To mitigate the strong dependence on the motion silhouette, the use of
local image features was investigated. As a test case, a texture saliency classifier
has been implemented, which uses features derived from the fast Fourier transform
(FFT) from local image patches. The classifier operates on single frames to detect
salient objects, which in the evaluation are people approaching a fence. Good real-
time performance figures are shown for the 24 hour sterile zone test of i-LIDS. The
local features proposed are an effective way of exploiting appearance for visual
surveillance.
- 182 -
CHAPTER 7 CONCLUSIONS 7.2 Discussion
This local feature concept was integrated into the earlier 3D framework to
form the novel 3DHOG classifier. This algorithm integrates the advantages from
both ideas by modelling road user appearance with local patches and spatially
constraining them in real world space. The proposed framework allows for a
variable number of features to be used, so that a single model for any viewing angle
of a road user is sufficient. Depending on camera view and road user orientation,
visible feature points are extracted and normalised using the 3D model and used for
classification. In this way, training data is fully portable between camera views
evaluation on a new data set and comparison with an industrial single view
appearance classifier trained on the new data set. The proposed system trained on i-
LIDS outperforms the industrial classifier, both tested on the new data set. Finally, a
tracking extension with Kalman filters has been introduced. This enabled the
7.2. Discussion
The work started based on motion estimation, which is the state of the art in visual
on from this initial approach, the use of local features incorporates appearance into
the object models. This in turn removes the strong reliance on motion estimation to
weaken the static camera requirements. The novel 3DHOG classifier can be applied
tracker. In this way, the system can incorporate historical information during
detection. This structure was identified in section 2.6 to be an essential element for
traffic surveillance systems to increase robustness.
- 183 -
CHAPTER 7 CONCLUSIONS 7.2 Discussion
classifier. This is due to the image patch warping and feature extraction during the
model matching process. This task however, is highly parallel in nature and
For a smaller number of image patches used for intrusion detection in chapter 4,
with the assumption that every silhouette contains one object. This is an essential
assumption for the motion silhouette based classifier. The appearance based
silhouette. This could either be implemented through a multiple object search for
frame. In this way, tracks will be continued and the hypotheses from motion
silhouettes can start new tracks.
typical for surveillance applications. Longer and more comprehensive video data
for testing is desirable, but comes with the cost of ground truth generation. Ground
truth, which is part of the i-LIDS data set, has been used and extended to match the
baseline industrial classifier was used to generate ground truth, which was then
manually checked for classification accuracy. Overall, a good attempt has been
made on evaluation, especially for the human intrusion with full 24 hours of video
data; however, more data would always be desirable.
computer vision systems. It is clear, that even the latest technology as described in
- 184 -
CHAPTER 7 CONCLUSIONS 7.3 Future Work
this thesis is far inferior to human perception for object recognition, identification
and scene understanding. Nevertheless, there is space for automatic video analysis
for supporting human operators in specific tasks. As outlined in the introduction, the
humans very quickly. The work presented in this thesis is therefore targeted at
incidents further. This aspect makes the technology particularly valuable for the
human intrusion detection case presented in chapter 4.
thesis. For intrusion detection with local features, extensions for features and
clustering can be considered. The current classifier uses a scalar feature value
trade off computational speed at the same time. The current clustering of individual
patches to objects can also be improved. Incorporating time and clustering in the
spatio-temporal domain would reduce the effect of single frame noise. Motion
estimation could also be included directly into the clustering: it is currently fused
with the clustering result. The clustering itself can be improved by employing a
For road user detection and classification with local features, there may be
many ways of extending and improving the current systems. Firstly, the image
patches could be extracted at several scales and the feature extraction could
consider this whole image pyramid. The selection and weighting of interest points
- 185 -
CHAPTER 7 CONCLUSIONS 7.3 Future Work
represent the weak hypotheses. The final strong classifier would use the best subset
of those points. This training methodology could benefit from additional negative
training samples or could use the proposed concept of positive samples with
position offsets to generate sharp location performance.
including night time. This might result in the need for modelling the appearance
vectors with multiple Gaussians (GMM) rather than the current single Gaussian
model. For computational efficiency, the spatial 3D models can be re-used between
classes. In this way, every extracted feature patch would be compared with the
trained appearance of several classes (e.g. bus, lorry) to generate a match measure
for all those classes from only a single image feature extraction step.
The search strategy for new object instances may be improved. The current
grid approach can be replaced by mean shift or similar methods to provide faster
parameters into the search strategy. More advanced tracking techniques like particle
filters can be integrated with the search framework to allow for multiple hypotheses
tracking. Finally, the 3DHOG classifier could be applied to moving cameras for
the road surface remains approximately constant apart from short intervals for
example when ascending onto a speed bump. A fixed window in front of the car
could be searched in every frame to detect and classify road users.
could benefit from moving away from the use of background and motion estimation
- 186 -
CHAPTER 7 CONCLUSIONS 7.4 Publications
without strong constraints are undoubtedly harder tasks, but emerging technology
mastering this will be able to enable more applications, which so far can only be
solved by humans.
7.4. Publications
Norbert Buch, James Orwell and Sergio A. Velastin. Urban Road User Detection
and Classification using 3D Wire Frame Models.
IET Computer Vision Journal 2010, Vol. 4, Issue 2, pages 105-116.
Norbert Buch, Fei Yin, James Orwell, Dimitrios Makris and Sergio A. Velastin.
Urban Vehicle Tracking using a Combined 3D Model Detector and
Classifier. In 13th International Conference on Knowledge-Based and
Intelligent Information & Engineering Systems KES 2009, Part I, LNCS 5711,
pages 169–176, Santiago, Chile, September 2009
Norbert Buch, Mark Cracknell, James Orwell and Sergio A. Velastin. Vehicle
Localisation and Classification in Urban CCTV Streams. In Intelligent
Transport Systems World Congress 2009, Stockholm, Sweden, September 2009
Norbert Buch and Sergio A. Velastin. Human intrusion detection using texture
classification in real- time. In First International Workshop on Tracking
Humans for the Evaluation of their Motion in Image Sequences THEMIS 2008
co-hosted with BMVC2008, pages 1–6, Leeds, September 2008.
Norbert Buch, James Orwell, and Sergio A. Velastin. Detection and classification
of vehicles for urban traffic scenes. In International Conference on Visual
Information Engineering VIE08, pages 182–187. The Institution of Engineering
and Technology, Xi‘an, China, July 2008.
- 187 -
CHAPTER 7 CONCLUSIONS 7.4 Publications
7.4.3. Presentations
Derek Renaud and Norbert Buch, Data to Decision to Action in London's Traffic
Systems. British Computer Society Evening Lecture, BCS, Southampton Street,
London, March 2010
Norbert Buch, James Orwell, and Sergio A. Velastin. Classifying and Tracking
Vehicles in Urban CCTV Streams. BMVA Symposium on Vision for
Automotive Applications, London, May 2009.
Norbert Buch. Classification of Vehicles and their Behaviour for Urban Traffic
Scenes. Doctoral Consortium, British Computer Society, London, May 2009.
(best presentation award)
- 188 -
CHAPTER 7 CONCLUSIONS 7.5 Personal Statement
for London and finding novel solutions for computer vision and transport problems.
I would never want to do without my time here at Kingston University.
- 189 -
APPENDIX A MATHEMATICAL PROOFS A.1 Sigmoid Parameters
A. Mathematical Proofs
mk s d k (49)
1
sd (50)
1 eab d
and uses two parameters a and b . The parameters can be estimated from the
distance surface DM for every model. The proposed parameterisation places the
centre point of the distance surface dC at the middle of the sigmoid function (Figure
71) with a match measure of s dC 0.5 . This results in
2
a (51)
dC d
b dC , (52)
where dC is the centre point value of the distance surface DM and d in the mean
value of the surface d mean DM . See Figure 49 on page 143 for an example
surface of an interest point and Figure 71 for the resulting sigmoid function.
- 190 -
APPENDIX A MATHEMATICAL PROOFS A.1 Sigmoid Parameters
Figure 71 Estimated sigmoid function shown as a dashed line. The continuous line is
the gradient of the sigmoid function defined by the centre value dC of the distance
surface DM and the mean distance d of all values.
Proof
This proof shows how equations (51) and (52) provide the properties of s d as
described above. The first constraint is s dC 0.5 to centre the sigmoid function,
which enforces
1 1
s dC a b dC
(53)
1 e 2
The above function is smooth and strictly monotonic for dC and also for variations
Equation (52) is the unique solution to equation (53) which is shown by inserting
(52) in (53):
1 1 1
s dC a dC dC
q.e.d. (54)
1 e 1 e 0
2
- 191 -
APPENDIX A MATHEMATICAL PROOFS A.1 Sigmoid Parameters
The gradient of the sigmoid function at point s dC 0.5 (like before) defines the
other parameter a . The gradient should be equal to the gradient of a line of the
training data between the centre distance value dC at match measure mk 0.5 and
0.5
g , (55)
dC d
which will be made equal to the gradient of the sigmoid function s d . The
d sd 1
e a b d a . (56)
2
dd 1 eab d
Considering the gradient at the point dC and using the existing parameter b dC ,
the gradient can be expressed as
d s dC 1 1 a
e
a dC d C
a e0 a . (57)
1 e
2 0 2
1 e C C
ddC a d d 4
The above gradient is then made equal to the line gradient g from equation (55):
a
g. (58)
4
2
a q.e.d. (59)
dC d
- 192 -
APPENDIX B PLUG-IN HIERARCHY AND PARAMETERS B.1 Overview of Modules
- 193 -
APPENDIX B PLUG-IN HIERARCHY AND PARAMETERS B.2 Parameter List
-overlayWriter Auxiliary module for CV modules to save overlay images for objects
OverlayWriter Class to generate overlay output
- 194 -
APPENDIX B PLUG-IN HIERARCHY AND PARAMETERS B.2 Parameter List
- 195 -
APPENDIX B PLUG-IN HIERARCHY AND PARAMETERS B.2 Parameter List
- 196 -
APPENDIX B PLUG-IN HIERARCHY AND PARAMETERS B.2 Parameter List
- 197 -
APPENDIX B PLUG-IN HIERARCHY AND PARAMETERS B.2 Parameter List
- 198 -
APPENDIX B PLUG-IN HIERARCHY AND PARAMETERS B.2 Parameter List
-overlayWriter Auxiliary module for CV modules to save overlay images for objects
OverlayWriter Class to generate overlay output
Parameter Type Short description
-overlay-dir directory Directory for overlay images
-output-pic string Extension for saved images
-output-dir directory Path to save output results
- 199 -
APPENDIX B PLUG-IN HIERARCHY AND PARAMETERS B.2 Parameter List
- 200 -
APPENDIX B PLUG-IN HIERARCHY AND PARAMETERS B.3 Setup GUI
values for parameters are displayed, selected modules can be configured and help is
- 201 -
APPENDIX C ADDITIONAL PERFORMANCE TABLES C.1 Motion Silhouette Classifier
Table 26 Classifier confusion matrix and overall performance figures for the motion
Table 27 Class wise performance figures for the motion silhouette classifier with de-
interlacing filter
- 202 -
APPENDIX C ADDITIONAL PERFORMANCE TABLES C.1 Motion Silhouette Classifier
Table 28 Classifier confusion matrix and overall performance figures for the motion
Table 29 Class wise performance figures for the motion silhouette classifier without
additional filter
Table 30 Classifier confusion matrix and overall performance figures for the motion
- 203 -
APPENDIX C ADDITIONAL PERFORMANCE TABLES C.1 Motion Silhouette Classifier
Table 31 Class wise performance figures for the motion silhouette classifier under
sunny conditions
Table 32 Classifier confusion matrix and overall performance figures for the motion
Table 33 Class wise performance figures for the motion silhouette classifier under
overcast conditions
- 204 -
APPENDIX C ADDITIONAL PERFORMANCE TABLES C.1 Motion Silhouette Classifier
Table 34 Classifier confusion matrix and overall performance figures for the motion
Table 35 Class wise performance figures for the motion silhouette classifier under
- 205 -
APPENDIX C ADDITIONAL PERFORMANCE TABLES C.2 3DHOG Classifier
Table 36 Confusion matrix and system performance for the 3DHOG classifier with
Table 37 Classifier confusion matrix and class wise performance for the 3DHOG
- 206 -
APPENDIX C ADDITIONAL PERFORMANCE TABLES C.2 3DHOG Classifier
Table 38 Confusion matrix and system performance for the 3DHOG classifier with
Table 39 Classifier confusion matrix and class wise performance for the 3DHOG
- 207 -
REFERENCES
References
(Acunzo et al., 2007) Acunzo, D., Zhu, Y., Xie, B., and Baratoff, G. (2007).
Context-adaptive approach for vehicle detection under varying lighting
conditions. In Intelligent Transportation Systems Conference, 2007. ITSC
2007. IEEE, pages 654–660.
(Agarwal et al., 2004) Agarwal, S., Awan, A., and Roth, D. (2004). Learning to
detect objects in images via a sparse, part-based representation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 26(11):1475–
1490.
(Alonso et al., 2007) Alonso, D., Salgado, L., and Nieto, M. (2007). Robust
vehicle detection through multidimensional classification for on board video
based systems. In Image Processing, 2007. ICIP 2007. IEEE International
Conference on, volume 4, pages 321–324.
(Atev et al., 2005) Atev, S., Arumugam, H., Masoud, O., Janardan, R., and
Papanikolopoulos, N. (2005). A vision-based approach to collision
prediction at traffic intersections. Intelligent Transportation Systems, IEEE
Transactions on, 6(4):416–423.
(Bardet and Chateau, 2008) Bardet, F. and Chateau, T. (2008). Mcmc particle filter
for real-time visual tracking of vehicles. In Intelligent Transportation
Systems, 2008. ITSC 2008. 11th International IEEE Conference on, pages
539–544.
- 208 -
REFERENCES
(Bashir and Porikli, 2006) Bashir, F. and Porikli, F. (2006). Performance evaluation
of object detection and tracking systems. In IEEE Int. W. on Performance
Evaluation of Tracking and Surveillance, PETS'06, pages 7–14.
(Bay et al., 2006) Bay, H., Tuytelaars, T., and Gool, L. V. (2006). Surf: Speeded up
robust features. In European Conference on Computer Vision (ECCV),
volume 3951 of LNCS, pages 404–17.
(Bloisi and Iocchi, 2009) Bloisi, D. and Iocchi, L. (2009). Argos–a video
surveillance system for boat traffic monitoring in venice. To appear in
International Journal of Pattern Recognition and Artificial Intelligence
(IJPRAI). 2009.
(Brown et al., 2005) Brown, L. M., Senior, A. W., li Tian, Y., Connell, J.,
Hampapur, A., Shu, C.-F., Merkl, H., and Lu, M. (2005). Performance
evaluation of surveillance systems under varying conditions. In IEEE Int'l
Workshop on Performance Evaluation of Tracking and Surveillance, pages
1–8, Colorado.
(Chen and Zhang, 2007) Chen, X. and Zhang, C. C. (2007). Vehicle classification
from traffic surveillance videos at a finer granularity. In Advances In
Multimedia Modeling, volume 4351 of Lecture Notes in Computer Science,
pages 772–781. Springer.
(Chen et al., 2007) Chen, Y.-T., Chen, C.-S., Huang, C.-R., and Hung, Y.-P.
(2007). Efficient hierarchical method for background subtraction. Pattern
Recognition, 40(10):2706 – 2715.
- 209 -
REFERENCES
(Ciciora et al., 2004) Ciciora, W., Farmer, J., and Adams, M. (2004). Modern
cable television technology: video, voice, and data communications. Morgan
Kaufmann.
(Colombo et al., 2007) Colombo, A., Leung, V., Orwell, J., and Velastin, S. (2007).
Markov models of periodically varying backgrounds for change detection.
In Visual Information Engineering 2007, London. IET.
(Cornelis et al., 2008) Cornelis, N., Leibe, B., Cornelis, K., and Gool, L. V.
(2008). 3d urban scene modeling integrating recognition and reconstruction.
International Journal of Computer Vision, 78(2-3):121–141.
(Crandall et al., 2005) Crandall, D., Felzenszwalb, P., and Huttenlocher, D. (2005).
Spatial priors for part-based recognition using statistical models. In
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE
Computer Society Conference on, volume 1, pages 10–17.
(Creusen et al., 2009) Creusen, I., Wijnhoven, R., and de With, P. (2009).
Applying feature selection techniques for visual dictionary creation in object
classification. In Proc. Int. Conf. on Image Processing, Computer Vision
and Pattern Recognition (IPCV), pages 722–727.
- 210 -
REFERENCES
(Cucchiara et al., 2001) Cucchiara, R., Grana, C., Piccardi, M., Prati, A., and
Sirotti, S. (2001). Improving shadow suppression in moving object detection
with hsv color information. In Intelligent Transportation Systems, 2001.
Proceedings. 2001 IEEE, pages 334–339.
(Culibrk et al., 2009) Culibrk, D., Antic, B., and Crnojevic, V. (2009). Real-time
stable texture regions extraction for motion-based object segmentation. In
British Machine Vision Conference (BMVC), London.
(Dahlkamp et al., 2006) Dahlkamp, H., Ottlik, A., and Nagel, H.-H. (2006).
Comparison of edge-driven algorithms for model-based motion estimation.
In First International Workshop on Spatial Coherence for Visual Motion
Analysis SCVMA, volume 3667 of Lecture Notes in Computer Science,
pages 38–50.
(Dahlkamp et al., 2004) Dahlkamp, H., Pece, A. E., Ottlik, A., and Nagel, H.-H.
(2004). Differential analysis of two model-based vehicle tracking
approaches. In Pattern Recognition, volume 3175 of Lecture Notes in
Computer Science, pages 71–78.
(Dalal and Triggs, 2005) Dalal, N. and Triggs, B. (2005). Histograms of oriented
gradients for human detection. In Computer Vision and Pattern Recognition,
2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages
886–893.
(Dalal et al., 2006) Dalal, N., Triggs, B., and Schmid, C. (2006). Human detection
using oriented histograms of flow and appearance. In ECCV 2006, pages
428–441.
(Davies and Lienhart, 2006) Davies, B. and Lienhart, R. (2006). Using cart to
segment road images. In Proceedings of SPIE Multimedia Content Analysis,
Management and Retrival.
- 211 -
REFERENCES
(Dunn and Parberry, 2002) Dunn, F. and Parberry, I. (2002). 3D math primer for
graphics and game development. Wordware Publishing.
(Everingham et al., 2009) Everingham, M., Gool, L. V., Williams, C. K. I., Winn,
J., and Zisserman, A. (2009). The pascal visual object classes (voc)
challenge. International Journal of Computer Vision (in press).
(Gandhi and Trivedi, 2007) Gandhi, T. and Trivedi, M. M. (2007). Video based
surround vehicle detection, classification and logging from moving
platforms: Issues and approaches. In Intelligent Vehicles Symposium, 2007
IEEE, pages 1067–1071.
(Gao et al., 2009a) Gao, T., Liu, Z., Gao, W., and Zhang, J. (2009a). Moving
vehicle tracking based on sift active particle choosing. In Advances in
Neuro-Information Processing in LNCS, volume 5507 of Lecture Notes in
Computer Science, pages 695–702. Springer.
(Gao et al., 2009b) Gao, T., Liu, Z., Gao, W., and Zhang, J. (2009b). A robust
technique for background subtraction in traffic video. In Advances in Neuro-
Information Processing, volume 5507 of Lecture Notes in Computer
Science, pages 736–744. Springer.
(Gordon et al., 1993) Gordon, N., Salmond, D., and Smith, A. (1993). Novel
approach to nonlinear/non-gaussian bayesian state estimation. Radar and
Signal Processing, IEE Proceedings F, 140(2):107–113.
(Guha et al., 2006) Guha, P., Mukerjee, A., and Venkatesh, K. (2006). Appearance
based multiple agent tracking under complex occlusions. In PRICAI 2006:
Trends in Artificial Intelligence, volume 4099 of Lecture Notes in Computer
Science, pages 593–602. Springer.
- 212 -
REFERENCES
(Guo et al., 2008) Guo, Y., Rao, C., Samarasekera, S., Kim, J., Kumar, R., and
Sawhney, H. (2008). Matching vehicles under large pose transformations
using approximate 3d models and piecewise mrf model. In Computer Vision
and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–
8.
(Gupte et al., 2002) Gupte, S., Masoud, O., Martin, R., and Papanikolopoulos, N.
(2002). Detection and classification of vehicles. Intelligent Transportation
Systems, IEEE Transactions on, 3(1):37–47.
(Hsieh et al., 2006) Hsieh, J. W., Yu, S. H., Chen, Y. S., and Hu, W. F. (2006).
Automatic traffic surveillance system for vehicle tracking and classification.
IEEE Transactions On Intelligent Transportation Systems, 7(2):175–187.
(Hu et al., 2004) Hu, W., Xiao, X., Xie, D., Tan, T., and Maybank, S. (2004).
Traffic accident prediction using 3-d model-based vehicle tracking.
Vehicular Technology, IEEE Transactions on, 53(3):677–694.
(iLIDS, nd) Home Office Scientific Development Branch, (n.d.). Imagery library for
intelligent detection systems i-lids. [Link] (accessed 4
September 2009).
- 213 -
REFERENCES
(Johansson et al., 2009) Johansson, B., Wiklund, J., Forssén, P.-E., and Granlund,
G. (2009). Combining shadow detection and simulation for estimation of
vehicle size and position. Pattern Recognition Letters, 30(8):751 – 759.
(Jones and Snow, 2008) Jones, M. and Snow, D. (2008). Pedestrian detection
using boosted features over many frames. In Pattern Recognition, 2008.
ICPR 2008. 19th International Conference on, pages 1–4.
(Kamijo et al., 2004) Kamijo, S., Harada, M., and Sakauchi, M. (2004). Incident
detection based on semantic hierarchy composed of the spatio-temporal mrf
model and statistical reasoning. In Systems, Man and Cybernetics, 2004
IEEE International Conference on, volume 1, pages 415–421.
(Kamijo et al., 2001a) Kamijo, S., Ikeuchi, K., and Sakauchi, M. (2001a). Event
recognitions from traffic images based on spatio-temporal markov random
field model. In 8th World Congress on Inteligent Transportation Systems.
(Kamijo et al., 2001b) Kamijo, S., Ikeuchi, K., and Sakauchi, M. (2001b). Vehicle
tracking in low-angle and front-view images based on spatio-temporal
markov random field model. In 8th World Congress on ITS.
(Kamijo et al., 2000) Kamijo, S., Matsushita, Y., Ikeuchi, K., and Sakauchi, M.
(2000). Traffic monitoring and accident detection at intersections. Intelligent
Transportation Systems, IEEE Transactions on, 1(2):108–118.
- 214 -
REFERENCES
(Kanhere, 2008) Kanhere, N. (2008). Vision-based Detection, Tracking and
Classification of Vehicles using Stable Features with Automatic Camera
Calibration. PhD thesis, Clemson University, USA.
(Kanhere et al., 2005) Kanhere, N., Pundlik, S., and Birchfield, S. (2005). Vehicle
segmentation and tracking from a low-angle off-axis camera. In Computer
Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society
Conference on, volume 2, pages 1152–1157.
(Kastrinaki et al., 2003) Kastrinaki, V., Zervakis, M., and Kalaitzakis, K. (2003).
A survey of video processing techniques for traffic applications. Image and
Vision Computing, 21(4):359 – 381.
(Khammari et al., 2005) Khammari, A., Nashashibi, F., Abramson, Y., and
Laurgeau, C. (2005). Vehicle detection combining gradient analysis and
adaboost classification. In Intelligent Transportation Systems, 2005.
Proceedings. 2005 IEEE, pages 66–71.
(Kim and Malik, 2003) Kim, Z. and Malik, J. (2003). Fast vehicle detection with
probabilistic feature grouping and its application to vehicle tracking. In
Computer Vision, 2003. Proceedings. Ninth IEEE International Conference
on, volume 1, pages 524–531.
(Kläser et al., 2008) Kläser, A., Marszalek, M., and Schmid, C. (2008). A spatio-
temporal descriptor based on 3d-gradients. In British Computer Vision
Conference BMVC 2008, volume 2, pages 995 – 1004.
(Kumar et al., 2008) Kumar, P., Mittal, A., and Kumar, P. (2008). Study of robust
and intelligent surveillance in visible and multi-modal framework.
Informatica, 32:63–77.
- 215 -
REFERENCES
(Kumar et al., 2003) Kumar, P., Ranganath, S., and Huang, W. M. (2003). Bayesian
network based computer vision algorithm for traffic monitoring using video.
IEEE Intelligent Transportation Systems Proceedings, Vols. 1 & 2, pages
897–902.
(Leibe et al., 2007) Leibe, B., Cornelis, N., Cornelis, K., and Van Gool, L. (2007).
Dynamic 3d scene analysis from a moving vehicle. In Computer Vision and
Pattern Recognition. CVPR '07. IEEE Conference on, pages 1–8.
(Leibe et al., 2004) Leibe, B., Leonardis, A., and Schiele, B. (2004). Combined
object categorization and segmentation with an implicit shape model. In
ECCV’04 Workshop on Statistical Learning in Computer Vision, pages 17–
32.
(Leibe et al., 2008a) Leibe, B., Leonardis, A., and Schiele, B. (2008a). Robust
object detection with interleaved categorization and segmentation.
International Journal of Computer Vision Special Issue on Learning for
Recognition and Recognition for Learning, 77(1-3):259–289.
(Leibe et al., 2008b) Leibe, B., Schindler, K., Cornelis, N., and Van Gool, L.
(2008b). Coupled object detection and tracking from static cameras and
moving vehicles. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 30(10):1683–1698.
(Leibe et al., 2005) Leibe, B., Seemann, E., and Schiele, B. (2005). Pedestrian
detection in crowded scenes. In Computer Vision and Pattern Recognition,
2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages
878–885.
- 216 -
REFERENCES
(Liebelt et al., 2008)Liebelt, J., Schmid, C., and Schertler, K. (2008). Viewpoint-
independent object class detection using 3d feature maps. In Computer
Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on,
pages 1–8.
(Lou et al., 2005) Lou, J., Tan, T., Hu, W., Yang, H., and Maybank, S. J. (2005). 3-d
model-based vehicle tracking. Image Processing, IEEE Transactions on,
14(10):1561–1569.
(Ma and Grimson, 2005) Ma, X. and Grimson, W. (2005). Edge-based rich
representation for vehicle classification. In Computer Vision, 2005. ICCV
2005. Tenth IEEE International Conference on, volume 2, pages 1185–
1192.
(Mauthner et al., 2008) Mauthner, T., Donoser, M., and Bischof, H. (2008). Robust
tracking of spatial related components. In Pattern Recognition, 2008. ICPR
2008. 19th International Conference on, pages 1–4.
- 217 -
REFERENCES
(Messelodi et al., 2005a) Messelodi, S., Modena, C. M., Segata, N., and Zanin, M.
(2005a). A kalman filter based background updating algorithm robust to
sharp illumination changes. In Roli, F. and Vitulano, S., editors, Lecture
Notes in Computer Science, volume 3617, pages 163–170. Springer Berlin /
Heidelberg. Proceedings of the 13th International Conference on Image
Analysis and Processing.
(Messelodi et al., 2005b) Messelodi, S., Modena, C. M., and Zanin, M. (2005b). A
computer vision system for the detection and classification of vehicles at
urban road intersections. Pattern Analysis & Applications, 8(1-2):17–31.
(Monnet et al., 2003) Monnet, A., Mittal, A., Paragios, N., and Ramesh, V.
(2003). Background modeling and subtraction of dynamic scenes. In
Computer Vision, 2003. Proceedings. Ninth IEEE International Conference
on, pages 1305–1312.
(Morris and Trivedi, 2006a) Morris, B. and Trivedi, M. (2006a). Improved vehicle
classification in long traffic video by cooperating tracker and classifier
modules. In AVSS '06: Proceedings of the IEEE International Conference
on Video and Signal Based Surveillance, page 9, Washington, DC, USA.
IEEE Computer Society.
(Muller et al., 2001) Muller, K.-R., Mika, S., Ratsch, G., Tsuda, K., and Scholkopf,
B. (2001). An introduction to kernel-based learning algorithms. IEEE
Transactions on Neural Networks, 12(2):181–202.
- 218 -
REFERENCES
(Needham and Boyle, 2003) Needham, C. and Boyle, R. (2003). Performance
evaluation metrics and statistics for positional tracker evaluation. In
International Conference on Computer Vision Systems, ICVS'03, pages 278–
289, Graz, Austria.
(Nguyen and Le, 2008) Nguyen, P. and Le, H. (2008). A multi-modal particle filter
based motorcycle tracking system. In PRICAI 2008: Trends in Artificial
Intelligence in LNCS, volume 5351, pages 819–828. Springer.
(Nummiaro et al., 2003) Nummiaro, K., Koller-Meier, E., and Gool, L. V. (2003).
An adaptive color-based particle filter. Image and Vision Computing,
21(1):99 – 110.
(Opelt, 2006) Opelt, A. (2006). Generic Object Recognition. PhD thesis, Graz
University of Technology, Austria.
(Opelt et al., 2006a) Opelt, A., Pinz, A., Fussenegger, M., and Auer, P. (2006a).
Generic object recognition with boosting. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 28(3):416–431.
(Opelt et al., 2006b) Opelt, A., Pinz, A., and Zisserman, A. (2006b). A boundary-
fragment-model for object detection. In Proceedings of the European
Conference on Computer Vision, pages 575–588. Springer-Verlag Berlin
Heidelberg.
(Opelt et al., 2006c) Opelt, A., Pinz, A., and Zisserman, A. (2006c). Incremental
learning of object detectors using a visual shape alphabet. In IEEE
Conference on Computer Vision and Pattern Recognition, volume 1, pages
3–10, Los Alamitos, CA, USA. IEEE Computer Society.
- 219 -
REFERENCES
(Ottlik and Nagel, 2008) Ottlik, A. and Nagel, H. (2008). Initialization of model-
based vehicle tracking in video sequences of inner-city intersections.
International Journal of Computer Vision, 80(2):211–225.
(Park et al., 2007) Park, K., Lee, D., and Park, Y. (2007). Video-based detection
of street-parking violation. In Proceedings of the International Conference
on Image Processing, Computer Vision, and Pattern Recognition CVPR
2007.
(Pingkun et al., 2007) Pingkun, Y., Khan, S., and Shah, M. (2007). 3d model based
object class detection in an arbitrary view. In Computer Vision, 2007. ICCV
2007. IEEE 11th International Conference on, pages 1–6.
(Pinz, 2005) Pinz, A. (2005). Foundations and Trends in Computer Graphics and
Vision, volume 1, chapter Object Categorization, pages 255–353.
[Link].
(Prati et al., 2003) Prati, A., Miki, I., Cucchiara, R., and Trivedi, M. M. (2003).
Comparative evaluation of moving shadow detection algorithms. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 25:918–923.
(project PASCAL, nd) project PASCAL (n.d.). The pascal visual object classes
homepage. [Link] (accessed 3rd
November 2009).
(Rad and Jamzad, 2005) Rad, R. and Jamzad, M. (2005). Real time classification
and tracking of multiple vehicles in highways. Pattern Recognition Letters,
26(10):1597–1607.
(Remagnino et al., 1998) Remagnino, P., Maybank, S., Fraile, R., Baker, K., and
Morris, R. (1998). 'Advanced Video-based Surveillance Systems', chapter
Automatic Visual Surveillance of Vehicles and People, pages 97–107.
Hingham, MA., USA.
- 220 -
REFERENCES
(Robert, 2009a) Robert, K. (2009a). Night-time traffic surveillance: A robust
framework for multi-vehicle detection, classification and tracking. In
Advanced Video and Signal Based Surveillance, IEEE Conference on, pages
1–6, Los Alamitos, CA, USA. IEEE Computer Society.
(Serre et al., 2007) Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., and Poggio,
T. (2007). Robust object recognition with cortex-like mechanisms. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 29(3):411–426.
(Sheikh and Shah, 2005) Sheikh, Y. and Shah, M. (2005). Bayesian modeling of
dynamic scenes for object detection. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 27(11):1778–1792.
(Shotton et al., 2009) Shotton, J., Winn, J., Rother, C., and Criminisi, A. (2009).
Textonboost for image understanding: Multi-class object recognition and
segmentation by jointly modeling texture, layout, and context. International
Journal of Computer Vision, 81(1):2–23.
(Song and Tai, 2008) Song, K.-T. and Tai, J.-C. (2008). Real-time background
estimation of traffic imagery using group-based histogram. Journal of
Information Science and Engineering, 24(2):411–423.
(Song and Nevatia, 2007) Song, X. and Nevatia, R. (2007). Detection and tracking
of moving vehicles in crowded scenes. In Motion and Video Computing.
WMVC '07. IEEE W. on, page 4.
- 221 -
REFERENCES
(Starck and Hilton, 2005) Starck, J. and Hilton, A. (2005). Spherical matching for
temporal correspondence of non-rigid surfaces. In Computer Vision, 2005.
ICCV 2005. Tenth IEEE International Conference on, volume 2, pages
1387–1394.
(Sturgess et al., 2009) Sturgess, P., Alahari, K., Ladicky, L., and Torr, P. H.
(2009). Combining appearance and structure from motion features for road
scene understanding. In British Machine Vision Conference.
(Su et al., 2007) Su, X., Khoshgoftaar, T., Zhu, X., and Folleco, A. (2007). Rule-
based multiple object tracking for traffic surveillance using collaborative
background extraction. In Advances in Visual Computing, volume 4842 of
Lecture Notes in Computer Science, pages 469–478. Springer.
(Sullivan et al., 1996) Sullivan, G. D., Baker, K. D., Worrall, A. D., Attwood,
C. I., and Remagnino, P. R. (1996). Model-based vehicle detection and
classification using orthographic approximations. In Proceedings of 7th
British Machine Vision Conference, volume 2, pages 695–704.
(Sun et al., 2004) Sun, Z., Bebis, G., and Miller, R. (2004). On-road vehicle
detection using optical sensors: a review. In Intelligent Transportation
Systems, 2004. Proceedings. The 7th International IEEE Conference on,
pages 585–590.
(Sun et al., 2006) Sun, Z., Bebis, G., and Miller, R. (2006). On-road vehicle
detection: a review. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 28(5):694–711.
- 222 -
REFERENCES
(Taj et al., 2008) Taj, M., Maggio, E., and Cavallaro, A. (2008). Objective
evaluation of pedestrian and vehicle tracking on the clear surveillance
dataset. Lecture Notes In Computer Science, 4625:160–173.
(Tan et al., 1998) Tan, T., Sullivan, G., and Baker, K. (1998). Model-based
localisation and recognition of road vehicles. International Journal of
Computer Vision, 27(1):5–25.
(Tanaka et al., 2007) Tanaka, T., Shimada, A., Arita, D., and ichiro Taniguchi, R.
(2007). A fast algorithm for adaptive background model construction using
parzen density estimation. In Advanced Video and Signal Based
Surveillance, 2007. AVSS 2007. IEEE Conference on, pages 528–533.
(Thi et al., 2008) Thi, T. H., Robert, K., Lu, S., and Zhang, J. (2008). Vehicle
classification at nighttime using eigenspaces and support vector machine. In
Image and Signal Processing, 2008. CISP '08. Congress on, volume 2,
pages 422–426.
(Torr, 2007) Torr, P. H. S. (2007). Graph cuts and their use in computer vision. In
International Computer Vision Summer School 2007.
- 223 -
REFERENCES
(van der Maaten et al., 2009) van der Maaten, L., Postma, E., and van den Herik, H.
(2009). Dimensionality reduction: A comparative review. Submitted to
Journal of Machine Learning Research.
(Viola and Jones, 2004) Viola, P. and Jones, M. J. (2004). Robust real-time face
detection. International Journal Of Computer Vision, 57(2):137–154.
(Wang et al., 2009a)Wang, H., Ullah, M. M., Klaser, A., Laptev, I., and Schmid, C.
(2009a). Evaluation of local spatio-temporal features for action recognition.
In British Machine Vision Conference, London.
(Wang et al., 2006) Wang, J., Bebis, G., and Miller, R. (2006). Robust video-based
surveillance by integrating target detection with tracking. In Computer
Vision and Pattern Recognition Workshop, 2006. CVPRW '06. Conference
on, pages 137–137.
(Wang et al., 2004) Wang, J., Chung, Y., Lin, S., Chang, S., Cherng, S., and Chen,
S. (2004). Vision-based traffic measurement system. In Pattern Recognition,
2004. ICPR 2004. Proceedings of the 17th International Conference on,
volume 4, pages 360–363.
- 224 -
REFERENCES
(Wang et al., 2009b) Wang, J., Ma, Y., Li, C., Wang, H., and Liu, J. (2009b). An
efficient multi-object tracking method using multiple particle filters. In
Computer Science and Information Engineering, 2009 WRI World Congress
on, volume 6, pages 568–572.
(Wijnhoven et al., 2008) Wijnhoven, R., de With, P., and Creusen, I. (2008).
Efficient template generation for object classification in video surveillance.
In Proc. of 29th Symposium on Information Theory in the Benelux, page
255–262.
(Woodford et al., 2009) Woodford, O. J., Rother, C., and Kolmogorov, V. (2009).
A global perspective on map inference for low-level vision. In IEEE
International Conference on Computer Vision (ICCV), pages 2319–2326.
(Wu and Nevatia, 2005) Wu, B. and Nevatia, R. (2005). Detection of multiple,
partially occluded humans in a single image by bayesian combination of
edgelet part detectors. In Computer Vision, 2005. ICCV 2005. Tenth IEEE
International Conference on, volume 1, pages 90–97 Vol. 1.
(Yin et al., 2007) Yin, F., Makris, D., and Velastin, S. A. (2007). Performance
evaluation of object tracking algorithms. In 10th IEEE International
Workshop on Performance Evaluation of Tracking and Surveillance,
PETS'07, Rio de Janeiro.
- 225 -
REFERENCES
(Yoneyama et al., 2005) Yoneyama, A., Yeh, C.-H., and Kuo, C.-C. J. (2005).
Robust vehicle and traffic information extraction for highway surveillance.
EURASIP J. Appl. Signal Process., 2005:2305–2321.
(Zhang et al., 2008a) Zhang, D., Qu, S., and Liu, Z. (2008a). Robust classification
of vehicle based on fusion of tsrp and wavelet fractal signature. In
Networking, Sensing and Control, 2008. ICNSC 2008. IEEE International
Conference on, pages 1788–1793.
(Zhang et al., 2007a) Zhang, G., Avery, R. P., and Wang, Y. (2007a). Video-
based vehicle detection and classification system for real-time traffic data
collection using uncalibrated video cameras. Transportation Research
Record, 1993:138–147.
(Zhang et al., 2007b) Zhang, J., Marszalek, M., Lazebnik, S., and Schmid, C.
(2007b). Local features and kernels for classification of texture and object
categories: A comprehensive study. International Journal of Computer
Vision, 73(2):213–238.
(Zhang and Tan, 2002) Zhang, J. and Tan, T. (2002). Brief review of invariant
texture analysis methods. Pattern Recognition, 35(3):735–747.
(Zhang et al., 2008b) Zhang, W., Wu, Q., Yang, X., and Fang, X. (2008b).
Multilevel framework to detect and handle vehicle occlusion. Intelligent
Transportation Systems, IEEE Transactions on, 9(1):161–174.
(Zhang et al., 2005) Zhang, W., Yu, B., Zelinsky, G., and Samaras, D. (2005).
Object class recognition using multiple layer boosting with heterogeneous
features. In Computer Vision and Pattern Recognition, 2005. CVPR 2005.
IEEE Computer Society Conference on, volume 2, pages 323–330.
(Zhang et al., 2008c) Zhang, Z., Li, M., Huang, K., and Tan, T. (2008c). Boosting
local feature descriptors for automatic objects classification in traffic scene
surveillance. International Conference on Pattern Recognition (ICPR) 2008.
(Zheng et al., 2005) Zheng, J., Wang, Y., Nihan, N. L., and Hallenbeck, M. E.
(2005). Extracting roadway background image: Mode-based approach.
Transportation Research Record, (1944):82–88.
- 226 -