REAL-TIME AUTOMATIC DETECTION OF VIOLENT-ACTS BY LOW-LEVEL COLOUR
VISUAL CUES
Alessandro Mecocci, Francesco Micheli
University of Siena- Dept. Information Engineering
Via Roma n° 56, 53100 Siena (Italy)
alemecoc@[Link], michelifrancesco@[Link]
ABSTRACT This is why in this paper we present a real-time automatic
system for detecting aggressive and suspicious acts. The main
Automatic recognition of human activities is important for the goals are: to identify those video intervals comprising violent or
development of next generation video-surveillance systems. In aggressive activities; to store a digital record of such activities,
this paper we address the specific problem of automatically and to rise timely alarms, both locally and remotely. The key
detecting violent interpersonal acts in monocular colour video advantages are: a huge reduction of recorded data that focus
streams. Unlikely previous approaches, only little knowledge is only on suspicious actions (this reduces the privacy-invasion
assumed about the acquisition setup and about the content of issue), and a proactive effect that enables well-timed reactions
the acquired scenes. So the proposed approach is suitable in a by police or other responsible entities.
wide range of practical cases. Reliability and general-purpose Even if the problem of automatic violent acts detection is an
applicability is achieved by analysing low-level features (like important one, only few literature covers this topic. In [1],
the spatial-temporal behaviour of coloured stains), and by person-on-person violent actions are recognized by reasoning at
measuring some warping and motion parameters. In this way it a single target level. First, silhouettes of each person are
is not necessary to extract accurate target silhouettes, that is a extracted from the image, then they are segmented and the
critical task because of occlusions and overcrowding that are principal parts of the human body (head, neck, shoulders,
typical during interpersonal contacts. A suitable index called limbs) are located. Finally, visual acts are classified by using:
Maximum Warping Energy (MWE) has been defined to trajectories, accelerations and orientations of these principal
describe the localized spatial-temporal complexity of colour parts. Unfortunately, this approach works well only in presence
conformations. Our experiments show that aggressive activities of few people (ideally only two interacting people), so their
give significantly higher MWE values if compared with safe silhouettes can be accurately outlined. Besides, a low mutual
actions like: walking, running, embracing or handshaking. So it occlusion rate is required, because the system cannot
is possible to distinguish violent acts from normal behaviours understand where the principal parts are located and to which
even in presence of many people and crowded environments. silhouettes they belong to.
Homography is used to improve robustness by verifying the In [2] and [3], video and audio cues are jointly used to detect
real targets nearness. False interactions because of perspective- aggressive actions in a scene. These approaches give more
induced occlusions are discarded. robust results, because occlusions are more easily tolerated.
Nevertheless silhouettes segmentation is still needed, so the
occlusion issues remain.
Index Terms— Image Processing, Violent Acts Recognition Since we want a system that works in real operative conditions
(in such cases the environment is quite crowded), we propose
1. INTRODUCTION an approach based on global chromatic features extracted from
moving object in the video stream. In other words, we do not
Nowadays, it is important and urgent to develop automatic try to detect violent actions by reasoning at a single target level,
systems capable of monitoring those areas where quiet and because it is almost impossible to get precise silhouettes of
pacific behaviours must be granted (airports, schools, rail people when the occlusion level is high, and the camera
stations, etc.), or of giving immediate alarm in case of danger in external resolution is medium to low (as is typical in the
unsafe and isolated places. The current use of 24 hours digital surveillance field). Moreover, during the interaction of three or
video recording, is not satisfactory, at least for two reasons: more persons, the complexity and the number of possible
first, a huge amount of digital video (comprising normal people spatial arrangements grow exponentially, rapidly reaching the
doing normal actions) must be recorded to grant the ability of intractability limit. Instead, we use multiple spatio-temporal
storing an extremely short chunk of useful data. This fact rises visual cues, to reach a reasonable certainty that an aggressive
serious privacy issues. Second, such data can be used only for action is undergoing. In this sense, our philosophy is closer to
“post factum” reconstructions, they are not useful to prevent that in [2] and [3], rather than in [1].
crimes nor to enable just-in-time counteractions.
1-4244-1437-7/07/$20.00 ©2007 IEEE I - 345 ICIP 2007
The principal idea is that, during violent actions, the interplay
between people elicits higher accelerations and disordered
movements of localized subparts in the scene. Moreover,
because of the human bodies closeness, various subparts tend to
be hidden and unhidden very often, causing changes in the
chromatic appearance of such subparts. Therefore, our
approach tries to solve the aggressive-act recognition problem,
by analysing the spatial and temporal behaviour of colour stains
into which each moving region in the scene can be segmented.
Note that we speak about moving regions, meaning that it is not
needed that such regions refer to a single target or silhouette. In Figure 1: Example of homography. The real plane Z 0 is
this way the system easily copes with the occlusion problem mapped on the image plane ( x , y ) through the centre point C.
(each region can be caused by the fusion of two or more targets,
and can be imprecise because of low-level segmentation moving targets interact because of visual occlusions. A typical
errors). case is when three or more targets run in opposite directions
To classify the actions, we introduce a localized complexity and fuse together because of perspective projection onto the
index, that has been called Maximum Warping Energy (MWE). image plane. The difference in spatial scale of the converging
The term localized means that the measurements are not targets (that are far away between one another), and the severe
referred to absolute movements (in this case a running person occlusion that takes place, tend to give high values of the MWE
would have a high movement level even if the action is not index even if it is impossible for the targets to interact (they lay
aggressive), but to relative motion, that is to movements in locations far apart). To fight this negative effect, we assume
referred to the centroid position of the moving region under a flat ground acquisition scenarios, i.e. we assume that all the
analysis. In this sense the approach follows the old idea that targets move over a plane. This is not a restrictive hypothesis
the human visual system breaks up the perceived motion into since most of working cases show a planar floor. Also, the flat
two parts: the first is the common motion of the whole ground assumption allows the use of homography to find out
configuration; and the second one is the relative motion of each the relative position of targets (compensation of perspective
element within the configuration [4]. distortion).
Therefore, our methods works as follows: Using a pinhole camera model, the relation between the
homogeneous coordinates of a 3D point P = [X, Y, Z, 1]T in
1) moving blobs in the scene are extracted by the world coordinate system and the homogeneous coordinate
background estimation techniques of its projected point p = [x, y, 1]T in the image plane is given
2) nearby blobs (Haussdorf distance) are grouped (they by the (3x4) matrix M:
are delimited by minimum bounding boxes) to form
Regions of Interest (RoIs) that are followed from p = MP
frame to frame (by centroid tracking) (1)
3) the colour values of each blob comprising each RoI
are clustered to get a set of significant stains that is M accounts both for the intrinsic and extrinsic parameters of
called RoI Colour Framework (RoI-CF) the camera, and is obtained from the product of an upper
4) the movements of each colour stain within each RoI- triangular matrix (intrinsic part) and a rigid transformation
CF are used to get an estimate of the MWE matrix (extrinsic part). Its 12 coefficients {mij} are obtained by
5) the nature of the undergoing interaction is decided using suitable calibration patches or by taking some
based on the time behaviour of the MWE measurements about the acquired scene [5].
It is easy to see that if the feet of moving people are assumed to
The paper is organized as follows: paragraph two describes the stay in contact with the ground, the previous equation can be
flat field assumption used to discriminate visual occlusions solved to recover the actual position of the target.
from real physical proximity. Paragraph three and four describe In practice, while two or more RoIs move, the contours of blobs
the Colour Stain Framework used to extract visual cues to comprising each RoI are analysed and used to estimate the
detect violent actions. Paragraph five describes the method used position of the lowest points of the RoI (that are assumed to
to analyse the motion of Colour Stains, while paragraph six correspond to points on the world ground plane). Starting from
discusses the results and future trends. these coordinates, the actual position of each RoI is estimated.
When two or more RoIs are near enough on the world ground
2. FLAT FIELD ASSUMPTION plane (say under a Euclidean-Hut whose dimensions are
experimentally defined), they are considered as interacting. In
In our experiments we noted that, in some critical situations, the other case they are considered as occluded because of
false alarms where given even if no aggressive action was perspective (no physical interaction).
occurring. Such problem mainly arise in those cases when
I - 346
4. ROI COLOUR FRAMEWORK
To capture the variations because of eventual aggressive actions
in the scene, we introduce the idea of Colour Framework (CF).
A Colour Framework is a set of m binary images J it, k with
k 1," , m associated to RoI Ri. Such images are obtained by
a b c segmenting the blobs of the RoI Ri according to a colour
Figure 2: (a) Original frame; (b) Foreground Mask; (c) clustering algorithm. In our current implementation, to speed up
Segmented Foreground Object in CIELab colour coordinates.
the calculation, we use a CIELab colour space and the
clustering is performed by simply subsampling the colour
3. BACKGROUND ESTIMATION AND ROI TRACKING components in a uniform way. Be L the original number of
quantization levels for each colour component ( L 256 in our
The blobs used to define each RoI, are got through a case), and be ' the subsampling factor, then the new number
background estimation and maintenance approach. An adaptive of quantization intervals for each component is L / ' (we
statistical model based on three Gaussian mixtures (one for assume that L is a multiple of ' ).
each colour channel) comprising five Gaussian each, is used in The same quantization factor is applied to each component so
a way similar to that in [6]. 3
Once all blobs have been extracted, nearby blobs are grouped we easily get: m L/' .
and a minimum bounding box is used to delimit each group In our experiments good results have been obtained with
(RoI). The separation between individual blobs is measured ' 64 or ' 128 that gives 64 or 8 possible colour classes.
according to the classic Hausdorff distance [7]. The distance The label for each colour class is assigned starting from 1 and
threshold used to define a RoI is not critical, because the increasing the value by 1 while scanning the CIELab
important information relates to low-level features given by resampled cube in a progressive order (L value first, from the
colour stains movements (see below). lowest to the largest value, then the a value, then the b value).
Once the RoIs have been identified, they are followed by a
simple centroid tracking technique. Note that the RoI centroid To build the J it, k binary image we create a new image Ft R by
is the centroid of the whole set of blobs comprising that RoI. assigning to each pixel of Ft the label of its corresponding
Such simple tracking scheme can be used because we do not
need a precise trajectory, since our goal is to compute an colour class. After that, each image J it, k is obtained by
overall estimate of the global motion of each group (RoI) to applying the following formula:
evaluate the relative movements of each blob within the group. °1 if Ft R ( x, y ) k
Besides, the centroid tracking method is very fast. J it, k ( x, y ) ® (3)
R
Let Ft be the frame at the time instant t , It the matching °̄ 0 if Ft ( x, y ) z k
binary segmented image provided by the background where ( x, y ) denotes the pixel coordinates, while k is an
estimation module, and I t the image obtained by masking Ft
integer running from 1 to m. Evidently each J it, k contains the
with It . Let Nt be the number of blobs (a blob is part of a pixels of the RoI whose colour belongs to the colour class k at
segmented person, or a whole person, or a group of person, see
time t . Each image J it, k comprises a certain number nit, k of
Figure 2) in I t . Such blobs are identified by Bit with
blobs. Such blobs represent the stains of colour k within the
i 1," , Nt and are characterized by their own centroids
G RoI Ri, and are indicated by Rit, k , s with s 1," , nit, k . Their
Cti (Cit, x , Cit, y ) . G
centroids are denoted by Cti , k , s .
The set of blobs is partitioned into subsets Sk so that
Bi S k iff Bz | d A ( Bi , Bz ) T Bz Sk where 5. ANALYSIS OF COLOUR STAINS MOTION
d A (, ) is the Hausdorff distance while ș is a suitable threshold.
The robustness of our approach is based on the possibility of
A RoI Ri is the rectangular region defined by the Minimum judging the presence of violent acts by detecting the degree of
Bounding Box comprising all the blobs in Sj. variations and the temporal behaviour of some visual cues. In
The centroid of each RoI is defined as the centroid of the blobs particular, the degree of motion of the colour stains with respect
comprising it. The RoI tracker is a function to the global motion, turned out to be a suitable descriptor. To
I (i ) : ^1," , NRt ` o ^1," , NRt 1` which matches each RoI of analyse the stains motion, firstly the global motion of each RoI
a scene with a corresponding RoI in the next scene. is estimated from frame Ft 1 to frame Ft (that is match is
G G
established between a certain Cti and a certain CIt (1i ) ). Once
I - 347
6. RESULTS AND CONCLUSIONS
In our experiments we have used CIELab partitioned into 8
different colour classes. Many video sequences have been
analysed related both to indoor and outdoor environments at
different time during the day. Even if some events have been
misclassified, the proposed approach is very promising and
shows a very low false negative error level.
For example, Figure 4 shows the values of MWE in a difficult
video where five people run towards each others from different
directions. The physical interaction starts at frame 85. Note that
Figure 3: (a) RoI tracking frame by frame; (b) Colour stains running phase is not considered dangerous (low value of MWE
warping after RoI motion compensation. Coloured double arrows before frame 85). The interaction becomes more and more
mark the relative motion of each stain in the lapse [t1,t]. violent culminating around frames 100 to 130. During this
interval only the red line is visible, because all people in the
such a pairing has been found, the colour stains motion within scene are very near (actually fighting together) so a single RoI
each conformation is estimated by matching each blob in J it, k is detected. Besides, the MWE values are significantly high.
After frame 130 people move alternately forward and backward
at time t, with a blob in JIt (i1), k at time t-1. The previous (someone falls to the floor), so the number of blobs sharply
matching is repeated for the whole set of m images comprising changes in time (and other coloured lines appear, matching the
each RoI-CF. The stains tracking is given by the function different RoIs present in the scene). Violence disappear after
frame 160 and MWE values drop accordingly.
\ ( p, q) :{1," , nit,k } u {1," , nMt (1i ), k } o {0,1} defined as
follows:
° 1 if Rit,k , p match with Rit,k1, q .
\ ( p, q ) ® (4)
t t 1
°̄0 if Ri ,k , p don't match with Ri ,k ,q .
To describe the spatial-temporal complexity of the colour stains
conformation, we introduce a synthetic index that has been
called Total Warping relative Energy TWEit,k of the stains of Figure 4: Maximum Warping Energy of some RoIs versus
colour k belonging to RoI Ri at time t. We first define the frame number. Each colour identifies a different RoI.
warping energy of stains of colour k at time t as follows:
Our future works will be focused on improving the analysis of
G G G G the time behavior of MWE through some kind of learning
WE ( p, q) || (Cti ,k , p Cti ) (CIt (1i ),k ,q CIt (1i ) ) ||2 (5)
t
i ,k strategy. We will also improve the colour segmentation phase
and we will integrate other sensors, like sound sensors.
Note that the previous energy refers to relative local motion of
the stains within the RoI. At this point the Total Warping 7. REFERENCES
relative Energy, for stains of colour k, can be defined according
to: [1] Datta A., Shah M., Da Vitoria Lobo N., “Person-on-Person
t 1
nit, k nI ( i ), k Violence Detection in Video Data”, icpr, p. 10433 (ICPR'02) -
Volume 1, 2002.
TWEit,k ¦ ¦ \ ( p, q) WE
p 1 q 1
t
i ,k ( p, q ) (6) [2] Vasconcelos N., Lippman A, “Towards semantically
meaningful feature spaces for the characterization of video
Since there are m different colour classes, then there will be m content”, Proc. ICIP, Volume 1, 1997, pp 25-28.
Total Warping relative Energies. Such energies are fused by a [3] Nam, J., Alghoniemy, M., “Audio-visual content-based
maximum operator to yield the Maximum Warping Energy violent scene characterization”, ICIP 98, pp 353-357.
MWEit according to the following: [4] Ramachandran V., S.M. Anstis, “The perception of Apparent
Motion”, Scientific American, pp. 80-87, June 1986.
MWEit max TWEit,k
k[1, m ]
^ ` (7) [5] Forsyth D., Ponce J., “Computer Vision – A modern
approach”, Prentice Hall, 2003.
Figure 3 shows the previous ideas applied to 4 colour stains. [6] Javed O., Shafique K., Shah M., “A Hierarchical Approach to
Robust Background Substraction using Color and Gradient
If MWEit exceeds a predetermined threshold during a certain Information”, Computer Vision Lab, School of Electrical
interval of time (currently a finite state machine is used that Engineering and Computer Science, University of Central Florida.
performs hysteresis thresholding while filtering out short and [7] Preparata F. P., Shamos M. I., “Computational geometry, an
little variations), we decide that the activity within RoI Ri I t introduction”. Springer-Verlag, NY(1985).
is violent.
I - 348