Crowd Detection
Crowd Detection
RASPBERRY PI CROWD
DETECTION SYSTEM
B. Tech. EC
Sem. VII
Submitted by
Name:Sruthi Cheruvullil Name: SnigdhaLokre
Roll No.: EC010 Roll No.: EC031
ID No.: 18ECUOS096 ID No.: 18ECUOS036
I|Page
Certificate
This is to certify that the project entitled “title of the project” is a bonafide
work of Mr. /
Miss.CHERUVULLIL SRUTHI GOKULANDHANRoll.No.: EC010_Identity
No.:__18 ECUOS096_______ of B.Tech. Semester VII in the branch of
“Electronics and Communication” during the academic year 2021-2022.
Date: 27-11-2021
II | P a g e
ACKNOWLEDGEMENT
The success and final outcome of any project requires the right guidance and assistance from
a lot of people and we are extremely privileged to have got it throughout the completion of
our project. All that we have done is only due to such supervision and assistance and would
like to thank all of them. we respect and thank Prof. Nisarg Bhatt, for providing us an
opportunity to carry out the project work and giving us all support and guidance which made
us complete the project duly.
We would like to express our sincere gratitude to our Head of Department, Dr. PurvangDalal
for providing us with an ingenuitive atmosphere. At last but not the least, we are highly
thankful to the Faculty of Technology, DDU for providing us such a platform to give life to
our classroom learned skills.
III | P a g e
INDEX
Certificate II
Acknowledgement III
List of figures VI
Abbreviations VII
Abstract VIII
References
Chapter 1- Introduction
1.1 Background 1
1.2 Objective 1
2.1.3 Working 16
2.1.4 Troubleshooting 17
Chapter 3- Accomplishments
3.1 Results 19
IV | P a g e
3.1.2 Real Time Implementation - RPi 19
and Spyder
3.2 Observations 20
4.1 Limitations 22
4.2 Applications 22
4.3 Conclusion 23
V|Page
LIST OF FIGURES
Fig 1.1 Flow diagram of crowd detection using image processing technique
Fig 1.3 Big sized object. What size do we choose for our sliding window detector?
VI | P a g e
ABBREVIATIONS
RPi = Raspberry Pi
PC = Personal Computer
VII | P a g e
ABSTRACT
With the expanding population and several problems arising due to crowded situations, the
necessity of crowd detection is also at a raise. It includes assessing the number of individuals
in the group and in addition the appropriation of the crowd density in different regions of the
group. Human monitoring can be quite tiresome and expensive.This is where Automated
Crowd Surveillance comes into picture. Estimation of such crowd density can be done from
the image/video of the crowded scene. Our project proposes a real-time approach to solve
such problems related to dense crowds. It uses live video capturing techniques with the help
of Raspberry Pi camera and Raspberry Pi B 4, and attempts to estimate the crowd density of
an area by applying image processing concepts, COCO classes and YOLO models.
VIII | P a g e
CHAPTER 1
INTRODUCTION
1.1 BACKGROUND
1.2 OBJECTIVE
Our proposed project uses Raspberry Pi as the microcontroller and Raspberry Pi
camera as the device to capture live video of a certain place where the crowd detection is to
be carried out. The primary objective is to carry out crowd detection without human
interference in order to increase accuracy and precision.
The Raspberry PI crowd detection system can extend its facets in the future by
including individual emotional analysis of detected people, detection of velocity of moving
objects(helpful at traffic signals) and object classification into human beings, 2 wheeler, 3
wheeler etc.
a. Crowd count for high-density images by Ankan et al. [1] but it is inefficient
for images containing mutual occlusion.
b. A deep learning approach by Shao et al. [2] for understanding crowded scenes
from video sequences.
c. Crowd counting technique proposed by Fu et al. [3]. But it works only for
low-density images and not for high-density ones.
d. A promising crowd counting method proposed by Huiyuan Fu [4] where
probable head regions can be found using a depth camera. But the system is
infeasible due to cost overhead because of the depth camera. Besides, it does
not work for large regions too.
e. The count of moving objects was estimated in the methods proposed in [5, 6].
These methods use the pattern of moving objects obtained from video streams
requiring a good frame rate which is pretty tough to achieve and also do not
work in case of still images.
1|Page
f. Zhang et al. [7] introduces a method utilizing a deep network trained using
perspective maps of images.
Fig 1.1 Flow diagram of crowd detection using image processing technique
However, image processing using the above techniques is the most appropriate in images
with high stages of contrast which is not always the case in real time scenarios where we
could have distorted images depending on the camera quality or target places of monitoring
might be overcrowded.
Therefore, our main focus would be on state-of-the-art methods, all of which use neural
networks and Deep Learning.
A few of the important concepts in object detection are a sliding window pyramid
and aspect ratio.
Each window is fed to the classifier which predicts the class of the object in the window( or
background if none is present). Hence, we know both the class and location of the objects in
the image. But how do we know the size of the window so that it always contains the image?
Let us look at examples:
2|Page
Fig 1.2 Small sized object
Fig 1.3 Big sized object. What size do we choose for our sliding window detector?
As we can see that the object can be of varying sizes. To solve this problem an image
pyramid is created by scaling the image.Idea is that we resize the image at multiple scales and
we count on the fact that our chosen window size will completely contain the object in one of
these resized images. Most commonly, the image is downsampled(size is reduced) until a
certain condition, typically a minimum size is reached. On each of these images, a fixed size
window detector is run. It’s common to have as many as 64 levels on such pyramids. Now,
all these windows are fed to a classifier to detect the object of interest. This will help us solve
the problem of size and location.
3|Page
Fig 1.4 Sliding Window Pyramid
The other problem is aspect ratio. A lot of objects can be present in various shapes like a
sitting person will have a different aspect ratio than a standing person or a sleeping person.
Hog features are good for many real-world problems. On each window obtained from
running the sliding window on the pyramid, we calculate Hog Features which are fed to an
SVM(Support vector machine) to create classifiers.
HOG based classifiers are CNNs that are too slow and computationally very expensive. It is
impossible to run CNNs on so many patches generated by a sliding window detector. R-CNN
solves this problem by using an object proposal algorithm called Selective Search which
reduces the number of bounding boxes that are fed to the classifier.
Still, RCNN are very slow. With SPP-net, we calculate the CNN representation for the entire
image only once and can use that to calculate the CNN representation for each patch
generated by Selective Search by performing a pooling type of operation on just that section
4|Page
of the feature maps of the last conv layer that corresponds to the region.
It also uses spatial pooling after the last convolutional layer as opposed to traditionally used
max-pooling because we need to generate the fixed size of input for the fully connected
layers of the CNN.
d. Fast R-CNN.
With SPP net, it is not trivial to perform back-propagation through spatial pooling layer.
Hence, the network only fine-tunes the fully connected part of the network. Thus, Fast RCNN
uses the ideas from SPP-net and RCNN and fixes the key problem in SPP-net i.e. it makes it
possible to train end-to-end. Along with this, Fast RCNN adds the bounding box regression
to the neural network training itself hence reducing the overall training time and increasing
the accuracy in comparison to SPP net because of the end to end learning of CNN.
e. Faster R-CNN.
Even though Fast R-CNN is fast and accurate, the slowest part is its Selective Search or Edge
boxes. Faster RCNN replaces selective search with a very small convolutional network called
Region Proposal Network to generate regions of Interests.
It introduces the idea of anchor boxes. At each location, the original paper uses 3 kinds of
anchor boxes for scale 128x 128, 256×256 and 512×512. Similarly, for aspect ratio, it uses
three aspect ratios 1:1, 2:1 and 1:2. So, In total at each location, we have 9 boxes on which
RPN predicts the probability of it being background or foreground. Hence, Faster-RCNN is
10 times faster than Fast-RCNN with similar accuracy.
5|Page
For YOLO, detection is a simple regression problem which takes an input image and learns
the class probabilities and bounding box coordinates.
It divides each image into a grid of S x S and each grid predicts N bounding boxes and
confidence. The confidence reflects the accuracy of the bounding box and whether the
bounding box actually contains an object(regardless of class). YOLO also predicts the
classification score for each box for every class in training. You can combine both the classes
to calculate the probability of each class being present in a predicted box.
So, total SxSxN boxes are predicted. However, most of these boxes have low confidence
scores and if we set a threshold, say 30% confidence, we can remove most of them as shown
in the example below.
At runtime, we run our image on CNN only once. Hence, YOLO is super fast and can be run
real time. Another key difference is that YOLO sees the complete image at once as opposed
to looking at the generated region proposals in the previous methods. So, this contextual
information helps in avoiding false positives. However, one limitation for YOLO is that it
only predicts 1 type of class in one grid hence, it struggles with very small objects.
Single Shot Detector achieves a good balance between speed and accuracy. SSD runs a
convolutional network on input image only once and calculates a feature map. In order to
handle the scale, it predicts bounding boxes after multiple convolutional layers. Since each
convolutional layer operates at a different scale, it is able to detect objects of various scales.
Coming back to the main question. “What method should we adopt in order to monitor the
crowd?” Below is a comparison for all above mentioned methods as given by [8].
6|Page
Fig 1.8 Accuracy and Speed trade-off
Keeping in mind that crowd images may be mutually occluded or distorted (depending on
camera resolution) and crowds are continuously moving, speed seems to be a little more
important than accuracy in this respect. Hence, our choice of methodology for
implementation would be YOLO based on the DarkNet (Darknet is an open source neural
network) framework.
However, it is a challenging task to model and implement YOLO CNN from scratch,
especially for beginners as it requires the development of many customized model elements
for training and for prediction. For example, even using a pre-trained model directly requires
sophisticated code to distill and interpret the predicted bounding boxes output by the model.
Instead of developing this code from scratch, we will use a third-party implementation i.e.
COCO dataset framework for YOLOv3 which consists of 80 labels, including, but not limited
to:
People , Bicycles , Cars and trucks , Airplanes , Stop signs and fire hydrants , Animals,
including cats, dogs , birds , horses , cows , and sheep , Kitchen and dining objects , such as
wine glasses , cups, forks, knives, spoons, etc.
Since we only need to detect people and exclude other classes, we will specify in our code:
Chapter 1 includes the overview of Crowd detection system. Its need in today’s world
is briefly discussed.
Chapter 2 describes the internal architecture of the software part and the working is
discussed thoroughly with the help of a block diagram.
7|Page
Chapter 3 shows the output obtained and what can be observed from these outputs.
Chapter 4 at the end, gives a proper conclusion and the future scopes and aspects for
the proposed project.
8|Page
CHAPTER 2
The implementation of code was carried out on Google Collab initially and then on
Raspbian which was installed on the RPi and run on SSH using VNC and PuTTY. The code
is divided into three parts.
a. Minimum confidence value is set below which we receive weak detection which is to
be filtered out.
b. A minimum threshold value is set which draws a box round the detected object.
c. Construct blob from input frame and then pass YOLO object detection.
9|Page
d. Set the box coordinates, centroids and confidences (probability of object detection ).
e. Extract class ID and confidence.
f. Filter detections by ensuring that the detected object is a human and minimum
confidence is met.
g. Apply non-maxima suppression to prevent overlapping of close by boxes to avoid
confusion.
h. Ensure to update results list if atleast one person is detected
MIN_CONF = 0.3
NMS_THRESH = 0.3
import numpy as np
import cv2
boxes = []
10 | P a g e
centroids = []
confidences = []
scores = detection[5:]
classID = np.argmax(scores)
confidence = scores[classID]
11 | P a g e
import argparse
import imutils
import cv2
import os
# load the COCO class labels our YOLO model was trained on
labelsPath = os.path.sep.join(["/content/drive/My Drive/social-distance-detector/yolo-
coco/coco.names"])
LABELS = open(labelsPath).read().strip().split("\n")
# determine only the output layer names that we need from YOLO
ln = net.getLayerNames()
ln = [ln[i[0] - 1] for i in net.getUnconnectedOutLayers()]
# if the frame was not grabbed, then we have reached the end
# of the stream
if not grabbed:
break
# resize the frame and then detect people (and only people) in it
frame = imutils.resize(frame, width=700)
results = detect_people(frame, net, ln,personIdx=LABELS.index("person"))
if len(results) >= 2:
# extract all centroids from the results and compute the
12 | P a g e
# Euclidean distances between all pairs of the centroids
centroids = np.array([r[2] for r in results])
D = dist.cdist(centroids, centroids, metric="euclidean")
# draw (1) a bounding box around the person and (2) the
# centroid coordinates of the person
cv2.rectangle(frame, (startX, startY), (endX, endY), color, 2)
cv2.circle(frame, (cX, cY), 5, color, 1)
if args["display"] > 0:
# show the output frame
cv2_imshow(frame)
# if an output video file path has been supplied and the video
# writer has not been initialized
if args["output"] != "" and writer is None:
# initialize our video writer
fourcc = cv2.VideoWriter_fourcc(*"MJPG")
writer = cv2.VideoWriter(args["output"], fourcc, 25,
(frame.shape[1], frame.shape[0]), True)
# if the video writer is not None, write the frame to the output
# video file
if writer is not None:
writer.write(frame)
MIN_CONF = 0.3
NMS_THRESH = 0.3
13 | P a g e
(H, W) = frame.shape[:2]
results = []
boxes = []
centroids = []
confidences = []
scores = detection[5:]
classID = np.argmax(scores)
confidence = scores[classID]
14 | P a g e
results.append(r)
import time
import picamera
with picamera.PiCamera() as camera:
camera.start_preview()
try:
for i, filename in enumerate(camera.capture_continuous('image{counter:02d}.jpg')):
print(filename)
time.sleep(1)
if i == 0:
break
finally:
camera.stop_preview()
# load the COCO class labels our YOLO model was trained on
labelsPath = os.path.sep.join(["yolo-coco/coco.names"])
LABELS = open(labelsPath).read().strip().split("\n")
# determine only the output layer names that we need from YOLO
ln = net.getLayerNames()
ln = [ln[i[0] - 1] for i in net.getUnconnectedOutLayers()]
# resize the frame and then detect people (and only people) in it
frame = imutils.resize(frame,width=700)
results = detect_people(frame, net, ln, personIdx=LABELS.index("person"))
if len(results) >= 2:
# extract all centroids from the results and compute the
# Euclidean distances between all pairs of the centroids
centroids = np.array([r[2] for r in results])
D = dist.cdist(centroids, centroids, metric="euclidean")
# draw (1) a bounding box around the person and (2) the
# centroid coordinates of the person
cv2.rectangle(frame, (startX, startY), (endX, endY), color, 2)
cv2.circle(frame, (cX, cY), 5, color, 1)
if args["display"] > 0:
# show the output frame
cv2.imshow("Frame", frame)
cv2.waitKey(0)
cv2.destroyAllWindows()
# if an output image file path has been supplied and the writer has not been #initialized
if args["output"] != "" and writer is None:
outpath="/home/pi/sem7/tp/cds/cds1/CDS_orig/my_output.jpg"
cv2.imwrite(outpath,frame)
2.1.3 WORKING
Raspberry Pi camera is used along with Raspberry to capture live video. The live
video is then processed frame by frame by using image processing concepts.
16 | P a g e
Fig 2.2 Block diagram
The block diagram shown in fig 2.1 briefs how the whole project works. It starts with
a Raspberry pi camera which records real time video which is further passed on to Raspberry
Pi board which has an external power supply connected with it. Normally, PC is not capable
of providing enough amperes of current to RPi, so we connect an external power supply of
output current greater than 3amps.
The Raspberry Pi is connected to the PC via WiFi through an SSH client PuTTY
which carries out the necessary processing and computations to detect humans. The Raspbian
is operated on VNC Viewer. When the code is run in Spyder or any similar software, an
image is captured by the Picamera, saved and displayed on the screen on VNC. The next step
is to process this image with the help of COCO and YOLO framework. This processing is
done frame by frame. Upon completing the processing, the frame is written with the
bounding boxes and centroid and finally displayed on screen as the output. The output is
saved in the specified path.
2.1.4 TROUBLESHOOTING
1. Segmentation fault errors because of bounding box size. We faced this error initially when
we were running the code on spyder in windows. But then we shifted to google colab because
it offers GPU simulation and further on to spyder on raspbian, where we didn’t face the same
issue.
2. swapRB conflict type error. This error occurred because the ‘mean’ argument in the
blobfromimage function takes the value in RGB format while opencv uses BGR format.
After making swapRB=true, we made both formats the same.
3. Input file error. This error occurred because of an incorrect drive file path where the
yoloweights config file was located.
4. Google colab cannot be interfaced with microcontrollers. Although it now seems logical
that it is a virtual/ cloud platform just to enable code simulation and see possible errors before
actual implementation on hardware, we did not know this initially. So we had to carry out the
entire process of raspbian installation and configuration.
5. Required cable for RPi4 is HDMI to microHDMI which is HDMI0. The regular ones used
are HDMI1.
17 | P a g e
6. Raspbian installation. During raspbian installation we faced various minor errors. For
example, we already had an 8GB SD card. So when to tried writing raspbianos image, it
couldn’t be written because of disc file size larger than the card’s size. We also couldn’t find
a software to compress a disc file. After numerous such problems, we configured RPi using
PuTTY and VNC [16]
7. ‘Cannot display desktop error’. Changed resolution and reinstalled session manager called
LXSession [17]
8. Real time crowd detection on live video stream. Live video capture is not possible on
colab and since the main reason for using it was its GPU simulation capability, having to run
the final code on raspbian was ultimately running on CPU which is extremely slow when it
comes to intensive processing and computations like this. Whenever we tried to capture the
live video stream, the RPi would get extremely hot before the output frames could start to
display. For this reason real time testing was done on image/ 1 frame instead of video.
Although the live stream simulation can easily be done using a CUDA GPU however,
according to this article [19] , it appears that none of the GPUs available in the market can be
initialized using RPi B 4.
9. Cv2 version. At one point we faced error with the version of opencv as follows:
From the above error, it isn’t quite clear if the error is with the opencv version. We tried
running a very basic part of this code on spyder in windows and when it successfully gave the
output, we updated the opencv version in raspbian as the same one that we were using in
spyder on windows.
ImportError: libjasper.so.1: cannot open shared object file: No such file or directory on RPi
which we were able to solve with the help of this thread [18]
18 | P a g e
CHAPTER 3
ACCOMPLISHMENTS
3.1 RESULTS
3.2 OBSERVATIONS
20 | P a g e
Object Detection Metrics are often useful in making observations related to object detection.
A few related terms are:
Precision
Precision is the ability of a model to identify only the relevant objects. It is the percentage of
correct positive predictions and is given by:
Recall
Recall is the ability of a model to find all the relevant cases (all ground truth bounding
boxes). It is the percentage of true positive detected among all relevant ground truths and is
given by:
Threshold
Depending on the metric, it is usually set to 50%, 75% or 95%.
With the help of the procedure mentioned in [20], we were able to generate confidence,
precision and recall values for our code for 5 sample images captured using camera at
random from internet and our own images.
Ground
Image No. Confidences TP or FP Precision Recall
Truth
Therefore the average accuracy from a sample of just 5 images processed using YOLO is
approximately 88%.
CHAPTER 4
21 | P a g e
CONCLUSION AND SCOPE OF THE PROJECT
4.1 LIMITATIONS
2. Object Classification
Any object appearing would be detected without classification of it being a
human being , car, bike etc.
4. Network Interruption
If WiFi is interrupted, PuTTY - the SSH client for windows becomes inactive
and has to be restarted
.
5. Accuracy and Confidence
Confidence is good but Accuracy of the system is comparatively slightly less.
This data is also has dependencies on camera resolution.
4.2 APPLICATIONS
1. Maintaining public order in certain crowded places such as airports, carnivals, sports
events, and railway stations is very essential. In a crowd management system,
counting people is an essential factor. Particularly in smaller areas, increase in the
number of people create problems such as fatalities, physical injury etc. Detecting
such unnecessary social gatherings/public events by alerting the required authorities
can be done easily.
.
2. Obtain real world data for revenue opportunity analysis which would help places like
a cafeteria where if the number of people present at the food counter is known, it
would be useful in making better decisions regarding service offerings, advertisement
and to streamline staffing levels.
3. The number of fighting jets, soldiers, and moving drones and their motion etc. are
estimated through proper crowd management systems. Thus the strength of the armed
forces can be estimated through this system.
22 | P a g e
4. Crowd monitoring systems are used to minimize terror attacks in public gatherings.
Traditional machine learning methods do not perform well in these situations. Some
methods which are used for proper monitoring of such sort of detection activities can
be explored
4.3 CONCLUSION
Crowd image analysis is an essential task for several applications. Crowd analysis
provides sufficient information about several tasks including counting, localization,
behaviour analysis etc. Our proposed project is easy to use, affordable and efficient. The
issue of the tiresome job of human detection can be resolved.
From the perspective of reuqirement of speed and accuracy of crowd detection systems, we
conclude that YOLO can be considered as a potential solution. However, with the right kind
of computing tools like a GPU and better microcontrollers and/or microprocessors, this
implementation could be even more efficient and smooth.
REFERENCES
23 | P a g e
[1] Antic B, Letic D, Culibrk D, Crnojevic V (2009) K- means based segmentation for real
time zenithal people counting. In: Proceedings of IEEE international conference on image
processing, pp 2565–2568
[2] Shao J (2017) Crowded scene understanding by deeply learned volumetric slices. IEEE
Trans Circuits Syst Video Technol 27(3):613–623
[3] Bansal A, Venkatesh KS (2009) People counting in high density crowds from still
images. In: Proceedings of. IEEE international conference on computer vision and pattern
recognition, pp 1093–1100
[4] Fu H, Ma H, Xiao H (2014a) Crowd counting via head detection and motion flow
estimation. In: Proceedings of 22nd ACM international conference on Multimedia, Florida,
pp 877–880
[6] Chauhan RV, Kumar S, Singh SK (2016) Human count estimation in high density crowd
images and videos. In: Proceedings of fourth international conference on parallel, distributed
and grid computing (PDGC), Waknaghat, pp 343–347
[7] Brostow GJ, Cipolla R (2006) Unsupervised bayesian detection of independent motion in
crowds. In: Proceedings IEEE computer society conference on computer vision and pattern
recognition (CVPR’06), pp 594–601
[8] Zhang C, Li H, Wang X, Yang X (2015) Cross-scene crowd counting via deep
convolutional neural networks. In: Proceedings of IEEE conference on computer vision and
pattern recognition (CVPR), Boston, MA, pp 833–841
[9] Zero to Hero: Guide to Object Detection using Deep Learning: Faster -CNN,YOLO,SSD
- CV-Tricks.com, 2021
[10] Advances and Trends in Real Time Visual Crowd Analysis, MDPI
[17] Fix Raspberry Pi's 'Cannot Currently Show the Desktop' Error
[18] ImportError: libjasper.so.1: cannot open shared object file: No such file or directory on
RPi
24 | P a g e
[19] External GPUs and the Raspberry Pi Compute Module 4
25 | P a g e