Object Detection Using Machine Learning: Bachelor of Technology
Object Detection Using Machine Learning: Bachelor of Technology
BACHELOR OF TECHNOLOGY
IN
SUBMITTED BY
DEPARTMENT OF AEIE
Acknowledgement
It is our great privilege to express our profound and sincere gratitude to our Project Supervisor,
Mrs. Lipika Mandal for providing us a very cooperative and precious guidance at every stage of
the present project work being carried out under her supervision. Her valuable advice and
instructions in carrying out the present study has been a very rewarding and pleasurable
experience that has greatly benefited me throughout the course of work.
We would like to convey my sincere gratitude towards Mr. Intekhab Hussain, Head of the
Department of AEIE, Asansol Engineering College for providing us the requisite support fortime
completion of our work. We would also like pay our heartiest thanks and gratitude to all the
teachers of the Department of AEIE, Asansol Engineering College for various suggestions being
provided in attaining success in our work.
We would like to express our earnest thanks to our other colleagues along with all technical staffs
of the Department of AEIE, Asansol Engineering College for their valuable assistance being
provided during our project work.
Finally, we would like to express our deep sense of gratitude to our parents for their constant
motivation and support throughout our work.
……………………………………………..
(Deeplekha Gupta)
……………………………………………..
(Ankita Das)
…………………………………………….
(Aditi Choubey)
ii
DEPARTMENTOF APPLIED ELECTRONICS &
INSTRUMENTATION ENGINEERING
ASANSOL ENGINEERING COLLEGE
Certificate of Approval
This is to certify that the work presented in the project entitled “OBJECT DETECTION USING
MACHINE LEARNING”, submitted by Deeplekha Gupta, AnkitaDas, Aditi Choubey in partial
fulfillment of the requirement for the award of degree of B. Tech in Applied Electronics &
Instrumentation Engineering of Asansol Engineering College, Asansol, is an authentic work
carried out under my supervision and guidance.
To the best of my knowledge, the content of this project does not form a basis of the award of
any previous degree to anyone else.
……………………………. ………………………………….
Intekhab Hussain Lipika Mandal (Assistant Professor)
(Head of the Department) Project Guide
Dept of AEIE Dept of AEIE
Asansol Engineering College Asansol Engineering College
…………………………………….
Bhaskar Roy
(Assistant Professor)
Final Year Project Coordinator
Dept. of AEIE
Asansol Engineering College
iii
Abstract
Efficient and accurate object detection has been an important topic in the advancement of computer
vision systems. With the advent of deep learning techniques, the accuracy for object detection has
increased drastically. The aim of object detection is to detect all instances of objects from a known
class, such as people, cars or faces in an image. Generally, only a small number of instances of the
object are present in the image, but there are a very large number of possible locations and scales at
which they can occur and that need to somehow be explored. In this project, we use a completely
deep learning-based approach to solve the problem of object detection in an end-to-end fashion. The
network is trained on the most challenging publicly available data-set, on which an object detection
challenge is conducted annually. Object recognition is to describe a collection of related computer
vision tasks that involve activities like identifying objects in digital photographs. Image classification
involves activities such as predicting the class of one object in an image. Object localization is
referring to identifying the location of one or more objects in an image and drawing a bounding box
around their extent. Object detection does the work of combines these two tasks and localizes and
classifies one or more objects in an image. There are various applications of object detection that
have been well researched including face detection, character recognition, and vehicle calculator.
Object detection can be used for various purposes including retrieval and surveillance. The resulting
system is fast and accurate, thus aiding those applications which require object detection.
iv
Contents
Acknowledgement ii
Certificate of Approval iii
Abstract iv
Chapter 1:Introduction 1
v
4.2 MOBILENET 12-13
vi
4.3 COCO DATASET 13
4.4 YOLO 13-14
4.5 VGG 14
4.6 R-CNN 14
Chapter 5 : Image Processing 15
5.1 DESCRIPTION 15-16
5.2 DIGITAL IMAGE PROCESSING 16-19
5.3 GRAY SCALE IMAGE 19
5.4 COLOR IMAGE 19-20
5.5 RELATED TECHNOLOGY 20
Chapter 6 : Software Requirement 21
PART 6.1: OBJECT DETECTION USING MACHINE LEARNING 21
6.1.1 JUPYTER NOTEBOOK 21
6.1.2 MODULES USED 21-23
23-24
6.1.3 PROGRAMMING LANGUAGE USED
PART 6.2: FACIAL EMOTION RECOGNITION USING 24
MACHINE LEARNING
6.2.1 HARDWARE INTERFACES 24
6.2.2 PLANNING 24
24-25
6.2.3 THE LIBRARY & PACKAGES
6.2.4 HAAR CASCADE CLASSIFIER IN OPEN-CV 25-27
Chapter 7: Results 28
PART 7.1: OBJECT DETECTION USING MACHINE LERNING 28
7.1.1 RESULTS 28
PART 7.2: FACIAL EMOTION RECOGNITION USING 29
MACHINE LEARNING
vii
Chapter 8: Conclusion 31
8.1 CONCLUSION 31
8.2 FUTURE SCOPE 31-32
Chapter 9: References 33-37
viii
CHAPTER 1 - INTRODUCTION
PART 1: OBJECT DETECTION USING MACHINE LEARNING
1.1INTRODUCTION:
Efficient and accurate object detection has been an important topic in the advancement of computer
vision systems. With the advent of deep learning techniques, the accuracy for object detection has
increased drastically. The project aims to incorporate state-of-the-art technique for object detection
with the goal of achieving high accuracy with a real-time performance. Object detection is a computer
vision technique that allows us to identify and locate object in an image or video data. With this kind of
identification and localization, object detection can be used to count objects in a scene and determine
and track their precise location, all while accurately labelling on them. Object detection is breaking into
a wide range of industries, with use cases ranging from personal security to productivity in the
workplace. Object detection and recognition is applied in many areas of computer vision, including
image retrieval, security, surveillance, automated vehicle systems and machine inspection. Significant
challenges stay on the field of object recognition. The possibilities are endless when it comes to future
use cases for object detection.
The motive of object detection is to recognize and locate all known objects in a scene. Preferably in 3D
space, recovering pose of objects in 3D is very important for robotic control systems.
Imparting intelligence to machines and making robots more and more autonomous and independent
has been a sustaining technological dream for the mankind. It is our dream to let the robots take on
tedious, boring, or dangerous work so that we can commit our time to more creative tasks.
Unfortunately, the intelligent part seems to be still lagging behind. In real life, to achieve this goal,
besides hardware development, we need the software that can enable robot the intelligence to do the
work and act independently. One of the crucial components regarding this is vision, apart from other
types of intelligences such as learning and cognitive thinking. A robot cannot be too intelligent if it
cannot see and adapt to a dynamic environment.
The searching or recognition process in real time scenario is very difficult. So far, no effective solution
has been found for this problem. Despite a lot of research in this area, the methods developed so far are
not efficient, require long training time, are not suitable for real time application, and are not scalable to
1
large number of classes. Object detection is relatively simpler if the machine is looking for detecting
one particular object. However, recognizing all the objects inherently requires the skill to differentiate
one object from the other, though they may be of same type. Such problem is very difficult for
machines, if they do not know about the various possibilities of objects.
1.3 MOTIVATION:
Object detection is breaking into a wide range of industries, with use cases ranging from personal
security to productivity in the workplace. Object detection and recognition is applied in many areas of
computer vision, including image retrieval, security, surveillance, automated vehicle systems and
machine inspection. Significant challenges stay on the field of object recognition. The possibilities are
endless when it comes to future use cases for object detection. Object detection is probably the most
profound aspect of computer vision due the number of practical use cases. Object detection refers to the
capability of software systems to locate objects in an image/scene and identify each object. It has been
widely used for face detection, vehicle detection, pedestrian counting, web images, security systems
and driver less cars. There are many ways object detection can be used as well in many fields of
practice. Like every other computer technology, a wide range of creative and amazing uses of object
detection will definitely come from the efforts of computer programmers and software developers.
Getting to use modern object detection methods in applications and systems, as well as building new
applications based on these methods is not a straight forward task. Early implementations of object
detection involved the use of classical algorithms, like the ones supported in OpenCV, the popular
computer vision library. However, these classical algorithms could not achieve enough performance to
work under different conditions.
Object detection from a complex background is a challenging application in image processing. The
goal of this project is to identify objects placed over a surface froma complex background image using
various techniques.
Many problems in computer vision were saturating on their accuracy before a decade. However, with
the rise of deep learning techniques, the accuracy of these problems drastically improved. One of the
major problems was that of image classification, which is needed as predicting the class of the image.
A slightly complicated problem is that of image localization, where the image contains a single object
2
and the system should predict the class of the location of the object in the image (a bounding box
around the object). The more complicated problem (this project), of object detection involves both
classification and localization. In this case, the input to the system will be a image, and the output will
be a bounding box corresponding to all the objects in the image, along with the class of object in each
box. An overview of all these problems is depicted in Fig. 1
The major challenge in this problem is that of the variable dimension of the output which is caused due
to the variable number of objects that can be present in any given input image. Any general machine
learning task requires a xed dimension of input and output for the model to be trained. Another
important obstacle for widespread adoption of object detection systems is the requirement of real-time
(>30fps) while being accurate in detection. The more complex the model is, the more time it requires
for inference; and the less complex the model is, the less is the accuracy. This trade-o between
accuracy and performance needs to be chosen as per the application. The problem involves
classification as well as regression, leading the model to be learnt simultaneously. This adds to the
complexity of the problem.
2.1 INTRODUCTION
Human emotion detection is implemented in many areas requiring additional security or information
about the person. It can be seen as a second step to face detection where we may be required to set up a
second layer of security, where along with the face, the emotion is also detected. This can be useful
verify that the person standing in front of the camera is not just a 2-dimensional representation
[1].Another important domain where we see the importance of emotion detection is for business
promotions. Most of the businesses thrive on customer responses to all their products and offers. If an
3
artificial intelligent system can capture and identify real time emotions based on user image or video,
they can make a decision on whether the customer liked or disliked the product or offer.We have seen
that security is the main reason for identifying any person. It can be based on finger-print matching,
voice recognition, passwords, retina detection etc. Identifying the intent of the person can also be
important to avert threats. This can be helpful in vulnerable areas like airports, concerts and major
public gatherings which have seen many breaches in recent years.
2.3 MOTIVATION:
In today’s networked world the need to maintain security of information or physical property is
becoming both increasingly important and increasingly difficult. In countries like Nepal the rate of
crimes are increasing day by day. No automatic systems are there that can track person’s activity. If we
will be able to track Facial expressions of persons automatically then we can find the criminal easily
since facial expressions changes doing different activities. So, we decided to make a Facial Expression
Recognition System. We are interested in this project after we went through few papers in this area.
The papers were published as per their system creation and way of creating the system for accurate and
reliable facial expression recognition system. As a result, we are highly motivated to develop a system
that recognizes facial expression and track one person’s activity.
Human facial expressions can be easily classified into 7 basic emotions: happy, sad, surprise, fear,
anger, disgust, and neutral. Our facial emotions are expressed through activation of specific sets of
facial muscles. These sometimes subtle, yet complex, signals in an expression often contain an
abundant amount of information about our state of mind. Through facial emotion recognition, we are
able to measure the effects that content and services 30 have on the audience/users through an easy and
low-cost procedure. For example, retailers may use these metrics to evaluate customer interest. Health
care providers can provide better service by using additional information about patient’s emotional
4
state during treatment. Entertainment producers can monitor audience engagement in events to
consistently create desired content.
Humans are well - trained in reading the emotions of others, in fact, at just 14 months old, babies can
already tell the difference between happy and sad. But can computers do a better job than us in
accessing emotional states? To answer the question, we designed a deep learning neural.
5
CHAPTER 2 – LITERATURE SURVEY
In the year 2017 Kaiming He, Georgia Gkioxari, Piotr Dollar, Ross Girshick proposed Mask RCNN.In
this paper Mask RCNN is certainly not a commonplace object detection network. It was intended to
settle a difficult example division task, i.e, making a mask for each object in the scene.Nonetheless,
Mask R-CNN indicated an incredible augmentation to the Faster R-CNN framework, and furthermore
thusly motivated object location research. The fundamental thought is to add a binary mask prediction
branch after ROI pooling alongside the current bounding box and characterization branches. Obviously,
both perform multiple tasks preparing (division + detection) and the new ROI Align layer adds to some
improvement over the bounding box benchmark.
In the year 2017 NavaneethBodla, Bharat Singh, Rama Chellappa, Larry S. Davis proposed SoftNMS –
Improving Object Detection with One Line of Code. In this paper Non maximum suppression (NMS) is
broadly utilized in anchor based object detection networks to diminish copy positive proposition that
are close-by. All the more explicitly, NMS iteratively wipes out applicant boxes on the off chance that
they have a high IOU with a surer applicant box. This could prompt some sudden conduct when two
objects with a similar class are to be sure near one another. SoftNMS rolled out a little improvement to
just downsizing the certainty score of the overlapped applicant boxes with a boundary. This scaling
boundary gives us more controlwhen tuning the localization execution, and furthermore prompts a
superior exactness when a high review is likewise required.
In the year 2017 ZhaoweiCai UC San Diego, Nuno Vasconcelos UC San proposed Cascade RCNN:
Delving into High Quality Object Detection. While FPN investigating how to plan a superior R-CNN
neck to utilize backbone highlights Cascade R-CNN examined an upgrade of R-CNN grouping and
regression head. The basic assumption that is straightforward yet sagacious: the higher IOU rules we
utilize while planning positive focuses on, the less false positive predictions the network will figure out
how to make. In any case, we can't just increment such IOU threshold from regularly utilized 0.5 to
more forceful 0.7, in light of the fact that it could likewise prompt all the more overpowering negative
models during training.
In the year 2017 Tsung-Yi Lin PriyaGoyal Ross GirshickKaiming He Piotr Dollar proposed Focal 13
Loss for Dense Object Detection. To comprehend why one-stage locators are typically not comparable
to two-stage detectors, RetinaNet explored the frontal area foundation class unevenness issue from a
one-stage detectors dense predictions. Take YOLO for instance, it attempted to predict classes and
6
bounding boxes for all potential areas meanwhile, so the majority of the yields are coordinated to
negative class during training. SSD tended to this issue by online hard model mining. YOLO utilized
an objectiveness score to certainly prepare a closer view classifier in the beginning phase of training.
RetinaNet thinks the two of them didn't get the way in to the issue, so it developed another loss
function work called Focal Loss to assist the network with realizing what'ssignificant.
In the year 2018 Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, JiayaJia proposed PathAggregation
Network for Instance Segmentation. In this paper Occurrence division has a close relationship with
object detection, so regularly another case segmentation network could likewise profit object
recognition research in a roundabout way.
In the year 2018 ChengjiLiu, Yufan Tao, JiaweiLiang, Kai Li, Yihang Chen proposed Object Detection
Based on YOLO Network.In this paper YOLO v3 is the latestformofthe YOLO versions. Following
YOLOv2's convention, YOLOv3 acquired more thoughts from past exploration and got a powerful
incredible one- stage finder like a beast. YOLO v3 adjusted the speed, exactness, and execution
unpredictability really well. Also, it got truly mainstream in the business as a resultof its quick speed
and basic parts. Basically, YOLO v3's success comes from its all the more impressive backbone
include extractor and a RetinaNet-like identification head with a FPN neck.
7
CHAPTER 3 -PROJECT METHODOLOGY
PART 3.1: OBJECT DETECTION USING MACHINE LEARNING
We used Python as programming language, and labeling the object with text data we use Open CV &
used pretrained deep learning architecture based on tensor flowso in this regard we again use OpenCV
to load already pre-trained TensorFlow frozen models.
8
At first, we have to load the image in our model after that object will be recognized by our model and
Classification and localization done instantly by our algorithm then, object will be detected.
Segmentation is termed as categorizing each pixel value of an image to a particular class. To make this
project we took help of SSD-MobilenetV2 algorithm.
System design shows the overall design of system. In this section we discuss in detail the design
aspects of the system:
9
3.2.2 FLOWCHART:
I.IMAGE PRE-PROCESSING:
Image pre-processing includes the removal of noise and normalization against the variation of pixel
position or brightness.
o Color Normalization
o Histogram Normalization
Face Detection is useful in detection of facial image. Face Detection is carried out in training dataset
using Haar classifier called Voila-Jones face detector and implemented through Opencv. Haar like
features encodes the difference in average intensity in different parts of the image and consists of black
10
and white connected rectangles in which the value of the feature is the difference of sum of pixel values
in black and white regions [6].
Selection of the feature vector is the most important part in a pattern classification problem. The image
of face after pre-processing is then used for extracting the important features. The inherent problems
related to image classification include the scale, pose, 17 translation and variations in illumination level
[6].
Emotion classification can be divided into two classes, primary emotion such as joy, sadness, anger,
fear disgust, and surprise, and secondary emotion, which evokes a mental image that correlates to
memory or primary emotion.
11
CHAPTER 4 - ALGORITHMS
4.1 SSD:
If there is an image so that single shot multi-box detector divides the image into small patches and then
combination of the patches based on the most salient feature it joins those patches and ask the classifier
to classify that image.
SSD uses VGG16 to extract feature maps. Then it detects objects using the Conv4_3 layer. For
illustration, we draw the Conv4_3 to be 8 × 8 spatially (it should be 38 × 38). For each cell in the
image (also called location), it makes 4 object predictions. Each prediction composes of a boundary
box and 21 scores for each class (one extra class for no object), and we pick the highest score as the
class for the bounded object. Conv4_3 makes total of 38 × 38 × 4 predictions: four predictions per cell
regardless of the depth of feature maps. An expected, many predictions contain no object. SSD reserves
a class “0” to indicate. SSD does not use the delegated region proposal network. Instead, it resolves to a
very simple method. It computes both the location and class scores using small convolution filters.
After extraction the feature maps, SSD applies 3 × 3 convolution filters for each cell to make
predictions. (These filters compute the results just like the regular CNN filters.) Each filter gives
outputs as 25 channels: 21 scores for each class plus one boundary box.In the beginning, we describe
the SSD detects objects from a single layer. Actually, it uses multiple layers (multi-scale feature maps)
for the detecting objects independently. As CNN reduces the spatial dimension gradually, the
resolution of the feature maps also decreases. SSD uses lower resolution layers for the detect larger-
scale objects. For example, the 4× 4 feature maps are used for the larger- scale object.
4.2 MOBILENET:
MobileNet uses depth wise separable convolutions that help in building deep neural networks. The
MobileNet model is more appropriate for portable and embedded vision-based applications where there
is absence of process control. The main objective of MobileNet is to optimize the latency while
building small neural nets at the same time. It concentrates just on size without much focus on speed.
12
MobileNets are constructed from depth wise separable convolutions.The number of parameters is
reduced significantly by this model through the use of depth wise separable convolutions, when
compared to that done by the network with normal convolutions having the same depth in the
networks.MobileNet is an efficient and portable CNN architecture that is used in real world
applications. MobileNets primarily use depth wiseseparable convolutions in place of the standard
convolutions used in earlier architectures to build lighter models.MobileNets introduce two new global
hyperparameters(width multiplier and resolution multiplier) that allow model developers to trade off
latency or accuracy for speed and low size depending on their requirements.Each depth wiseseparable
convolution layer consists of a depth wise convolution and a pointwise convolution. Countingdepth
wise and pointwise convolutions as separate layers, a MobileNet has 28 layers. A standard MobileNet
has 4.2 million parameters which can be further reduced by tuning the width multiplier hyperparameter
appropriately.
COCO stands for Common Objects in Context, as the image dataset was created with the goal of
advancing image recognition. The COCO dataset contains challenging, high-quality visual datasets for
computer vision, mostly state-of-the- art neural networks. For example, COCO is often used to
benchmark algorithms to compare the performance of realtime object detection. The format of the
COCO dataset is automatically interpreted by advanced neural network libraries.
4.4: YOLO:
YOLO is real-time object detection. It applies one neural network to the complete image dividing the
image into regions and predicts bounding boxes and possibilities for every region. Predicted
probabilities are the basis on which these bounding boxes are weighted. A single neural network
predicts bounding boxes and class possibilities directly from full pictures in one evaluation. Since the
full detection pipeline is a single network, it can be optimized end-to-end directly on detection
performance.You Only Look Once (YOLO) is one of the most popular model architectures and object
detection algorithms. It uses one of the best neural network architectures to produce high accuracy and
overall processing speed, which is the main reason for its popularity. If we search Google for object
detection algorithms, the first result will be related to the YOLO model.
YOLO algorithm aims to predict a class of an object and the bounding box that defines the object
location on the input image. It recognizes each bounding box using four numbers:
13
o Center of the bounding box ((b{x}, b{y}))
o Width of the box (b{w})
o Height of the box (b{h})
o In addition to that, YOLO predicts the corresponding number c for the predicted class as well as
the probability of the prediction (P{c})
4.5 VGG:
VGG network is another convolution neural network architecture used for image classification.It stands
for Visual Geometry Group; it is a standard deep Convolutional Neural Network (CNN) architecture
with multiple layers. The “deep” refers to the number of layers with VGG-16 or VGG-19 consisting of
16 and 19 convolutional layers. The VGG architecture is the basis of ground-breaking object
recognition models.
4.6 R-CNN:
R-CNN is a progressive visual object detection system that combines bottom-up region proposals with
rich options computed by a convolution neural network.R-CNN uses region proposal ways to initial
generate potential bounding boxes in a picture and then run a classifier on these proposed boxes.
14
CHAPTER 5 - IMAGE PROCESSING
5.1 DESCRIPTION:
With the advent of modern technology our desires went high and it binds no bounds. In the present era
a huge research work is going on in the field of digital image and image processing. The way of
progression has been exponential and it is ever increasing. Image Processing is a vast area of research
in present day world and its applications are very widespread.
Image processing is the field of signal processing where both the input and output signals are images.
One of the most important applications of Image processing is Facial expression recognition. Our
emotion is revealed by the expressions in our face. Facial Expressions plays an important role in
interpersonal communication. Facial expression is a non-verbal scientific gesture which gets expressed
in our face as per our emotions. Automatic recognition of facial expression plays an important role in
artificial intelligence and robotics and thus it is a need of the generation. Some application related to
this include Personal identification and Access control, Videophone and Teleconferencing, Forensic
application, Human-Computer Interaction, Automated Surveillance, Cosmetology and so on.
The objective of this project is to develop Automatic Facial Expression Recognition System which can
take human facial images containing some expression as input and recognize and classify it into seven
different expression class such as:
Neutral
Angry
Happy
Disgust
Fear
Sad
Surprise
15
Fig:05: Basic different emotions
What is DIP?
A picture might be characterized as a two-dimensional capacity f(x, y), where x, y are spatial directions,
and the adequacy off at any combine of directions (x, y) is known as the power or dark level of the
picture by then. Whenever x, y and the abundance estimation of are all limited discrete amounts, we
call the picture a computerized picture. The field of DIP alludes to preparing advanced picture by
methods for computerized PC. Advanced picture is made out of a limited number of components, each
of which has a specific area and esteem. The components are called pixels.
16
Vision is the most progressive of our sensor, so it is not amazing that picture play the absolute most
imperative part in human observation. Be that as it may, dissimilar to people, who are constrained to
the visual band of the EM range imaging machines cover practically the whole EM range, going from
gamma to radio waves. They can work likewise on pictures produced by sources that people are not
acclimated to partner with picture.
There is no broad understanding among creators in regards to where picture handling stops and other
related territories, for example, picture examination and PC vision begin. Now and then a qualification
is made by characterizing picture handling as a teach in which both the info and yield at a procedure
are pictures. This is constraining and to some degree manufactured limit. The range of picture
investigation is in the middle of picture preparing and PC vision.
There are no obvious limits in the continuum from picture handling toward one side to finish vision at
the other. In any case, one helpful worldview is to consider three sorts of mechanized procedures in this
continuum: low, mid and abnormal state forms. Low-level process includes primitive operations, for
example, picture preparing to decrease commotion differentiate upgrade and picture honing.
A low-level process is described by the way that both its sources of info and yields are pictures.
Mid-level process on pictures includes assignments, for example, division, depiction of that Question
diminishes them to a frame reasonable for PC handling and characterization of individual articles.
A mid-level process is portrayed by the way that its sources of info by and large are pictures however
its yields are properties removed from those pictures. At long last more elevated amount handling
includes "Understanding an outlet of perceived items, as in picture examination and at the farthest end
of the continuum playing out the intellectual capacities typically connected with human vision.
Advanced picture handling, as effectively characterized is utilized effectively in a wide scope of
regions of outstanding social and monetary esteem.
What is an Image?
A picture is spoken to as a two dimensional capacity f(x, y) where x and y are spatial co-ordinates and
the adequacy of "T" at any match of directions (x, y) is known as the power of the picture by then.
Processing on image:
Processing on image can be of three types They are low-level, mid-level, high level.
17
Low-level Processing:
Medium-level Processing:
o Segmentations
o Edge Detection
o Object Extraction
High-level Processing:
o Image analysis
o Scene interpretation
Since the digital image is invisible, it must be prepared for viewing on one or more output device(laser
printer, monitor at).The digital image can be optimized for the application by enhancing the appearance
of the structures within it.
Pixel:
Pixel is the smallest element of an image. Each pixel corresponds to any one value. In an 8-bit gray scale
image, the value of the pixel between 0 and 255.Each pixel store a value proportional to the light
intensity at that particular location. It is indicated in either Pixels per inch or Dots per inch.
Resolution:
The resolution can be defined in many ways. Such as pixel resolution, spatial resolution, temporal
resolution, spectral resolution. In pixel resolution, the term resolution refers to the total number of
count of pixels in a digital image. For example, If an image has M rows and N columns, then its
resolution can be defined as MX N. Higher is the pixel resolution, the higher is the quality of the
image.
Since high resolution is not a cost-effective process It is not always possible to achieve high resolution
images with low cost. Hence it is desirable imaging. In Super Resolution imaging, with the help of
certain methods and 42 algorithms we can be able to produce high resolution images from the low-
resolution image from the low-resolution images.
A gray scale picture is a capacity I (xylem) of the two spatial directions of the picture plane. I(x,y) is the
force of the picture force of picture at the point (x, y) on the picture plane. I (xylem) take non- negative
expect the picture is limited by a rectangle.
It can be spoken to by three capacities, R (xylem) for red, G (xylem) for green and B (xylem) for blue. A
picture might be persistent as for the x and y facilitates and furthermore in adequacy. Changing over
such a picture to advanced shape requires that the directions and the adequacy to be digitized.
19
Digitizing the facilitates esteems is called inspecting. Digitizing the adequacy esteems is called
quantization.
I. TENSORFLOW:
Tensor flow is an open-source software library for high performance numerical computation. It allows
simple deployment of computation across a range of platforms (CPUs, GPUs, TPUs) due to its versatile
design also from desktops to clusters of servers to mobile and edge devices. Tensor flow was designed
and developed by researchers and engineers from the Google Brain team at intervals Google’s AI
organization, it comes with robust support for machine learning and deep learning and the versatile
numerical computation core is used across several alternative scientific domains.
To construct, train and deploy Object Detection Models TensorFlow is used that makes it easy and also
it provides a collection of Detection Models pre-trained on the COCO dataset, the Kitti dataset, and the
Open Images dataset. One among the numerous Detection Models is that the combination of Single
Shot Detector (SSDs) and Mobile Nets architecture that is quick, efficient and doesn't need huge
computational capability to accomplish the object Detection.
“Deep Face” is a deep learning facial recognition system developed to identify human faces in a digital
image. Designed and developed by a group of researchers in Facebook. Google also has its own facial
recognition system in Google Photos, which automatically separates all the photos according to the
person in the image. 44 There are various components involved in Facial Recognition or authors could
say it focuses on various aspects like the eyes, nose, mouth and the eyebrows for recognizing a face.
SSD discretizes the output space of bounding boxes into a set of default boxes over different aspect
ratios and scales per feature map location. At the time of prediction, the network generates scores for
the presence of each object category in each default box and generates adjustments to the box to better
match the object shape. Additionally, the network combines predictions from multiple feature maps
with different resolutions to naturally handle objects of various sizes.
20
CHAPTER 6 – SOFTWARE REQUIREMENT
PART 6.1 OBJECT DETECTION USING MACHINE LEARNING
The Jupyter Notebook App is a server-client application that allows editing andrunning notebook
documents via a web browser. The Jupyter Notebook App can be executed on a local desktop requiring
no internet access (as described in this document) or can be installed on a remote server and accessed
through the internet.In addition to displaying/editing/running notebook documents, the Jupyter
Notebook App has a “Dashboard” (Notebook Dashboard), a “control panel” showing local files and
allowing to open notebook documents or shutting down their kernels.
I. OPEN CV:
OpenCV stands for Open supply pc Vision Library is associate open supply pc vision and machine
learning software system library. The purpose of creation of OpenCV was to produce a standard
infrastructure for computer vision applications and to accelerate the utilization of machine perception
within the business product [6]. It becomes very easy for businesses to utilize and modify the code with
OpenCV as it is a BSD-licensed product. It is a rich wholesome library as it contains 2500 optimized
algorithms, which also includes a comprehensive set of both classic and progressive computer vision
and machine learning algorithms. These algorithms are used for various functions such as discover and
acknowledging faces. Identify objects classify human actions. In videos, track camera movements,
track moving objects. Extract 3D models of objects, manufacture 3D purpose clouds from stereo
cameras, sew pictures along to providea high-resolution image of a complete scene, find similar
pictures from a picture information, remove red eyes from images that are clicked with the flash, follow
eye movements, recognize scenery and establish markers to overlay it with augmented reality.
Officially launched in 1999 the OpenCV project was initially an Intel Research initiative to advance
CPU-intensive applications, part of a series of projects including real-time ray tracing and 3D display
walls The main contributors to the project included a number of optimization experts in Intel Russia, as
well as Intel's Performance Library Team. In the early days of OpenCV, the goals of the project were
described as:
21
Advance vision research by providing not only open but also optimized code for basic vision
infrastructure. No more reinventing the wheel.
Disseminate vision knowledge by providing a common infrastructure that developers could
build on, so that code would be more readily readable and transferable.
II. MATPLOTLIB:
Matplotlib is a python library used to create 2D graphs and plots by using python scripts. It has a
module named pyplot which makes things easy for plotting by providing feature to control line styles,
font properties, formatting axes etc.It supports a very wide variety of graphs and plots namely -
histogram, bar charts, power spectra, error charts etc. It is used along with NumPy to provide an
environment that is an effective opensource alternative for MatLab. It can also be used with graphics
toolkits like PyQt and wxPyt.
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in
Python.
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in
Python. Matplotlib makes easy things easy and hard things possible.
22
allows us visual access to huge amounts of data in easily digestible visuals. Matplotlib consists of
several plots like line, bar, scatter, histogram etc.
23
power. On the other hand, often the quickest way to debug a program is to add a few print statements to
the source: the fast edit-test-debug cycle makes this simple approach very effective.
6.2.2 PLANNING:
24
A powerful N-dimensional array object
Sophisticated(broadcasting)functions
Tools for integrating C/C++ and Fortran code
Useful linear algebra, Fourier transform and random number capabilities
NumPy Array:A NumPy array is a grid of values, all of the same type, and is indexed by a tuple of
non-negative integers. The number of dimensions is the rank of the array; the shape of an array is a
tuple of integers giving the size of the array along each dimension.
SciPy:SciPy (Scientific Python) is often mentioned in the same breath with NumPy. SciPy extends the
capabilities of NumPy with further useful functions for minimization regression, Fouriertransformation
and many others.
Keras:Keras is a high-level neural networks API, written in Python and capable of running on top of
Tensor Flow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Keras
contains numerous implementations of commonly used neural network Building blocks such as layers,
objectives, activation functions, optimizers, and a Host of tools to make working with image and text
data easier. The code is hosted on GitHub, and community support forums include the GitHub issues
page, and a Slack channel. Keras allows users to productize deep models on smartphones (iOS and
Android), on the web, or on the Java Virtual Machine. It also allows use of distributed training of deep
learning models on clusters of Graphics Processing Units (GPU).
The algorithm needs a lot of positive images (images of faces) and negative images (images without
faces) to train the classifier. Then we need to extract features from it. They are just like our
convolutional kernel. Each feature is a single value obtained by subtracting sum of pixels under white
rectangle from sum of pixels under black rectangle.
Now all possible sizes and locations of each kernel is used to calculate plenty of features. (Just imagine
how much computation it needs? Even a 24x24 window results over 160000 features). For each feature
calculation, we need to find sum of pixels under white and black rectangles. To solve this, they
introduced the integral images. It simplifies calculation of sum of pixels, how large may be the number
of pixels, to an operation involving just four pixels. Nice, isn’t it? It makes things super-fast.
25
But among all these features we calculated, most of them are irrelevant. For example, consider the
image below. Top row shows two good features. The first feature selected seems to focus on the
property that the region of the eyes is often darker than the region of the nose and cheeks. The second
feature selected relies on the property that the eyes are darker than the bridge of the nose. But the same
windows applying on cheeks or any other place is irrelevant. So how do we select the best features out
of 160000+ features? It is achieved by Adaboost.
For this, we apply each and every feature on all the training images. For each feature, it finds the best
threshold which will classify the faces to positive and negative. But obviously, there will be errors or
misclassifications. We select the features with minimum error rate, which means they are the features
that best classifies the face and non- face images. (The process is not as simple as this. Each image is
given an equal weight in the beginning. After each classification, weights of misclassified images are
increased. Then again same process is done. New error rates are calculated. Also new weights. The
process is continued until required accuracy or error rate is achieved or required number of features are
found).
Final classifier is a weighted sum of these weak classifiers. It is called weak because it alone
can‟tclassify the image, but together with others forms a strong classifier. The paper says even 200
features provide detection with 95% accuracy. Their final setup had around 6000 features. (Imagine a
reduction from 160000+ features to 6000 features. That is a big gain).
In an image, most of the image region is non-face region. So it is a better idea to have a simple method
to check if a window is not a face region. If it is not, discard it in a single shot. Don‟t process it again.
Instead focus on region where there can be a face. This way, we can find more time to check a possible
face region.
For this they introduced the concept of Cascade of Classifiers. Instead of applying all the 6000 features
on a window, group the features into different stages of classifiers and apply one-by-one. (Normally
first few stages willcontain very a smaller number of features). If a window fails the first stage, discard
it. We don’t consider remaining features on it. If it passes, apply the second stage of features and
continue the process. The window which passes all stages is a face region. Haar-like features are digital
image features used in object recognition. They owe their name to their intuitive similarity with Haar
wavelets and were used in the first real-time face detector.Historically, working with only image
intensities (i.e., the RGB pixel values at each and every pixel of image) made the task of feature
26
calculation computationally expensive. A publication by Papa Georgiou et al. discussed working with
an alternate feature set based on Haar wavelets instead of the usual image intensities. Paul Viola and
Michael Jones adapted the idea of using Haar wavelets and developed the so-called Haar-like features.
A Haar-like feature considers adjacent rectangular regions at a specific location in a detection window,
sums up the pixel intensities in each region and calculates the difference between these sums. This
difference is then used to categorize subsections of an image. For example, with a human face, it is a
common observation that among all faces the region of the eyes is darker than the region of the cheeks.
Therefore, a common Haar feature for face detection is a set of two adjacent rectangles that lie above
the eye and the cheek region. The position of these rectangles is defined relative to a detection window
that acts like a bounding box to the target object (the face in this case).
In the detection phase of the Viola–Jones object detection framework, a window of the target size is
moved over the input image, and for each subsection of the image the Haar-like feature is calculated.
This difference is then compared to a learned threshold that separates non-objects from objects.
Because such a Haar-like feature is only a weak learner or classifier (its detection quality is slightly
better than random guessing) a large number of Haar-like features are necessary to describe an object
with sufficient accuracy.
In the Viola–Jones object detection framework, the Haar-like features are therefore organized in
something called a classifier cascade to form a strong learner or classifier.
The key advantage of a Haar-like feature over most other features is its calculation speed. Due to the
use of integral images, a Haar-like feature of any size can be calculated in constant time (approximately
60 microprocessor instructions for a 2-rectangle feature).
27
CHAPTER 7 – RESULTS
PART 7.1 OBJECT DETECTION USING MACHINE LEARNING
7.1.1 RESULTS
This is a snap shot of a video data which we feed to the algorithm and expect our algorithm to detect
and identify objects in the image and label them according to the class assigned to it.
28
PART 7.2 FACIAL EMOTION RECOGNITION USING MACHINE LEARNING
7.2.2 RESULTS
This is a snapshot of before detecting an image. After we feed that image into our model. We can see
that our algorithm predicts the emotion of the image and shows it with labeling a square box with text.
29
II. Second Input Result
As we can our project can detect the emotions of the object in an image accurately.
30
CHAPTER 8 – CONCLUSION
8.1 CONCLUSION:
In this case, when the model predicts incorrectly, the correct label is often the second most likely
emotion. The facial expression recognition system presented in this research work contributes a
resilient face recognition model based on the mapping of behavioral characteristics with the
physiological biometric characteristics. The physiological characteristics of the human face with
relevance to various expressions such as happiness, sadness, fear, anger, surprise and disgust are
associated with geometrical structures which restored as base matching template for the recognition
system. The behavioral aspect of this system relates the attitude behind different expressions as
property base. The property bases are alienated as exposed and hidden category in genetic algorithmic
genes. The gene training set evaluates the expressional uniqueness of individual faces and provide a
resilient expressional recognition model in the field of biometric security.
The design of a novel asymmetric cryptosystem based on biometrics having features like hierarchical
group security eliminates the use of passwords and smart cards as opposed to earlier cryptosystems. It
requires a special hardware support like all other biometrics system. This research work promises a
new direction of research in the field of asymmetric biometric cryptosystems which is highly desirable
in order to get rid of passwords and smart cards completely. Experimental analysis and study show that
the hierarchical security structures are effective in geometric shape identification for physiological
traits.
It is important to note that there is no specific formula to build a neural network that would guarantee
to work well. Different problems would require different network architecture and a lot of trail and
errors to produce desirable validation accuracy. This is the reason why neural nets are often perceived
as “black box” algorithms.
In this project we got an accuracy of almost 70% which is not bad at all comparing all the previous
models. But we need to improve in specific areas like –
But due to lack of highly configured system we could not go deeper into dense neural network as the
system gets very slow and we will try to improve in these areas in future.
We would also like to train more databases into the system to make the model more and more accurate but
again resources become a hindrance in the path and we also need to improve in several areas in future
to resolve the errors and improve the accuracy.
Having examined techniques to cope with expression variation, in future it may be investigated in more
depth about the face classification problem and optimal fusion of color and depth information. Further
study can be laid down in the direction of allele of gene matching to the geometric factors of the facial
expressions. The genetic property evolution framework for facial expressional system can be studied to
suit the requirement of different security models such as criminal detection, governmental confidential
security breaches etc.
32
CHAPTER 9 - REFERENCES
[1] D. C. Ali Mollahosseini and M. H. Mahoor. Going deeper in facial expression recognition using
deep neural networks. IEEE Winter Conference on Applications of Computer Vi- sion, 2016
[2] S.-Y. D. Bo-Kyeong Kim, JihyeonRoh and S.-Y. Lee. Hi- erarchical committee of deep
convolutional neural networks for robust facial expression recognition. Journal on Multi- modal User
Interfaces, pages 1–17, 2015.
[4] [4] P. Ekman and W. V. Friesen.Emotional facial action coding system. Unpublished manuscript,
University of California at San Francisco, 1983.
[6] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. JMLRProceedings, 2015
[8] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell.
Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014
[10] Sokolova, M., Japkowicz, N., &Szpakowicz, S. (2006, December). Beyond accuracy, F-score and
ROC: a family of discriminant measures for performance evaluation. In Australasian Joint Conference
on Artificial Intelligence (pp. 1015- 1021). Springer Berlin Heidelberg.
[11] Michel, P., & El Kaliouby, R. (2005). Facial expression recognition using support vector
machines. In The 10th International Conference on Human-Computer Interaction, Crete, Greece.
33
[12] Michel, P., & El Kaliouby, R. (2003, November). Real time facial expression recognition in video
using support vector machines. In Proceedings of the 5th international conference on Multimodal
interfaces (pp. 258- 264). ACM.
LIST OF PUBLICATION:
[1] Ankita Das, Aditi Choubey, Lipika Mandal “Object Detection Using Machine Learning ” in National
Conference on Computational & Characterization Techniques in Engineering and Sciences (CCTES
2023) February 27-28,2023.
34
35
36
37