Final Documentation
Final Documentation
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING(AI&ML)
by
T.AKSHITHA 21K81A6655
Under the Guidance of K.GAYATHRI 21K81A6623
O.PRASHANTH 21K81A6635
MR. N. KRANTHI KUMAR
ASSISTANT PROFESSOR
DEPARTMENT OF CSE(AI&ML)
NOVEMBER - 2024
St. MARTIN'S ENGINEERING COLLEGE
UGC Autonomous
Affiliated to JNTUH, Approved by AICTE
NBA & NAAC A+ Accredited
Dhulapally, Secunderabad - 500
100
Certificate
Place:
Date:
St. MARTIN'S ENGINEERING COLLEGE
UGC Autonomous
Affiliated to JNTUH, Approved by AICTE,
Accredited by NBA & NAAC A+,ISO 9001:2008 Certified
Dhulapally, Secunderabad - 500 100
DEPARTMENT OF CSE(AI&ML)
DECLARATION
T.AKSHITHA 21K81A6655
K.GAYATHRI 21K81A6623
O.PRASHANTH 21K81A6635
ACKNOWLEDGEMENT
The satisfaction and euphoria that accompanies the successful completion of any task would
be incomplete without the mention of the people who made it possible and whose
encouragement and guidance have crowded our efforts with success.
First and foremost, we would like to express our deep sense of gratitude and indebtedness to
our College Management for their kind support and permission to use the facilities available
in the Institute.
We especially would like to express our deep sense of gratitude and indebtedness to
Dr. P. SANTOSH KUMAR PATRA, Professor and Group Director,
St. Martin’s Engineering College, Dhulapally, for permitting us to undertake this project.
We are also thankful to Dr. K. SRINIVAS, Head of the Department, Computer Science And
Engineering (AI&ML), St. Martin’s Engineering College, Dhulapally, Secunderabad, for his
support and guidance throughout our project
We are also thankful to our project coordinator MR. D.VENKATESHAN , Assistant
Professor, Computer Science And Engineering (AI&ML) department for his valuable
support.
We would like to express our sincere gratitude and indebtedness to our project supervisor
MR. N. KRANTHI KUMAR , Assistance Professor, Computer Science and Engineering
(AI&ML), St. Martins Engineering College, Dhulapally, for his support and guidance
throughout our project.
Finally, we express thanks to all those who have helped us successfully completing this
project. Furthermore, we would like to thank our family and friends for their moral support
and encouragement. We express thanks to all those who have helped us in successfully
completing the project.
T.AKSHITHA 21K81A6655
K.GAYATHRI 21K81A6623
O.PRASHANTH 21K81A6635
i
i
ABSTRACT
Facial expression recognition plays a pivotal role in emotion analysis and AI development,
providing insights into human emotional states and enhancing interaction with AI systems.
This study focuses on a comprehensive facial expression recognition dataset designed to advance
emotion analysis and improve AI algorithms.The dataset encompasses 10,000 facial images from
diverse individuals, annotated with seven primary emotions: happiness, sadness, anger, surprise,
fear, disgust, and neutral.Each image is tagged with metadata including age, gender, and ethnicity
to facilitate in-depth analysis and model training.We applied various machine learning and deep
learning techniques to this dataset to develop robust emotion recognition models.Preliminary
results demonstrate high accuracy in emotion classification, with convolutional neural networks
(CNNs) showing superior performance in distinguishing subtle emotional expressions.
The dataset’s richness in diversity and detail supports the development of models that are not only
accurate but also generalizable across different populations. Our research highlights the
importance of diverse and well-annotated datasets in advancing the field of emotion recognition.
The dataset provides a valuable resource for researchers and developers, enabling the creation of
more responsive and empathetic AI systems
Future research will focus on expanding the dataset and refining models to enhance their
applicability in real-world scenarios and various applications
ii
LIST OF FIGURES
iii
LIST OF ACRONYMS AND DEFINITIONS
S. ACRONYM DEFINITION
N
O
iv
CONTENTS
ACKNOWLEDGEMENT i
ABSTRACT ii
LIST OF FIGURES iii
1.4 Applications 03
3.3 Design 16
1.1 History
Facial expression recognition is a pivotal area of research within emotion analysis and artificial
intelligence (AI). As an essential component of human-computer interaction, accurate recognition of
facial expressions enhances the ability of AI systems to understand and respond to human emotions
effectively. The ability to decode facial expressions is critical for applications ranging from
emotion-aware virtual assistants to advanced security systems and therapeutic tools. With
advancements in machine learning and computer vision, the development and refinement of facial
expression recognition systems have become increasingly sophisticated.
The growing importance of facial expression recognition technology underscores the need for
improved datasets and algorithms. Advances in this field promise to enhance human-computer
interactions, improve automated emotional analysis, and offer new possibilities for personalized user
experiences. However, achieving these goals requires overcoming existing limitations and
developing systems that are both reliable and adaptable to real-world scenarios.
Automated and semi-automated methods for facial expression recognition must overcome these
challenges by improving data quality and ensuring comprehensive representation. Developing a
high-quality, diverse dataset requires addressing issues such as annotation accuracy, expressions
1
1.4 Research Motivation
The motivation behind developing a comprehensive facial expression recognition dataset stem from
the need for high-quality, diverse data to train and validate AI systems. Accurate facial expression
recognition is vital for numerous applications, including mental health monitoring, interactive
gaming, and user experience enhancement. Despite the progress in this field, current datasets often
lack the diversity required to train models that perform well across different populations and
conditions. This limitation hinders the development of universally applicable and reliable emotion
recognition systems
1.4 Applications
- Enhanced emotion recognition:
Advanced facial expression recognition systems can provide accurate assessments of human
emotions, leading to improvements in areas such as customer service, mental health monitoring, and
interactive technologies.
- Real-time interaction:
Integration of these systems in real-time applications, such as virtual assistants and gaming, can
create more responsive and emotionally aware interactions, enhancing user experience.
By understanding user emotions, AI systems can tailor responses and interactions to individual
emotional states, providing a more personalized and engaging experience.
- Training and research tool: A comprehensive dataset can serve as a valuable resource for
training new AI models and conducting research, advancing the field of emotion recognition a
2
CHAPTER 2
LITERATURE SURVEY
Nan et al. [1] proposed A-MobileNet, a novel approach for facial expression recognition, detailed in
their 2022 paper published in the Alexandria Engineering Journal. This study introduced an
optimized mobile network architecture aimed at improving the accuracy of recognizing facial
expressions. The authors leveraged advanced network design and training techniques, resulting in
enhanced performance over existing methods. The A-MobileNet approach demonstrated superior
recognition accuracy across various datasets of facial expressions. This advancement is particularly
valuable for applications in human-computer interaction and affective computing, where accurate
emotion detection is crucial. The research underscores A-MobileNet's potential in real-world
scenarios requiring precise emotion recognition.
Li et al. [2] conducted a study published in the Alexandria Engineering Journal in 2021, analyzing
the correlation between facial expressions and urban crime. Their research explored how facial
expression analysis can reveal emotional patterns associated with potential crime hotspots. By
examining large datasets of facial expressions, the study identified that specific emotional states
could be linked to increased crime risk in urban areas. This innovative approach suggests that facial
expression data may serve as a useful tool for predicting and preventing crime. The findings
highlight the potential of emotion analysis in enhancing urban safety and crime management
strategies.
Mannepalli et al. [3] introduced an adaptive fractional deep belief network for speaker emotion
recognition in their 2017 study published in the Alexandria Engineering Journal. The research aimed
to improve the accuracy of recognizing emotions from speech signals through a novel deep learning
model. The adaptive fractional deep belief network demonstrated significant enhancements in
emotion recognition performance compared to traditional methods. The study's results showed
improved accuracy in detecting various emotions in spoken language. This advancement offers
valuable implications for applications in voice-based emotion analysis and human-computer
interaction, enhancing the understanding of speaker emotions.
Tonguç and Ozkara [4] investigated automatic recognition of student emotions from facial
expressions during lectures, published in Computers & Education in 2020. Their study focused on
developing a system to monitor and analyze student emotions in real-time to improve educational
outcomes. By employing facial expression recognition technology, the research aimed to assess
students' emotional states and engagement levels during lectures. The findings revealed that
automatic emotion recognition can provide valuable insights into student experiences and learning
3
environments. This approach has potential applications in enhancing classroom interactions and
adapting teaching.
Yun et al. [5] explored social skills training for children with autism spectrum disorder using a
robotic behavioral intervention system, published in Autism Research in 2017. The study focused on
employing robotic systems to facilitate social skills development in children with autism. The
robotic intervention aimed to provide engaging and interactive training to improve social behaviors
and communication skills. The results indicated that the robotic system effectively supported the
development of social skills in children with autism. This research highlights the potential of
technology-enhanced interventions in addressing social challenges faced by children with autism.
Li et al. [6] introduced MVT, a Mask Vision Transformer, for facial expression recognition in the
wild, as detailed in their 2021 preprint. The study proposed a new vision transformer model
designed to improve facial expression recognition accuracy in challenging real-world conditions.
MVT utilized masking techniques to enhance model performance on diverse facial expression
datasets. The research demonstrated that the Mask Vision Transformer achieved improved
recognition rates compared to traditional methods. This advancement is significant for applications
requiring robust emotion detection in varying environments and conditions.
Liang et al. [7] presented a convolution-transformer dual branch network for head-pose and
occlusion facial expression recognition, published in Visual Computer in 2022. Their study
introduced a dual branch network combining convolutional and transformer models to address
challenges in facial expression recognition caused by head-pose variations and occlusions. The
proposed network demonstrated improved accuracy in recognizing facial expressions despite these
difficulties. The research highlights the effectiveness of integrating convolutional and transformer
approaches to enhance emotion recognition performance. This advancement is valuable for
applications involving complex facial expression analysis.
Jeong and Ko [8] focused on driver’s facial expression recognition in real-time for safe driving, as
reported in Sensors in 2018. Their study aimed to develop a system for monitoring driver emotions
to enhance road safety. By analyzing drivers' facial expressions in real-time, the research sought to
detect signs of fatigue or distraction that could impact driving performance. The findings showed
that real-time emotion recognition could contribute to safer driving practices. This approach
underscores the potential of emotion detection technology in improving road safety and driver
assistance systems.
Kaulard et al. [9] provided a validated database of emotional and conversational facial expressions
known as the MPI Facial Expression Database, published in PLoS One in 2012. The study aimed to
create a comprehensive resource for facial expression research by offering a diverse set of emotional
4
and conversational expressions. The database was validated through rigorous testing to ensure its
reliability for various research applications. The MPI Facial Expression Database serves as a
valuable tool for researchers studying facial expressions and emotion recognition. This resource
facilitates.
Ali et al. [10] explored the potential of using facial expressions to detect Parkinson’s disease in their
2021 study published in npj Digital Medicine. The research investigated how changes in facial
expressions, observable in online videos, could indicate the presence of Parkinson’s disease.
Preliminary evidence suggested that facial expression analysis might serve as a non-invasive method
for detecting early signs of Parkinson’s disease. The study highlights the potential of leveraging
facial expression data for early diagnosis and monitoring of Parkinson’s disease. This approach
offers a promising direction for improving diagnostic techniques in neurological disorders.
Du et al. [11] examined perceptual learning of facial expressions in their 2016 paper published in
Vision Research. The study investigated how individuals learn to recognize and interpret facial
expressions over time. The research explored the mechanisms of perceptual learning and its impact
on the ability to discern facial emotions. The findings revealed that perceptual learning significantly
enhances the recognition of facial expressions, contributing to a deeper understanding of emotional
communication. This research provides insights into the cognitive processes involved in emotion
perception and its implications for emotional learning and development.
Varghese et al. [12] provided an overview of emotion recognition systems in their 2015 conference
paper published by IEEE. The study reviewed various techniques and approaches used for emotion
recognition, including their applications and challenges. The overview covered methods ranging
from traditional statistical approaches to modern machine learning techniques. The research
highlighted advancements in emotion recognition technology and its potential applications in
various fields, such as human-computer interaction and psychological studies. This comprehensive
review offers valuable insights into the state-of-the-art in emotion recognition systems.
Egger et al. [13] reviewed emotion recognition from physiological signal analysis in their 2019
paper published in Electronic Notes in Theoretical Computer Science. The study focused on
analyzing physiological signals, such as heart rate and skin conductance, for emotion recognition.
The review summarized different methodologies used in physiological signal analysis and their
effectiveness in detecting emotions. The findings highlighted the strengths and limitations of various
approaches, offering a thorough understanding of the current advancements in physiological
emotion recognition. This research contributes to the development of more accurate and reliable
emotion detection systems.
Mattavelli et al. [14] investigated facial expression recognition and discrimination in Parkinson’s
5
disease in their 2021 study published in the Journal of Neuropsychology. The research examined
how Parkinson’s disease affects the ability to recognize and interpret facial expressions.
CHAPTER 3
Existing facial expression recognition systems often rely on datasets that include high-resolution
images or videos of facial expressions, captured under controlled conditions. These datasets are
annotated with labels corresponding to different emotional states, such as happiness, sadness, anger,
and surprise. The quality and diversity of these datasets are critical in training AI models to achieve
high accuracy and generalizability.
Imaging techniques used in collecting data for facial expression recognition include high-definition
cameras and specialized recording equipment to ensure the capture of fine details in facial
movements. Annotators typically classify the expressions using predefined emotional categories,
and advanced algorithms then process these annotations to train machine learning models.
The development of facial expression recognition systems also involves creating standardized
benchmarks and evaluation metrics to assess model performance. These benchmarks help compare
different algorithms and ensure that the systems meet the required accuracy and robustness levels.
The current facial expression recognition systems face several challenges that impact their
performance and applicability.
Variability in Facial Expressions: One significant challenge is the variability in facial expressions
across different individuals and contexts. Factors such as age, ethnicity, and cultural background can
6
influence how emotions are expressed and perceived. This variability can lead to inconsistencies in
recognition accuracy and limit the effectiveness of existing datasets.
Annotation Accuracy and Consistency: Accurate annotation of facial expressions is crucial for
training effective AI models. However, manual annotation is time-consuming and prone to
inconsistencies, particularly when dealing with subtle or complex expressions
Lighting and Environmental Conditions: Facial expression datasets often suffer from limitations
related to lighting and environmental conditions. Variations in lighting, background, and facial
occlusions can impact the clarity and quality of the images, leading to challenges in achieving
consistent recognition across different scenarios.
Dataset Diversity: Many existing datasets may not sufficiently represent diverse populations,
leading to biased models that perform well only for specific groups. The lack of diversity in datasets
can result in reduced generalization and accuracy when applied to broader or more varied
populations.
Ethical and Privacy Concerns: Collecting and using facial expression data raises ethical and
privacy concerns, especially when dealing with sensitive information. Ensuring that datasets are
collected and used in compliance with privacy regulations and ethical guidelines is essential to
address these concerns.
The limitations of current facial expression recognition approaches highlight the need for improved
datasets and methodologies.
Subjectivity in Annotation: The manual annotation of facial expressions often involves subjective
judgment, leading to variability in how different annotators label the same expressions. This
subjectivity can introduce inconsistencies and affect the quality of the dataset.
Limited Predictive Power: Current datasets and models may have limited predictive power,
particularly when used in isolation. A dataset that lacks diversity or comprehensiveness can lead to
models that are not fully representative of real-world scenarios, resulting in reduced accuracy and
reliability.
Scalability and Resource Intensity: Building and maintaining high-quality facial expression
datasets can be resource-intensive and challenging to scale. The need for large volumes of data,
high-resolution images, and extensive annotation efforts can be a barrier to developing robust
systems.
Lack of Standardization: The absence of standardized protocols for dataset creation, annotation,
7
and evaluation can lead to inconsistencies and difficulties in comparing different facial expression
recognition systems. Standardization is necessary to ensure that datasets and models meet
established performance benchmarks.
Ethical and Legal Considerations: The collection and use of facial expression data must navigate
ethical and legal considerations, including informed consent and data privacy. Addressing these
issues is crucial for the responsible development and deployment of facial expression recognition
The foundation of any machine learning project lies in the quality and comprehensiveness of its
dataset. For this study, the Facial Expression Recognition Dataset was employed, comprising a
diverse collection of facial images categorized into seven distinct emotions: angry, disgust, fear,
happy, neutral, sad, and surprise. The dataset was organized into training and testing directories,
ensuring a balanced representation of each emotion category. This comprehensive dataset serves as
the cornerstone for training and evaluating the emotion recognition models, facilitating the system's
ability to generalize across various facial expressions.
Preprocessing is a critical step aimed at enhancing data quality and suitability for model training.
The initial phase involved handling missing or corrupted data to prevent inaccuracies during model
training. This was achieved by systematically removing null values and ensuring all image files
were intact and properly formatted. Following data cleansing, label encoding was performed to
8
transform categorical emotion labels into a numerical format, enabling seamless integration with
machine learning algorithms. Additionally, images were resized to a uniform dimension (64x64
pixels) to maintain consistency across the dataset, thereby optimizing computational efficiency and
model performance.
Accurate label encoding is essential for transforming categorical labels into a machine-readable
format. In this study, the LabelEncoder from the scikit-learn library was utilized to convert textual
emotion labels into numerical indices. This encoding facilitates the model's ability to interpret and
differentiate between various emotion classes during the training process. By assigning unique
numerical values to each emotion category, the model can effectively learn and predict the
underlying emotional states represented in the facial images.
To evaluate the model's performance objectively, the dataset was partitioned into training and
testing subsets using an 80-20 split. This stratification ensures that the model is trained on a
substantial portion of the data while reserving a representative sample for unbiased testing. The
training set is used to optimize the model's parameters, whereas the testing set serves as a
benchmark to assess the model's generalization capabilities on unseen data.
As a baseline for performance comparison, the Decision Tree Classifier (DTC) was implemented.
DTC is a widely used machine learning algorithm known for its simplicity and interpretability. By
constructing a tree-like model of decisions, DTC classifies data by learning decision rules inferred
from the input features. This existing algorithm provides a foundational benchmark against which
the proposed CNN model's performance can be measured, highlighting improvements and
identifying areas for enhancement.
To advance the system's emotion recognition capabilities, a Convolutional Neural Network (CNN)
was developed as the proposed algorithm. CNNs are a class of deep learning models particularly
adept at processing and interpreting visual data. By leveraging multiple layers of convolutional and
pooling operations, CNNs can automatically extract and learn intricate features from raw image
data, enabling more accurate and nuanced emotion classification. The architecture of the proposed
CNN includes convolutional layers for feature extraction, pooling layers for dimensionality
reduction, and dense layers for final classification, culminating in a softmax activation function to
output probability distributions over the emotion classes.
9
Step 7: Performance Comparison
A rigorous performance comparison was conducted between the existing DTC and the proposed
CNN models. Utilizing metrics such as accuracy, precision, recall, and F1-score, the models'
effectiveness in correctly identifying and classifying facial expressions was evaluated. Additionally,
confusion matrices were generated to visualize the models' performance across different emotion
categories, providing insights into specific strengths and weaknesses. This comparative analysis
underscores the advancements achieved through the proposed CNN approach.
The final step involves deploying the trained models to predict emotions from new, unseen test data.
Utilizing the trained CNN model, the system processes individual facial images, generating emotion
predictions based on the learned features. This step demonstrates the model's practical applicability
in real-world scenarios, showcasing its ability to accurately interpret and classify emotions from
facial expressions. The predicted outcomes are visualized alongside the input images, providing a
tangible representation of the system's performance and reliability.
10
Fig.1: Block Diagram of Proposed system.
Data splitting and preprocessing are pivotal in ensuring the robustness and reliability of machine
learning models. In this study, the dataset was first meticulously cleaned to eliminate any null or
corrupted entries, ensuring that only high-quality images were utilized for training and evaluation.
The images were uniformly resized to 64x64 pixels, standardizing the input dimensions and
facilitating efficient processing. Subsequently, label encoding was performed to convert categorical
emotion labels into numerical representations, a prerequisite for compatibility with machine learning
algorithms. The cleaned and encoded dataset was then randomly shuffled to prevent any inherent
biases and was split into training and testing subsets using an 80-20 ratio. This stratification ensures
that the training set sufficiently captures the diversity of facial expressions, while the testing set
provides an unbiased evaluation of the model's generalization capabilities. Normalization of pixel
values was also conducted by scaling the image data to a range of 0 to 1, enhancing the model's
convergence during training and mitigating issues related to varying illumination conditions in the
images.
The process of building machine learning models involves several key steps, including model
selection, architecture design, compilation, training, and evaluation. Initially, the Decision Tree
Classifier (DTC) was implemented as the baseline model due to its simplicity and interpretability.
The DTC was trained on the flattened image data, where each image was transformed into a one-
dimensional array to facilitate input into the classifier. Hyperparameters such as the maximum depth
of the tree and the criterion for splitting were tuned to optimize performance. Following the DTC, a
Convolutional Neural Network (CNN) was developed as the proposed model. The CNN architecture
comprised multiple convolutional layers with ReLU activation functions, followed by max-pooling
layers to reduce spatial dimensions. These layers were succeeded by fully connected dense layers
culminating in a softmax activation layer to output probability distributions across the emotion
classes. The CNN was compiled using the Adam optimizer and categorical cross-entropy loss
function, and was trained over multiple epochs with a validation split to monitor performance. Both
models were evaluated using a suite of performance metrics, including accuracy, precision, recall,
and F1-score, to comprehensively assess their effectiveness in emotion classification.
What is DTC?
The Decision Tree Classifier (DTC) is a supervised machine learning algorithm used for both
11
classification and regression tasks. It operates by recursively partitioning the feature space into
distinct regions based on the values of input features, effectively creating a tree-like model of
decisions. Each internal node in the tree represents a feature test, each branch denotes the outcome
of the test, and each leaf node corresponds to a class label or regression value.
DTC works by selecting the feature that best splits the data at each node, based on criteria such as
Information Gain or Gini Impurity. The algorithm begins at the root node, evaluating all possible
splits across all features to determine the most informative partition. This process is recursively
applied to each subsequent node, creating a tree structure that captures the decision-making process.
The recursion continues until a stopping condition is met, such as reaching a maximum tree depth or
when further splits do not significantly improve the model's performance.
Architecture of DTC
1. Root Node: The topmost node representing the entire dataset, from which all splits emanate.
2. Internal Nodes: Nodes that represent feature tests, guiding the traversal based on feature
values.
3. Branches: Edges that connect nodes, indicating the outcome of feature tests.
4. Leaf Nodes: Terminal nodes that assign a class label or value based on the majority class or
average value in that partition.
Disadvantages of DTC
Overfitting: Decision trees can create overly complex models that capture noise in the
training data, reducing their ability to generalize to unseen data.
Bias Toward Features with More Levels: Features with a larger number of unique values
can dominate the splitting process, potentially neglecting more informative features.
Instability: Small variations in the data can lead to significantly different tree structures,
affecting the model's consistency.
Limited Expressiveness: Decision trees may struggle to model complex relationships and
interactions between features, limiting their performance on intricate datasets.
Despite these drawbacks, DTC serves as a valuable baseline for evaluating more sophisticated
models like CNNs, providing insights into their relative performance enhancements.
12
4.3.2 Proposed Algorithm: Convolutional Neural Network (CNN)
What is CNN?
Convolutional Neural Networks (CNNs) are a class of deep learning models specifically designed to
process and analyze visual data. They are characterized by their ability to automatically and
adaptively learn spatial hierarchies of features through convolutional layers, making them highly
effective for tasks such as image classification, object detection, and facial recognition.
CNNs operate by passing input images through a series of layers that perform convolutions, pooling,
and non-linear transformations. The convolutional layers apply learnable filters to the input,
extracting local patterns such as edges, textures, and shapes. These filters capture spatial hierarchies
by detecting low-level features in early layers and progressively more complex patterns in deeper
layers. Pooling layers reduce the spatial dimensions of the data, enhancing computational efficiency
and providing spatial invariance. Finally, fully connected dense layers integrate the extracted
features to perform classification or regression tasks, outputting predictions based on the learned
representations.
Architecture of CNN
1. Input Layer: Accepts raw image data, typically in the form of multi-dimensional arrays
representing pixel values.
3. Activation Functions: Introduce non-linearity into the model, enabling it to learn complex
representations. Common activation functions include ReLU (Rectified Linear Unit).
4. Pooling Layers: Downsample feature maps to reduce spatial dimensions, thereby decreasing
computational load and mitigating overfitting. Max pooling and average pooling are
common strategies.
Advantages of CNN
CNNs offer several advantages that make them highly suitable for image-based tasks:
Automatic Feature Extraction: Unlike traditional machine learning models that rely on
handcrafted features, CNNs learn hierarchical feature representations directly from raw data.
Parameter Sharing: Convolutional layers utilize shared weights, reducing the number of
parameters and enhancing computational efficiency.
Scalability: CNN architectures can be scaled to accommodate larger and more complex
datasets, making them adaptable to a wide range of applications.
Robustness to Noise: The hierarchical feature learning and pooling operations contribute to
the model's resilience against noise and distortions in the input data.
14
3.3 DESIGN
UML stands for Unified Modeling Language. UML is a standardized general-purpose modeling
language in the field of object-oriented software engineering. The standard is managed, and was
created by, the Object Management Group. The goal is for UML to become a common language for
creating models of object-oriented computer software. In its current form UML is comprised of two
major components: a Meta-model and a notation. In the future, some form of method or process may
also be added to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying, Visualization, Constructing
and documenting the artifacts of software system, as well as for business modeling and other non-
software systems. The UML represents a collection of best engineering practices that have proven
successful in the modeling of large and complex systems. The UML is a very important part of
developing objects-oriented software and the software development process. The UML uses mostly
graphical notations to express the design of software projects.
GOALS: The Primary goals in the design of the UML are as follows:
Provide users a ready-to-use, expressive visual modeling Language so that they can develop and
exchange meaningful models.
Support higher level development concepts such as collaborations, frameworks, patterns and
components.
15
Integrate best practices.
16
Figure-3.3.1: Class Diagram
17
Figure-3.3.2: Sequence Diagram
3.3.6 Use Case diagram: A use case diagram in the Unified Modeling Language (UML) is a type of
behavioral diagram defined by and created from a Use-case analysis. Its purpose is to present a
graphical overview of the functionality provided by a system in terms of actors, their goals
(represented as use cases), and any dependencies between those use cases. The main purpose of a
19
use case diagram is to show what system functions are performed for which actor. Roles of the
actors in the system can be depicted.
Software Requirements
1. Python Programming Language:
- Version: Recommended to use Python 3.7 or above due to improved library support and
compatibility.
- Why Python? Python’s vast array of libraries makes it ideal for handling audio data, machine
learning, and data processing, which are crucial for sound classification tasks.
2. Python Libraries and Tools:
- NumPy: Essential for array manipulation, this library provides high-performance operations on
multidimensional arrays and matrices, which are foundational for data preprocessing and
transformations.
- Pandas: A data manipulation library that allows easy handling of data structures like
DataFrames, which is useful for organizing sound features and categories.
- Matplotlib and Seaborn: For visualization of data distributions, model performance (e.g.,
confusion matrix), and category counts. These tools help in assessing the dataset and model results
graphically.
- Scikit-learn: A machine learning library offering algorithms like Multi-Layer Perceptron (MLP)
and utilities for model evaluation (e.g., precision, recall, F1-score). It is essential for training,
testing, and fine-tuning the models.
- TensorFlow: Useful if you plan to expand the project to deep learning models like convolutional
neural networks (CNNs) or recurrent neural networks (RNNs), which can be beneficial for complex
sound classification tasks.
- Librosa: A specialized library for audio and music analysis. It is used for loading audio files,
noise reduction, and extracting features like Mel-frequency cepstral coefficients (MFCCs), which
- IPython: Provides an interactive interface that helps in code debugging and iterative
development.
- Joblib: Allows saving and loading of models, which is useful when you need to save trained
models and use them later without retraining.
- Imbalanced-learn (SMOTE): For oversampling in case of imbalanced datasets (e.g., some urban
sound categories may have fewer samples than others), which helps in improving the model’s
22
performance.
- LightGBM: A gradient-boosting framework with high performance and efficiency, suitable for
handling large datasets and high-dimensional data. It constructs decision trees and improves
accuracy over traditional machine learning algorithms like MLP.
3. Operating System:
- Compatibility: Python and the listed libraries are cross-platform compatible. This project can be
executed on Windows, macOS, or Linux systems.
- Preferred OS: Linux is often preferred for machine learning projects due to its resource
efficiency, but Windows and macOS are also viable.
4. Additional Tools:
- Jupyter Notebook or Google Colab: Ideal for experimenting and visualizing results step-by-step,
especially during data exploration, preprocessing, and model training.
- Integrated Development Environment (IDE): Options like PyCharm, Visual Studio Code, or
JupyterLab enhance productivity for code management, debugging, and testing.
Hardware Requirements
1. Processor (CPU):
- Minimum Requirement: Dual-core CPU.
- Recommended: A multi-core processor (Quad-core or higher) to handle data-intensive tasks like
audio processing and machine learning efficiently. If available, a CPU with a higher clock speed
(3.0 GHz or above) can further improve performance, especially during model training.
- Why Needed? Processing audio files and training models can be CPU-intensive, particularly
when dealing with large datasets.
2. Memory (RAM):
- Minimum Requirement: 8GB RAM.
- Recommended: 16GB or more for handling larger datasets smoothly. Higher memory allows
better performance for loading audio files, processing features, and running machine learning
models without significant lag or memory errors.
- Why Needed? Data manipulation and machine learning algorithms consume memory, especially
3. Storage:
- Minimum Requirement: At least 20GB of storage for code, libraries, and small datasets.
23
- Recommended: Solid-State Drive (SSD) with 100GB or more. An SSD significantly reduces
loading times and improves read/write speed, which is beneficial when accessing and saving large
audio datasets.
- Why Needed? Urban sound datasets can be large, and SSDs speed up data access, enhancing
overall project efficiency.
This setup will allow you to develop, test, and potentially deploy a robust urban sound classification
model, leveraging both machine learning and audio processing techniques for a scalable and
efficient system.
CHAPTER 5
IMPLEMENTATION
24
Python is a general-purpose language. It has a wide range of applications from Web development
(like: Django and Bottle), scientific and mathematical computing (Orange, SymPy, NumPy) to
desktop graphical user Interfaces (Pygame, Panda3D). The syntax of the language is clean, and the
length of the code is relatively short. It's fun to work in Python because it allows you to think about
the problem rather than focusing on the syntax.
25
You can freely use and distribute Python, even for commercial use. Not only can you use and
distribute software’s written in it, but you can also even make changes to Python's source code.
Python has a large community constantly improving it in each iteration.
Portability
You can move Python programs from one platform to another and run it without any changes. It
runs seamlessly on almost all platforms including Windows, Mac OS X and Linux.
27
# Count the number of images in each category
category_counts = {category: len(os.listdir(os.path.join(path, category))) for category in categories}
precision = []
recall = []
fscore = []
accuracy = []
global labels
labels = ['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise']
# Predict
pred_probability = model.predict(test)
pred_number = np.argmax(pred_probability)
output_name = categories[pred_number]
CHAPTER 6
EXPERIMENTAL RESULTS
34
6.1 Implementation Description
10.1 Implementation Description
Importing Libraries: The code begins by importing essential libraries for data handling,
visualization, model training, evaluation, and serialization. Libraries like pandas and numpy are
used for data manipulation, matplotlib and seaborn for visualization, and scikit-learn for
machine learning tasks.
Dataset Loading and Exploration: The dataset is loaded from a CSV file named
`app_data.csv` into a pandas DataFrame. Initial exploration of the dataset is done by checking its
shape, structure, and the presence of any missing values. Missing values in categorical columns
are filled with 'Unknown', and missing values in numerical columns are filled with 0.
Data Visualization: A count plot of the target variable `DiagnosisByCriteria` is generated to
visualize the distribution of different diagnosis classes. This helps in understanding the class
balance in the dataset.
Label Encoding: Categorical variables in the dataset are encoded into numerical values using
`LabelEncoder`. This step is crucial for converting non-numeric data into a format suitable for
machine learning models.
Data Resampling: The dataset is resampled to handle class imbalance and to ensure that the
models have enough data to learn from. Resampling is done by generating a new dataset with
10,000 samples.
Train-Test Split: The dataset is split into training and testing sets using an 80-20 split. The
training set is used to train the machine learning models, while the test set is used to evaluate
their performance.
Model Building and Evaluation
Decision Tree Classifier: If a pre-trained Decision Tree Classifier model exists, it is
loaded; otherwise, a new model is trained with specific hyperparameters.
The trained model is saved using `joblib` for future use.
Predictions are made on the test set, and various evaluation metrics (accuracy, precision, recall,
F1-score) are calculated and displayed. A confusion matrix is also generated to visualize the
model's performance.
Convolutional Neural Network (CNN): If a pre-trained CNN model exists, it is loaded;
otherwise, a new CNN model is trained with specific layers (Convolution2D, MaxPooling2D,
Flatten, Dense).
The trained model is saved using `joblib` for future use.
35
Predictions are made on the test set, and various evaluation metrics (accuracy, precision, recall,
F1-score) are calculated and displayed. A confusion matrix is also generated to visualize the
model's performance.
Comparison of Models The performance metrics of both models (Decision Tree Classifier and
CNN) are compared. This comparison helps determine which model performs better in facial
expression recognition.
Prediction on New Data A new dataset (test1.csv) is loaded for testing the trained models. The
image is preprocessed (resized, normalized) and fed into the trained model to predict the facial
expression. The predicted output is displayed on the image.
36
Figure 10.2: Confusion matrix of CNN
The code evaluates the performance of a classification algorithm by calculating precision, recall, F1-
score, and accuracy, then generates and displays a classification report and confusion matrix. Lists
for precision, recall, F1-score, and accuracy are initialized globally. The `performance_metrics`
function takes an algorithm name, predicted labels (`predict`), and true labels (`testY`) as inputs,
converting both to integer types. It computes precision, recall, F1-score using macro averaging, and
accuracy, appending these metrics to their respective lists. The function prints these metrics and
generates a classification report with specified target names. It also creates a confusion matrix,
visualized as a heatmap using seaborn, with labels on the x and y axes. The provided example calls
this function with a proposed CNN model, using the predicted and true class labels derived from the
model's predictions and the test set respectively.
37
Figure 10.3: prediction on test image
The code attempts to load an image from the specified path and processes it for prediction using a
trained model. First, it reads the image twice into `img` and `img1` variables using OpenCV's
`cv2.imread()` function. If either read operation fails, an error message is printed. If successful, the
image is resized to 64x64 pixels, converted to a numpy array, reshaped to match the model's input
dimensions, converted to `float32` type, and normalized by dividing by 255.0. The preprocessed
image is then passed to the model for prediction, obtaining the predicted probability and the
corresponding class label. The code then displays the image with the predicted output label
overlayed using matplotlib.
38
CHAPTER 7
CONCLUSION AND FUTURE SCOPE
7.1 CONCLUSION
11.1 Conclusion
This study presents a comprehensive facial expression recognition dataset, which serves as a critical
resource for advancing emotion analysis and AI development. With 10,000 annotated images
spanning seven primary emotions and diverse demographic attributes, the dataset significantly
enhances the ability to build robust and accurate emotion recognition models. By employing both
machine learning and deep learning techniques, especially convolutional neural networks (CNNs),
we achieved high classification accuracy. The inclusion of metadata such as age, gender, and
ethnicity provides a deeper understanding of how different demographic factors influence emotional
expression. The findings underline the importance of diverse datasets in creating generalizable AI
models capable of accurately identifying subtle emotions across various populations. This dataset
holds great potential for improving human-AI interaction, making AI systems more empathetic and
responsive.
Dataset Expansion: Increasing the size of the dataset with additional images and more
emotion categories (e.g., contempt, amusement) to capture a wider emotional spectrum.
Multimodal Emotion Analysis: Integrating other data types such as voice and body
gestures to enhance emotion recognition accuracy.
39
CHAPTER 8
REFERENCES
[1] Y. Nan, J. Ju, Q. Hua, H. Zhang, B. Wang, "A-MobileNet: An approach of facial expression
recognition," Alexandria Engineering Journal, vol. 61, no. 6, pp. 4435-4444, 2017
[2] Z. Li, T. Zhang, X. Jing, Y. Wang, "Facial expression-based analysis on emotion correlations,
hotspots, and potential occurrence of urban crimes," Alexandria Engineering Journal, vol. 60, no. 1,
pp. 1411-1420, 2017
[3] K. Mannepalli, P.N. Sastry, M. Suman, "A novel adaptive fractional deep belief networks for
speaker emotion recognition," Alexandria Engineering Journal, vol. 56, no. 4, pp. 485-497, 2017.
[4] G. Tonguç, B.O. Ozkara, "Automatic recognition of student emotions from facial expressions
during a lecture," Computers & Education, vol. 148, Article 103797, 2017
[5] S.S. Yun, J. Choi, S.K. Park, G.Y. Bong, H. Yoo, "Social skills training for children with autism
spectrum disorder using a robotic behavioral intervention system," Autism Research, vol. 10, no. 7,
pp. 1306-1323, 2017.
[6] H. Li, M. Sui, F. Zhao, Z. Zha, F. Wu, "Mvt: Mask vision transformer for facial expression
recognition in the wild," arXiv preprint arXiv:2106.04520, 2019
[7] X. Liang, L. Xu, W. Zhang, Y. Zhang, J. Liu, Z. Liu, "A convolution-transformer dual branch
network for head-pose and occlusion facial expression recognition," Visual Computer, 2019, pp. 1-
14.
[8] M. Jeong, B.C. Ko, "Driver’s facial expression recognition in real-time for safe driving,"
Sensors, vol. 18, no. 12, p. 4270, 2018.
[9] K. Kaulard, D.W. Cunningham, H.H. Bülthoff, C. Wallraven, "The MPI facial expression
database—a validated database of emotional and conversational facial expressions," PLoS One, vol.
7, no. 3, p. e32321, 2020.
[10] M.R. Ali, T. Myers, E. Wagner, H. Ratnu, E. Dorsey, E. Hoque, "Facial expressions can detect
Parkinson’s disease: preliminary evidence from videos collected online," npj Digital Medicine, vol.
4, no. 1, pp. 1-4, 2020.
[11] Y. Du, F. Zhang, Y. Wang, T. Bi, J. Qiu, "Perceptual learning of facial expressions," Vision
Research, vol. 128, pp. 19-29, 2020.
[14] G. Mattavelli, et al., "Facial expressions recognition and discrimination in Parkinson’s disease,"
Journal of Neuropsychology, vol. 15, no. 1, pp. 46-68, 2021.
41