0% found this document useful (0 votes)
163 views77 pages

Two Stage Job Title Identification-1

Program

Uploaded by

Satish Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
163 views77 pages

Two Stage Job Title Identification-1

Program

Uploaded by

Satish Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 77

A Comparative Study of BERT and CNN2D Techniques for Job

Title Identification

1
ABSTRACT

Data science techniques are increasingly used to extract insights from large datasets, particularly
in analyzing job market trends by classifying online job advertisements. Traditional multi-label
classification methods, like self-supervised learning and clustering, have shown promise but
often require extensive labeled datasets and focus on specific databases such as O*NET, which is
tailored to the US job market. This paper introduces a two-stage job title identification
methodology designed for smaller datasets. It utilizes Bidirectional Encoder Representations
from Transformers (BERT) to classify job ads by sector and then applies unsupervised learning
and similarity measures to match job titles within the predicted sector. The proposed document
embedding strategy, incorporating weighting and noise removal, enhances accuracy by 23.5%
compared to Bag of Words models. Results indicate a 14% improvement in job title
identification accuracy, achieving over 85% in certain sectors. The study also explores the use of
CNN2D, an advanced algorithm, to further enhance classification performance by filtering
features through multiple neural network iterations.

2
CONTENTS

S.NO. CHAPTER PAGE NO


I
ABSTRACT
II
LIST OF FIGURES
III
LIST OF SCREENS
1. INTRODUCTION 1
2. LITERATURE SURVEY 4
7
3. SYSTEM ANALYSIS
3.1 EXISTING SYSTEM 7
3.2 PROPOSED SYSTEM 7
3.3 FUNCTIONAL REQUIREMENTS 9
3.4 NON-FUNCTIONAL REQUIREMENTS 10
3.5 FEASIBILITY STUDY 12
3.6 REQUIREMENT SPECIFICATION 14
4. SYSTEM DESIGN 15
4.1 MODULE DESCRIPTION 15
4.2 SYSTEM ARCHITECTURE 16
4.3 DATA FLOW DIAGRAM 17
4.4 UML DIAGRAMS 18
4.4.1 Use case Diagram 19
4.4.2 Class Diagram 20
4.4.3 Sequence Diagram 21
4.4.4 Collaboration Diagram 22
5. TECHNOLOGY DESCRIPTIONS 24
5.1 PYTHON Introduction 24
6. SAMPLE CODE 33
7. TESTING 39
7.1 Introduction 39

3
8. SCREENSHOTS 44
9. CONCLUSION 53
10. BIBLIOGRAPHY 54

4
II
LIST OF FIGURES

S.NO CHAPTER PAGENO

1 ARCHITECTURE 16

2 DATA FLOW DIAGRAM 17

3 4.2.1. USE CASE DIAGRAM 19

4 4.2.2. CLASS DIAGRAM 20

5 4.2.3. SEQUENCE DIAGRAM 21

6 4.4.4 COLLABORATION DIAGRAM 22

5
III
LIST OF SCREENS

S.NO SCREENS PAGENO

1 Importing require python classes and packages 47


Defining code to remove stop words, special
2 48
symbols etc.

3 Reading and displaying dataset values 49

4 Finding and plotting graph of various JOBS found 50

Creating BERT and TFIDF object to convert all


5 51
JOB description into numeric vector
6 Normalizing and applying CHI2 algorithm 52

7 Splitting data into train and test 53


Defining function to calculate accuracy and other
8 54
metrics

9 Training SVM on TFIDF features 55

10 Training Naïve Bayes 56

11 Training Logistic Regression 57

12 Training LSTM algorithm 58

13 Defining extension CNN2D algorithm 59

14 Training propose BERT model 60

15 Training extension CNN2D algorithm 62

16 Displaying all algorithm performance 68


17 Displaying all algorithms performance in tabular 69
format

6
Reading JOB description from TEST data and
18 70
then predicting JOB TITLE

7
OBJECTIVE:

Develop a Two-Stage Job Title Identification Framework using BERT and Unsupervised
Methods to Improve Accuracy with Small Datasets.

MOTIVATION:

This project pioneers efficient job classification using cutting-edge techniques, enhancing
accuracy. Its impact extends to identifying emerging occupations, aiding the Moroccan job
market's growth and adaptability.

PROBLEM STATEMENT:

Current job market analysis techniques lack adaptability to small datasets and non-US contexts.
Existing methods demand vast labeled data, limiting applicability outside specific databases and
nations, hindering global job market analysis and emerging occupation identification.

SCOPE OF WORK:

This project focuses on enhancing job title identification using advanced techniques, addressing
limitations of small datasets and adapting to diverse sectors, aiming to significantly improve
accuracy in job market analysis.

8
1 INTRODUCTION
The rapid expansion of the Internet and the rise of social media have led to an enormous amount
of data, demanding swift and efficient processing to extract valuable insights for decision-
making. Data science techniques play a crucial role in this context by enabling the analysis and
classification of diverse data types, such as text, images, and video, and can significantly
improve upon traditional, resource-intensive methods.

The job market has similarly transitioned to online platforms, with employers and recruiters
posting job advertisements across various websites to reach a broader audience. This digital shift
presents a valuable opportunity to analyze job market trends and understand the specific needs in
terms of skills and occupations. Such insights are beneficial not only for labor market analysts
and policymakers aiming to enhance employment strategies but also for job seekers and students
seeking relevant career opportunities and necessary training.

Classifying online job advertisements is a complex task due to the unstructured or semi-
structured nature of the data. Job ads often use varied lexicons and include extraneous details that
can obscure relevant information. For example, job titles may mention location or salary, while
descriptions might contain irrelevant company details. Addressing these challenges requires
advanced techniques for text and document representation and novel feature extraction methods.
Previous approaches to occupation normalization typically rely on classification or clustering
algorithms.

Several methods have been explored for this task, from traditional machine learning models like
Support Vector Machines (SVM) and Naïve Bayes to more sophisticated deep learning models
such as Bidirectional Encoder Representations from Transformers (BERT). Studies have shown
mixed results, with some focusing only on job titles or descriptions, revealing limitations such as
insufficient information in titles or ambiguity in descriptions. Moreover, existing methodologies
often require extensive labeled datasets, which are labor-intensive to create and difficult to
update, especially for new or evolving occupations.

Most prior research has concentrated on English-language job ads and specific occupational
classifiers like O*NET, which presents challenges in applying these methods to job ads in other

9
languages. Unsupervised models, such as clustering and field similarity approaches, offer an
alternative by bypassing the need for labeled data, which is particularly useful given the vast
number of occupations.

Traditional word embedding techniques, such as Bag of Words (BOW) and Term Frequency-
Inverse Document Frequency (TFIDF), have limitations in capturing the nuanced semantic
relationships between words. Thus, there is a need for more sophisticated embedding approaches
and feature extraction techniques to improve classification accuracy.

In this paper, we propose a novel job title identification methodology that combines self-
supervised and unsupervised machine learning algorithms with minimal labeling. Our approach
consists of two stages: first, classifying job ads by sector using text classifiers like SVM, Naïve
Bayes, Logistic Regression, and BERT; and second, matching job ads with occupations within
the predicted sector. We develop a customized document embedding strategy and experiment
with various feature selection methods to extract key information from job descriptions and
titles. By calculating the similarity between job ad representations and occupation
representations, our methodology aims to enhance job title identification accuracy and can be
applied to data from various countries.

10
HOW MACHINE LEARNING WORKS

ML, a cornerstone of (AI), revolutionizes how systems learn and improve over time through data
and algorithms, mirroring human learning processes to enhance accuracy and efficiency. As an
integral part of the burgeoning field of data science, machine learning harnesses statistical
methods to train algorithms for classification and prediction, unlocking invaluable insights
within data mining endeavors. These insights not only inform decision-making processes across
various applications and businesses but also hold the potential to significantly impact key growth
metrics.

At the heart of machine learning lie algorithms that drive the decision-making process. These
algorithms analyze data, to produce estimates or classifications regarding patterns within the
data. An essential component of this process is the utilization of an error function to evaluate the
model's predictions, comparing them to known examples to assess accuracy. Through an iterative
process of evaluation and optimization, wherein model weights are adjusted to minimize
discrepancies between predictions and actual outcomes, machine learning algorithms
autonomously refine their performance until a desired level of accuracy is achieved.

Machine learning methods can broadly be categorized into three main types:

Supervised Machine Learning:

Relies on labeled datasets to train algorithms for accurate data classification, outcome prediction.
As the model ingests input data, it adjusts its parameters to fit the dataset, undergoing cross-
validation procedures to avoid overfitting or underfitting. Supervised learning is instrumental in
addressing real-world challenges at scale, such as email spam detection or sentiment analysis.
Common supervised learning techniques include neural networks, linear regression, LR, RF, and
(SVM).

Unsupervised Machine Learning:

Unsupervised learning involves the analysis and clustering of unlabeled datasets to expose
hidden patterns or groupings without person involvement. These algorithms excel at exploratory
data analysis, customer segmentation, and feature reduction, making them invaluable for tasks

11
such as recommendation systems and image recognition. Examples of unsupervised learning
algorithms include k-means clustering, probabilistic clustering methods, and neural networks.

Semi-supervised Machine Learning:

A balance between supervised and unsupervised learning approaches by leveraging a small


labeled dataset to guide classification and feature extraction from a larger, unlabeled dataset.
This method proves particularly useful in scenarios where labeled data is limited or costly to
obtain, enhancing the efficiency of supervised learning algorithms.

The practical applications of machine learning span a diverse array of fields and industries, each
harnessing its capabilities to drive innovation and efficiency:

Speech Recognition:

Automatic speech recognition (ASR) technologies, powered by (NLP) it convert human


language into text format, enabling features like voice search and accessibility enhancements on
mobile devices.

Customer Service:

Online chatbots leverage machine learning algorithms to deliver personalized customer service
experiences, providing assistance, answering queries, and facilitating transactions across various
digital platforms.

Computer Vision:

Computer vision systems interpret and examine digital images and videos, enabling tasks such as
photo tagging on social media, medical imaging analysis, and object detection for autonomous
vehicles.

Recommendation Engines:

AI-powered recommendation engines analyze past consumer behavior to generate personalized


product recommendations, enhancing cross-selling strategies and customer engagement in e-
commerce and digital retail environments.

12
Automated Stock Trading:

High-frequency trading platforms utilize machine learning algorithms to optimize stock


portfolios and execute trades autonomously, leveraging data trends to drive investment decisions.

13
WHAT IS DEEP LEARNING

Deep learning, a cornerstone of artificial intelligence (AI), has garnered significant attention due
to the growing interest in AI. This technique enhances our ability to classify, recognize, detect,
and describe various forms of data, thereby aiding in a deeper understanding of complex
information. Deep learning is instrumental in tasks such as image classification, speech
recognition, object detection, and content description.

Several key developments are propelling deep learning forward. Firstly, algorithmic
improvements have significantly boosted the performance of deep learning methods, making
them more efficient and effective. Additionally, new machine learning approaches have been
developed, leading to increased model accuracy. The emergence of new classes of neural
networks tailored for specific applications, such as text translation and image classification, has
further expanded the scope of deep learning.

A crucial factor in the advancement of deep learning is the availability of vast amounts of data.
This data includes streaming data from the Internet of Things (IoT), textual data from social
media, physicians' notes, and investigative transcripts. The abundance of data enables the
construction of neural networks with numerous deep layers, enhancing their learning capabilities.

Moreover, computational advances have played a pivotal role in the evolution of deep learning.
Distributed cloud computing and graphics processing units (GPUs) have provided unprecedented
computing power, essential for training deep learning algorithms. This level of computing power
allows for the processing of large datasets and the execution of complex calculations required by
deep learning models.

In parallel, human-to-machine interfaces have seen significant advancements. Traditional input


devices like the mouse and keyboard are being supplemented or replaced by more intuitive
interfaces such as gesture, swipe, touch, and natural language. These innovations have not only
improved the user experience but also spurred renewed interest in AI and deep learning.

The combined impact of these advancements is profound. Algorithmic enhancements, new


neural network architectures, vast data availability, and powerful computational resources have

14
collectively driven the remarkable progress in deep learning. These developments enable deep
learning models to achieve higher accuracy and better performance across various applications,
from image and speech recognition to more sophisticated tasks like natural language processing
and predictive analytics.

As deep learning continues to evolve, it promises to unlock new possibilities and applications,
further integrating AI into our daily lives and various industries. The ongoing advancements in
deep learning will likely lead to even more significant breakthroughs, pushing the boundaries of
what AI can achieve and how it can be utilized to solve complex problems and improve our
understanding of the world.

HOW DEEP LEARNING WORKS

Deep learning revolutionizes problem-solving in analytics by shifting from instructing computers


on how to solve problems to training them to solve problems autonomously. Traditional analytics
involve using existing data to engineer features, derive new variables, select a model, and
estimate its parameters. This method often results in predictive systems that don't generalize well
because their effectiveness hinges on the quality of the model and its features. For instance,
developing a fraud detection model typically starts with a set of variables. Through data
transformations, you might derive a model dependent on thousands of variables, necessitating a
complex process of determining which variables are meaningful. This process must be repeated
whenever new data is introduced.

Deep learning transforms this approach by replacing model formulation and specification with
hierarchical characterizations, or layers, that learn to identify latent features within the data. This
shift moves from feature engineering to feature representation, allowing deep learning models to
generalize better, adapt more effectively, and continuously improve with new data. Instead of
fitting a static model, you train a task-specific system that evolves over time.

Deep learning is profoundly impacting various industries. In life sciences, it facilitates advanced
image analysis, aids in drug discovery, predicts health issues, identifies disease symptoms, and
accelerates insights from genomic sequencing. In transportation, deep learning helps autonomous

15
vehicles adapt to changing conditions, enhancing safety and efficiency. It also plays a crucial
role in protecting critical infrastructure and expediting response times in emergencies.

The hierarchical nature of deep learning models enables them to automatically learn complex
patterns from large datasets, which is a significant departure from traditional analytics that rely
heavily on manual feature engineering. This capability makes deep learning models more
dynamic and scalable, as they can handle vast amounts of data and improve as more data
becomes available.

By continuously learning and adapting, deep learning systems can provide more accurate and
reliable predictions, which is particularly valuable in fields requiring high precision and
adaptability. For example, in healthcare, deep learning models can analyze medical images to
detect diseases at early stages, potentially saving lives through early intervention. In the realm of
autonomous vehicles, these models can process real-time data to make split-second decisions,
improving safety and navigation.

In summary, deep learning changes the landscape of analytics by enabling systems to learn and
improve autonomously. This approach leads to more robust, adaptable, and dynamic predictive
models that can revolutionize industries ranging from healthcare to transportation. As deep
learning continues to evolve, its applications will likely expand, offering new solutions to
complex problems and enhancing our ability to make data-driven decisions.

How Deep Learning Being Used

Speech Recognition

Deep learning has been widely adopted in both business and academia for speech recognition.
Major technologies such as Xbox, Skype, Google Now, and Apple’s Siri utilize deep learning to
accurately recognize and interpret human speech and voice patterns, enhancing user interactions
and accessibility.

16
Natural Language Processing

Deep learning’s neural networks have long been instrumental in processing and analyzing
written text. This specialization of text mining uncovers patterns in diverse sources such as
customer complaints, physician notes, and news reports, enabling better understanding and
decision-making.

Image Recognition

Image recognition powered by deep learning has practical applications like automatic image
captioning and scene description. This technology is crucial in law enforcement, helping to
identify criminal activity from numerous photos submitted by bystanders. Additionally, self-
driving cars rely on image recognition, utilizing 360-degree camera technology to navigate and
ensure safety.

Recommendation Systems

Recommendation systems have become a staple in platforms like Amazon and Netflix,
predicting user interests based on past behavior. Deep learning enhances these systems,
providing more accurate recommendations in complex environments such as music, clothing,
and other preferences across multiple platforms.

Automated Driving

In the automotive industry, deep learning is crucial for the development of automated driving
technologies. It aids in the automatic detection of objects like stop signs and traffic lights and
enhances pedestrian detection, significantly reducing accident rates and improving road safety.

Aerospace and Defence

Deep learning is employed in aerospace and defense to identify objects from satellite images,
locating areas of interest and distinguishing between safe and unsafe zones for troops. This
application improves strategic planning and operational efficiency.

17
Medical Research

In the medical field, deep learning is revolutionizing cancer research by enabling the automatic
detection of cancer cells. Researchers at UCLA, for instance, have developed an advanced
microscope that produces high-dimensional data sets to train deep learning models, accurately
identifying cancer cells and potentially improving diagnostic processes.

Industrial Automation

Deep learning enhances industrial automation by improving worker safety around heavy
machinery. It automatically detects when people or objects are within unsafe distances of
machines, preventing accidents and ensuring a safer work environment.

Electronics

In electronics, deep learning powers automated hearing and speech translation. Home assistance
devices that respond to voice commands and learn user preferences are prime examples of deep
learning applications, making daily tasks more convenient and personalized.

18
2. LITERATURE SURVEY

2.1] M. Zhao

In the online job recruitment field, accurately categorizing job titles and resumes is crucial for
connecting job seekers with suitable positions. Machine learning offers effective solutions for
text and image classification. This paper introduces Carotene, a semi-supervised job title
classification system in use at CareerBuilder. Carotene employs a two-stage cascade classifier,
combining classification and clustering techniques to handle a large job taxonomy. We discuss
Carotene's architecture, compare it with earlier versions and third-party systems, and present
experimental results using machine learning metrics and user experience surveys.

2.2] N. Sbihi

Poor student orientation can lead to skill mismatches and high unemployment rates, as students
often choose education paths without understanding job market needs. Existing resources, such
as career centers and advice from family, often lack comprehensive market insights. This paper
addresses this gap by collecting data from job portals and university websites to align university
training with current job market demands. Focusing on the IT sector, we use machine learning
and text analysis to match job ads with relevant university programs. Our findings reveal in-
demand programming roles and emerging positions like data scientists, highlighting some
curriculum mismatches. A Dashboard will be developed to guide students towards better career
alignment.

2.3] F. Mercorio

The Web has become a valuable resource for labor market data, with an increasing number of
job postings appearing on online platforms. This paper evaluates and contrasts various
classification methods—explicit rules, machine learning, and LDA-based algorithms—using a
diverse dataset of job offers gathered from 12 different sources. The goal is to classify these
offers according to a standardized occupation framework, offering insights into the effectiveness
of these techniques for organizing job market data.

19
2.4] G. Mezzour

Offshore sector in Morocco offers numerous job opportunities, but analyzing related job ads is
challenging due to their unstructured nature. This study examines job ads from February to August 2017,
utilizing machine learning and text mining techniques. We analyze required skills, including natural and
programming languages, education level, experience, contract types, and salaries. Our findings highlight
that French is crucial for offshore roles, with English and Spanish also valued. Development and web
design are key IT roles, with Java, SQL, JavaScript, and PHP being the most sought-after programming
languages.

2.5] V. Guliashki

This paper presents a hybrid approach that merges k-NN and SVM machine learning techniques to
identify job titles with similar descriptions and industries. This innovative method enhances both the
accuracy and efficiency of the candidate selection process, streamlining the task of matching job titles to
suitable candidates. By integrating these two methods, the approach improves the overall effectiveness of
job title classification, making it faster and more precise.

20
3 SYSTEM ANALYSIS

3.1EXISTING SYSTEM

Data science algorithms often used to extract useful knowledge from unstructured text data such
as Identifying Job Title by analysing Job Text Description. All existing algorithms are heavily
dependent on large Label data for perfect classification and gathering huge label require lots of
experience and time. All existing algorithms were using Occupational Information Network
(O*NET) data from USjob market and this existing algorithm were not applying any additional
technique to improve accuracy.

1.1Disadvantages of Existing system

1. Time-consuming data gathering.


2. Limited to US job market.
3. Lack of accuracy improvement.
4. Algorithm complexity.
5. Dependency on large datasets.

3.2PROPOSED SYSTEM

To address the challenges in job title identification, this paper proposes a two-stage approach. In
the first stage, Bidirectional Encoder Representations from Transformers (BERT) is employed to
classify job ads into relevant sectors, such as Information Technology or Agriculture. BERT
converts unstructured text into numeric vectors while capturing semantic similarities. In the
second stage, the Euclidean Distance algorithm is used to measure the similarity between the job
ads and potential job titles, finding the closest match even with a small number of labels.
Compared to traditional models like TFIDF and WORD2VEC, the BERT-based approach
combined with Euclidean Distance achieves higher accuracy in job title identification.

3.2.1Advantages of Proposed System

1. Improved accuracy.

21
2. Adaptability to small datasets.
3. Enhanced classification.
4. Benefits the analysis of evolving job markets.
5. Performance High.

3.4REQUIREMENT SPECIFICATION

Hardware Requirements

➢ PROCESSOR - Intel Core i3/i5

➢ RAM - 8 GB

➢ HARD DISK (SSD) - 500 GB

Software Requirements

➢ Operating system : Windows 11, Windows 10


➢ Coding Language : PYTHON
➢ Documentation : MS Office

22
4 SYSTEM DESIGN
4.1 METHODOLOGY
Importing Python Classes and Packages

The journey of any data-driven project begins with importing essential Python libraries and
classes. For tasks involving data manipulation, natural language processing (NLP), and machine
learning model implementation, several key packages are utilized:

 pandas: Provides data structures and data analysis tools, making it indispensable for
handling and manipulating tabular data.
 NumPy: Offers support for large multi-dimensional arrays and matrices, along with a
collection of mathematical functions.
 scikit-learn: A versatile machine learning library that includes tools for model training,
evaluation, and preprocessing.
 TensorFlow and Keras: These libraries facilitate building and training deep learning
models, such as neural networks.
 Transformers: From the Hugging Face library, this package is crucial for utilizing
advanced NLP models like BERT.
 matplotlib and seaborn: Essential for creating visualizations and plots to analyze data
distributions and model performance.

These libraries collectively enable comprehensive data analysis and model development,
providing a robust framework for tackling complex tasks.

Text Preprocessing

Text preprocessing is a critical step in preparing job descriptions for analysis and model training.
The goal is to clean the text data by removing elements that do not contribute to the prediction of
job titles. This involves:

23
 Removing Stop Words: Words such as "the," "and," or "is" are common but do not
provide meaningful information for the prediction task. Removing them helps in focusing
on more relevant terms.
 Eliminating Special Symbols: Symbols like punctuation marks or special characters that
do not contribute to the text's semantic meaning are discarded.
 Stripping Irrelevant Elements: Any other irrelevant elements, such as extra spaces or
HTML tags, are removed to ensure the text data is clean and formatted consistently.

This preprocessing ensures that the text data is in a suitable form for subsequent analysis and
feature extraction.

Dataset Exploration

Exploring the dataset is crucial for understanding its structure and content. This involves:

 Reading the Dataset: The job descriptions dataset is loaded into a DataFrame, allowing
for examination of its structure, including columns, data types, and sample entries.
 Understanding Distribution: Initial analysis helps in understanding how job titles are
distributed across the dataset. This might include examining the frequency of different
job titles and identifying any imbalances or biases in the data.

Exploratory analysis provides insights into the data’s characteristics and helps in tailoring the
preprocessing and feature extraction steps to improve model performance.

Graph Plotting for Job Titles

Visualizing the distribution of job titles helps in understanding their frequency and prevalence
within the dataset. This is achieved through:

 Graph Plotting: A bar graph is plotted, where the x-axis represents various job titles and
the y-axis shows their respective counts. This visualization offers a clear view of the most
common and rare job titles, highlighting any trends or anomalies.

24
Such graphs are instrumental in identifying which job titles are most prevalent and whether any
job titles are underrepresented, which might impact the model's training.

Feature Extraction using BERT and TF-IDF

Feature extraction transforms text data into a numeric format that machine learning models can
process:

 BERT (Bidirectional Encoder Representations from Transformers): BERT is a


sophisticated NLP model that captures context from both directions in a sentence,
providing rich, contextualized word embeddings. It generates high-dimensional vectors
for job descriptions, capturing nuanced meanings and relationships in the text.
 TF-IDF (Term Frequency-Inverse Document Frequency): TF-IDF is a statistical
measure that reflects the importance of a word in a document relative to a collection of
documents. It transforms text into numeric vectors based on word frequency and
document rarity, helping in identifying significant terms.

Both BERT and TF-IDF are applied to the job descriptions to convert them into feature vectors
suitable for machine learning model training.

Normalization and CHI2 Algorithm

Normalization: After feature extraction, the next step is to normalize the features to ensure that
all variables contribute equally to the model's performance. This step adjusts the scale of
features, making them comparable and improving the stability and convergence of machine
learning models.

CHI2 Algorithm: The CHI2 (Chi-squared) test is applied to the features to evaluate their
importance. This statistical test assesses the independence of features from the target variable,
helping to select the most relevant features and improve model accuracy by focusing on the most
significant ones.

25
Data Splitting and Model Evaluation

Data Splitting: The dataset is divided into training and testing sets. Typically, an 80-20 split is
used, where 80% of the data is used for training the model, and 20% is reserved for testing. This
ensures that the model is evaluated on unseen data, providing a realistic measure of its
performance.

Model Evaluation: Various evaluation metrics are computed to assess the model’s
effectiveness:

 Accuracy: Measures the proportion of correctly predicted job titles.


 Precision: Indicates how many of the predicted job titles were correct.
 Recall: Reflects how many of the actual job titles were correctly identified.
 Confusion Matrices: Provide a visual representation of prediction results, showing true
positives, false positives, true negatives, and false negatives.

These metrics help in understanding the model’s performance and identifying areas for
improvement.

Model Training and Evaluation

Different machine learning algorithms are trained and evaluated using the extracted features:

 SVM (Support Vector Machine): A powerful classifier that finds the optimal
hyperplane to separate different job titles.
 Naïve Bayes: A probabilistic classifier based on Bayes' theorem, effective for text
classification tasks.
 Logistic Regression: A statistical model used for binary classification, which can be
extended to handle multi-class problems.
 BERT: Fine-tuned for the job title prediction task, leveraging its contextual
understanding of text.
 CNN2D (Convolutional Neural Network): Although typically used for image analysis,
CNNs can be adapted for text classification by treating text as a sequence of data.

26
The performance of each model is analyzed using metrics like accuracy, precision, recall, and
confusion matrices to determine the most effective approach for job title prediction.

Performance Visualization

The performance of various algorithms is visualized to facilitate comparison:

 Graphs: Bar graphs or line plots display the accuracy and other metrics of different
models. The x-axis represents the names of the algorithms, while the y-axis shows their
performance metrics.
 Tabular Format: A table is used to present the performance metrics of each model,
allowing for easy comparison and evaluation.

These visualizations help in understanding which algorithms perform best and in making data-
driven decisions about model selection.

Prediction on Test Data

The final step involves using the trained models to predict job titles based on job descriptions
from the test dataset. The predicted titles are compared with the actual titles to evaluate the
model’s real-world performance and effectiveness in accurately classifying job descriptions.

Extension

In the proposed paper, traditional machine learning algorithms like SVM, Naïve Bayes, and
Logistic Regression were employed, but advanced algorithms like CNN2D and Bi-LSTM were
not explored. As an extension, CNN2D has been experimented with in this work. CNN2D filters
features through multiple neuron iterations, allowing the model to train with the most relevant
features, which helps in achieving higher accuracy. This exploration of advanced algorithms
demonstrates their potential in improving job title prediction and offers valuable insights into
their effectiveness compared to traditional methods.

27
JOB TITLE DATASET:

To train all algorithms author has generated his own dataset but not publish on internet so we
have used Job Title Description dataset from KAGGLE and below are the dataset details

In above dataset first row contains dataset column names and remaining rows contains dataset
values and in dataset we can see Job Title, Name and Description and by using above dataset we
will train and test all algorithm performance.

28
4.2SYSTEM ARCHITECTURE:

FIG1: Architecture

29
4.4 UML DIAGRAMS

 (UML) serves as a pivotal tool in the realm of object-oriented software engineering. Its
primary objective is to become a universal language for modeling software systems.
Presently, UML consists of two core components: a Meta-model and documentation,
with potential future enhancements on the horizon.

 UML isn't confined to software engineering; it extends its reach to business modeling and
various non-software domains. It encapsulates a set of proven design practices essential
for managing complex systems. As an integral part of software development, UML
employs graphical notations to streamline project visualization and communication.

 UML facilitates the creation and documentation of software artifacts, playing a crucial
role in software development and engineering processes.

GOALS:

 Provide tools for extending core concepts and customization.

 Ensure independence from specific programming languages and methodologies.

 Offer a formal framework for understanding the modeling language.

30
4.4.1 USE CASE DIAGRAM
A usage case diagram in (UML) illustrates system behavior through actor-goal relationships. It
visually represents how actors interact with system functions, known as use cases, and any
relationships between them. Its main purpose is to depict which system functions are performed
for each actor, defining their roles within the system.

Upload Dataset

Reading and displaying dataset


values

Creating BERTand TFIDF object to convert


all JOB description into numeric vector

Applying CHI2 algorithmon both


BERT and TFIDF vector

User Pre-Processing Dataset

System

Training SVM, Naïve Bayes, Logisti


c Regression on TFIDF features

Training BERTmodel with max


similarity measure

Training BERTwith CNN2D


algorithm

Upload JOB description fromTEST


data

Predict JOB title

FIGURE 4.2.1 USECASE DIAGRAM

4.4.2 CLASS DIAGRAM


31
In software design, a UML class diagram illustrates a system's structure by depicting classes,
their properties, methods, and relationships, clarifying data ownership within the system.

System
Reading and displaying dataset values
User Creating BERTand TFIDF object to convert all JOB description into numeric vector
Applying CHI2 algorithm on both BERT and TFIDF vector
Upload Dataset
Pre-Processing Dataset
Upload JOB description from TEST data
Training SVM, Naïve Bayes, Logistic Regression on TFIDF features()
View Predicted JOB Title Results()
Training BERTmodel with max similarity measure()
Training BERTwith CNN2Dalgorithm()
Predict JOB title()

FIGURE 4.2.2 CLASS DIAGRAM

32
4.4.3 SEQUENCE DIAGRAM
A sequence is outline in UML It is a sort of communication chart that presents how strategies
work with one another and in what demand. Succession graphs are at times called event charts,
event circumstances, and timing outlines.

User System

Upload Dataset

Reading and displaying dataset values

Creating BERTand TFIDF object to convert all JOB description into numeric vector

Applying CHI2 algorithm on both BERTand TFIDF vector

Pre-Processing Dataset

Training SVM, Naïve Bayes, Logistic Regression on TFIDF features

Training BERTmodel with max similarity measure

Training BERTwith CNN2Dalgorithm

Upload JOB description fromTESTdata

Predict JOB title

FIGURE 4.2.3 SEQUENCE DIAGRAM

33
4.4.4 COLLABORATION DIAGRAM

This demonstrates object relationships in a system, focusing on architecture rather than message
flow. It embodies object-oriented programming principles, showcasing features within objects
and their connections across the system.

2: Reading and displaying dataset values


3: Creating BERTand TFIDF object to convert all JOB description into numeric vector
4: Applying CHI2 algorithm on both BERTand TFIDF vector
User 5: Pre-Processing Dataset
6: Training SVM, Naïve Bayes, Logistic Regression on TFIDF features
7: Training BERTmodel with max similarity measure
1: Upload Dataset 8: Training BERTwith CNN2D algorithm
9: Upload JOB description from TESTdata 10: Predict JOB title

System

4.4.4 COLLABORATION DIAGRAM

34
ACTIVITY DIAGRAM:

Activity blueprints are graphical depictions of work methods of sequential exercises and
activities by help for choice, cycle and simultaneousness. the Unified Modeling Language,
activity charts can be utilized to depict the organization and operational all around mentioned
work systems of parts in a structure. A development framework demonstrates the general
development of control

4.4.5 ACTIVITY

35
5 TECHNOLOGY DESCRIPTIONS

5.1 Python Introduction

Python, a dynamic and high-level programming language, offers versatility for application
development with its support for object-oriented programming.

Its simplicity makes it easy to learn, while its interpreted nature facilitates rapid development.
With Python's syntax and dynamic typing, it's ideal for scripting tasks and quick application
prototyping.

Supporting various programming patterns, with object-oriented, imperative, and functional


styles, Python provides a flexible framework for developers to work within.

Python History

Python was conceived in the late 1980s by Guido Van Rossum. Its first implementation surfaced
in December 1989, followed by a code publication in February 1991.

Python 1.0 launched in 1994, introducing lambda, map, filter, and reduce.

Python 2.0 followed with list comprehensions and a garbage collection system. Python 3.0,
addressing language flaws, released on December 3, 2008, known as "Py3K."

36
Python Features

1. Easy Learning Curve: Its simplicity and developer-friendly nature make Python easy to
learn and use.
2. Interpreted Nature: Being interpreted, Python executes code line by line, simplifying
debugging, ideal for beginners.
3. Cross-Platform Compatibility: Python operates seamlessly across various platforms
like Windows, Linux, and Macintosh, enhancing its portability.
4. Open Source: Python is freely available on its official website, offering access to its
source code, promoting collaborative development.
5. Object-Oriented: Python supports object-oriented programming, creation and
manipulation of classes and objects.

37
Python Applications

Python's versatility extends across various domains of software development, making it a go-to
choice for many applications. Here's how Python can be applied:

1. Web Applications: Python offers libraries like Django and Flask for web development,
handling protocols such as HTML, XML, and JSON.
2. Desktop GUI Applications: Python's Tk library and Kivy framework facilitate GUI
development for desktop applications.
3. Software Development: Python serves as a supportive language for tasks like build
control, management, and testing.
4. Business Applications: Python powers ERP systems, with platforms like Tryton
providing application support.
5. Console-Based Applications: IPython demonstrates Python's capability for developing
console-based applications.

38
Why Python

Python offers cross-platform compatibility and a user-friendly syntax resembling English. Its
concise syntax enables efficient coding with fewer lines compared to other languages. With an
interpreter system, Python allows instant code execution, facilitating rapid prototyping.

Installation on Windows

Python 3.7 on Windows operating system.

We need to add python path and click on install now.

39
We check-box (install for all users) is to be checked.

Now, we are ready to install python-3.7 and Lets install it.

40
Now, go to command prompt to run Python. Type the command python and its prompts python.

41
6.SAMPLE CODE
#import require python classes and packages
from string import punctuation
from nltk.corpus import stopwords
import nltk
from nltk.stem import WordNetLemmatizer
import numpy as np
import pandas as pd
import pickle
from nltk.stem import PorterStemmer
import os
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler
from sentence_transformers import SentenceTransformer #loading bert sentence model
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer #loading tfidf vector
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.tree import DecisionTreeClassifier
from scipy.spatial import distance
from numpy import dot
from numpy.linalg import norm
from keras.utils.np_utils import to_categorical
from keras.layers import MaxPooling2D
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D

42
from keras.models import Sequential, Model, load_model
from keras.callbacks import ModelCheckpoint
import seaborn as sns
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

#define object to remove stop words and other text processing


stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
ps = PorterStemmer()

#define global variables to store labels, title and title + description


title_textdata = []
desc_textdata = []
labels = []

#define function to clean text by removing stop words and other special symbols
def cleanText(doc):
tokens = doc.split()
table = str.maketrans('', '', punctuation)
tokens = [w.translate(table) for w in tokens]
tokens = [word for word in tokens if word.isalpha()]
tokens = [w for w in tokens if not w in stop_words]
tokens = [word for word in tokens if len(word) > 1]
tokens = [ps.stem(token) for token in tokens]
tokens = [lemmatizer.lemmatize(token) for token in tokens]
tokens = ' '.join(tokens)
return tokens

43
#load & display dataset values
dataset = pd.read_csv("Dataset/JobsDataset.csv")
class_label, count = np.unique(dataset['Query'], return_counts=True)
dataset

#find and plot graph of different jobs found in dataset


count = count[0:5]
class_label = class_label[0:5]
height = count
bars = class_label
y_pos = np.arange(len(bars))
plt.figure(figsize =(6, 3))
plt.bar(y_pos, height)
plt.xticks(y_pos, bars)
plt.xlabel("Job Titles")
plt.ylabel("Count")
plt.xticks(rotation=90)
plt.show()

#function to get job label of give name


def getLabel(name):
class_id = -1
for i in range(len(class_label)):
if class_label[i] == name:
class_id = i
break
return class_id

#now read job details dataset and then convert to TFIDF vector and BERT vector
if os.path.exists("model/bert_title_desc.npy"):

44
bert_desc_X = np.load("model/bert_title_desc.npy")#load bert title + description data
Y = np.load("model/label.npy") #load training labels
bert_title_X = np.load("model/bert_title.npy") #load bert title
with open('model/desc_tfidf.txt', 'rb') as file:
tfidf_desc_vector = pickle.load(file)
file.close()
with open('model/title_tfidf.txt', 'rb') as file:
tfidf_title_vector = pickle.load(file)
file.close()
tfidf_title_X = np.load("model/tfidf_title_X.txt.npy")#load tfidf title vector
tfidf_desc_X = np.load("model/tfidf_desc_X.txt.npy") #load tfidf title + description vector
else:
for i in range(len(dataset)):
label = dataset.get_value(i, 'Query')#loop all job details from dataset
label = getLabel(label)
if label < 5:
title = dataset.get_value(i, 'Job Title')#get title
desc = dataset.get_value(i, 'Description')#get description
title = title.strip().lower()
title = cleanText(title)#clean titke
desc = desc.strip().lower()
desc = cleanPost(desc)#clean description
title_textdata.append(title)
desc_textdata.append(title+" "+desc)#append both title and description
labels.append(label)
print(label)
bert = SentenceTransformer('nli-distilroberta-base-v2')#creat bert model
embeddings = bert.encode(desc_textdata, convert_to_tensor=True)
X = embeddings.numpy()
np.save("model/bert_title_desc", X)
Y = np.asarray(labels)

45
np.save("model/label", Y)
embeddings = bert.encode(title_textdata, convert_to_tensor=True)
X = embeddings.numpy()
np.save("model/bert_title", X)
#create TFIDF vector
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, use_idf=True, ngram_range=(1,
1), smooth_idf=False, norm=None, decode_error='replace', max_features=768)
tfidf = tfidf_vectorizer.fit_transform(desc_textdata).toarray()
np.save("model/tfidf_desc_X.txt",tfidf)
with open('model/desc_tfidf.txt', 'wb') as file:
pickle.dump(tfidf_vectorizer, file)
file.close()
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, use_idf=True, ngram_range=(1,
1), smooth_idf=False, norm=None, decode_error='replace', max_features=768)
tfidf = tfidf_vectorizer.fit_transform(title_textdata).toarray()
np.save("model/tfidf_title_X.txt",tfidf)
with open('model/title_tfidf.txt', 'wb') as file:
pickle.dump(tfidf_vectorizer, file)
file.close()
print("BERT and TFIDF vector generated")
print("BERT Vector : "+str(bert_desc_X))
print("TFIDF Vector : "+str(tfidf_desc_X))

#now preprocess dataset such as normalization and features selection usng CHI2 weights
scaler1 = MinMaxScaler((0,1))
scaler2 = MinMaxScaler((0,1))
scaler3 = MinMaxScaler((0,1))
scaler4 = MinMaxScaler((0,1))
bert_desc_X = scaler1.fit_transform(bert_desc_X)#normalized TFIDF and BERT Title and
description
bert_title_X = scaler2.fit_transform(bert_title_X)

46
tfidf_desc_X = scaler3.fit_transform(tfidf_desc_X)
tfidf_title_X = scaler4.fit_transform(tfidf_title_X)
selected1 = SelectKBest(score_func = chi2, k = 300)#select best top to 300 features usinggv
CHi2
bert_desc_X = selected1.fit_transform(bert_desc_X, Y)
selected2 = SelectKBest(score_func = chi2, k = 300)
bert_title_X = selected2.fit_transform(bert_title_X, Y)
selected3 = SelectKBest(score_func = chi2, k = 300)
tfidf_desc_X = selected3.fit_transform(tfidf_desc_X, Y)
selected4 = SelectKBest(score_func = chi2, k = 300)
tfidf_title_X = selected4.fit_transform(tfidf_title_X, Y)
print("Preprocessing completed")

#now split both BERT and TFIDF dataset into train & Test
bert_desc_X_train, bert_desc_X_test, bert_desc_y_train, bert_desc_y_test =
train_test_split(bert_desc_X, Y, test_size=0.2)
bert_title_X_train, bert_title_X_test, bert_title_y_train, bert_title_y_test =
train_test_split(bert_title_X, Y, test_size=0.2)
tfidf_desc_X_train, tfidf_desc_X_test, tfidf_desc_y_train, tfidf_desc_y_test =
train_test_split(tfidf_desc_X, Y, test_size=0.2)
tfidf_title_X_train, tfidf_title_X_test, tfidf_title_y_train, tfidf_title_y_test =
train_test_split(tfidf_title_X, Y, test_size=0.2)
print()
print("Dataset train & test split as 80% dataset for training and 20% for testing")
print("Training Size (80%): "+str(bert_desc_X_train.shape[0])) #print training and test size
print("Testing Size (20%): "+str(bert_desc_X_test.shape[0]))
print()

#define global variables to store accuracy and other metrics


precision = []
recall = []

47
fscore = []
accuracy = []

#function to calculate various metrics such as accuracy, precision etc


def calculateMetrics(algorithm, predict, testY):
p = precision_score(testY, predict,average='macro') * 100
r = recall_score(testY, predict,average='macro') * 100
f = f1_score(testY, predict,average='macro') * 100
a = accuracy_score(testY,predict)*100
print(algorithm+' Accuracy : '+str(a))
print(algorithm+' Precision : '+str(p))
print(algorithm+' Recall : '+str(r))
print(algorithm+' FMeasure : '+str(f))
accuracy.append(a)
precision.append(p)
recall.append(r)
fscore.append(f)
conf_matrix = confusion_matrix(testY, predict)
plt.figure(figsize =(5, 4))
ax = sns.heatmap(conf_matrix, xticklabels = class_label, yticklabels = class_label, annot =
True, cmap="viridis" ,fmt ="g");
ax.set_ylim([0,len(class_label)])
plt.title(algorithm+" Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

#now train SVM algorithm of TFIDF features


svm_cls = svm.SVC() #create SVM object
svm_cls.fit(tfidf_desc_X_train, tfidf_desc_y_train)#train SVM on training data
predict = svm_cls.predict(tfidf_desc_X_test) #predict on test data

48
calculateMetrics("SVM", predict, tfidf_desc_y_test)#calculate accuracy and other metrics

#now train naive bayes algorithm


nb_cls = GaussianNB() #create SVM object
nb_cls.fit(tfidf_desc_X_train, tfidf_desc_y_train)#train Naive Bayes on training data
predict = nb_cls.predict(tfidf_desc_X_test) #predict on test data
calculateMetrics("Naive Bayes", predict, tfidf_desc_y_test)#calculate accuracy and other metrics

#now train Logistic Regression algorithm


lr_cls = LogisticRegression() #create LR object
lr_cls.fit(tfidf_desc_X_train, tfidf_desc_y_train)#train LR on training data
predict = lr_cls.predict(tfidf_desc_X_test) #predict on test data
calculateMetrics("Logistic Regression", predict, tfidf_desc_y_test)#calculate accuracy and other
metrics

#train propose BERT model on BERT data using euclidean distance function to match predicted
label with highest similarity
#to avoid incorrect prediction
predict = []
for i in range(len(bert_desc_X_test)):#loop all test data
max_value = 0
pred = 0
for j in range(len(bert_desc_X)):#loop all bert train data
#calculate euclidean distance between bert train and test data
dst = dot(bert_desc_X_test[i], bert_desc_X[j]) / (norm(bert_desc_X_test[i]) *
norm(bert_desc_X[j]))
#choose predicted label with max similarity
if dst > max_value and dst != 1:
max_value = dst
pred = Y[j]
#save max similarity label in predict array

49
predict.append(pred)
calculateMetrics("Propose BERT Model", predict, bert_desc_y_test)#calculate accuracy and
other metrics

#now train extension CNN model using convolution 2D neural network as this algorithm filtered
features at multiple
#neurons iterations to train model with best features and this best features help CNN in getting
high accuracy
bert_desc_X_train = np.reshape(bert_desc_X_train, (bert_desc_X_train.shape[0], 10, 10, 3))
bert_desc_X_test = np.reshape(bert_desc_X_test, (bert_desc_X_test.shape[0], 10, 10, 3))
bert_desc_y_train = to_categorical(bert_desc_y_train)
bert_desc_y_test = to_categorical(bert_desc_y_test)
#define object
extension_model = Sequential()
#add CNN2d layer with 32 neurons to filter features 32 times
extension_model.add(Convolution2D(32, (3 , 3), input_shape = (bert_desc_X_train.shape[1],
bert_desc_X_train.shape[2], bert_desc_X_train.shape[3]), activation = 'relu'))
#max layer collected filtered features from CNN layer
extension_model.add(MaxPooling2D(pool_size = (2, 2)))
#defining another filter
extension_model.add(Convolution2D(32, (3, 3), activation = 'relu'))
extension_model.add(MaxPooling2D(pool_size = (2, 2)))
extension_model.add(Flatten())
#defining output layer
extension_model.add(Dense(units = 256, activation = 'relu'))
extension_model.add(Dense(units = bert_desc_y_train.shape[1], activation = 'softmax'))
#compile and train the model
extension_model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics =
['accuracy'])
if os.path.exists("model/cnn_weights.hdf5") == False:

50
model_check_point = ModelCheckpoint(filepath='model/cnn_weights.hdf5', verbose = 1,
save_best_only = True)
hist = extension_model.fit(bert_desc_X_train, bert_desc_y_train, batch_size = 16, epochs =
50, validation_data=(bert_desc_X_test, bert_desc_y_test), callbacks=[model_check_point],
verbose=1)
f = open('model/cnn_history.pckl', 'wb')
pickle.dump(hist.history, f)
f.close()
else:
extension_model = load_model("model/cnn_weights.hdf5")
#perform prediction on test data
predict = extension_model.predict(bert_desc_X_test)
predict = np.argmax(predict, axis=1)
y_test1 = np.argmax(bert_desc_y_test, axis=1)
calculateMetrics("Extension CNN2d Model", predict, y_test1)#calculate accuracy and other
metrics

#all algorithms performance graph


df = pd.DataFrame([['SVM','Precision',precision[0]],['SVM','Recall',recall[0]],['SVM','F1
Score',fscore[0]],['SVM','Accuracy',accuracy[0]],
['Naive Bayes','Precision',precision[1]],['Naive Bayes','Recall',recall[1]],['Naive
Bayes','F1 Score',fscore[1]],['Naive Bayes','Accuracy',accuracy[1]],
['Logistic Regression','Precision',precision[2]],['Logistic
Regression','Recall',recall[2]],['Logistic Regression','F1 Score',fscore[2]],['Logistic
Regression','Accuracy',accuracy[2]],
['Propose Bert','Precision',precision[3]],['Propose Bert','Recall',recall[3]],['Propose
Bert','F1 Score',fscore[3]],['Propose Bert','Accuracy',accuracy[3]],
['Extension Bert with CNN2D','Precision',precision[4]],['Extension Bert with
CNN2D','Recall',recall[4]],['Extension Bert with CNN2D','F1 Score',fscore[4]],['Extension Bert
with CNN2D','Accuracy',accuracy[4]],
],columns=['Parameters','Algorithms','Value'])

51
df.pivot("Parameters", "Algorithms", "Value").plot(kind='bar',figsize =(6, 3))
plt.title("All Algorithms Performance Graph")
plt.show()

#showing all algorithms with scenario A and B performance values


columns = ["Algorithm Name","Precison","Recall","FScore","Accuracy"]
values = []
algorithm_names = ["SVM", "Naive Bayes", "Logistic Regression", "Propose BERT",
'Extension BERT CNN2D']
for i in range(len(algorithm_names)):
values.append([algorithm_names[i],precision[i],recall[i],fscore[i],accuracy[i]])

temp = pd.DataFrame(values,columns=columns)
temp

#create bert object


bert = SentenceTransformer('nli-distilroberta-base-v2')
print("Bert model created")

#now predict job title from job description


dataset = pd.read_csv("Dataset/testData.csv",encoding = "ISO-8859-1")
dataset = dataset.values
for i in range(len(dataset)):
data = dataset[i,0]
data = cleanText(data)
temp = []
temp.append(data)#add message to array
embeddings = bert.encode(data, convert_to_tensor=True)#convert message review to bert
vector
X = embeddings.numpy()#convert vector to numpy
X = X.reshape(1, -1)

52
X = scaler1.transform(X)
X = selected1.transform(X)
X = np.reshape(X, (X.shape[0], 10, 10, 3))
predict = extension_model.predict(X)
predict = np.argmax(predict)
print("Job Description = "+dataset[i,0][0:150])
print("PREDICTED JOB TITLE =====> "+class_label[predict]+"\n")

53
7. TESTING

7.1 Testing is the procedure where the deformities are Identified, detached, oppressed for
correction and guarantee that the item is sans imperfection so as to give quality to it and thus
consumer loyalty of the testing we must know the following things.

 Recognition of defects: defects must be identified first in the product.

 Detaching the defects: After identification defects must be listed. Isolation means
separation. Physical separation is done by the developer.

 Submitted for rectification: This is the responsibility of the TE to send the list of defects
for rectification.

TYPES OF TESTING

Unit Testing: Unit tests verify individual components, ensuring each business method performs
correctly with distinct inputs and predictable outcomes.

Integration Testing: Validate integrated software components, ensuring their combined


functionality meets expectations.

Functional Testing: Functional tests assess system functions, verifying alignment with business
and technical requirements.

System Testing: System tests validate the entire software system, ensuring it meets specified
requirements and produces expected outcomes.

White Box Testing: White box tests analyze internal structures and logic to ensure completeness
and accuracy.

Black Box Testing: Black box tests assess software functionality without knowledge of internal
operations, verifying inputs and outputs against specified requirements.

54
8 SCREENSHOTS

Importing require python classes and packages

55
Defining code to remove stop words, special symbols etc.

56
Reading and displaying dataset values

57
Finding and plotting graph of various JOBS found in dataset where x-axis represents JOB TITLE
and y-axis represents counts

58
Reading each JOB description and then cleaning and adding to array variable and then creating
BERT and TFIDF object to convert all JOB description into numeric vector. In above screen
=========== before dashed lines you can see we are creating BERT and TFIDF vectors and
after executing above block will get below vector

59
BERT and TFIDF vector created

60
Normalizing and applying CHI2 algorithm on both BERT and TFIDF vector

61
Splitting data into train and test

62
Defining function to calculate accuracy and other metrics

63
Training SVM on TFIDF features and it got 83% accuracy and can see other metrics also and in
confusion matrix graph x-axis represents True Job Title and Y-axis represents Predicted Job Title
and all different colour boxes in diagnol represents correct prediction count and remaining blue
boxes contains incorrect prediction count

64
Training Naïve Bayes got 51% accuracy

65
Training Logistic Regression got 84% accuracy

66
Training propose BERT model with max similarity measure got 88% accuracy which is higher
than existing algorithms and can see other metrics also

67
Training extension CNN2D algorithm and after executing above block will get below output

68
Extension CNN2D model got 96% accuracy and can see other metrics also

69
Displaying all algorithm performance where x-axis represents algorithm names and y-axis
represents accuracy and other metrics in differnet colour bars

70
Displaying all algorithms performance in tabular format

71
Reading JOB description from TEST data and then predicting JOB TITLE and below is the
output

72
In above screen in first line we can see JOB Description and then after == arrow symbol can
see predicted JOB title as Big data Engineer or Cloud Architect

73
9. CONCLUSION
This project systematically employed Python libraries for text preprocessing, dataset exploration,
and model training. It began with importing necessary packages and defining code to clean text
data. Exploratory data analysis included displaying job dataset values and plotting graphs to
visualize job title distribution. BERT and TFIDF vectors were created from job descriptions,
normalized, and subjected to the CHI2 algorithm. Various models including SVM, Naïve Bayes,
Logistic Regression, and proposed BERT model were trained and evaluated. The extension
CNN2D model achieved high accuracy. Performance metrics were displayed graphically and in
tabular format. Test data predictions demonstrated effective job title prediction capabilities.

74
10.BIBLIOGRAPHY

[1] F. Javed, Q. Luo, M. McNair, F. Jacob, M. Zhao, and T. S. Kang, ‘‘Carotene: A job title
classification system for the online recruitment domain,’’ in Proc. IEEE 1st Int. Conf. Big Data
Comput. Service Appl., Mar. 2015, pp. 286–293.

[2] M. S. Pera, R. Qumsiyeh, and Y.-K.Ng, ‘‘Web-based closed-domain data extraction on


online advertisements,’’ Inf. Syst., vol. 38, no. 2, pp. 183–197, Apr. 2013.

[3] R. Kessler, N. Béchet, M. Roche, J.-M.Torres-Moreno, and M. El-Bèze, ‘‘A hybrid approach
to managing job offers and candidates,’’ Inf. Process.Manage., vol. 48, no. 6, pp. 1124–1135,
Nov. 2012.

[4] I. Rahhal, K. Carley, K. Ismail, and N. Sbihi, ‘‘Education path: Student orientation based on
the job market needs,’’ in Proc. IEEE Global Eng. Educ. Conf. (EDUCON), Mar. 2022, pp.
1365–1373.

[5] S. Mittal, S. Gupta, K. Sagar, A. Shamma, I. Sahni, and N. Thakur, ‘‘A performance
comparisons of machine learning classification techniques for job titles using job descriptions,’’
SSRN Electron. J., 2020.Accessed: Feb. 22, 2023.[Online]. Available:
https://www.ssrn.com/abstract=3589962, doi: 10.2139/ssrn.3589962.

[6] R. Boselli, M. Cesarini, F. Mercorio, and M. Mezzanzanica, ‘‘Using machine learning for
labour market intelligence,’’ in Machine Learning and Knowledge Discovery in Databases
(Lecture Notes in Computer Science), Y. Altun, K. Das, T. Mielikäinen, D. Malerba, J.
Stefanowski, J. Read, M. Zitnik, M. Ceci, and S. Dzeroski, Eds. Cham, Switzerland: Springer,
2017, pp. 330–342.

[7] T. Van Huynh, K. Van Nguyen, N. L.-T. Nguyen, and A. G.-T. Nguyen, ‘‘Job prediction:
From deep neural network models to applications,’’ in Proc. RIVF Int. Conf. Comput. Commun.
Technol. (RIVF), Oct. 2020, pp. 1–6.

75
[8] F. Amato, R. Boselli, M. Cesarini, F. Mercorio, M. Mezzanzanica, V. Moscato, F. Persia,
and A. Picariello, ‘‘Challenge: Processing web texts for classifying job offers,’’ in Proc. IEEE
9th Int. Conf. Semantic Comput. (IEEE ICSC), Feb. 2015, pp. 460–463.

[9] H. T. Tran, H. H. P. Vo, and S. T. Luu, ‘‘Predicting job titles from job descriptions with
multi-label text classification,’’ in Proc. 8th NAFOSTED Conf. Inf. Comput. Sci. (NICS), Dec.
2021, pp. 513–518. [10] R. Boselli, M. Cesarini, F. Mercorio, and M. Mezzanzanica,
‘‘Classifying online job advertisements through machine learning,’’ Future Gener.Comput. Syst.,
vol. 86, pp. 319–328, Sep. 2018.

[11] M. Vinel, I. Ryazanov, D. Botov, and I. Nikolaev, ‘‘Experimental comparison of


unslupervised approaches in the task of separating specializations within professions in job
vacancies,’’ in Proc. Conf. Artif. Intell. Natural Lang., Cham, Switzerland: Springer, 2019, pp.
99–112.

[12] E. Malherbe, M. Cataldi, and A. Ballatore, ‘‘Bringing order to the job market: Efficient job
offer categorization in E-recruitment,’’ in Proc. 38th Int. ACM SIGIR Conf. Res. Develop. Inf.
Retr., Aug. 2015, pp. 1101–1104.

[13] F. Saberi-Movahed, M. Rostami, K. Berahmand, S. Karami, P. Tiwari, M. Oussalah, and S.


S. Band, ‘‘Dual regularized unsupervised feature selection based on matrix factorization and
minimum redundancy with application in gene selection,’’ Knowl.-Based Syst., vol. 256, Nov.
2022, Art. no. 109884.

[14] I. Khaouja, I. Rahhal, M. Elouali, G. Mezzour, I. Kassou, and K. M. Carley, ‘‘Analyzing the
needs of the offshore sector in Morocco by mining job ads,’’ in Proc. IEEE Global Eng. Educ.
Conf. (EDUCON), Apr. 2018, pp. 1380–1388.

[15] R. Bekkerman and M. Gavish, ‘‘High-precision phrase-based document classification on a


modern scale,’’ in Proc. 17th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Aug.
2011, pp. 231–239.

[16] P. Neculoiu, M. Versteegh, and M. Rotaru, ‘‘Learning text similarity with siamese
recurrent networks,’’ in Proc. 1st Workshop Represent. Learn. (NLP). Berlin, Germany:

76
Association for Computational Linguistics, 2016, pp. 148–157. Accessed: Feb. 22, 2023.
[Online]. Available: http://aclweb.org/anthology/W16-1617, doi: 10.18653/v1/W16-1617.

[17] I. Karakatsanis, W. AlKhader, F. MacCrory, A. Alibasic, M. A. Omar, Z. Aung, and W. L.


Woon, ‘‘Data mining approach to monitoring the requirements of the job market: A case study,’’
Inf. Syst., vol. 65, pp. 1–6, Apr. 2017.

[18] Y. Zhu, F. Javed, and O. Ozturk, ‘‘Document embedding strategies for job title
classification,’’ in Proc. 30th Int. Flairs Conf., 2017, pp. 55–65.Accessed: Oct. 4, 2022.[Online].
Available: https:// www.aaai.org/ocs/index.php/FLAIRS/FLAIRS17/paper/view/15470

[19] F. Colace, M. D. Santo, M. Lombardi, F. Mercorio, M. Mezzanzanica, and F. Pascale,


‘‘Towards labour market intelligence through topic modelling,’’ in Proc. Annu. Hawaii Int.
Conf. Syst. Sci., 2019, pp. 1–10

[20] E. Mankolli and V. Guliashki, ‘‘A hybrid machine learning method for text analysis to
determine job titles similarity,’’ in Proc. 15th Int. Conf. Adv. Technol., Syst. Services
Telecommun. (TELSIKS), Oct. 2021, pp. 380–385

77

You might also like