0% found this document useful (0 votes)
55 views52 pages

Project Report I

The document outlines a project titled 'Lung Cancer Prediction' submitted by Shruti Verma as part of her Bachelor of Technology degree requirements at Gurugram University. The project aims to develop an accurate classification and prediction system for lung cancer using machine learning, image processing, and deep learning techniques, emphasizing the importance of early detection for improving survival rates. It includes methodologies such as image acquisition, preprocessing, model selection, and the integration of multiview medical imaging to enhance diagnostic accuracy.

Uploaded by

Shruti Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views52 pages

Project Report I

The document outlines a project titled 'Lung Cancer Prediction' submitted by Shruti Verma as part of her Bachelor of Technology degree requirements at Gurugram University. The project aims to develop an accurate classification and prediction system for lung cancer using machine learning, image processing, and deep learning techniques, emphasizing the importance of early detection for improving survival rates. It includes methodologies such as image acquisition, preprocessing, model selection, and the integration of multiview medical imaging to enhance diagnostic accuracy.

Uploaded by

Shruti Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

LUNG CANCER PREDICTION


PROJECT I [ P.22.6.631]

SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE AWARD OF

Gurugram University, Gurugram

​ ​
PROJECT I​
[ P.22.6.631]
LUNG CANCER PREDICTION

SUBMITTED IN PARTIAL FULFILLMENT OF THE


REQUIREMENTS FOR THE AWARD
OF

Submitted By

Shruti Verma
University Roll No: 12028413 / University Reg. No: 221001360102

2
This is to certify that the project report entitled “Lung Cancer Prediction”
submitted by “Shruti Verma (25325)” in partial fulfillment of the requirements for
the award of the degree of Bachelor of Technology in Computer Science &
Engineering (AI&ML) of Dronacharya College of Engineering, Gurugram is a
record of bonafide work carried out under my guidance and supervision.


Dr. Ritu Pahwa
(Head of Department)
(Signature and Seal)​ ​ ​ ​ ​ ​

3
I, SHRUTI VERMA , of 6TH semester Btech (CSE AI&ML), in the Department
of Bachelor of Technology in Computer Science & Engineering from
Dronacharya College of Engineering, Gurugram, hereby declare that the project
work entitled “ LUNG CANCER PREDICTION ” is carried out by us and
submitted in partial fulfillment of the requirements for the award of Bachelor of
Technology in Computer Science & Engineering (AI&ML) , under
Dronacharya College of Engineering during the academic year 2024 - 2025 and
has not been submitted to any other university for the award of any kind of degree.

4
The success and final outcome of this project requires a lot of guidance and
assistance from many people and I am extremely privileged to have got this all
along the completion of my project. All that I have done is only due to such
supervision and assistance and I would not forget to think them. I respect and thank
Dr. Ritu Pahwa, HOD CSE (AI&ML), Dronacharya College of Engineering,
Affiliated to Maharshi Dayanand University, Rohtak for providing me an
opportunity to do the project work. I am extremely thankful to her for providing
such nice support and motivation. I owe my deep gratitude to our project mentors
Prof. Parveen Kumari and Prof. Suman Chopra who took keen interest in our
project work and guided us all along, till the completion of our project work by
providing all the necessary information for developing a good system.

5
TABLE OF CONTENTS

1. Introduction of Project
01
1.1 Introduction
01
1.2 Project Overview 01
●​ Background 01
●​ Objective 01

1.3 The Significance of Lung Cancer Prediction 01

03
1.4 Methodology
04
●​ Data Collection
04
●​ Data Preprocessing
●​ Feature Engineering 04

●​ Model Selection 05
●​ Training and Evaluation 05
●​ Model Deployment 06
●​ Monitoring and Maintenance
06

2. Tools and Technology Used in Project


07
●​ Python
09
●​ Numpy
●​ Matplotlib Library 10

●​ Pandas Library 11
●​ Seaborn Library 12

3. My Project
13
●​ Code Snapshots

4. Result and Discussion 27

5. Conclusion and Future Scope 32

6
LIST OF FIGURES

Figure1.1 ……………. (1) ……………………………………… 03

Figure 1.2 Data Flow (2) ……………………………………... 03


Figure 1.3 Algorithm flow(3) ……………………………………... 06
Figure 2.1 Python(1) ……………………………………... 07
Figure 2.2 Numpy(2) ……………………………………... 09
Figure 2.3 Pandas(3) ……………………………………... 10
Figure 2.4 Matplotlib(4) ……………………………………… 11
Figure 2.5 Seaborn(5) ……………………………………..... 12
Figure 3.1 Source Code(1) ……………………………………... 13
Figure 3.2 Source Code(2) ……………………………………... 13
Figure 3.3 Source Code(3) ……………………………………... 14
Figure 3.4 Source Code(4) ……………………………………... 14
Figure 3.5 Source Code(5) ……………………………………... 15
Figure 3.6 Source Code(6) ……………………………………... 15
Figure 3.7 Source Code(7) ……………………………………... 15
Figure 3.8 Source Code(8) ……………………………………... 16
Figure 3.9 Source Code(9) …………………………………….. 16
Figure 3.10 Source Code(10) …………………………………….. 16
Figure 3.11 Source Code(11) …………………………………….. 17
Figure 3.12 Source Code(12) …………………………………….. 17
Figure 3.13 Source Code(13) …………………………………….. 17
Figure 3.14 Source Code(14) …………………………………….. 18
Figure 3.15 Source Code(15) ……………………………………... 18

7
Figure 3.16 Source Code(16) ……………………………………... 18
Figure 3.17 Source Code(17) ………………………………………. 19

Figure 3.18 Source Code(18) ………………………………………. 19

Figure 3.19 Source Code(19) ……………………………………… 19

Figure 3.20 Source Code(20) ………………………………………. 20

Figure 3.21 Source Code(21) ………………………………………. 20

Figure 3.22 Source Code(22) ……………………………………….. 20

Figure 3.23 Source Code(23) ……………………………………….. 21

Figure 3.24 Source Code(24) ……………………………………… 21


Figure 3.25 Source Code(25) ………………………………………. 21

Figure 3.26 Source Code(26) ……………………………………… 22

Figure 3.27 Source Code(27) ……………………………………… 22

Figure 3.28Source Code(28) ……………………………………… 22

Figure 3.29 Source Code(29) ……………………………………… 23

Figure 3.30 Source Code(30) ……………………………………… 23

Figure 3.31 Source Code(31) ……………………………………… 23

Figure 3.32 Source Code(32) ……………………………………… 24

Figure 3.33 Source Code(33) ……………………………………… 24

Figure 3.34 Source Code(34) ……………………………………… 24

Figure 3.35 Source Code(35) ……………………………………… 25

Figure 3.36 Source Code(36) ……………………………………… 25

Figure 3.37 Source Code(37) ……………………………………… 25

Figure 3.38 Source Code(38) ……………………………………… 26

Figure 3.39 Source Code(39) ……………………………………… 26

8
Figure 3.40 Source Code(40) ……………………………………… 26

Figure 4.1 Results (1) ……………………………………… 27

Figure 4.2 Results (2) ……………………………………… 27

Figure 4.3 Results (3) ……………………………………… 28

Figure 4.4 Results (4) ……………………………………… 28

Figure 4.5 Results (5) ……………………………………… 28

Figure 4.6 Results (6) ……………………………………… 29

Figure 4.7 Results (7) ……………………………………… 29

Figure 4.8 Results (8) ……………………………………… 29

Figure 4.9 Results & Discussion(9) ………………………………………. 31

9
CHAPTER 1
INTRODUCTION TO THE PROJECT

1.1 Introduction
Lung cancer remains one of the most lethal diseases worldwide, claiming the lives of
approximately one million people annually. It presents a significant challenge for medical
professionals to detect and diagnose effectively, particularly in its early stages. Despite advances
in medical technology, the complete understanding of cancer's root causes and definitive
treatments continues to elude researchers. However, early detection significantly improves
survival rates, with the American Cancer Society estimating a 47% chance of survival when lung
cancer is detected in its initial stages.

The current medical landscape necessitates efficient lung nodule identification from chest CT
scans, as these nodules often serve as critical indicators of potential malignancy. Traditional
diagnostic methods such as X-ray imaging frequently fail to reveal lung cancer in its early stages,
especially when dealing with round lesions less than 10mm in diameter. This limitation
underscores the urgent need for computer-aided detection (CAD) systems that can assist medical
professionals in early and accurate diagnosis.

Image processing techniques play a crucial role in identifying regions affected by cancer in lung
images. These techniques include noise reduction, feature extraction, identification of damaged
regions, and comparison with historical lung cancer data. The application of digital image
processing enables the integration of various aspects of an image into a coherent entity, allowing
for targeted analysis of specific lung regions. A key advantage of this approach is the ability to
differentiate between cancerous and non-cancerous sections by comparing image intensities.

Our project aims to develop an accurate classification and prediction system for lung cancer
using a comprehensive approach that integrates machine learning, image processing, and deep
learning techniques. The workflow begins with image acquisition, followed by preprocessing
using geometric mean filters to enhance image quality and reduce noise artifacts. K-means
clustering algorithms are then applied for image segmentation, helping to identify regions of
interest within the lung CT scans.

For the classification and prediction phase, we implement a robust multi-algorithm approach to
identify the most effective method for lung cancer detection. Our machine learning
implementation includes a diverse set of algorithms: Logistic Regression for probabilistic
classification, K-Nearest Neighbors Classifier for instance-based learning, Decision Tree for
interpretable rule-based prediction, Bagging Classifier for improved stability and accuracy,
Gaussian Naive Bayes for probabilistic modeling with feature independence assumptions, and

10
Random Forest for ensemble-based classification with reduced overfitting. In parallel, we
employ Convolutional Neural Networks (CNN) as our deep learning approach, which excels at
automatic feature extraction directly from image data, capturing subtle patterns and
characteristics that might be overlooked by traditional methods.

Expanding on these methodologies, our combined research initiatives introduce two distinct yet
complementary strategies. One notable approach involves a two-step verification architecture
for lung cancer detection. This model first assesses patient risk (low, medium, or high) through a
series of questions about symptoms and medical background, leveraging a Decision Tree
algorithm. If a medium or high risk is determined, the assessment is validated by analyzing the
patient's CT scan images using VGG16 Convolutional Neural Networks to predict the specific
type of lung cancer. This aims to streamline the diagnostic process and enhance accuracy by
eliminating the immediate need for a doctor's intervention in the initial screening.

Another significant thrust of our research focuses on machine learning-based lung cancer
detection using multiview image registration and fusion. This method addresses the limitation
of single imaging modalities by combining information from multiple image views, particularly
CT scans. Novel algorithms for multiview medical image registration and fusion are
introduced to create a single, highly detailed diagnostic image by aligning and merging different
perspectives without losing crucial clinical information. This integration of anatomical and
functional data, often processed by models like ResNet-18, provides a more reliable basis for
diagnosis.

By comparing and evaluating these various algorithms and integrated approaches, our project
aims to determine the most accurate and efficient method for lung cancer prediction, potentially
combining strengths from multiple models for optimal performance. The significance of this
research lies in its potential to improve early detection rates, which directly correlates with
higher survival rates. As lung cancer incidence continues to rise in developing countries due to
increased life expectancy, urbanization, and adoption of Western lifestyles, the implementation
of efficient and accurate diagnostic tools becomes increasingly important for global public
health. Our system aims to reduce the burden on radiologists, minimize human error in
diagnosis, and ultimately contribute to better patient outcomes through timely intervention and
treatment planning.

Fig.1.1 Data Flow

11
1.2 Project Overview
This research presents a consolidated exploration of the foundational context and key aims
driving two distinct but complementary research initiatives in the realm of lung cancer diagnosis.
Both studies are pivotal in their contributions to improving the early identification and precise
classification of lung cancer, thereby facilitating more timely and effective therapeutic
interventions.

●​ Background

The accurate identification of lung cancer represents a formidable challenge that has garnered
significant attention from researchers worldwide. As established, lung cancer is a leading cause
of cancer-related mortality globally, necessitating continuous advancements in diagnostic
methodologies. While various imaging techniques are employed, a single imaging modality often
proves insufficient to capture the entirety of critical morphological and functional data required
for a definitive diagnosis of both normal and diseased anatomical structures. This inherent
limitation underscores the necessity for more comprehensive imaging solutions.

In this context, the practice of multiview medical imaging—which involves combining


information from two or more distinct images—becomes paramount. This integrated approach is
crucial for achieving an expanded understanding of medical conditions and for enhancing the
identification of subtle lesions, cancerous cells, and tumors that might be missed by isolated
imaging techniques. Furthermore, the advent and rapid evolution of machine learning (ML) and
deep learning (DL) techniques have opened new frontiers in medical diagnostics. These
advanced computational methods are increasingly being leveraged to expedite cancer detection
and classification of its stages. By employing ML and DL, it becomes feasible to analyze much
larger patient datasets with greater efficiency, thereby reducing the time and cost associated with
manual analysis, and potentially enabling the screening of more individuals.

A key methodological innovation in this research lies in the sophisticated use of multiview image
registration and fusion. Image registration is the process of aligning different medical images
(e.g., from different modalities or taken at different times) to a common spatial framework.
Image fusion then combines these aligned images to create a single, composite image that retains
all the relevant information from the original sources without significant loss of clinical detail.
This meticulously executed process involves a precise comparison of geometrical dimensions
and intensity levels across the various input images. The resultant fused image provides a richer,
more reliable diagnostic basis, integrating both anatomical (structural) and functional
(physiological) information, which is critical for a more reliable disease diagnosis. Computed
Tomography (CT) imaging, particularly favored for its high-resolution capabilities in visualizing
the human skeleton and its ability to depict how different body parts absorb X-rays, serves as a
primary imaging modality in this comprehensive approach.

12
●​ Objectives

The overarching objective of this research is to propose and rigorously demonstrate the
effectiveness of a novel, machine learning-based approach for the accurate detection and
classification of lung cancer. This encompasses several specific, interlinked goals aimed at
advancing the state of the art in medical imaging and diagnostics:

1.​ To propose an efficient Machine Learning (ML) classification model: This objective
focuses on developing a robust and highly effective ML model capable of distinguishing
cancerous lung tissues from healthy ones with high precision. The model is designed to
handle complex patterns within medical imaging data to ensure reliable classification
outcomes.​

2.​ To extract crucial information from medical images while conserving image energy:
This involves the implementation of sophisticated feature extraction methods. The aim is
to intelligently extract the most diagnostically relevant features from the medical images,
ensuring that vital information is captured accurately, while also optimizing the
computational resources by efficiently processing the image data, minimizing redundancy
and maximizing signal content.​

3.​ To explore and utilize multiview medical image registration and fusion algorithms:
A key component of this research is the investigation, development, and application of
advanced algorithms for both the registration (alignment) and fusion (combination) of
multiview medical images. The goal is to produce composite images that offer a more
comprehensive and information-rich representation of the lung, integrating data from
different perspectives or modalities to enhance diagnostic clarity.​

4.​ To evaluate performance using various indicators: The research meticulously assesses
the effectiveness of the proposed model and methodologies. This involves using a range
of standard performance metrics to quantify the accuracy, precision, recall, F1-score, and
other relevant indicators, thereby validating the robustness and practical utility of the
developed system in a clinical context.​

Through these objectives, the research seeks to provide a cutting-edge, data-driven solution for
lung cancer detection, ultimately contributing to improved patient outcomes by enabling earlier,
more accurate, and less invasive diagnostic processes.

13
1.3 The Significance of Lung Cancer Prediction
Lung cancer prediction, especially through the application of machine learning (ML), holds
immense and growing significance in modern healthcare, fundamentally reshaping diagnostic
paradigms and patient management strategies. This importance stems from its capacity to
address critical challenges associated with this highly lethal disease, offering multifaceted
benefits across the healthcare spectrum.

1. Early Detection Benefits: Transforming Survival Rates The most profound significance of
ML-driven lung cancer prediction lies in its ability to facilitate early detection, which is the
single most critical factor in improving patient survival outcomes. Lung cancer is notoriously
aggressive, often presenting with subtle or non-existent symptoms in its initial stages, leading to
late diagnoses when the disease has progressed and treatment options are limited. Machine
learning models, however, excel at identifying minute, often imperceptible patterns within vast
datasets—be it patient symptoms, medical history, or complex imaging data—that may indicate
the presence of early-stage lung cancer, even before macroscopic changes or overt symptoms
become apparent. Early detection can dramatically increase the 5-year survival rate, with
estimates suggesting a potential leap from as low as 15% in advanced stages to over 85% when
detected in its nascent phases. This capability of early warning empowers clinicians to intervene
decisively and promptly, offering patients the best chance for successful treatment and long-term
survival.

2. Cost-Effective and Optimized Screening Traditional lung cancer screening methods, such as
regular CT scans and invasive biopsies, are inherently expensive, resource-intensive, and carry
certain risks, making universal screening impractical and financially unfeasible for entire
populations. Machine learning-based prediction systems offer a highly cost-effective alternative.
By analyzing various risk factors and initial patient data, these models can accurately identify
and stratify high-risk individuals. This allows healthcare systems to prioritize those patients who
are most likely to benefit from further, more expensive diagnostic testing, such as low-dose CT
scans or biopsies. This optimized resource allocation not only reduces healthcare expenditure but
also minimizes unnecessary exposure to radiation and invasive procedures for low-risk
individuals, making screening programs more sustainable and impactful.

3. Reduced Healthcare Burden and Streamlined Diagnostic Pathways The implementation of


accurate ML-based risk assessment systems significantly alleviates the burden on healthcare
infrastructure and personnel. By effectively differentiating between high-risk and low-risk
patients, the system can reduce the volume of unnecessary diagnostic procedures and follow-up
appointments for those unlikely to have the disease. Conversely, it ensures that high-risk patients
receive immediate and expedited attention, streamlining their journey through the diagnostic and
treatment pathways. This optimized workflow minimizes delays, prevents overcrowding in
diagnostic facilities, and allows healthcare professionals to allocate their valuable time and
resources more efficiently to critical cases, thereby improving overall healthcare delivery.

14
4. Personalized Medicine: Tailored Risk Assessment and Treatment Machine learning
models possess an unparalleled ability to analyze and synthesize complex individual patient
characteristics, including genetic predispositions, environmental exposures, lifestyle factors, and
specific medical histories. This capability allows them to provide highly personalized risk
assessments for lung cancer. Unlike generalized risk calculators, ML models can discern intricate
interactions between numerous variables, leading to a nuanced understanding of an individual's
susceptibility. This personalized approach facilitates more targeted and effective treatment
strategies, as interventions can be tailored precisely to the patient's unique risk profile and
predicted disease characteristics. For instance, if a model predicts a specific aggressive subtype,
treatment can be initiated more rapidly with therapies known to be effective against that subtype.

5. Enhanced Support for Healthcare Professionals: A Powerful Decision-Support Tool


Machine learning systems are not designed to replace healthcare professionals but rather to serve
as powerful decision-support tools. For radiologists interpreting complex CT scans, these
systems can act as an additional "pair of eyes," highlighting subtle nodules or suspicious patterns
that might otherwise be overlooked, thereby minimizing human error. In regions with limited
access to specialist oncologists or advanced diagnostic facilities, these systems can empower
general practitioners or regional hospitals to conduct initial, accurate risk assessments, guiding
them on which patients require urgent specialist referral. This democratizes access to
high-quality diagnostic insights, particularly benefiting underserved populations and bridging
gaps in healthcare access.

6. Population Health Management: Proactive Intervention and Resource Allocation The


large-scale implementation of machine learning-based prediction systems enables comprehensive
population-level health monitoring. By identifying prevalent risk factors and predicting potential
surges in lung cancer cases within specific communities, public health authorities can proactively
implement targeted intervention programs, health education campaigns, and early screening
initiatives. This proactive approach facilitates a more efficient allocation of public health
resources, allowing for early intervention programs that can significantly reduce the overall
incidence and mortality burden of lung cancer in communities, ultimately contributing to better
public health outcomes on a grand scale.

15
1.4 Methodology
The methodology for lung cancer detection and classification, drawn from reliable research
efforts, follows a structured approach encompassing data handling, model development, and
evaluation.

●​ Data Collection​

○​ Research extensively utilizes large datasets of medical images, primarily


Computed Tomography (CT) scans.
○​ One study references the use of 83 CT scans from 70 distinct patients for
experimental investigation.
○​ Another prominent dataset employed is LIDC-IDRI (Lung Image Database
Consortium-Image Database Resource Initiative), which comprises 4,682 CT
scan images from 61 patients, featuring nodules ranging from 3 to 30 mm in
diameter.
○​ Beyond imaging data, some methodologies also incorporate non-invasive patient
information, such as symptoms and medical history, to facilitate initial risk
assessment.

●​ Data Preprocessing​

○​ Noise Reduction: A critical initial step in image preprocessing is the reduction of


noise. Techniques like geometric mean filters are applied to enhance the quality
of CT scan images and mitigate artifacts that could interfere with analysis.
○​ Image Quality Enhancement: The application of these filters specifically aims
to improve the overall clarity and suitability of the input images for subsequent
processing stages.

●​ Feature Engineering​

○​ Image Segmentation: Post-preprocessing, K-means clustering algorithms are


widely applied for image segmentation. This process is crucial for accurately
identifying and isolating specific regions of interest within the lung CT scans,
such as potential nodules or affected areas.
○​ Multiview Image Registration and Fusion: A key innovation in one research
approach involves the utilization of multiview medical image registration and
fusion algorithms. This process combines information from multiple image
perspectives to create a more comprehensive view. Techniques like
Multiresolution Rigid Registration (MRR) are employed to ensure consistent

16
sizing and resolution across different image views, thereby extracting richer
anatomical and functional data.
○​ Automated Feature Extraction: To enhance efficiency and reduce manual
intervention, methods such as Discrete Wavelet Transform (DWT) and
principal component averaging are used for automated feature extraction
directly from CT images. Other methods involve identifying damaged regions and
comparing them with historical data.

●​ Model Selection​

○​ A diverse array of machine learning algorithms is considered for classification


and prediction. These include:
■​ Logistic Regression for probabilistic classification.
■​ K-Nearest Neighbors (KNN) Classifier for instance-based learning.
■​ Decision Tree, particularly for initial patient risk assessment based on
symptoms and medical history.
■​ Bagging Classifier for improved stability and accuracy through ensemble
learning.
■​ Gaussian Naive Bayes for probabilistic modeling with assumptions of
feature independence.
■​ Random Forest, another ensemble method known for robust
classification and reduced overfitting.
○​ Deep Learning Models: Convolutional Neural Networks (CNNs) are
extensively utilized for their superior capabilities in automated feature learning
from image data:
■​ VGG16 Convolutional Neural Networks are employed for validating
initial findings from CT scans and for predicting specific types of lung
cancer.
■​ The ResNet-18 model is specifically used for accurate tumor
identification and detailed stage classification (STG-1, STG-2, STG-3, and
STG-4).

●​ Training and Evaluation

○​ Two-Step Verification Architecture: One methodological framework


incorporates a two-step verification process. The first step involves a preliminary
risk assessment using a Decision Tree classifier, which has demonstrated a high
accuracy (e.g., 99.67%) for initial risk diagnosis. If a medium or high risk is

17
identified, the second step involves validating the finding and predicting the
specific cancer type using VGG16 CNNs on CT scan images.
○​ Cross-Validation: To ensure the models' robustness, accuracy, and
generalizability, k-fold cross-validation is a standard practice during the training
phase. One study explicitly mentions using a 10-fold cross-validation approach.
○​ Performance Metrics: Model evaluation focuses on key metrics such as
detection sensitivities, overall accuracy, and the reduction of false positives.
Performance is often assessed using techniques like the Free-Receiver Operating
Characteristic (FROC) curve. Reported accuracy rates for deep learning models
like ResNet-18 have reached as high as 98.2% in lung cancer detection and
staging.

●​ Model Deployment

While explicit sections detailing "Model Deployment" are not extensively elaborated in
the provided research abstracts, the ultimate aim of these projects is to develop
computer-aided detection (CAD) systems. These systems are designed to be integrated
into clinical environments to assist medical professionals in early and accurate diagnosis,
thereby reducing the workload on radiologists and minimizing the potential for human
error. The overarching goal is to provide a comprehensive "classification and prediction
system for lung cancer" that can be practically implemented.

●​ Monitoring and Maintenance

The provided documents do not contain explicit sections specifically addressing


"Monitoring and Maintenance" for deployed models. However, the continuous pursuit of
high accuracy, efficiency, and reliability in these diagnostic systems inherently implies
the need for ongoing monitoring and regular maintenance processes to ensure their
sustained effectiveness and integrity in real-world clinical applications.

18
CHAPTER 2
TOOLS AND TECHNOLOGY USED IN THE PROJECT
The development of the lung cancer prediction system leverages a comprehensive set of
programming tools and libraries, primarily within the Python ecosystem, to handle diverse
aspects of data processing, model construction, and performance evaluation.

2.1 Python
This serves as the foundational programming language for the entire project. Its versatility and
extensive libraries make it the preferred choice for implementing complex machine learning and
deep learning algorithms, managing data workflows, and developing the overall predictive
framework.

2.2 NumPy
As a cornerstone library for numerical computing in Python, Numpy is essential for efficient
handling of large datasets, particularly the numerical array structures inherent in medical image
processing (e.g., CT scan pixel data) and the mathematical computations central to machine
learning algorithms.

2.3 Matplotlib Library


This widely used library is critical for data visualization. It enables the creation of various plots,
charts, and graphical representations necessary for exploring data characteristics, visualizing
intermediate processing steps, and rigorously evaluating model performance through elements
like confusion matrices and Receiver Operating Characteristic (ROC) curves.

2.4 Pandas Library


A powerful tool for data manipulation and analysis in Python, Pandas provides robust data
structures like DataFrames. It is extensively utilized for tasks in the data preprocessing and
feature engineering phases, including cleaning raw data, handling missing values, standardizing
formats, and curating diverse patient datasets comprising demographic, symptom, and medical
history information.

19
2.5 Seaborn Library
Built upon Matplotlib, Seaborn enhances statistical data visualization capabilities. It is
particularly valuable for generating aesthetically pleasing and informative statistical graphics. Its
use facilitates deeper insights into data relationships (e.g., through correlation analysis) and
provides richer visualizations for interpreting model behavior and analytical results. \

2.6 OpenCV or CV2


This is a widely used library for computer vision tasks. Given the extensive image processing
steps described, such as image acquisition, preprocessing, noise reduction, and segmentation
(e.g., K-means clustering), OpenCV is a fundamental tool for manipulating and analyzing image
data.

2.7 Pillow (PIL)


Often used for image processing tasks, including opening, manipulating, and saving many
different image file formats. It complements OpenCV in handling image data.

2.8 Scikit-learn
This comprehensive machine learning library is implied by the use of various traditional machine
learning algorithms (Logistic Regression, K-Nearest Neighbors, Decision Tree, Random Forest,
Bagging Classifier, Gaussian Naive Bayes) and is essential for standard procedures like splitting
datasets into training and testing sets for model evaluation.

2.9 Tensorflow
TensorFlow is an open-source, end-to-end platform for machine learning. It provides a
comprehensive ecosystem of tools, libraries, and community resources that allow researchers and
developers to build and deploy ML-powered applications. In the context of lung cancer
prediction, TensorFlow serves as the robust computational engine underlying the deep learning
models. It handles the low-level numerical operations, tensor manipulations, and efficient
execution on various hardware (e.g., GPUs), which are critical for processing large medical
image datasets (CT scans) and training complex neural networks. It provides the backbone for
defining and running the computational graphs of deep learning architectures.

20
2.10 Keras
Keras is a high-level neural networks API, written in Python and capable of running on top of
TensorFlow (among other backends like Theano or CNTK). It is designed for fast
experimentation with deep neural networks, making the process of building, training, and
evaluating deep learning models much simpler and more intuitive. In this project, Keras provides
the user-friendly interface for constructing the sophisticated CNN architectures utilized, such as
VGG16 and ResNet-18.

●​ Sequential Model API: Keras's Sequential model API allows for the linear stacking of
layers, making it straightforward to define the network structure.
●​ Layer Components: Specific Keras layers mentioned, like Conv2D (for learning spatial
features from images), AvgPool2D and MaxPooling2D (for downsampling and reducing
dimensionality), Flatten (to convert 2D feature maps into a 1D vector), Dense (for fully
connected classification layers), and Dropout (for regularization to prevent overfitting),
are directly used to build the CNNs that analyze CT scan images for tumor detection and
classification.
●​ Utilities: Keras also offers utilities like image_dataset_from_directory and img_to_array,
which streamline the process of loading, batching, and preparing image data for training
the deep learning models.

21
CHAPTER 3
SNAPSHOTS

Fig. 3.1 Source Code (1)

Fig. 3.2 Source Code (2)

Fig. 3.3 Source Code (3)

22
Fig. 3.4 Source Code (4)

Fig. 3.5 Source Code (5)

23
Fig. 3.6 Source Code (6)

Fig. 3.7 Source Code (7)

Fig. 3.8 Source Code (8)

24
Fig. 3.9 Source Code (9)

Fig. 3.10 Source Code (10)

25
Fig. 3.11 Source Code (11)

Fig. 3.12 Source Code (12)

26
Fig. 3.13 Source Code (13)

Fig. 3.14 Source Code (14)

Fig. 3.15 Source Code (15)

Fig. 3.16 Source Code (16)

27
Fig. 3.17 Source Code (17)

Fig. 3.18 Source Code (18)

Fig. 3.16 Source Code (19)

28
Fig. 3.20 Source Code (20)

Fig. 3.21 Source Code (21)

Fig. 3.22 Source Code (22)

Fig. 3.23 Source Code (23)

29
Fig. 3.24 Source Code (24)

Fig. 3.25 Source Code (25)

Fig. 3.22 Source Code (26)

30
Fig. 3.27 Source Code (27)

31
Fig. 3.28 Source Code (28)

Fig. 3.29 Source Code (29)

Fig. 3.30 Source Code (30)

32
Fig. 3.31 Source Code (31)

Fig. 3.32 Source Code (32)

Fig. 3.33 Source Code (33)

33
Fig. 3.34 Source Code (34)

Fig. 3.35 Source Code (35)

Fig. 3.36 Source Code (36)

Fig. 3.37 Source Code (37)

34
Fig. 3.38 Source Code (38)

Fig. 3.39 Source Code (39)

Fig. 3.40 Source Code (40)

35
Fig. 3.41 Source Code (41)

Fig. 3.42 Source Code (42)

36
Fig. 3.43 Source Code (43)

Fig. 3.44 Source Code (44)

Fig. 3.45 Source Code (45)

37
CHAPTER 4
RESULTS AND DISCUSSIONS

Results

Training Performance
The model training process demonstrated excellent convergence characteristics over 15 epochs.
The training and validation accuracy curves (Figure 4.1) show rapid improvement in the initial
epochs, with both metrics reaching approximately 98% accuracy by epoch 14. The training
accuracy achieved a final value of 98.3%, while validation accuracy reached 97.8%, indicating
minimal overfitting with only a 0.5% gap between training and validation performance.

Fig. 4.1 Result (1)

The loss curves (Figure 4.2) further confirm the model's effective learning progression. Both
training and validation loss decreased consistently from initial values of approximately 1.0 to
final values below 0.1. The parallel decline of both loss curves without significant divergence
suggests good generalization capability and appropriate model complexity for the given dataset.

38
Fig. 4.2 Result (2)

Model Performance Evaluation

Multi-class Classification Performance

The model's performance was evaluated across three different prediction approaches, yielding
varying results:

Prediction 1 (3-class classification):


●​ Overall Accuracy: 88.3%
●​ Precision: 89.5%
●​ Recall: 97.7%
●​ F1-score: 93.4%

Fig. 4.3 Result (3)

39
The binary classification showed strong performance with 85 true negatives and 6 true positives
correctly identified, with minimal false positives (2) and false negatives (10).

Fig. 4.4 Result (4)

Prediction 2 (Binary classification):

●​ Overall Accuracy: 87.4%


●​ Precision: 87.8%
●​ Recall: 98.9%
●​ F1-score: 92.9%

Fig. 4.5 Result (4)

40
The binary classification showed strong performance with 86 true negatives and 4 true positives
correctly identified, with minimal false positives (1) and false negatives (12).

Fig. 4.6 Result (6)

Prediction 3 (Binary classification):

●​ Overall Accuracy: 87.4%


●​ Precision: 87.8%
●​ Recall: 98.9%
●​ F1-score: 92.9%

Fig. 4.7 Result (7)

41
This prediction approach yielded identical results to Prediction 2, suggesting consistent model
behavior across similar classification tasks.

Fig. 4.8 Result (8)

Model Robustness Analysis

Prediction 4 (Binary classification):

●​ Overall Accuracy: 84.5%


●​ Precision: 84.5%
●​ Recall: 100%
●​ F1-score: 91.6%

Fig. 4.9 Result (9)

42
While showing slightly lower overall accuracy, this approach achieved perfect recall, correctly
identifying all positive cases without any false negatives.

Fig. 4.10 Result (10)

Prediction 5 (Binary classification):

●​ Overall Accuracy: 86.4%


●​ Precision: 91.0%
●​ Recall: 93.1%
●​ F1-score: 92.0%

Fig. 4.11 Result (11)

43
This configuration demonstrated the highest precision (91.0%) among all predictions, with 81
true negatives, 8 true positives, 6 false positives, and 8 false negatives.

Fig. 4.12 Result (12)

Prediction 6 (Binary classification):

●​ Overall Accuracy: 89.3%


●​ Precision: 90.4%
●​ Recall: 97.7%
●​ F1-score: 93.9%

Fig. 4.13 Result (13)

44
The final prediction approach achieved the highest overall accuracy (89.3%) and F1-score
(93.9%), with excellent balance between precision and recall.

Fig. 4.14 Result (14)

Discussions

Clinical Significance

The consistently high recall values across all predictions (ranging from 93.1% to 100%) are
particularly significant for medical diagnostic applications. High recall ensures that the model
successfully identifies the vast majority of positive cases, minimizing the risk of missing critical
diagnoses. The perfect recall achieved in Prediction 4 (100%) is especially noteworthy, as it
suggests the model can reliably detect all cases requiring medical attention.

Model Reliability and Generalization

The close alignment between training and validation accuracy (98.3% vs 97.8%) indicates that
the model generalizes well to unseen data without significant overfitting. The smooth
convergence of both accuracy and loss curves suggests optimal training duration and learning
rate selection.

Performance Consistency

The variation in performance across different prediction approaches (accuracy ranging from
84.5% to 89.3%) suggests that the model's performance may be sensitive to specific data
preprocessing or class balancing techniques. However, all configurations maintained strong
performance metrics, with F1-scores consistently above 91%, indicating robust overall
performance.

45
Comparative Analysis

The multi-class classification (Prediction 1) achieved competitive performance with 88.3%


accuracy, demonstrating the model's capability to distinguish between three distinct categories.
The perfect classification of Malignant cases in this scenario is particularly valuable for clinical
applications where false negatives could have severe consequences.

Limitations and Future Work

While the model demonstrates excellent performance, the slight variations across different
prediction approaches suggest potential areas for optimization. Future work could focus on:

1.​ Feature Engineering: Exploring additional feature extraction techniques to improve


consistency across different prediction approaches
2.​ Class Balancing: Investigating advanced sampling techniques to address any potential
class imbalance issues
3.​ Ensemble Methods: Combining multiple prediction approaches to leverage their
individual strengths
4.​ Cross-validation: Implementing k-fold cross-validation to better assess model stability
and generalization

Clinical Implementation Considerations

The high recall rates achieved by the model make it suitable for screening applications where
sensitivity is prioritized over specificity. However, the precision values (ranging from 84.5% to
91.0%) suggest that additional confirmatory testing would be advisable for positive predictions
to minimize false positive rates in clinical practice.

Conclusion

The developed model demonstrates strong performance across multiple evaluation metrics, with
particularly impressive recall rates that are crucial for medical diagnostic applications. The
consistent training performance and good generalization characteristics suggest that the model is
well-suited for practical implementation. The achievement of up to 89.3% accuracy with 93.9%
F1-score represents a robust solution for the classification task, with the flexibility to prioritize
either sensitivity or specificity based on specific clinical requirements.

The confusion matrix for the 3-class classification reveals excellent performance in
distinguishing between Benign (27 correct predictions, 3 misclassifications), Malignant (113
correct predictions, 0 misclassifications), and Normal (77 correct predictions, 0
misclassifications) cases. The model demonstrated perfect recall for Malignant cases, which is
crucial for medical diagnosis applications.

46
Fig. 4.15 Result (15)

47
CHAPTER 5
CONCLUSION AND FUTURE SCOPE

Conclusion
The comprehensive analysis of lung cancer prediction using multiple machine learning
algorithms demonstrates the significant potential for early detection and classification of this
life-threatening disease. This research has successfully implemented and evaluated a diverse
range of both traditional machine learning and deep learning approaches, providing valuable
insights into their comparative effectiveness for medical diagnosis.

Performance Achievements

The implementation of various machine learning approaches has yielded promising results across
different algorithmic frameworks. The ensemble methods, particularly Random Forest and
Bagging, demonstrated robust performance through their ability to combine multiple decision
trees and reduce overfitting. Support Vector Machine (SVM) showed excellent classification
capabilities with its ability to find optimal decision boundaries in high-dimensional feature
spaces. Decision Tree algorithms provided interpretable results with high accuracy, making them
valuable for clinical decision-making where transparency is crucial.

Logistic Regression offered reliable probabilistic predictions with good computational efficiency,
while Gaussian Naïve Bayes provided fast and effective classification despite its simplicity. The
Convolutional Neural Network (CNN) implementation for medical image analysis demonstrated
superior feature extraction capabilities from CT scan images, enabling accurate identification and
classification of lung cancer types.

Methodological Innovation

The integration of traditional machine learning algorithms with deep learning approaches
provides a comprehensive diagnostic framework. The ensemble methods (Random Forest and
Bagging) enhanced prediction reliability by reducing variance and improving generalization.
SVM's kernel-based approach effectively handled non-linear relationships in clinical data, while
Decision Trees offered clear decision pathways that clinicians can easily interpret and validate.

The CNN implementation for image analysis represents a significant advancement in automated
medical diagnosis, capable of identifying subtle patterns in medical images that might be
overlooked by traditional methods. The combination of these diverse approaches allows for
cross-validation of results and provides multiple perspectives on the same diagnostic challenge.

48
Comparative Analysis

Each algorithm demonstrated unique strengths: Random Forest and Bagging excelled in handling
noisy data and preventing overfitting through ensemble learning. SVM showed superior
performance with complex, high-dimensional datasets and provided robust classification
boundaries. Decision Trees offered exceptional interpretability, crucial for medical applications
where understanding the reasoning behind predictions is essential.

Logistic Regression provided probabilistic outputs that are valuable for risk assessment, while
Gaussian Naïve Bayes offered computational efficiency suitable for real-time applications. The
CNN model demonstrated state-of-the-art performance in image classification tasks,
automatically learning hierarchical features from raw medical images.

Clinical Impact

These diverse machine learning models address critical healthcare challenges by enabling early
detection when treatment outcomes are most favorable, providing multiple algorithmic
perspectives for diagnostic validation, offering both interpretable (Decision Trees) and
high-performance (CNN) solutions, and supporting different clinical scenarios from rapid
screening (Naïve Bayes) to detailed analysis (CNN).

The comprehensive approach ensures robustness and reliability in clinical decision-making, as


multiple algorithms can be used to cross-validate diagnoses and reduce the risk of
misclassification. This multi-algorithmic framework provides healthcare professionals with
flexible tools suited to different clinical contexts and resource constraints.

Future Scope

1. Advanced Ensemble Methods and Hybrid Models

Future research should focus on developing sophisticated ensemble techniques that optimally
combine the strengths of different algorithms. Advanced stacking methods could integrate
Random Forest's robustness, SVM's precision, and CNN's feature extraction capabilities into
unified prediction systems. Developing adaptive ensemble methods that dynamically weight
different models based on input characteristics would enhance overall performance.

The exploration of gradient boosting techniques like XGBoost and LightGBM could further
improve ensemble performance. Integration of deep ensemble methods that combine multiple
CNN architectures with traditional machine learning algorithms would create more robust and
accurate diagnostic systems.

2. Enhanced Deep Learning Architectures

Advancing beyond basic CNN implementations to explore ResNet, DenseNet, and Vision
Transformer architectures would improve medical image analysis capabilities. Implementing

49
attention mechanisms in CNN models would help identify specific regions of interest in medical
images, providing more interpretable results for clinicians.

Developing 3D CNN architectures for volumetric CT scan analysis would enable better spatial
understanding of lung structures. Integration of transfer learning techniques using pre-trained
medical imaging models would accelerate training and improve performance with limited
datasets.

3. Feature Engineering and Selection Optimization

Implementing advanced feature selection techniques that identify the most relevant clinical and
imaging features for each algorithm would optimize performance. Developing automated feature
engineering pipelines that can extract domain-specific features for traditional machine learning
algorithms while allowing CNNs to learn features automatically.

The integration of genetic algorithms for hyperparameter optimization across all models would
ensure optimal performance. Implementing feature fusion techniques that combine handcrafted
features with CNN-learned features could enhance traditional machine learning algorithm
performance.

4. Cross-Algorithm Validation and Uncertainty Quantification

Developing sophisticated validation frameworks that leverage disagreement between algorithms


to identify uncertain cases requiring additional clinical review. Implementing Bayesian
approaches for uncertainty quantification across all models would provide confidence intervals
for predictions.

Creating ensemble confidence metrics that combine uncertainty estimates from multiple
algorithms would help clinicians understand prediction reliability. Developing decision fusion
strategies that optimally combine predictions from different algorithms based on their individual
strengths and weaknesses.

5. Real-world Clinical Integration

Creating clinical decision support systems that can deploy multiple algorithms simultaneously
and provide comparative results to healthcare professionals. Developing user interfaces that
allow clinicians to choose appropriate algorithms based on specific clinical scenarios and
available computational resources.

Implementing real-time performance monitoring systems that track algorithm performance in


clinical settings and automatically retrain models when performance degrades. Creating pipelines
for continuous learning that can incorporate new clinical data to improve all models
simultaneously.

6. Personalized Algorithm Selection

50
Developing meta-learning approaches that can automatically select the most appropriate
algorithm for individual patients based on their clinical characteristics and data quality. Creating
personalized ensemble methods that weight different algorithms based on patient-specific
factors.

Implementing adaptive systems that can switch between fast algorithms (Naïve Bayes, Logistic
Regression) for screening and comprehensive approaches (CNN, ensemble methods) for detailed
diagnosis based on clinical urgency and resource availability.

7. Multi-modal Data Integration and Scalability

Extending the current framework to incorporate genetic data, blood biomarkers, and
environmental factors across all algorithms. Developing distributed computing frameworks that
can efficiently train and deploy multiple algorithms on large-scale medical datasets.

Creating federated learning approaches that can train all algorithms across multiple healthcare
institutions while maintaining patient privacy. Implementing edge computing solutions that can
run lightweight versions of multiple algorithms on mobile devices for point-of-care diagnostics.

The proposed model shows the overview of prediction of lung cancer at an early stage. After
prediction of the tumour normal, malignant or benign, we generate a confusion matrix for each
machine learning technique and based on the confusion matrix we calculate accuracy, Recall,
precision and F1 score. From the result we can say that our proposed model can distinguish
between benign and malignant, and it can be seen that artificial neural network is providing more
accuracy in both texture and region based, as well as from the recall value we can say that it has
correctly identified maximum number of malignant tumour In near future deep learning shall
outperform machine learning in the field of image classification, object recognition and feature
extraction. CNN networks are well known for its features in providing accuracy with higher
number of hidden layers in it.

51
REFERENCES

[1]
[Link]
ience-and-technology/computer-etworks/project-report-early-lung-cancer-detection-using-
machine-learning-and-image-processing/56169752

[2] [Link]

[3] S. Nageswaran et al., “[Retracted] Lung Cancer Classification and Prediction Using
Machine Learning and Image Processing,” BioMed Research International, vol. 2022, no.
1, p. 1755460, 2022.
[4] S. U. Krishna, A. Lakshman, T. Archana, K. Raja, and M. Ayyadurai, “Lung cancer
prediction and classification using decision tree and VGG16 convolutional neural
networks,” The Open Biomedical Engineering Journal, vol. 18, no. 1, 2024.
[5] I. Nazir, I. ul Haq, S. A. AlQahtani, M. M. Jadoon, and M. Dahshan, “Machine
Learning-Based Lung Cancer Detection Using Multiview Image Registration and Fusion,”
Journal of Sensors, vol. 2023, no. 1, p. 6683438, 2023.
[6] K. Tuncal, B. Sekeroglu, and C. Ozkan, “Lung cancer incidence prediction using
machine learning algorithms,” Journal of advances in information technology, vol. 11, no.
2, 2020.
[7] S. Raut, S. Patil, and G. Shelke, “Lung cancer detection using machine learning
approach,” International Journal of Advance Scientific Research and Engineering Trends
(IJASRET), 2021.
[8] [Link]
[9] S. S. Raoof, M. A. Jabbar, and S. A. Fathima, “Lung Cancer prediction using machine
learning: A comprehensive approach,” in 2020 2nd International conference on innovative
mechanisms for industry applications (ICIMIA), 2020, pp. 108–115.
[10] [Link]

52

You might also like