0% found this document useful (0 votes)
56 views

Research Proposal UK

This document outlines a study plan for a PhD in computer science focused on using machine learning and artificial intelligence for early detection of breast tumors. The plan includes an introduction on breast cancer statistics and current detection methods. A literature review discusses previous research on classifying breast cancer using methods like decision trees, neural networks, and support vector machines. The proposed work plan involves feature selection, classification algorithms like naive Bayes, linear regression, AdaBoost, and neural networks to classify tumors. The potential impact is early detection could improve survival rates.

Uploaded by

DANI DJ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Research Proposal UK

This document outlines a study plan for a PhD in computer science focused on using machine learning and artificial intelligence for early detection of breast tumors. The plan includes an introduction on breast cancer statistics and current detection methods. A literature review discusses previous research on classifying breast cancer using methods like decision trees, neural networks, and support vector machines. The proposed work plan involves feature selection, classification algorithms like naive Bayes, linear regression, AdaBoost, and neural networks to classify tumors. The potential impact is early detection could improve survival rates.

Uploaded by

DANI DJ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Study Plan for Doctoral Degree Program (PhD)

PhD (Computer Science)

Research Title

Early advance detection of Breast Tumor using Machine


learning and Artificial Intelligence

7th July 2022

Table of Contents
1. Introduction & Background ............................................................................................................ 3
2. Literature Review ............................................................................................................................ 4
Current Progress and motivation:....................................................................................................... 5
3. Resources required ......................................................................................................................... 6
4. Problem Statement ......................................................................................................................... 7
5. Research Methodology ................................................................................................................... 7
Feature Selection ................................................................................................................................ 7
Gaussian Naïve Bays ........................................................................................................................... 8
Linear Regression Classifier................................................................................................................. 8
AdaBoost Ensemble ............................................................................................................................ 9
Neural Network................................................................................................................................... 9
6. Proposed Work plan and implication ........................................................................................... 10
7. Impact Potential ............................................................................................................................ 10
1. Introduction & Background
Breast tumor is one of the most common deceases in the women of the whole world.
Like all other tumors, breast tumor has also four stages, (i.e. First and second stages are not
very risky for losing of life while third and fourth stages are very dangerous). Thousands of
women are dying from breast tumor all over the world. According to surveys, estimated 1.38
million new cases have diagnosed in only 2008 which was 23% across all types of tumors while
2018 reported 1.67 million new tumor cases which was 25% of all tumors [1] and ranked breast
tumor at the overall 2nd highest position. In 2019, an estimated 268,600 new cases of invasive
breast cancer will be diagnosed among women and approximately 2,670 cases will be
diagnosed in men. In addition, an estimated 48,100 cases of DCIS will be diagnosed among
women. Approximately 41,760 women and 500 men are expected to die from breast cancer in
2019 [1]. Another recent survey has reported 2 million new cases of tumor in 2018 year only.
New breast tumor cases of 39% have been encountered in Asian population in which 44%
women have died. A sample comparison overview of new cases and deaths from breast tumors
between 2008 and 2018 is shown in table 1. This table depicts that although number of new
cases in less developed regions are almost equal to more developed regions but the death rate
is much higher in less development regions as lack of advance technology and early detection
of tumor.

Region 2008 2018


Per 100,000 New Cases Deaths New Cases Deaths
United State of America 182 40 433 45
IARC membership 740 214 940 259
WHO Western Pacific region 279 73 430 186
WHO South East Asian region 203 93 240 117
WHO Europe region 450 139 500 243
WHO Africa region 68 37 200 49
Less developed regions 691 269 883 324
More developed regions 692 189 794 298
World 1384 458 2077 832
Table 1: Breast Tumor Cases (2008 and 2018) across World (WCRF)

It is noted that breast tumor survival is directly proportional to early stage tumor
detection. Low survival happens when early stages detection as you can see in the table 1 where
less developed regions have a lot more deaths than more developed regions. There is a lot of
tumor detection techniques both clinical wise and technology wise. Mammography method
uses x-rays of breast for tumor detection. The specialists and doctors are the diagnostic sources
here which fail to early stages detection of tumor because of human habituation [2].

2. Literature Review
Breast tumor has become the major cause of death and the number of deaths increases
every year due to breast tumor [3]. It is the common type tumor diagnose in women and leads
to their death. Data mining and classification are efficient methods for taking decisions based
on analysis and diagnosis. In this study, the performance of four machine learning algorithm
Decision Tree (C4.5), Support Vector Machine (SVM), Naïve Bayes (NB) and k Nearest
Neighbors (k-NN) has been compared using Wisconsin Breast Cancer (original) datasets. The
objective of this study was to acquire correct classification of dataset using on the basis of
algorithm efficiency and effectiveness in order to check the accuracy, specificity and sensitivity
of each algorithm. The algorithm that gave results with high accuracy and low error rate was
SVM with accuracy rate (97.13%). However, there is need of more intelligent classification
algorithms that will help doctor to reduce error rate [4].

It gave the inclusive perspective about implementations of automated diagnostic


systems for diagnosis of breast cancer. In his work, he compared the performance of combined
neural network (CNN), recurrent neural network (RNN), multilayer perceptron neural network
(MLPNN), probabilistic neural network (PNN) and support vector machine (SVM). The main
objective of this study was to provide the guide who wants to develop such systems [5, 6]. It
stated that breast cancer can be detected at early stages by using machine learning algorithm.
He also showed the ways in which these algorithms can work better to classify cancer and non-
cancer patients [7]. The use of Bayesian gene selection approach and logistic regression for
cancer prediction and classification has indicated by [8], in his work, various techniques have
been utilized in order to find effective model for the diagnosis of breast cancer.

The isotonic separation method has developed by that are used for classification of data
[9]. This method was finalized as effective classification method after comparing its
performance with learning vector quantization, support vector machines, decision tree
induction and other methods by using sufficient and insufficient breast cancer data set [10].
The results of this worked showed found this method a practical tool for classification in case
of medical domain.
The least square support vector machine and active set strategy were introduced by
[11]. These methods were developed for the classification breast cancer datasets [12]. They
described Deep Convolution Neural Networks Enable Discrimination of Heterogeneous
Digital Pathology Images. The main focus of this work is convolution neural networks (CNN)
based different computational methods followed by developing a pipeline for the classification
of histopathology images effectively in different cancer types [13]. It utilized Hybrid machine
learning method for breast cancer diagnosis. This method can combine k-nearest neighbor
algorithm with artificial immune system. The hybrid method has showed effective results in
Wisconsin Breast Cancer Dataset. It is stated that this method can be utilize for other diagnosis
problems in breast cancer [14].

Two approaches evolutionary and ensemble approach are followed by using negative
correlation training algorithm. In [15], they have utilized Neural Network on WBCD to analyze
the performance accuracy of different diagnosis techniques. In 2007, Sumathi et al. applied
genetic algorithms approach on WBDC and concluded that genetic algorithms can reduce time
to train the network as well as improve the accuracy. The used of SVM for early diagnosis of
breast cancer was carried [14]. While the use of SVM and RVM for classification of documents
has performed [16]. He concluded that to predict the RVM accuracy is higher than SVM. There
is a computerized breast cancer diagnosis method introduced [17].

This method is developed by the combination of Back propagation neural network and
genetic algorithm that can reduce the time of diagnosis and also classification of breast mass
into benign and malignant types. In this study, the method was to use on the dataset [18, 19].
In case of Set A, the method removed all the data containing missing values with 100%
accuracy while in case of Set B, statistical cleaning process was carried out in order to find any
missing values or noise with 83.36% accuracy. In this study, it is concluded that medical data
that is in original form gives accurate results as compared to the data that has been changed.

Current Progress and motivation:


Motivation behind choosing breast tumor detection is depicted in the visualization
figure 1. Among all the tumors, 24% of defected population has breast tumor. This problem
needs to be solved. The main problem in breast tumor is that early stages of breast cancer
diagnosis are difficult due to different factors. The oncologists and human specialist are unable
to recognize the tumor at early stages from x-rays and other type of images. This is the reason
which results in high death rate in breast tumor. In my research, I have aimed to detect breast
tumor at early stages using machine learning algorithms. For this purpose, I have used
Wisconsin breast cancer dataset used for diagnostic. The description to this dataset is given in
next section.

Fig1. Breast Tumor vs Rest

3. Resources required
Dataset that will be used in this research is Wisconsin collected from a previous
research that was conducted by R. Al-Hadidi, A. Alarabeyyat for breast cancer detection
purpose [17]. These two were the professors at University of Wisconsin. This dataset has
statistical values calculated from breast tumor images. It has total of 569 instances under two
labels (i.e. Malignant and Benign).Total number of attributes per sample are 32. Total number
of malignant samples are 212 which is 37.3% of total and total number of benign samples are
357 which is 62.7%, as shown in figure 2. This dataset will be loaded to Python by built-in
function of sklearn library.

Fig2. Dataset Categorization


4. Problem Statement
Breast tumor is one of the highest death factors between all types of tumors in women.
According to a survey, 40k people have died from breast tumor in a single year. This death
percentage is very high in last/fourth stage of breast tumor. That is why, early stages detection
using different machine learning and artificial intelligence techniques are required. In this
project, comparison between performance of different machine learning and artificial
intelligence techniques on the old benchmark dataset of Wisconsin breast cancer. The main
objective is to diagnose and predict breast tumor at early stages to overcome death rate.

As discussed in the introduction and literature section, if breast tumor is not diagnosed
at early stages then it can lead toward death. Also, already mentioned a Mammography method
in which oncologists are supposed to detect the stage of tumor from x-ray images but due to
human fatigue, breast tumor detection at early stages are still a big challenge. There is
discussion about machine learning techniques in the literature which focused on early detection
of breast tumor but due to diverse nature of this tumor or one or another limitation of machines,
this problem is still on hold and waiting for bio informaticians to be resolved. In my research,
I will focus on different advanced machine learning algorithms combining with artificial
intelligence and showed a comparative comparison to diagnose breast tumor. I will also design
a hybrid algorithm by observing comparison results that will be more efficient and helpful for
early detection of breast tumor.

5. Research Methodology
There are tens of state-of-art machine learning models in the literature and each one has
own pros and cons. In my research I will have a comparison between an old one machine
learning model (i.e. Gaussian Naïve Bayes) and three more advanced machine learning models
(i.e. Linear Regression Classifier, Neural Network, an ensemble method AdaBoost Ensemble).
The relative comparison between these models will show the overall performance section.
Below subsections discuss each model.

Feature Selection
Feature selection method will be done on the data before passing the data into
corresponding machine learning model. There will be some features whose values can be
greater than 50s’ while others have less than 10. I will calculate maximum values from each
feature and found the feature matching count. I will exclude these features from the dataset to
do somehow manually normalization in between features. This will be executed using Python
based script of feature selection.

Gaussian Naïve Bays


As the name shows, this is a supervised machine learning algorithm which learns
according to the Bayesian theorem with naïve assumptions between each pair of features within
a class variable [20]. Built in function in Python for this method is imported from sklearn
library.

Dataset will be splinted into 70% of training data and 30% will left for validation of
model testing to check the performance of model. To split the data into train and test data, I
will use built in function imported from sklearn library. The complete reference code for this
model along with each step complete details in comments before each line is given in the
‘Gaussian Naïve Bayes’ subsection in the section of ‘ML Models’. Training and testing
explanations will be also provided in this subsection along with explanation in the comments.
Discussion on the results will be carried out after successful implementation and gathering
results.

Linear Regression Classifier


Logistic regression (aka maximum entropy or Logit regression) is classification
algorithm rather than regression, using linear model [21]. This is probabilistic machine learning
model. Logistic function predicts possible outcomes of an event. As this is probabilistic
classifier, we have to set decision boundaries at the outputs between 0 and 1. Cost function is
the optimization parameter of LR. Finally, stochastic gradient decent is calculated and
optimized to minimize the error as much as possible.

Training/Testing/Results/Discussion

Dataset will be splinted into 70% of training data and 30% will left for validation of
model testing to check the performance of model. To split the data into train and test data, I
will use built in function imported from sklearn library. Then the data will be fetched to the
python based designed model script to gather the results. The results will be observed and
compared with other models results. Below is the confusion matrix for the stated method based
on demo data.
AdaBoost Ensemble
As the name shows, this is an ensemble technique which is combination of different
algorithms. It is so for successful boosting ensemble technique. It has two main algorithms.
This method is related to input data and the weightage to be given to that data.

Training/Testing/Results/Discussion

Dataset will be splinted into 70% of training data and 30% will left for validation of
model testing to check the performance of model. To split the data into train and test data, I
will use built in function imported from sklearn library. Same like other models the model
based script will be executed into python to gather the dataset based results. The results will be
compared with other models results as well as actual results to check the accuracy and
reliability.

Neural Network
Neural Network (NN) which is mimic of human brain, are merging very rapidly. This
was firstly introduced by Murata in 1994. But due to bias factor in NN, this algorithm is
learning and training itself according to the input data. All this algorithm needs is diverse nature
of data. Once you provide a diverse nature of training data then this model can beat human
easily in different aspects of life. More advance model of NN for detection of deceases from
images is convolution neural network, as discussed in literature. I have used simple multilayer
perceptron model to train and test the data.

Training/Testing/Results/Discussion

Dataset will be splinted into 70% of training data and 30% will left for validation of
model testing to check the performance of model. To split the data into train and test data, I
will use built in function imported from sklearn library. Same like other models the model
based script will be executed into python to gather the dataset based results. The results will be
compared with other models results as well as actual results to check the accuracy and
reliability. The results will be discussed later after the successful implementation ahead.
6. Proposed Work plan and implication
The tentative activity time line and milestone is given as below:

 Survey, Literature review and reports on existing knowledge and knowledge


management: 5 months (5th month)
 Analysis and Case Study, Comparison results in order to regulate the best practice: 7
months (12th month)
 Framework Design and Implementation: 12 months (24th month)
 Experimentation using a suitable platform and environments: 3 months (27th month)
 Data Analysis: 3 months (30th month)
 Thesis write up and submission, 3 months (33th month)

7. Impact Potential
Research findings and recommendations will be published in reputable academic and
science index journals and exploitable outputs will be produced. Moreover , instructions of
research supervisor will be on highest priority during the research.
References

[1] American Cancer Society 2020 , Breast cancer Facts and Figures report for year 2019-2020.

[2] Sri Hari Nallamala, Dr. Pragnyaban Mishra and Dr. Suvarna Vani Koneru, “Qualitative
Metrics on Breast Cancer Diagnosis with Neuro Fuzzy Inference Systems”, International
Journal of Advanced Trends in Computer Science and Engineering (IJATCSE), Vol. 8 No. 2
(2019), P. 259 – 264.

[3] Kuhl CK. Abbreviated magnetic resonance imaging (MRI) for breast cancer screening:
rationale, concept, and transfer to clinical practice. Annu Rev Med. 2019 ,pp 501-519.

[4] Sri Hari Nallamala, Siva Kumar Pathuri, Dr Suvarna Vani Koneru, “An Appraisal on
Recurrent Pattern Analysis Algorithm from the Net Monitor Records”, International Journal of
Engineering & Technology (IJET) (UAE), ISSN: 2227 – 524X, Vol. 7, No 2.7 (2018), SI 7, P.
542 – 545

[5] Sri Hari Nallamala, Siva Kumar Pathuri, Dr Suvarna Vani Koneru, “A Literature Survey
on Data Mining Approach to Effectively Handle Cancer Treatment”, International Journal of
Engineering & Technology (IJET) (UAE), ISSN: 2227 – 524X, Vol. 7, No 2.7 (2018), SI 7, P.
729 – 732.

[6] Khosravi, P., Kazemi, E., Imielinski, M., Elemento, O. and Hajirasouliha, I., 2018. Deep
convolutional neural networks enable discrimination of heterogeneous digital pathology
images. EBioMedicine, 27, pp.317-328.

[7] D. S. Jacob, R. Viswan, V. Manju, L. PadmaSuresh and S. Raj, 2018 "A Survey on Breast
Cancer Prediction Using Data Mining Techniques, Conference on Emerging Devices and
Smart Systems (ICEDSS), Tiruchengode, pp. 256-258.

[8]. Vreemann S, Gubern-Mérida A, Schlooz-Vries MS, et al. Influence of risk category and
screening round on the performance of an MR imaging and mammography screening program
in carriers of the BRCA mutation and other women at increased risk. Radiology.
2018;286(2):443-451.

[9] S. Nayak and D. Gope, "Comparison of supervised learning algorithms for RF-based breast
cancer detection," 2017 Computing and Electromagnetics International Workshop (CEM),
Barcelona, 2017, P. 13-14. doi: 10.1109/CEM.2017.7991863.
[10] Şahan, S., Polat, K., Kodaz, H. and Güneş, S., 2017. A new hybrid method based on fuzzy-
artificial immune system and k-nn algorithm for breast cancer diagnosis. Computers in Biology
and Medicine, 37(3), pp.415-423.

[11] Kuhl CK, Strobel K, Bieling H, Leutner C, Schild HH, Schrading S. Supplemental breast
MR imaging screening of women with average risk of breast cancer. Radiology.
2017;283(2):361-370

[12] Amin MB, Edge SB, Greene FL, et al, eds. AJCC Cancer Staging Manual. 8th ed. New
York, NY: Springer; 2017

[13] Lo G, Scaranelo AM, Aboras H, et al. Evaluation of the utility of screening


mammography for high-risk women undergoing screening breast MR imaging. Radiology.
2017;285(1):36-43.

[14] Amin MB, Edge SB, Greene FL, et al, eds. AJCC Cancer Staging Manual. 8th ed.
New York, NY: Springer; 2017

[15] Y. Tsehay et al., "Biopsy-guided learning with deep convolutional neural networks for
Prostate Cancer detection on multiparametric MRI," 2017 IEEE 14th International Symposium
on Biomedical Imaging (ISBI 2017), Melbourne, VIC, 2017, P. 642-645.

[16] Asri, H., Mousannif, H., Al Moatassime, H. and Noel, T., 2016. Using machine learning
algorithms for breast cancer risk prediction and diagnosis. Procedia Computer Science, 83,
pp.1064-1069.

[17] M. R. Al-Hadidi, A. Alarabeyyat and M. Alhanahnah, "Breast Cancer Detection Using


K-Nearest Neighbor Machine Learning Algorithm," 2016 9th International Conference on
Developments in Systems Engineering (DeSE), Liverpool, 2016, P. 35-39

[18] C. Deng and M. Perkowski, "A Novel Weighted Hierarchical Adaptive Voting Ensemble
Machine Learning Method for Breast Cancer Detection," 2015 IEEE International Symposium
on Multiple-Valued Logic, Waterloo, ON, 2015, P. 115-120.

[19] Sastry, J.K.R., Ganesh, J.V., Bhanu, J.S., I2C based networking for implementing
heterogeneous microcontroller based distributed embedded systems, Indian Journal of Science
and Technology, Volume 8, Issue 15, 2015
[20] KISHORE, P.V.V., PRASAD, M.V.D., PRASAD, C.R. and RAHUL, R., 2015. 4 -Camera
model for sign language recognition using elliptical fourier descriptors and ANN, International
Conference on Signal Processing and Communication Engineering Systems - Proceedings of
SPACES 2015, in Association with IEEE 2015, pp. 34 -38.

[21] Dubey, A.K., Gupta, U. and Jain, S., 2015. A survey on breast cancer scenario and
prediction strategy. In Proceedings of the 3rd International Conference on Frontiers of
Intelligent Computing: Theory and Applications (FICTA) 2014 (pp. 367-375). Springer, Cham.

You might also like