Research Proposal UK
Research Proposal UK
Research Title
Table of Contents
1. Introduction & Background ............................................................................................................ 3
2. Literature Review ............................................................................................................................ 4
Current Progress and motivation:....................................................................................................... 5
3. Resources required ......................................................................................................................... 6
4. Problem Statement ......................................................................................................................... 7
5. Research Methodology ................................................................................................................... 7
Feature Selection ................................................................................................................................ 7
Gaussian Naïve Bays ........................................................................................................................... 8
Linear Regression Classifier................................................................................................................. 8
AdaBoost Ensemble ............................................................................................................................ 9
Neural Network................................................................................................................................... 9
6. Proposed Work plan and implication ........................................................................................... 10
7. Impact Potential ............................................................................................................................ 10
1. Introduction & Background
Breast tumor is one of the most common deceases in the women of the whole world.
Like all other tumors, breast tumor has also four stages, (i.e. First and second stages are not
very risky for losing of life while third and fourth stages are very dangerous). Thousands of
women are dying from breast tumor all over the world. According to surveys, estimated 1.38
million new cases have diagnosed in only 2008 which was 23% across all types of tumors while
2018 reported 1.67 million new tumor cases which was 25% of all tumors [1] and ranked breast
tumor at the overall 2nd highest position. In 2019, an estimated 268,600 new cases of invasive
breast cancer will be diagnosed among women and approximately 2,670 cases will be
diagnosed in men. In addition, an estimated 48,100 cases of DCIS will be diagnosed among
women. Approximately 41,760 women and 500 men are expected to die from breast cancer in
2019 [1]. Another recent survey has reported 2 million new cases of tumor in 2018 year only.
New breast tumor cases of 39% have been encountered in Asian population in which 44%
women have died. A sample comparison overview of new cases and deaths from breast tumors
between 2008 and 2018 is shown in table 1. This table depicts that although number of new
cases in less developed regions are almost equal to more developed regions but the death rate
is much higher in less development regions as lack of advance technology and early detection
of tumor.
It is noted that breast tumor survival is directly proportional to early stage tumor
detection. Low survival happens when early stages detection as you can see in the table 1 where
less developed regions have a lot more deaths than more developed regions. There is a lot of
tumor detection techniques both clinical wise and technology wise. Mammography method
uses x-rays of breast for tumor detection. The specialists and doctors are the diagnostic sources
here which fail to early stages detection of tumor because of human habituation [2].
2. Literature Review
Breast tumor has become the major cause of death and the number of deaths increases
every year due to breast tumor [3]. It is the common type tumor diagnose in women and leads
to their death. Data mining and classification are efficient methods for taking decisions based
on analysis and diagnosis. In this study, the performance of four machine learning algorithm
Decision Tree (C4.5), Support Vector Machine (SVM), Naïve Bayes (NB) and k Nearest
Neighbors (k-NN) has been compared using Wisconsin Breast Cancer (original) datasets. The
objective of this study was to acquire correct classification of dataset using on the basis of
algorithm efficiency and effectiveness in order to check the accuracy, specificity and sensitivity
of each algorithm. The algorithm that gave results with high accuracy and low error rate was
SVM with accuracy rate (97.13%). However, there is need of more intelligent classification
algorithms that will help doctor to reduce error rate [4].
The isotonic separation method has developed by that are used for classification of data
[9]. This method was finalized as effective classification method after comparing its
performance with learning vector quantization, support vector machines, decision tree
induction and other methods by using sufficient and insufficient breast cancer data set [10].
The results of this worked showed found this method a practical tool for classification in case
of medical domain.
The least square support vector machine and active set strategy were introduced by
[11]. These methods were developed for the classification breast cancer datasets [12]. They
described Deep Convolution Neural Networks Enable Discrimination of Heterogeneous
Digital Pathology Images. The main focus of this work is convolution neural networks (CNN)
based different computational methods followed by developing a pipeline for the classification
of histopathology images effectively in different cancer types [13]. It utilized Hybrid machine
learning method for breast cancer diagnosis. This method can combine k-nearest neighbor
algorithm with artificial immune system. The hybrid method has showed effective results in
Wisconsin Breast Cancer Dataset. It is stated that this method can be utilize for other diagnosis
problems in breast cancer [14].
Two approaches evolutionary and ensemble approach are followed by using negative
correlation training algorithm. In [15], they have utilized Neural Network on WBCD to analyze
the performance accuracy of different diagnosis techniques. In 2007, Sumathi et al. applied
genetic algorithms approach on WBDC and concluded that genetic algorithms can reduce time
to train the network as well as improve the accuracy. The used of SVM for early diagnosis of
breast cancer was carried [14]. While the use of SVM and RVM for classification of documents
has performed [16]. He concluded that to predict the RVM accuracy is higher than SVM. There
is a computerized breast cancer diagnosis method introduced [17].
This method is developed by the combination of Back propagation neural network and
genetic algorithm that can reduce the time of diagnosis and also classification of breast mass
into benign and malignant types. In this study, the method was to use on the dataset [18, 19].
In case of Set A, the method removed all the data containing missing values with 100%
accuracy while in case of Set B, statistical cleaning process was carried out in order to find any
missing values or noise with 83.36% accuracy. In this study, it is concluded that medical data
that is in original form gives accurate results as compared to the data that has been changed.
3. Resources required
Dataset that will be used in this research is Wisconsin collected from a previous
research that was conducted by R. Al-Hadidi, A. Alarabeyyat for breast cancer detection
purpose [17]. These two were the professors at University of Wisconsin. This dataset has
statistical values calculated from breast tumor images. It has total of 569 instances under two
labels (i.e. Malignant and Benign).Total number of attributes per sample are 32. Total number
of malignant samples are 212 which is 37.3% of total and total number of benign samples are
357 which is 62.7%, as shown in figure 2. This dataset will be loaded to Python by built-in
function of sklearn library.
As discussed in the introduction and literature section, if breast tumor is not diagnosed
at early stages then it can lead toward death. Also, already mentioned a Mammography method
in which oncologists are supposed to detect the stage of tumor from x-ray images but due to
human fatigue, breast tumor detection at early stages are still a big challenge. There is
discussion about machine learning techniques in the literature which focused on early detection
of breast tumor but due to diverse nature of this tumor or one or another limitation of machines,
this problem is still on hold and waiting for bio informaticians to be resolved. In my research,
I will focus on different advanced machine learning algorithms combining with artificial
intelligence and showed a comparative comparison to diagnose breast tumor. I will also design
a hybrid algorithm by observing comparison results that will be more efficient and helpful for
early detection of breast tumor.
5. Research Methodology
There are tens of state-of-art machine learning models in the literature and each one has
own pros and cons. In my research I will have a comparison between an old one machine
learning model (i.e. Gaussian Naïve Bayes) and three more advanced machine learning models
(i.e. Linear Regression Classifier, Neural Network, an ensemble method AdaBoost Ensemble).
The relative comparison between these models will show the overall performance section.
Below subsections discuss each model.
Feature Selection
Feature selection method will be done on the data before passing the data into
corresponding machine learning model. There will be some features whose values can be
greater than 50s’ while others have less than 10. I will calculate maximum values from each
feature and found the feature matching count. I will exclude these features from the dataset to
do somehow manually normalization in between features. This will be executed using Python
based script of feature selection.
Dataset will be splinted into 70% of training data and 30% will left for validation of
model testing to check the performance of model. To split the data into train and test data, I
will use built in function imported from sklearn library. The complete reference code for this
model along with each step complete details in comments before each line is given in the
‘Gaussian Naïve Bayes’ subsection in the section of ‘ML Models’. Training and testing
explanations will be also provided in this subsection along with explanation in the comments.
Discussion on the results will be carried out after successful implementation and gathering
results.
Training/Testing/Results/Discussion
Dataset will be splinted into 70% of training data and 30% will left for validation of
model testing to check the performance of model. To split the data into train and test data, I
will use built in function imported from sklearn library. Then the data will be fetched to the
python based designed model script to gather the results. The results will be observed and
compared with other models results. Below is the confusion matrix for the stated method based
on demo data.
AdaBoost Ensemble
As the name shows, this is an ensemble technique which is combination of different
algorithms. It is so for successful boosting ensemble technique. It has two main algorithms.
This method is related to input data and the weightage to be given to that data.
Training/Testing/Results/Discussion
Dataset will be splinted into 70% of training data and 30% will left for validation of
model testing to check the performance of model. To split the data into train and test data, I
will use built in function imported from sklearn library. Same like other models the model
based script will be executed into python to gather the dataset based results. The results will be
compared with other models results as well as actual results to check the accuracy and
reliability.
Neural Network
Neural Network (NN) which is mimic of human brain, are merging very rapidly. This
was firstly introduced by Murata in 1994. But due to bias factor in NN, this algorithm is
learning and training itself according to the input data. All this algorithm needs is diverse nature
of data. Once you provide a diverse nature of training data then this model can beat human
easily in different aspects of life. More advance model of NN for detection of deceases from
images is convolution neural network, as discussed in literature. I have used simple multilayer
perceptron model to train and test the data.
Training/Testing/Results/Discussion
Dataset will be splinted into 70% of training data and 30% will left for validation of
model testing to check the performance of model. To split the data into train and test data, I
will use built in function imported from sklearn library. Same like other models the model
based script will be executed into python to gather the dataset based results. The results will be
compared with other models results as well as actual results to check the accuracy and
reliability. The results will be discussed later after the successful implementation ahead.
6. Proposed Work plan and implication
The tentative activity time line and milestone is given as below:
7. Impact Potential
Research findings and recommendations will be published in reputable academic and
science index journals and exploitable outputs will be produced. Moreover , instructions of
research supervisor will be on highest priority during the research.
References
[1] American Cancer Society 2020 , Breast cancer Facts and Figures report for year 2019-2020.
[2] Sri Hari Nallamala, Dr. Pragnyaban Mishra and Dr. Suvarna Vani Koneru, “Qualitative
Metrics on Breast Cancer Diagnosis with Neuro Fuzzy Inference Systems”, International
Journal of Advanced Trends in Computer Science and Engineering (IJATCSE), Vol. 8 No. 2
(2019), P. 259 – 264.
[3] Kuhl CK. Abbreviated magnetic resonance imaging (MRI) for breast cancer screening:
rationale, concept, and transfer to clinical practice. Annu Rev Med. 2019 ,pp 501-519.
[4] Sri Hari Nallamala, Siva Kumar Pathuri, Dr Suvarna Vani Koneru, “An Appraisal on
Recurrent Pattern Analysis Algorithm from the Net Monitor Records”, International Journal of
Engineering & Technology (IJET) (UAE), ISSN: 2227 – 524X, Vol. 7, No 2.7 (2018), SI 7, P.
542 – 545
[5] Sri Hari Nallamala, Siva Kumar Pathuri, Dr Suvarna Vani Koneru, “A Literature Survey
on Data Mining Approach to Effectively Handle Cancer Treatment”, International Journal of
Engineering & Technology (IJET) (UAE), ISSN: 2227 – 524X, Vol. 7, No 2.7 (2018), SI 7, P.
729 – 732.
[6] Khosravi, P., Kazemi, E., Imielinski, M., Elemento, O. and Hajirasouliha, I., 2018. Deep
convolutional neural networks enable discrimination of heterogeneous digital pathology
images. EBioMedicine, 27, pp.317-328.
[7] D. S. Jacob, R. Viswan, V. Manju, L. PadmaSuresh and S. Raj, 2018 "A Survey on Breast
Cancer Prediction Using Data Mining Techniques, Conference on Emerging Devices and
Smart Systems (ICEDSS), Tiruchengode, pp. 256-258.
[8]. Vreemann S, Gubern-Mérida A, Schlooz-Vries MS, et al. Influence of risk category and
screening round on the performance of an MR imaging and mammography screening program
in carriers of the BRCA mutation and other women at increased risk. Radiology.
2018;286(2):443-451.
[9] S. Nayak and D. Gope, "Comparison of supervised learning algorithms for RF-based breast
cancer detection," 2017 Computing and Electromagnetics International Workshop (CEM),
Barcelona, 2017, P. 13-14. doi: 10.1109/CEM.2017.7991863.
[10] Şahan, S., Polat, K., Kodaz, H. and Güneş, S., 2017. A new hybrid method based on fuzzy-
artificial immune system and k-nn algorithm for breast cancer diagnosis. Computers in Biology
and Medicine, 37(3), pp.415-423.
[11] Kuhl CK, Strobel K, Bieling H, Leutner C, Schild HH, Schrading S. Supplemental breast
MR imaging screening of women with average risk of breast cancer. Radiology.
2017;283(2):361-370
[12] Amin MB, Edge SB, Greene FL, et al, eds. AJCC Cancer Staging Manual. 8th ed. New
York, NY: Springer; 2017
[14] Amin MB, Edge SB, Greene FL, et al, eds. AJCC Cancer Staging Manual. 8th ed.
New York, NY: Springer; 2017
[15] Y. Tsehay et al., "Biopsy-guided learning with deep convolutional neural networks for
Prostate Cancer detection on multiparametric MRI," 2017 IEEE 14th International Symposium
on Biomedical Imaging (ISBI 2017), Melbourne, VIC, 2017, P. 642-645.
[16] Asri, H., Mousannif, H., Al Moatassime, H. and Noel, T., 2016. Using machine learning
algorithms for breast cancer risk prediction and diagnosis. Procedia Computer Science, 83,
pp.1064-1069.
[18] C. Deng and M. Perkowski, "A Novel Weighted Hierarchical Adaptive Voting Ensemble
Machine Learning Method for Breast Cancer Detection," 2015 IEEE International Symposium
on Multiple-Valued Logic, Waterloo, ON, 2015, P. 115-120.
[19] Sastry, J.K.R., Ganesh, J.V., Bhanu, J.S., I2C based networking for implementing
heterogeneous microcontroller based distributed embedded systems, Indian Journal of Science
and Technology, Volume 8, Issue 15, 2015
[20] KISHORE, P.V.V., PRASAD, M.V.D., PRASAD, C.R. and RAHUL, R., 2015. 4 -Camera
model for sign language recognition using elliptical fourier descriptors and ANN, International
Conference on Signal Processing and Communication Engineering Systems - Proceedings of
SPACES 2015, in Association with IEEE 2015, pp. 34 -38.
[21] Dubey, A.K., Gupta, U. and Jain, S., 2015. A survey on breast cancer scenario and
prediction strategy. In Proceedings of the 3rd International Conference on Frontiers of
Intelligent Computing: Theory and Applications (FICTA) 2014 (pp. 367-375). Springer, Cham.