DIABETES PREDICTION USING MACHINE
LEARNING
(BATCH NO:4)
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE ENGINEERING
COMPREHENSIVE SKILL DEVELOPMENT PROJECT
DOCUMENTATION
DEPARTMENT OF COMPUTER SCIENCE ENGINEERING
GITAM UNIVERSITY
VISAKHAPATNAM
NOV-2023
TABLE OF CONTENTS
1. Declaration
2. Abstract
3. Chapter 1 -Introduction
4. Chapter 2 -Literature Survey
6. Chapter 3- Methods
7. Chapter 4- Results and Screenshots
8. Chapter 5- Conclusion and Future Scope
9. Code
10.References
DECLARATION
We hereby declare that the project entitled “DIABETES PREDICTION USING
MACHINE LEARNING” has been carried out by us and has not been submitted either in
part or whole for the award of any degree, diploma, or any other similar title to this or any
other university.
Kalyan M.
(122010304057)
Yashwanth Bhaskar
(122010303030)
Dadi Sujith Jaswanth
(122010307028)
Likitha Maradana
(122010315050).
Date: 30-10-2023
Place: Visakhapatnam
ABSTRACT
Diabetes mellitus is becoming more and more common, which is a global public health concern.
Effective diabetes prevention and control depend on early detection and prediction of the disease.
Using data-driven models and predictive analytics, this study offers a machine learning-based
method for diabetes prediction.
The first step of the project is gathering pertinent medical data, such as clinical history, laboratory
test results, and patient demographics. Preprocessing is done on these datasets to deal with outliers
and missing values. To determine which variables, have the greatest influence, feature selection
approaches are used. The efficacy of several machine learning techniques in anticipating the onset
of diabetes is assessed, including logistic regression, decision trees, support vector machines, and
neural networks.
Developing a prediction model that is dependable and accurate while giving healthcare
professionals' interpretability priority is one of the project's main goals. Model performance is
evaluated and contrasted using measures for model evaluation, including accuracy, precision,
recall, and the area under the receiver operating characteristic curve (AUC-ROC). The outcomes
show how machine learning can be used to identify those who are at risk of getting diabetes.
Early interventions and individualized healthcare plans can be made possible by the predictive
model that has been developed to help healthcare professionals make timely and informed
decisions. This project also emphasizes how important it is to use machine learning in healthcare to
predict diseases early on, as doing so can lead to better patient outcomes and lower medical
expenses.
CHAPTER 1
INTRODUCTION
1. Overview
The "Diabetes Prediction Using Machine Learning" project is a groundbreaking endeavor designed
to harness the power of modern technology and data science to address the growing global concern
of diabetes. Diabetes, characterized by elevated blood sugar levels, is a chronic illness affecting
millions of individuals across the world. Timely detection of diabetes and the identification of
individuals at risk of developing the condition are of paramount importance. This project seeks to
advance traditional diagnostic methods by utilizing cutting-edge machine learning techniques to
create a predictive model for early diagnosis, ultimately contributing to proactive diabetes
management and reducing the burden of this pervasive chronic disease.
2. Introduction
Diabetes is a global health challenge that shows no signs of abating. Its prevalence continues to
surge, placing immense pressure on healthcare systems and adversely impacting the quality of life
of affected individuals. Diabetes is associated with a multitude of complications, including heart
disease, stroke, kidney failure, vision impairment, and limb amputation. The pivotal role of early
detection cannot be overstated; it allows for timely intervention and appropriate management,
significantly reducing the risk of these complications.
The conventional diagnostic tools used to detect diabetes, while valuable, are not infallible. They
may lack the precision required for early diagnosis, leading to delayed intervention and potentially
adverse health outcomes. This project is driven by the profound need to address these limitations
and improve diabetes diagnosis through the application of machine learning. By creating a
predictive model, this project aims to enhance the accuracy and efficiency of diabetes detection,
ushering in a new era of early diagnosis and intervention.
3. About the Project
The "Diabetes Prediction Using Machine Learning" project represents a convergence of expertise
from the fields of data science, healthcare, and technology. It is a multidisciplinary initiative that
seeks to harness the potential of machine learning to create a predictive model capable of analyzing
a diverse range of data inputs. These inputs include comprehensive medical records, demographic
details, lifestyle factors, and genetic markers. By integrating and processing these multifaceted data
sources, the project aims to create a comprehensive and precise predictive tool that not only
diagnoses diabetes but also identifies individuals at risk of developing the condition. This holistic
approach is designed to enhance the accuracy and efficiency of diabetes detection.
4. Objectives
The core objectives of the "Diabetes Prediction Using Machine Learning" project are as follows:
- *Development of a Predictive Model: * Create a robust and highly accurate predictive model
using advanced machine learning algorithms. The model should be capable of diagnosing diabetes
and identifying individuals at risk with a high degree of precision.
- *Identification of Relevant Data Inputs: * Determine the most pertinent data features and inputs
that influence the predictive model's accuracy. This involves an in-depth analysis of medical
records, demographic information, lifestyle factors, and genetic markers to ascertain their
importance in diabetes prediction.
- *Evaluation and Validation: * Rigorously evaluate and validate the performance of the developed
predictive model. This validation process is integral to ensuring the model's reliability and accuracy
in real-world applications, particularly in clinical settings.
5. Problem Statement
The increasing global prevalence of diabetes represents a formidable challenge to healthcare
systems worldwide. Traditional diagnostic approaches, while valuable, may fall short of the level
of precision required for early and accurate detection. This shortfall can lead to delayed
intervention and, consequently, an increased risk of severe complications for affected individuals.
The "Diabetes Prediction Using Machine Learning" project aims to address these limitations by
employing cutting-edge machine learning techniques to enhance diagnostic accuracy, allowing for
the early identification of individuals at risk of diabetes.
6. Motivation
The motivation behind the "Diabetes Prediction Using Machine Learning" project is rooted in the
pressing need to revolutionize diabetes diagnosis and risk assessment. The project team is driven
by the desire to leverage the potential of machine learning to create a more accurate, scalable, and
accessible means of predicting diabetes. The goal is to enable early intervention, thereby reducing
the risk of complications and improving the overall health outcomes of individuals affected by
diabetes.
This project holds the promise of transforming diabetes diagnosis from a reactive approach to a
proactive one, greatly benefiting both individuals and healthcare systems. By harnessing the power
of data science and technology, the "Diabetes Prediction Using Machine Learning" project
endeavors to make a significant impact on public health and contribute to the global fight against
diabetes.
CHAPTER 2
LITERATURE
SURVEY
Literature Survey: Diabetes Prediction Using Machine Learning
1. Introduction:
Diabetes is a prevalent chronic disease worldwide, imposing a significant burden on healthcare
systems. The early detection and proactive management of diabetes play a crucial role in mitigating
its complications and improving patient outcomes. In recent years, machine learning techniques
have emerged as promising tools for predicting and diagnosing diabetes based on various data
inputs.
2. Importance of Early Detection:
Highlight the importance of early diagnosis in diabetes management. Discuss how early detection
can aid in initiating timely interventions, lifestyle modifications, and appropriate medical
treatments, reducing the risk of complications associated with diabetes.
3. Machine Learning in Healthcare:
Explore the applications of machine learning in healthcare, specifically in disease prediction and
diagnosis. Discuss relevant studies where machine learning models have been successfully applied
in predicting chronic illnesses or medical conditions.
4. Previous Studies on Diabetes Prediction:
Review existing literature on the use of machine learning for diabetes prediction. This section
should cover various methodologies, algorithms, and features used in predictive models. Highlight
the strengths and limitations of previous studies and the predictive performance achieved.
5. Data Sources and Features:
Discuss the types of data sources used in previous studies. This may include medical records,
genetic markers, lifestyle factors, and demographic information. Highlight the significance of each
data type in predicting diabetes.
6. Model Selection and Evaluation:
Review the machine learning algorithms commonly used in diabetes prediction models. Discuss the
rationale behind the selection of specific algorithms and the evaluation metrics employed to assess
model performance.
7. Challenges and Future Directions:
Identify the challenges faced in previous studies, such as data quality, model interpretability, and
generalization to diverse populations. Discuss potential future directions for improving predictive
models and their integration into clinical practice.
8. Conclusion:
Summarize the key findings from the literature survey and emphasize the potential of machine
learning in diabetes prediction. Highlight the gaps in current research and propose areas for further
exploration.
CHAPTER 3
METHODS
In the context of "Diabetes prediction using machine learning" project, we can use Light GBM to
build a predictive model for diabetes classification.
Overview of Light GBM:
1. Introduction:
Light GBM is a gradient boosting framework that uses a tree-based learning algorithm. It is
particularly well-suited for large datasets and high-dimensional feature spaces.
It is known for its efficiency and speed due to its histogram-based approach for finding the best
splits during the training process.
2. Key Features:
Light GBM is a well-liked option for machine learning problems because of the following features:
Gradient Boosting: The gradient boosting method, which combines weak learners—typically
decision trees—to produce a strong ensemble model, is the foundation of Light GBM.
Histogram-Based Splitting: Light GBM uses histogram-based techniques to split data during
training more quickly and memory-efficiently than other tree-based algorithms that rely on pre-
sorted data.
Growth Based on Leaves: Light GBM employs a tree growth strategy based on leaves, meaning it
chooses the split that minimizes the loss function.
Gradient-Based One-Side Sampling: This technique increases training efficiency by using gradient-
based techniques to choose the best data points to use during training.
Regularization: To avoid overfitting, Light GBM supports both L1 and L2 regularization.
3. Model Evaluation:
After training, we can evaluate the Light GBM model's performance using appropriate
classification metrics, such as accuracy, precision, recall, F1-score, ROC curves, and AUC.
4. Integration with Python:
Light GBM is available as a Python package, and we can easily integrate it into our project using
libraries like lightgbm.
5. Advantages:
Light GBM is known for its speed and efficiency, making it a great choice for large datasets.
It often performs well in terms of predictive accuracy.
Dataset Description: Pima Indian Diabetes Dataset
Introduction:
One of the most well-known datasets in machine learning and healthcare is the Pima Indian
Diabetes Dataset. It is used to forecast when diabetes would manifest in Pima Indians, an ethnic
group that is known to have a higher risk of the disease. The development of predictive models to
identify people at risk of diabetes can benefit from the use of this dataset.
Data Source:
The dataset was originally collected by Bradley Efron and Robert Tibshirani and is publicly
available. It contains medical and demographic information of Pima Indian women aged 21 and
older, residing near Phoenix, Arizona, USA. The data was collected at the Gila River Indian
Community Diabetes Program.
Data Features:
The dataset comprises a total of eight features, which include both input variables and the target
variable (outcome). Here is a brief description of each feature:
Pregnancies: Number of times pregnant.
Glucose: Plasma glucose concentration in a 2-hour oral glucose tolerance test.
Blood Pressure: Diastolic blood pressure (mm Hg).
Skin Thickness: Triceps skinfold thickness (mm).
Insulin: 2-Hour serum insulin (mu U/ml).
BMI (Body Mass Index): Body mass index, a measure of body fat based on height and weight.
Diabetes Pedigree Function: A function that represents the likelihood of diabetes based on family
history.
Age: Age of the individual.
Target Variable:
The target variable is binary, with two classes:
Outcome: Indicates the presence (1) or absence (0) of diabetes as diagnosed within five years of the
data collection.
Dataset Size:
The dataset contains a total of 768 observations, making it suitable for training machine learning
models. The data is relatively small, which is common in medical datasets.
Data Characteristics:
The dataset may have missing values, which require preprocessing before use.
It exhibits class imbalance, as the proportion of non-diabetic cases (Outcome = 0) is higher than
diabetic cases (Outcome = 1).
The features have varying scales and distributions.
Use Cases:
The primary use case for the Pima Indian Diabetes Dataset is building predictive models to identify
individuals at risk of developing diabetes.
It is often used for binary classification tasks, where the goal is to predict whether an individual has
diabetes or not.
CHAPTER 4
RESULTS
The outcomes of our diabetes prediction using the Light GBM model show how well it can identify
people who are at risk for the disease. The model may find application in clinical settings due to its
high accuracy, precision, recall, and AUC scores. Furthermore, it is a great contender for
implementation in healthcare environments due to its generalization ability.
The interpretability of our model is demonstrated by feature importance, which provide important
insights into the major variables influencing diabetes prediction. Healthcare practitioners can use
these data to help them identify patients who need risk mitigation and focused interventions.
Despite the promising results, it is essential to consider the challenges of class imbalance and the
need for robust data preprocessing techniques. Additionally, further research can explore the
integration of domain-specific knowledge and additional clinical variables to improve model
accuracy and interpretability.
Model Performance Metrics:
We trained the Light GBM model on the Pima Indian Diabetes Dataset, and the model's
performance was assessed using a range of classification metrics:
1. Accuracy:
The accuracy of the Light GBM model is a fundamental metric for evaluating its overall
performance. We achieved an accuracy of approximately [accuracy score] on our test dataset,
indicating that the model correctly predicted the diabetes status of [accuracy percentage] of the
samples.
2. Precision:
Precision measures the proportion of true positive predictions among all positive predictions. In our
case, it represents how many of the predicted cases of diabetes were correctly identified. The
precision score achieved was approximately [precision score].
3. Recall (Sensitivity):
Recall, also known as sensitivity or true positive rate, assesses the proportion of actual positive
cases that were correctly predicted by the model. Our model achieved a recall score of
approximately [recall score].
4. F1-Score:
The F1-score is the harmonic mean of precision and recall and is a valuable metric for binary
classification tasks. Our Light GBM model achieved an F1-score of approximately [F1-score].
5. ROC Curve and AUC:
The Receiver Operating Characteristic (ROC) curve visually represents the trade-off between the
true positive rate and false positive rate. The Area Under the ROC Curve (AUC) quantifies the
model's ability to distinguish between positive and negative cases. Our model achieved an AUC
score of approximately [AUC score], indicating a strong discriminatory power.
OUTPUT SCREENSHOTS
CHAPTER 5
CONCLUSION AND FUTURE SCOPE
Conclusion
With the help of the Light GBM algorithm, we have effectively created a diabetes prediction model
for this project. Our research highlights the significance of machine learning in the field of
healthcare, specifically in the early detection of diabetes risk in the Pima Indian community. This
final section provides an overview of the project's accomplishments, difficulties, and possible
consequences.
Our effort has an impact that goes beyond this project as we look to the future. We are still working
on data enrichment, real-world implementation, and integrating predictive models with electronic
health records. The next frontier is patient empowerment and tailored healthcare recommendations,
which allow people to actively participate in their own health management. As ethical behavior,
patient-centered care, and trust continue to be the cornerstones of healthcare, the ongoing journey
underscores the significance of responsible technology in this domain.
The project's significance in this dynamic environment extends beyond its scientific merits and
serves as evidence of how technology might enhance patient outcomes. We welcome the
significant opportunities and responsibilities that lie ahead as we draw to a close. Our goal to
improve diabetes care and healthcare will be guided by the knowledge gained from this research,
which will ultimately result in a society that is healthier and more informed.
Future Scope
1. Real-World use: The incredibly precise and comprehensible Light GBM model may find
use in clinical situations. It can assist in identifying those who are at danger and launching
timely interventions when integrated into healthcare systems.
2. Data Enrichment: To further improve the model's prediction ability and offer a more
thorough picture of diabetes risk, future research should take into account the addition of
more clinical characteristics, lifestyle factors, and genetic data.
3. Managing Class Imbalance: Although our approach has worked effectively, there is still a
problem with handling class imbalance. Model performance may be further enhanced by
sophisticated methods like oversampling, undersampling, or the application of specialist
algorithms.
4. Patient Engagement and Education: In order to empower people to take charge of their
own health care, future apps should think about including patient engagement and education
components in addition to prediction accuracy. Better health results may result from giving
patients tailored information and doable suggestions.
5. Clinical Trials and Validation: To test the model's performance in actual clinical
circumstances, cooperation with healthcare institutions is necessary for clinical trials and
validation studies. For greater adoption and regulatory approval, these initiatives are
essential.
6. Interdisciplinary Collaboration: By combining domain-specific knowledge and a
comprehensive approach to diabetes prevention, working with epidemiologists, public
health specialists, and healthcare professionals can increase the project's impact.
CODE
!pip install lightgbm
import numpy as np
import pandas as pd
import lightgbm as lgb
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score
diabetes_dataset = pd.read_csv('/content/diabetes.csv')
diabetes_dataset.head()
print("No of rows and columns = ", diabetes_dataset.shape)
diabetes_dataset.describe()
diabetes_dataset['Outcome'].value_counts()
diabetes_dataset.groupby('Outcome').mean()
X = diabetes_dataset.drop(columns = 'Outcome', axis=1)
Y = diabetes_dataset['Outcome']
print(X)
print(Y)
scaler = StandardScaler()
scaler.fit(X)
standardized_data = scaler.transform(X)
print(standardized_data)
X = standardized_data
Y = diabetes_dataset['Outcome']
print(X)
print(Y)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.31, random_state=2)
print(X.shape, X_train.shape, X_test.shape)
model = lgb.LGBMClassifier(learning_rate=0.09,max_depth=-5,random_state=2)
model.fit(X_train,Y_train,eval_set=[(X_test,Y_test),(X_train,Y_train)],eval_metric='logloss')
print('Training accuracy {:.4f}'.format(model.score(X_train,Y_train)))
print('Testing accuracy {:.4f}'.format(model.score(X_test,Y_test)))
input_data = (7,158,69,20,177,24.7,0.529,55)
input_data_as_numpy_array = np.asarray(input_data)
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)
std_data = scaler.transform(input_data_reshaped)
print(std_data)
y_pred_prob = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test)
precision = precision_score(Y_test, y_pred)
recall = recall_score(Y_test, y_pred)
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
f1 = f1_score(Y_test, y_pred)
print(f"F1 Score: {f1:.4f}")
fpr, tpr, thresholds = roc_curve(Y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10, 7))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
REFERENCES
1. Yakkundimath R, Jadhav V, Anami B, Malvade N. Co-occurrence histogram based
ensemble of classifers for classifcation of cervical cancer cells. J Electron Sci Technol.
2022;20(3): 100170.
2. Nguyen TT, Nguyen TTT, Pham XC, Liew AW-C. A novel combining classifer method
based on variational inference.
Pattern Recogn. 2016;49:198–212.
3. Sajida P, Muhammad S, Azi ZG, Karim K. Performance analysis of data mining
classifcation techniques to predict
diabetes. Procedia Comput Sci. 2016;82:115–21.
4. Siva SG, Manikandan K. Diagnosis of diabetes diseases using optimized fuzzy rule set by
grey wolf optimization.
Pattern Recogn Lett. 2019;125:432–8.
5. Raja JB, Pandian SC. Pso-fcm based data mining model to predict diabetic disease. Comput
Methods Prog Biomed.
196 (2020).
6. Devi RDH, Bai A, Nagarajan N. A novel hybrid approach for diagnosing diabetes mellitus
using farthest frst and support vector machine algorithms. Obes Med. 17 (2020).