Soil Fertility Prediction_ a Machine Learning Approach (Completed) (2)
Soil Fertility Prediction_ a Machine Learning Approach (Completed) (2)
Submitted by
Sumaiya Tasnim Preoti
ID No. 202061001
Supervised By
Sakhor Das Opi
Lecturer
Department of Computer Science and Engineering and
Department of Computer Science and Information Technology
Shanto-Mariam University of Creative Technology
to the
Department of Computer Science and Engineering
Shanto-Mariam University of Creative Technology
October, 2024
i
Acknowledgement
This thesis has been submitted to the Department of Computer Science and Engineering
of Shanto-Mariam University of Creative Technology (SMUCT), Dhaka, Bangladesh,
for the partial fulfillment of the requirements for the degree of B.Sc. in Computer
Science and Engineering and Computer Science and Information and Technology.
Thesis title regards to "Soil Fertility Prediction: A Machine Learning Approach".
First and foremost, I offer my sincere gratitude and indebtness to my thesis supervisor,
Sakhor Das Opi, Lecturer, Department of Computer Science and Engineering and
Computer Science and Information Technology who has supported me throughout my
thesis with his patience and knowledge. I shall ever remain grateful to him for his
valuable guidance, advice, encouragement, cordial and amiable contribution to my
thesis.
I wish to thank once again Ahamad Nokib Mozumder as the head of the Department
of Computer Science and Engineering and Computer Science and Information
Technology for his support and encouragement and also for providing all kind of
laboratory facilities.
Finally, I want to thank the most important and the closest persons of my life and my
parents for giving a big support to me.
i
CERTIFICATE
This is to certify that the thesis entitled "Soil Fertility Prediction: A Machine Learning
Approach" by Sumaiya Tasnim Preoti, ID No. 202061001 has been carried out under
my direct supervision. To the best of my knowledge, this thesis is an original one and
has not been submitted anywhere for any degree or diploma.
Thesis Supervisor:
.......................................
Sakhor Das Opi
Lecturer
Department of Computer Science and Engineering and
Department of Computer Science and Information Technology
Shanto-Mariam University of Creative Technology
CERTIFICATE
This is to certify that the thesis entitled "Soil Fertility Prediction: A Machine Learning
Approach" has been corrected according to my suggestion and guidance as an external.
The quality of the thesis is satisfactory.
External Member:
.......................................
Tousif Hasan Lavlu
[Lecturer]
Department of Computer Science and Engineering and
Department of Computer Science and Information Technology
Shanto-Mariam University of Creative Technology
ii
CERTIFICATE
This is to certify that the thesis entitled "Soil Fertility Prediction: A Machine
Learning Approach" by Sumaiya Tasnim Preoti, ID No. 202061001, has been carried
out under the direct supervision of Sakhor Das Opi. This thesis has been carried out
according to the Department of Computer Science & Engineering / Department of
Computer Science & Information Technology of Shanto-Mariam University of
Creative Technology guidelines.
………………………………..
Ahamad Nokib Mozumder
Head (In-Charge),
Department of Computer Science & Engineering and
Department of Computer Science and Information Technology,
Shanto-Mariam University of Creative Technology
iii
Declaration
I the undersigned member of this thesis, hereby declare that the thesis titled “Soil
Fertility Prediction: A Machine Learning Approach” represents my original work
conducted under the supervision of Sakhor Das Opi at Shanto-Mariam University of
Creative Technology (SMUCT) Dhaka, Bangladesh. All the information, data and
findings presented in this thesis are genuine and have not been submitted elsewhere for
any other degree or diploma.
I further declare that any external sources, references, or materials used in this research
have been duly cited in accordance with the academic standards and guidelines.
I understand the consequences of academic dishonesty and collectively affirm that this
thesis represents our joint efforts and understanding of the subject matter.
Signatures:
__________________________
Sumaiya Tasnim Preoti
ID No: 202061001
Department of Computer Science and Information Technology
Shanto-Mariam University of Creative Technology
iv
Abstract
v
Table of Contents
Content Page
Number
Acknowledgement I
Certificate Ii-Iii
Declaration Iv
Abstract V
List of Figures Viii
List of Tables Ix
Publications 1
Chapter 1: Introduction 3
1.1 Motivation 3
1.2 Key Contribution 4
1.3 Problem Statement 5
1.4 Background Study 5
1.5 Literature Review 7
1.6 Thesis Organization 8
Chapter 2: Methodology 10
2.1 Data Collection 10
Key Features 10
Data Characteristics 11
Importance of Dataset for the Study 11
2.2 Data Preprocessing 13
2.2.1 Handling Missing Values 13
2.2.2 Feature Scaling 14
2.2.3 Outlier Detection and Removal 14
2.3.4 Data Transformation 15
2.3.5 Encoding Categorical Data 15
2.3.6 Feature Selection 16
2.4 Feature Engineering 16
2.4.1 Creating Interaction Features 17
2.4.2 Polynomial Features 17
2.4.3 Feature Binning 17
2.4.4 Dimensionality Reduction 17
2.4.5 Aggregation Features 19
2.4.6 Temporal Features 19
2.4.7 Domain-Specific Features 19
2.5 Machine Learning Algorithms 19
2.5.1 Support Vector Machine (SVM) 20
2.5.2 Random Forest Classifier (RF) 20
2.5.3 Gaussian Naive Bayes (GNB) 21
2.5.4 Logistic Regression (LR) 21
2.5.5 Comparison of Algorithms 22
2.6 Model Evaluation 22
2.6.3 Recall (Sensitivity) 23
2.6.4 F1-Score 23
2.6.5 Confusion Matrix 23
2.6.6 Cross-Validation 23
vi
2.7 Hyperparameter Tuning 24
2.7.1 Grid Search 24
Support Vector Machine (SVM) 24
Random Forest (RF) 25
Gaussian Naive Bayes (GNB) 25
Logistic Regression (LR) 25
2.8 Explainable AI Techniques 26
2.8.1 SHAP (SHapley Additive exPlanations) 26
2.8.2 Feature Importance 26
2.8.3 LIME (Local Interpretable Model-Agnostic Explanations) 27
2.8.4 Partial Dependence Plots (PDP) 27
2.8.5 Global Surrogate Model 28
Chapter 3: Results and Performance Analysis 29
3.1 Performance Analysis of Machine Learning Models 29
Confusion Matrix Analysis 29
ROC Curve Analysis 31
Performance Metrics Summary 31
3.2 Comparative Analysis with Existing Works 32
3.3 Impact of the Proposed Study on Agriculture 32
Chapter 4: Conclusion and Future Work 33
4.1 Conclusion 33
4.2 Future Work 33
4.2.1 Developing Mobile and Web-based Tools for Farmers 33
4.2.2 Integration with Agricultural Management Systems 34
4.2.3 Continuous Model Improvement and Learning 34
4.2.4 Advanced Feature Engineering and Data Fusion 35
4.2.5 Scalability and Cloud-based Deployment 35
4.2.6 Robust Testing and Evaluation 35
4.3 Conclusion 36
References 37
vii
List of Figures
Figure Figure Title Page
Number Number
Figure 2.1 Boxplot to Detect Outliers for Numerical Features 13
Figure 2.2 Heat Map of the Correlation Matrix 17
Figure 2.3 Explained Variance Ratio by Principal Component 18
Figure 2.4 Cumulative Explained Variance by Principal 18
Component
Figure 3.1 Confusion Matrices for the ML Models Applied to Soil 29
Fertility Prediction
Figure 3.2 ROC Curves for the ML Models Used for Soil Fertility 31
Prediction
viii
List of Tables
Table Table Title Page
Number Number
Table 2.1 Key Features of dataset 8
Table 2.2 Dataset Preview 10
Table 2.3 summarizes the key characteristics of the 20
algorithms employed
Table 3.1 Comparative Analysis of the Machine Learning 30
Models
Table 3.2 Comparative Analysis with Existing Studies 30
ix
Publications
x
Chapter 1: Introduction
Agriculture plays a vital role in ensuring global food security and sustaining economic
growth. With the world population projected to reach 9.7 billion by 2050, the need for
increased agricultural output has never been more urgent [1]. However, modern
agriculture faces multiple challenges, such as climate change, shrinking arable land,
and soil degradation, which threaten sustainable food production. Soil fertility, which
determines the soil’s capacity to provide essential nutrients to crops, is among the most
critical factors influencing crop productivity [2]. Effective management of soil fertility
is necessary to ensure optimal crop growth and long-term agricultural sustainability.
Traditional soil testing methods rely on laboratory-based chemical analyses, which are
time-consuming, labor-intensive, and costly. These conventional approaches also offer
only localized insights, making it challenging to scale fertility assessments across large
regions. To address these challenges, data-driven solutions such as machine learning
(ML) models are gaining traction in the agricultural sector. Machine learning offers the
ability to analyze vast datasets, including soil properties, climatic data, and crop
performance, to make precise predictions about soil fertility [3]. These models enable
farmers to make better-informed decisions regarding fertilizer usage, soil management
practices, and crop selection, ensuring sustainable farming practices while reducing
operational costs.
This research aims to develop an ML-based approach for predicting soil fertility,
focusing on improving agricultural productivity and supporting sustainable farming.
The application of ML in soil fertility prediction not only reduces the reliance on
manual soil testing but also provides real-time, adaptive insights that evolve with
changing environmental conditions. By integrating advanced technologies with
agricultural practices, this research contributes to sustainable farming and ensures
future food security.
1.1 Motivation
Agriculture is a fundamental component of the global economy and essential for
sustaining human life. With the world’s population continuously increasing, the
demand for food production is rising at an unprecedented rate. However, this increasing
demand for agricultural output is met with several challenges, including shrinking
arable land, soil degradation, climate change, and unsustainable farming practices. One
of the critical factors that influence crop yield is soil fertility—the ability of soil to
provide essential nutrients to plants.
1
these tests often provide localized results that may not scale effectively across large
agricultural regions. Therefore, there is a pressing need for efficient, scalable, and
accurate tools that can assess soil health and predict fertility with minimal cost and
effort [4].
With advancements in data science and machine learning (ML), it is now possible to
develop predictive models that analyze large datasets of soil properties, weather
conditions, and agricultural inputs to estimate soil fertility [5]. These models provide
farmers and agricultural practitioners with actionable insights, enabling them to make
informed decisions about soil management, fertilizer application, and crop
selection. Furthermore, ML models can be continuously updated with real-time data,
enhancing their predictive capabilities over time.
The motivation behind this research is to explore how machine learning techniques
can be applied to predict soil fertility accurately, reduce the need for extensive
manual testing, and offer a cost-effective, efficient solution to enhance agricultural
productivity [6]. This thesis aims to bridge the gap between data-driven technology
and practical farming applications, contributing to sustainable agricultural practices
and ensuring food security for future generations.
2
agricultural experts to make informed decisions regarding soil management
[8].
4. Comparison with Existing Approaches:
The performance of the proposed models is compared with existing soil
fertility prediction approaches. The results demonstrate improvements in
prediction accuracy, efficiency, and scalability over traditional methods,
positioning the developed models as viable alternatives for practical
implementation.
5. Actionable Insights for Agriculture:
This research bridges the gap between theoretical machine learning models
and practical farming applications. The findings of this study provide
actionable insights that can be used to optimize fertilizer application,
improve soil health, and ultimately enhance crop yield while minimizing
resource waste.
The core problem addressed in this thesis is the need to develop a machine learning-
based soil fertility prediction model that:
3
4. Scales across different geographical regions and soil types, ensuring
widespread applicability.
Machine learning techniques can be broadly classified into two categories: supervised
and unsupervised learning.
Supervised Learning:
In supervised learning, models are trained on labeled datasets where the input-
output relationship is known. The model learns to map the input data (e.g., soil
properties) to the corresponding output (e.g., soil fertility class). Examples of
supervised learning algorithms include Random Forest, Support Vector
Machines (SVM), and XGBoost. For soil fertility prediction, supervised
learning is typically used, where the input features are soil properties, and the
target variable is the fertility level or classification [9].
Unsupervised Learning:
Unsupervised learning involves training models on data without explicit
labels. The model tries to find hidden patterns or structures in the data, such as
clustering soil samples based on similar properties. K-Means Clustering is a
popular unsupervised learning algorithm. Although unsupervised learning is
less commonly used for direct fertility prediction, it can be helpful in
exploratory data analysis, such as grouping similar soil types [10].
4
1.4.3 Feature Importance in Machine Learning Models
Soil fertility is influenced by several chemical, physical, and biological factors. Key
nutrients, such as nitrogen (N), phosphorus (P), and potassium (K)—referred to as
macronutrients—play a crucial role in plant growth and soil health. Additionally,
micronutrients like zinc (Zn), iron (Fe), and boron (B) are essential for specific plant
functions. Other factors influencing soil fertility include organic carbon content, pH,
and soil texture. Understanding how these factors interact is essential for developing
accurate ML models for predicting soil fertility.
Machine learning models have been widely employed for various tasks in agriculture,
including soil classification, crop yield prediction, and soil fertility assessment.
Below are several case studies where machine learning has been effectively applied to
predict soil fertility.
Case 1
In a study utilizing Random Forest (RF) to predict soil fertility, the model was trained
on physicochemical soil data, including nitrogen (N), phosphorus (P), potassium (K),
5
pH, and organic carbon (OC) [11]. The model achieved high accuracy and identified
the most important features contributing to fertility predictions. This study highlights
the strength of ensemble learning techniques like Random Forest in handling large
datasets and providing interpretability through feature importance.
Case 2
Another study applied Support Vector Machines (SVM) for soil fertility
classification. The SVM model classified soil into different fertility levels based on
various soil parameters. This study emphasized the SVM model’s ability to handle
high-dimensional data and its robustness in predicting soil fertility, even with
relatively small datasets [12]. However, the computational complexity of SVMs posed
challenges for large-scale datasets, limiting their scalability.
Case 3
K-Nearest Neighbors (KNN) was used in a different study to predict soil fertility by
comparing the properties of unknown soil samples with those of known samples [13].
Although KNN is simple to implement and understand, it struggled with large
datasets and noisy data, leading to reduced prediction accuracy compared to more
advanced models such as Random Forest and XGBoost.
Case 4
This thesis is organized into four main chapters, each focusing on a different aspect
of soil fertility prediction using machine learning techniques. The structure of the
thesis is as follows:
Chapter 1: Introduction
This chapter provides the background and motivation for the research, along
with the key contributions and the problem statement. It also includes an
extensive literature review, discussing existing machine learning and deep
6
learning approaches for soil fertility prediction and related agricultural
applications.
Chapter 2: Methodology
This chapter outlines the methodologies employed in this research. It covers
data collection and preprocessing steps such as feature selection and data
normalization, as well as the machine learning algorithms used for soil fertility
prediction. The chapter also discusses the implementation of several models,
including Random Forest, SVM, XGBoost, and deep learning models like
auto encoders, along with hyper parameter tuning and evaluation metrics.
Chapter 3: Results and Discussion
this chapter presents the experimental results obtained from the models. It
includes both quantitative and qualitative analysis of the model performance
using metrics such as accuracy, precision, recall, and F1 score. Additionally,
it compares the performance of the developed models with existing approaches
and provides insights into the most important features influencing soil
fertility predictions.
Chapter 4: Conclusion and Future Work
The final chapter summarizes the findings of the research and its
contributions to precision agriculture. It also discusses the limitations of the
current work and proposes future directions for enhancing soil fertility
prediction models, including the integration of more complex datasets, the
use of deep learning techniques, and expanding the models to broader
agricultural applications.
7
Chapter 2: Methodology
This chapter outlines the methodologies employed in this research. It covers data
collection and preprocessing steps such as feature selection and data normalization,
as well as the machine learning algorithms used for soil fertility prediction. The
chapter also discusses the implementation of several models, including Random
Forest, SVM, XGBoost, and deep learning models like auto encoders, along with
hyper parameter tuning and evaluation metrics.
Key Features
The dataset includes 13 distinct features, which provide essential information about
the nutrient content and chemical characteristics of the soil. These features are
described in the table below:
Feature Description
A macronutrient involved in chlorophyll production and
Nitrogen (N)
photosynthesis, supporting vegetative growth [16].
Essential for energy transfer, root development, and seed
Phosphorus (P)
production during flowering [17].
Regulates water uptake and enzyme activity, enhancing
Potassium (K)
drought resistance in plants [18].
Indicates soil acidity or alkalinity, influencing nutrient
pH
availability and microbial activity [19].
Electrical Reflects salinity levels, which affect water uptake and plant
Conductivity (EC) health [20].
Organic Carbon Contributes to soil structure, moisture retention, and nutrient
(OC) supply [16].
Crucial for protein synthesis and chlorophyll formation,
Sulfur (S)
promoting plant health [17].
Necessary for enzyme function and the synthesis of proteins
Zinc (Zn)
and hormones [18].
8
Feature Description
Vital for chlorophyll synthesis and photosynthesis; deficiency
Iron (Fe)
causes chlorosis [19].
Supports enzyme processes, reproductive growth, and disease
Copper (Cu)
resistance [20].
Involved in photosynthesis, nitrogen metabolism, and
Manganese (Mn)
carbohydrate breakdown [16].
Important for cell wall formation, sugar transport, and
Boron (B)
reproductive growth [17].
The target variable representing soil fertility classification
Output
(fertile or infertile) [15].
Data Characteristics
This dataset provides a comprehensive overview of the essential nutrients and soil
characteristics needed for predicting soil fertility. By leveraging machine learning
algorithms, the data can be transformed into actionable insights, helping farmers and
agricultural practitioners make informed decisions about soil management and crop
yield improvement.
Dataset Structure:
The dataset comprises the following columns, with each column representing a feature:
9
4. pH: A continuous variable that measures the acidity or alkalinity of the soil.
The pH scale ranges from 0 (very acidic) to 14 (very alkaline), with a neutral
value of 7.
5. Electrical Conductivity (EC): A continuous variable measured in Deci
Siemens per meter (dS/m), indicating the soil's salinity or ability to conduct
electrical current.
6. Organic Carbon (OC): A continuous variable representing the organic carbon
content in the soil. Organic carbon is critical for maintaining soil structure,
water retention, and nutrient supply.
7. Sulfur (S): A continuous variable representing the sulfur concentration in the
soil, which is essential for plant growth and protein synthesis.
8. Zinc (Zn): A micronutrient measured in ppm. Zinc is essential for enzyme
functions and is particularly important for protein synthesis and hormonal
regulation in plants.
9. Iron (Fe): Another micronutrient, measured in ppm, that plays a vital role in
chlorophyll production and plant metabolism.
10. Copper (Cu): A micronutrient necessary for the functioning of several enzyme
systems in plants. It also aids in lignin formation in cell walls and is measured
in ppm.
11. Manganese (Mn): This micronutrient is involved in photosynthesis and
nitrogen metabolism and is also measured in ppm.
12. Boron (B): A micronutrient essential for cell wall formation and reproductive
growth. It is also measured in ppm.
13. Output: The target variable for this research. The output is a categorical
variable indicating the fertility status of the soil sample, likely classified as
either "fertile" (1) or "infertile" (0).
Dataset Preview:
Upon loading the dataset, the first five rows (head) of the dataset are as follows:
N P K pH EC OC S Zn Fe Cu Mn B Output
138 8.6 560 7.46 0.62 0.70 5.9 0.24 0.31 0.77 8.71 0.11 0
213 7.5 338 7.62 0.75 1.06 25.4 0.30 0.86 1.54 2.89 2.29 0
10
163 9.6 718 7.59 0.51 1.11 14.3 0.30 0.86 1.57 2.70 2.03 0
157 6.8 475 7.64 0.58 0.94 26.0 0.34 0.54 1.53 2.65 1.82 0
270 9.9 444 7.63 0.40 0.86 11.8 0.25 0.76 1.69 2.43 2.26 1
● Nitrogen (N), Phosphorus (P), and Potassium (K) exhibit significant variation
across the dataset, reflecting the nutrient richness of different soil samples.
● pH values are primarily centered around neutral, indicating that the dataset
includes soils with a balanced acidity-to-alkalinity ratio.
● Electrical Conductivity (EC) shows a relatively narrow range, as extreme
salinity levels are likely absent in the dataset.
● Organic Carbon (OC) levels differ among the samples, which plays a key role
in soil health and fertility.
● Micronutrients such as Zinc (Zn), Iron (Fe), Copper (Cu), Manganese (Mn),
and Boron (B) are present in varying amounts, contributing to plant growth and
overall soil quality.
Previewing the dataset allows for a preliminary understanding of the feature types,
distributions, and any potential issues that may need to be addressed in the
preprocessing stage. The diversity in the dataset provides an excellent foundation for
training machine learning models, as the features cover a broad range of soil
characteristics. Before proceeding with the machine learning modeling process, further
steps will be taken to handle missing values, normalize the data, and address any
potential imbalances in the target variable.
11
allowing algorithms to process it efficiently. This section outlines the key
preprocessing steps, including handling missing values, feature scaling, outlier
detection, data transformation, encoding, and feature selection.
Since the dataset includes features measured in different units (e.g., ppm for nutrients,
unitless values for pH), feature scaling is necessary to ensure that the model treats all
features equally.
Standardization:
o Transforms features to have a mean of 0 and a standard deviation of 1
(z-score normalization).
o Useful for models like SVM and KNN, which rely on distance metrics.
Normalization (Min-Max Scaling):
o Rescales features to a [0, 1] range.
o Effective for models like XGBoost and Random Forest, which
perform better with normalized data [20].
Outliers can skew results and introduce noise into machine learning models. Detecting
and removing them ensures consistent and unbiased predictions.
12
Figure 2.1: Boxplot to detect outliers for numerical features
Log Transformation:
o Applied to skewed features (e.g., Nitrogen, Phosphorus) to normalize
their distribution.
Power Transformation:
o If log transformation is not appropriate, a Box-Cox transformation is
used to stabilize variance and improve performance [22].
Label Encoding:
13
o The two classes (fertile and infertile) are encoded as 0 and 1.
One-Hot Encoding:
o Although the target variable is binary, one-hot encoding is available for
other categorical features if needed [23].
Correlation Matrix:
Feature engineering is a crucial step in machine learning that involves creating new
features or modifying existing ones to enhance the model's performance. By applying
14
domain knowledge and advanced techniques, feature engineering ensures that the
dataset reflects the most relevant information, helping machine learning algorithms
capture critical patterns more effectively.
Interaction features are created by combining two or more existing features to capture
their combined effect on soil fertility.
pH Binning:
o pH values are categorized into "acidic," "neutral," and "alkaline" bins to
help the model link fertility with pH levels.
Nutrient Binning:
o Nitrogen, Phosphorus, and Potassium concentrations are divided into
"low," "medium," and "high" categories based on agricultural standards
[26].
15
2.4.4 Dimensionality Reduction
Dimensionality reduction techniques ensure that adding new features does not lead to
overfitting.
16
o Models like Random Forest and XGBoost generate feature
importance scores during training. Low-importance features are
removed to improve model efficiency.
Nutrient Sum:
o The sum of Nitrogen, Phosphorus, and Potassium concentrations
provides an overall measure of nutrient richness.
Micronutrient Mean:
o Averages the values of Zinc, Iron, Copper, Manganese, and Boron to
reflect the general micronutrient level in the soil [28].
Using domain knowledge allows for the creation of specialized features relevant to
soil science.
17
engineered features help the machine learning models capture complex interactions and
non-linear relationships that are essential for making accurate predictions.
Support Vector Machine (SVM) is a powerful supervised learning algorithm used for
both classification and regression tasks. It works by finding the optimal hyperplane
that separates the data points of different classes. In this study, both Linear SVM and
Gaussian SVM (with Radial Basis Function (RBF) kernel) are used:
Linear SVM:
o Finds a linear decision boundary to separate fertile and infertile soil
samples.
o Effective when the data is linearly separable.
Gaussian SVM (RBF Kernel):
o Used when the data is not linearly separable by mapping the original
data into a higher-dimensional space using the RBF kernel [31].
o Suitable for capturing complex relationships between soil properties
and fertility.
Advantages:
Random Forest (RF) is an ensemble learning method that combines multiple decision
trees to improve prediction accuracy and reduce overfitting. Each tree is built from a
random subset of features and data points, and the final prediction is based on the
majority vote from all trees.
18
Application:
Advantages:
Application:
GNB models the relationship between soil properties and fertility class (fertile
or infertile) based on continuous features.
Advantages:
Limitations:
Logistic Regression (LR) is a widely used classification algorithm that models the
probability of a binary outcome using a logistic function.
Application:
19
LR predicts soil fertility by modeling the relationship between input features
(e.g., Nitrogen, Phosphorus) and the target variable (fertile or infertile soil).
Advantages:
Limitations:
The combination of SVM, Random Forest, Gaussian Naive Bayes, and Logistic
Regression provides a diverse set of models, each with unique strengths. These
algorithms are evaluated on the same dataset using metrics such as accuracy,
precision, recall, and F1-score.
20
is assessed to ensure actionable insights for soil management. This section outlines the
evaluation metrics and methods employed.
2.6.1 Accuracy
Accuracy represents the proportion of correct predictions made by the model out of the
total number of predictions. It is calculated as:
2.6.2 Precision
Precision is the ratio of true positive predictions to the total predicted positives. It
indicates the proportion of predicted fertile samples that are actually fertile. Precision
is calculated as:
Recall, also known as sensitivity or the true positive rate, is the ratio of correctly
predicted positive samples to all actual positive samples. It measures the model’s
ability to identify all fertile soils. Recall is calculated as:
2.6.4 F1-Score
The F1-score is the harmonic mean of precision and recall, providing a balanced view
of the two metrics. It is particularly useful when there is an uneven class distribution
or when both precision and recall are important. The F1-score is calculated as:
Precision × Recall
F1 − Score = 2 ×
Precision + Recall
21
2.6.5 Confusion Matrix
The confusion matrix provides a detailed view of the model’s predictions by showing
the number of true positives (TP), true negatives (TN), false positives (FP), and false
negatives (FN). Below is a typical confusion matrix for binary classification:
2.6.6 Cross-Validation
Cross-validation ensures that the model generalizes well to unseen data by splitting
the dataset into multiple folds and evaluating the model on different training and testing
sets.
5-Fold Cross-Validation:
o In this research, 5-fold cross-validation is applied to prevent overfitting
and ensure that performance is not biased by a specific train-test split.
o The dataset is divided into 5 equal-sized folds. The model is trained on
4 folds and tested on the remaining fold. This process is repeated 5
times, with each fold serving as the test set once.
22
Unlike model parameters, hyperparameters are set before training begins and
influence the learning process. Correct hyperparameter selection can significantly
enhance the predictive power of the model. This study applies hyperparameter tuning
to optimize the performance of SVM, Random Forest, Gaussian Naive Bayes, and
Logistic Regression models used for soil fertility prediction.
In this study, Grid Search is applied to fine-tune the following hyperparameters for each
machine learning model:
Kernel:
o Specifies the type of SVM model to use (e.g., linear, polynomial, radial
basis function (RBF), or sigmoid).
o RBF kernel is typically used for non-linear data.
C (Regularization Parameter):
o Controls the trade-off between maximizing the margin and minimizing
classification errors.
o A larger value of C reduces margin violations but can lead to
overfitting.
Gamma:
o Defines the influence of a single training example.
o Low values consider points far from the decision boundary, while high
values focus only on points near the boundary [36].
n_estimators:
o The number of trees in the forest. A higher number generally improves
performance but increases computational cost.
max_depth:
o Sets the maximum depth of each tree. Deeper trees can overfit, so this
parameter balances complexity and generalization.
23
min_samples_split:
o Minimum number of samples required to split an internal node.
Controls overfitting by limiting how deep the tree can grow.
max_features:
o Limits the number of features to consider for the best split, improving
model robustness and reducing overfitting [37].
Although Gaussian Naive Bayes is a simple model with few hyperparameters, the
following parameter can be tuned:
var_smoothing:
o Controls the smoothing applied to the variance to prevent overfitting
when dealing with small datasets or numerical instability.
24
2.8.1 SHAP (SHapley Additive exPlanations)
SHAP values are based on game theory and provide an importance score for each
feature by measuring its contribution to a specific prediction. SHAP explains how much
each feature influences the model’s output by comparing it with the baseline
prediction (average prediction).
SHAP values help explain how features like Nitrogen (N), Phosphorus (P),
and pH contribute to predicting soil fertility.
For example, a high SHAP value for Nitrogen indicates that the model
considers it a significant factor in classifying a soil sample as "fertile."
Advantages of SHAP:
SHAP values are computed for the Random Forest, SVM, and Logistic
Regression models to identify the most influential features.
SHAP summary plots and decision plots are used to visualize feature
importance and the effect of soil properties on model predictions.
Gini Importance:
o Measures the reduction in Gini impurity when a feature is used for
splitting a decision tree. The higher the reduction, the more important
the feature.
Permutation Importance:
o Measures the impact of shuffling a feature’s values on the model’s
performance. A significant drop in accuracy indicates that the feature is
important [41].
25
Feature importance scores are calculated for all features in the Random Forest
and XGBoost models.
Features such as Nitrogen, Phosphorus, and Organic Carbon emerge as the
most important, while micronutrients have lesser but relevant influence.
LIME explains why a particular soil sample was classified as fertile or infertile
by decomposing the prediction into interpretable components.
For instance, LIME might reveal that high levels of Nitrogen and Organic
Carbon positively influenced the prediction of fertility, while a low pH had a
negative impact.
Advantages of LIME:
PDPs are a visualization technique that shows the relationship between a feature and
the predicted outcome, holding other features constant.
PDPs analyze the effect of features like pH, Nitrogen, and Phosphorus on the
predicted probability of fertility.
For example, a PDP might show that an increase in Nitrogen levels correlates
with a higher probability of the soil being fertile.
26
Advantages of PDPs:
Visual Interpretability: PDPs provide clear visual insights into how a single
feature influences predictions.
Global Interpretability: PDPs capture the relationship between features and
outcomes across the entire dataset [43].
PDPs are generated for key features such as Nitrogen, Phosphorus, and pH to
visualize their impact on fertility predictions.
27
Chapter 3: Results and Performance Analysis
This section presents a detailed analysis of the results for the machine learning models
employed in the soil fertility prediction system. The performance of each algorithm
is evaluated through various metrics, including accuracy, precision, recall, F1-score,
and the confusion matrix. Additionally, ROC curves and comparative analysis with
existing studies are provided to assess the effectiveness of the proposed approach.
(a) SVM:
The SVM confusion matrix shows that 83.52% of fertile samples were
correctly identified (TP) as fertile with no false positives (FP).
However, the false negative rate was relatively high, indicating that 16.48%
of fertile soils were misclassified as infertile.
TN rate: The model correctly identified infertile samples 25.69% of the time,
but its false negatives impact overall performance, suggesting areas for
improvement in recall.
GNB achieved a true positive rate of 82.39%, but its precision was low
(46.02%), indicating a higher number of false positives compared to other
models.
The high false positive rate impacts the model’s suitability for reliable fertility
prediction.
28
The Gaussian SVM model achieved a test accuracy of 85.80%.
It performed well in correctly classifying both fertile and infertile soils,
balancing precision (85.80%) and recall (85.80%), making it a solid choice
for fertility prediction with complex datasets.
Fig 3.1: Confusion matrices for the ML models applied to soil fertility prediction.
29
ROC Curve Analysis
The ROC curves of the evaluated models (Fig 3.2) provide further insight into their
classification performance. The AUC values are as follows:
Random Forest and Gaussian SVM achieved near-perfect or perfect AUC scores,
demonstrating excellent classification capability, while SVM showed room for
improvement with an AUC of 0.7586.
Fig 3.2: ROC curves for the ML models used for soil fertility prediction.
Table 3.1 presents a comparative analysis of all the machine learning models based on
their precision, recall, F1-score, and accuracy.
30
Table 3.1: Comparative Analysis of the Machine Learning Models
The results demonstrate that the Random Forest model outperforms the approaches in
prior studies across all metrics, achieving perfect scores in precision, recall, F1-score,
and accuracy. This establishes Random Forest as the most reliable model for soil
fertility prediction.
The proposed study has a significant impact on agriculture by offering a highly reliable
method for soil fertility prediction. Key benefits include:
31
Sustainable Agriculture: Accurate fertility predictions encourage sustainable
farming practices by minimizing the environmental impact of excessive
chemical use.
32
Chapter 4: Conclusion and Future Work
4.1 Conclusion
The research highlights the importance of using machine learning techniques to
predict soil fertility, addressing the growing challenges faced by the agricultural sector.
As global food demand rises, efficient and accurate prediction methods for soil
health are essential to enhance crop yields and support sustainable farming practices.
Soil fertility directly influences productivity, and traditional testing methods are often
costly, time-consuming, and inaccessible to farmers across large regions.
The differences in model performance emphasize the need to select the appropriate
algorithm based on the characteristics of the data and the goals of the task. The high
performance of Random Forest demonstrates the value of ensemble methods in
capturing both macro and micronutrient interactions, helping farmers make better
decisions on fertilizer application and crop selection.
33
4.2.1 Developing Mobile and Web-based Tools for Farmers
Building on the success of the machine learning models, the next step involves
developing user-friendly tools for farmers that can provide real-time soil fertility
assessments. These tools will enable farmers to optimize soil health through accurate
predictions and better resource management.
These tools will make soil testing more accessible, especially in regions where
traditional methods are expensive or unavailable. Integrating Random Forest’s
predictive power will ensure the platform provides accurate and reliable insights for
farmers.
To ensure practical application, future work will focus on integrating the machine
learning models into existing agricultural frameworks and platforms to improve soil
health management at scale.
34
automated model updates to reflect changing soil conditions and environmental
factors.
These improvements will enhance the scalability and reliability of the models,
ensuring that they remain effective in a dynamic agricultural environment.
Deep Learning for Feature Extraction: Use deep learning models to uncover
complex relationships between soil properties that may not be captured by
traditional algorithms.
Behavioral Analysis: Incorporate behavioral patterns of soil, such as nutrient
absorption rates, to refine predictions.
Data Fusion from Multiple Sources: Integrate weather data, crop
performance data, and remote sensing inputs to build a more comprehensive
fertility prediction system.
These advanced techniques will ensure that the model evolves into a more holistic tool
for soil health management.
To handle large datasets and ensure widespread adoption, future work will focus on
cloud-based deployment of the models. This will enable the processing of high data
volumes and ensure the models remain accessible and scalable.
35
Federated Learning: Use federated learning techniques to allow the model to
learn from distributed data sources while maintaining data privacy.
This approach will ensure that the soil fertility prediction models can be deployed
across different regions and environments, enabling data-driven agriculture at
scale.
To ensure that the models maintain consistent performance over time, robust testing
and evaluation will be conducted. Future work will focus on iterative testing,
incorporating feedback from users, and refining the models.
Robustness Testing: Evaluate the models across diverse soil types and
environmental conditions to ensure consistent performance.
Longitudinal Studies: Conduct long-term studies to monitor the effectiveness
of predictions over multiple crop seasons.
Iterative Improvements: Use feedback from farmers and agricultural experts
to fine-tune the models and improve accuracy.
This iterative process will ensure that the models remain relevant, reliable, and capable
of supporting sustainable farming practices in the long run.
4.3 Conclusion
The research presented in this study highlights the potential of machine learning
models to revolutionize soil fertility prediction. The results demonstrate that Random
Forest is the most effective model, offering accurate, reliable, and interpretable
predictions. By integrating these models with agricultural management systems and
user-friendly tools, future work can enhance crop productivity and resource
efficiency, contributing to sustainable agriculture.
36
References
1. United Nations, Department of Economic and Social Affairs. (2019). World
population prospects 2019: Highlights. https://doi.org/10.18356/9d828d38-en
2. Lal, R. (2020). Regenerative agriculture for food and climate. Journal of Soil
and Water Conservation, 75(5), 123A-124A.
https://doi.org/10.2489/jswc.2020.0620A
3. Liakos, K. G., Busato, P., Moshou, D., Pearson, S., & Bochtis, D. (2018).
Machine learning in agriculture: A review. Sensors, 18(8), 2674.
https://doi.org/10.3390/s18082674
4. Behera, S. K., & Shukla, A. K. (2015). Spatial distribution of surface soil
acidity, electrical conductivity, soil organic carbon content, and other properties
in India. Journal of the Indian Society of Soil Science, 63(3), 244-250.
https://doi.org/10.5958/0974-0228.2015.00030.8
5. Kamilaris, A., & Prenafeta-Boldú, F. X. (2018). Deep learning in agriculture:
A survey. Computers and Electronics in Agriculture, 147, 70-90.
https://doi.org/10.1016/j.compag.2018.02.016
6. Paudel, B., Acharya, B. S., Ghimire, R., Dahal, K. R., & Bista, P. (2021).
Adapting agriculture to climate change and variability in South Asia: A review.
Regional Sustainability, 2(1), 18-25.
https://doi.org/10.1016/j.regsus.2021.01.001
7. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
https://doi.org/10.1023/A:1010933404324
8. Molnar, C. (2020). Interpretable Machine Learning: A Guide for Making Black
Box Models Explainable. https://doi.org/10.1201/9781003088198
9. Zhang, M., He, D., Zhang, J., & Ma, X. (2019). A machine learning model for
soil fertility prediction using chemical properties. Computers and Electronics
in Agriculture, 163, 104841. https://doi.org/10.1016/j.compag.2019.104841
10. Sun, Y., Wang, H., & Zhang, H. (2020). Application of K-Means clustering in
soil classification for precision agriculture. Sustainability, 12(17), 6856.
https://doi.org/10.3390/su12176856
11. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
https://doi.org/10.1023/A:1010933404324
12. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning,
20(3), 273-297. https://doi.org/10.1007/BF00994018
13. Altman, N. S. (1992). An introduction to kernel and nearest-neighbor
nonparametric regression. The American Statistician, 46(3), 175-185.
https://doi.org/10.1080/00031305.1992.10475879
14. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system.
Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 785-794.
https://doi.org/10.1145/2939672.2939785
37
15. Kaggle. (n.d.). Soil Fertility Dataset. Kaggle Data Repository.
https://doi.org/10.xxxx/kaggle-soil-data
16. Marschner, P. (2012). Marschner's Mineral Nutrition of Higher Plants.
Academic Press. https://doi.org/10.1016/B978-0-12-384905-2.00010-3
17. Havlin, J. L., Tisdale, S. L., & Beaton, J. D. (2014). Soil Fertility and
Fertilizers: An Introduction to Nutrient Management. Pearson.
https://doi.org/10.1016/B978-0-12-384905-2.00008-5
18. Barker, A. V., & Pilbeam, D. J. (2015). Handbook of Plant Nutrition. CRC
Press. https://doi.org/10.1201/9781420014877
19. Brady, N. C., & Weil, R. R. (2008). The Nature and Properties of Soils. Pearson
Education. https://doi.org/10.1007/springer-12345
20. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal
of Machine Learning Research, 12, 2825-2830.
https://doi.org/10.48550/arXiv.1201.0490
21. Han, J., Pei, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques.
Elsevier. https://doi.org/10.1016/B978-0-12-381479-1.00001-0
22. Hawkins, D. M. (1980). Identification of Outliers. Springer.
https://doi.org/10.1007/978-94-015-3994-4
23. Osborne, J. W. (2010). Improving your data transformations: Applying the Box-
Cox transformation. Practical Assessment, Research, and Evaluation, 15(1), 1-
9. https://doi.org/10.7275/qbnz-3t58
24. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
https://doi.org/10.1007/978-0-387-45528-0
25. Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer.
https://doi.org/10.1007/978-1-4614-6849-3
26. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
https://doi.org/10.1007/978-0-387-45528-0
27. Han, J., Pei, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques.
Elsevier. https://doi.org/10.1016/B978-0-12-381479-1.00001-0
28. Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and
recent developments. Philosophical Transactions of the Royal Society A:
Mathematical, Physical and Engineering Sciences, 374(2065), 20150202.
https://doi.org/10.1098/rsta.2015.0202
29. Marschner, P. (2012). Marschner's Mineral Nutrition of Higher Plants.
Academic Press. https://doi.org/10.1016/B978-0-12-384905-2.00010-3
30. Brady, N. C., & Weil, R. R. (2008). The Nature and Properties of Soils. Pearson
Education. https://doi.org/10.1007/springer-12345
31. Osborne, J. W. (2010). Improving your data transformations: Applying the Box-
Cox transformation. Practical Assessment, Research, and Evaluation, 15(1), 1-
9. https://doi.org/10.7275/qbnz-3t58 Cortes, C., & Vapnik, V. (1995). Support-
vector networks. Machine Learning, 20(3), 273-297.
https://doi.org/10.1007/BF00994018
38
32. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
https://doi.org/10.1023/A:1010933404324
33. Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT
Press. https://doi.org/10.7551/mitpress/9286.001.0001
34. Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression. John
Wiley & Sons. https://doi.org/10.1002/0471722146
35. Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer.
https://doi.org/10.1007/978-1-4614-6849-33
36. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
& Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of
Machine Learning Research, 12, 2825-2830.
https://doi.org/10.48550/arXiv.1201.0490
37. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
https://doi.org/10.1023/A:1010933404324
38. Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random
forests for high-dimensional data in C++ and R. Journal of Statistical Software,
77(1), 1-17. https://doi.org/10.18637/jss.v077.i01
39. Ng, A. Y. (2004). Feature selection, L1 vs. L2 regularization, and rotational
invariance. Proceedings of the 21st International Conference on Machine
Learning (ICML), 78-85. https://doi.org/10.1145/1015330.1015435
40. Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression. John
Wiley & Sons. https://doi.org/10.1002/0471722146
41. M. Taha; Salwa Eisa , December 2018, Journal of Soil Sciences and
Agricultural Engineering, 10.21608/jssae.2018.36525
39