Building Your First Machine Learning Model
Last Updated :
01 Jul, 2024
Today, we're exploring a comprehensive guide to building a wine quality prediction model using some of the most powerful tools and libraries available in Python. Whether you're a beginner looking to understand the basics or an experienced data scientist aiming to refine your skills, this guide has something for everyone.
In this tutorial, we'll walk you through the entire process, from importing essential libraries to evaluating our machine learning models. We will be using popular libraries such as Pandas for data handling, Numpy for working with arrays, and Seaborn and Matplotlib for data visualization. Additionally, we will leverage the capabilities of Scikit-Learn and XGBoost to preprocess our data, develop our models, and evaluate their performance.
Building Your First Machine Learning Model
Our dataset, the well-known Wine Quality dataset, contains various chemical properties of wine and their corresponding quality scores. We'll start by examining and cleaning the dataset, ensuring that it's ready for analysis. Next, we'll perform exploratory data analysis (EDA) to uncover hidden patterns and insights within the data. Finally, we'll develop several machine learning models, compare their performance, and select the best one for our prediction task.
By the end of this guide, you'll have a solid understanding of the steps involved in building a predictive model, from data preprocessing and visualization to model training and evaluation. So, let's dive in and start our journey towards mastering wine quality prediction with Python!Machine Learning (ML) is revolutionizing industries with its ability to learn from data and make predictions. If you're new to ML, building your first model might seem daunting. This step-by-step guide will walk you through the process, from data preparation to making predictions.
Building your first machine learning model involves understanding the problem, preparing data, choosing and training a model, and evaluating its performance. This guide covered the essential steps using the KNN algorithm and the Iris dataset.
Importing libraries and Dataset:
Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')
Loading and Exploring the Dataset
Python
df = pd.read_csv('winequality.csv')
print(df.head())
Output:
First Five rows of the datasetLet's explore the type of data present in each of the columns present in the dataset.
Python
Output:
Information about columns of the dataStatistical Summary of the Dataset.
Python
Output:
Some descriptive statistical measures of the datasetExploratory Data Analysis
EDA is an approach to analysing the data using visual techniques. It is used to discover trends, and patterns, or to check assumptions with the help of statistical summaries and graphical representations. Now let's check the number of null values in the dataset columns wise.
Python
Output:
Sum of null values column wiseLet's impute the missing values by means as the data present in the different columns are continuous values.
Python
for col in df.columns:
if df[col].isnull().sum() > 0:
df[col] = df[col].fillna(df[col].mean())
df.isnull().sum().sum()
Output:
0
Let's draw the histogram to visualise the distribution of the data with continuous values in the columns of the dataset.
Python
df.hist(bins=20, figsize=(10, 10))
plt.show()
Output:
Histograms for the columns containing continuous dataNow let's draw the count plot to visualise the number data for each quality of wine.
Python
plt.bar(df['quality'], df['alcohol'])
plt.xlabel('quality')
plt.ylabel('alcohol')
plt.show()
Output:
Count plot for each quality of wineThere are times the data provided to us contains redundant features they do not help with increasing the model's performance that is why we remove them before using them to train our model.
Python
plt.figure(figsize=(12, 12))
sb.heatmap(df.corr() > 0.7, annot=True, cbar=False)
plt.show()
Output:
Heat map for highly correlated featuresFrom the above heat map we can conclude that the 'total sulphur dioxide' and 'free sulphur dioxide' are highly correlated features so, we will remove them.
Python
df = df.drop('total sulfur dioxide', axis=1)
Model Development
Let's prepare our data for training and splitting it into training and validation data so, that we can select which model's performance is best as per the use case. We will train some of the state of the art machine learning classification models and then select best out of them using validation data.
Python
df['best quality'] = [1 if x > 5 else 0 for x in df.quality]
We have a column with object data type as well let's replace it with the 0 and 1 as there are only two categories.
Python
df.replace({'white': 1, 'red': 0}, inplace=True)
After segregating features and the target variable from the dataset we will split it into 80:20 ratio for model selection.
Python
features = df.drop(['quality', 'best quality'], axis=1)
target = df['best quality']
xtrain, xtest, ytrain, ytest = train_test_split(
features, target, test_size=0.2, random_state=40)
xtrain.shape, xtest.shape
Output:
((5197, 11), (1300, 11))
Normalising the data before training help us to achieve stable and fast training of the model.
Python
norm = MinMaxScaler()
xtrain = norm.fit_transform(xtrain)
xtest = norm.transform(xtest)
As the data has been prepared completely let's train some state of the art machine learning model on it.
Python
models = [LogisticRegression(), XGBClassifier(), SVC(kernel='rbf')]
for i in range(3):
models[i].fit(xtrain, ytrain)
print(f'{models[i]} : ')
print('Training Accuracy : ', metrics.roc_auc_score(ytrain, models[i].predict(xtrain)))
print('Validation Accuracy : ', metrics.roc_auc_score(
ytest, models[i].predict(xtest)))
print()
Output:
Accuracy of the model for training and validation dataModel Evaluation
From the above accuracies we can say that Logistic Regression and SVC() classifier performing better on the validation data with less difference between the validation and training data. Let's plot the confusion matrix as well for the validation data using the Logistic Regression model.
Python
metrics.plot_confusion_matrix(models[1], xtest, ytest)
plt.show()
Output:
Confusion matrix drawn on the validation dataLet's also print the classification report for the best performing model.
Python
print(metrics.classification_report(ytest,
models[1].predict(xtest)))
Output:
Classification report for the validation data
Similar Reads
Deploy Machine Learning Model using Flask In this article, we will build and deploy a Machine Learning model using Flask. We will train a Decision Tree Classifier on the Adult Income Dataset, preprocess the data, and evaluate model accuracy. After training, weâll save the model and create a Flask web application where users can input data a
7 min read
Model Selection for Machine Learning Machine learning (ML) is a field that enables computers to learn patterns from data and make predictions without being explicitly programmed. However, one of the most crucial aspects of machine learning is selecting the right model for a given problem. This process is called model selection. The cho
6 min read
Flowchart for basic Machine Learning models Machine Learning (ML) is a branch of Artificial Intelligence (AI) that allow computers to learn from large amount of data, identify patterns and make decisions. It help them to predict new similar data without explicit programming for each task. A good way to understand how machine learning works is
4 min read
Learning Model Building in Scikit-learn Building machine learning models from scratch can be complex and time-consuming. Scikit-learn which is an open-source Python library which helps in making machine learning more accessible. It provides a straightforward, consistent interface for a variety of tasks like classification, regression, clu
8 min read
Creating a simple machine learning model Machine Learning models are the core of smart applications. To get a better insight into how machine learning models work, let us discuss one of the most basic algorithms: Linear Regression. This is employed in predicting a dependent variable based on one or multiple independent variables by utilizi
3 min read