Open In App

Building Your First Machine Learning Model

Last Updated : 01 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Today, we're exploring a comprehensive guide to building a wine quality prediction model using some of the most powerful tools and libraries available in Python. Whether you're a beginner looking to understand the basics or an experienced data scientist aiming to refine your skills, this guide has something for everyone.

In this tutorial, we'll walk you through the entire process, from importing essential libraries to evaluating our machine learning models. We will be using popular libraries such as Pandas for data handling, Numpy for working with arrays, and Seaborn and Matplotlib for data visualization. Additionally, we will leverage the capabilities of Scikit-Learn and XGBoost to preprocess our data, develop our models, and evaluate their performance.

Building Your First Machine Learning Model

Our dataset, the well-known Wine Quality dataset, contains various chemical properties of wine and their corresponding quality scores. We'll start by examining and cleaning the dataset, ensuring that it's ready for analysis. Next, we'll perform exploratory data analysis (EDA) to uncover hidden patterns and insights within the data. Finally, we'll develop several machine learning models, compare their performance, and select the best one for our prediction task.

By the end of this guide, you'll have a solid understanding of the steps involved in building a predictive model, from data preprocessing and visualization to model training and evaluation. So, let's dive in and start our journey towards mastering wine quality prediction with Python!Machine Learning (ML) is revolutionizing industries with its ability to learn from data and make predictions. If you're new to ML, building your first model might seem daunting. This step-by-step guide will walk you through the process, from data preparation to making predictions.

Building your first machine learning model involves understanding the problem, preparing data, choosing and training a model, and evaluating its performance. This guide covered the essential steps using the KNN algorithm and the Iris dataset.

Importing libraries and Dataset:

Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

Loading and Exploring the Dataset

Python
df = pd.read_csv('winequality.csv')
print(df.head())

Output:

First Five rows of the dataset
First Five rows of the dataset

Let's explore the type of data present in each of the columns present in the dataset.

Python
df.info()

Output:

Information about columns of the data
Information about columns of the data

Statistical Summary of the Dataset.

Python
df.describe().T

Output:

Some descriptive statistical measures of the dataset
Some descriptive statistical measures of the dataset

Exploratory Data Analysis

EDA is an approach to analysing the data using visual techniques. It is used to discover trends, and patterns, or to check assumptions with the help of statistical summaries and graphical representations.  Now let's check the number of null values in the dataset columns wise.

Python
df.isnull().sum()

Output:

Sum of null values column wise
Sum of null values column wise

Let's impute the missing values by means as the data present in the different columns are continuous values.

Python
for col in df.columns:
  if df[col].isnull().sum() > 0:
    df[col] = df[col].fillna(df[col].mean())

df.isnull().sum().sum()

Output:

0

Let's draw the histogram to visualise the distribution of the data with continuous values in the columns of the dataset.

Python
df.hist(bins=20, figsize=(10, 10))
plt.show()

Output:

Histograms for the columns containing continuous data
Histograms for the columns containing continuous data

Now let's draw the count plot to visualise the number data for each quality of wine.

Python
plt.bar(df['quality'], df['alcohol'])
plt.xlabel('quality')
plt.ylabel('alcohol')
plt.show()

Output:

Count plot for each quality of wine
Count plot for each quality of wine

There are times the data provided to us contains redundant features they do not help with increasing the model's performance that is why we remove them before using them to train our model.

Python
plt.figure(figsize=(12, 12))
sb.heatmap(df.corr() > 0.7, annot=True, cbar=False)
plt.show()

Output:

Heat map for highly correlated features
Heat map for highly correlated features

From the above heat map we can conclude that the 'total sulphur dioxide' and 'free sulphur dioxide' are highly correlated features so, we will remove them.

Python
df = df.drop('total sulfur dioxide', axis=1)

Model Development

Let's prepare our data for training and splitting it into training and validation data so, that we can select which model's performance is best as per the use case. We will train some of the state of the art machine learning classification models and then select best out of them using validation data.

Python
df['best quality'] = [1 if x > 5 else 0 for x in df.quality]

We have a column with object data type as well let's replace it with the 0 and 1 as there are only two categories.

Python
df.replace({'white': 1, 'red': 0}, inplace=True)

After segregating features and the target variable from the dataset we will split it into 80:20 ratio for model selection.

Python
features = df.drop(['quality', 'best quality'], axis=1)
target = df['best quality']

xtrain, xtest, ytrain, ytest = train_test_split(
    features, target, test_size=0.2, random_state=40)

xtrain.shape, xtest.shape

Output:

((5197, 11), (1300, 11))

Normalising the data before training help us to achieve stable and fast training of the model.

Python
norm = MinMaxScaler()
xtrain = norm.fit_transform(xtrain)
xtest = norm.transform(xtest)

As the data has been prepared completely let's train some state of the art machine learning model on it.

Python
models = [LogisticRegression(), XGBClassifier(), SVC(kernel='rbf')]

for i in range(3):
    models[i].fit(xtrain, ytrain)

    print(f'{models[i]} : ')
    print('Training Accuracy : ', metrics.roc_auc_score(ytrain, models[i].predict(xtrain)))
    print('Validation Accuracy : ', metrics.roc_auc_score(
        ytest, models[i].predict(xtest)))
    print()

Output:

Accuracy of the model for training and validation data
Accuracy of the model for training and validation data

Model Evaluation

From the above accuracies we can say that Logistic Regression and SVC() classifier performing better on the validation data with less difference between the validation and training data. Let's plot the confusion matrix as well for the validation data using the Logistic Regression model.

Python
metrics.plot_confusion_matrix(models[1], xtest, ytest)
plt.show()

Output:

Confusion matrix drawn on the validation data
Confusion matrix drawn on the validation data

Let's also print the classification report for the best performing model.

Python
print(metrics.classification_report(ytest,
                                    models[1].predict(xtest)))

Output:

Classification report for the validation data
Classification report for the validation data

Next Article

Similar Reads