Building Your First Machine Learning Model

Last Updated : 01 Jul, 2024

Today, we're exploring a comprehensive guide to building a wine quality prediction model using some of the most powerful tools and libraries available in Python. Whether you're a beginner looking to understand the basics or an experienced data scientist aiming to refine your skills, this guide has something for everyone.

In this tutorial, we'll walk you through the entire process, from importing essential libraries to evaluating our machine learning models. We will be using popular libraries such as Pandas for data handling, Numpy for working with arrays, and Seaborn and Matplotlib for data visualization. Additionally, we will leverage the capabilities of Scikit-Learn and XGBoost to preprocess our data, develop our models, and evaluate their performance.

Building Your First Machine Learning Model

Our dataset, the well-known Wine Quality dataset, contains various chemical properties of wine and their corresponding quality scores. We'll start by examining and cleaning the dataset, ensuring that it's ready for analysis. Next, we'll perform exploratory data analysis (EDA) to uncover hidden patterns and insights within the data. Finally, we'll develop several machine learning models, compare their performance, and select the best one for our prediction task.

By the end of this guide, you'll have a solid understanding of the steps involved in building a predictive model, from data preprocessing and visualization to model training and evaluation. So, let's dive in and start our journey towards mastering wine quality prediction with Python!Machine Learning (ML) is revolutionizing industries with its ability to learn from data and make predictions. If you're new to ML, building your first model might seem daunting. This step-by-step guide will walk you through the process, from data preparation to making predictions.

Building your first machine learning model involves understanding the problem, preparing data, choosing and training a model, and evaluating its performance. This guide covered the essential steps using the KNN algorithm and the Iris dataset.

Importing libraries and Dataset:

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

Loading and Exploring the Dataset

Python

df = pd.read_csv('winequality.csv')
print(df.head())

Output:

Let's explore the type of data present in each of the columns present in the dataset.

Python

df.info()

Output:

Statistical Summary of the Dataset.

Python

df.describe().T

Output:

Some descriptive statistical measures of the dataset

Exploratory Data Analysis

EDA is an approach to analysing the data using visual techniques. It is used to discover trends, and patterns, or to check assumptions with the help of statistical summaries and graphical representations. Now let's check the number of null values in the dataset columns wise.

Python

df.isnull().sum()

Output:

Let's impute the missing values by means as the data present in the different columns are continuous values.

Python

for col in df.columns:
  if df[col].isnull().sum() > 0:
    df[col] = df[col].fillna(df[col].mean())

df.isnull().sum().sum()

Output:

Let's draw the histogram to visualise the distribution of the data with continuous values in the columns of the dataset.

Python

df.hist(bins=20, figsize=(10, 10))
plt.show()

Output:

Histograms for the columns containing continuous data

Now let's draw the count plot to visualise the number data for each quality of wine.

Python

plt.bar(df['quality'], df['alcohol'])
plt.xlabel('quality')
plt.ylabel('alcohol')
plt.show()

Output:

There are times the data provided to us contains redundant features they do not help with increasing the model's performance that is why we remove them before using them to train our model.

Python

plt.figure(figsize=(12, 12))
sb.heatmap(df.corr() > 0.7, annot=True, cbar=False)
plt.show()

Output:

From the above heat map we can conclude that the 'total sulphur dioxide' and 'free sulphur dioxide' are highly correlated features so, we will remove them.

Python

df = df.drop('total sulfur dioxide', axis=1)

Model Development

Let's prepare our data for training and splitting it into training and validation data so, that we can select which model's performance is best as per the use case. We will train some of the state of the art machine learning classification models and then select best out of them using validation data.

Python

df['best quality'] = [1 if x > 5 else 0 for x in df.quality]

We have a column with object data type as well let's replace it with the 0 and 1 as there are only two categories.

Python

df.replace({'white': 1, 'red': 0}, inplace=True)

After segregating features and the target variable from the dataset we will split it into 80:20 ratio for model selection.

Python

features = df.drop(['quality', 'best quality'], axis=1)
target = df['best quality']

xtrain, xtest, ytrain, ytest = train_test_split(
    features, target, test_size=0.2, random_state=40)

xtrain.shape, xtest.shape

Output:

((5197, 11), (1300, 11))

Normalising the data before training help us to achieve stable and fast training of the model.

Python

norm = MinMaxScaler()
xtrain = norm.fit_transform(xtrain)
xtest = norm.transform(xtest)

As the data has been prepared completely let's train some state of the art machine learning model on it.

Python

models = [LogisticRegression(), XGBClassifier(), SVC(kernel='rbf')]

for i in range(3):
    models[i].fit(xtrain, ytrain)

    print(f'{models[i]} : ')
    print('Training Accuracy : ', metrics.roc_auc_score(ytrain, models[i].predict(xtrain)))
    print('Validation Accuracy : ', metrics.roc_auc_score(
        ytest, models[i].predict(xtest)))
    print()

Output:

Accuracy of the model for training and validation data

Model Evaluation

From the above accuracies we can say that Logistic Regression and SVC() classifier performing better on the validation data with less difference between the validation and training data. Let's plot the confusion matrix as well for the validation data using the Logistic Regression model.

Python

metrics.plot_confusion_matrix(models[1], xtest, ytest)
plt.show()

Output:

Confusion matrix drawn on the validation data

Let's also print the classification report for the best performing model.

Python

print(metrics.classification_report(ytest,
                                    models[1].predict(xtest)))

Output:

Classification report for the validation data

Building Your First Machine Learning Model

pruthvi17

Improve

Article Tags :

Practice Tags :

Building Your First Machine Learning Model

Building Your First Machine Learning Model

Importing libraries and Dataset:

Loading and Exploring the Dataset

Statistical Summary of the Dataset.

Exploratory Data Analysis

Model Development

Model Evaluation

Similar Reads

Thank You!

What kind of Experience do you want to share?