0% found this document useful (0 votes)
206 views7 pages

Red Wine Quality Prediction Using Machine Learning Techniques

This document summarizes a research paper that predicts red wine quality using machine learning techniques. The paper uses a dataset from UCI with physicochemical properties and alcohol content of wines to predict quality ratings between 3-8. It applies classification algorithms like Naive Bayes, Support Vector Machine, and Random Forest. Performance is evaluated using metrics like accuracy, F1 score, and misclassification error. The paper finds that machine learning can effectively predict red wine quality based on objective measurements.

Uploaded by

Arina Mihaela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
206 views7 pages

Red Wine Quality Prediction Using Machine Learning Techniques

This document summarizes a research paper that predicts red wine quality using machine learning techniques. The paper uses a dataset from UCI with physicochemical properties and alcohol content of wines to predict quality ratings between 3-8. It applies classification algorithms like Naive Bayes, Support Vector Machine, and Random Forest. Performance is evaluated using metrics like accuracy, F1 score, and misclassification error. The paper finds that machine learning can effectively predict red wine quality based on objective measurements.

Uploaded by

Arina Mihaela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/341812162

Red Wine Quality Prediction Using Machine Learning Techniques

Conference Paper · January 2020


DOI: 10.1109/ICCCI48352.2020.9104095

CITATIONS READS

3 1,747

3 authors, including:

Sunny Kumar
Roorkee Institute of Technology
2 PUBLICATIONS   3 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Sunny Kumar on 07 June 2021.

The user has requested enhancement of the downloaded file.


Red Wine Quality Prediction Using Machine
Learning Techniques
Sunny Kumar Kanika Agrawal Nelshan Mandan
Department of Computer Science Department of Computer Science Department of Computer Science
Roorkee Institute of Technology Roorkee Institute of Technology Roorkee Institute of Technology
Roorkee, Uttarakhand, India Roorkee, Uttarakhand, India Roorkee, Uttarakhand, India
[email protected] [email protected] [email protected]

Abstract—Nowadays people try to lead a luxurious life. Physicochemical and tactile tests are utilized for assessing
They tend to use the things either for show off or for their daily wine confirmation [2]. The segregation of wines isn't a
basis. These days the consumption of red wine is very common simple procedure inferable from the intricacy and
to all. So this research basically deals with the quality prediction heterogeneity of its headspace. The arrangement of wines is
of the red wine using its various attributes. Dataset is taken significant in light of the fact that of various reasons. These
from the sources and the techniques such as Random Forest, reasons are financial estimation of wine items, to secure and
Support Vector Machine and Naïve Bayes are applied. Various guarantee the nature of wines, to preclude corruption of
performance measures are calculated and the results are wines, and to control refreshment preparing [3]. Data mining
compared among training set and testing set and accordingly innovations have been applied to plan wine quality. The point
the best out of the three techniques depending on the training of machine learning techniques like various applications is to
set results is predicted. make models from information to anticipate wine quality.
Keywords— processes; data extraction; Naïve Bayes; SVM; In 1991, a "Wine" informational index which contains
Random Forest; quality. 178 occurrences with estimations of 13 distinctive synthetic
constituents, such as, alcohol, magnesium was given into UCI
store to order three cultivars from Italy [4]. For new
I. INTRODUCTION information mining classifiers this data has been significantly
The path toward discovering new examples to separate utilized as a benchmark since it is exceptionally simple to
the quality information from immense storehouse is known as separate. For wine characterization as indicated by geological
data mining. It incorporates various kinds of measurements, area; Principal Component Analysis (PCA) was done and
machine learning and arrangement of databases. The announced [5]. The information they utilized in their
fundamental target of data mining is to isolate significant examination incorporates 33 Greek wines with
information from tremendous database and after that changes physicochemical factors. Another work of wine grouping
over the important substance into a meaningful substance for relied upon the physicochemical data. This data associated
future research. Knowledge Discovery in Databases (KDD) with wine smell chromatograms as estimated with a Fast
generally incorporates data mining as its critical investigation GC Analyser [6]. In the last investigation, three portrayal
step. Aside from the analysis, it likewise incorporates methods, for example, Naïve Bayes, Random Forest and
intricacy contemplations, large data house analysis, pre Support Vector Machines (SVM) are contrasted agreeing and
analysis and post analysis of the information and finally finds their exhibition in a two-organized architecture. Some have
the interesting data and then updates it. Information analysis proposed a couple of uses of data mining frameworks to wine
regularly tests the speculations and models on the quality appraisal. Cortez et al. [1] proposed a taste desire
information, paying little mind to the substance of framework. In their taste expectation framework, a Support
information. Data mining is a blend of factual models and Vector Machine, Naïve Bayes, and a Random Forest were
machine learning. The term data mining deals with the applied to engineer examination of wines. Shanmuganathans
extraction of learning and models from huge dataset. Data procedure was about forecast the effects of season and
mining undertaking is the programmed strategy of extracting climate on wine yields and wine quality [7]. The Wine
patterns from the huge proportion of data, finds the informatics framework as shown by Chen et al. [8] depicted
inconsistencies and then finally detects the required result. the flavour and traits of wine from typical language audits.
Different terms like data fishing data dredging, and data They used association rules and progressive clustering. In
snooping refers to the creation of new theories out from the research article [9], the authors have compared different
bigger data collection. machine learning algorithms such as Naïve Bayes, Decision
Tree and Support Vector Machines on Cardiotocography data
II. LITERATURE REVIEW to predict the best algorithm out of them. In research article
Today, various customers appreciate wine to an ever [10], authors showed the different techniques, applications
increasing extent. Wine industry is looking into new and challenges faced by text analysis.
advances for both wine making and offering structures in
order to back up this development [1].
III. RESEARCH METHODOLOY AND EXPERIMENT DESIGN precision of a confusion matrix. Result is then
multiplied by two.
The data is extracted from UCI machine learning
repository [11] to do the research. The dataset contains 1599
F1 Score = 2*(Recall * Precision)/ (Recall + Precision)
instances with 12 variables for red wine data. The data
evaluation is based on the inputs taken and then finally
concludes with the prediction of red wine quality. For this  Misclassification Error: It is obtained by subtracting
dataset qualities are predicted between the range 3-8, where accuracy from one and gives the error in the
‘3’ predicts poor quality of red wine and ‘8’ predicts calculations done.
excellent quality of red wine.
Error = 1-Accuracy
The highlights include fixed acidity, citrus acid, volatile
acidity, residual sugar, chlorides, thickness, free sulphur B. Techniques Involved in Research
dioxide, absolute sulphur dioxide, pH, alcohol and sulphates
.The value of pH depicts the acidity and basicity of the wine. Techniques used in the research are given below. These
Consumable wines have their pH scale between 3-4. The
amount of salt depicts the chloride content in the wine. The are:
goal of the information file is to anticipate the rating that
master will accommodate a wine test, utilizing an extent of  Naive Bayes Algorithm: Naive Bayes algorithm relies
physicochemical properties, for instance, acidity and liquor upon bayes speculation. To find whether a particular
properties. As a result of security and strategic issues, simply part has a spot with a particular class it utilizes the
physicochemical (inputs) and output factors are available. possibility of likelihood. Naive Bayes classifier are
profoundly versatile, requiring various parameters
In the field of machine learning, a confusion matrix is a straight in the quantity of factors in a learning
table that is frequently used to depict the presentation of a problem.
grouping model on a lot of test information for which the  Support Vector Machine: This technique was taken
genuine qualities are known. It permits the perception of the from factual learning theory by Vapnik and
presentation of a calculation. This research basically uses the Chervonenkis. It was first exhibited in 1992 by Boser
red wine data set and then calculates the confusion matrix, Guyon and Vapnik. This technique is utilized for the
relevant performance measures and finally compares the characterization of both nonlinear and linear
different machine learning algorithms on the basis of information. It utilizes a nonlinear mapping to change
accuracy predicted on this dataset. the primary preparing information into a higher
estimation. It scans for the linear optimal isolating
A. Performanc Measures Used in Reasearch hyperplane in this new estimation. A hyperplane can
Performance measures are the measures that are used in isolate information from two classes, with a
the research so as calculate and evaluate the techniques to reasonable nonlinear mapping to adequately high
detect the effectiveness and efficiency of the techniques. estimation. The SVM uses support vectors and edges
Some of them are listed below: to find this hyperplane [12]. A SVM model is a
portrayal of the models as point in space, mapped with
 Accuracy: It is the value predicted when the sum of the goal that instances of the different classes are
True Positive and True Negative is divided by the isolated by a gap that is as wide as would be prudent.
sum of True Positive, False positive, False Negative SVM can play out a nonlinear type of classification.
and True Negative values of a confusion matrix.  Random Forest: This technique utilizes a blend of tree
indicators; each individual tree depends upon an
Accuracy=TP+TN/TP+FP+FN+TN random vector. This arbitrary vector has
indistinguishable and a similar circulation for all trees
Where TP is True Positive, TN is True Negative, FP is in the forest. It was portrayed by Breiman in 2001[13].
False Positive and FN is False Negative in a confusion Random forest helps in predicting the important
matrix. variables in classification and regression problems in a
simple way.
 Precision: It is the value obtained when True Positive
is divided by the sum of True Positive and False IV. IMPLEMENTATION
Positive values of a confusion matrix.
An analysis is done on the redwine.csv dataset extracted
Precision = TP/TP+FP from UCI machine learning repository[11] that contains the
details of Red Wine. The datasets contain 1599 observation
and have 12 attributes such as fixed acidity, volatile acidity,
 Recall: Recall is also sometimes used as Sensitivity. It
citrus acid, residual sugar, chlorides, free sulphur dioxide,
is the value obtained when True Positive is divided by
absolute sulphur dioxide, thickness, pH, sulphates, and
the sum of True Positive and False Negative values of
alcohol. All these attributes are used to predict the quality of
a confusion matrix.
red wine. The dataset of red wine is divided into training and
testing set with the probabilities 0.7& 0.3 respectively.
Recall= TP/TP+FN Libraries such as naïve bayes, pysch, dplyr, knitr, ggplot2,
random forest and e1701 are imported. After importing the
 Specificity: Inverse of Recall is known as Specificity. libraries, summary of the model is calculated using Naïve
Bayes, Random Forest and support Vector Machine
Specificity = TN/TN+FP algorithms. After calculating the summaries, the confusion
matrix of 6*6, depending on the dataset observations and the
quality, is calculated. Variable ‘matrix’ is used to denote the
 F-Measure: F1 Score is obtained by multiplying
confusion matrix. Further various performance measures such
Recall and Precision divided by sum of Recall and as precision, recall, specificity, f-measure, accuracy and
misclassification error are calculated using the algorithms.
Results were predicted on the basis of these measures. This
research finally shows that the best accuracy is shown by the
Support Vector Machine algorithm on red wine dataset
extracted from UCI, then Random Forest algorithm and last
comes the Naïve Bayes algorithm. “Fig. 1” below shows the
steps used in the research and hence detects the quality of red
wine using data mining techniques. “Fig. 2” shows 1599
observations and 12 variables of red wine dataset. “Fig. 3”
shows the mean and standard deviation values for different
attributes of training set using Naïve Bayes algorithms. “Fig.
4” shows the mean and standard deviation values for different Fig. 2. Snapshot showing 1599 observations and 12 variables of red wine
attributes of testing set using Naïve Bayes algorithms. “Fig. dataset
5’ shows the confusion matrix of red wine dataset for training
set using Naïve Bayes algorithm. “Fig. 6” shows the
confusion matrix of red wine dataset for testing set using
Naïve Bayes algorithm. “Fig. 7” shows the confusion matrix
of red wine dataset for training set using Support Vector
Machine algorithm. “Fig. 8” shows the confusion matrix of
red wine dataset for Testing set using Support Vector
Machine algorithm. “Fig. 9” shows the confusion matrix of
red wine dataset for training set using Random Forest
Algorithm. “Fig. 10” shows the confusion matrix of red wine
dataset for testing set using Random Forest algorithm.

Fig.3. Snapshot showing the mean and standard deviation values for different
attributes of training set using Naïve Bayes algorithms

Fig. 1. Flow chart showing steps used in research to predict red wine quality
Fig. 7. Confusion matrix of red wine dataset for training set using Support
Vector Machine algorithm.

Fig. 8. Confusion matrix of red wine dataset for testing set using Support
Vector Machine algorithm.

Fig. 9. Confusion matrix of red wine dataset for training set using Random
Forest Algorithm

Fig. 4. Snapshot showing the mean and standard deviation values for
different attributes of testing set using Naïve Bayes algorithm.

Fig. 10. Confusion matrix of red wine dataset for testing set using Random
Forest algorithm.

V. RESULTS AND DISCUSSIONS


The dataset taken contains the red wine data extracted
from UCI machine learning repository which is used to
predict the wine quality. In this research different machine
learning algorithms are executed on the dataset in RStudio
software. It helps in finding out the accuracy of the
Fig. 5. Confusion matrix of red wine dataset for training set using Naïve
Bayes algorithm. algorithms and locate the best out of it from a given dataset.
During the usage, the data is separated into training set and
testing set each with probability of 0.7 and 0.3 respectively.
The result shows that, accuracy obtained for training set and
testing set using Naïve Bayes algorithm are 55.91% and
55.89% respectively , using SVM algorithm are 67.25% and
68.64% respectively and using Random Forest algorithm are
65.83% and 65.46% respectively. Since the training set has
high probability of division i.e. 0.7, hence taking the accuracy
of training sets for examination shows that Support Vector
Machine algorithm has highest accuracy then Random Forest
algorithm and last comes Naïve Bayes algorithm. Table I.
Fig. 6. Confusion matrix of red wine dataset for testing set using Naïve below shows the performance measure values for training set
Bayes algorithm. of red wine dataset using Naïve Bayes algorithm. Table II.
below shows that performance measure values for testing set TABLE IV. Performance measures of testing set of red wine dataset using
Support Vector Machine algorithm.
of red wine dataset using Naïve Bayes algorithm. Table III.
below shows the performance measure values for training set
of red wine dataset using Support Vector Machine algorithm.
Table IV. below shows the performance measure values for
testing set of red wine dataset using Support Vector Machine
algorithm. Table V. below shows the performance measure
values for training set of red wine dataset using Random
Forest algorithm. Table VI. below shows that performance
measure values for testing set of red wine dataset using
Random Forest algorithm.

TABLE I. Performance measures of training set of red wine dataset using


Naïve Bayes

TABLE VI. Performance measures of training set of red wine dataset using
Random Forest algorithm.

TABLE II. Performance measures of testing set of red wine dataset using
Naïve Bayes algorithm.

TABLE VI. Performance measures of testing set of red wine dataset using
Random Forest algorithm.

TABLE III. Performance measures of training set of red wine dataset using
Support Vector Machine algorithm.
VI. CONCLUSIONS
Data mining nowadays is most important technique which
is utilized for investigation of the archives. It looks at the
information and produces the required yield. With the
headway in the innovation it helps in playing the sound test in
the market thus benefits the client. As a result of its property
of investigating the information it is utilized in the
examination to process diverse execution appraisals utilizing
different calculations. In this exploration accuracy,
misclassification error, precision, recall, specificity and F-
measures are resolved. Since the training dataset contains
about 70% of the data from the original dataset, thus the
results demonstrates the Support Vector Machine as the best
algorithm giving an accuracy of 67.25% implemented on red
wine quality prediction on RStudio software, then comes
Random Forest giving an accuracy 65.83% and last comes
the Naïve Bayes algorithm giving an accuracy of 55.91%.
REFERENCES
[1] P. Cortez, A. Cerderia, F. Almeida, T. Matos, and J. Reis, “Modelling
wine preferences by data mining from physicochemical properties,” In
Decision Support Systems, Elsevier, 47 (4): 547-553. ISSN: 0167-9236.
[2] S. Ebeler, “Linking Flavour Chemistry to Sensory Analysis of Wine,” in
Flavor Chemistry, Thirty Years of Progress, Kluwer Academic
Publishers, 1999, pp. 409-422.
[3] V. Preedy, and M. L. R. Mendez, “Wine Applications with Electronic
Noses,” in Electronic Noses and Tongues in Food Science, Cambridge,
MA, USA: Academic Press, 2016, pp. 137-151.
[4] A. Asuncion, and D. Newman (2007), UCI Machine Learning
Repository, University of California, Irvine, [Online]. Available:
http://www.ics.uci.edu/~mlearn/MLRepository.html
[5] S. Kallithraka, IS. Arvanitoyannis, P. Kefalas, A. El-Zajouli, E.
Soufleros, and E. Psarra, “Instrumental and sensory analysis of Greek
wines; implementation of principal component analysis (PCA) for
classification according to geographical origin,” Food Chemistry, 73(4):
501-514, 2001.
[6] N. H. Beltran, M. A. Duarte- MErmound, V. A. S. Vicencio, S. A.
Salah, and M. A. Bustos, “Chilean wine classification using volatile
organic compounds data obtained with a fast GC analyzer,” Instrum.
Measurement, IEEE Trans., 57: 2421-2436, 2008.
[7] S. Shanmuganathan, P. Sallis, and A. Narayanan, “Data mining
techniques for modelling seasonal climate effects on grapevine yield and
wine quality,” IEEE International Conference on Computational
Intelligence Communication Systems and Networks, pp. 82-89, July
2010.
[8] B. Chen, C. Rhodes, A. Crawford, and L. Hambuchen,
“Wineinformatics: applying data mining on wine sensory reviews
processed by the computational wine wheel,” IEEE International
Conference on Data Mining Workshop, pp. 142-149, Dec. 2014.
[9] K. Agrawal and H. Mohan, "Cardiotocography Analysis for Fetal State
Classification Using Machine Learning Algorithms," 2019 International
Conference on Computer Communication and Informatics (ICCCI),
Coimbatore, Tamil Nadu, India, 2019, pp. 1-6.
[10] K. Agrawal and H. Mohan, "Text Analysis: Techniques, Applications
and Challenges," presented in 2019 International Conference on
Computer Communication and Informatics (ICCCI), Coimbatore, Tamil
Nadu, India, 2019.
[11] UCI Machine Learning Repository, Wine quality data set, [Online].
Available: https://archive.ics.uci.edu/ml/datasets/Wine+Quality.
[12] J. Han, M. Kamber, and J. Pei, “Classification: Advanced Methods,” in
Data Mining Concepts and Techniques, 3rd ed., Waltham, MA, USA:
Morgan Kaufmann, 2012, pp. 393-443.
[13] W. L. Martinez, A. R. Martinez, “Supervised Learning” in
Computational Statistics Handbook with MATLAB, 2nd ed., Boca
Raton, FL, USA: Chapman & Hall/CRC, 2007, pp. 363-431.

View publication stats

You might also like