EDA Mini Project Report
EDA Mini Project Report
SUBMITTED BY
BY
Submitted by
Name Roll No
1. Aum Patil 2361004
2. Abhijeet Bagal 2361006
is a bonafide student of this institute and the work has been carried out by them under the
supervision of Ms. Madhuri B. Thorat and is approved for the partial fulfillment of the
Department of Artificial Intelligence and Data Science, AISSMS IOIT, Pune.
1
1. Abstract
2. Acknowledgement 2
3
3. Introduction
5-7
5. Proposed System(System Workflow and Workflow Diagram)
8
6. System Requirements
8. Future Scope 18
9. Conclusion 18
10. References 20
ABSTRACT
The quality of red wine is a critical factor in consumer satisfaction and market
value. This project presents a comprehensive Exploratory Data Analysis (EDA)
approach to assess the factors influencing the quality of red wine. Using a
synthetically generated dataset of 5,000 records simulating real-world wine
attributes, the project analyzes key physicochemical properties such as alcohol
content, volatile acidity, citric acid, sulphates, and pH levels. Python libraries
including Pandas, Seaborn, Matplotlib, and NumPy were used to process,
visualize, and interpret the data. The EDA reveals significant correlations
between certain features—most notably, a strong positive correlation of alcohol
and sulphates with wine quality, and a negative correlation with volatile acidity.
A variety of univariate, bivariate, and multivariate plots were used to uncover
data patterns, detect outliers, and identify quality-driving parameters. The
insights gained from this analysis not only demonstrate the power of EDA in
quality assurance but also provide a foundation for predictive modeling and
further optimization in wine production.
ACKNOWLEDGEMENT
I would like to express my sincere gratitude to all those who guided and
supported me throughout the completion of this project titled “Red Wine
Quality Assurance using EDA in Python.” First and foremost, I would like to
thank my mentor and faculty for their valuable insights, encouragement, and
constructive feedback during each phase of this project. Their continuous
guidance helped me refine my understanding of data analysis and apply it
effectively in a practical setting. I am also thankful to my peers and friends for
their collaborative discussions and support, which contributed to improving the
quality of this work. A special thanks to the open-source Python community and
the developers of the libraries such as Pandas, Matplotlib, Seaborn, and
NumPy, without which this project would not have been possible. Lastly, I
would like to thank my family for their unwavering support and motivation
throughout the duration of this project. This project has been a great learning
experience and has strengthened my interest and skills in data science and
analytics.
INTRODUCTION
This project, titled "Red Wine Quality Assurance using Exploratory Data
Analysis (EDA) in Python," aims to analyze and understand the relationship
between different measurable features of red wine and their impact on its overall
quality rating. The dataset used in this study is synthetically generated to
resemble real-world red wine data, ensuring scalability and reliability for
analysis.
EDA plays a critical role in data science by helping analysts uncover hidden
patterns, detect anomalies, check assumptions, and build a better understanding
of the data before applying advanced models. By employing visualization and
statistical techniques, this project identifies the most influential factors affecting
wine quality and provides meaningful insights for quality control and decision-
making.
Problem Statement
To develop a data-driven solution that analyses the physicochemical properties of
red wine to identify the most influential factors affecting its quality. To
implement an end-to-end analytical model that enables wine producers to
interpret key variables such as acidity, alcohol content, and sulphates, thereby
supporting consistent quality assurance and informed decision-making in the
wine production process.
Objectives
The primary objectives of this project are as follows:
1. To perform Exploratory Data Analysis (EDA) on red wine data to
understand distribution, variance, and relationships between features.
2. To identify the key physicochemical factors that significantly influence
the quality of red wine.
3. To visualize the data using statistical plots and correlation heatmaps for
better interpretation.
4. To uncover hidden patterns and insights that can help in quality
assurance and optimization of wine production.
Dataset
• Name: Synthetic Red Wine Quality Dataset
• Size: 5,000 records × 12 features
• File Format: CSV (red_wine_quality_large.csv)
• Data Source: Custom-generated using Python's numpy and pandas
libraries based on statistical distributions of real-world wine features.
PROPOSED SYSTEM
The proposed system is designed to analyze red wine quality based on various
physicochemical parameters using Exploratory Data Analysis (EDA). The
system aims to discover meaningful patterns and relationships in the dataset to
assist wine producers in understanding and improving wine quality.
The system workflow is divided into the following key stages:
1. Dataset Generation
• A synthetic dataset of 5,000 entries is generated using Python, simulating
real-world red wine characteristics.
• Features such as acidity, alcohol content, sugar, sulphates, and quality
score are generated using statistically relevant distributions.
3. Data Preprocessing
• Cleaning the dataset by checking for missing/null values.
• Descriptive statistics like mean, median, standard deviation are computed.
• Data types are validated and converted if necessary for analysis.
• Outlier detection using boxplots.
4. Exploratory Data Analysis (EDA)
• Univariate Analysis: Distribution of individual features using histograms
and count plots.
• Bivariate Analysis: Pairwise relationships using scatterplots, boxplots,
and violin plots.
• Multivariate Analysis: Interaction between multiple features using pair
plots and heatmaps.
• Correlation Matrix: Heatmap to highlight relationships between features
and wine quality.
5. Insight Extraction
• Patterns such as “higher alcohol content correlates positively with
quality” are derived.
• Features negatively impacting quality like high volatile acidity are
identified.
• Recommendations for improving wine quality are formed based on
findings.
To successfully implement the Red Wine Quality Assurance EDA project using
Python, a system with moderate specifications is sufficient.
Hardware Requirements
• A computer with Intel i3 processor or AMD Ryzen (dual-core or
higher).
• Minimum 4 GB RAM (Recommended: 8 GB for better performance).
• At least 500 MB of free disk space for storing project files and the
dataset.
• Display of 13 inches or more with a resolution of 1366×768 or higher.
• Compatible with Windows 10, macOS, or Linux operating systems.
Software Requirements
• Python version 3.8 or higher.
• Jupyter Notebook, or an IDE like VS Code or PyCharm for writing and
running the code.
• Pandas library for data manipulation.
• NumPy library for numerical operations.
• Matplotlib for basic plotting and visualization.
• Seaborn for statistical visualizations like boxplots, heatmaps, etc.
• CSV file named red_wine_quality_large.csv for input data.
Optional Tools
• Google Colab for running the project in the cloud without installation.
• Streamlit for converting the EDA into an interactive dashboard (for
future scope).
• Git / GitHub for version control and sharing the project online.
IMPLEMENTATION
1. Matplotlib Documentation
https://matplotlib.org/stable/contents.html/
3. NumPy Documentation
https://numpy.org/doc/