
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Exploring Data Distribution
Introduction
The distribution of data gives us useful insights into the data while working with any data science or machine learning use case. Data Distribution is how the data is available and its present condition, the information about specific parts of the data, any outliers in the data as well as central tendencies related to the data.
To explore the data distribution there popular graphical methods that prove beneficial while working with the data. In this article let us explore these methods.
Know more about your data: The Graphical Way
Histograms & KDE Density Plots
Histograms are the most popular and common data exploration tool used among graphical methods. In a Histogram, rectangular bars are used to represent the frequency of a particular variable or category, or bin. Binning is supported when we have different buckets in which the data can be present.
Let us understand the histogram using the below code example on the house pricing dataset.
The below code helps us to understand histograms more effectively. In this code example, we have used house price dataset to plot the frequency or histogram plot for SalePrice vs Frequency on the left side. The right side plot is the KDE plot for the SalePrice vs Frequency Distribution. The Density plot is the probability density function of the histogram.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline df = pd.read_csv("/content/house_price_data.csv") figure, ax = plt.subplots(1, 2, sharex=True, figsize=(12, 6)) ax[0]= sns.histplot(data=df, x="SalePrice",ax=ax[0]) ax[0].set_ylabel("Frequency") ax[0].set_xlabel("SalePrice") ax[0].set_title("Frequency(Histogram)") ax[1]= sns.distplot(df.SalePrice, kde = True,ax=ax[1]) ax[1].set_ylabel("Density") ax[1].set_xlabel("SalePrice") ax[1].set_title("Frequency(Histogram)")
Output
In the below code example, we have used bins for different classes. We have used the penguins dataset to plot the bill depth vs count. Here bill depth is binned into different brackets and is plotted on the x axis with count or frequency on the y axis.
# Using bins on penguins' dataset - seaborn import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline data_pen = sns.load_dataset("penguins") sns.histplot(data=data_pen, x="bill_depth_mm", bins=15)
Output
Boxplots
Boxplots are also known as box and whiskers plots. The box plot represents the percentile of data. The entire data is divided into different percentiles, out of which the major quantiles are the 25th, 50th, and 75th percentiles. The 50th percentile represents the median. Boxplots show the data that is located within the 25th and 75th percentiles known as the IQR(Inter Quartile Range)
Let us understand boxplot using the below code example on house pricing dataset.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline df = pd.read_csv("/content/house_price_data.csv") subset = pd.concat([df['SalePrice'], df['OverallQual']]) figure = sns.boxplot(x='OverallQual', y="SalePrice", data=df)
Output
Violin Plot
It looks similar to boxplots, however, it has the probability distribution of variables also shown in the graph. It is used to compare the probability distributions of the variables under observation.
Let us understand the violin plot using the below code example on the house pricing dataset.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline df = pd.read_csv("/content/house_price_data.csv") subset = pd.concat([df['SalePrice'], df['MSSubClass']]) figure = sns.violinplot(x='MSSubClass', y="SalePrice", data=df)
Output
Conclusion
Boxplots, density plots, and violin plots are the most popular and common methods to explore data distributions. They are reliable and highly trusted by Machine Learning Engineers and Data Scientists. These plots give us a sense of the data and how the data is distributed. Also, basic information regarding skewness, sparsity, etc can also be determined from the plot.Plots likeBoxplots and violin plots can also indicate outlier points.