Data Visualization Module1
Data Visualization Module1
• As the world becomes more and more connected with an increasing number of electronic devices, the volume of data will
continue to grow exponentially. IDC predicts there will be 163 zettabytes (163 trillion gigabytes) of data by 2025.
• All of this data is hard for the human brain to comprehend—in fact, it’s difficult for the human brain to comprehend
numbers larger than five without drawing some kind of analogy or abstraction. Data visualization designers can play a vital
role in creating those abstractions.
• After all, big data is useless if it can’t be comprehended and consumed in a useful way. That’s why data visualization plays an
important role in everything from economics to science and technology, to healthcare and human services. By turning complex
numbers and other pieces of information into graphs, content becomes easier to understand and use.
When to Use It
• Since large numbers are so difficult to comprehend in any meaningful way, and many of the most useful data sets contain
huge amounts of valuable data, data visualization has become a vital resource for decision-makers.
• To take advantage of all this data, many businesses see the value of data visualizations in the clear and efficient
comprehension of important information, enabling decision-makers to understand difficult concepts, identify new patterns,
If you use a bar chart, here are the key design best practices:
•Use consistent colours and labelling throughout so that you can identify relationships more easily
•Simplify the length of the y-axis labels and don’t forget to start from 0 so you can keep your data in order
When do I use a line chart visualization?
Use a line chart for the following reasons:
•to understand trends, patterns, and fluctuations in your data
•to compare different yet related data sets with multiple series
•to make projections beyond your data
The continuous variable shown on the X-axis is broken into discrete intervals and the number of data you have
in that discrete interval determines the height of the bar.
Histograms give an estimate as to where values are concentrated, what the extremes are and whether there are
any gaps or unusual values throughout your data set.
What is NumPy?
In Python we have lists that serve the purpose of arrays, but they are slow to process.
NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
The array object in NumPy is called ndarray, it provides a lot of supporting functions that
make working with ndarray very easy.
Arrays are very frequently used in data science, where speed and resources are very
important.
What is Pandas?
The name "Pandas" has a reference to both "Panel Data", and "Python
Data Analysis"
What is Pandas?
• Pandas is defined as an open-source library that provides high-performance data
manipulation in Python. It is built on top of the NumPy package, meaning Numpy is
required to operate the Pandas. The name of Pandas is derived from the word Panel Data,
which means Econometrics from Multidimensional data. It is used for data analysis in
Python and was developed by Wes McKinney in 2008.
• Before Pandas, Python was capable for data preparation, but it only provided limited
support for data analysis. So, Pandas came into the picture and enhanced data analysis
capabilities. It can perform five significant steps required for processing and analysis of
data irrespective of the origin of the data, i.e., load, manipulate, prepare, model, and
analyze.
• Both the Pandas and NumPy can be seen as an essential library for any scientific
computation, including machine learning due to their intuitive syntax and high-
performance matrix computation capabilities. These two libraries are also best suited for
data science applications.
Why Use Pandas?
Pandas can clean messy data sets, and make them readable and relevant.
The subplots() function takes three arguments that describes the layout of the figure.
The layout is organized in rows and columns, which are represented by
the first and second argument.
The third argument represents the index of the current plot.
plt.subplot(1, 2, 1)
#the figure has 1 row, 2 columns, and this plot is the first plot.
plt.subplot(1, 2, 2)
#the figure has 1 row, 2 columns, and this plot is the second plot.
import matplotlib.pyplot as plt x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
import numpy as np
plt.subplot(2, 3, 4)
x = np.array([0, 1, 2, 3]) plt.plot(x,y)
y = np.array([3, 8, 1, 10])
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(2, 3, 1)
plt.plot(x,y) plt.subplot(2, 3, 5)
plt.plot(x,y)
plt.subplot(2, 3, 6)
plt.subplot(2, 3, 2) plt.plot(x,y)
plt.plot(x,y)
plt.show()
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(2, 3, 3)
plt.plot(x,y)
Basic Plotting with Matplotlib