0% found this document useful (0 votes)
29 views

Data Visualization Module1

This document provides information about data visualization and related Python libraries NumPy and Pandas. It defines data visualization as a way to visually communicate quantitative data using visual representations like graphs, charts and plots. It discusses why data visualization is important given the large amounts of data created daily. It also provides summaries of NumPy and Pandas, describing them as Python libraries for numerical processing and data analysis respectively. NumPy works with numerical data and arrays while Pandas works with tabular data and provides tools like DataFrame for data analysis. It outlines some key differences between the two libraries in terms of usage, performance and data types handled.

Uploaded by

Diwakar RV
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Data Visualization Module1

This document provides information about data visualization and related Python libraries NumPy and Pandas. It defines data visualization as a way to visually communicate quantitative data using visual representations like graphs, charts and plots. It discusses why data visualization is important given the large amounts of data created daily. It also provides summaries of NumPy and Pandas, describing them as Python libraries for numerical processing and data analysis respectively. NumPy works with numerical data and arrays while Pandas works with tabular data and provides tools like DataFrame for data analysis. It outlines some key differences between the two libraries in terms of usage, performance and data types handled.

Uploaded by

Diwakar RV
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Data Visualization

Subject Code: 20IS641A


Academic year: 2023
Semester: 6th C
Faculty Name:Anitha
Data Visualization Module 1
Data Visualization is a coherent way to visually communicate
quantitative content.
Depending on its attributes, the data may be represented in many
ways, such as a line graph, bar chart, pie chart, scatter plot, map, or
bubble graph.
Why do we use Data Visualization
• According to IBM, 2.5 quintillion bytes of data are created every day. The 
Research Scientist Andrew McAfee and Professor Erik Brynjolfsson of MIT point out that “more data cross the internet every
second than were stored in the entire internet just 20 years ago.”

• As the world becomes more and more connected with an increasing number of electronic devices, the volume of data will
continue to grow exponentially. IDC predicts there will be 163 zettabytes (163 trillion gigabytes) of data by 2025.

• All of this data is hard for the human brain to comprehend—in fact, it’s difficult for the human brain to comprehend  
numbers larger than five without drawing some kind of analogy or abstraction. Data visualization designers can play a vital
role in creating those abstractions.

• After all, big data is useless if it can’t be comprehended and consumed in a useful way. That’s why data visualization plays an
important role in everything from economics to science and technology, to healthcare and human services. By turning complex
numbers and other pieces of information into graphs, content becomes easier to understand and use.
When to Use It
• Since large numbers are so difficult to comprehend in any meaningful way, and many of the most useful data sets contain

huge amounts of valuable data, data visualization has become a vital resource for decision-makers.

• To take advantage of all this data, many businesses see the value of data visualizations in the clear and efficient

comprehension of important information, enabling decision-makers to understand difficult concepts, identify new patterns,

and get data-driven insights in order to make better decisions.


  Bar charts organize data into rectangular bars that make it a breeze to compare related data sets.
When do I use a bar chart visualization?
Use a bar chart for the following reasons:
• to compare two or more values in the same category
•to compare parts of a whole
•You don’t have too many groups (less than 10 works best)
•to understand how multiple similar data sets relate to each other

Don’t use a bar chart for the following reasons:


•The category you’re visualizing only has one value associated with it
•You want to visualize continuous data

Best practices for a bar chart visualization

If you use a bar chart, here are the key design best practices:
•Use consistent colours and labelling throughout so that you can identify relationships more easily
•Simplify the length of the y-axis labels and don’t forget to start from 0 so you can keep your data in order
When do I use a line chart visualization?
Use a line chart for the following reasons:
•to understand trends, patterns, and fluctuations in your data
•to compare different yet related data sets with multiple series
•to make projections beyond your data

Don’t use a line chart for the following reason:


•to demonstrate an in-depth view of your data

Best practices for a line chart visualization


If you use a line chart, here are the key design best practices:
•Along with using a different colour for each category you’re comparing, make sure you also use solid lines to keep the
line chart clear and concise
•To avoid confusion, try not to compare more than 4 categories in one line chart
When do I use a scatter plot visualization?
Use a scatterplot for the following reasons:
•to show the relationship between two variables
•You want a compact data visualization

Don’t use a scatterplot for the following reasons:


•to rapidly scan information
•For clear and precise data points

Best practices for a scatter plot visualization


If you use a scatterplot, here are the key design best practices:
•Although trend lines are a great way to analyze the data on a scatterplot, ensure you stick to 1 or 2 trend
lines to avoid confusion
•Don’t forget to start at 0 for the y-axis
When do I use a pie chart visualization?
Use a pie chart for the following reasons:
•To compare relative values
•to compare parts of a whole
•to rapidly scan metrics

Don’t use a pie chart for the following reason:


•to precisely compare data

Best practices for a pie chart visualization


If you use a pie chart, here are the key design best practices:
•Make sure that the pie slices add up to 100%. To make this easier, add the numerical values and percentages to
your pie chart
•Order the pieces of your pie according to size
•Use a pie chart if you have only up to 5 categories to compare. If you have too many categories, you won’t be
able to differentiate between the slices
A histogram is a data visualization that shows the distribution of data over a continuous interval or certain time
period.

It's basically a combination of a vertical bar chart and a line chart.

The continuous variable shown on the X-axis is broken into discrete intervals and the number of data you have
in that discrete interval determines the height of the bar.

Histograms give an estimate as to where values are concentrated, what the extremes are and whether there are
any gaps or unusual values throughout your data set.
What is NumPy?

NumPy is a Python library used for working with arrays.

It also has functions for working in domain of linear algebra, fourier


transform, and matrices.

NumPy was created in 2005 by Travis Oliphant. It is an open source project


and you can use it freely.

NumPy stands for Numerical Python.


What is NumPy?
• NumPy is mostly written in C language, and it is an extension module of
Python. It is defined as a Python package used for performing the various
numerical computations and processing of the multidimensional and single-
dimensional array elements. The calculations using Numpy arrays are faster
than the normal Python array.
• The NumPy package is created by the Travis Oliphant in 2005 by adding the
functionalities of the ancestor module Numeric into another module Numarray.
It is also capable of handling a vast amount of data and convenient with Matrix
multiplication and data reshaping.
Why Use NumPy?

In Python we have lists that serve the purpose of arrays, but they are slow to process.

NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.

The array object in NumPy is called ndarray, it provides a lot of supporting functions that
make working with ndarray very easy.

Arrays are very frequently used in data science, where speed and resources are very
important.
What is Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating


data.

The name "Pandas" has a reference to both "Panel Data", and "Python
Data Analysis"
What is Pandas?
• Pandas is defined as an open-source library that provides high-performance data
manipulation in Python. It is built on top of the NumPy package, meaning Numpy is
required to operate the Pandas. The name of Pandas is derived from the word Panel Data,
which means Econometrics from Multidimensional data. It is used for data analysis in
Python and was developed by Wes McKinney in 2008.
• Before Pandas, Python was capable for data preparation, but it only provided limited
support for data analysis. So, Pandas came into the picture and enhanced data analysis
capabilities. It can perform five significant steps required for processing and analysis of
data irrespective of the origin of the data, i.e., load, manipulate, prepare, model, and
analyze.

• Both the Pandas and NumPy can be seen as an essential library for any scientific
computation, including machine learning due to their intuitive syntax and high-
performance matrix computation capabilities. These two libraries are also best suited for
data science applications.
Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on


statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.


Difference between Pandas and NumPy:
There are some differences between Pandas and NumPy that is listed below:
• The Pandas module mainly works with tabular data, whereas the NumPy module works with numerical
data.
• The Pandas provides some sets of powerful tools like DataFrame and Series that mainly used for
analyzing the data, whereas in NumPy module offers a powerful object called Array.
• Instacart, SendGrid, and Sighten are some of the famous companies that work on the Pandas module,
whereas NumPy is used by SweepSouth.
• The Pandas covered the broader application because it is mentioned in 73 company stacks
and 46 developer stacks, whereas in NumPy, 62 company stacks and 32 developer stacks are being
mentioned.
• The performance of NumPy is better for 50K rows or less.
• The performance of Pandas is better than the NumPy for 500K rows or more. Performance depends on
the kind of operation between 50K to 500K rows.
• NumPy library provides objects for multi-dimensional arrays, whereas Pandas is capable of offering an
in-memory 2d table object called DataFrame.
• NumPy consumes less memory as compared to Pandas.
• Indexing of the Series objects is quite slow as compared to NumPy arrays.
Other – data visualization codes with output
Other – data visualization codes with output
Other – data visualization codes with output
Other – data visualization codes with output
The subplots() Function

The subplots() function takes three arguments that describes the layout of the figure.
The layout is organized in rows and columns, which are represented by
the first and second argument.
The third argument represents the index of the current plot.

plt.subplot(1, 2, 1)
#the figure has 1 row, 2 columns, and this plot is the first plot.

plt.subplot(1, 2, 2)
#the figure has 1 row, 2 columns, and this plot is the second plot.
import matplotlib.pyplot as plt x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
import numpy as np
plt.subplot(2, 3, 4)
x = np.array([0, 1, 2, 3]) plt.plot(x,y)
y = np.array([3, 8, 1, 10])
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(2, 3, 1)
plt.plot(x,y) plt.subplot(2, 3, 5)
plt.plot(x,y)

x = np.array([0, 1, 2, 3]) x = np.array([0, 1, 2, 3])


y = np.array([10, 20, 30, 40]) y = np.array([10, 20, 30, 40])

plt.subplot(2, 3, 6)
plt.subplot(2, 3, 2) plt.plot(x,y)
plt.plot(x,y)
plt.show()
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])

plt.subplot(2, 3, 3)
plt.plot(x,y)
Basic Plotting with Matplotlib

You might also like