0% found this document useful (0 votes)
12 views16 pages

Unit III Notes (First Half) 1

Effective data visualization is crucial in data science for exploratory data analysis, error detection, and communication of insights. It involves using various tools and techniques to analyze and present data clearly, ensuring that visualizations are aesthetically pleasing and informative. Key principles include maximizing data-ink ratio, minimizing misleading elements, and selecting appropriate chart types based on the data's purpose.

Uploaded by

ragulrr.22msc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views16 pages

Unit III Notes (First Half) 1

Effective data visualization is crucial in data science for exploratory data analysis, error detection, and communication of insights. It involves using various tools and techniques to analyze and present data clearly, ensuring that visualizations are aesthetically pleasing and informative. Key principles include maximizing data-ink ratio, minimizing misleading elements, and selecting appropriate chart types based on the data's purpose.

Uploaded by

ragulrr.22msc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Visualizing Data

Effective data visualization is an important aspect of data science, for at least


three distinct reasons:
 Exploratory data analysis: What does your data really look like? Plots and
visualizations are the best way I know of to do this.
 Error detection: Feeding unvisualized data to any machine learning
algorithm is asking for trouble. Problems with outlier points, insu_cient
cleaning, and erroneous assumptions reveal themselves immediately when
properly visualizing your data.
 Communication: Can you present what you have learned effectively to
others? Meaningful results become actionable only after they are shared.

1. Exploratory Data Analysis


Exploratory Data Analysis (EDA) is the process in data science of analyzing and
visualizing data to understand its main characteristics, discover patterns and trends
(finding how data behaves, repeats, or changes over time) , detect outliers or errors,
test assumptions, and generate insights before applying formal modeling or
machine learning techniques.
i. Face a New Data Set: What should you do when encountering a new data set?
Answer the basic questions:
 Who constructed this data set, when, and why? Understanding how your
data was obtained provides clues as to how relevant it is likely to be, and
whether we should trust it.
 How big is it? How rich is the data set in terms of the number of fields or
columns? How large is it as measured by the number of records or rows?
 What do the fields mean? Walk through each of the columns in your data
set, and be sure you understand what they are. Which fields are numerical
or categorical?

Look for Familiar or Interpretable Records::


 Get to know a few data records closely (persons, places, or things).
 This helps you understand whether the data makes sense and spot errors.
 If familiar records don’t exist, focus on special cases like maximum or
minimum values.
Summary Statistics
 Check basic statistics(information) for each column to understand data
spread and center.
 For numerical data, use minimum, maximum, median, and quartiles.
 For categorical data, count unique categories and find the most frequent
ones.
Pairwise Correlations
 Analyze correlations between variables to see how they are related.
 Good features should correlate strongly with the target but not too much
with each other.
Class Breakdowns :
 Break data by major categories (like gender or location).
 Compare distributions to see if meaningful differences exist between groups.
Plots of Distributions
 Use graphs to visually inspect data distributions.
 Look for patterns, trends, and outliers, and decide if data cleaning or
transformation is needed.
EDA helps you understand your data using records, statistics, correlations,
categories, and visualizations before modeling.
2. Visualization Tools
Visualization tools help us understand data using graphs and charts. The choice
of tool depends on the purpose of visualization. Visualization tasks are mainly
grouped into three categories:

1. Exploratory Data Analysis (EDA)

 Used for quick and interactive exploration of data


 Helps to find patterns, trends, and errors
 Tools: Excel, Python (iPython/Jupyter), R, Mathematica
 These tools hide complexity and provide reasonable default plots, which
can be customized if needed

2. Publication / Presentation-Quality Charts

 Used to create high-quality, clear, and attractive graphs

 Focuses on accurate and informative presentation

 Tools: Matplotlib, Gnuplot, R visualization libraries

 Allows full control over design, style, and layout

3. Interactive Visualization for External Applications

 Used to build interactive dashboards for users

 Helps non-technical users explore data easily

 Tools: Python dashboards, Tableau

 Supports interaction like filtering, zooming, and linked views


Visualization tools help explore data, present insights clearly, and build interactive
systems depending on the goal.
3. Developing a Visualization Aesthetic

It means choosing and designing charts in a clean, clear, and visually pleasing
way so that people can easily understand the data.

📌 In simple words:
👉 Make your graphs look neat, meaningful, and easy to read.

This includes:

 Using the right type of chart

 Choosing clear colors and labels

 Avoiding clutter and confusion

 Highlighting important patterns or trends


The visual aesthetic and vocabulary is largely derived from the books of Edward
Tufte [Tuf83, Tuf90, Tuf97]. He is an artist. He has thought long and hard about
what makes a chart or graph informative and beautiful, basing a design aesthetic on
the following principles:

_Maximize data-ink ratio: Your visualization is supposed to show off your data.
So why is so much of what you see in charts the background grids, shading, and
tic-marks?
_ Minimize the lie factor: As a scientist, your data should reveal the truth,
ideally the truth you want to see revealed.
_ Minimize chart junk: Modern visualization software often adds cool visual
effects that have little to do with your data set. Is your graphic interesting because
of your data, or in spite of it? ie) 3D effect, shadow, animation not necessary
_ Use proper scales and clear labeling: Accurate interpretation of data depends
upon non-data elements like scale and labeling.
_ Make effective use of color: The human eye has the power to discriminate
between small gradations in hue and color saturation. Are you using color to
highlight important properties of your data, or just to make an artistic statement?
_ Exploit the power of repetition: Arrays of similar graphics with different but
related data elements provide a concise and powerful way to enable visual
comparisons.
i. Maximizing Data-Ink Ratio :: In any graphic, some of the ink is used to
represent the actual underlying data, while the rest is employed on graphic effects.

Data-Ink Ratio = (Ink used to show data) / (Total ink used in the graphic)

Goal : 👉 Maximize the data-ink ratio (remove unnecessary lines, 3D effects,


and decorations)

ii. Minimizing the Lie Factor : A visualization seeks to tell a true story about what
the data is saying. Changing data itself is an obvious lie, but even correct data can
be shown in a misleading way using bad charts. A good chart keeps this value as
close to 1 as possible.

A company’s sales increased from 100 units to 110 units.


🔹 Actual effect in the data

Actual increase = 110 − 100 = 10 units


Percentage increase = 10%
Now suppose a bar chart shows the sales bar doubling in height, making it look
like sales increased by 100%(Wrong)..

Lie Factor = 10% / 10% = 1

Bad practices include:


 Presenting means without variance: The data values {100; 100; 100; 100;
100} and {200; 0; 100; 200; 0} tell different stories, even though both means
are 100.
 Presenting interpolations without the actual data: Regression lines and fitted
curves are effective at communicating trends and simplifying large data sets
 Distortions of scale : This happens when the shape or size of a chart (its
width vs. height, called the aspect ratio) changes how we see the data.
Even if the numbers are the same, stretching or squashing the chart can
make trends look bigger, smaller, faster, or slower than they really are.
 Eliminating tick labels from numerical axes : Tick labels are the numbers
along the axes (like 0, 10, 20…). If you hide these numbers, people
cannot tell the exact values of the data from the chart.
 Hiding the origin (zero point) : Usually, charts assume the y-axis starts at 0.
If you start the y-axis at some higher value instead of 0: The biggest
number looks much bigger compared to the smallest.
Chart Types

It branches into three main purposes:

1. Distribution → Showing how values are spread.

2. Relationship → Showing how two or more variables relate.

3. Comparison → Showing differences among groups or over time.

Step 2: Distribution

 Goal: See how data values are spread.

 Chart options:

1. Histogram – Shows frequency of values.

2. Boxplot – Shows median,outliers.

3. Density plot / Line plot – Smooth curve showing distribution.

4. Dot plot / Strip chart – Each value is a dot.

Step 3: Relationship

 Goal: See how two variables are connected.

 Chart options:

1. Scatter plot – Shows correlation between variables.

2. Line plot – Shows trend over continuous variable (like time).

Step 4: Comparison

 Goal: Show differences across categories or time.

 Chart options:
o For time series / ordered data:

 Line chart

o For categorical comparison:

 Bar chart (vertical or horizontal)

o For parts of a whole:

 Pie chart

 Step 1: Decide your goal (distribution, relationship, comparison).

 Step 2: Look at your data type (categorical, numerical, time series).

 Step 3: Pick the chart that best communicates your message.

Chart Types - Primary types of data visualizations.

6.3.1 Tabular Data - Tables of numbers can be beautiful things,


and are very effective ways to present data.

Although they may appear to lack the visual appeal of graphic


presentations, tables have several advantages over other
representations, including:

1. Representation of precision: The resolution of a number tells


you something about the process of how it was obtained:
 an average salary of $79,815 says something different than
$80,000.
 Such subtleties are generally lost on plots, but openly clear
in numerical tables.

2. Representation of scale: When numbers are written in columns:

 Longer numbers mean bigger values


 Shorter numbers mean smaller values

If numbers are right-aligned, your eyes can easily compare:

 thousands vs lakhs

 hundreds vs thousands

3. Multivariate data (many variables)

When data has more than two variables, graphs become confusing.

 2 variables → easy graph


 3+ variables → hard to imagine

👉 Tables don’t have this problem.


They can handle many columns easily.

4. Different types of data together (heterogeneous data)

Tables are best when data includes:

 Numbers (marks, salary)


 Text (names, cities)
 Categories (grade, role)
 Symbols or emojis (✔️)

Graphs struggle with mixed data, but tables handle all of it neatly.
5. Compactness: Tables are particularly useful for representing small numbers of points.
Two points in two dimensions can be drawn as a line, but why bother? A small table is
generally better than a sparse visual.

Best practices include:

1. Order rows to invite comparison: You have the freedom to order the rows in a table any
way you want, so take advantage of it. Sorting the rows according to the values of an
important column is generally a good idea.
2. Order columns to highlight importance, or pairwise relationships: Eyes darting from left-
to-right across the page cannot make e_ective visual comparisons, but neighboring fields
are easy to contrast.

3. Right-justify uniform-precision numbers:


4. Use emphasis, font, or color to highlight important entries:
5. Avoid excessive-length column descriptors:

6.3.2 Dot and Line Plots


Dot Chart (Dot Plot)
 A dot chart shows only dots for data values.
 Each dot represents one data point.
 No lines are drawn between points.
When to use:
 When data is separate or categorical
 When values exist only at specific points (like marks of students, income of states)
Example:
Marks of students in a test:
 Each student’s mark = one dot
👉 Dot charts show actual data clearly and avoid confusion.
Line Chart (Line Plot)
 A line chart shows dots connected by lines.
 It shows how data changes continuously.
When to use:
 When data changes over time (days, months, years)
 When values in between also make sense
Example:
Temperature over a week:
 Points are connected to show rising or falling trend
👉 Line charts help us see trends and patterns.

Advantages of Line Charts

1. Shows trend clearly


Line charts clearly show whether values are increasing, decreasing, or stable over time.

2. Good for time-based data


Best used when data changes with time (days, months, years).

3. Helps in interpolation and prediction


Lines help estimate values between known points and predict future behavior.

4. Easy comparison
Multiple lines can be drawn to compare different groups (e.g., sales of two products).

5. Highlights patterns
Seasonal effects, cycles, and sudden changes are easy to spot.

Best Practices for Line Charts

 Use line charts only for continuous data


Do not connect points for categorical data (like states or names).

 Always show actual data points


Show dots along with the line so viewers can see real observations.

 Keep the chart simple


Avoid too many lines in one chart (2–4 lines maximum).

 Choose proper axis ranges

 Start from zero when it makes sense

 Avoid misleading truncation of axes


 Use consistent scale and labels
Clearly label axes and units to avoid confusion.

 Use colors carefully


Use different colors or line styles to distinguish lines, but don’t overuse them.
6.3.3 Scatter Plots

Why large data is hard to show

 When a dataset has thousands of points, graphs can become messy.


 Too many dots overlap and form a dark blob (often called a “black ball”).
 This hides useful patterns.

👉 But if done correctly, scatter plots can clearly show even very large datasets.

What is a scatter plot?

 A scatter plot shows every data point as a dot.

 Each dot represents an (x, y) value.

Example:

 Height on x-axis

 Weight on y-axis

 Each person = one dot

Best practices for scatter plots (simple)

✔ Use the right dot size

 Biggest mistake: dots are too large

 Large dots overlap and hide data

✔ Handle overlapping points properly

 When many points have the same values (especially integers):

✔ Showing many variables (more than 2)

Problem:
 Humans can’t easily visualize 4 or more dimensions

Solution 1: Reduce dimensions

 Convert many variables into two new axes

 Techniques like PCA do this

Solution 2: Pairwise scatter plots (better)

 Draw many small scatter plots


 Each plot compares two variables
 Helps find:
o Relationships
o Correlations
6.3.4 Bar Plots and Pie Charts

What bar charts and pie charts show

Both bar charts and pie charts show how data is divided into groups.

Use a Bar Chart when:

 You want to compare values accurately.

 You want to see which is largest or smallest.

 You want to track changes across categories or time.

Use a Pie Chart when:

 You want to show parts of a whole (percentages).

 You only have a few categories.

6.3.5 Histograms - understand distribution of data

 A histogram shows how data is distributed.

 Data is grouped into ranges (bins).

 Each bar shows how many values fall in that range.


6.3.6 Data Maps - quickly see where data is high or low

 A heatmap uses colors to show data values.


 Darker or brighter colors = higher values
 Lighter colors = lower values

👉 Instead of numbers, color intensity shows meaning.

Example:

 Student performance:
o Green → good
o Yellow → average
o Red → poor

Key points:

 Best for large datasets


 Makes patterns and clusters easy to see

6.4 Great Visualizations - Developing your own visualization aesthetic gives you
a language to talk about what you like and what you don't like.

1. Marey's Train Schedule - Time of day → x-axis

 Stations from Paris to Lyon → y-axis


 Each line = one train
Instead of a table of times, Marey drew lines showing where each train is at every moment.

 Steep line → fast train

 Flat line → slow train

 Horizontal line → train stopped at a station.

 Where two lines cross = trains pass each other

2. Snow's Cholera Map (Cholera Disease)


New York's Weather Year

You might also like