DATA VISUALIZATION
What Is Visualization?
Ans: visualization as the communication of information using graphical representations. Pictures have been used as a
mechanism for communication since before the formalization of written language. A single picture can contain a wealth of
information, and can be processed much more quickly than a comparable page of words. Pictures can also be
independent of local language, as a graph or a map may be understood by a group of people with no common tongue.
Visualization in Everyday Life
Relationship between Visualization and Other Fields
1.1 What Is the Difference between Visualization and Computer Graphics?ans:
In all visualizations, one can clearly see the use of the graphics prim-itives (points, lines, areas, and volumes).
Beyond the use of graphics, the most important aspect of all visualizations is their connection to data.
Computer graphics focuses primarily on graphical objects and the organization of graphic primitives;
visualizations go one step further and are based on the underlying data, and may include spatial positions,
populations, or physical measures.
Consequently, visualization is the application of graphics to display data by mapping data to graphical
primitives and rendering the display graphics.
The field of visualization encompasses aspects from numerous other disciplines, including human-computer
interaction, perceptual psychology, databases, statistics, and data mining, to name a few.
While computer graphics can be used to define and generate the displays that are used to communicate the
information, the sources of data and the way users interact and perceive the data are all important components
to understand when presenting information.
Computer graphics is predominantly focused on the creation of interactive synthetic images and animations of
three-dimensional objects, most often where visual realism is one of the primary goals. A secondary
application of computer graphics is in art and entertainment, with video games, cartoons, advertisements, and
movie special effects as typical examples. Visualization, on the other hand, does not emphasize visual re-
alism as much as the effective communication of information.
1.2 Scientific Data Visualization vs. Information Visualization
Ans:
Scientific Data Visualization:
● Purpose: Scientific data visualization is primarily used for representing and communicating data
generated from scientific experiments, research, and analysis. Its purpose is to help scientists
and researchers understand complex datasets, identify patterns, trends, and anomalies, and draw
insights from the data.
● Audience: The primary audience for scientific data visualization includes scientists, researchers,
engineers, and domain experts who are analyzing and exploring data related to scientific
phenomena. The emphasis is on accuracy, precision, and facilitating scientific discoveries.
Information Visualization:
● Purpose: Information visualization, on the other hand, is focused on making complex information,
often non-scientific or non-technical in nature, more understandable and accessible to a general
audience. It's used to simplify and clarify data or information for decision-makers, journalists, the
general public, or anyone seeking insights from diverse datasets.
● Audience: The audience for information visualization includes a broad range of people who need
to comprehend information quickly and easily. This can include business professionals,
policymakers, journalists, and the general public. Clarity, accessibility, and the ability to draw
insights at a glance are paramount.
Types of Data:
● Scientific Data Visualization:
● Scientific data visualization typically deals with complex, quantitative data derived from
experiments, simulations, observations, or scientific measurements. This data often includes
variables, measurements, and parameters related to specific scientific domains.
● Information Visualization:
● Information visualization can deal with a wide variety of data, including text, numbers, categorical
data, and qualitative information. It's often used for representing business data, market trends,
news articles, social media content, or any other type of information that can be structured and
presented visually.
Visual Representation:
● Scientific Data Visualization:
● Scientific data visualization often involves techniques like scatter plots, heatmaps, 3D
visualizations, and specialized scientific visualizations (e.g., molecular structures, flow
simulations) that are tailored to the specific needs of scientists.
● Information Visualization:
● Information visualization commonly uses techniques like bar charts, line graphs, treemaps, word
clouds, network diagrams, and other general-purpose visualization methods that aim to provide
clarity and facilitate quick understanding.
The Visualization Process
Data modeling. The data to be visualized, whether from a file or a database, has to be structured to facilitate its
visualization. The name, type, range, and semantics of each attribute or field of a data record must be available in a
format that ensures rapid access and easy modification.
Data selection. Similar to clipping, data selection involves identifying the subset of the data that will be potentially
visualized. This can occur totally under user control or via algorithmic methods, such as cycling through time slices
or automatically detecting features of potential interest to the user.
Data to visual mappings. The heart of the visualization pipeline is performing the mapping of data values to
graphical entities or their attributes. Thus, one component of a data record may map to the size of an object, while
others might control the position or color of the object. This mapping often involves processing the data prior to
mapping, such as scaling, shifting, filtering, interpolating, or subsampling.
Scene parameter setting (view transformations). As in traditional graphics, the user must specify several
attributes of the visualization that are relatively independent of the data. These include color map selection (for
different domains, certain colors have clearly defined meaning), sound map selection (in case the auditory channels
will be conveying information as well), and lighting specifications (for 3D visualizations).
Rendering or generation of the visualization. The specific projection or rendering of the visualization objects
varies according to the mapping being used; techniques such as shading or texture mapping might be involved,
although many visualization techniques only require drawing lines and uniformly shaded polygons. Besides showing
the data itself, most visualizations also include supplementary information to facilitate interpretation, such as axes,
keys, and annotations.
Viewer may choose exploration ,hypothesis testing, find anomalies, clustering trending.
Graphical process:
[Link] 2. Viewing [Link] [Link] surface [Link] [Link].
Pseudocode Conventions
data—The working data table. This data table is assumed to contain only numeric values.
• m—The number of dimensions (columns) in the working data [Link] are typically iterated over using j as
the running dimension index.
n—The number of records (rows) in the working data table. Records are typically iterated over using i as the running
record index.
• Normalize(record, dimension), Normalize(record, dimension, min, max)—A function that maps the value for the
given record and dimension in the working data table to a value between min and max, or between zero and one if
min and max are not specified.
• Color(color)—A function that sets the color state of the graphics environment to the specified color.
• MapColor(record, dimension)—set color from applying the global color map to the normalized value of the given m
and n.
• Circle(x, y,radius)- creates a circle.
• Polyline(xs, ys)—A function that draws a polyline ( binary connected lines).
• Polygon(xs, ys)—A function that fills the polygon
• GetLatitudes(record), GetLongitudes(record)—Functions that retrieve the arrays of latitude and longitude
coordinates,respec of the geographic polygon.
• ProjectLatitudes(lats, scale), ProjectLongitudes(longs, scale)—Functions that project arrays of latitude values to
arrays of y values, and arrays of longitude values to arrays of x values, respectively.
THE SCATTER PLOT
Scatterplot(xDim, yDim, cDim, rDim, rM in, rM ax)
1 for each record i ✄ For each record,
2 do x ← Normalize(i, xDim) ✄ derive the location,
3 y ← Normalize(i, yDim)
4 r ← Normalize(i, rDim, rM in, rM ax) ✄ radius,
5 MapColor(i, cDim) ✄ and color, then
6 Circle(x, y, r) ✄ draw the record as a circle.
A scatter plot is a type of data visualization that is used to display individual data points, primarily
to show the relationship between two or more variables. It is a fundamental tool for
understanding how two variables are related and for identifying patterns or trends in the data.
Scatter plots are versatile and can be used in various fields, including statistics, data analysis,
scientific research, and business. They are a valuable tool for visually exploring and
understanding relationships between variables and identifying patterns or anomalies in data.
Variables: Scatter plots typically display two variables on a two-dimensional graph, one
on the horizontal (x-axis) and the other on the vertical (y-axis). These variables can be
numerical or quantitative, and the plot shows how changes in one variable relate to
changes in the other.
Data Points: Each data point in a scatter plot represents an individual observation or
data record.
Axes: The x-axis and y-axis represent the two variables being [Link] scales help
interpret the data points' positions.
Relationship Identification: Scatter plots are used to identify and visualize relationships
between variables. Depending on the pattern observed, you can classify the relationship
as positive (both variables increase together), negative (one variable increases while the
other decreases), or no correlation (no clear pattern).
Patterns and Trends: Different patterns and trends can be observed in scatter plots, such
as clusters, linear relationships, exponential relationships, or no discernible relationship.
Outliers: Scatter plots are useful for identifying outliers, which are data points that
significantly deviate from the general pattern of the data. Outliers may represent data
errors or important exceptions to the general trend.
Title and Labels: Scatter plots often include a title that describes what the plot is
showing and labels for the x-axis and y-axis to indicate the variables being represented.
Legend: If you have multiple datasets or categories displayed on the same scatter plot, a
legend can be included to distinguish between them.
Data Foundation
Data comes from many sources; it can be gathered from sensors or surveys, or it can be generated by
simulations and computations. Data can be raw (untreated), or it can be derived from raw data via some
process, such as smoothing, noise removal, scaling, or interpolation.
Types of data
Ordinal. The data take on numeric values:
• binary—assuming only values of 0 and 1;
• discrete—taking on only integer values or from a specific subset (e.g., (2, 4, 6));
• continuous—representing real values (e.g., in the interval [0, 5]).
Nominal. The data take on nonnumeric values:
• categorical—a value selected from a finite (often short) list of possibilities (e.g., red, blue,green);
• ranked—a categorical variable that has an implied ordering (e.g., small, medium, large);
• arbitrary—a variable with a potentially infinite range of values with no implied ordering
(e.g.,addresses).
Cardinal . used to describe a data type representing non-negative integers or unsigned integers.
Counting.
Another way of categorising based on scale
Ordering relation : inherent ordering present in data.
Distance metric : distance btw 2 records.
Existence of zero : fixed consent value
Visualization has a sclae associated with it
Structure within and between Records
Scalars, Vectors, and Tensors
Scalars: individual variable in records is called scalar. Eg:the cost of an item or the age of an
individual.
Vector : Composite variable values are known as vector .eg; adisplacement in x and y.
Tensor: Rank + dimensionality , represented as array as matrix. A tensor is defined by its rank
and by the dimensionality of the space within which it is defined.
Scalar = tensor with rank 0
Vector = tensor with rank 1
Rank 2 tensor = 2d matrix
Other Forms of Structure
Topology : how records are connected
Data Preprocessing
[Link] and Statistics
Metadata: [Link] [Link] [Link] . Information regarding a data set of interest.
Statistics: [Link]:can help users eliminate redundant fields or highlight associations between
dimensions that might not have been apparent otherwise.
[Link]:analysis can help segment the data into groups exhibiting strong similarities.
[Link] ,median ,mode.
[Link] Values and Data Cleansing
Discard the bad record: throw away any data record containing a missing or erroneous field
Assign a sentinel value :Another popular strategy is to have a designated sentinel value for each variable in the
data set that can be assigned when the real value in a record is in question. For example, in a variable that has a
range of 0 to 100, one might use a value such as −5 to designate an erroneous or missing entry.
Assign the average value. A simple strategy for dealing with bad or missing data is to replace it with the
average value for that variable or dimension. that it minimally affects the overall statistics for that variable.
Drawback of using this method is that it may mask or obscure outliers.
Assign value based on nearest neighbor variables. The basic idea here is that if record A is missing an
entry for variable i, and record B is closer than any other record to A without considering variable i, then using the
value of variable i from record B as a substitute in A is a reasonable assumption. The problem with this approach,
however, is that variable i may be most dependent on only a subset of the other dimensions, rather than on all
dimensions, and so the best nearest neighbor based on all dimensions may not be the best substitute for this
particular dimension.
Compute a substitute value: as imputation, seeks to find values that have high statistical confidence.
Schafer has developed a model-based imputation technique.
[Link] :Normalization is the process of transforming a data set so that the results satisfy a particular
statistical property. A simple example of this is to transform the range of values a particular variable assumes so that
all numbers fall within the range of 0.0 to 1.0. Other forms of normalization convert the data such that each
dimension has a common mean and standard deviation.
dnormalized = (doriginal − dmin)/(dmax − dmin).
a non-linear normalization
dsqrt−normalized = (doriginal − dmin)/(dmax − dmin)
[Link] and Subsetting
Discrete data of a continuous phenomenon fill data interpolation .
Linear interpolation: Given the value of a variable d at two locations A and B, we can estimate the value of that
variable at a location C that is between A and B by first calculating the percentage of the distance between A and
B where C lies. This percentage can then be used in conjunction with the amount the variable changes in value
between the two points to determine amount the value should have changed by the time point C is reached.
(xC − xA)/(xB − xA)=(dC − dA)/(dB − dA)
Bilinear interpolation: We can extend this concept to two dimensions (or to an arbitrary number of dimensions)
by repeating the procedure for each dimension. For example, a common task in two dimensions is to compute the
value of d at location (x, y) given a uniform grid of data values (i.e., the space between points is uniform in both
directions, as in an image). If the location coincides with a grid point, the answer is simply the value stored at that
location.
Non linear interpolation : a mathematical technique used to estimate values between data points in a way that
does not assume a linear relationship between the data points. In contrast to linear interpolation, which assumes a
straight-line relationship, nonlinear interpolation allows for more complex and curved relationships to be modeled.
Nonlinear interpolation is particularly useful when the data does not follow a linear trend or when a higher degree of
accuracy is needed in approximating values.
Subsetting: very large data sets cluster is formedd/ clustering visualization.
[Link]
Data supported in contiguous region, where each region corresponds to a particular classification of the data.
MRI may have 256 values for each data point. Can e segmented into bones muscles, tissue.
A typical problem with segmentation is that the results may not coincide with regions that are semantically
homogeneous (undersegmented), or may consist of large numbers of tiny regions (oversegmented). One solution to
this problem is to follow the initial segmentation process with an iterative split-and-merge refinement stage.
6. Dimension Reduction
Visualization can process limited dimensions only.
Reduce dimension but preserve the information contained. Can be done manually.
Eg : PCA , Multidimensional scaling , SOM etc.
PCA ( principle component analysis)
Principal component analysis (PCA) is a popular method for dimension reduction [385]. PCA is a data-space
transformation technique that computes new dimensions/attributes which are linear combinations of the original data
attributes. The advantage of the new dimensions is that they can be sorted according to their contribution in
explaining the variance of the data. By selecting the most relevant new dimensions, a subspace of variables is
obtained that minimizes the average error of lost information.
[Link] ( same scale) the range of continuous initial variable.
Z = (value - mean )/ std
[Link] covariance matrix to find correlation. Variable varying from mean wrt to each other
P cross p matrix: x ; p = num of dimensions; x,y, z are variables;
cov(x ,x) ( x,y) (x, z)
Complete it by seeing sir notes
[Link] eigenvectors and eigenvalues of covariance matix to find principle components .
[Link] a feature vector decide what to keep.
[Link] the data doing along principle components axis.
Characteristics of good data set:
[Link] accuracy units etcc.
[Link] : contractions a. Duplicate values b. Missing value [Link] char set
[Link] in future representation : [Link] size images [Link] value c. normalized values.
[Link] of data set
5: Diversity
[Link]
[Link]
UNIT 2
Visualization Process( STAGES)
Data preprocessing and transformation:
[Link] the raw data into something usable by the visualization system.
[Link] data are mapped to fundamental data types for computer ingestion.
[Link] specific application data issues such as missing values, errors in input, and data too large for processing.
[Link] data may require sampling, filtering, aggregation, or partitioning.
Mapping for visualizations.:
[Link] on a specific and optimal visual representation
[Link] graph example.
[Link] process of creating and using maps to display and communicate data in a visual format.
View Transformation:
1. Geometry data into image(matplotlib , openGL , seaborn).
Measure for Visualization:
Expressiveness:all the information correctly 0 ≤ Mexp ≤ 1 (Should not be more than 1 )
Effectiveness : interpreetd accurately and rendered in a cost effective manner. Meff = 1/(1 + interpret + render).
have 0 < Meff ≤ 1. The larger Meff is, the greater the visualization’s effectiveness. If Meff is small, then either the
interpretation time is very large, or the rendering time is large. If Meff is large (close to 1), then both the interpretation
and the rendering time are very small.
SEMIOLOGY
The science of graphical symbols and marks is called semiology.
This includes diagrams, networks, maps, plots, and other common visualizations.
Semiology uses the qualities of the plane and objects on the plane to produce similarity features, ordering features,
and proportionality features of the data that are visible for human consumption.
Semiology of graphical symbol
Symbols and Visualizations
• similarity in data structure ⇐⇒ visual similarity of corresponding symbols;
• order between data items ⇐⇒ visual order between corresponding symbols.
Without cognition a graphical symbol is unusable .
Relationships and ordering in data must be represented on screen .No visual symbol must be without any base in
data.
Features of graphics:
Graphics have 3 or more dimensions
Even 2D graphics can have a z dimension.
THE EIGHT VISUAL VARIABLES
The concept of the "eight visual variables" was introduced by Jacques Bertin, a cartographer
and theorist in the field of information visualization. Bertin identified these fundamental visual
variables as elements that can be manipulated to represent and encode data visually. They are
used to create effective and meaningful visual [Link] eight variables can be
adjusted as necessary to maximize the effectiveness of a visualization to convey information.
These variables are:
Position: The most fundamental visual variable. The spatial arrangement of graphical elements
can convey quantitative or categorical information. For instance, a scatter plot uses the x and y
positions to represent data points.
Size: The area, length, or volume of graphical elements can represent quantitative information.
For example, in a bubble chart, the size of the bubble can represent a numerical value.
Shape: Different shapes or symbols can be used to represent categories or classes within a
dataset. Shapes can help distinguish between various data points or categories.
Value (Brightness): Varied shades of color or grayscale can represent numerical values. Darker
or lighter shades can encode higher or lower values respectively.
Color: Color can be used for both categorical and quantitative data. It can differentiate
categories or represent a spectrum of values. Color hue, saturation, and intensity can be
manipulated to convey different information.
Orientation: The angle or direction of graphical elements can represent data. For instance, in a
bar chart, the orientation of the bars can convey information.
Texture: Different textures or patterns can be used to represent different categories or values in
a dataset. However, texture is less commonly used due to its potential to create visual clutter.
Motion: Motion is a dynamic visual variable that can represent changes, time-based
sequences, or flow within a dataset. Animated visualizations or dynamic elements can show
transitions, changes over time, or movements within the data.
In the context of data visualization, motion can be used to display the evolution of data,
temporal patterns, or changes over time.
Understanding and effectively utilizing these visual variables are crucial for creating informative
and easy-to-understand data visualizations. By combining these variables in various ways, data
analysts and designers can create visualizations that effectively communicate the insights and
patterns within the data. The choice of visual variable or combination thereof depends on the
nature of the data and the story the visualization aims to convey.
HISTORICAL PRESPECTIVE
The history of data visualization spans centuries, evolving alongside technological
advancements, scientific progress, and the increasing need to comprehend complex
information.
[Link] (1967) Semiology of Graphics
Jacques Bertin's "Semiology of Graphics" (originally published in French in 1967) is a seminal
work in the field of information visualization and cartography.
Bertin presents the fundamentals of information encoding via graphic representations as a semiology, a
science dealing with sign systems.
Marks (or Graphical Primitives): Bertin emphasizes the importance of graphical primitives or
marks used to represent data. He categorizes marks into points, lines, and areas. Each of these
marks has distinct purposes and can represent different types of data. For instance, points can
represent individual data items, lines can depict connections or sequences, and areas can
illustrate groups or larger units within a dataset.
Positional Encoding (Two planar dimensions): Bertin underscores the significance of
positional encoding within a two-dimensional space. He highlights the power of using the x and
y axes (or other spatial dimensions) to encode quantitative or categorical data. Bertin's work
stresses that the arrangement of these marks within the graphical space is crucial for accurately
representing and distinguishing different data points or categories. The plane is marked by
implantations, classes of representations that constitute the elementary figures of plane geometry: points,
lines, and areas. These three types of figures are organized in the two planar dimensions by the
imposition, dividing graphics into four groups: diagrams, networks, maps, and symbols. With the
implantations and an imposition, a visualization is specified
Retinal Attributes: In "Semiology of Graphics," Bertin introduced the concept of retinal
variables—attributes that engage human perception directly and can be used to represent data.
There are six retinal variables identified by experimental psychology: size (height, area, or number), value
(saturation), texture (fineness or coarseness), color (hue), orientation (angular
displacement), and shape.
Marks :Points, lines, and areas
Positional : Two planar dimensions
Retinal :Size, value, texture, color, orientation, and shape
[Link] (1986) APT
In 1986, Jock Mackinlay, a researcher at Xerox PARC, introduced the APT (A Presentation Tool)
system in his Ph.D. dissertation titled "Automating the Design of Graphical Presentations of
Relational Information." The APT system was an innovative tool that aimed to automate the
design process of graphical presentations of relational information.
Mackinlay’s basis set of primitive graphical languages.
Three principles are defined for composing two presentations by merging graphics that encode
the same information. First is the double-axescomposition: the composition of two graphical
sentences that have identical horizontal and vertical axes. Second is the single-axis composition
which aligns two sentences that have identical horizontal or vertical axes. Third is the mark
composition for merging mark sets by pairing each and every mark of one set with a compatible
mark of the other set.
[Link] and Grinstein (1989) Visualization Reference Model
[Link] and Lewis (1990) Catalog
[Link] (1990) Natural Scene Paradigm
[Link] (1991) BOZ
[Link] (1994) Lattice Model
[Link] (1995) AVE
[Link], Mackinlay, and Shneiderman (1999) Spatial Substrate
[Link] (1999) EAVE
Taxonomies
A taxonomy is a means to convey a classification. Often hierarchical in nature, a taxonomy can
be used to group similar objects and define relationships.
Keller and Keller, in their book Visual Cues [236], classify visualization techniques based on the
type of data being analyzed and the user’s task(s). Similar to those identified earlier in this book,
the data types they consider are:
Classification of Data Types. 6 types of data exist:
1. One-dimensional data—e.g., temporal data, news data, stock prices, text documents
2. Two-dimensional data—e.g., maps, charts, floor plans, newspaper layouts
3. Multidimensional data—e.g., spreadsheets, relational tables
4. Text and hypertext—e.g., new articles, web documents
5. Hierarchies and graphs—e.g., telephone/network traffic, system dynamics models
6. Algorithm and software—e.g., software, execution traces, memory dumps.
Classification of Visualization Techniques. 5 classes of visualization techniques exist:
1. Standard 2D/3D displays—e.g., x, y- or x, y, z-plots, bar charts, line graphs;
2. Geometrically transformed displays—e.g., landscapes, scatterplot matrices, projection pursuit
techniques, prosection views, hyperslice, parallel coordinates
3. Iconic displays—e.g., Chernoff faces, needle icons, star icons, stick figure icons, color icons,
tilebars;
4. Dense pixel displays—e.g., recursive pattern, circle segments, graph sketches;
5. Stacked displays—e.g., dimensional stacking, hierarchical axes, worlds-within-worlds,
treemaps, cone trees.
The Gibsonian Affordance Theory
The Gibsonian Affordance Theory, proposed by psychologist James J. Gibson, is a concept
within ecological psychology that focuses on the relationship between individuals and their
environment. This theory emphasizes the affordances of the environment and objects within it.
Affordances, according to Gibson, are the action possibilities or opportunities for interaction that
the environment offers to an individual based on the properties and features of objects and the
environment itself. These action possibilities are perceived directly by individuals through their
perception of the environment.
Give 5 examples of semiotics based on Gibson affordance theory.
1. A rigid bottle cap allows for twisting.
2. A hinged door allows for pushing or pulling
3. A staircase allows for ascending or descending.
4. Roundabout Signage: Traffic signs and lane markings at a roundabout offer affordances for
drivers to understand the rules of navigation. Arrows, lane markings, and signages indicating
entry and exit points provide semiotic cues for safe and efficient navigation.
[Link] Buttons: The arrangement and symbols on elevator buttons offer affordances for
pressing to select floors. The numerals, usually arranged in a vertical sequence, indicate the
action of choosing a specific floor level.