U1T3 - White Paper - Data Visualization Techniques From Basics To Big Data With SAS Visual Analytics
U1T3 - White Paper - Data Visualization Techniques From Basics To Big Data With SAS Visual Analytics
Bar Charts.................................................................................3
Scatter Plots..............................................................................4
Pie Charts..................................................................................6
Understanding Influence
With Decision Trees ........................................................14
Visualizing Flow
With Sankey Diagrams....................................................15
Conclusion.........................................................................16
1
Introduction The heart and soul of SAS Visual Analytics is the SAS® LASR™
Analytic Server, which can execute and accelerate analytic
A picture is worth a thousand words – especially when you are computations through in-memory processing. The combination
trying to understand and gain insights from data. It is particularly of high-performance analytics and an easy-to-use data
relevant when you are trying to find relationships among exploration interface enables different types of users to create
hundreds, or even thousands, of variables to determine and interact with graphs so they can understand and derive value
their relative importance. from their data faster than ever. This creates an unprecedented
ability to solve difficult problems, improve business performance
Organizations of all types and sizes generate data each minute, and mitigate risk – rapidly and confidently.
hour and day. Everyone – from executives and departmental
decision makers to call center workers and employees on
production lines – hopes to learn things from collected data that Tips to Get Started
can help them make better decisions, take smarter actions and
There are a few basic concepts that can help you generate the
operate more efficiently.
best visuals for displaying your data:
Regardless of how much data you have, one of the best ways • Understand the data you are trying to visualize, including
to discern important relationships is through advanced analysis its size and cardinality.
and high-performance data visualization. If sophisticated • Determine what you are trying to visualize and what kind
analyses can be performed quickly, even immediately, and of information you want to communicate.
results presented in ways that showcase patterns and allow
• Know your audience and understand how it processes
querying and exploration, people across all levels in your
visual information.
organization can make faster, more effective decisions.
• Use a visual that conveys the information in the best and
simplest form for your audience.
To create meaningful visuals of your data, there are some basics
you should consider. Data size and column composition play an
important role when selecting graphs to represent your data.
This paper discusses some of the basic issues concerning data
visualization and provides suggestions for addressing those
issues. In addition, big data brings a unique set of challenges
for creating visualizations. This paper covers some of those
challenges and potential solutions as well. What Is Data Cardinality?
If you are working with massive amounts of data, one challenge Cardinality is the uniqueness of data values
is how to display results of data exploration and analysis in a contained in a column. High cardinality
way that is not overwhelming. You may need a new way to look means there is a large percentage of unique
at the data – one that collapses and condenses the results in an
values (e.g., bank account numbers,
intuitive fashion but still displays graphs and charts that decision
makers are accustomed to seeing. And, in today’s on-the-go because each item should be unique). Low
society, you may also need to make the results available quickly cardinality means a column of data contains
via mobile devices, and provide users with the ability to easily a large percentage of repeat values (such as
explore data on their own in real time.
a “gender” column).
SAS® Visual Analytics is a data visualization and business
intelligence solution that uses intelligent autocharting to help
business analysts and nontechnical users visualize data. It
creates the best possible visual based on the data that is
selected. The visualizations make it easy to see patterns and
trends and identify opportunities for further analysis.
2
The Basics: Charting 101 You may want to use line graphs when the change in a variable
or variables clearly needs to be displayed and/or when trending
Here is a quick guide to help you decide which chart type or rate-of-change information is of value. It is also important to
(or graph) to use for your data. note that you shouldn’t pick a line chart merely because you
have data points. Rather, the number of data points that you
Line Graphs are working with may dictate the best visual to use. For example,
if you only have 10 data points to display, the easiest way to
A line graph, or line chart, shows the relationship of one
understand those 10 points might be to simply list them in a
variable to another. They are most often used to track changes
particular order using a table.
or trends over time. Line charts are also useful when comparing
multiple items over the same time period (see Figure 1). The
When deciding to use a line chart, you should consider whether
stacking lines are used to compare the trend or individual
the relationship between data points needs to be conveyed. If it
values for several variables.
does, and the values on the X axis are continuous, a simple line
chart may be what you need.
Figure 1: Line graphs show the relationship of one variable to another. Shown here, multiple category line graphs compare multiple
items over the same time period.
3
Bar Charts Another form of a bar chart is called the progressive bar chart,
or waterfall chart. A waterfall chart shows how the initial value of
Bar charts are most commonly used for comparing the
a measure increases or decreases during a series of operations
quantities of different categories or groups. Values of a category
or transactions (see Figure 2). The first bar begins at the initial
are represented using the bars, and they can be configured with
value, and each subsequent bar begins where the previous bar
either vertical or horizontal bars, with the length or height of
ends. The length and direction of a bar indicates the magnitude
each bar representing the value.
and type (positive or negative, for example) of the operation or
transaction. The resulting chart is a stepped cascade that
When values are distinct enough that differences in the bars
shows how the transactions or operations lead to the final
can be detected by the human eye, you can use a simple bar
value of the measure.
chart. However, when the values (bars) are very close together
or there are large numbers of values (bars) that need to be
}}
displayed, it becomes more difficult to compare the bars
to each other.
Bar charts can be configured with
To help provide visual variance, bars can have different colors. either vertical or horizontal bars, with
The colors can be used to indicate such things as a particular
status or range. Coloring the bars works best when most bars the length or height of each bar
are in a different range or status. When all bars are in the same representing the value.
range or status, the color becomes irrelevant, and it is most
visually helpful to keep the color consistent or have no
coloring at all.
Figure 2: This bar graph – a waterfall chart – shows how the initial value of a measure increases or decreases during a series of
operations or transactions.
4
Scatter Plots Once you have plotted all of the data points using a scatter plot,
you are able to visually determine whether data points are
A scatter plot (or X-Y plot) is a two-dimensional plot that shows
related. Scatter plots can help you gain a sense of how spread
the joint variation of two data items. In a scatter plot, each
out the data might be or how closely related the data points are,
marker (symbols such as dots, squares and plus signs)
as well as quickly identify patterns present in the distribution of
represents an observation. The marker position indicates the
the data (see Figure 3). Scatter plots are helpful when you have
value for each observation. Scatter plots also support grouping.
many data points. If you are working with a small set of data
When you assign more than two measures, a scatter plot matrix
points, a bar chart or table may be a more effective way to
is produced. A scatter plot matrix is a series of scatter plots that
display the information.
displays every possible pairing of the measures that are
}}
assigned to the visualization.
Scatter plots are useful for examining the relationship, or Scatter plots can help you gain a sense
correlations, between X and Y variables. Variables are said to
of how spread out the data might be or
be correlated if they have a dependency on, or are somehow
influenced by, each other. For example, “profit” is often related how closely related the data points are.
to “revenue” – and the relationship that exists might be that as
They can also quickly identify patterns
revenue increases, profit also increases (a positive correlation).
A scatter plot is a good way to visualize these relationships present in the distribution of the data.
in data.
}}
markers are replaced with bubbles. A bubble plot displays the
relationships among at least three measures. Two measures are
represented by the plot axes. The third measure is represented Bubble plots are a variation of
by the size of the bubbles (see Figure 4). Each bubble
represents an observation.
scatter plots. They’re especially
useful for data sets with dozens
A bubble plot is useful for data sets with dozens to hundreds
of values or when the values differ by several orders of
to hundreds of values or when
magnitude. You can use color to represent an additional the values differ by several
measure, and you can animate the bubbles to display
changes in the data over time.
orders of magnitude.
Figure 4: A bubble plot can be animated to show data changing over time.
6
Pie Charts
There is much debate around the value of pie charts, which
are used to compare the parts of a whole. However, they can
be difficult to interpret because the human eye has a hard time
estimating areas and comparing visual angles. Another
challenge with using a pie chart for analysis is that it is difficult
to compare slices of the pie that are similar in size but not
located next to each other.
If you do use pie charts, they are most effective when there are
limited components and when text and percentages are
included to describe the content. By providing additional
information, report consumers do not have to guess the
meaning and value of each slice. If you choose to use a pie
chart, the slices should be a percentage of the whole (see
Figure 5). Figure 5: A pie chart helps you compare the percentages of
}}
different components.
Pie charts are most effective when there mobile devices. Other charts (like a bar chart) may provide a
are limited components and when text better way to represent the same information in less space (see
Figure 6).
and percentages are included to
describe the content. Of course, there are many other chart types you can use to
present data and analytical results. The selection of charts
When designing reports or dashboards, another consideration usually will depend upon the number of categories and
for the efficacy of a pie chart is the amount of space the pie measures (or dimensions) you want to visualize. By following
chart requires in the sizing of the report. Because of the round the tips outlined here and understanding the examples, you
shape, pie charts require extra real estate, so they may be less may need to try different types of visuals and test them with
than ideal when developing dashboards for small screens or your audience to make sure the correct information is
being conveyed.
Figure 6: Alternatives to pie charts include line charts and bar charts.
7
Visualizing Big Data meaningful and not overwhelming. You may need a new way
to look at the data that collapses and condenses the results in
Big data brings new challenges to visualization because of an intuitive fashion but still displays graphs and charts that
the speed, size and diversity of data that must be taken into decision makers are accustomed to seeing. You may also need
account. The cardinality of the columns you are trying to to make the results available quickly via mobile devices, and
visualize should also be considered. provide users with the ability to easily explore data on their
own in real time.
One of the most common definitions of big data is data that is
of such volume, variety and velocity that an organization must
When working with massive amounts of data, it can be difficult
move beyond its comfort zone technologically to derive
to immediately grasp what visual might be the best to use. The
intelligence for effective decisions.
autocharting capability in SAS Visual Analytics takes a look at
• Volume refers to the size of the data. the data you wish to examine and then, based on the amount
• Variety describes whether the data is structured, of data and the type of data, it presents the most appropriate
semistructured or unstructured. visualization. This intelligent autocharting helps business
analysts and nontechnical users easily visualize their data. They
• Velocity is the speed at which data pours in and how
can build hierarchies on the fly, interactively explore data and
frequently it changes.
display the data in different ways to answer specific questions
or solve new problems without having to rely on constant
Building upon basic graphing and visualization techniques,
assistance from IT to provide changing views of information.
SAS Visual Analytics has taken an innovative approach to
addressing the challenges associated with visualizing big data.
In addition, “what does it mean” explanations in SAS Visual
Using innovative, in-memory capabilities combined with SAS
Analytics display information about the analysis that has been
Analytics and data discovery, SAS provides new techniques
performed, and identify and explain the relationships between
based on core fundamentals of data analysis and the
the variables that are displayed (see Figure 7). This makes
presentation of results.
analytics and the creation of data visualizations easy, even
those with nontechnical or limited analytic backgrounds.
Handling Large Data Volumes
One challenge when working with big data is how to display
results of data exploration and analysis in a way that is
Figure 7: SAS Visual Analytics provides autocharting and “what does it mean” pop-ups to help nontechnical users create and
understand data visualizations. The “what does it mean” pop-up (bottom) explains that the correlation shown in this binned box
plot indicates a strong linear relationship between Sales Rep Rating and Vendor Satisfaction.
8
Box plots are another example of how the volume of data can
affect the visual being shown. A box plot is a graphical display
then, based on the amount of data of five statistics (the minimum, lower quartile, median, upper
and the type of data, it presents the quartile and maximum) that summarize the distribution of a set
of data. The lower quartile (25th percentile) is represented by
most appropriate visualization. the lower edge of the box, and the upper quartile (75th
percentile) is represented by the upper edge of the box. The
Data volume can become an issue because traditional median (50th percentile) is represented by a central line that
architectures and software may not be able to process huge divides the box into sections. Extreme values are represented
amounts of data in a timely manner, thus requiring you to make by whiskers that extend out from the edges of the box. Usually,
compromises and aggregate the details you want to visualize. these display well when using big data (see Figure 8).
Even the most common descriptive statistics calculations can
become complicated when you are dealing with big data and Often, box plots are used to understand the outliers in the data.
don’t want to be restricted by column limits, storage constraints Generally speaking, the number of outliers in the data can be
and limited support for different data types. The SAS in-memory represented by 1 percent to 5 percent of the data. With
engine solves these issues by speeding up the task of data traditionally sized data sets, viewing this proportion of the data
exploration, and a visual interface displays the results in an is not necessarily hard to do. However, when you are working
easy-to-understand visualization. with massive amounts of data, viewing 1 percent to 5 percent
of the data is challenging.
For example, what if you have a billion rows in a data set and
want to create a scatter plot on two measures? It would be For example, if you were working with a billion rows of data, the
impossible to see so many data points. And the application outliers would represent 10 million data points. If you bin the
creating the visual may not be able to plot a billion points in a results and show a box plot with whiskers (Figure 8), you can
timely or effective manner. One potential solution is to use view the distribution of the data and see the outliers – all
calculated quickly on big data.
Figure 8: This box plot compares the distribution of data points within a category.
9
Visualizing Semistructured and Unstructured Another visualization technique that can be used for
semistructured or unstructured data is the network diagram.
Data With Word Clouds and Network Diagrams
Network diagrams view relationships in terms of nodes
The variety of big data brings challenges because (representing individual actors within the network) and ties
semistructured and unstructured data require new visualization (which represent relationships between the individuals, such as
techniques. A word cloud visual (where the size of the word friendship, kinship, organizations, business relationships, etc.).
represents its frequency within a body of text) can be used on These networks are often depicted in a diagram where nodes
unstructured data as a way to display high- or low-frequency are represented as points and ties are represented as lines.
words (see Figure 9).
Network diagrams can be used in many applications
SAS Visual Analytics takes the concept of word clouds a step and disciplines. For example, businesses analyze social
further by taking advantage of taxonomies and ontologies to networks to understand their interactions with customers,
make associations . Words are then organized into topics based while counterintelligence and law enforcement might map
on how the words are used. SAS Visual Analytics word clouds a clandestine or covert organization such as an espionage
can display the hot topics of the day gleaned from such text ring, an organized crime family or a street gang. You can also
analysis. Users can drill down by clicking on an individual topic superimpose the network diagram on a map, for example,
to see exactly what words or phrases comprise that topic. to show the relationship or product sales across geographic
areas (see Figure 10).
}}
indicate a weaker correlation (see Figure 11). If you hover over
a box, a summary of the relationship is shown. You can double-
While visualizing structured data is click on a box in the matrix for further details.
}}
fairly simple, semistructured or
unstructured data requires new A correlation matrix combines big data
visualization techniques, such as and fast response times to quickly
word clouds or network diagrams. identify which variables among the
millions or billions are related. It also
shows how strong the relationship is
between the variables.
10
Figure 10: Network diagrams explore relationships within a data set, including connections across geographic areas.
Figure 11: In this correlation matrix, darker boxes indicate a stronger correlation; lighter boxes indicate a weaker correlation.
You can double-click on a box for further details.
11
Another concern with big data is cardinality because the data SAS has adopted a method for dealing with high cardinality
may have many unique values per column. If there are too many in SAS Visual Analytics – bar charts with an overview bar that
columns in your bar chart, you cannot see the labels for each zooms into the bar chart and enables information consumers
bar and the graph becomes less meaningful. to scroll through the entire chart. The level of zoom can also be
controlled. If you compare Figure 12 to Figure 13, it is easy to
see that Figure 13 presents the information more clearly.
Figure 13: An overview axis bar chart shows the high cardinality in big data more clearly. You can scroll through the entire chart.
12
Data Visualization Made Easy The addition of a second measure results in either an
autocharted heat map (Figure 15) or a scatter plot (Figure 3).
With Autocharting
A category of data can be one of three types: standard, date
In SAS Visual Analytics, intelligent autocharting produces the or geographic. When the category type is standard, SAS Visual
best visual based on what data you drag and drop onto the Analytics will show a frequency count of data (see Figure 16
visual palette. It is important to note that autocharting may not below). If the category is a date, then a measure is also required
always create the exact visualization you had in mind. In that and the visual will be a line graph (see Figure 1). If the category
case, you also can select a specific visual to build. However, is geographic, then a map will be displayed.
when you are first exploring a new data set, autocharts are
useful because they provide a quick view of the data. You then Autocharting in SAS Visual Analytics also takes into account
have the ability to switch to another specific visual as desired. the cardinality of the data and adjusts the visuals accordingly. As
For example, with autocharting, when a single measure is mentioned previously, if cardinality is deemed high, a bar chart
selected, distribution of that measure is shown (Figure 14). with an the overview axis is displayed (see Figure 12).
Figure 14: Autocharting in SAS Visual Analytics produces a bar Figure 15: With autocharting, two measures can either result
chart to show the distribution of a single measure. in a heat map (above) or a scatter plot.
Figure 16: When the data category is standard, SAS Visual Analytics displays a frequency count in the form of a simple bar chart.
13
See Into the Future With Forecasting mean” section at the bottom of the screen, as shown in Figure
17. This is just another way SAS Visual Analytics brings advanced
Techniques
analytics to nontechnical users in an approachable format.
Forecasting estimates future values for your data based on
statistical trends. As such, it is an extremely important tool for When additional measures are added to the forecast (as shown
organizational planning. Fortunately, SAS Visual Analytics can in Figure 18), three things happen in SAS Visual Analytics:
help you expand the culture of forecasting in your organization.
1 Each variable is evaluated to determine whether it
Easy-to-use capabilities take the complexity out of forecasting,
“influences” the forecast. Variables deemed to be
so that users of all skill levels can see for themselves what might
influencers are added to the bottom of the screen for
happen in the future.
simulation purposes.
A simple menu guides users through the process of generating 2 When influencers are found, the forecast is recalculated and
forecasting results. Select the date, time or date-time data items refined. As you can see, the “confidence interval” (light blue
you want to use for the forecast. The software automatically bars) around the forecast in Figure 18 is much tighter than in
chooses the most appropriate forecasting algorithm for the data Figure 17.
chosen. You also have the option to select the forecasting 3 Users can manipulate the values of the influencing variables
intervals. When you click OK, a line chart is created, along with a to see the potential impact on the forecast, in effect by
clear explanation of the forecasting results in the “what does it performing simulations.
Figure 17: With automated forecasting capabilities, SAS Visual Analytics chooses the most appropriate forecasting algorithm
for the selected data. “What does it mean” pop-ups (bottom of screen) provides explanations of analytic functions and data
correlations, so even nontechnical users can understand what the data means.
14
Figure 18: By adding additional measures, these underlying factors are evaluated as to their potential impact on the forecast,
the forecast is recalculated accordingly, and users can use these additional values to perform simulations.
Understanding Influence the decision tree. A strong relationship is defined as one where
knowledge of the value of an input improves the ability to
With Decision Trees predict the value of the target.
Figure 19: On this decision tree, we can see the data segmented according to the various branching points. The right side shows
the various fine-tuning parameters available in “Custom Mode.”
Visualizing Flow nodes, where the width of each link indicates the frequency of
With Sankey Diagrams the link or the value of a measure. This enables you to see flow
patterns and recognize trends such as where customers enter
Sankey diagrams use path analysis to show the dynamics of your website, where they navigate to and where they exit. You
how transactions move through a system (e.g., how customers can identify successful flow patterns or isolate flows that failed
navigate your website). The diagrams display a series of linked to deliver the desired action.
Figure 20: A Sankey diagram displays a series of linked nodes, where the width of each node
indicates the frequency of the link or value of the measure.
16
Conclusion
Visualizing your data can be both fun and challenging. It is
much easier to understand information in a visual compared to
a large table with lots of rows and columns. However, with the
many visually exciting choices available, it is possible that the
visual creator may end up presenting the information using the
wrong visualization. In some cases, there are specific visuals you
should use for certain data. In other instances, your audience
may dictate which visualization you present. In the latter
scenario, showing your audience an alternative visual that
conveys the data more clearly may provide just the information
that’s needed to truly understand the data.
The net effect is the ability to accelerate the analytics life cycle
and to perform the process more often, with more data. Users
can quickly view more options, ask more questions, make more
precise decisions and succeed faster than ever before.
To contact your local SAS office, please visit: sas.com/offices
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are
trademarks of their respective companies. Copyright © 2014, SAS Institute Inc. All rights reserved.
106006_S140192.0715