0% found this document useful (0 votes)
75 views

Goals, Process, and Challenges of Exploratory Data Analysis: An Interview Study

1) The study interviewed 18 data analysts to understand their goals, processes, and challenges when conducting exploratory data analysis (EDA). 2) The analysts described two main exploration goals: profiling to assess data quality, and discovery to gain new insights. 3) While discovery goals occurred in open-ended analyses, all analysts engaged in profiling across all analyses. 4) The study characterized analysts' processes, identified common challenges like examining many variables, and proposed tools could help with automation and guidance.

Uploaded by

RACHNA JAIN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views

Goals, Process, and Challenges of Exploratory Data Analysis: An Interview Study

1) The study interviewed 18 data analysts to understand their goals, processes, and challenges when conducting exploratory data analysis (EDA). 2) The analysts described two main exploration goals: profiling to assess data quality, and discovery to gain new insights. 3) While discovery goals occurred in open-ended analyses, all analysts engaged in profiling across all analyses. 4) The study characterized analysts' processes, identified common challenges like examining many variables, and proposed tools could help with automation and guidance.

Uploaded by

RACHNA JAIN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Goals, Process, and Challenges of

Exploratory Data Analysis: An Interview Study


Kanit Wongsuphasawat* Yang Liu† Jeffrey Heer‡
Apple Inc. University of Washington University of Washington

A BSTRACT descriptions of their analysis process, we revise Kandel et al.’s model


How do analysis goals and context affect exploratory data analysis of the data analysis process [42] to include exploration. We also
(EDA)? To investigate this question, we conducted semi-structured report the analysts’ context including tools, domain knowledge (or
interviews with 18 data analysts. We characterize common explo- the lack thereof), and involved stakeholders.
ration goals: profiling (assessing data quality) and discovery (gaining Next, we discuss recurring observed challenges in the data analy-
arXiv:1911.00568v1 [cs.HC] 1 Nov 2019

new insights). Though the EDA literature primarily emphasizes dis- sis process and report how analysis goals and context impact them.
covery, we observe that discovery only reliably occurs in the context We also describe how analysts handle challenges specific to ex-
of open-ended analyses, whereas all participants engage in profiling ploration tasks including choosing variables to explore, handling
across all of their analyses. We describe the process and challenges repetitive tasks, and determining the end of an exploration. We find
of EDA highlighted by our interviews. We find that analysts must that analysts often have to explore numerous variable combinations,
perform repetitive tasks (e.g., examine numerous variables), yet requiring them to apply domain knowledge to select and reduce the
they may have limited time or lack domain knowledge to explore number of variables. As analysts perform repetitive tasks, they may
data. Analysts also often have to consult other stakeholders and curate analysis templates to automate their routines and help them
oscillate between exploration and other tasks, such as acquiring follow best practices. Due to time limits, analysts may also need to
and wrangling additional data. Based on these observations, we move on to other tasks before completing their exploration.
identify design opportunities for exploratory analysis tools, such as Finally, we identify opportunities for data exploration tools. We
augmenting exploration with automation and guidance. argue that tools can help mitigate these observed challenges and
facilitate rapid and systematic exploration by providing automation
Index Terms: Human-centered computing—Visualization—Visu- for routine tasks and guiding analysis practices. We also note a
alization techniques—Treemaps; Human-centered computing— lack of support for data wrangling and navigation of analysis history
Visualization—Visualization design and evaluation methods within exploration tools.

1 I NTRODUCTION 2 BACKGROUND AND R ELATED W ORK


Exploratory data analysis (EDA), as introduced by Tukey [67], aims We build on the exploratory data analysis literature and complement
to complement formal confirmatory analysis with a “flexible atti- prior work on understanding data analysis.
tude”, letting data exposure inform analysts’ modeling decisions [68]. 2.1 Exploratory Data Analysis
With this attitude, analysts usually “explore” aspects of data by ex-
amining data values, derived statistics, and visualizations. Today, Exploratory data analysis stems from the collection of work by the
data exploration is widely adopted as a critical part of data science, statistician John Tukey in the 1960s and 1970s [24, 39, 40, 67]. His
both in industrial and scientific settings [36]. However, while ana- seminal book [67] compiles a collection of data visualization tech-
lysts perform data exploration in various kinds of analyses, the EDA niques as well as robust and non-parametric statistics for data explo-
literature lacks a consistent definition of exploration goals. More- ration. Many communities including Statistics, Human-Computer In-
over, little research has observed how analysis goals and context teraction, and Information Visualization have since contributed new
affect the day-to-day practice and challenges of EDA. Understanding data exploration tools and techniques (e.g., [15, 23, 27, 65, 66, 72]).
these issues can inform the design of data exploration tools. While Tukey did not explicitly define the goals of EDA, his and
To better understand current EDA practices, we conducted semi- other writings about EDA [10, 12, 13, 28, 30, 35, 49, 54, 63, 70] mostly
structured interviews with 18 analysts from academic and industrial focus on the discovery of structure and patterns in the data, and
settings. We asked the analysts to describe their analysis goals, tasks consider EDA a step that precedes formal modeling or confirmatory
they performed, and challenges they faced in their exploration. We analysis. However, some [17, 18, 73] argue that EDA also covers
first describe observed analysis context and process. We discuss profiling [7, 55], or initial data examination to detect data quality
observed types of analyses that involve exploration and identify issues. Some also state that EDA may occur without formal mod-
two common exploration goals: profiling (understanding what the eling [19]. As the prior literature lacks a consistent definition of
data contain and assessing data quality) and discovery (gaining new EDA goals, our study helps clarify the nature and scope of EDA
insights). Though the EDA literature emphasizes discovery, we by providing evidence that EDA goals include both profiling and
observe that all participants engage in profiling across all of their discovery. Though the EDA literature emphasizes discovery, we
analyses, while discovery only reliably occurs in open-ended analy- observe that discovery only reliably occurs in open-ended analyses,
ses, which participants perform less often. Based on the participants’ while all participants engage in profiling across all analyses. We also
find that some analysts perform exploration to clean or summarize
* e-mail: [email protected]. This work was done when the first author data without modeling involved. Besides characterizing goals, we
was at the University of Washington. also identify common challenges for data exploration and discuss
† e-mail: [email protected] how analysis goals and context affect them.
‡ e-mail: [email protected]
2.2 Understanding Data Analysis
Many prior studies summarize high-level tasks and challenges in
the data analysis process. Some focus on specific user groups and
types of analysis. Kwon and Fisher [22] discuss visual analytic
challenges for novices. Conversely, we study experts whose jobs research fields and industrial topics, and held a variety of job titles.
primarily involve data analysis. A few studies [21, 44, 57] examine In this paper, we use the term “analyst” to generally refer to any
data analysis and sensemaking within intelligence agencies, which participant, as all participants’ jobs primarily involved data analysis.
share many challenges with our findings due to exploratory and To recruit participants, we emailed our contacts within our per-
collaborative nature of their work. However, these agencies often sonal and professional networks to forward our recruiting emails to
analyze text documents whereas our participants mostly explore analysts in their organizations. We used a survey to screen partici-
structured data. For structured data, Guo [32] describes research pants to those that had at least one year of data analysis experience
programming practices while Fayyad et al. [26] discuss the process and performed EDA at least once a month. The participants’ data
of algorithmic data mining. analysis experience varied from 1-3 years to over 10 years. Most
Another group of works [8,42,73] discusses general data analysis of them performed EDA on a daily or weekly basis, with the least
process. Closest to our work are the interview studies by Kandel frequent account being biweekly. While our recruitment strategy
et al. [42] and Alspaugh et al. [8], which derive analysis tasks and introduced potential sampling bias in the results, our primary goal is
challenges based on interview data. However, Kandel et al. mostly to characterize the space of day-to-day exploratory analysis process
focus on analysts that perform directed analyses (i.e. answering and challenges, not to quantify how frequently each specific task
predefined questions) and largely overlook tasks and challenges occurs. To better quantify these results, other methods, such as
specific to discovery. Meanwhile, Alspaugh et al. interview analysts surveys, could complement our findings.
about exploratory activities akin to this work, but focus only on
open-ended exploration and do not discuss how analysis goals affect 3.2 Interview
exploration challenges. In contrast, this study covers exploratory We conducted semi-structured interviews with one interviewee at a
activities across open-ended and directed analyses. We complement time. Each interview lasted from 45 to 90 minutes. We interviewed
these prior studies with the characterization of exploration goals analysts at their workplace when possible, and used video calls
(profiling and discovery) and details of how analysis goals and otherwise. For each interview, we began by describing the study
context affect EDA tasks and challenges. In §4, we also discuss how objective, namely to understand current practices and difficulties
our characterization of the data analysis process differs from those of exploratory data analysis. We then asked open-ended questions
of Kandel et al. and Alspaugh et al. and encouraged interviewees to describe their specific experiences
Some prior studies investigate specific issues in data analysis, e.g., such as “walk us through a recent exploratory data analysis scenario.”
the effects of latency [50] and multiple comparisons [77]. Some Our questions aimed to learn about the following topics:
study specific tools such as computational notebooks [46, 60], inter-
active visualizations [11, 76], and dashboards [61]. In contrast, we • What are your analysis goals and outcomes?
study day-to-day practices of EDA, which involve many challenges • What tasks do you perform during analyses?
and tools. For exploration challenges, Lam [48] discusses interaction • What tools do you use and how do you use them?
costs for visualizing data such as repetitive physical motions and • How do you interact with other involved stakeholders?
choosing data subsets. Kidd [47] observes that knowledge workers • How do you choose parts of a dataset to explore?
often focus on implications for decision-making rather than produc-
ing generalizable knowledge. Others examine low-level tasks for • How long does an exploration take?
visual exploration. Amar et al. [9] present a taxonomy of low-level • How do you decide that an exploration is complete?
visual analytics tasks. A few studies [20, 38] identify operations • What are the key challenges you face in exploratory analysis
that analysts perform to visualize data. Conversely, we focus on the and how do you handle them?
high-level data exploration process.
Prior work also discusses some of the analysis challenges ob- 3.3 Analysis
served in our study. Many studies (e.g., [25, 41, 42, 45, 59]) discuss We analyzed the interview data using an iterative coding method.
challenges for data wrangling such as data integration, data clean- The first two authors independently coded all data. Throughout the
ing, and handling large data. Here we discuss how data wrangling coding process, we discussed disagreements and iteratively revised
couples with and impedes exploration. our codes to ensure consistency across coding sessions. The rest
Many researchers [14, 31, 38, 42, 58] have also noted the impor- of this paper presents the results from this analysis (summarized in
tance of analytic provenance. Ragan et al. [58] also characterize Fig. 1). We also include representative quotes from the interviews to
types and purposes of provenance in visual analytics. Some stud- support these results. We use P1-P18 to refer to the participants.
ies [37, 42, 44] identify how and why analysts collaborate, and dis-
cuss impediments for collaboration. Rule et al. [60] also describe 4 A NALYSIS P ROCESS AND C ONTEXT
the tension between exploring data and documenting insights for
From the interview responses, we first categorize analysis projects
computational notebook users. Batch et al. [11] also comment that
based on their overarching objectives and identify two kinds of
visualization tools lack integration with data science workflows.
exploration goals. We then report observed high-level tasks in the
Though this study shares some findings with prior research, our
analysis process. We also discuss analysts’ context including tools,
work is the first, to our knowledge, to overview the day-to-day
their operational and domain knowledge, and their collaboration
process and challenges of EDA for both profiling and discovery
with involved stakeholders.
aspects, including examination of how analysts choose variables to
explore and determine when an exploration should stop. 4.1 Types of Analysis Projects
3 M ETHODS We asked the interviewees about the objectives of the projects that
involved exploratory analysis. We observed four common project
To better understand day-to-day practices of exploratory data anal- types, with varying levels of open-endedness.
ysis, we conducted semi-structured interviews with experienced
analysts across both academia and industry. Question Answering. All analysts (18/18) reported working on
answering business and research questions, so they explored the
3.1 Participants data to check data quality before answering them. Many analysts
(8/18) also noted that their questions, while predetermined, were
We interviewed 18 analysts (11 male, 7 female) from both academia sometimes open-ended and thus required exploration to discover
and industry. As listed in Fig. 1, the participants worked on various answers, as P14 said:
g)
in
g)

tis
m ri n

g)
er
D An t (S il) ono nee

t ( ns g) erin
dv
BI arc t (M Me istic )

y)
ry

a r (M So ing (A
n a tr gi

a l (C u l t i n e
ng Sc ch cin )
ra

a ent Co l E )
a tive su ing )

D Sci ist nica dia


D sso ist sul her
D ee ntis nic )

Q Art t (C o G ia)

on ult es
R Stu ello t (S Lib

D Stu (R t (A l E
e

An ist ns ng

l) g)
D An UX ltin )

D Sci ec cia )
nt t (C ons am
y s de ed

a ent ha Me
a

ys o in
et tin
( at
D h F ntis ity

ys o c s)

of en o ar
y s R g)
di

at nal (Vi l M
t

R ul
D An t (S istic
Ph arc cie ers

t
Pr Sci t (C ese

l
a

ai
S niv

n
ys tat
e
es de w

e t
R arc t (U

(
Ph in ie

t
e ys

t
e
es nal

at al

at al

ua is

at al
es h

-E h

D ita
A

e
e

e
a

a
at

at

at

at

at

at
D

D
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 Count
Demographic Analysis Setting: Industry (I) / Academia (A) I A A A A I A I I I I I I I A I I I 0
Data Analysis Experience (Years) 3-5 5-10 1-3 1-3 5-10 3-5 3-5 1-3 3-5 1-3 1-3 >10 1-3 1-3 5-10 >10 >10 1-3 0
Project Type Question Answering ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 18
- Open-ended Questions ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 8
Open Exploration ✓ ✓ ✓ ✓ ✓ ✓ ✓ 7
Model Development ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 10
Data Preparation ✓ ✓ ✓ 3
Exploration Profiling ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 18
Goals Discovery ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 13
Context Tools Excel / Google Sheet ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 13
BI Tools (e.g., Tableau/PowerBI) ✓ ✓ ✓ ✓ ✓ ✓ 6
General Programming Languages (R/Python/MATLAB/SAS) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 15
Data Query Languages (e.g., SQL/Scalding) ✓ ✓ ✓ ✓ ✓ ✓ ✓ 7
Specific Domain-specific/Internal Tools ✓ ✓ ✓ ✓ ✓ ✓ ✓ 7
Analyst's Domain-Specific Analysts ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 11
Role Consultant ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 9
Stakeholders Clients ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 13
Data Owners ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 10
Analysis Team Members ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 15

Data Locate Exist Data ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 17


Acquisition Self-Collected ✓ ✓ ✓ ✓ ✓ 5
Data Combining Multiple Datasets ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 12
Wrangling Dealing with Data Size ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 15
Converting Data Formats ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 15
Deriving New Forms of Data ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 13
Handling Erroneous Values ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 13
Exploration Observed - Directly Looking at Raw Data Values ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 12
Tasks - Compute Summary Statistics ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 8
- Examining Histograms & Count Plots ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 15
- Computing Box Plot / Density Plot ✓ ✓ ✓ 3
- Examine Bivariate Plots ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 18
- Examine Plots with More Than 2 Variables ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 11
Challenges How to Choose Variables to Explore?
Process
- Use domain knowledge to select variables ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 18
- Dropping Irrelevant / Redundant Variables ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 10
- Apply Statistics and Modeling Techniques ✓ ✓ ✓ ✓ ✓ ✓ ✓ 7
- Dimensionality Reduction ✓ ✓ ✓ ✓ ✓ ✓ 6
Handling Repetitive Tasks ✓ ✓ ✓ ✓ ✓ ✓ ✓ 7
Exploring Unstructured Data ✓ ✓ ✓ ✓ ✓ ✓ ✓ 7
Determining the End of Exploration
- Mentioned "No definite answer" ✓ ✓ ✓ ✓ ✓ 5
- Criterion: Goal Satisfaction ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 18
- Criterion: Stakeholder Feedback ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 9
- Criterion: Time Limit ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 9
Modeling Projects Involve Modeling ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 12
Reporting & Adjusting Reports to Match Analysis Audience ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 10
Sharing Analysis Sharing & Provenance ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 8

Figure 1: A matrix of interviewees, their corresponding analysis context, and high-level tasks they perform in the analysis process.

“A lot of my work is more long-term open-ended research such as training machine learning models or developing new metrics
questions such as: how can we characterize the health of the and rules. Besides the models, analysts might also deliver reports,
users on our platform?” or integrate the solutions into dashboards as their project outcomes.
Analysts often produced analysis reports in the form of writ- Data Publishing. A few analysts (3/18) explored data while
ten documents and presentation slides. They also sometimes built cleaning datasets for publishing on shared repositories, so others
interactive dashboards. could use the datasets for other analyses.
Open-Ended Exploration. While answering specific questions
4.2 Exploration Goals
was more common, several analysts (7/18) noted that they some-
times broadly explored data to summarize and look for new insights We asked the analysts why they performed data exploration in their
without a specific question. P17, a data science consultant, reported analysis projects. From their descriptions, we categorize their goals
that his clients once gave him their website’s data and asked “Please into two common categories:
just tell me about my site.” P5, an astronomer, also said: Profiling. A common goal for all analysts (18/18) was to learn
“Occasionally we get data that’s surprising like the uni- what the data contained and assess if the data were suitable for the
verse does something we haven’t seen before and a telescope analyses. By broadly looking at the data and their plots, analysts
caught it. Then you sit down with the data and think ‘What could learn about their shapes and distributions, and detect data
do I do now?”’ quality issues such as missing data, extreme values, or inconsistent
data types. They might also check specific assumptions of the
Akin to question answering, analysts often produced reports to data, both in terms of expectations based on domain knowledge and
describe insights from the open-exploration process. mathematical assumptions required for modeling. By profiling, they
Model Development. Many analysts (10/18) reported cases where learned if the data were ready for the analyses or if they needed to
they performed exploratory analysis to prepare for modeling projects further wrangle the data or acquire more data.
Acquisition Wrangling Exploration Modeling
Obtain data Transform data to have a Examine data’s values, statistics Build and evaluate statistical Reporting
by locating existing data or suitable format for analysis and and visualizations to profile data models for testing hypotheses Share the analysis results
collecting the data themselves to handle data quality issues or discover new insights or making predictions

Figure 2: The analysis process couples exploration with many tasks including acquisition, wrangling, modeling, and reporting.

Discovery. Many analysts (13/18) also explored data to discover tion users who usually looked at and wrangled data in spreadsheets,
new insights or hypotheses, as P17 described that his exploration and visually explored data in Tableau.
goal was to “be open-minded and learn what the data could tell me.” The rest (15/18) were programmers who primarily used one
For question answering and modeling projects, analysts might focus language among Python, MATLAB, R, and SAS to analyze data.
on developing intuitions how to answer questions or formulate mod- They usually plotted data with APIs such as Matplotlib [1] and
els such as learning about potential relationships between variables ggplot2 [72]. Several of them also used computational notebooks
or rankings of feature importance. Some insights also inspired the (e.g., Jupyter [56]) to keep history for repeating and revising their
analysts to broadly explore other relevant factors while some helped analyses. Some noted that they preferred exploring data via scripting
them form and investigate specific questions. instead of using graphical interfaces as they did not have to switch
Analysts’ focus on exploration goals depended on project objec- tools. However, the programmers switched to other tools in some
tives. While the EDA literature (reviewed in §2.2) mostly focuses cases. P6 sometimes explored data in Tableau when it could connect
on discovery, we observed that profiling was a more common goal. to the data sources. Several used spreadsheets to inspect raw data,
Projects with fixed questions generally centered on profiling, though though they rarely wrangled data in spreadsheets like the application
surprising observations from profiling sometimes prompted analysts users. Many utilized languages such as SQL and Scalding to fetch
to investigate and discover the causes of the surprises. Meanwhile, and manipulate the data. Some used Tableau [66], Google Data
open-ended analyses involved both goals. Analysts often first fo- Studio [3], or Microsoft PowerBI [5] for reporting.
cused on profiling, and shifted their focus to discover new insights Several analysts sometimes had to use domain-specific tools. P3
when they felt more confident about the data. explored biopsy images from a 3D scan with a specialized tool. A
few industrial analysts also noted that their internal data platforms
4.3 High-Level Tasks in the Analysis Process had some support for data wrangling and exploration. As domain-
specific tools often had limited features, analysts preferred to use
From the interviewees’ responses about the tasks they performed in general-purpose tools if possible. However, their data often resided
their analyses, we characterize the data analysis process as an itera- in domain-specific tools and exporting data was sometimes difficult.
tive process that couples five common high-level tasks: acquisition,
wrangling, exploration, modeling, and reporting (as listed and de- 4.5 Operational and Domain Knowledge
fined in Fig. 2). Some projects might omit some tasks. For example, The analysts typically needed operational and domain knowledge in
though exploration often preceded modeling, some analysts (6/18) their analyses. They must know where the data were stored, and how
explored data to clean or summarize data without modeling involved. the data were collected and processed. They also needed domain
Some data were also clean and did not require wrangling. expertise to interpret the data and detect errors. Since analysts
The analysts’ process coupled exploration with many tasks. The usually lacked some required knowledge, they had to learn more
analysts regularly explored data to assess if the data were relevant about the problem domains and consult other stakeholders.
during acquisition. Similarly, they often explored data to decide Job roles also affected the levels of operational and domain knowl-
how to wrangle them. Exploration also helped them discover the edge that analysts had. We observed that the analysts had two kinds
need to collect or wrangle more data. In addition, the analysts of job roles relative to their problem domains: domain-specific
often reported exploration results to other stakeholders and gathered analysts (9/18) and consultants (7/18), with two (2/18) straddling
feedback for more exploration. While we observed less coupling both roles in different phases of their careers. In academia, most
between modeling and exploration, a few analysts examined training researchers focused on their research topics, but one (P7) was a
data when they observed poor modeling results. statistician providing solutions to multiple research domains. In
Our characterization of analysis tasks is similar to those of Kandel industry, there were both analysts embedded into product teams and
et al. [42] and Alspaugh et al. [8]. However, as Kandel et al. fo- consultants who served internal or external clients. As consultants
cus on analysts that typically perform directed analyses (answering typically worked with a broader set of domains, they often had less
predetermined questions), they only list profiling rather than explo- domain expertise and relied more on other stakeholders, as P17 said:
ration as one of the tasks. Alspaugh et al., whose study focuses on
open-ended analyses, augment Kandel et al.’s model by adding ex- “Since I’m not embedded with the team, I don’t have the
ploration as an alternative task to modeling. In contrast, as our study domain context. In this example where I saw elevated counts
covers exploratory tasks for both directed and open-ended analyses, in the product’s telemetry, I didn’t know what it meant. I
we found that analysts often explored data prior to modeling. They could guess, but I’m not on the team, so I have no idea.”
also often performed similar exploration tasks (examining the data’s
4.6 Stakeholders and Collaboration
values and derived statistics and visualizations, as described in §7.1)
to profile data or discover new insights. Thus, we revise Kandel We observed that analysts collaborated with a few types of stake-
et al.’s model by replacing profiling with a more general exploration holders over the course of their analysis projects.
task, which subsumes both profiling and discovery goals. Clients. Most analysts (13/18) had clients who prompted them
In §5-8, we discuss common challenges in these tasks and report to perform the analyses and were the direct audience for the results.
how analysts handled them. Though analysts also explored varia- Some analysts were consultants who served external clients while
tions of models and outputs, this paper focuses on data exploration. some worked with internal clients within their organizations such
We consider model diagnostics beyond the scope of this paper. as product managers or executives. Analysts often interacted with
clients in an iterative fashion. Besides reporting the final results, ana-
4.4 Analysis Tools lysts might share preliminary results and ask the clients for feedback
The interviewees reported using and switching between multiple such as verifying if the results matched the clients’ prior knowledge
tools throughout their analyses. A few (P1, P11, P18) were applica- and checking if the analyses aligned with the project goals.
Data owners. Many analysts (10/18) interacted with data engi- the data. However, data often came from many sources or had an
neers or database administrators who curated, processed, or stored improper format and size for analysis tools. Thus, analysts had to
the data prior to their analyses. Clients were also sometimes data transform the data prior to exploration. Once they explored the data,
owners, directly providing the data for the analysts. Analysts often they might discover that they needed to further handle erroneous
asked the data owners to provide additional information to help them values or rescale the data. Due to this coupling, some analysts even
locate, clean, and understand the data since data owners had a better associated exploratory analysis with data cleaning.
understanding of the format and meaning of the data as well as Akin to prior work [42], several analysts reported that they often
where the data were stored and how they were processed. spent the majority of their analysis time to wrangle and clean data.
Analysis Team Members. Though the analysts primarily analyzed As exploration tools often lack support for some wrangling tasks,
data on their own, most of them (15/18) were members of analysis they had to switch between tools throughout the analysis and migrate
teams. Thus, they regularly obtained feedback from fellow analysts data between these tools. We now identify commonly observed
and supervisors before presenting to clients. Typical feedback in- wrangling tasks that coupled with or impeded exploration.
cluded additional questions to explore, technical advice for analysis
techniques and implementation, and suggestions to make the reports 6.1 Combining Multiple Datasets
easier to understand for the clients. Moreover, a few interviewees Many analysts (12/18) had to join multiple datasets or integrate
noted that they worked jointly with their colleagues on some projects. similar datasets from multiple sources, both of which presented many
Two reported splitting the work so each team member could focus challenges. To understand the similarities and differences between
on an independent scope and make progress in parallel. Another datasets, they might have to profile the datasets while combining
mentioned that she and her colleague independently analyzed the them. P6 and P10 complained that they often had to join data
same data and cross-checked if they arrived at the same results. from over 20 tables. Three analysts also had to use many scripting
Besides supervisors and fellow analysts, a few interviewees had languages to fetch the data from multiple different platforms.
colleagues with more domain expertise in their teams. P3’s medical One common challenge was the inconsistency between data
device research team had a pathologist to give opinions on tumor sources. P5, an astronomer, reported that different telescopes pub-
image analysis. P16, a data science consultant, also reported that lished data using various time systems, so she spent a few days just
his organization included business-oriented “solution managers”, to get the data on the same time systems before she could combine
whose duties were to bridge the communication gap between the them. P11 also described joining data with different levels of granu-
clients and technical-oriented data scientists and help them define larity: “Voting data is collected at precinct level while health data is
deliverables that matched the clients’ goals. at a state level, and population data is served at zip code level”.
5 DATA ACQUISITION 6.2 Dealing with Data Size
We now discuss challenges for data exploration and relevant activi-
Most analysts (15/18) had to deal with data size, which increased
ties. The first step is to acquire the data necessary for the analysis.
data processing time, impeded sharing, or even crashed their analysis
All but one interviewee (17/18) reported working with existing
tools. P14 mentioned that it took her a few days just to retrieve the
datasets. For business analysts, most data were from product logs or
data. P3 noted that it was “extremely difficult to share a 250GB file”.
customer surveys, while many researchers worked on datasets jointly
Several analysts complained that large datasets did not work in R.
collected by their research communities. Only some (5/18) had par-
P11 was annoyed that his data crashed both Excel and Tableau.
ticipated in data collection, either by collecting the data themselves
or requesting that certain data should be collected. The analysts applied a few strategies to handle large datasets.
When working with existing data, finding relevant data were dif- Some (8/18) reduced data size by sampling the data. P9 and P10
ficult for a few reasons. First, data were often distributed. Several noted that their challenges for sampling included “figuring out how
analysts reported that their companies used multiple data storage large of a sample size we needed and balancing how long it would
infrastructures. A few researchers also mentioned that their datasets take to run” as well as “determining how to get meaningful and
were collected and published by different research organizations. represenative samples”.
Thus, analysts typically had to search for data in many places. More- Some analysts (8/18) also reduced data size by filtering interesting
over, data sources often had insufficient data description, having or relevant subsets based on their domain knowledge or suggestions
uninformative column labels and missing or outdated documenta- from domain experts. P15 also applied signal processing techniques
tion. As a result, analysts had to explore all potential datasets to to detect signals of interest from audio data, so she could explore
assess if they were relevant to their analyses. just the relevant data. However, analysts might not know in advance
Some analysts consulted data owners to locate and understand how to filter the data until they explored the data.
the data. They often received connections to the data owners from Some interviewees (4/18) handled large datasets by aggregating
their clients or colleagues. However, P14 noted that finding the right them. One difficulty for aggregation was deciding the level of detail.
people to talk to was difficult since she worked in a remote office. For example, aggregating time series by milliseconds could make the
Analysts also used keyword search to look for relevant datasets in aggregated data too large, while aggregating by year might eliminate
their databases. However, as the same data could be named in many important details for the analysis. However, as analysts sometimes
ways, they had to try many different keywords to find the data. For lacked specific questions during exploration, they might not initially
some analysts, their data sources might not have convenient search know the right aggregation level and thus had to re-aggregate the
capability at all. Due to this problem, P2 noted that she was building data many times during the exploration.
a searchable database for her organization.
For consulting analysts, they often received their data from the 6.3 Converting Data Formats
clients. However, the provided data might lack the information Most analysts (15/18) had to convert data into formats expected by
necessary to achieve the project goals, requiring the analysts to their analysis tools. Common formatting tasks included converting
search for more appropriate data or otherwise terminate the project. file formats and character encodings as well as manipulating data
layout such as splitting data columns and reshaping datasets into
6 DATA W RANGLING long formats. A common complaint was that data formatting was
We observed that analysts often coupled data wrangling with explo- time-consuming. Several analysts also complained that they had to
ration. As analysts received new data, they might want to explore manually format spreadsheets that did not have rectangular shapes.
6.4 Deriving New Forms of Data avoid making the scatterplot matrices too large. A few also grouped
Many analysts (13/18) derived new forms of data more appropriate redundant variables identified via correlation plots.
for their analyses. Many often rescaled data by normalizing data into The analysts reported that a straightforward exploration may take
certain ranges (e.g., 0 to 1) or applying logarithmic transformation a few hours to a few days. However, the data were often dirty
to make them more normally distributed. Several applied low-pass or incomplete, requiring them to acquire or wrangle more data
filters or calculated moving averages to reduce noise in the data. P14 before they continued exploring. Moreover, analysts often had to
and P17 coded new high-level categories from the original low-level consult and get feedback from clients or colleagues. However, these
categories. As we will discuss in §7.4, analysts also often derived stakeholders might not be immediately available to help, so the
tabular forms of unstructured data (e.g., by calculating statistics) so analysts had to switch to other projects while waiting. For these
they could explore and analyze the new data. reasons, exploration may take several days or even weeks.

6.5 Handling Erroneous Values 7.2 How to Choose Variables to Explore?


Most analysts (13/18) had to handle data errors such as missing One common challenge was choosing variables to explore. The in-
data and extreme values. Handling erroneous values was challeng- terviewees generally reported that they were comfortable exploring
ing since any decision to filter or impute the data required domain datasets with up to one or a few dozen variables. However, many
knowledge and might affect downstream analysis. Thus, analysts (12/18) had to analyze datasets with several dozens to hundreds of
often explored other aspects of the data and consulted data owners variables and mentioned that the number of variable combinations
before picking a filter condition or imputation method. Since they to explore was a challenge for them. P16 complained that picking
might later find that some errors were irrelevant to their analyses, variables was “too time-consuming”. P2 said that “choosing vari-
analysts sometimes “piled up” errors and kept exploring until they ables was harder than plotting itself.” P10 even said he “sometimes
knew which errors were important to handle. skipped plotting if there were too many variables.”
When there were fewer variables, analysts typically examined
7 DATA E XPLORATION univariate distributions of all variables and, if possible, all bivariate
distributions. If there were too many combinations, they often tried
Once analysts wrangled their data to have a proper format and size,
to choose around 10-20 variables using a number of criteria. In addi-
they would explore the data, which sometimes led them to acquire
tion, they sometimes applied dimensionality reduction techniques.
or wrangle more data. In this section, we first summarize observed
exploration process with a focus on tabular data, the common data All interviewees regularly applied domain knowledge to choose
form for all interviewees. We then discuss exploration challenges variables. For profiling, they often examined variables related to
including choosing variables, handling repetitive tasks, exploring their assumptions based on prior knowledge or suggestions from in-
unstructured data, and determining when to stop exploration. volved stakeholders. For question answering and modeling, analysts
might explore variables they considered relevant to their questions or
7.1 Observed Exploration Tasks likely to affect the dependent variables. For open-ended exploration,
analysts might wander through data based on what they found inter-
Analysts usually began exploring by checking what the data con- esting. Though a common difficulty was deciding what would be
tained. For tabular data, analysts would look at table headers and, if interesting for the audience, several analysts noted that they often
available, read the data’s documentation. After knowing what the explored relationships that might have implications for decision-
data were about, they would choose aspects in the data to explore making. P11 also “drew diagrams between variables with potential
(or stop exploring if the data were irrelevant). As we will discuss in relationships” to pick variables.
§7.2, the analysts may reduce the number of variables if necessary. More than half of the analysts (10/18) reported criteria for drop-
Analysts applied various methods to examine tabular data. To ping variables. They often discarded variables that were parts of their
profile the data, more than half of analysts (12/18) directly looked at datasets, but irrelevant to their analysis. As datasets often contained
the data values (e.g., via a print command or spreadsheet software). duplicate or similar variables, three analysts also used correlation
Many (8/18) computed summary statistics such as the range and plots to group redundant variables. For each group, they then picked
central tendencies for continuous variables and value counts for a variable that was the most reliable, having no outstanding data
categorical variables. Most analysts (15/18) examined univariate quality issues, and the most understandable for their audience.
distributions with histograms and count plots. P2 and P7 reported Several analysts (7/18) applied statistics and modeling techniques
using box plots, while P12 used kernel density plots. Analysts to select variables. Some built simple models, such as shallow
sometimes wrangled a variable during exploration, e.g., by filtering decision trees or random forests, to determine important features. P2
irrelevant and missing values or rescaling the variable. and P8 examined variables that correlated with dependent variables.
Analysts examined multivariate distributions for both profiling However, these approaches have some limitations. P17 noted that
and discovery goals. They often checked certain distributions to industrial datasets often contained duplicated variables, which might
verify their assumptions and investigated why some assumptions cause some of them to appear less important in the model building
did not hold true. If their exploration goal included discovery, they approach. P2 also noted that “sometimes there were many things
would also explore various combinations of variables to see if they that too were correlated but not important”.
could learn interesting insights. Some of these insights might inspire Besides selecting variable subsets, several analysts (6/18) utilized
them to further explore other relevant aspects of the data. dimensionality reduction techniques to explore large number of vari-
All analysts employed bivariate plots including bar, line, and ables. Many used principal component analysis (PCA) and plotted
scatter plots. A few (3/18) used 2D histograms, frequency tables, the top eigenvectors. P14 also plotted data with t-SNE [51,71]. How-
and contour plots. Many (11/18) also explored plots with more than ever, dimensionality reduction could lead to interpretation difficulty,
two variables. In many cases, they encoded the third variable in a as P12 noted: “If I have a hyper-dimension that’s combining 1,000
plot with colors. P4 and P16 also displayed surfaces of functions different variables, I can’t explain to my audience what it means.”
with two input variables using 3D plots. However, P16 noted it was
sometimes difficult to see relationships from a 3D plot. To examine
7.3 Handling Repetitive Tasks
multiple variables at the same time, several used scatterplot matrices.
P2 and P8 also used parallel coordinate plots. If there were too many We found that repetitive tasks also impeded exploration for many
variables, some analysts grouped variables into small batches to analysts (7/18). Some said that they often had to “reinvent the
wheel”, performing similar tasks in each exploration. P8 also wished especially when the data were large. P3, who profiled samples of
for a better way to visualize multiple variables at the same time: large image collections, reported that he felt “confident for 90% of
the time”, but sometimes worried that he might have missed im-
“I wish there were a tool that I can just browse through
portant errors in the data. For discovery, analysts generally felt
a gallery of each variable’s plot. It would be awesome to
less confident as the goal was more open-ended by nature. P5 even
just browse through each of the variable’s distribution and
revealed that she never knew if she had comprehensively explored
outliers, then move on to the next one.”
the data when exploring a dataset she had not seen before.
Despite the abundance of guidelines for data analysis and visual- Stakeholder Feedback. Since determining if they had sufficiently
ization, some analysts also noted that most tools did not incorporate achieved the exploration goals could be difficult, analysts generally
such knowledge or make them easily accessible. Thus, they had to performed multiple rounds of exploration where they received feed-
manually apply the knowledge themselves. One common challenge, back from team members and clients in between. They then used
especially for programmers, was recalling how to run specific anal- feedback from team members and clients to assess if they need to
ysis commands. Another common complaint was the lack of good further explore the data. Some analysts described that they would
defaults in tools. P13 complained that Matplotlib often required stop profiling once their clients and colleagues no longer had con-
additional customization to make plots look good. P17 was annoyed cerns about the data. A few also noted that early feedback from
that many plotting libraries dropped null values by default without colleagues sometimes helped them terminate a low-value project
indicating that some values were dropped. early and let them focus on more important projects.
To avoid repetitive tasks and ensure that they followed best prac- For open-ended discovery, analysts often ended a round of explo-
tices, some programmers compiled templates for commands they ration when they had shareable results. One industrial analyst (P9)
often used. P17 even wrote a script to generate a Jupyter notebook mentioned that he usually stopped when he found a result “worth
that included basic summary plots of all variables in a given dataset sitting down and discussing.” P5 who used a large public data (P5)
and ran basic checks for data quality issues, so he could begin for her research mentioned that she stopped exploring when she
exploration by browsing the notebook without rewriting analysis “discovered enough material [to analyze] for a paper”. By writing
commands every time. Though templates were useful for saving a paper, she then received feedback from the research community,
analysis time, different datasets often had their own subtleties, so driving her to do further analysis. Analysts also sometimes stopped
analysts needed to adjust their templates based on the data. exploring if the data had nothing interesting.
7.4 Exploring Unstructured Data Time Constraint. Half of the analysts (9/18) cited time limits as a
While all analysts (18/18) regularly worked with structured data major factor that prompted them to stop exploring. P16 said that “it
(e.g., tables and networks), several (7/18) sometimes analyzed un- is okay to explore data for a few weeks, but after that I will need to
structured data (e.g., text, audio, genomic sequences, and images). A start the other parts of the work.” P17 also noted the pressing nature
common challenge was the lack of methods for exploring a large col- of his work: “we are developing models, and we have to deliver.
lection of unstructured data. Thus, analysts often derived new forms It’s happened that we have some stones left unturned—sometimes
of data and explored the new data instead. P13 and P16 computed we come back, sometimes we don’t.” For time-sensitive projects,
word frequencies for text data. P2 calculated missing call rates for analysts might skip some parts of the exploration, as P8 said:
genomic sequences. P15 also applied signal processing techniques “If I have to do it fast, I would not spend most of my time
to extract signals of interests from audio data. However, when it in exploratory analysis. I’ll do some spot checks like just
was difficult to derive a new form of data, the analysts might have to checking the ranges. I would not even look at the distributions
sample the data instead. For example, P3 profiled a large collection and just go right into modeling.”
of image data by directly examining a small set of samples. Since analysts often had limited time to explore large amount of
data, it was difficult to perfectly explore all aspects of the data. Thus,
7.5 When Does the Exploration End? they sometimes returned to exploration after moving on to modeling.
As exploratory analysis is open-ended by nature, a common chal- A few of them also reported that poor modeling results led them to
lenge was deciding when the exploration should end so analysts further explore if any data quality issues caused the problems.
could move on to the next tasks. When asked how they decided to
end an exploration, a handful of interviewees (5/18) responded that 8 R EPORTING AND S HARING A NALYSIS
they did not always have a definite answer if an exploration was com- As data analysis is iterative and collaborative, analysts had to share
plete. From the interviews, we found that analysts decided to end their analysis results throughout the process. We now discuss com-
an exploration based on multiple factors including goal satisfaction, mon challenges for sharing analysis results.
feedback from involved stakeholders, and time constraints.
Goal Satisfaction. All analysts (18/18) often ended an exploration 8.1 Adjusting Reports to Match Analysis Audience
once they satisfied with their goal. For discovery in question an- Many analysts (10/18) needed to adjust their reports to match their
swering and modeling projects, they concluded when they had an analysis goals and the audience’s background. P17 mentioned that
intuition on how to formulate the answers or the models. For profil- his goal was to “produce insights for the audience with the least
ing, analysts usually stopped when they had verified all assumptions amount of effort for them to understand.” P18 also noted that “ex-
and felt they had a good sense of the data. Analysts might move on plaining complicated things in a simple way” was the hardest thing
if they thought they had done a sufficient job, as P17 said: in data analysis.
We observed a few strategies for simplifying analysis reports.
“If I reached a point where I no longer saw glaring issues,
First, analysts typically avoided using sophisticated plots, such as
I’m done. It does not mean the data is clean. ... However,
box plots, in reports for stakeholders with less data analysis expertise.
I’m not seeing any other issues in the data, so they’re small
Moreover, while their explorations might have many delicate details,
enough that I don’t need to care about them.”
they often presented only the most important findings, such as ones
The analysts’ confidence whether they had done a sufficient job that had implications for decision-making. However, a challenge
varied based on the exploration goals. Analysts generally felt confi- was that their analysis audience had varying degrees of expectations.
dent about profiling, as P6 noted that “just looking at distributions Some might even expect to explore the reports themselves, requiring
is not that hard.” However, they were sometimes less confident, the analysts to create dashboards for the reports.
The need to communicate with an audience also led the analysts to are interested in, and thus overlook important insights in the data.
align their analyses with the audience’s background. When possible, Tools might reduce this risk by suggesting analysts to explore other
they would choose concepts that the audience were familiar with. As aspects of the data and promoting serendipitous discovery.
discussed earlier in §7.2, one criterion to choose a variable from a An important question is how to recommend data for analysts
group of redundant ones was whether the audience would understand to explore. For profiling, tools may automatically detect and sug-
it. P8 also reported that he avoided introducing a new metric in his gest variables with potential issues such as missing values or out-
analysis if there was a similar but widely-used metric. liers [43]. For discovery, suggesting data is more challenging as the
goal is open-ended. While prior work [64,69,74] leverages statistics
8.2 Analysis Sharing and Provenance for suggestions, we find that analysts mostly pick variables based
Sharing analysis history across organization was a common chal- on their interpretation of semantic relationships between variables,
lenge for many analysts (8/18), as P14 said: “I often felt that I’m while statistical properties are sometimes irrelevant. Thus, tools
reinventing the wheel, but it’d take me a week to find somebody should at least allow analysts to steer suggestions based on their in-
who already did something similar.” A few also reported that their terests. An open question is how to design an elicitation method that
companies tried to use collaboration platforms such as a wiki to lets analysts convey domain knowledge such as how the variables
share analysis summaries. However, these attempts eventually got could influence each other. Tools might then store and use this infor-
abandoned because analysts did not want the extra work to write a mation to recommend relevant variables, and possibly help refute
summary, in part because they had already presented their analysis hypothesized relationships. As analysts in the same organizations
via other forms of reports such as slides. P9 also noted the tension often explore the same datasets at different times, tools may also
between doing more analysis and writing more reports: leverage prior analyses to learn relevancy between variables.
“Given a fixed amount of time, do I answer more questions
9.2 Support Iterative and Collaborative Workflows
and go as far as I can or do I go slower and write more
reports? Finding the balance is a bit hard.” One observation is the lack of support for browsing and searching
Analysts also had to revisit their own analysis history to repeat history [33] in exploration tools. If analysts can efficiently find
an analysis with new data, or to help them recall prior work when analyses relevant to certain datasets and variables, they can better
they summarized an analysis for reporting or switched from another understand the data and avoid repeating existing work. Moreover,
project. As discussed earlier in §4.4, some analysts utilized compu- since an exploration on a dataset can be lengthy, tools should also
tational notebooks to keep analysis history. However, some analysts provide interfaces to annotate important findings so that analysts can
had difficulty keeping analysis history, as P12 said: later revisit and summarize these findings for their reports. As ana-
lysts may not know if they have comprehensively explored the data,
“I and many other analysts I know often went through an surfacing variable coverage [62] may help them identify unexplored
awful lot of charts and later realized there were a few that directions and perform more comprehensive exploration.
we wanted but didn’t save along the way.” Another key finding is the tight coupling between exploration
and other tasks, requiring analysts to switch tools and migrate data.
9 D ESIGN O PPORTUNITIES Exploration tools could benefit by either providing support for other
tasks such as data wrangling [6] or tightly integrating with existing
Based on these interview results, we now identify design opportuni-
analysis ecosystems. For example, the JupyterLab data science envi-
ties for improving data exploration tools.
ronment [4] has an extension system that can integrate an exploration
9.1 Facilitate Rapid Exploration with Automation tool for Jupyter Notebook users. Moreover, tools should consider
using a shared in-memory data format (e.g., [2]) to reduce the need
From the interviews, we observe many challenges in exploratory to migrate data due to tool switching. Finally, as analysts often have
data analysis that suggest opportunities to augment data exploration to create reports or presentations to share with other stakeholders,
with automation and guidance. tools can provide scaffolding to help generate reports from existing
First, as analysts often need to perform repetitive tasks and have analyses and annotations of important findings.
limited time to explore data, tools should provide automation to help
analysts focus on analyzing data rather than executing routine tasks. 10 C ONCLUSION
While some existing tools provide sensible defaults for plotting com-
mands [72] or help automate chart design [52], these features are not This paper presents the results of an interview on exploratory data
yet available in popular analysis environments such as Python. More analysis with 18 analysts across academia and industry. We charac-
importantly, analysts still have to manually create charts one-by-one. terize common exploration goals: profiling (assessing data quality)
As we observe that some analysts apply templates to automate chart and discovery (gaining new insights). Though the EDA literature
generation and wish to browse charts without manually plotting emphasizes discovery, we observe that discovery only reliably oc-
them, tools can recommend charts for analysts to examine [74, 75]. curs in the context of open-ended analyses, whereas all participants
As analysts noted in §7.3, tools can incorporate analysis practices engage in profiling across all of their analyses. We also describe
into their recommendations. Since analysts should begin exploring how analysis goals and context affect the tasks and challenges in ex-
data by examining univariate summaries of all variables [53, 63], ploratory data analysis. We find that analysts must perform repetitive
tools can suggest these plots for the analysts [74, 75]. When an tasks, yet they may have limited time or lack domain knowledge to
analyst plots an average of a variable, a tool may augment the plot explore data. Analysts also often have to consult other stakeholders
with variance to convey uncertainty, or suggest robust statistics such and oscillate between exploration and other tasks, such as acquiring
as median if there are outliers skewing the average. For large data, a and wrangling additional data. Based on these observations, we
tool may suggest approximate techniques such as sampling, online conclude with design opportunities for data exploration tools, such
aggregation [29, 34], or density based plots such as histograms and as augmenting exploration with automation and guidance.
binned scatterplots [16] instead of plotting all individual points.
Another key difficulty for data exploration is choosing variables ACKNOWLEDGMENTS
to explore. We observe that analysts heavily rely on their judgment We thank Interactive Data Lab members and the anonymous referees
including determining what variables are interesting and deciding if for their feedback. This work was done when the first author was at
they have sufficiently explored the data. One potential risk is that the University of Washington. This work was supported by a Moore
analysts may be biased to focus on what they or their stakeholders Foundation Data-Driven Discovery Investigator Award.
R EFERENCES analysis. Analytics Press, 2009.
[28] J. J. Filliben and A. Heckert. Exploratory data analysis. Engineering
[1] Matplotlib documentation. http://matplotlib.org/. Statistics Handbook, Internet, National Institute of Standards and
[2] Apache arrow. https://arrow.apache.org/, 2018. Accessed on Aug 1, Technology, 2005.
2018. [29] D. Fisher, I. Popov, S. Drucker, et al. Trust me, i’m partially right:
[3] Google data studio. https://datastudio.google.com/u/0/, 2018. Accessed incremental visualization lets analysts explore large datasets faster. In
on Aug 1, 2018. Proceedings of the SIGCHI Conference on Human Factors in Comput-
[4] Jupyterlab. https://github.com/jupyterlab/jupyterlab, 2018. Accessed ing Systems, pp. 1673–1682. ACM, 2012.
on Aug 1, 2018. [30] A. Gelman. Exploratory data analysis for complex models. Journal of
[5] Microsoft powerbi. https://powerbi.microsoft.com/, 2018. Accessed Computational and Graphical Statistics, 13(4):755–779, 2004.
on Aug 1, 2018. [31] D. Gotz and M. X. Zhou. Characterizing users’ visual analytic activity
[6] Tableau prep. https://www.tableau.com/products/prep, 2018. Accessed for insight provenance. Information Visualization, 8(1):42–55, 2009.
on Aug 1, 2018. [32] P. J. Guo. Software tools to facilitate research programming. PhD
[7] Z. Abedjan, L. Golab, and F. Naumann. Profiling relational data: a thesis, Stanford University Stanford, CA, 2012.
survey. The VLDB Journal-The International Journal on Very Large [33] J. Heer, J. Mackinlay, C. Stolte, and M. Agrawala. Graphical histories
Data Bases, 24(4):557–581, 2015. for visualization: Supporting analysis, communication, and evaluation.
[8] S. Alspaugh, N. Zokaei, A. Liu, C. Jin, and M. A. Hearst. Futzing and IEEE transactions on visualization and computer graphics, 14(6),
moseying: Interviews with professional data analysts on exploration 2008.
practices. IEEE transactions on visualization and computer graphics, [34] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In
2018. Acm Sigmod Record, vol. 26, pp. 171–182. ACM, 1997.
[9] R. Amar, J. Eagan, and J. Stasko. Low-level components of analytic [35] D. C. Hoaglin, F. Mosteller, and J. W. Tukey. Understanding robust
activity in information visualization. In Information Visualization, and exploratory data anlysis. Wiley Series in Probability and Mathe-
2005. INFOVIS 2005. IEEE Symposium on, pp. 111–117. IEEE, 2005. matical Statistics, New York: Wiley, 1983, edited by Hoaglin, David C.;
[10] N. Andrienko and G. Andrienko. Exploratory analysis of spatial and Mosteller, Frederick; Tukey, John W., 1983.
temporal data: a systematic approach. Springer Science & Business [36] S. Idreos, O. Papaemmanouil, and S. Chaudhuri. Overview of data
Media, 2006. exploration techniques. In Proceedings of the 2015 ACM SIGMOD
[11] A. Batch and N. Elmqvist. The interactive visualization gap in initial International Conference on Management of Data, pp. 277–281. ACM,
exploratory data analysis. IEEE transactions on visualization and 2015.
computer graphics, 24(1):278–287, 2018. [37] P. Isenberg, A. Tang, and S. Carpendale. An exploratory study of visual
[12] J. T. Behrens. Principles and procedures of exploratory data analysis. information analysis. In Proceedings of the SIGCHI Conference on
Psychological Methods, 2(2):131, 1997. Human Factors in Computing Systems, pp. 1217–1226. ACM, 2008.
[13] J. T. Behrens and C.-h. Yu. Exploratory data analysis. Handbook of [38] T. Jankun-Kelly, K.-L. Ma, and M. Gertz. A model and framework
psychology, 2:33–64, 2003. for visualization exploration. IEEE Transactions on Visualization and
[14] S. P. Callahan, J. Freire, E. Santos, C. E. Scheidegger, C. T. Silva, Computer Graphics, 13(2):357–369, 2007.
and H. T. Vo. Vistrails: visualization meets data management. In [39] L. Jones. The Collected Works of John W. Tukey: Philosophy and
Proceedings of the 2006 ACM SIGMOD international conference on Principles of Data Analysis 1949-1964, vol. 3. CRC Press, 1986.
Management of data, pp. 745–747. ACM, 2006. [40] L. Jones. The Collected Works of John W. Tukey: Philosophy and
[15] S. K. Card, J. D. Mackinlay, and B. Shneiderman. Readings in in- Principles of Data Analysis 1965-1986, vol. 4. CRC Press, 1987.
formation visualization: using vision to think. Morgan Kaufmann, [41] S. Kandel, J. Heer, C. Plaisant, J. Kennedy, F. van Ham, N. H. Riche,
1999. C. Weaver, B. Lee, D. Brodbeck, and P. Buono. Research directions
[16] D. B. Carr, R. J. Littlefield, W. Nicholson, and J. Littlefield. Scatterplot in data wrangling: Visualizations and transformations for usable and
matrix techniques for large n. Journal of the American Statistical credible data. Information Visualization, 10(4):271–288, 2011.
Association, 82(398):424–436, 1987. [42] S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Enterprise data anal-
[17] C. Chatfield. The initial examination of data. Journal of the Royal ysis and visualization: An interview study. In IEEE Visual Analytics
Statistical Society. Series A (General), pp. 214–253, 1985. Science & Technology (VAST), 2012.
[18] C. Chatfield. Exploratory data analysis. European journal of opera- [43] S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, and J. Heer. Pro-
tional research, 23(1):5–13, 1986. filer: Integrated statistical analysis and visualization for data quality
[19] C. Chatfield. Problem solving: a statistician’s guide. Chapman and assessment. In Proc. Advanced Visual Interfaces (AVI), pp. 547–554.
Hall/CRC, 1995. ACM, 2012.
[20] E. H.-h. Chi and J. T. Riedl. An operator interaction framework for [44] Y.-a. Kang and J. Stasko. Characterizing the intelligence analysis
visualization systems. In Information Visualization, 1998. Proceedings. process: Informing visual analytics design through a longitudinal field
IEEE Symposium on, pp. 63–70. IEEE, 1998. study. In Visual Analytics Science and Technology (VAST), 2011 IEEE
[21] G. Chin Jr, O. A. Kuchar, and K. E. Wolf. Exploring the analytical Conference on, pp. 21–30. IEEE, 2011.
processes of intelligence analysts. In Proceedings of the SIGCHI [45] D. A. Keim, F. Mansmann, J. Schneidewind, and H. Ziegler. Challenges
Conference on Human Factors in Computing Systems, pp. 11–20. ACM, in visual data analysis. In Information Visualization, 2006. IV 2006.
2009. Tenth International Conference on, pp. 9–16. IEEE, 2006.
[22] B. chul Kwon, B. Fisher, and J. S. Yi. Visual analytic roadblocks [46] M. B. Kery, M. Radensky, M. Arya, B. E. John, and B. A. Myers.
for novice investigators. In Visual Analytics Science and Technology The story in the notebook: Exploratory data science using a literate
(VAST), 2011 IEEE Conference on, pp. 3–11. IEEE, 2011. programming tool. In Proceedings of the 2018 CHI Conference on
[23] W. S. Cleveland. The elements of graphing data, vol. 2. Wadsworth Human Factors in Computing Systems, p. 174. ACM, 2018.
Advanced Books and Software Monterey, CA, 1985. [47] A. Kidd. The marks are on the knowledge worker. In Proceedings of
[24] W. S. Cleveland. The Collected Works of John W. Tukey: Graphics the SIGCHI conference on Human factors in computing systems, pp.
1965-1985, vol. 5. CRC Press, 1988. 186–191. ACM, 1994.
[25] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate [48] H. Lam. A framework of interaction costs in information visualization.
record detection: A survey. IEEE Transactions on knowledge and data IEEE transactions on visualization and computer graphics, 14(6),
engineering, 19(1):1–16, 2007. 2008.
[26] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, et al. Knowledge [49] G. Leinhardt and S. Leinhardt. Chapter 3: Exploratory data analysis:
discovery and data mining: Towards a unifying framework. In KDD, New tools for the analysis of empirical data. Review of research in
vol. 96, pp. 82–88, 1996. education, 8(1):85–157, 1980.
[27] S. Few. Now you see it: simple visualization techniques for quantitative [50] Z. Liu and J. Heer. The effects of interactive latency on exploratory
visual analysis. Visualization and Computer Graphics, IEEE Transac- IEEE Transactions on Visualization & Computer Graphics, (6), 2007.
tions on, 2014. [77] E. Zgraggen, Z. Zhao, R. Zeleznik, and T. Kraska. Investigating the
[51] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of effect of the multiple comparisons problem in visual analysis. In Pro-
machine learning research, 9(Nov):2579–2605, 2008. ceedings of the 2018 CHI Conference on Human Factors in Computing
[52] J. Mackinlay, P. Hanrahan, and C. Stolte. Show me: Automatic pre- Systems, p. 479. ACM, 2018.
sentation for visual analysis. IEEE Transactions on Visualization and
Computer Graphics (Proc. InfoVis), 13(6):1137–1144, 2007.
[53] D. S. Moore and G. P. McCabe. Introduction to the Practice of Statistics.
WH Freeman/Times Books/Henry Holt & Co, 1989.
[54] S. Morgenthaler. Exploratory data analysis. Wiley Interdisciplinary
Reviews: Computational Statistics, 1(1):33–44, 2009.
[55] F. Naumann. Data profiling revisited. ACM SIGMOD Record, 42(4):40–
49, 2014.
[56] F. Perez and B. E. Granger. Project jupyter: Computational narratives as
the engine of collaborative data science. Retrieved September, 11:207,
2015.
[57] P. Pirolli and S. Card. The sensemaking process and leverage points
for analyst technology as identified through cognitive task analysis. In
Proceedings of international conference on intelligence analysis, vol. 5,
pp. 2–4. McLean, VA, USA, 2005.
[58] E. D. Ragan, A. Endert, J. Sanyal, and J. Chen. Characterizing prove-
nance in visualization and data analysis: an organizational framework
of provenance types and purposes. IEEE transactions on visualization
and computer graphics, 22(1):31–40, 2016.
[59] E. Rahm and H. H. Do. Data cleaning: Problems and current ap-
proaches. IEEE Data Eng. Bull., 23(4):3–13, 2000.
[60] A. Rule, A. Tabard, and J. D. Hollan. Exploration and explanation in
computational notebooks. In Proceedings of the 2018 CHI Conference
on Human Factors in Computing Systems, p. 32. ACM, 2018.
[61] A. Sarikaya, M. Correll, L. Bartram, M. Tory, and D. Fisher. What do
we talk about when we talk about dashboards? IEEE Transactions on
Visualization and Computer Graphics, 2019.
[62] A. Sarvghad, M. Tory, and N. Mahyar. Visualizing dimension coverage
to support exploratory analysis. IEEE transactions on visualization
and computer graphics, 23(1):21–30, 2017.
[63] H. J. Seltman. Experimental design and analysis. Online at: http:
//www.stat.cmu.edu/hseltman/309/Book/Book.pdf , 2012.
[64] J. Seo and B. Shneiderman. A rank-by-feature framework for interac-
tive exploration of multidimensional data. Information Visualization,
4(2):96–113, 2005.
[65] B. Shneiderman. The eyes have it: A task by data type taxonomy for
information visualizations. In Visual Languages, 1996. Proceedings.,
IEEE Symposium on, pp. 336–343. IEEE, 1996.
[66] C. Stolte, D. Tang, and P. Hanrahan. Polaris: A System for Query,
Analysis, and Visualization of Multidimensional Relational Databases.
IEEE Transactions on Visualization and Computer Graphics, 8(1):52–
65, 2002.
[67] J. W. Tukey. Exploratory data analysis, vol. 2. Reading, Mass., 1977.
[68] J. W. Tukey. We need both exploratory and confirmatory. The American
Statistician, 34(1):23–25, 1980.
[69] M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis.
SeeDB: Efficient data-driven visualization recommendations to support
visual analytics. VLDB 2015, 8(13):2182–2193, 2015.
[70] P. F. Velleman and D. C. Hoaglin. Applications, basics, and computing
of exploratory data analysis. Duxbury Press, 1981.
[71] M. Wattenberg, F. Viégas, and I. Johnson. How to use t-sne effectively.
Distill, 1(10):e2, 2016.
[72] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer,
2009.
[73] H. Wickham and G. Grolemund. R for data science: import, tidy,
transform, visualize, and model data. ” O’Reilly Media, Inc.”, 2016.
[74] G. Wills and L. Wilkinson. Autovis: automatic visualization. Informa-
tion Visualization, 9(1):47–69, 2010.
[75] K. Wongsuphasawat, D. Moritz, A. Anand, J. Mackinlay, B. Howe,
and J. Heer. Voyager: Exploratory analysis via faceted browsing of vi-
sualization recommendations. IEEE Trans. Visualization & Comp.
Graphics (Proc. InfoVis), 2016. https://idl.cs.washington.
edu/papers/voyager.
[76] J. S. Yi, Y. ah Kang, J. T. Stasko, J. A. Jacko, et al. Toward a deeper
understanding of the role of interaction in information visualization.

You might also like