Goals, Process, and Challenges of Exploratory Data Analysis: An Interview Study
Goals, Process, and Challenges of Exploratory Data Analysis: An Interview Study
new insights). Though the EDA literature primarily emphasizes dis- sis process and report how analysis goals and context impact them.
covery, we observe that discovery only reliably occurs in the context We also describe how analysts handle challenges specific to ex-
of open-ended analyses, whereas all participants engage in profiling ploration tasks including choosing variables to explore, handling
across all of their analyses. We describe the process and challenges repetitive tasks, and determining the end of an exploration. We find
of EDA highlighted by our interviews. We find that analysts must that analysts often have to explore numerous variable combinations,
perform repetitive tasks (e.g., examine numerous variables), yet requiring them to apply domain knowledge to select and reduce the
they may have limited time or lack domain knowledge to explore number of variables. As analysts perform repetitive tasks, they may
data. Analysts also often have to consult other stakeholders and curate analysis templates to automate their routines and help them
oscillate between exploration and other tasks, such as acquiring follow best practices. Due to time limits, analysts may also need to
and wrangling additional data. Based on these observations, we move on to other tasks before completing their exploration.
identify design opportunities for exploratory analysis tools, such as Finally, we identify opportunities for data exploration tools. We
augmenting exploration with automation and guidance. argue that tools can help mitigate these observed challenges and
facilitate rapid and systematic exploration by providing automation
Index Terms: Human-centered computing—Visualization—Visu- for routine tasks and guiding analysis practices. We also note a
alization techniques—Treemaps; Human-centered computing— lack of support for data wrangling and navigation of analysis history
Visualization—Visualization design and evaluation methods within exploration tools.
tis
m ri n
g)
er
D An t (S il) ono nee
t ( ns g) erin
dv
BI arc t (M Me istic )
y)
ry
a r (M So ing (A
n a tr gi
a l (C u l t i n e
ng Sc ch cin )
ra
a ent Co l E )
a tive su ing )
Q Art t (C o G ia)
on ult es
R Stu ello t (S Lib
D Stu (R t (A l E
e
An ist ns ng
l) g)
D An UX ltin )
D Sci ec cia )
nt t (C ons am
y s de ed
a ent ha Me
a
ys o in
et tin
( at
D h F ntis ity
ys o c s)
of en o ar
y s R g)
di
at nal (Vi l M
t
R ul
D An t (S istic
Ph arc cie ers
t
Pr Sci t (C ese
l
a
ai
S niv
n
ys tat
e
es de w
e t
R arc t (U
(
Ph in ie
t
e ys
t
e
es nal
at al
at al
ua is
at al
es h
-E h
D ita
A
e
e
e
a
a
at
at
at
at
at
at
D
D
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 Count
Demographic Analysis Setting: Industry (I) / Academia (A) I A A A A I A I I I I I I I A I I I 0
Data Analysis Experience (Years) 3-5 5-10 1-3 1-3 5-10 3-5 3-5 1-3 3-5 1-3 1-3 >10 1-3 1-3 5-10 >10 >10 1-3 0
Project Type Question Answering ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 18
- Open-ended Questions ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 8
Open Exploration ✓ ✓ ✓ ✓ ✓ ✓ ✓ 7
Model Development ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 10
Data Preparation ✓ ✓ ✓ 3
Exploration Profiling ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 18
Goals Discovery ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 13
Context Tools Excel / Google Sheet ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 13
BI Tools (e.g., Tableau/PowerBI) ✓ ✓ ✓ ✓ ✓ ✓ 6
General Programming Languages (R/Python/MATLAB/SAS) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 15
Data Query Languages (e.g., SQL/Scalding) ✓ ✓ ✓ ✓ ✓ ✓ ✓ 7
Specific Domain-specific/Internal Tools ✓ ✓ ✓ ✓ ✓ ✓ ✓ 7
Analyst's Domain-Specific Analysts ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 11
Role Consultant ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 9
Stakeholders Clients ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 13
Data Owners ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 10
Analysis Team Members ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 15
Figure 1: A matrix of interviewees, their corresponding analysis context, and high-level tasks they perform in the analysis process.
“A lot of my work is more long-term open-ended research such as training machine learning models or developing new metrics
questions such as: how can we characterize the health of the and rules. Besides the models, analysts might also deliver reports,
users on our platform?” or integrate the solutions into dashboards as their project outcomes.
Analysts often produced analysis reports in the form of writ- Data Publishing. A few analysts (3/18) explored data while
ten documents and presentation slides. They also sometimes built cleaning datasets for publishing on shared repositories, so others
interactive dashboards. could use the datasets for other analyses.
Open-Ended Exploration. While answering specific questions
4.2 Exploration Goals
was more common, several analysts (7/18) noted that they some-
times broadly explored data to summarize and look for new insights We asked the analysts why they performed data exploration in their
without a specific question. P17, a data science consultant, reported analysis projects. From their descriptions, we categorize their goals
that his clients once gave him their website’s data and asked “Please into two common categories:
just tell me about my site.” P5, an astronomer, also said: Profiling. A common goal for all analysts (18/18) was to learn
“Occasionally we get data that’s surprising like the uni- what the data contained and assess if the data were suitable for the
verse does something we haven’t seen before and a telescope analyses. By broadly looking at the data and their plots, analysts
caught it. Then you sit down with the data and think ‘What could learn about their shapes and distributions, and detect data
do I do now?”’ quality issues such as missing data, extreme values, or inconsistent
data types. They might also check specific assumptions of the
Akin to question answering, analysts often produced reports to data, both in terms of expectations based on domain knowledge and
describe insights from the open-exploration process. mathematical assumptions required for modeling. By profiling, they
Model Development. Many analysts (10/18) reported cases where learned if the data were ready for the analyses or if they needed to
they performed exploratory analysis to prepare for modeling projects further wrangle the data or acquire more data.
Acquisition Wrangling Exploration Modeling
Obtain data Transform data to have a Examine data’s values, statistics Build and evaluate statistical Reporting
by locating existing data or suitable format for analysis and and visualizations to profile data models for testing hypotheses Share the analysis results
collecting the data themselves to handle data quality issues or discover new insights or making predictions
Figure 2: The analysis process couples exploration with many tasks including acquisition, wrangling, modeling, and reporting.
Discovery. Many analysts (13/18) also explored data to discover tion users who usually looked at and wrangled data in spreadsheets,
new insights or hypotheses, as P17 described that his exploration and visually explored data in Tableau.
goal was to “be open-minded and learn what the data could tell me.” The rest (15/18) were programmers who primarily used one
For question answering and modeling projects, analysts might focus language among Python, MATLAB, R, and SAS to analyze data.
on developing intuitions how to answer questions or formulate mod- They usually plotted data with APIs such as Matplotlib [1] and
els such as learning about potential relationships between variables ggplot2 [72]. Several of them also used computational notebooks
or rankings of feature importance. Some insights also inspired the (e.g., Jupyter [56]) to keep history for repeating and revising their
analysts to broadly explore other relevant factors while some helped analyses. Some noted that they preferred exploring data via scripting
them form and investigate specific questions. instead of using graphical interfaces as they did not have to switch
Analysts’ focus on exploration goals depended on project objec- tools. However, the programmers switched to other tools in some
tives. While the EDA literature (reviewed in §2.2) mostly focuses cases. P6 sometimes explored data in Tableau when it could connect
on discovery, we observed that profiling was a more common goal. to the data sources. Several used spreadsheets to inspect raw data,
Projects with fixed questions generally centered on profiling, though though they rarely wrangled data in spreadsheets like the application
surprising observations from profiling sometimes prompted analysts users. Many utilized languages such as SQL and Scalding to fetch
to investigate and discover the causes of the surprises. Meanwhile, and manipulate the data. Some used Tableau [66], Google Data
open-ended analyses involved both goals. Analysts often first fo- Studio [3], or Microsoft PowerBI [5] for reporting.
cused on profiling, and shifted their focus to discover new insights Several analysts sometimes had to use domain-specific tools. P3
when they felt more confident about the data. explored biopsy images from a 3D scan with a specialized tool. A
few industrial analysts also noted that their internal data platforms
4.3 High-Level Tasks in the Analysis Process had some support for data wrangling and exploration. As domain-
specific tools often had limited features, analysts preferred to use
From the interviewees’ responses about the tasks they performed in general-purpose tools if possible. However, their data often resided
their analyses, we characterize the data analysis process as an itera- in domain-specific tools and exporting data was sometimes difficult.
tive process that couples five common high-level tasks: acquisition,
wrangling, exploration, modeling, and reporting (as listed and de- 4.5 Operational and Domain Knowledge
fined in Fig. 2). Some projects might omit some tasks. For example, The analysts typically needed operational and domain knowledge in
though exploration often preceded modeling, some analysts (6/18) their analyses. They must know where the data were stored, and how
explored data to clean or summarize data without modeling involved. the data were collected and processed. They also needed domain
Some data were also clean and did not require wrangling. expertise to interpret the data and detect errors. Since analysts
The analysts’ process coupled exploration with many tasks. The usually lacked some required knowledge, they had to learn more
analysts regularly explored data to assess if the data were relevant about the problem domains and consult other stakeholders.
during acquisition. Similarly, they often explored data to decide Job roles also affected the levels of operational and domain knowl-
how to wrangle them. Exploration also helped them discover the edge that analysts had. We observed that the analysts had two kinds
need to collect or wrangle more data. In addition, the analysts of job roles relative to their problem domains: domain-specific
often reported exploration results to other stakeholders and gathered analysts (9/18) and consultants (7/18), with two (2/18) straddling
feedback for more exploration. While we observed less coupling both roles in different phases of their careers. In academia, most
between modeling and exploration, a few analysts examined training researchers focused on their research topics, but one (P7) was a
data when they observed poor modeling results. statistician providing solutions to multiple research domains. In
Our characterization of analysis tasks is similar to those of Kandel industry, there were both analysts embedded into product teams and
et al. [42] and Alspaugh et al. [8]. However, as Kandel et al. fo- consultants who served internal or external clients. As consultants
cus on analysts that typically perform directed analyses (answering typically worked with a broader set of domains, they often had less
predetermined questions), they only list profiling rather than explo- domain expertise and relied more on other stakeholders, as P17 said:
ration as one of the tasks. Alspaugh et al., whose study focuses on
open-ended analyses, augment Kandel et al.’s model by adding ex- “Since I’m not embedded with the team, I don’t have the
ploration as an alternative task to modeling. In contrast, as our study domain context. In this example where I saw elevated counts
covers exploratory tasks for both directed and open-ended analyses, in the product’s telemetry, I didn’t know what it meant. I
we found that analysts often explored data prior to modeling. They could guess, but I’m not on the team, so I have no idea.”
also often performed similar exploration tasks (examining the data’s
4.6 Stakeholders and Collaboration
values and derived statistics and visualizations, as described in §7.1)
to profile data or discover new insights. Thus, we revise Kandel We observed that analysts collaborated with a few types of stake-
et al.’s model by replacing profiling with a more general exploration holders over the course of their analysis projects.
task, which subsumes both profiling and discovery goals. Clients. Most analysts (13/18) had clients who prompted them
In §5-8, we discuss common challenges in these tasks and report to perform the analyses and were the direct audience for the results.
how analysts handled them. Though analysts also explored varia- Some analysts were consultants who served external clients while
tions of models and outputs, this paper focuses on data exploration. some worked with internal clients within their organizations such
We consider model diagnostics beyond the scope of this paper. as product managers or executives. Analysts often interacted with
clients in an iterative fashion. Besides reporting the final results, ana-
4.4 Analysis Tools lysts might share preliminary results and ask the clients for feedback
The interviewees reported using and switching between multiple such as verifying if the results matched the clients’ prior knowledge
tools throughout their analyses. A few (P1, P11, P18) were applica- and checking if the analyses aligned with the project goals.
Data owners. Many analysts (10/18) interacted with data engi- the data. However, data often came from many sources or had an
neers or database administrators who curated, processed, or stored improper format and size for analysis tools. Thus, analysts had to
the data prior to their analyses. Clients were also sometimes data transform the data prior to exploration. Once they explored the data,
owners, directly providing the data for the analysts. Analysts often they might discover that they needed to further handle erroneous
asked the data owners to provide additional information to help them values or rescale the data. Due to this coupling, some analysts even
locate, clean, and understand the data since data owners had a better associated exploratory analysis with data cleaning.
understanding of the format and meaning of the data as well as Akin to prior work [42], several analysts reported that they often
where the data were stored and how they were processed. spent the majority of their analysis time to wrangle and clean data.
Analysis Team Members. Though the analysts primarily analyzed As exploration tools often lack support for some wrangling tasks,
data on their own, most of them (15/18) were members of analysis they had to switch between tools throughout the analysis and migrate
teams. Thus, they regularly obtained feedback from fellow analysts data between these tools. We now identify commonly observed
and supervisors before presenting to clients. Typical feedback in- wrangling tasks that coupled with or impeded exploration.
cluded additional questions to explore, technical advice for analysis
techniques and implementation, and suggestions to make the reports 6.1 Combining Multiple Datasets
easier to understand for the clients. Moreover, a few interviewees Many analysts (12/18) had to join multiple datasets or integrate
noted that they worked jointly with their colleagues on some projects. similar datasets from multiple sources, both of which presented many
Two reported splitting the work so each team member could focus challenges. To understand the similarities and differences between
on an independent scope and make progress in parallel. Another datasets, they might have to profile the datasets while combining
mentioned that she and her colleague independently analyzed the them. P6 and P10 complained that they often had to join data
same data and cross-checked if they arrived at the same results. from over 20 tables. Three analysts also had to use many scripting
Besides supervisors and fellow analysts, a few interviewees had languages to fetch the data from multiple different platforms.
colleagues with more domain expertise in their teams. P3’s medical One common challenge was the inconsistency between data
device research team had a pathologist to give opinions on tumor sources. P5, an astronomer, reported that different telescopes pub-
image analysis. P16, a data science consultant, also reported that lished data using various time systems, so she spent a few days just
his organization included business-oriented “solution managers”, to get the data on the same time systems before she could combine
whose duties were to bridge the communication gap between the them. P11 also described joining data with different levels of granu-
clients and technical-oriented data scientists and help them define larity: “Voting data is collected at precinct level while health data is
deliverables that matched the clients’ goals. at a state level, and population data is served at zip code level”.
5 DATA ACQUISITION 6.2 Dealing with Data Size
We now discuss challenges for data exploration and relevant activi-
Most analysts (15/18) had to deal with data size, which increased
ties. The first step is to acquire the data necessary for the analysis.
data processing time, impeded sharing, or even crashed their analysis
All but one interviewee (17/18) reported working with existing
tools. P14 mentioned that it took her a few days just to retrieve the
datasets. For business analysts, most data were from product logs or
data. P3 noted that it was “extremely difficult to share a 250GB file”.
customer surveys, while many researchers worked on datasets jointly
Several analysts complained that large datasets did not work in R.
collected by their research communities. Only some (5/18) had par-
P11 was annoyed that his data crashed both Excel and Tableau.
ticipated in data collection, either by collecting the data themselves
or requesting that certain data should be collected. The analysts applied a few strategies to handle large datasets.
When working with existing data, finding relevant data were dif- Some (8/18) reduced data size by sampling the data. P9 and P10
ficult for a few reasons. First, data were often distributed. Several noted that their challenges for sampling included “figuring out how
analysts reported that their companies used multiple data storage large of a sample size we needed and balancing how long it would
infrastructures. A few researchers also mentioned that their datasets take to run” as well as “determining how to get meaningful and
were collected and published by different research organizations. represenative samples”.
Thus, analysts typically had to search for data in many places. More- Some analysts (8/18) also reduced data size by filtering interesting
over, data sources often had insufficient data description, having or relevant subsets based on their domain knowledge or suggestions
uninformative column labels and missing or outdated documenta- from domain experts. P15 also applied signal processing techniques
tion. As a result, analysts had to explore all potential datasets to to detect signals of interest from audio data, so she could explore
assess if they were relevant to their analyses. just the relevant data. However, analysts might not know in advance
Some analysts consulted data owners to locate and understand how to filter the data until they explored the data.
the data. They often received connections to the data owners from Some interviewees (4/18) handled large datasets by aggregating
their clients or colleagues. However, P14 noted that finding the right them. One difficulty for aggregation was deciding the level of detail.
people to talk to was difficult since she worked in a remote office. For example, aggregating time series by milliseconds could make the
Analysts also used keyword search to look for relevant datasets in aggregated data too large, while aggregating by year might eliminate
their databases. However, as the same data could be named in many important details for the analysis. However, as analysts sometimes
ways, they had to try many different keywords to find the data. For lacked specific questions during exploration, they might not initially
some analysts, their data sources might not have convenient search know the right aggregation level and thus had to re-aggregate the
capability at all. Due to this problem, P2 noted that she was building data many times during the exploration.
a searchable database for her organization.
For consulting analysts, they often received their data from the 6.3 Converting Data Formats
clients. However, the provided data might lack the information Most analysts (15/18) had to convert data into formats expected by
necessary to achieve the project goals, requiring the analysts to their analysis tools. Common formatting tasks included converting
search for more appropriate data or otherwise terminate the project. file formats and character encodings as well as manipulating data
layout such as splitting data columns and reshaping datasets into
6 DATA W RANGLING long formats. A common complaint was that data formatting was
We observed that analysts often coupled data wrangling with explo- time-consuming. Several analysts also complained that they had to
ration. As analysts received new data, they might want to explore manually format spreadsheets that did not have rectangular shapes.
6.4 Deriving New Forms of Data avoid making the scatterplot matrices too large. A few also grouped
Many analysts (13/18) derived new forms of data more appropriate redundant variables identified via correlation plots.
for their analyses. Many often rescaled data by normalizing data into The analysts reported that a straightforward exploration may take
certain ranges (e.g., 0 to 1) or applying logarithmic transformation a few hours to a few days. However, the data were often dirty
to make them more normally distributed. Several applied low-pass or incomplete, requiring them to acquire or wrangle more data
filters or calculated moving averages to reduce noise in the data. P14 before they continued exploring. Moreover, analysts often had to
and P17 coded new high-level categories from the original low-level consult and get feedback from clients or colleagues. However, these
categories. As we will discuss in §7.4, analysts also often derived stakeholders might not be immediately available to help, so the
tabular forms of unstructured data (e.g., by calculating statistics) so analysts had to switch to other projects while waiting. For these
they could explore and analyze the new data. reasons, exploration may take several days or even weeks.