0% found this document useful (0 votes)

10 views

ACM REP24 Paper Submission

Uploaded by

vcox4460

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

ACM REP24 Paper Submission

Uploaded by

vcox4460

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Computational Experiment Comprehension using Provenance

Summarization
Nichole Boufford Joseph Wonsil Adam Pocock
University of British Columbia University of British Columbia Oracle Labs
Vancouver, British Columbia, Canada Vancouver, British Columbia, Canada Burlington, Massachusetts, USA
[email protected] [email protected] [email protected]

Jack Sullivan Margo Seltzer Thomas Pasquier

Oracle Labs University of British Columbia University of British Columbia
Burlington, Massachusetts, USA Vancouver, British Columbia, Canada Vancouver, British Columbia, Canada
[email protected] [email protected] [email protected]

ABSTRACT ACM Reference Format:

Scientists use complex multistep workflows to analyze data. How- Nichole Boufford, Joseph Wonsil, Adam Pocock, Jack Sullivan, Margo Seltzer,
and Thomas Pasquier. 2024. Computational Experiment Comprehension
ever, reproducing computational experiments is often difficult as
using Provenance Summarization. In Proceedings of 2023 ACM Conference
scientists’ software engineering practices are geared towards the on Reproducibility and Replicability (ACM REP ’24). ACM, New York, NY,
science, not the programming. In particular, reproducing a scientific USA, 19 pages. https://doi.org/XXXXXXX.XXXXXXX
workflow frequently requires information about its execution. This
information includes the precise versions of packages and libraries
used, the particular processor used to perform floating point compu- 1 INTRODUCTION
tation, and the language runtime used. This can be extracted from Historically, researchers and scientists have recorded experimen-
data provenance, the formal record of what happened during an ex- tal procedures in lab notebooks. These notebooks should contain
periment. However, data provenance is inherently graph-structured enough detail to allow another researcher to reproduce an experi-
and often large, which makes interpretation challenging. Rather ment. However, this paper-based approach does not translate well
than exposing data provenance through its graphical representa- to today’s computational world. Computational lab notebooks (e.g.,
tion, we propose a textual one and use a large language model to Jupyter [23]) seem an obvious solution. They automate experiment
generate it. We develop techniques for prompting large language recording, making them vastly superior to paper-based approaches.
models to automatically generate textual summaries of provenance However, computational lab notebooks still fall short of support-
data. We conduct a user study to compare the effectiveness of these ing automated reproducibility [42]. Even with access to all the
summaries to the more common node-link diagram representation. experiment materials such as analysis scripts and data sources, it is
The majority of participants are better able to extract useful infor- non-trivial to reproduce or even understand computational exper-
mation from the textual summaries. The textual summaries were iments [42, 53, 54]. This is problematic as scientists cannot build
particularly beneficial for scientists with low computational exper- upon prior work without a solid understanding of the computa-
tise. We discuss the qualitative results from our study to motivate tional experiment they are trying to reproduce. These challenges
future designs for reproducibility tools. contribute to what is known as the reproducibility crisis in sci-
ence [4].
CCS CONCEPTS Often, researchers reproduce experiments to establish a baseline
• Human-centered computing → User studies; Interaction paradigms; from which they can extend existing research. Researchers use
• Computing methodologies → Natural language generation; many tools that help with mechanical experiment reproducibility
• Information systems → Data provenance. such as code repositories [10], dataset repositories [29, 31] and com-
putational environment trackers [14, 43]. However, just because a
KEYWORDS researcher can run an experiment does not mean they have enough
knowledge to build upon it. Conversely, if a researcher cannot im-
Provenance, Reproducibility, User Study, Text Generation mediately reproduce an experiment result, they will need a better
understanding of the experiment to determine what is missing or
not working correctly. The information they need to develop this
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed understanding also might not exist in the code and data.
for profit or commercial advantage and that copies bear this notice and the full citation Data provenance addresses the issue of incomplete experiment
on the first page. Copyrights for components of this work owned by others than ACM tracking [40]. Provenance is metadata describing the history of
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a data objects [9]. Although data provenance addresses the mechan-
fee. Request permissions from [email protected]. ical problem in reproducibility by recording the missing pieces a
ACM REP ’24, June 18–20, 2024, Rennes, France user needs to run an experiment, it falls short on helping the user
© 2024 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 understand the pipeline they are reproducing. Tools that use data
https://doi.org/XXXXXXX.XXXXXXX provenance typically present it visually as a directed acyclic graph
ACM REP ’24, June 18–20, 2024, Rennes, France Boufford et al.

that represents the relationships between data objects. These ob-

jects are often scripts and datasets, whose attributes might contain
version information, and the relationships describe dependence or
causality. Unfortunately, since provenance graphs are large and
complicated, even experienced software engineers find interpreting
these graphs challenging [6].
Provenance systems typically store provenance data in a machine-
readable format that must be transformed for presentation to users.
In practically all prior work, the transformation produced a node-
link diagram (Fig. 1) [5, 21, 25, 30, 41, 46]. Large graphs are difficult
for humans to understand [33]. They often contain more informa-
tion than a human can store in their working memory and thus
impose too high a cognitive load. Figure 1: Simple provenance graph displayed as a node link
Imagine a new graduate student who is starting to work on a diagram. Process 1 executes the process_data.py script. This
problem that a previous student had worked on. To get started, script reads input.csv and writes to temp.data. Process 2 exe-
they need to reproduce the student’s experiment, and then they cutes the second script, analysis.py, which reads temp.data
can begin to extend the work. The previous student used version and parameters.csv and writes to two image files, plot_1.png
control, a data repository, and an environment manager, and still, and plot_2.png.
the new student cannot seem to get the same results. Although
their scripts run without errors, the new student is not sure if they
ran the scripts in the correct order or if the code has changed since less overwhelming than the node-link diagram. Our qualitative find-
the previous student reported their results. Even if the student was ings suggest several areas of future work to improve provenance
able to reproduce the correct result through trial and error, they are summarizations in the service of experimental reproducibility.
in no position to build upon this work, because they do not have a
good understanding of the workflow. Contributions
Fortunately, the previous student collected provenance during • We develop techniques for using large language models to
their last experiment execution. However, the provenance is not produce summaries from provenance graphs.
immediately helpful since it is only machine-readable. They use a • We demonstrate that these techniques produce high quality
graph database tool to view the provenance, but there are hundreds summaries.
of nodes! Since they cannot understand the provenance data, they • We conduct a user study to compare text-based provenance
are no closer to understanding the workflow than before. summaries and more typical node-link representations.
This scenario is all too common and shows that data provenance, • We show that many users find text-based provenance sum-
on its own, is not a complete solution. Although the answer to maries less cognitively demanding.
“Which data preprocessing script did I use to create the input data
Our code, datasets, and study materials are publicly available (fur-
for this trained model?” is in the provenance data, interpreting
ther details in Appendix A).
the data to find this information is nontrivial. Some researchers
evaluated different summarization and presentation techniques [6,
44], but these studies all assume that the right solution requires
2 BACKGROUND
exposing the graph-structured representation of the provenance We develop and evaluate textual provenance summarizations in the
to the user. We question that assumption; our goal is to facilitate context of reproducibility. We use the definition of reproducibility
a user’s understanding of the experiment. We hypothesize that from the National Academies of Science, Engineering and Medicine:
explaining what a provenance graph represents is a better approach “Reproducibility is obtaining consistent results using the same in-
to achieving that goal. put data, computational steps, methods, code, and conditions of
We present a text-based provenance summarization technique analysis” [34].
and evaluate it with a user study. Our text-based provenance sum-
marization technique is based on the observation that while ex- 2.1 Provenance
periment development is an iterative and complicated process, the Data provenance describes when data objects were created, when
ultimate experiment execution follows a relatively simple and logi- they were modified and their history of ownership [9]. In experi-
cal procedure. As such, we prefer to directly express this sequence mental workflow tracking, this translates into recording the details
rather than illustrating it with a complicated graph structure. Tra- about dependencies, creation, and modification of experiment code,
ditional lab notebooks describe experimental control flow using data, and outputs. While many tools designed for software engi-
natural language; we do the same. For our user study, we use a large neering capture version information, data provenance augments
language model (LLM) to generate a natural language summary of such tools by capturing an execution record and the relationships
a provenance graph. We find that users with less computational among different objects involved in a workflow. Prior work uses
expertise prefer the text-based explanation as it is more familiar and provenance data in experimental tracking systems to recreate com-
putational environments [13, 41] and record interactions between
applications and file systems [18].
Computational Experiment Comprehension using Provenance Summarization ACM REP ’24, June 18–20, 2024, Rennes, France

Figure 2: Steps to create a text summary of a computational experiment.

Most provenance applications, including those used for exper- means that the context is limited to approximately 32000 charac-
iment and code management, present provenance as a node-link ters.
diagram [5, 21, 25, 30, 41, 46]. However, provenance graphs can
contain hundreds of elements, even for small tasks such as running 3 TEXT SUMMARIZATION
a computational notebook. Research shows that graphs containing We generate several high-quality summaries using GPT-4 [37] as a
more than 50 to 100 elements are hard to interpret [59] as they proof of concept for our user study. Fig. 2 shows the sequence of data
cannot fit in a human’s working memory [33]. Alternative prove- transformations involved in producing text summaries of compu-
nance visualizations [6, 44] have failed to see meaningful adoption tational experiments. We first run an experiment while recording
(see §6 for further discussion). Given that large graphs are hard provenance 1 . We then preprocess the provenance data 2 3
to understand, we propose natural language text summaries as an and then use the GPT-4 model from OpenAI [37] to generate the
alternative. Our intuition is that scientific experiments follow a summaries for our user study 4 . The LLM-generated summaries
logical control flow that we can describe using natural language. should contain 1) enough information that a user can understand
We know that scientists frequently read papers, lab reports and the experiment well enough to reproduce it and 2) no unnecessary
procedural documents, so we hypothesize that they might find a or false information. We outline further goals and expectations in
written format easier to understand and a better way to explain how §3.3.
to reproduce a computational experiment. While it was previously 1 Provenance capture We developed a system level provenance
impractical to generate these text summaries manually, we now collection tool to capture provenance during experiment [7]. Sys-
can generate them automatically using large language models. tem level provenance describes data at the granularity of system
calls, files and processes. We wrote our own tool because most ex-
2.2 Summarization using Generative AI isting system provenance collection tools have a large installation
Recent work in generative artificial intelligence shows that large overhead [32, 39]. Our tool uses eBPF [2], a Linux framework that
language models (LLMs) are able to effectively summarize large allows users to monitor operating system events without modifica-
quantities of text [17, 62]. Users interact with LLMs using a prompt- tions to the kernel. Previous work that uses eBPF for provenance
ing interface where they use natural language to instruct the model capture mainly focuses on security [27, 45], whereas our tool only
to answer a question or complete a task [63]. The input to an LLM is captures the information necessary for computational experiment
a natural language expression, called a prompt. The model outputs reproducibility. We call the provenance data captured by our system-
a response to that prompt, also in natural language. If the task is level tool a provenance log.
summarization, the user also provides the document as part of the
prompt. 3.1 Data Preprocessing
Many prior works uncover limitations of LLM summarization [22, For most LLMs, including GPT-4 [37], the context window is limited.
28, 47, 50]. It requires careful prompting to generate useful re- Since many of our provenance logs are larger than the context
sponses [60]. LLM responses are sometimes verbose, redundant, window, we need a more concise representation. Additionally, the
and unclearly organized. Additionally, with current generative AI provenance log we get from the data capture stage is a machine-
models, we cannot guarantee response correctness [8]. Lastly, LLMs readable JSON file. The JSON provenance format is long and verbose,
can process a limited amount of text at one time. The context window which increases the context size. Previous work shows that LLM
is the maximum amount of text a model can process. The context response quality degrades and loses information around the middle
consists of one or more prompt and model response pairs, similar of a document when the context is too long [28].
to a conversation. Since our prompt contains an instruction and We reduce log size by removing duplicate edges and converting
a provenance log, our instruction, provenance log and the model the JSON log to natural language. We perform both the edge re-
response together must be smaller than the context window. The duction and the natural language formatting automatically using
context window is measured in tokens; for GPT models, a token is Python scripts. Both of these methods also provide the benefit of
approximately equal to 4 characters. At the time of this study, the reducing noise in the input to the LLM. Duplicate edge reduction
largest context window available for GPT-4 is 8000 tokens. This helps prevent edges from being erroneously categorized as more
ACM REP ’24, June 18–20, 2024, Rennes, France Boufford et al.

1 { " type " : " E n t i t y " , " i d " : " 4 5 5 4 " , " a n n o t a t i o n s " : { " inode_inum " : " 4 5 5 4 " , " uid " : " 0 " , " path " : " /
r o o t / u s r / b i n / p ython3 . 1 1 " } }
2 { " type " : " A c t i v i t y " , " id " : " 2 2 7 8 9 9 " , " annotations " : { " pid " : " 2 2 7 8 9 9 " } }
3 { " t y p e " : " Used " , " t o " : " 4 5 5 4 " , " from " : " 2 2 7 8 9 9 " , " a n n o t a t i o n s " : { " o p e r a t i o n " : " e x e c u t e " , "
datetime ":"2023 −10 −21 2 0 : 0 1 : 5 5 : 4 2 7 " } }

(a) Example machine-readable provenance log.

1 2023 −10 −21 2 0 : 0 1 : 5 5 : 4 2 7 p r o c e s s w i t h p i d : 2 2 7 8 9 9 e x e c u t e d f i l e : / r o o t / u s r / b i n / pyt hon3 . 1 1

(b) Provenance log converted to natural language format.

Figure 3: A provenance log in machine-readable JSON format (Fig. 3a) is converted to natural language format (Fig. 3b). The
machine-readable JSON format size is 88 tokens and the natural language log size is 30 tokens.

important than they are, and the natural language format aligns fine-tuning. Existing work has shown that prompt engineering ef-
more with an LLM’s training corpus than the JSON output of our fectively generates well-written summaries of long-form text [17].
system-level provenance collection tool. 4 Prompt Engineering We use GPT-4-0613 [37], the latest openly
2 Edge Reduction The operating system sometimes produces available model from OpenAI at the time of our study. OpenAI pro-
many system events for a single user action. For example, if a user vides guidelines and strategies for developing prompts [1]. We fol-
is modifying a file using a text editor, the operating system might lowed these guidelines and adjusted our prompts until we achieved
execute multiple writes in a row. Our provenance collection system a desirable output. Using clear and specific instructions achieved
will record each write event as an edge. Conceptually, there is no the best results. In Fig. 4, we show the final prompt we used to
difference between a single large write event and many consecutive generate the summaries for our user study. We discuss how we
small write events. Therefore, we use a simplified version of edge evaluated output quality and how we arrived at our final prompt
aggregation described by Xu et al. [57] to remove repeated edges in §3.3.
from the graph. This reduced log sizes by 43-53% for the logs in our Temperature Parameter Additionally, we set the GPT tempera-
study. ture parameter to 0 to ensure consistent responses. The temperature
3 Natural Language Formatting Through empirical experi- is a randomness control parameter for the GPT model. A lower tem-
ments, we found that converting the JSON logs into natural lan- perature means less randomness and a higher temperature means
guage sentences improved both log size and summary quality. The the outputs will have more variability. Higher temperatures some-
new log format follows a simple natural language structure where times introduces interesting prose and more high-level descriptions,
a short sentence describes each relationship in the graph. For exam- but the responses were inconsistent and more likely to contain false
ple, when a process writes to a file, this is recorded in the log as a information. Setting the temperature to 0, we get responses that
JSON object for each of: a process node, a file node, and an edge that are nearly the same each time, differing by a only few words, if any.
connects the two nodes. We simplify this relationship as “Process p Summarizing Large Provenance Logs Even after preprocessing,
writes to file f ”, where p and f are the identifiers for the process some of our provenance logs still exceeded the model context win-
and the file. Fig. 3 shows an example of the natural language log dow. The GPT-4 context window is 8,000 tokens at the time of our
format conversion. Since we can enumerate all the possible relation- study. In comparison, our processed provenance logs ranged from
ship types in our provenance graph, we can generate a mapping of 3945 to 12815 tokens. In cases where the log was too large, we used
sentences to relationships in the provenance graph. We can then prompt chaining, a technique that has been used for large, complex
automatically generate a log in natural language format, filling tasks [52, 55]. If a log exceeded the size of the context window, we
in the blanks with values from the provenance data. This format divided it into two or more logs. We define break points as edges in
produces higher quality summaries than the machine-readable log, the graph that correspond to a user executing a command. These
using the evaluation criteria in §3.3. The natural language format break points represent a natural break in the log information such
reduces the study provenance log size by an additional 58-63%. In as a user executing a python script from the command line. We
combination these two techniques reduce the logs to between 17 maximize the size of the first chunk and put the remainder in the
and 24% of their original size. The code for generating the natural second chunk, ensuring that the first section of the next chunk
language format is publicly available (details in Appendix A). starts at a break point. The model then summarizes the first section
of the log, and we give the response, the next section of the log, and
a second prompt back to the model, We repeat this process until
3.2 Prompting the model has summarized the entire log. This method generated
high quality summaries using the evaluation method described in
After preprocessing, we use LLM prompting to generate text sum-
§3.3.
maries from the preprocessed provenance logs. Prompt engineering
is the process of designing LLM prompts to achieve a desired re-
sponse. Prompt engineering does not require model training or
Computational Experiment Comprehension using Provenance Summarization ACM REP ’24, June 18–20, 2024, Rennes, France

Figure 4: The prompt we used to generate our user study text summaries. Input provenance log denotes the log specific to each
task.

3.3 Summary Evaluation involves manually writing a “conversation history”. That is, one
There is currently no standard for evaluating LLM-generated sum- writes a prompt and then manually generates a desired response to
maries. Existing methods for evaluating LLM responses use both that prompt. The prompt/response pair is provided to the model as
qualitative and quantitative methods depending on the applica- an example before giving the model a prompt for which you want
tion [12]. Quantitative methods involve statistically measuring re- the model to produce a response. The model is likely to follow the
sponses compared to reference text written by a human. Recent response format from the conversation history when using this
research shows no strong correlation between statistical metrics technique. While the responses generated from prompts that in-
and summary quality [50]. We do not have a strict expected output cluded examples were of high quality, the examples counted against
structure for the text summaries; therefore, the statistical difference the context window limit, leaving fewer tokens available for the
between the generated and reference summaries is not meaningful. real provenance log. We did not use this example technique for
Therefore, we use qualitative methods to evaluate LLM-generated the summaries in our study. Rather, we opted to use a detailed
summaries. We define a rubric with four categories: instruction that uses less of the context window limit as shown in
Fig. 4.
Completeness Is all the necessary information included?
Conciseness Is any unnecessary information included?
Truthfulness Does it include any false information? 4 USER STUDY
Readability Is it easy to read and well formatted? We conduct a user study to evaluate whether users were better able
to understand workflows given text-based provenance summaries
For each of the four categories, we manually assign a score out
than they were when given node-link diagrams. The study uses a
of 4, giving a total score out of 16. We developed a prompt that
mixed methods approach. Participants are quantitatively evaluated
produced summaries that scored 16/16 for each of the provenance
on their ability to answer questions about several computational
logs used in the study. We used the prompt in Fig. 4 to summarize
experiments using only the provenance summarization. We then
logs smaller than the context window. We also use this prompt as
analyze qualitative feedback through long answer text responses
the first prompt in the chaining approach. The prompt (excluding
and audio recordings. Appendix D contains our study materials.
the input provenance) uses only 101 tokens, leaving the rest of the
8K context window for input logs and the output summary.
It took approximately one month of iteration to create our final 4.1 Study Methods
solution. We had many discussions with our team members to The study session consisted of a brief introduction and overview of
develop the rubric, refine the prompt and come to a consensus the study purpose, demographic questionnaire, activity, and post-
on the best responses. To develop our prompt, we started with a activity questionnaire.
basic prompt, “Summarize the following log.” and provided a small Study Activity Each participant completed four tasks. For each
log describing a user executing a python script. We adjusted the task, the participant was given either a node-link diagram prove-
prompt, using different wording and adding further instructions nance summary or an automatically generated text summary repre-
and context. As we increased the size of the input log, the model senting a computational experiment that they had not seen before.
required more specific prompting to steer it in the right direction. Participants used the provenance summary to answer questions
Once we engineered a prompt that consistently produced well about the computational experiment. The questions concerned in-
written summaries, we used this prompt to generate the summaries formation one would need to reproduce said experiment, such as,
in our user study. “Which scripts write to this data file?” and “How many output files
We also experimented with providing examples, another tech- are created during this experiment?”. We describe the study’s com-
nique discussed in the OpenAI guidelines. The example technique putational experiments (workflows) in Table 2. We used the one
ACM REP ’24, June 18–20, 2024, Rennes, France Boufford et al.

(a) Average time to complete each task. (b) Average score for each task normalized (c) Normalized NASA Task Load Index
out of 100. (TLX).

Figure 5: Quantitative metrics showing performance using both graph (orange) and text (blue) provenance summaries.

Task IDDescription
Human Robot Interaction
Information Technology

Environmental Science
User executes a Python script and the script
0
creates a plot of input data.
Computer Science

User executes an R preprocessing script fol-

1
Bioinformatics

lowed by a Python model training script.

Visualization
Data Science

User executes a model training script three

2 times, each with a different learning rate passed
Forestry

Physics

as a command line argument.

User executes a script that reads in data but does
ID Occupation
not produce any output files. The user then ed-
1 Lecturer ✓
its the script using the vim text editor. The user
2 Data Scientist ✓ 3
finally executes the edited script, and this time,
3 Graduate Student ✓ ✓ the script execution produces a model check-
4 Data Scientist ✓ ✓ ✓ point file.
5 Graduate Student ✓ Table 2: Workflow description for each task in the user study.
6 Scientist ✓
7 Professor ✓ ✓ ✓
8 Database Admin ✓
9 Graduate Student ✓ ✓ ✓
10 Graduate Student ✓ information provided in the demographic survey, we categorized
11 Graduate Student ✓ ✓ the participants into three expertise categories based on their data
12 Graduate Student ✓ science and programming experience.
Table 1: Participant Fields
4.3 Quantitative Results
We evaluate three performance metrics for each task: score, time to
complete, and perceived cognitive load (Fig. 5). Overall, participants
shot approach described in §3.2 to develop the summaries for the were able to answer most questions correctly regardless of which
first and second tasks and the chaining approach to create the third representation they used. Time to complete and perceived cognitive
and fourth task summaries. We manually generated the node-link load varied across tasks.
diagrams to make them as readable as possible. Details on how we Time to Complete We measured the time to complete each task
manually generated the summaries are in Appendix B. Each task’s by recording how long the participant spent on that page in the
node-link diagrams and text summaries are available in Appen- survey (Fig. 5a). The average time to complete each task was similar
dix D. for tasks 1-3 independent of whether the participants used the text
summary or the node-link diagram summary. In task 3, we see that
4.2 Participants participants using the node-link diagram took slightly longer, on
Twelve people participated in the user study. The participants were average, to complete this task. We suspect this difference occurs
six graduate students, two data scientists, one research scientist, because task 4 contained the only long answer question that asks
one professor, one lecturer, and one database administrator. Their “why?” rather than “what?”. This question required participants to
fields spanned data science, bioinformatics, environmental science, consider the larger picture of the whole provenance graph and its
and forestry. Table 1 shows the complete list of fields. Using the implications.
Computational Experiment Comprehension using Provenance Summarization ACM REP ’24, June 18–20, 2024, Rennes, France

(a) Which provenance summarization was more useful? (b) Which provenance summarization was more enjoyable?

Figure 6: Participants rank their preference for either the text summarization or node link diagram. Participants are categorized
by their computational expertise.

Question Score Each participant successfully answered most ques- At first glance, there is no trend in either direction. Some partic-
tions correctly regardless of representation (Fig. 5b). For tasks 1-3, ipants strongly prefer the node-link diagram, and others strongly
at least 8 out of 12 participants scored 100% and 11 out of 12 scored prefer the text, with a few in the middle. But, when we include par-
over 70% using either the text or node-link diagram summary. The ticipants’ overall experience with research programming, a stronger
scores were lowest for task 4, where only one participant scored trend emerges. Users with little experience (blue) find the text both
perfectly, although 10 out of 12 participants scored over 75%. The more useful and enjoyable. Users with intermediate (pink) to ad-
participants who scored the highest on this task were able to answer vanced (yellow) experience varied in whether they found the text
the multistep reasoning more easily with the text-based provenance or node-link summary more useful, but tended more towards the
summary than they were with the graph-based one. We assigned a graph in terms of enjoyment. We uncover some explanations for
score for this question manually, giving two points if they answered these trends using the long answer survey responses and audio
correctly with plausible reasoning and one point for partially cor- transcriptions. We outline the prominent themes below. In the fol-
rect responses (i.e. correct reasoning but incorrect answer or vice lowing sections, we refer to participants by number (eg. P0) to
versa). All other questions in the study had only one answer and preserve anonymity.
were marked as either correct or incorrect.
Text summaries are accessible for all expertise levels. As observed
Perceived Cognitive Load We measure perceived cognitive load
in Fig. 6a, users with less computational expertise preferred the text
using the NASA Task Load Index (TLX) standard scoring system
summaries. Less experienced participants were more comfortable
[19]. After each task, participants recorded their response to the
and less overwhelmed with the text summaries. For instance, P6
TLX questions in Appendix C. As with the other quantitative met-
felt the graphs required some background knowledge they did not
rics, the cognitive load scores are similar when comparing the two
have.
summarization methods (Fig. 5c).
The quantitative results show no obvious difference in overall Text summaries tell a story. P6 described the text summaries as
performance when using the text or the node-link summary. We “Text reads more like a storyline, which is more intuitive for me”. Mul-
begin to see larger differences when we look at user preference and tiple participants found the text summaries followed a logical order.
qualitative feedback. P12, who studies bioinformatics, remarked that they are required
to closely follow written protocols in their work. This experience
translated well to understanding the text summaries, which have
4.4 Qualitative Results a similar format to a written experiment protocol. However, the
graph differed from any data format they were familiar with and
In the post-activity survey, we asked questions regarding the entire
required more effort to understand. Advanced users, many of whom
study experience. At this point, the participants have completed two
found the graph more enjoyable, still appreciated aspects of the text
tasks using a text provenance summary and two tasks with a node-
summary. P8 notes “The text format felt more useful in identifying
link provenance summary. We asked them to rate their preference
the workflow steps in order.”
for either summary technique using a 5-level Likert scale [26]. The
participants recorded whether either was more useful during the Text summary requires attention to detail. Several participants
activities and whether either was more enjoyable than the other. who preferred the node-link diagram summarization found the text
We show the results of these questions in Fig. 6b. too long to read. P10 found it less enjoyable to “read through each
ACM REP ’24, June 18–20, 2024, Rennes, France Boufford et al.

sentence and remember what is being done at each step”. While the The participants’ enthusiasm while sharing feedback on repro-
length of prose was a barrier for some participants, other partic- ducibility tools sparks optimism for future research in provenance
ipants found it helpful. Particularly, users who are confident and and reproducibility.
often read written protocols were comfortable extracting informa-
tion from the text summaries. 5.1 Design Recommendations
The text summaries lacked some structure compared to the
In the post-activity questionnaire, we asked participants if there
graphs. As in natural prose, the subjects and verbs do not appear are additional features they would like for a reproducibility tool
in the same place in each sentence in the text summaries. For P9, and if there is anything that would prevent them from using a
“The text summary tended to jump around more and was difficult to reproducibility tool. We give several recommendations based on
follow.” We discuss alternative text summary formats in §5.2. our takeaways from the qualitative analysis. These guidelines can
also be applied more broadly to any tools that assist with user
Advanced Users Identify Patterns in Graph. Users with high com-
comprehension of experimental workflows.
putational expertise often preferred the graph format. Many users
Visual Features For both visualization-based and text-based sum-
in this category enjoyed the extra details and workflow visualiza-
marizations, several users noted they would like highlighting, zoom-
tion for identifying relationships and patterns. P8 found the graph
ing, panning, and search. As P7 describes, “adding colors to file
“made it easier to identify relationships between different components.”
names, and scripts/outputs/paths [...] would make it more readable.”
Similarly, P9 found that when using the graph “it was easier to see
P3 also mentions “highlighting of linked routes (when you hover over
repeated steps and patterns.”
an item it shows all the related items)”. In graph summaries, users
Users noted that the text summary was harder to skim and
complained of difficulty tracing the edges between nodes. In text
quickly extract information. As such, several participants iden-
summaries, several participants noted that the text required users
tified areas where the text could be improved, potentially affording
to read the entire entry, sometimes multiple times, to ensure they
similar benefits to the graph. Several participants noted that key-
did not miss any details. We imagine that simply color coding and
word highlighting in the text might allow pattern matching similar
bolding keywords such as verbs (e.g., read, write, execute) and file
to the graph. We discuss this further in §5.1.
paths would help users to extract important details more quickly
Text summary can help users to get up to speed on node-link dia- and easily.
grams. Several users noted they would like to see both provenance Hide Low-Level Details With either provenance summarization,
representations in a real application. For less experienced users, users still felt overloaded with information on first impressions. P3
suggested “the ability to have hierarchical drop down tree to help
some noted they could use the text summary to help understand the
organize larger amounts of data”. Similarly, P9 wanted the tool to
node-link diagram. P6 would prefer to have “both text and [node-
link diagram] side by side [...] so that I could eventually learn how “allow the user to ‘zoom in’ on different parts of the experiment”.
to read [the node-link diagrams] with some practice.” Even users The option to view a high-level summary first and expand on the
who preferred the graph noted that “a plain or natural language details later might reduce the initial cognitive overload and make
commentary is always useful [alongside the graph]” (P9). the summaries more approachable.
Integration with Existing Tools Four participants expressed in-
terest in integration with tools they already use, such as Git [10] or
4.5 Remote Study RMarkdown [3]. Several would have liked a provenance summa-
We released a second version of our user study as an online survey rization directly linked to their code repository. Others mentioned
and allowed participants to complete the survey on their own. The it would help them to understand previous experiments if they inte-
remote version of the study had minor changes from the in-person grated a provenance summary into their computational notebook.
version including small changes to question wording, two addi- Installation and Use Overheads Many participants mentioned
tional demographic questions and an additional long-form answer that they would be unlikely to use any tool if the overhead for use
question for task 2. 10 participants completed the remote study. We was too high. This overhead includes installation and workflow
did not see any significant trends across the quantitative metrics. modifications. P6 noted “if set up would slow me down a lot, I might
All the participants that completed the remote study were cate- be less likely to use it.” Specifically, barriers include having to rewrite
gorized as intermediate or expert in their computational and data any of their existing code or switching programming environments.
science expertise. Participants’ overall preference for the text sum-
marization versus the node link diagram was similar to the initial 5.2 Text Summary Limitations and
study. The qualitative feedback matched the themes we identified Improvements
in the first study. Several participants remarked that they enjoyed
the visual cues in the node link diagram but also found the text Although we cannot guarantee perfect summaries using current
useful for answering questions about what happened during an models, our positive results using a generic large language model
experiment. leave us hopeful. We expect that using a domain specific LLM,
trained on experiment provenance data, would be better still. For
our user study, we generated text summaries using the out-of-the-
5 DISCUSSION & FUTURE WORK box GPT-4 model from OpenAI [37] with no fine-tuning. GPT-4
Our qualitative analysis yields several areas of improvement for is closed-source, and we assume OpenAI trained it with general-
text-based provenance summaries as well as reproducibility tools. purpose data. We expect that training or fine-tuning a model using
Computational Experiment Comprehension using Provenance Summarization ACM REP ’24, June 18–20, 2024, Rennes, France

provenance graphs and summary data would reduce false infor-

mation and give more consistent structure. Fine-tuning involves
training a pre-trained model on a smaller task-specific dataset. We
envision using our evaluation criteria from §3.3 in this task-specific
dataset. Additionally, LLM providers such as OpenAI and Meta
are frequently releasing improved models with larger context win-
dows [51]. At the time of writing, OpenAI has already released a
new version of GPT-4 with a 128K context window [20] As such,
we are optimistic for a path forward.
With LLM limitations in mind, we also note that there are sev-
eral ways one could automatically generate similar, or even better,
summaries. The simplest method is to use rule-base methods to (a) Macko and Seltzer use node-link diagrams to visualize
populate text fields in a structured report [49]. A significant bene- provenance data.
fit of rule-based text generation is that it does not require model
training. Additionally, rule-based methods address some concerns
mentioned by study participants who wanted more structure in the
text. Conversely, the rigid structure of rule-based methods could
decrease the output quality, making the summaries less compelling
to read.
Lastly, extending the summary generation using LLMs, several
study participants expressed interest in using a tool similar to Chat-
GPT for assistance with reproducing experiments. ChatGPT is Ope-
nAI’s interactive tool for model prompting [36]. In this interactive
mode, users would be able to “chat” with the model to ask ques-
tions about the experiment and how to reproduce it. Querying the
provenance through this interface might reduce some confusion
for users were overwhelmed with too much data in the summaries. (b) Borkin et al. use a radial diagram to show file system
Assuming we could fine-tune a model for provenance summariza- provenance.
tion, one could imagine such a tool for querying provenance graphs.
The benefits would be that users could express queries in natural
language and get nuanced, simple responses.

6 RELATED WORK
Our work focuses on determining how receptive users are to a tool
conveying information from provenance. Given that we propose a
technique using LLMs to do so, we examine prior work on visualiz-
ing and summarizing provenance, as well as LLM summarization
techniques and limitations.
(c) Schreiber and Struminski use comics to present smart-
watch provenance data.
6.1 Provenance Graph Visualization
Provenance data are historically displayed using node-link dia- Figure 7: We compare the most common node-link visualiza-
grams [15, 21, 30]. Some applications such as Probe-It [15] include tion (Fig. 7a) with two alternative approaches (Fig. 7c, Fig. 7b).
additional views, graph-style visualizations have practically be-
come standard practice. Many tools store provenance data in graph
databases, e.g., Neo4j [35], and then use the tools that accompany
those systems or other graph-centric tools, e.g., GraphViz [16], to
explicitly represent provenance data. However, generic graph tools our node-link diagrams rather that use existing tools to generate
often produce illustrations that are cluttered and difficult to read. the graphs. We made this decision because we use a different prove-
Provenance tools for experimental workflow tracking also, unsur- nance abstraction than the language-level tools and some applica-
prisingly, use node-link diagram illustrations. Vistrails [5] captures tion specific provenance visualization tools such as VisTrails [5].
provenance for workflows in their applications and displays the Additionally, the graphs we generated using GraphViz [16] and
provenance data using node-link diagrams. Users must execute their Neo4J database viewer [35] were not well organized and did not
entire workflow in the Vistrails application to capture provenance. display all the information necessary for reproducibility comprehen-
Language-level provenance tools common in research program- sion. Therefore, we did not think it would be a fair comparison to
ming, such as RDataTracker [25] and noWorkflow [41], also use use these in the study. We manually created the graphs in our study
node-link diagrams. For our study, we chose to manually generate to highlight workflow-level detail necessary for reproducibility.
ACM REP ’24, June 18–20, 2024, Rennes, France Boufford et al.

6.2 Alternative Provenance Summarization 7 CONCLUSION

Some prior work eschewed the node-link diagram and provided Our study demonstrates the viability of automatically generating
alternative visualizations of provenance data. Schreiber and Stru- text-based provenance summaries for enhancing user understand-
minski use comics to describe user sensor data, such as metrics from ing of data science workflows. We evaluate the text summarization
a smartwatch [44]. The comic provenance visualization in Fig. 7c approach with a user study, comparing text summaries to node-
provides an easy-to-read, high-level summary of self-tracking sen- link diagrams for presenting provenance data to users. We found
sor data [44]. Borkin et al. used a radial representation to represent that users with less programming expertise often prefer the text
file system provenance data (Fig. 7b [6]). They compare this radial summaries as they adopt a familiar format that is more approach-
representation to the node-link representation of Orbiter [30] and able. In contrast, users with advanced programming experience
find that system administrators are better able to interpret system enjoyed the extra details and visual cues provided in the graphs. We
level behavior using the radial diagrams. Similar to what we ob- provide direction to improve both the text and node-link diagram
served with text-based summaries, these tools lowered the cognitive summarizations in our qualitative analysis of our survey results.
load for some users while performing tasks. VinciDecoder [48] uses Not only do our results demonstrate the effectiveness of the textual
machine learning to summarize provenance graphs into forensic summaries, but the results also provide insight into how to present
reports. These forensic reports are also similar to our work, as the provenance data to enhance user comprehension.
reports are text-based and use machine learning techniques to pro-
duce them. However, since our summarizations are LLM-generated,
our reports have less regular structure than VinciDecoder’s. One ACKNOWLEDGMENTS
could envision adopting similar generation techniques to create We acknowledge the support of the Natural Sciences and Engi-
more structured experimental workflow provenance summariza- neering Research Council of Canada (NSERC). Nous remercions le
tions. While these visualization techniques perform well in some Conseil de recherches en sciences naturelles et en génie du Canada
domains, none are designed or evaluated specifically for computa- (CRSNG) de son soutien.
tional experiment comprehension.

6.3 LLM Summarization REFERENCES

[1] [n.d.]. Open AI Prompt Engineering Guide. https://platform.openai.com/docs/
Summarization is a popular application of large language mod- guides/prompt-engineering. Accessed: 2023-11-22.
els. Zhang et al. and Goyal et al. use LLMs to summarize new [2] (accessed January 23, 2024). eBPF. online. https://ebpf.io/..
[3] JJ Allaire, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi,
articles [17, 62]. The authors find that users prefer LLM-generated Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and
summaries to other summarization models [17] and like them much Richard Iannone. 2023. rmarkdown: Dynamic Documents for R. https://github.
as human-written summaries [62]. Similarly, Laskar et al. evalu- com/rstudio/rmarkdown R package version 2.25.
[4] Monya Baker. 2016. 1,500 scientists lift the lid on reproducibility. Nature 533,
ated how well various LLMs summarize meeting notes [24] and 7604 (2016), 452–454.
ultimately deployed such summaries in a real-world setting. Other [5] L. Bavoil, S.P. Callahan, P.J. Crossno, J. Freire, C.E. Scheidegger, C.T. Silva, and H.T.
work introduces new techniques for prompting LLMs to summarize Vo. 2005. VisTrails: enabling interactive multiple-view visualizations. In VIS 05.
IEEE Visualization, 2005. 135–142. https://doi.org/10.1109/VISUAL.2005.1532788
even longer text. Wu et al. propose their “Extract-then-Evaluate” [6] Michelle A. Borkin, Chelsea S. Yeh, Madelaine Boyd, Peter Macko, Krzysztof Z.
method [56], while Chang et al. and Zhang et al. demonstrate tech- Gajos, Margo Seltzer, and Hanspeter Pfister. 2013. Evaluation of Filesystem
Provenance Visualization Tools. IEEE Transactions on Visualization and Computer
niques based on iterative and incremental updates [11, 61]. Graphics 19, 12 (2013), 2476–2485. https://doi.org/10.1109/TVCG.2013.155
Despite the successes found in LLM summarization, they are not [7] Nichole Boufford. 2023. Thoth. https://github.com/ubc-systopia/thoth.
without their drawbacks. Yang et al. demonstrated that the predic- [8] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
tions ChatGPT provides in a mental health care setting are unstable Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
and can change based on tweaking the severity of adjectives used Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
in the prompt [58]. They also found that while LLMs are capable of Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
providing explanations for their answer, those explanations are not Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners.
always correct and do not mean that the models are interpretable. CoRR abs/2005.14165 (2020). arXiv:2005.14165 https://arxiv.org/abs/2005.14165
[9] Lucian Carata, Sherif Akoush, Nikilesh Balakrishnan, Thomas Bytheway, Ripdu-
Meanwhile, Shen et al. tested LLMs for automatic evaluation of man Sohan, Margo Seltzer, and Andy Hopper. 2014. A Primer on Provenance:
summaries, and found them inconsistent and not yet at the level Better Understanding of Data Requires Tracking Its History and Context. Queue
of human evaluators [47]. Liu et al. observe that LLMs frequently 12, 3 (2014), 10–23. https://doi.org/10.1145/2602649.2602651
[10] Scott Chacon and Ben Straub. 2014. Pro git. Apress.
overlook information in the middle of a document [28]. Tang et al. [11] Yapei Chang, Kyle Lo, Tanya Goyal, and Mohit Iyyer. 2023. BooookScore: A
evaluates medical journal summaries produced by GPT 3.5 [38] and systematic exploration of book-length summarization in the era of LLMs. arXiv
ChatGPT [36] and demonstrate that the GPT-generated summaries preprint arXiv:2310.00785 (2023).
[12] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao
are often untruthful or used indecisive language that could lead Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang,
to misinformation. Unlike provenance logs, these summarizations’ Philip S. Yu, Qiang Yang, and Xing Xie. 2023. A Survey on Evaluation of Large
Language Models. arXiv:2307.03109 [cs.CL]
inputs (news articles, paper abstracts) are pure natural language [13] Fernando Chirigati, Rémi Rampin, Dennis Shasha, and Juliana Freire. 2016. Re-
sentences and paragraphs, which is what led us to create the natural proZip: Computational Reproducibility With Ease. In Proceedings of the 2016
language format from our provenance logs (§3). To our knowledge, International Conference on Management of Data (San Francisco, California, USA)
(SIGMOD ’16). ACM, 2085–2088. https://doi.org/10.1145/2882903.2899401
there is no published work on using LLMs for summarizing prove- [14] April Clyburne-Sherin, Xu Fei, and Seth Ariel Green. 2019. Computational
nance logs or log-structured data. reproducibility via containers in social psychology. Meta-Psychology 3 (2019).
Computational Experiment Comprehension using Provenance Summarization ACM REP ’24, June 18–20, 2024, Rennes, France

[15] Nicholas Del Rio and Paulo Pinheiro Da Silva. 2007. Probe-it! visualization support [41] João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire.
for provenance. In International Symposium on Visual Computing. Springer, 732– 2017. NoWorkflow: A Tool for Collecting, Analyzing, and Managing Provenance
741. from Python Scripts. Proc. VLDB Endow. 10, 12 (aug 2017), 1841–1844. https:
[16] John Ellson, Emden Gansner, Lefteris Koutsofios, Stephen C North, and Gordon //doi.org/10.14778/3137765.3137789
Woodhull. 2002. Graphviz—open source graph drawing tools. In Graph Drawing: [42] João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire.
9th International Symposium, GD 2001 Vienna, Austria, September 23–26, 2001 2019. A large-scale study about quality and reproducibility of jupyter notebooks.
Revised Papers 9. Springer, 483–484. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories
[17] Tanya Goyal, Junyi Jessy Li, and Greg Durrett. 2023. News Summarization and (MSR). IEEE, 507–517.
Evaluation in the Era of GPT-3. arXiv:2209.12356 [cs.CL] [43] Chandrasekhar Ramakrishnan, Michele Volpi, Fernando Perez-Cruz, Lilian Gasser,
[18] Philip J Guo and Margo I Seltzer. 2012. Burrito: Wrapping your lab notebook in Firat Ozdemir, Patrick Paitz, Mohammad Alisafaee, Philipp Fischer, Ralf Gruben-
computational infrastructure. (2012). mann, Eliza Jean Harris, et al. 2023. Renku: a platform for sustainable data science.
[19] Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX In Thirty-seventh Conference on Neural Information Processing Systems Datasets
(Task Load Index): Results of Empirical and Theoretical Research. In Human and Benchmarks Track.
Mental Workload, Peter A. Hancock and Najmedin Meshkati (Eds.). Advances [44] Andreas Schreiber and Regina Struminski. 2017. Visualizing Provenance using
in Psychology, Vol. 52. North-Holland, 139–183. https://doi.org/10.1016/S0166- Comics. In 9th USENIX Workshop on the Theory and Practice of Provenance (TaPP
4115(08)62386-9 2017). USENIX Association, Seattle, WA. https://www.usenix.org/conference/
[20] https://openai.com/blog/new-models-and-developer-products-announced-at de- tapp17/workshop-program/presentation/schreiber
vday. 2023. New models and developer products announced at DevDay. blog [45] R Sekar, Hanke Kimm, and Rohit Aich. 2023. eAudit: A Fast, Scalable and De-
post. ployable Audit Data Collection System. In 2024 IEEE Symposium on Security and
[21] Jane Hunter and Kwok Cheung. 2007. Provenance Explorer-a graphical interface Privacy (SP). IEEE Computer Society, 87–87.
for constructing scientific publication packages from provenance trails. Interna- [46] Omid Setayeshfar, Christian Adkins, Matthew Jones, Kyu Hyung Lee, and
tional Journal on Digital Libraries 7 (2007), 99–107. Prashant Doshi. 2021. Graalf: Supporting graphical analysis of audit logs for
[22] Shaoxiong Ji, Tianlin Zhang, Kailai Yang, Sophia Ananiadou, and Erik Cambria. forensics. Software Impacts 8 (2021), 100068.
2023. Rethinking Large Language Models in Mental Health Applications. arXiv [47] Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, and Lidong Bing.
preprint arXiv:2311.11267 (2023). 2023. Large Language Models are Not Yet Human-Level Evaluators for Abstrac-
[23] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, tive Summarization. In Findings of the Association for Computational Linguistics:
Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason EMNLP 2023. 4215–4233.
Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla, Carol Willing, [48] Azadeh Tabiban, Heyang Zhao, Yosr Jarraya, Makan Pourzandi, and Lingyu
and Jupyter development team. 2016. Jupyter Notebooks - a publishing format Wang. 2023. VinciDecoder: Automatically Interpreting Provenance Graphs Into
for reproducible computational workflows. In Positioning and Power in Academic Textual Forensic Reports With Application To OpenStack. In Secure IT Systems:
Publishing: Players, Agents and Agendas, Fernando Loizides and Birgit Scmidt 27th Nordic Conference, NordSec 2022, Reykjavic, Iceland, November 30–December
(Eds.). IOS Press, Netherlands, 87–90. https://eprints.soton.ac.uk/403913/ 2, 2022, Proceedings (Reykjavic, Iceland). Springer-Verlag, Berlin, Heidelberg,
[24] Md Tahmid Rahman Laskar, Xue-Yong Fu, Cheng Chen, and Shashi Bhushan 346–367. https://doi.org/10.1007/978-3-031-22295-5_19
Tn. 2023. Building Real-World Meeting Summarization Systems using Large [49] Azadeh Tabiban, Heyang Zhao, Yosr Jarraya, Makan Pourzandi, Mengyuan Zhang,
Language Models: A Practical Perspective. arXiv preprint arXiv:2310.19233 (2023). and Lingyu Wang. 2022. ProvTalk: towards interpretable multi-level provenance
[25] B.S. Lerner and E.R. Boose. 2014. RDataTracker: Collecting Provenance in an In- analysis in networking functions virtualization (NFV). In The Network and Dis-
teractive Scripting Environment. In USENIX Workshop on the Theory and Practice tributed System Security Symposium 2022 (NDSS ’22).
of Provenance (TaPP). https://doi.org/10.1007/978-3-319-16462-5_36 [50] Liyan Tang, Zhaoyi Sun, Betina Idnay, Jordan G Nestor, Ali Soroush, Pierre A Elias,
[26] Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of Ziyang Xu, Ying Ding, Greg Durrett, Justin F Rousseau, et al. 2023. Evaluating
psychology (1932). large language models on medical evidence summarization. npj Digital Medicine
[27] Soo Yee Lim, Bogdan Stelea, Xueyuan Han, and Thomas Pasquier. 2021. Secure 6, 1 (2023), 158.
namespaced kernel audit for containers. In Proceedings of the ACM Symposium [51] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas-
on Cloud Computing. 518–532. mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos-
[28] Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv
Fabio Petroni, and Percy Liang. 2023. Lost in the middle: How language models preprint arXiv:2307.09288 (2023).
use long contexts. arXiv preprint arXiv:2307.03172 (2023). [52] Dietrich Trautmann. 2023. Large Language Model Prompt Chaining for Long
[29] Yucheng Low, Ajit Banerjee, and Rajat Arya. 2021. XetHub. https://about.xethub. Legal Document Classification. arXiv preprint arXiv:2308.04138 (2023).
com/ [53] Ana Trisovic, Matthew Lau, Thomas Pasquier, and Merce Crosas. 2022. A large-
[30] Peter Macko and Margo Seltzer. 2011. Provenance Map Orbiter: Interactive scale study on research code quality and execution. Scientific Data 9 (02 2022),
Exploration of Large Provenance Graphs. In 3rd USENIX Workshop on the Theory 60. https://doi.org/10.1038/s41597-022-01143-6
and Practice of Provenance (TaPP 11). USENIX Association, Heraklion, Crete [54] Joseph Wonsil, Nichole Boufford, Prakhar Agrawal, Christopher Chen, Tianhang
Greece. https://www.usenix.org/conference/tapp11/provenance-map-orbiter- Cui, Akash Sivaram, and Margo Seltzer. 2023. Reproducibility as a service.
interactive-exploration-large-provenance-graphs Software: Practice and Experience 53, 7 (2023), 1543–1571. https://doi.org/10.1002/
[31] Andrew Mleczko, Sebastian Schuberth, Lars Schneider, and Brian M. Carlson. spe.3202 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.3202
2021. git-lfs. https://github.com/git-lfs/git-lfs [55] Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeff Gray, Alejandra Molina,
[32] Kiran-Kumar Muniswamy-Reddy, David Holland, Uri Braun, and Margo Seltzer. Michael Terry, and Carrie J Cai. 2022. Promptchainer: Chaining large language
2006. Provenance-Aware Storage Systems. USENIX (01 2006), 43–56. model prompts through visual programming. In CHI Conference on Human Factors
[33] T. Munzner. 2015. Visualization Analysis and Design. CRC Press. 210 pages. in Computing Systems Extended Abstracts. 1–10.
https://books.google.de/books?id=NfkYCwAAQBAJ [56] Yunshu Wu, Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, and Estevam Hr-
[34] National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility uschka. 2023. Less is More for Long Document Summary Evaluation by LLMs.
and Replicability in Science. The National Academies Press, Washington, DC. arXiv preprint arXiv:2309.07382 (2023).
https://doi.org/10.17226/25303 [57] Zhang Xu, Zhenyu Wu, Zhichun Li, Kangkook Jee, Junghwan Rhee, Xusheng
[35] Neo4j. 2012. Neo4j. http://neo4j.org/ Xiao, Fengyuan Xu, Haining Wang, and Guofei Jiang. 2016. High Fidelity Data
[36] OpenAI. 2023. ChatGPT. https://chat.openai.com/chat Reduction for Big Data Security Dependency Analyses. In Proceedings of the 2016
[37] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] ACM SIGSAC Conference on Computer and Communications Security (Vienna,
[38] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Austria) (CCS ’16). Association for Computing Machinery, New York, NY, USA,
Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- 504–516. https://doi.org/10.1145/2976749.2978378
man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Pe- [58] Kailai Yang, Shaoxiong Ji, Tianlin Zhang, Qianqian Xie, Ziyan Kuang, and Sophia
ter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language Ananiadou. 2023. Towards interpretable mental health analysis with large lan-
models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL] guage models. In Proceedings of the 2023 Conference on Empirical Methods in
[39] Thomas Pasquier, Xueyuan Han, Mark Goldstein, Thomas Moyer, David Eyers, Natural Language Processing. 6056–6077.
Margo Seltzer, and Jean Bacon. 2017. Practical Whole-System Provenance Capture. [59] Vahan Yoghourdjian, Yalong Yang, Tim Dwyer, Lee Lawrence, Michael Wybrow,
In Symposium on Cloud Computing (SoCC’17). ACM, ACM. and Kim Marriott. 2020. Scalability of network visualisation from a cognitive
[40] Thomas Pasquier, Matthew K. Lau, Ana Trisovic, Emery R. Boose, Ben Couturier, load perspective. IEEE transactions on visualization and computer graphics 27, 2
Mercè Crosas, Aaron M. Ellison, Valerie Gibson, Chris R. Jones, and Margo (2020), 1677–1687.
Seltzer. 2017. If these data could talk. Nature Scientific Data 4 (2017). https: [60] J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang.
//www.nature.com/articles/sdata2017114 2023. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design
LLM Prompts. In Proceedings of the 2023 CHI Conference on Human Factors in
ACM REP ’24, June 18–20, 2024, Rennes, France Boufford et al.

Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Nodes representing the same file are not duplicated, so it is clear
Machinery, New York, NY, USA, Article 437, 21 pages. https://doi.org/10.1145/ in the graph when a process reads a file that another process created.
3544548.3581388
[61] Haopeng Zhang, Xiao Liu, and Jiawei Zhang. 2023. SummIt: Iterative Text In this situation when multiple processes have edges to a node, the
Summarization via ChatGPT. arXiv preprint arXiv:2305.14835 (2023). order the arrows point to the node are in execution order. The arrow
[62] Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown,
and Tatsunori B. Hashimoto. 2023. Benchmarking Large Language Models for
attached at the top executed first, moving downwards to the last
News Summarization. arXiv:2301.13848 [cs.CL] arrow at the bottom which is the operation executed last.
[63] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou,
Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang,
Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang,
C TASK LOAD INDEX QUESTIONS
Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A Survey of Large (1) How mentally demanding was this task? (1-Very Low, 5-Very
Language Models. arXiv:2303.18223 [cs.CL]
High)
(2) How hurried or rushed were you during this task? (1-Very
A AVAILABILITY Low, 5-Very High)
The work presented in this paper is open-source. Detailed installa- (3) How successful would you rate yourself in accomplishing
tion instructions are available online. this task? (1-Perfect, 5-Failure)
(4) How hard did you have to work to accomplish your level of
• System Provenance Collection Tool. Available for download
performance? (1-Very Low, 5-Very High)
under GPL-2.0 license at https://github.com/ubc-systopia/
(5) How insecure, discouraged, irritated, stressed, and annoyed
thoth.
were you? (1-Very Low, 5-Very High)
• Provenance Summarization. Available for download under
(6) How useful was the provenance summary in answering the
Apache 2.0 license at https://doi.org/10.5281/zenodo.10672536.
questions? (1-Very Useful, 5-Not Useful)
• The user study documents are available at https://doi.org/10.
5281/zenodo.106725369.
D STUDY ACTIVITIES
B DIAGRAM CREATION D.1 Questions
It is common for graph users to visualize their data as node-link D.1.1 Task 1.
graphs with tools like Neo4j [35] and GraphViz [16]. However, these (1) What is the name of the dataset the student is using?
general-purpose tools are not a sufficient fit for this study. They (2) Which directory is the dataset saved in?
do not readily show in a static way all the necessary attributes for (3) What is the name of the file containing the experiment code?
nodes and edges a participant needs to see to answer the questions (4) Which directory is the experiment code located in?
from our tasks. Additionally, since we created automatically gener- (5) How many output files are produced? (Include intermediate
ated text summaries tailored towards reproducibility and workflow outputs)
executions, it would not be a fair comparison to use a general- (6) Which programming languages are used to conduct the anal-
purpose visualization. Therefore, we chose to make our own dia- ysis in this experiment?
grams.
D.1.2 Task 2.
We devised a new type of node-link diagram tailored towards
displaying information from provenance logs about a workflow. (1) How many times is the script train_model.py executed?
We manually created our node-link diagram visualizations using (2) How many times is the script preprocess.R executed?
a set of pre-defined rules rather than write a program to do so (3) Which scripts write to the file data.csv?
automatically. Making them manually allowed us a finer grain of (4) Which scripts read from the file data.csv?
control over the various aspects of the diagram to ensure legibility; (5) Which scripts write to the file temp_data.csv?
however, we believe the process could be automated with some (6) Which scripts read from the file "temp_data.csv"?
effort. (7) Which of the following are dependencies of train_model.py?
Our node-link diagrams highlight the processes that comprise D.1.3 Task 3.
the computational workflow. Overall, our diagrams display events (1) Where is the dataset located?
in order they executed from top to bottom; however, the operations (2) How many output files were created during this experiment
are not displayed proportional to the time they occurred, only the (including intermediate files)?
order. We represent each process with a large arrow that points (3) Please explain the difference between the first and second
downward to indicate the order of execution. Each process’ arrow executions of the train_model.py script in no more than two
receives a unique color, except for instances where that process has sentences.
spawned additional processes. The child processes are large arrows
of the same color placed directly to the right of the original process. D.1.4 Task 4.
The top of the arrow has a block containing the command that (1) What is the name of the dataset the student is using?
initiated the process. Smaller arrows attached to the right side of a (2) Which directory is the dataset saved in?
process show the various system operations the process performed (3) What is the name of the file containing the experiment code?
over the course of its existence. These arrows represent edges to (4) Which directory is the experiment code located in?
other nodes, such as files it reads or writes, libraries it loads, or (5) How many output files are produced? (Include intermediate
programs it executes. outputs)
Computational Experiment Comprehension using Provenance Summarization ACM REP ’24, June 18–20, 2024, Rennes, France

(6) Which programming languages are used to conduct the anal- D.3 Node Link Diagrams
ysis in this experiment? Fig. 12, Fig. 13, Fig. 14, Fig. 15 show the node link diagrams we
created and used in our study.
D.2 Text Summaries
Fig. 8, Fig. 9, Fig. 10, Fig. 11 show the text summaries used in our Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009
study. We used the single prompt method for task 1 and 2 and the
chaining method for task 2 and 3.
ACM REP ’24, June 18–20, 2024, Rennes, France Boufford et al.

Figure 8: Workflow 1 Text Summary

Figure 9: Workflow 2 Text Summary

Figure 10: Workflow 3 Text Summary

Computational Experiment Comprehension using Provenance Summarization ACM REP ’24, June 18–20, 2024, Rennes, France

Figure 11: Workflow 4 Text Summary

ACM REP ’24, June 18–20, 2024, Rennes, France Boufford et al.

Figure 12: Workflow 1 Node Link Diagram

Computational Experiment Comprehension using Provenance Summarization ACM REP ’24, June 18–20, 2024, Rennes, France

Figure 13: Workflow 2 Node Link Diagram

ACM REP ’24, June 18–20, 2024, Rennes, France Boufford et al.

Figure 14: Workflow 3 Node Link Diagram

Computational Experiment Comprehension using Provenance Summarization ACM REP ’24, June 18–20, 2024, Rennes, France

Figure 15: Workflow 4 Node Link Diagram

100 NLP Questions
100% (5)
100 NLP Questions
23 pages
2503.07080v3
No ratings yet
2503.07080v3
20 pages
Galaxy: A Comprehensive Approach For Supporting Accessible, Reproducible, and Transparent Computational Research in The Life Sciences
No ratings yet
Galaxy: A Comprehensive Approach For Supporting Accessible, Reproducible, and Transparent Computational Research in The Life Sciences
13 pages
Scientific Modeling With Massively Parallel SIMD Computers
No ratings yet
Scientific Modeling With Massively Parallel SIMD Computers
12 pages
Bachelor Thesis Informatik PDF
100% (3)
Bachelor Thesis Informatik PDF
8 pages
Oommen - NPJ - Rethinking Materials Simulations - Blending Direct Numerical Simulations With Neural Operators
No ratings yet
Oommen - NPJ - Rethinking Materials Simulations - Blending Direct Numerical Simulations With Neural Operators
14 pages
Boet Tiger 2015
No ratings yet
Boet Tiger 2015
9 pages
Nextflow in Bioinformatics Executors Performance - 2023 - Future Generation Co
No ratings yet
Nextflow in Bioinformatics Executors Performance - 2023 - Future Generation Co
12 pages
Pid 5184483
No ratings yet
Pid 5184483
6 pages
Entropy 23 01123 v2
No ratings yet
Entropy 23 01123 v2
21 pages
Evaluation of Evolutionary Programming
No ratings yet
Evaluation of Evolutionary Programming
4 pages
Final Short Success Prediction in A MOOC Environment
No ratings yet
Final Short Success Prediction in A MOOC Environment
7 pages
DFT Scan Cells Network Design
No ratings yet
DFT Scan Cells Network Design
3 pages
High-Speed Tracking With Kernelized Correlation Filters
No ratings yet
High-Speed Tracking With Kernelized Correlation Filters
14 pages
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
No ratings yet
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
40 pages
材料研究的自主性：碳纳米管生长的案例研究
No ratings yet
材料研究的自主性：碳纳米管生长的案例研究
6 pages
Bachelor Thesis Informatik Thema
100% (2)
Bachelor Thesis Informatik Thema
8 pages
Mathematical Problems in Engineering - 2020 - Yin - Wasserstein Generative Adversarial Network and Convolutional Neural
No ratings yet
Mathematical Problems in Engineering - 2020 - Yin - Wasserstein Generative Adversarial Network and Convolutional Neural
16 pages
Comparing The Partition Table and Hash Tables With ARC: Esocoh
No ratings yet
Comparing The Partition Table and Hash Tables With ARC: Esocoh
6 pages
MSR 2017 B
No ratings yet
MSR 2017 B
12 pages
Master Thesis Neural Network
100% (1)
Master Thesis Neural Network
4 pages
Region-Based_Convolutional_Networks_for_Accurate_Object_Detection_and_Segmentation
No ratings yet
Region-Based_Convolutional_Networks_for_Accurate_Object_Detection_and_Segmentation
17 pages
Artificial Intelligence in Pharma-Ceutical Product Formulation: Neural Computing
No ratings yet
Artificial Intelligence in Pharma-Ceutical Product Formulation: Neural Computing
10 pages
Controllable Data Generation by Deep Learning: A Review: Shiyu Wang Yuanqi Du Xiaojie Guo Bo Pan Zhaohui Qin Liang Zhao
No ratings yet
Controllable Data Generation by Deep Learning: A Review: Shiyu Wang Yuanqi Du Xiaojie Guo Bo Pan Zhaohui Qin Liang Zhao
38 pages
Expdbidbook
No ratings yet
Expdbidbook
31 pages
KidneyGridPortal-MGC2006
No ratings yet
KidneyGridPortal-MGC2006
7 pages
Computer Physics Communications: Vei Wang, Nan Xu, Jin-Cheng Liu, Gang Tang, Wen-Tong Geng
No ratings yet
Computer Physics Communications: Vei Wang, Nan Xu, Jin-Cheng Liu, Gang Tang, Wen-Tong Geng
19 pages
Practical Design Space Exploration: 1 Luigi Nardi 2 David Koeplinger 3 Kunle Olukotun
No ratings yet
Practical Design Space Exploration: 1 Luigi Nardi 2 David Koeplinger 3 Kunle Olukotun
12 pages
TorchDA_A Python package for performing data assimilation with deep learning forward and transformation functions
No ratings yet
TorchDA_A Python package for performing data assimilation with deep learning forward and transformation functions
17 pages
computers-13-00077-v2
No ratings yet
computers-13-00077-v2
3 pages
Open-source rasberry Pi-based operant box for translational behavioral testing in rodents
No ratings yet
Open-source rasberry Pi-based operant box for translational behavioral testing in rodents
11 pages
2018 Lafayette AdCo
No ratings yet
2018 Lafayette AdCo
7 pages
Anwyl-Irvine - Gorilla in our midst: An online behavioral experiment builder
No ratings yet
Anwyl-Irvine - Gorilla in our midst: An online behavioral experiment builder
20 pages
Valleti et al. - 2024 - Deep kernel methods learn better from cards to process optimization
No ratings yet
Valleti et al. - 2024 - Deep kernel methods learn better from cards to process optimization
20 pages
Lab Workbook - Advance-Peer 22AD2001A
No ratings yet
Lab Workbook - Advance-Peer 22AD2001A
150 pages
Transfer-Learning Bridging the Gap Between Real and Simulation Data for Machine Learning in Injection Molding
No ratings yet
Transfer-Learning Bridging the Gap Between Real and Simulation Data for Machine Learning in Injection Molding
6 pages
FABRIC Testbed From The Eyes of A Network Researcher
No ratings yet
FABRIC Testbed From The Eyes of A Network Researcher
12 pages
Well Test Analysis and Interpretation TH
No ratings yet
Well Test Analysis and Interpretation TH
9 pages
Student Projects in Computer Networking: Simulation Versus Coding
No ratings yet
Student Projects in Computer Networking: Simulation Versus Coding
6 pages
Btac 492
No ratings yet
Btac 492
9 pages
0 DLC Protocol Multi Species 2019 Nature Protocols
No ratings yet
0 DLC Protocol Multi Species 2019 Nature Protocols
27 pages
N-20206
No ratings yet
N-20206
10 pages
Stapor ASOC 2021
No ratings yet
Stapor ASOC 2021
12 pages
A Hybrid Quantum-Classical Neural Network Architecture For Binary Classification
No ratings yet
A Hybrid Quantum-Classical Neural Network Architecture For Binary Classification
9 pages
Elena Gaura, Michael Allen, Lewis Girod, James Brusey, Geoffrey Challen (Auth.), Elena Gaura, Michael Allen, Lewis Girod, James Brusey, Geoffrey Challen (Eds.)-Wireless Sensor Networks_ Deployments An
No ratings yet
Elena Gaura, Michael Allen, Lewis Girod, James Brusey, Geoffrey Challen (Auth.), Elena Gaura, Michael Allen, Lewis Girod, James Brusey, Geoffrey Challen (Eds.)-Wireless Sensor Networks_ Deployments An
307 pages
On The Analysis of Reinforcement Learning: Jeremy Stribling, Max Krohn and Dan Aguayo
No ratings yet
On The Analysis of Reinforcement Learning: Jeremy Stribling, Max Krohn and Dan Aguayo
8 pages
Scimakelatex 23918 None
No ratings yet
Scimakelatex 23918 None
6 pages
SAPENet: Self-Attention Based Prototype Enhancement Network For Few-Shot Learning
No ratings yet
SAPENet: Self-Attention Based Prototype Enhancement Network For Few-Shot Learning
11 pages
Querying and ReUsing Workflows With Visstrails
No ratings yet
Querying and ReUsing Workflows With Visstrails
4 pages
OS&UNIX Master Manual
No ratings yet
OS&UNIX Master Manual
107 pages
ImageNet Classification With Deep
No ratings yet
ImageNet Classification With Deep
7 pages
FHNW Bachelor Thesis Wegleitung
100% (2)
FHNW Bachelor Thesis Wegleitung
7 pages
Cache Coherence Considered Harmful
No ratings yet
Cache Coherence Considered Harmful
5 pages
Modified Rkr21-Ii Year Ii-Sem
No ratings yet
Modified Rkr21-Ii Year Ii-Sem
26 pages
Read-Write, Certifiable Communication: Sanfian Guptka and Todorov Randsey
No ratings yet
Read-Write, Certifiable Communication: Sanfian Guptka and Todorov Randsey
4 pages
A Few-Shot Approach For Relation Extraction Domain
No ratings yet
A Few-Shot Approach For Relation Extraction Domain
9 pages
Formas de Usar Vina
No ratings yet
Formas de Usar Vina
13 pages
Project Proposal: Efficient Algorithms For Molecular Dynamics Simulations and Other Dynamic Spatial Join Queries
No ratings yet
Project Proposal: Efficient Algorithms For Molecular Dynamics Simulations and Other Dynamic Spatial Join Queries
7 pages
Grid Computing
No ratings yet
Grid Computing
2 pages
Mastering matplotlib
From Everand
Mastering matplotlib
Duncan M. McGreggor
No ratings yet
Real-Time Critical Systems
From Everand
Real-Time Critical Systems
Jordan Lee Mauro-Buhagiar
3/5 (1)
1 s2.0 009286749090303V Main
No ratings yet
1 s2.0 009286749090303V Main
9 pages
Water Resources Research - 2013 - Sheng - Dynamic Coupling of Pore Scale and Reservoir Scale Models For Multiphase Flow
No ratings yet
Water Resources Research - 2013 - Sheng - Dynamic Coupling of Pore Scale and Reservoir Scale Models For Multiphase Flow
16 pages
ADA498437
No ratings yet
ADA498437
14 pages
2017 Front Matter
No ratings yet
2017 Front Matter
5 pages
Evolving Internal Length Scales in Plast
No ratings yet
Evolving Internal Length Scales in Plast
25 pages
2018 DrBabakEbrahimian ContinuumMechThermodyn
No ratings yet
2018 DrBabakEbrahimian ContinuumMechThermodyn
10 pages
European Journalof Environmentaland Civil Engineering
No ratings yet
European Journalof Environmentaland Civil Engineering
8 pages
2012 DrBabakEbrahimian MRC
No ratings yet
2012 DrBabakEbrahimian MRC
8 pages
2023 Emnlp-Main 799
No ratings yet
2023 Emnlp-Main 799
20 pages
Mini ProjectA17
0% (1)
Mini ProjectA17
25 pages
A Comprehensive Review On Feature Set Used For Anaphora Resolution
No ratings yet
A Comprehensive Review On Feature Set Used For Anaphora Resolution
90 pages
Algorithms: Evaluation of Diversification Techniques For Legal Information Retrieval
No ratings yet
Algorithms: Evaluation of Diversification Techniques For Legal Information Retrieval
24 pages
PDF Based Question &answering Using Langchain and Openai Api
No ratings yet
PDF Based Question &answering Using Langchain and Openai Api
58 pages
TEXT SUMMARIZATION USING NLP (Final-2)
No ratings yet
TEXT SUMMARIZATION USING NLP (Final-2)
40 pages
Youtube Transcript Summarizer Using Flask
No ratings yet
Youtube Transcript Summarizer Using Flask
9 pages
Final Defense Report-4
No ratings yet
Final Defense Report-4
29 pages
Unit-5 DL
No ratings yet
Unit-5 DL
35 pages
HLDC Hindi Legal Documents Corpus
No ratings yet
HLDC Hindi Legal Documents Corpus
17 pages
Paper-6 Data Mining and Natural Language Processing Methods For Extracting Opinions From Customer Reviews
No ratings yet
Paper-6 Data Mining and Natural Language Processing Methods For Extracting Opinions From Customer Reviews
7 pages
Lecture 07
No ratings yet
Lecture 07
59 pages
G-E: NLG Evaluation Using G - 4 With Better Human Alignment: VAL PT
No ratings yet
G-E: NLG Evaluation Using G - 4 With Better Human Alignment: VAL PT
12 pages
s2 Summarization and Execution
No ratings yet
s2 Summarization and Execution
8 pages
Question Bank
No ratings yet
Question Bank
13 pages
1.1 What Is Data Mining?
No ratings yet
1.1 What Is Data Mining?
6 pages
FEBS Project Prodigy
No ratings yet
FEBS Project Prodigy
11 pages
NLP-based Automatic Answer Evaluation
No ratings yet
NLP-based Automatic Answer Evaluation
5 pages
Text Processor For OCR AND FILE and Summarization
No ratings yet
Text Processor For OCR AND FILE and Summarization
3 pages
IBM-CBSE_AI_Project_Logbook (3) (2)
No ratings yet
IBM-CBSE_AI_Project_Logbook (3) (2)
32 pages
A Survey of Deep Learning in Sports Applications
No ratings yet
A Survey of Deep Learning in Sports Applications
20 pages
Group 1 BCT Draft 3
No ratings yet
Group 1 BCT Draft 3
21 pages
Final Document
No ratings yet
Final Document
46 pages
CCTV VIDEO Optimisation FINAL REPORT
No ratings yet
CCTV VIDEO Optimisation FINAL REPORT
91 pages
Baingan Lelo
No ratings yet
Baingan Lelo
7 pages
A Domain-Specific Automatic Text Summarization Using Fuzzy Logic
No ratings yet
A Domain-Specific Automatic Text Summarization Using Fuzzy Logic
13 pages
2022 Coling-1 540
No ratings yet
2022 Coling-1 540
8 pages
RAPTOR
No ratings yet
RAPTOR
23 pages
Introduction To NLP 2021
No ratings yet
Introduction To NLP 2021
13 pages
Text Summarization
No ratings yet
Text Summarization
13 pages

ACM REP24 Paper Submission

Uploaded by

ACM REP24 Paper Submission

Uploaded by

Computational Experiment Comprehension using Provenance

Jack Sullivan Margo Seltzer Thomas Pasquier

ABSTRACT ACM Reference Format:

that represents the relationships between data objects. These ob-

Figure 2: Steps to create a text summary of a computational experiment.

(a) Example machine-readable provenance log.

1 2023 −10 −21 2 0 : 0 1 : 5 5 : 4 2 7 p r o c e s s w i t h p i d : 2 2 7 8 9 9 e x e c u t e d f i l e : / r o o t / u s r / b i n / pyt hon3 . 1 1

(b) Provenance log converted to natural language format.

User executes an R preprocessing script fol-

lowed by a Python model training script.

User executes a model training script three

as a command line argument.

provenance graphs and summary data would reduce false infor-

6.2 Alternative Provenance Summarization 7 CONCLUSION

6.3 LLM Summarization REFERENCES

Figure 8: Workflow 1 Text Summary

Figure 9: Workflow 2 Text Summary

Figure 10: Workflow 3 Text Summary

Figure 11: Workflow 4 Text Summary

Figure 12: Workflow 1 Node Link Diagram

Figure 13: Workflow 2 Node Link Diagram

Figure 14: Workflow 3 Node Link Diagram

Figure 15: Workflow 4 Node Link Diagram

You might also like