ACM REP24 Paper Submission
ACM REP24 Paper Submission
Summarization
Nichole Boufford Joseph Wonsil Adam Pocock
University of British Columbia University of British Columbia Oracle Labs
Vancouver, British Columbia, Canada Vancouver, British Columbia, Canada Burlington, Massachusetts, USA
[email protected] [email protected] [email protected]
Most provenance applications, including those used for exper- means that the context is limited to approximately 32000 charac-
iment and code management, present provenance as a node-link ters.
diagram [5, 21, 25, 30, 41, 46]. However, provenance graphs can
contain hundreds of elements, even for small tasks such as running 3 TEXT SUMMARIZATION
a computational notebook. Research shows that graphs containing We generate several high-quality summaries using GPT-4 [37] as a
more than 50 to 100 elements are hard to interpret [59] as they proof of concept for our user study. Fig. 2 shows the sequence of data
cannot fit in a human’s working memory [33]. Alternative prove- transformations involved in producing text summaries of compu-
nance visualizations [6, 44] have failed to see meaningful adoption tational experiments. We first run an experiment while recording
(see §6 for further discussion). Given that large graphs are hard provenance 1 . We then preprocess the provenance data 2 3
to understand, we propose natural language text summaries as an and then use the GPT-4 model from OpenAI [37] to generate the
alternative. Our intuition is that scientific experiments follow a summaries for our user study 4 . The LLM-generated summaries
logical control flow that we can describe using natural language. should contain 1) enough information that a user can understand
We know that scientists frequently read papers, lab reports and the experiment well enough to reproduce it and 2) no unnecessary
procedural documents, so we hypothesize that they might find a or false information. We outline further goals and expectations in
written format easier to understand and a better way to explain how §3.3.
to reproduce a computational experiment. While it was previously 1 Provenance capture We developed a system level provenance
impractical to generate these text summaries manually, we now collection tool to capture provenance during experiment [7]. Sys-
can generate them automatically using large language models. tem level provenance describes data at the granularity of system
calls, files and processes. We wrote our own tool because most ex-
2.2 Summarization using Generative AI isting system provenance collection tools have a large installation
Recent work in generative artificial intelligence shows that large overhead [32, 39]. Our tool uses eBPF [2], a Linux framework that
language models (LLMs) are able to effectively summarize large allows users to monitor operating system events without modifica-
quantities of text [17, 62]. Users interact with LLMs using a prompt- tions to the kernel. Previous work that uses eBPF for provenance
ing interface where they use natural language to instruct the model capture mainly focuses on security [27, 45], whereas our tool only
to answer a question or complete a task [63]. The input to an LLM is captures the information necessary for computational experiment
a natural language expression, called a prompt. The model outputs reproducibility. We call the provenance data captured by our system-
a response to that prompt, also in natural language. If the task is level tool a provenance log.
summarization, the user also provides the document as part of the
prompt. 3.1 Data Preprocessing
Many prior works uncover limitations of LLM summarization [22, For most LLMs, including GPT-4 [37], the context window is limited.
28, 47, 50]. It requires careful prompting to generate useful re- Since many of our provenance logs are larger than the context
sponses [60]. LLM responses are sometimes verbose, redundant, window, we need a more concise representation. Additionally, the
and unclearly organized. Additionally, with current generative AI provenance log we get from the data capture stage is a machine-
models, we cannot guarantee response correctness [8]. Lastly, LLMs readable JSON file. The JSON provenance format is long and verbose,
can process a limited amount of text at one time. The context window which increases the context size. Previous work shows that LLM
is the maximum amount of text a model can process. The context response quality degrades and loses information around the middle
consists of one or more prompt and model response pairs, similar of a document when the context is too long [28].
to a conversation. Since our prompt contains an instruction and We reduce log size by removing duplicate edges and converting
a provenance log, our instruction, provenance log and the model the JSON log to natural language. We perform both the edge re-
response together must be smaller than the context window. The duction and the natural language formatting automatically using
context window is measured in tokens; for GPT models, a token is Python scripts. Both of these methods also provide the benefit of
approximately equal to 4 characters. At the time of this study, the reducing noise in the input to the LLM. Duplicate edge reduction
largest context window available for GPT-4 is 8000 tokens. This helps prevent edges from being erroneously categorized as more
ACM REP ’24, June 18–20, 2024, Rennes, France Boufford et al.
1 { " type " : " E n t i t y " , " i d " : " 4 5 5 4 " , " a n n o t a t i o n s " : { " inode_inum " : " 4 5 5 4 " , " uid " : " 0 " , " path " : " /
r o o t / u s r / b i n / p ython3 . 1 1 " } }
2 { " type " : " A c t i v i t y " , " id " : " 2 2 7 8 9 9 " , " annotations " : { " pid " : " 2 2 7 8 9 9 " } }
3 { " t y p e " : " Used " , " t o " : " 4 5 5 4 " , " from " : " 2 2 7 8 9 9 " , " a n n o t a t i o n s " : { " o p e r a t i o n " : " e x e c u t e " , "
datetime ":"2023 −10 −21 2 0 : 0 1 : 5 5 : 4 2 7 " } }
Figure 3: A provenance log in machine-readable JSON format (Fig. 3a) is converted to natural language format (Fig. 3b). The
machine-readable JSON format size is 88 tokens and the natural language log size is 30 tokens.
important than they are, and the natural language format aligns fine-tuning. Existing work has shown that prompt engineering ef-
more with an LLM’s training corpus than the JSON output of our fectively generates well-written summaries of long-form text [17].
system-level provenance collection tool. 4 Prompt Engineering We use GPT-4-0613 [37], the latest openly
2 Edge Reduction The operating system sometimes produces available model from OpenAI at the time of our study. OpenAI pro-
many system events for a single user action. For example, if a user vides guidelines and strategies for developing prompts [1]. We fol-
is modifying a file using a text editor, the operating system might lowed these guidelines and adjusted our prompts until we achieved
execute multiple writes in a row. Our provenance collection system a desirable output. Using clear and specific instructions achieved
will record each write event as an edge. Conceptually, there is no the best results. In Fig. 4, we show the final prompt we used to
difference between a single large write event and many consecutive generate the summaries for our user study. We discuss how we
small write events. Therefore, we use a simplified version of edge evaluated output quality and how we arrived at our final prompt
aggregation described by Xu et al. [57] to remove repeated edges in §3.3.
from the graph. This reduced log sizes by 43-53% for the logs in our Temperature Parameter Additionally, we set the GPT tempera-
study. ture parameter to 0 to ensure consistent responses. The temperature
3 Natural Language Formatting Through empirical experi- is a randomness control parameter for the GPT model. A lower tem-
ments, we found that converting the JSON logs into natural lan- perature means less randomness and a higher temperature means
guage sentences improved both log size and summary quality. The the outputs will have more variability. Higher temperatures some-
new log format follows a simple natural language structure where times introduces interesting prose and more high-level descriptions,
a short sentence describes each relationship in the graph. For exam- but the responses were inconsistent and more likely to contain false
ple, when a process writes to a file, this is recorded in the log as a information. Setting the temperature to 0, we get responses that
JSON object for each of: a process node, a file node, and an edge that are nearly the same each time, differing by a only few words, if any.
connects the two nodes. We simplify this relationship as “Process p Summarizing Large Provenance Logs Even after preprocessing,
writes to file f ”, where p and f are the identifiers for the process some of our provenance logs still exceeded the model context win-
and the file. Fig. 3 shows an example of the natural language log dow. The GPT-4 context window is 8,000 tokens at the time of our
format conversion. Since we can enumerate all the possible relation- study. In comparison, our processed provenance logs ranged from
ship types in our provenance graph, we can generate a mapping of 3945 to 12815 tokens. In cases where the log was too large, we used
sentences to relationships in the provenance graph. We can then prompt chaining, a technique that has been used for large, complex
automatically generate a log in natural language format, filling tasks [52, 55]. If a log exceeded the size of the context window, we
in the blanks with values from the provenance data. This format divided it into two or more logs. We define break points as edges in
produces higher quality summaries than the machine-readable log, the graph that correspond to a user executing a command. These
using the evaluation criteria in §3.3. The natural language format break points represent a natural break in the log information such
reduces the study provenance log size by an additional 58-63%. In as a user executing a python script from the command line. We
combination these two techniques reduce the logs to between 17 maximize the size of the first chunk and put the remainder in the
and 24% of their original size. The code for generating the natural second chunk, ensuring that the first section of the next chunk
language format is publicly available (details in Appendix A). starts at a break point. The model then summarizes the first section
of the log, and we give the response, the next section of the log, and
a second prompt back to the model, We repeat this process until
3.2 Prompting the model has summarized the entire log. This method generated
high quality summaries using the evaluation method described in
After preprocessing, we use LLM prompting to generate text sum-
§3.3.
maries from the preprocessed provenance logs. Prompt engineering
is the process of designing LLM prompts to achieve a desired re-
sponse. Prompt engineering does not require model training or
Computational Experiment Comprehension using Provenance Summarization ACM REP ’24, June 18–20, 2024, Rennes, France
Figure 4: The prompt we used to generate our user study text summaries. Input provenance log denotes the log specific to each
task.
3.3 Summary Evaluation involves manually writing a “conversation history”. That is, one
There is currently no standard for evaluating LLM-generated sum- writes a prompt and then manually generates a desired response to
maries. Existing methods for evaluating LLM responses use both that prompt. The prompt/response pair is provided to the model as
qualitative and quantitative methods depending on the applica- an example before giving the model a prompt for which you want
tion [12]. Quantitative methods involve statistically measuring re- the model to produce a response. The model is likely to follow the
sponses compared to reference text written by a human. Recent response format from the conversation history when using this
research shows no strong correlation between statistical metrics technique. While the responses generated from prompts that in-
and summary quality [50]. We do not have a strict expected output cluded examples were of high quality, the examples counted against
structure for the text summaries; therefore, the statistical difference the context window limit, leaving fewer tokens available for the
between the generated and reference summaries is not meaningful. real provenance log. We did not use this example technique for
Therefore, we use qualitative methods to evaluate LLM-generated the summaries in our study. Rather, we opted to use a detailed
summaries. We define a rubric with four categories: instruction that uses less of the context window limit as shown in
Fig. 4.
Completeness Is all the necessary information included?
Conciseness Is any unnecessary information included?
Truthfulness Does it include any false information? 4 USER STUDY
Readability Is it easy to read and well formatted? We conduct a user study to evaluate whether users were better able
to understand workflows given text-based provenance summaries
For each of the four categories, we manually assign a score out
than they were when given node-link diagrams. The study uses a
of 4, giving a total score out of 16. We developed a prompt that
mixed methods approach. Participants are quantitatively evaluated
produced summaries that scored 16/16 for each of the provenance
on their ability to answer questions about several computational
logs used in the study. We used the prompt in Fig. 4 to summarize
experiments using only the provenance summarization. We then
logs smaller than the context window. We also use this prompt as
analyze qualitative feedback through long answer text responses
the first prompt in the chaining approach. The prompt (excluding
and audio recordings. Appendix D contains our study materials.
the input provenance) uses only 101 tokens, leaving the rest of the
8K context window for input logs and the output summary.
It took approximately one month of iteration to create our final 4.1 Study Methods
solution. We had many discussions with our team members to The study session consisted of a brief introduction and overview of
develop the rubric, refine the prompt and come to a consensus the study purpose, demographic questionnaire, activity, and post-
on the best responses. To develop our prompt, we started with a activity questionnaire.
basic prompt, “Summarize the following log.” and provided a small Study Activity Each participant completed four tasks. For each
log describing a user executing a python script. We adjusted the task, the participant was given either a node-link diagram prove-
prompt, using different wording and adding further instructions nance summary or an automatically generated text summary repre-
and context. As we increased the size of the input log, the model senting a computational experiment that they had not seen before.
required more specific prompting to steer it in the right direction. Participants used the provenance summary to answer questions
Once we engineered a prompt that consistently produced well about the computational experiment. The questions concerned in-
written summaries, we used this prompt to generate the summaries formation one would need to reproduce said experiment, such as,
in our user study. “Which scripts write to this data file?” and “How many output files
We also experimented with providing examples, another tech- are created during this experiment?”. We describe the study’s com-
nique discussed in the OpenAI guidelines. The example technique putational experiments (workflows) in Table 2. We used the one
ACM REP ’24, June 18–20, 2024, Rennes, France Boufford et al.
(a) Average time to complete each task. (b) Average score for each task normalized (c) Normalized NASA Task Load Index
out of 100. (TLX).
Figure 5: Quantitative metrics showing performance using both graph (orange) and text (blue) provenance summaries.
Task IDDescription
Human Robot Interaction
Information Technology
Environmental Science
User executes a Python script and the script
0
creates a plot of input data.
Computer Science
Physics
(a) Which provenance summarization was more useful? (b) Which provenance summarization was more enjoyable?
Figure 6: Participants rank their preference for either the text summarization or node link diagram. Participants are categorized
by their computational expertise.
Question Score Each participant successfully answered most ques- At first glance, there is no trend in either direction. Some partic-
tions correctly regardless of representation (Fig. 5b). For tasks 1-3, ipants strongly prefer the node-link diagram, and others strongly
at least 8 out of 12 participants scored 100% and 11 out of 12 scored prefer the text, with a few in the middle. But, when we include par-
over 70% using either the text or node-link diagram summary. The ticipants’ overall experience with research programming, a stronger
scores were lowest for task 4, where only one participant scored trend emerges. Users with little experience (blue) find the text both
perfectly, although 10 out of 12 participants scored over 75%. The more useful and enjoyable. Users with intermediate (pink) to ad-
participants who scored the highest on this task were able to answer vanced (yellow) experience varied in whether they found the text
the multistep reasoning more easily with the text-based provenance or node-link summary more useful, but tended more towards the
summary than they were with the graph-based one. We assigned a graph in terms of enjoyment. We uncover some explanations for
score for this question manually, giving two points if they answered these trends using the long answer survey responses and audio
correctly with plausible reasoning and one point for partially cor- transcriptions. We outline the prominent themes below. In the fol-
rect responses (i.e. correct reasoning but incorrect answer or vice lowing sections, we refer to participants by number (eg. P0) to
versa). All other questions in the study had only one answer and preserve anonymity.
were marked as either correct or incorrect.
Text summaries are accessible for all expertise levels. As observed
Perceived Cognitive Load We measure perceived cognitive load
in Fig. 6a, users with less computational expertise preferred the text
using the NASA Task Load Index (TLX) standard scoring system
summaries. Less experienced participants were more comfortable
[19]. After each task, participants recorded their response to the
and less overwhelmed with the text summaries. For instance, P6
TLX questions in Appendix C. As with the other quantitative met-
felt the graphs required some background knowledge they did not
rics, the cognitive load scores are similar when comparing the two
have.
summarization methods (Fig. 5c).
The quantitative results show no obvious difference in overall Text summaries tell a story. P6 described the text summaries as
performance when using the text or the node-link summary. We “Text reads more like a storyline, which is more intuitive for me”. Mul-
begin to see larger differences when we look at user preference and tiple participants found the text summaries followed a logical order.
qualitative feedback. P12, who studies bioinformatics, remarked that they are required
to closely follow written protocols in their work. This experience
translated well to understanding the text summaries, which have
4.4 Qualitative Results a similar format to a written experiment protocol. However, the
graph differed from any data format they were familiar with and
In the post-activity survey, we asked questions regarding the entire
required more effort to understand. Advanced users, many of whom
study experience. At this point, the participants have completed two
found the graph more enjoyable, still appreciated aspects of the text
tasks using a text provenance summary and two tasks with a node-
summary. P8 notes “The text format felt more useful in identifying
link provenance summary. We asked them to rate their preference
the workflow steps in order.”
for either summary technique using a 5-level Likert scale [26]. The
participants recorded whether either was more useful during the Text summary requires attention to detail. Several participants
activities and whether either was more enjoyable than the other. who preferred the node-link diagram summarization found the text
We show the results of these questions in Fig. 6b. too long to read. P10 found it less enjoyable to “read through each
ACM REP ’24, June 18–20, 2024, Rennes, France Boufford et al.
sentence and remember what is being done at each step”. While the The participants’ enthusiasm while sharing feedback on repro-
length of prose was a barrier for some participants, other partic- ducibility tools sparks optimism for future research in provenance
ipants found it helpful. Particularly, users who are confident and and reproducibility.
often read written protocols were comfortable extracting informa-
tion from the text summaries. 5.1 Design Recommendations
The text summaries lacked some structure compared to the
In the post-activity questionnaire, we asked participants if there
graphs. As in natural prose, the subjects and verbs do not appear are additional features they would like for a reproducibility tool
in the same place in each sentence in the text summaries. For P9, and if there is anything that would prevent them from using a
“The text summary tended to jump around more and was difficult to reproducibility tool. We give several recommendations based on
follow.” We discuss alternative text summary formats in §5.2. our takeaways from the qualitative analysis. These guidelines can
also be applied more broadly to any tools that assist with user
Advanced Users Identify Patterns in Graph. Users with high com-
comprehension of experimental workflows.
putational expertise often preferred the graph format. Many users
Visual Features For both visualization-based and text-based sum-
in this category enjoyed the extra details and workflow visualiza-
marizations, several users noted they would like highlighting, zoom-
tion for identifying relationships and patterns. P8 found the graph
ing, panning, and search. As P7 describes, “adding colors to file
“made it easier to identify relationships between different components.”
names, and scripts/outputs/paths [...] would make it more readable.”
Similarly, P9 found that when using the graph “it was easier to see
P3 also mentions “highlighting of linked routes (when you hover over
repeated steps and patterns.”
an item it shows all the related items)”. In graph summaries, users
Users noted that the text summary was harder to skim and
complained of difficulty tracing the edges between nodes. In text
quickly extract information. As such, several participants iden-
summaries, several participants noted that the text required users
tified areas where the text could be improved, potentially affording
to read the entire entry, sometimes multiple times, to ensure they
similar benefits to the graph. Several participants noted that key-
did not miss any details. We imagine that simply color coding and
word highlighting in the text might allow pattern matching similar
bolding keywords such as verbs (e.g., read, write, execute) and file
to the graph. We discuss this further in §5.1.
paths would help users to extract important details more quickly
Text summary can help users to get up to speed on node-link dia- and easily.
grams. Several users noted they would like to see both provenance Hide Low-Level Details With either provenance summarization,
representations in a real application. For less experienced users, users still felt overloaded with information on first impressions. P3
suggested “the ability to have hierarchical drop down tree to help
some noted they could use the text summary to help understand the
organize larger amounts of data”. Similarly, P9 wanted the tool to
node-link diagram. P6 would prefer to have “both text and [node-
link diagram] side by side [...] so that I could eventually learn how “allow the user to ‘zoom in’ on different parts of the experiment”.
to read [the node-link diagrams] with some practice.” Even users The option to view a high-level summary first and expand on the
who preferred the graph noted that “a plain or natural language details later might reduce the initial cognitive overload and make
commentary is always useful [alongside the graph]” (P9). the summaries more approachable.
Integration with Existing Tools Four participants expressed in-
terest in integration with tools they already use, such as Git [10] or
4.5 Remote Study RMarkdown [3]. Several would have liked a provenance summa-
We released a second version of our user study as an online survey rization directly linked to their code repository. Others mentioned
and allowed participants to complete the survey on their own. The it would help them to understand previous experiments if they inte-
remote version of the study had minor changes from the in-person grated a provenance summary into their computational notebook.
version including small changes to question wording, two addi- Installation and Use Overheads Many participants mentioned
tional demographic questions and an additional long-form answer that they would be unlikely to use any tool if the overhead for use
question for task 2. 10 participants completed the remote study. We was too high. This overhead includes installation and workflow
did not see any significant trends across the quantitative metrics. modifications. P6 noted “if set up would slow me down a lot, I might
All the participants that completed the remote study were cate- be less likely to use it.” Specifically, barriers include having to rewrite
gorized as intermediate or expert in their computational and data any of their existing code or switching programming environments.
science expertise. Participants’ overall preference for the text sum-
marization versus the node link diagram was similar to the initial 5.2 Text Summary Limitations and
study. The qualitative feedback matched the themes we identified Improvements
in the first study. Several participants remarked that they enjoyed
the visual cues in the node link diagram but also found the text Although we cannot guarantee perfect summaries using current
useful for answering questions about what happened during an models, our positive results using a generic large language model
experiment. leave us hopeful. We expect that using a domain specific LLM,
trained on experiment provenance data, would be better still. For
our user study, we generated text summaries using the out-of-the-
5 DISCUSSION & FUTURE WORK box GPT-4 model from OpenAI [37] with no fine-tuning. GPT-4
Our qualitative analysis yields several areas of improvement for is closed-source, and we assume OpenAI trained it with general-
text-based provenance summaries as well as reproducibility tools. purpose data. We expect that training or fine-tuning a model using
Computational Experiment Comprehension using Provenance Summarization ACM REP ’24, June 18–20, 2024, Rennes, France
6 RELATED WORK
Our work focuses on determining how receptive users are to a tool
conveying information from provenance. Given that we propose a
technique using LLMs to do so, we examine prior work on visualiz-
ing and summarizing provenance, as well as LLM summarization
techniques and limitations.
(c) Schreiber and Struminski use comics to present smart-
watch provenance data.
6.1 Provenance Graph Visualization
Provenance data are historically displayed using node-link dia- Figure 7: We compare the most common node-link visualiza-
grams [15, 21, 30]. Some applications such as Probe-It [15] include tion (Fig. 7a) with two alternative approaches (Fig. 7c, Fig. 7b).
additional views, graph-style visualizations have practically be-
come standard practice. Many tools store provenance data in graph
databases, e.g., Neo4j [35], and then use the tools that accompany
those systems or other graph-centric tools, e.g., GraphViz [16], to
explicitly represent provenance data. However, generic graph tools our node-link diagrams rather that use existing tools to generate
often produce illustrations that are cluttered and difficult to read. the graphs. We made this decision because we use a different prove-
Provenance tools for experimental workflow tracking also, unsur- nance abstraction than the language-level tools and some applica-
prisingly, use node-link diagram illustrations. Vistrails [5] captures tion specific provenance visualization tools such as VisTrails [5].
provenance for workflows in their applications and displays the Additionally, the graphs we generated using GraphViz [16] and
provenance data using node-link diagrams. Users must execute their Neo4J database viewer [35] were not well organized and did not
entire workflow in the Vistrails application to capture provenance. display all the information necessary for reproducibility comprehen-
Language-level provenance tools common in research program- sion. Therefore, we did not think it would be a fair comparison to
ming, such as RDataTracker [25] and noWorkflow [41], also use use these in the study. We manually created the graphs in our study
node-link diagrams. For our study, we chose to manually generate to highlight workflow-level detail necessary for reproducibility.
ACM REP ’24, June 18–20, 2024, Rennes, France Boufford et al.
[15] Nicholas Del Rio and Paulo Pinheiro Da Silva. 2007. Probe-it! visualization support [41] João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire.
for provenance. In International Symposium on Visual Computing. Springer, 732– 2017. NoWorkflow: A Tool for Collecting, Analyzing, and Managing Provenance
741. from Python Scripts. Proc. VLDB Endow. 10, 12 (aug 2017), 1841–1844. https:
[16] John Ellson, Emden Gansner, Lefteris Koutsofios, Stephen C North, and Gordon //doi.org/10.14778/3137765.3137789
Woodhull. 2002. Graphviz—open source graph drawing tools. In Graph Drawing: [42] João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire.
9th International Symposium, GD 2001 Vienna, Austria, September 23–26, 2001 2019. A large-scale study about quality and reproducibility of jupyter notebooks.
Revised Papers 9. Springer, 483–484. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories
[17] Tanya Goyal, Junyi Jessy Li, and Greg Durrett. 2023. News Summarization and (MSR). IEEE, 507–517.
Evaluation in the Era of GPT-3. arXiv:2209.12356 [cs.CL] [43] Chandrasekhar Ramakrishnan, Michele Volpi, Fernando Perez-Cruz, Lilian Gasser,
[18] Philip J Guo and Margo I Seltzer. 2012. Burrito: Wrapping your lab notebook in Firat Ozdemir, Patrick Paitz, Mohammad Alisafaee, Philipp Fischer, Ralf Gruben-
computational infrastructure. (2012). mann, Eliza Jean Harris, et al. 2023. Renku: a platform for sustainable data science.
[19] Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX In Thirty-seventh Conference on Neural Information Processing Systems Datasets
(Task Load Index): Results of Empirical and Theoretical Research. In Human and Benchmarks Track.
Mental Workload, Peter A. Hancock and Najmedin Meshkati (Eds.). Advances [44] Andreas Schreiber and Regina Struminski. 2017. Visualizing Provenance using
in Psychology, Vol. 52. North-Holland, 139–183. https://doi.org/10.1016/S0166- Comics. In 9th USENIX Workshop on the Theory and Practice of Provenance (TaPP
4115(08)62386-9 2017). USENIX Association, Seattle, WA. https://www.usenix.org/conference/
[20] https://openai.com/blog/new-models-and-developer-products-announced-at de- tapp17/workshop-program/presentation/schreiber
vday. 2023. New models and developer products announced at DevDay. blog [45] R Sekar, Hanke Kimm, and Rohit Aich. 2023. eAudit: A Fast, Scalable and De-
post. ployable Audit Data Collection System. In 2024 IEEE Symposium on Security and
[21] Jane Hunter and Kwok Cheung. 2007. Provenance Explorer-a graphical interface Privacy (SP). IEEE Computer Society, 87–87.
for constructing scientific publication packages from provenance trails. Interna- [46] Omid Setayeshfar, Christian Adkins, Matthew Jones, Kyu Hyung Lee, and
tional Journal on Digital Libraries 7 (2007), 99–107. Prashant Doshi. 2021. Graalf: Supporting graphical analysis of audit logs for
[22] Shaoxiong Ji, Tianlin Zhang, Kailai Yang, Sophia Ananiadou, and Erik Cambria. forensics. Software Impacts 8 (2021), 100068.
2023. Rethinking Large Language Models in Mental Health Applications. arXiv [47] Chenhui Shen, Liying Cheng, Xuan-Phi Nguyen, Yang You, and Lidong Bing.
preprint arXiv:2311.11267 (2023). 2023. Large Language Models are Not Yet Human-Level Evaluators for Abstrac-
[23] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, tive Summarization. In Findings of the Association for Computational Linguistics:
Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason EMNLP 2023. 4215–4233.
Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla, Carol Willing, [48] Azadeh Tabiban, Heyang Zhao, Yosr Jarraya, Makan Pourzandi, and Lingyu
and Jupyter development team. 2016. Jupyter Notebooks - a publishing format Wang. 2023. VinciDecoder: Automatically Interpreting Provenance Graphs Into
for reproducible computational workflows. In Positioning and Power in Academic Textual Forensic Reports With Application To OpenStack. In Secure IT Systems:
Publishing: Players, Agents and Agendas, Fernando Loizides and Birgit Scmidt 27th Nordic Conference, NordSec 2022, Reykjavic, Iceland, November 30–December
(Eds.). IOS Press, Netherlands, 87–90. https://eprints.soton.ac.uk/403913/ 2, 2022, Proceedings (Reykjavic, Iceland). Springer-Verlag, Berlin, Heidelberg,
[24] Md Tahmid Rahman Laskar, Xue-Yong Fu, Cheng Chen, and Shashi Bhushan 346–367. https://doi.org/10.1007/978-3-031-22295-5_19
Tn. 2023. Building Real-World Meeting Summarization Systems using Large [49] Azadeh Tabiban, Heyang Zhao, Yosr Jarraya, Makan Pourzandi, Mengyuan Zhang,
Language Models: A Practical Perspective. arXiv preprint arXiv:2310.19233 (2023). and Lingyu Wang. 2022. ProvTalk: towards interpretable multi-level provenance
[25] B.S. Lerner and E.R. Boose. 2014. RDataTracker: Collecting Provenance in an In- analysis in networking functions virtualization (NFV). In The Network and Dis-
teractive Scripting Environment. In USENIX Workshop on the Theory and Practice tributed System Security Symposium 2022 (NDSS ’22).
of Provenance (TaPP). https://doi.org/10.1007/978-3-319-16462-5_36 [50] Liyan Tang, Zhaoyi Sun, Betina Idnay, Jordan G Nestor, Ali Soroush, Pierre A Elias,
[26] Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of Ziyang Xu, Ying Ding, Greg Durrett, Justin F Rousseau, et al. 2023. Evaluating
psychology (1932). large language models on medical evidence summarization. npj Digital Medicine
[27] Soo Yee Lim, Bogdan Stelea, Xueyuan Han, and Thomas Pasquier. 2021. Secure 6, 1 (2023), 158.
namespaced kernel audit for containers. In Proceedings of the ACM Symposium [51] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas-
on Cloud Computing. 518–532. mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos-
[28] Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv
Fabio Petroni, and Percy Liang. 2023. Lost in the middle: How language models preprint arXiv:2307.09288 (2023).
use long contexts. arXiv preprint arXiv:2307.03172 (2023). [52] Dietrich Trautmann. 2023. Large Language Model Prompt Chaining for Long
[29] Yucheng Low, Ajit Banerjee, and Rajat Arya. 2021. XetHub. https://about.xethub. Legal Document Classification. arXiv preprint arXiv:2308.04138 (2023).
com/ [53] Ana Trisovic, Matthew Lau, Thomas Pasquier, and Merce Crosas. 2022. A large-
[30] Peter Macko and Margo Seltzer. 2011. Provenance Map Orbiter: Interactive scale study on research code quality and execution. Scientific Data 9 (02 2022),
Exploration of Large Provenance Graphs. In 3rd USENIX Workshop on the Theory 60. https://doi.org/10.1038/s41597-022-01143-6
and Practice of Provenance (TaPP 11). USENIX Association, Heraklion, Crete [54] Joseph Wonsil, Nichole Boufford, Prakhar Agrawal, Christopher Chen, Tianhang
Greece. https://www.usenix.org/conference/tapp11/provenance-map-orbiter- Cui, Akash Sivaram, and Margo Seltzer. 2023. Reproducibility as a service.
interactive-exploration-large-provenance-graphs Software: Practice and Experience 53, 7 (2023), 1543–1571. https://doi.org/10.1002/
[31] Andrew Mleczko, Sebastian Schuberth, Lars Schneider, and Brian M. Carlson. spe.3202 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.3202
2021. git-lfs. https://github.com/git-lfs/git-lfs [55] Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeff Gray, Alejandra Molina,
[32] Kiran-Kumar Muniswamy-Reddy, David Holland, Uri Braun, and Margo Seltzer. Michael Terry, and Carrie J Cai. 2022. Promptchainer: Chaining large language
2006. Provenance-Aware Storage Systems. USENIX (01 2006), 43–56. model prompts through visual programming. In CHI Conference on Human Factors
[33] T. Munzner. 2015. Visualization Analysis and Design. CRC Press. 210 pages. in Computing Systems Extended Abstracts. 1–10.
https://books.google.de/books?id=NfkYCwAAQBAJ [56] Yunshu Wu, Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, and Estevam Hr-
[34] National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility uschka. 2023. Less is More for Long Document Summary Evaluation by LLMs.
and Replicability in Science. The National Academies Press, Washington, DC. arXiv preprint arXiv:2309.07382 (2023).
https://doi.org/10.17226/25303 [57] Zhang Xu, Zhenyu Wu, Zhichun Li, Kangkook Jee, Junghwan Rhee, Xusheng
[35] Neo4j. 2012. Neo4j. http://neo4j.org/ Xiao, Fengyuan Xu, Haining Wang, and Guofei Jiang. 2016. High Fidelity Data
[36] OpenAI. 2023. ChatGPT. https://chat.openai.com/chat Reduction for Big Data Security Dependency Analyses. In Proceedings of the 2016
[37] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] ACM SIGSAC Conference on Computer and Communications Security (Vienna,
[38] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Austria) (CCS ’16). Association for Computing Machinery, New York, NY, USA,
Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- 504–516. https://doi.org/10.1145/2976749.2978378
man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Pe- [58] Kailai Yang, Shaoxiong Ji, Tianlin Zhang, Qianqian Xie, Ziyan Kuang, and Sophia
ter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language Ananiadou. 2023. Towards interpretable mental health analysis with large lan-
models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL] guage models. In Proceedings of the 2023 Conference on Empirical Methods in
[39] Thomas Pasquier, Xueyuan Han, Mark Goldstein, Thomas Moyer, David Eyers, Natural Language Processing. 6056–6077.
Margo Seltzer, and Jean Bacon. 2017. Practical Whole-System Provenance Capture. [59] Vahan Yoghourdjian, Yalong Yang, Tim Dwyer, Lee Lawrence, Michael Wybrow,
In Symposium on Cloud Computing (SoCC’17). ACM, ACM. and Kim Marriott. 2020. Scalability of network visualisation from a cognitive
[40] Thomas Pasquier, Matthew K. Lau, Ana Trisovic, Emery R. Boose, Ben Couturier, load perspective. IEEE transactions on visualization and computer graphics 27, 2
Mercè Crosas, Aaron M. Ellison, Valerie Gibson, Chris R. Jones, and Margo (2020), 1677–1687.
Seltzer. 2017. If these data could talk. Nature Scientific Data 4 (2017). https: [60] J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang.
//www.nature.com/articles/sdata2017114 2023. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design
LLM Prompts. In Proceedings of the 2023 CHI Conference on Human Factors in
ACM REP ’24, June 18–20, 2024, Rennes, France Boufford et al.
Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Nodes representing the same file are not duplicated, so it is clear
Machinery, New York, NY, USA, Article 437, 21 pages. https://doi.org/10.1145/ in the graph when a process reads a file that another process created.
3544548.3581388
[61] Haopeng Zhang, Xiao Liu, and Jiawei Zhang. 2023. SummIt: Iterative Text In this situation when multiple processes have edges to a node, the
Summarization via ChatGPT. arXiv preprint arXiv:2305.14835 (2023). order the arrows point to the node are in execution order. The arrow
[62] Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown,
and Tatsunori B. Hashimoto. 2023. Benchmarking Large Language Models for
attached at the top executed first, moving downwards to the last
News Summarization. arXiv:2301.13848 [cs.CL] arrow at the bottom which is the operation executed last.
[63] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou,
Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang,
Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang,
C TASK LOAD INDEX QUESTIONS
Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A Survey of Large (1) How mentally demanding was this task? (1-Very Low, 5-Very
Language Models. arXiv:2303.18223 [cs.CL]
High)
(2) How hurried or rushed were you during this task? (1-Very
A AVAILABILITY Low, 5-Very High)
The work presented in this paper is open-source. Detailed installa- (3) How successful would you rate yourself in accomplishing
tion instructions are available online. this task? (1-Perfect, 5-Failure)
(4) How hard did you have to work to accomplish your level of
• System Provenance Collection Tool. Available for download
performance? (1-Very Low, 5-Very High)
under GPL-2.0 license at https://github.com/ubc-systopia/
(5) How insecure, discouraged, irritated, stressed, and annoyed
thoth.
were you? (1-Very Low, 5-Very High)
• Provenance Summarization. Available for download under
(6) How useful was the provenance summary in answering the
Apache 2.0 license at https://doi.org/10.5281/zenodo.10672536.
questions? (1-Very Useful, 5-Not Useful)
• The user study documents are available at https://doi.org/10.
5281/zenodo.106725369.
D STUDY ACTIVITIES
B DIAGRAM CREATION D.1 Questions
It is common for graph users to visualize their data as node-link D.1.1 Task 1.
graphs with tools like Neo4j [35] and GraphViz [16]. However, these (1) What is the name of the dataset the student is using?
general-purpose tools are not a sufficient fit for this study. They (2) Which directory is the dataset saved in?
do not readily show in a static way all the necessary attributes for (3) What is the name of the file containing the experiment code?
nodes and edges a participant needs to see to answer the questions (4) Which directory is the experiment code located in?
from our tasks. Additionally, since we created automatically gener- (5) How many output files are produced? (Include intermediate
ated text summaries tailored towards reproducibility and workflow outputs)
executions, it would not be a fair comparison to use a general- (6) Which programming languages are used to conduct the anal-
purpose visualization. Therefore, we chose to make our own dia- ysis in this experiment?
grams.
D.1.2 Task 2.
We devised a new type of node-link diagram tailored towards
displaying information from provenance logs about a workflow. (1) How many times is the script train_model.py executed?
We manually created our node-link diagram visualizations using (2) How many times is the script preprocess.R executed?
a set of pre-defined rules rather than write a program to do so (3) Which scripts write to the file data.csv?
automatically. Making them manually allowed us a finer grain of (4) Which scripts read from the file data.csv?
control over the various aspects of the diagram to ensure legibility; (5) Which scripts write to the file temp_data.csv?
however, we believe the process could be automated with some (6) Which scripts read from the file "temp_data.csv"?
effort. (7) Which of the following are dependencies of train_model.py?
Our node-link diagrams highlight the processes that comprise D.1.3 Task 3.
the computational workflow. Overall, our diagrams display events (1) Where is the dataset located?
in order they executed from top to bottom; however, the operations (2) How many output files were created during this experiment
are not displayed proportional to the time they occurred, only the (including intermediate files)?
order. We represent each process with a large arrow that points (3) Please explain the difference between the first and second
downward to indicate the order of execution. Each process’ arrow executions of the train_model.py script in no more than two
receives a unique color, except for instances where that process has sentences.
spawned additional processes. The child processes are large arrows
of the same color placed directly to the right of the original process. D.1.4 Task 4.
The top of the arrow has a block containing the command that (1) What is the name of the dataset the student is using?
initiated the process. Smaller arrows attached to the right side of a (2) Which directory is the dataset saved in?
process show the various system operations the process performed (3) What is the name of the file containing the experiment code?
over the course of its existence. These arrows represent edges to (4) Which directory is the experiment code located in?
other nodes, such as files it reads or writes, libraries it loads, or (5) How many output files are produced? (Include intermediate
programs it executes. outputs)
Computational Experiment Comprehension using Provenance Summarization ACM REP ’24, June 18–20, 2024, Rennes, France
(6) Which programming languages are used to conduct the anal- D.3 Node Link Diagrams
ysis in this experiment? Fig. 12, Fig. 13, Fig. 14, Fig. 15 show the node link diagrams we
created and used in our study.
D.2 Text Summaries
Fig. 8, Fig. 9, Fig. 10, Fig. 11 show the text summaries used in our Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009
study. We used the single prompt method for task 1 and 2 and the
chaining method for task 2 and 3.
ACM REP ’24, June 18–20, 2024, Rennes, France Boufford et al.