0% found this document useful (0 votes)
121 views14 pages

Paper Ai Presentations

The document introduces PPT Agent, a novel framework for automatically generating presentations by utilizing an edit-based approach that improves content quality, visual design, and structural coherence. It also presents PPT Eval, a comprehensive evaluation framework that assesses presentations across three dimensions: Content, Design, and Coherence. Experiments demonstrate that PPT Agent significantly outperforms traditional methods, achieving high-quality presentations and a high success rate.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views14 pages

Paper Ai Presentations

The document introduces PPT Agent, a novel framework for automatically generating presentations by utilizing an edit-based approach that improves content quality, visual design, and structural coherence. It also presents PPT Eval, a comprehensive evaluation framework that assesses presentations across three dimensions: Content, Design, and Coherence. Experiments demonstrate that PPT Agent significantly outperforms traditional methods, achieving high-quality presentations and a high success rate.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

PPT Agent: Generating and Evaluating Presentations Beyond Text-to-Slides

Hao Zheng1,2, * , Xinyan Guan1,2,∗ , Hao Kong3 , Jia Zheng1 , Hongyu Lin1
Yaojie Lu1 , Ben He1,2 , Xianpei Han1 , Le Sun1
1
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
2
University of Chinese Academy of Sciences
3
Shanghai Jiexin Technology
{zhenghao2022,guanxinyan2022,zhengjia,hongyu,luyaojie}@iscas.ac.cn
{xianpei,sunle}@iscas.ac.cn [email protected]

Abstract PPTAgent: Edit-base Generation Abstractive Summarization

Automatically generating presentations from


documents is a challenging task that requires
Reference Slide Document Document
balancing content quality, visual design, and
arXiv:2501.03936v1 [cs.AI] 7 Jan 2025

structural coherence. Existing methods primar-


ily focus on improving and evaluating the con- The
TheIntroduction ofofAIAI
TheIntroduction
Introduction of AI
•Simulating human intelligence
tent quality in isolation, often overlooking vi- •Simulating human intelligence
AI
AIEra
AI Era
Era •From
•Simulating
forfor
diverse
diverse
for diverse
rules
human
real-world
real-world
real-world
(1950s) toto
intelligence
tasks
tasks
moderntasks
sual design and structural coherence, which •From
deep•From
deep
rules
rules
learning
learning
(1950s)
(1950s) modern
to modern
deep
•Driving learning across
innovations
limits their practical applicability. To address •Driving
•Driving
industries
innovations
industries
innovations
like
across
healthcare,
like
across
healthcare,
industries
finance, and like
beyond healthcare,
finance, and beyond
these limitations, we proposePPT Agent, which finance, and beyond

comprehensively improves presentation gener- Content: Visual Support Content: Tedious Text
ation through a two-stage, edit-based approach Design: Engaging Look Design: Boring layout
inspired by human workflows. PPT Agent Coherence: Proper Intro Coherence: Abrupt Start
first analyzes reference presentations to un-
derstand their structural patterns and content Figure 1: Comparison between our PPT Agent approach
schemas, then drafts outlines and generates (left) and the conventional abstractive summarization
slides through code actions to ensure consis- method (right). Our method, which begins by editing
tency and alignment. To comprehensively eval- a reference slide, aligns more closely with the human
uate the quality of generated presentations, presentation creation process.
we further introduce PPT Eval, an evaluation
framework that assesses presentations across
three dimensions: Content, Design, and Coher- 2024) by leveraging the generalization capabilities
ence. Experiments show that PPT Agent signif- of large language models (LLM).
icantly outperforms traditional automatic pre-
sentation generation methods across all three Existing approaches often adopt an end-to-end
dimensions. The code and data are available at text-generation paradigm, focusing solely on tex-
https://github.com/icip-cas/PPTAgent. tual content while neglecting layout design and
presentation structures, making them impractical
1 Introduction for real-world applications. For example, as shown
in Figure 1, prior studies (Mondal et al., 2024; Se-
Presentations are a widely used medium for infor- fid et al., 2021) treat presentation generation as an
mation delivery, valued for their visual effective- abstractive summarization task, focus primarily on
ness in engaging and communicating with audi- textual content while overlooking the interactive
ences. However, creating high-quality presenta- nature of presentations. This results in simplistic
tions requires a captivating storyline, visually ap- and visually uninspiring outputs that fail to engage
pealing layouts, and rich, impactful content (Fu audiences.
et al., 2022). Consequently, creating well-rounded
However, automatically creating visually rich
presentations requires advanced presentation skills
and structurally clear presentations remains chal-
and significant effort. Given the inherent complex-
lenging due to the complexity of data formats and
ity of presentation creation, there is growing in-
the lack of effective evaluation frameworks. First,
terest in automating the presentation generation
most presentations are saved in PowerPoint’s XML
process (Mondal et al., 2024; Maheshwari et al.,
format, which is inherently tedious and redundant
* These authors contributed equally (Gryk, 2022). This complex format poses signifi-
3. Outline Generation 4. Slide Generation
Slide 1: Opening
Doc Section: S1 …
Slide 2:…Display
Reference Presentation Document

1. Slide Clustering
Opening: Element Data
Feedback:
author Title AI Era Check for
and icon existence
Icon Drone.jpg PPTAgent

2. Schema Extraction
① replace_span(0, “AI Era”)
Opening: Bullets: Display: AI Era ② replace_image(7, “Drone.jpg”)
author few bullet landscape Self Correction
and icon points painting
② replace_image(1, “Drone.jpg”)

Stage Ⅰ: Presentation Analysis Stage Ⅱ: Presentation Generation

Figure 2: Overview of the PPT Agent workflow. Stage I: Presentation Analysis involves analyzing the input
presentation to cluster slides into groups and extract their content schemas. Stage II: Presentation Generation
generates new presentations guided by the outline, incorporating feedback mechanisms to ensure robustness.

cant challenges for LLMs in interpreting the pre- presentation outline and assigns specific document
sentation layout and structure, let alone generating sections and reference slides to each slide. For
appealing slides in an end-to-end fashion. Second, instance, the framework selects the opening slide
and more importantly, the absence of comprehen- as the reference slide to present meta-information,
sive evaluation frameworks exacerbates this issue. such as the title and icon. PPT Agent offers a suite
Current metrics like perplexity and ROUGE (Lin, of editing action APIs that empower LLMs to dy-
2004) fail to capture essential aspects of presenta- namically modify the reference slide. By breaking
tion quality such as narrative flow, visual design, down the process into discrete stages rather than
and content impact. Moreover, ROUGE-based eval- end-to-end generation, this approach ensures con-
uation tends to reward excessive textual alignment sistency, adaptability, and seamless handling of
with input documents, undermining the brevity and complex formats.
clarity crucial for effective presentations. These To comprehensively evaluate the quality of
limitations highlight the urgent need for advance- generated presentations, we propose PPT Eval,
ments in automated presentation generation, partic- a multidimensional evaluation framework. In-
ularly in enhancing visual design and developing spired by Chen et al. (2024a) and Kwan et al.
comprehensive evaluation frameworks. (2024), PPT Eval leverages the MLLM-as-a-judge
Rather than creating complex presentations from paradigm to enable systematic and scalable evalua-
scratch in a single pass, presentations are typically tion. Drawing from Duarte (2010), we categorized
created by selecting exemplary slides as references presentation quality into three dimensions: Content,
and then summarizing and transferring key con- Design, and Coherence, providing both quantitative
tent onto them (Duarte, 2010). Inspired by this scores and qualitative feedback for each dimension.
process, we design PPT Agent to decompose pre- Our human evaluation studies validated the relia-
sentation generation into an iterative, edit-based bility and effectiveness of PPT Eval.
workflow, as illustrated in Figure 2. In the first Results demonstrate that our method effectively
stage, given a document and a reference presen- generates high-quality presentations, achieving an
tation, PPT Agent analyzes the reference presenta- average score of 3.67 across the three dimensions
tions to extract semantic information, providing evaluated by PPT Eval. These results, covering a
the textual description that identifies the purpose diverse range of domains, highlight a high success
and data model of each slide. In the Presentation rate of 97.8%, showcasing the versatility and ro-
Generation stage, PPT Agent generates a detailed bustness of our approach.
Our main contributions can be summarized as paradigm for creating new slides, addressing
follows: challenges in processing spatial relationships
and designing styles. This approach generates
• We propose PPT Agent, a novel framework a sequence of actions to modify existing slides.
that redefines automatic presentation gener- Within this paradigm, both the input document and
ation as an edit-based workflow guided by the reference presentation serve as inputs. This
reference presentations. process can be described as Equation 2, where m
• We introduce PPT Eval, the first comprehen- represents the number of generated actions. Each
sive evaluation framework that assesses pre- action ai represents a line of executable code, and
sentations across three key dimensions: Con- Rj is the reference slide being edited.
tent, Design, and Coherence. m
X
A= ai = f (C | Rj ) (2)
• We publicly released the PPT Agent and i=1
PPT Eval codebase, along with a curated pre-
sentation dataset, to facilitate future research 2.2 Stage I: Presentation Analysis
in automatic presentation generation. To facilitate presentation generation, we first clus-
ter slides in the reference presentation and extract
2 PPTAgent their content schemas. This structured semantic
representation helps LLMs determine which slides
In this section, we first establish the formulation of
to edit and what content to convey in each slide.
the presentation generation task. Subsequently, we
describe the framework of our proposed PPT Agent, Slide Clustering Slides can be categorized into
which operates in two distinct stages. In stage I, two main types based on their functionalities:
we analyze the reference presentation by clustering slides that support the structure of the presentation
similar slides and extracting their content schemas. (e.g., opening slides) and slides that convey spe-
This process aims to enhance the expressiveness of cific content (e.g., bullet-point slides). We employ
the reference presentation, thereby facilitating sub- different clustering algorithms to effectively clus-
sequent presentation generation. In stage II, given ter slides in the presentation based on their textual
an input document and the analyzed reference pre- or visual characteristics. For structural slides, we
sentation, we aim to select the most suitable slides leverage LLMs to infer the functional role of each
and generate the target presentation through an slide and group them accordingly, as these slides
interactive editing process based on the selected often exhibit distinctive textual features. For the
slides. An overview of our proposed workflow is remaining slides, which primarily focus on present-
illustrated in Figure 2. ing specific content, we employ a hierarchical clus-
tering approach leveraging image similarity. For
2.1 Problem Formulation
each cluster, we infer the layout patterns of each
PPT Agent is designed to generate an engaging pre- cluster using MLLMs. Further details regarding
sentation via an edit-based process. We will pro- this method can be found in Appendix C.
vide formal definitions for both PPT Agent and the
conventional method, illustrating their divergence. Schema Extraction After clustering slides to fa-
The conventional method for creating each slide cilitate the selection of slide references, we further
S can be described in Equation 1, where n repre- analyzed their content schemas to ensure purpose-
sents the number of elements on the slide, and C de- ful alignment of the editing. Given the complexity
notes the source content composed of sections and and fragmentation of real-world slides, we utilized
figures. Each element on the slide, ei , is defined the context perception capabilities of LLMs (Chen
by its type, content, and styling attributes, such as et al., 2024a) to extract diverse content schemas.
(Textbox, "Hello", {border, size, position, . . . }). Specifically, we defined an extraction framework
where each element is represented by its category,
n
X modality, and content. Based on this framework,
S= ei = f (C) (1) the schema of each slide was extracted through
i=1 LLMs’ instruction-following and structured output
Compared to the conventional method, capabilities. Detailed instructions are provided in
PPT Agent adopts an edit-based generation Appendix E.
2.3 Stage II: Presentation Generation Logical Structure
Extract Slide-1: Describe xx
In this stage, we begin by generating an outline that …
specifies the reference slide and relevant content MLLM Judge
Slide-n: Conclude xx
Evaluation Target
for each slide in the new presentation. For each
Content:55
Content: Design:44
Design:
slide, LLMs iteratively edit the reference slide us- Content:
Thetextual 5
textual Design:design,
Cohesive 4
design, Coherence: 4
The Cohesive Minor flaws
ing interactive executable code actions to complete The textual content
content
content is is but
Cohesiveoverlaps
design, but
but overlaps
impactful, and is
well overlaps reduce presented in the
the generation process. impac4ul,
impactful,
supported and
and
by images reduce
reduce appeal.
appeal.
appeal. logical structure

Figure 3: This figure illustrates the evaluation process


Outline Generation Following human prefer-
in PPT Eval, which assesses three key dimensions: con-
ences, we instruct LLMs to create a structured tent, design, and coherence. Content evaluates the qual-
outline composed of multiple entries. Each en- ity of text and images within the slides. Design ex-
try specifies the reference slide, relevant document amines the visual consistency and appeal. Coherence
section indices, as well as the title and descrip- focuses on the logical flow of the presentation. Each
tion of the new slide. By utilizing the planning dimension is rated on a scale from 1 to 5, with detailed
and summarizing capabilities of LLMs, we pro- feedback provided for improvement.
vide both the document and semantic information
extracted from the reference presentation to gen- 3 PPTEval
erate a coherent and engaging outline for the new
presentation, which subsequently orchestrates the To address the limitations of existing automated
generation process. metrics for presentation evaluation, we introduce
PPT Eval, a comprehensive framework for assess-
ing presentation quality from multiple perspectives.
Slide Generation Guided by the outline, the The framework provides scores on a 1-to-5 scale
slide generation process iteratively edits a reference and offers detailed feedback to guide the improve-
slide to produce the new slide. To enable precise ment of future presentation generation methods.
manipulation of slide elements, we implement five The overall evaluation process is depicted in Fig-
specialized APIs that allow LLMs to edit, remove, ure 3, with the detailed scoring criteria and exam-
and duplicate text elements, as well as edit and re- ples provided in Appendix B.
move visual elements. To further enhance the com- Drawing from Duarte (2008, 2010), we have
prehension of slide structure, inspired by Feng et al. identified three key dimensions for evaluating pre-
(2024) and Tang et al. (2023), we convert slides sentation quality:
from their raw XML format into an HTML repre-
sentation, which is more interpretable for LLMs. Content: The content dimension evaluates the
For each slide, LLMs receive two types of input: information presented on the slides, focusing on
text retrieved from the source document based on both text and images. We assess content quality
section indices, and captions of available images. from three perspectives: the amount of information,
The new slide content is then generated following the clarity and quality of textual content, and the
the guidance of the content schema. support provided by visual content. High-quality
Subsequently, LLMs leverage the generated con- textual content is characterized by clear, impactful
tent, HTML representation of the reference slide, text that conveys the proper amount of information.
and API documentation to produce executable edit- Additionally, images should complement and rein-
ing actions. These actions are executed in a REPL1 force the textual content, making the information
environment, where the system detects errors dur- more accessible and engaging. To evaluate content
ing execution and provides real-time feedback for quality, we employ MLLMs on slide images, as
self-correction. The self-correction mechanism slides cannot be easily comprehended in a plain
leverages intermediate results to iteratively refine text format.
the editing actions, enhancing the robustness of the Design: Good design not only captures atten-
generation process. tion but also enhances content delivery. We eval-
uate the design dimension based on three aspects:
1
https://en.wikipedia.org/wiki/ color schemes, visual elements, and overall design.
Read-eval-print_loop Specifically, the color scheme of the slides should
have clear contrast to highlight the content while Document Presentation
maintaining harmony. The use of visual elements, Domain
#Chars #Figs #Chars #Figs #Pages
such as geometric shapes, can make the slide de-
sign more expressive. Finally, good design should Culture 12,708 2.9 6,585 12.8 14.3
adhere to basic design principles, such as avoiding Education 12,305 5.5 3,993 12.9 13.9
overlapping elements and ensuring that design does Science 16,661 4.8 5,334 24.0 18.4
not interfere with content delivery. Society 13,019 7.3 3,723 9.8 12.9
Tech 18,315 11.4 5,325 12.9 16.8
Coherence: Coherence is essential for maintain-
ing audience engagement in a presentation. We Table 1: Statistics of the dataset used in our experiments,
evaluate coherence based on the logical structure detailing the number of characters (‘#Chars’) and figures
and the contextual information provided. Effective (‘#Figs’), as well as the number of pages (‘#Pages’).
coherence is achieved when the model constructs
a captivating storyline, enriched with contextual
information that enables the audience to follow the ilarity score exceeding 0.85. Similarly, slides were
content seamlessly. We assess coherence by analyz- excluded if their text embeddings had a cosine sim-
ing the logical structure and contextual information ilarity score above 0.8 compared to the preceding
extracted from the presentation. slide, as suggested by Fu et al. (2022). Detailed
statistics of the dataset are presented in Table 1.
4 Experiment
4.2 Experimental Settings and Baseline
4.1 Dataset Models We evaluate our method using three
Data Collection Existing presentation datasets, state-of-the-art models: GPT-4o-2024-08-06 (GPT-
such as Mondal et al. (2024); Sefid et al. (2021); 4o), Qwen2.5-72B-Instruct (Qwen2.5, Yang et al.,
Sun et al. (2021); Fu et al. (2022), have two main is- 2024), and Qwen2-VL-72B-Instruct (Qwen2-VL,
sues. First, they are mostly stored in PDF or JSON Wang et al., 2024a). These models are categorized
formats, which leads to a loss of semantic infor- according to the specific modalities they handle,
mation, such as structural relationships and styling whether textual or visual, as indicated by their
attributes of elements. Additionally, these datasets subscripts. Specifically, we define configurations
are primarily derived from academic reports, lim- as combinations of a language model (LM) and a
iting their diversity. To address these limitations, vision model (VM), such as Qwen2.5LM+Qwen2-
we introduce Zenodo10K, a new dataset sourced VLVM.
from Zenodo (European Organization For Nuclear During experiments, we allow up to two itera-
Research and OpenAIRE, 2013), an open digital tions of self-correction per slide generation task,
repository hosting diverse artifacts from different producing 5 × 10 × 10 = 500 presentations per
domains. We have curated 10,448 presentations configuration. We use Chen et al. (2024b) and Wu
from this source and made them publicly available et al. (2020) to compute the text and image em-
to support further research. Following Mondal et al. beddings respectively. All open-source LLMs are
(2024), we sampled 50 presentations across five deployed using the VLLM framework (Kwon et al.,
domains to serve as reference presentations. Addi- 2023) on a cluster of 8 NVIDIA A100 GPUs. The
tionally, we collected 50 documents from the same total computational cost for these experiments is
domains to be used as input documents. Details of approximately 500 GPU hours.
the sampling criteria are provided in Appendix A.
Baseline We adopt the methodology described
Data Preprocessing We utilized VikParuchuri in Bandyopadhyay et al. (2024) as our baseline.
(2023) to extract both textual and visual content This approach employs a multi-staged end-to-end
from the documents. The extracted textual content model to generate narrative-rich presentations, with
was then organized into sections using Qwen2.5- an image similarity-based ranking algorithm to add
72B-Instruct (Yang et al., 2024). For the visual images to the slides. The baseline method is eval-
content, captions were generated using Qwen2-VL- uated using either GPT-4o or Qwen2.5, as it does
72B-Instruct (Wang et al., 2024a). To minimize not require the necessary processing of visual infor-
redundancy, we identified and removed duplicate mation. Each configuration generates 5 × 10 = 50
images if their image embeddings had a cosine sim- presentations, given that it does not require an input
Setting Existing Metrics PPTEval
Language Model Vision Model SR(%)↑ PPL↓ FID↓ Content↑ Design↑ Coherence↑ Avg.↑
Baseline
GPT-4oLM – – 110.6 – 2.98 2.33 3.24 2.85
Qwen2.5LM – – 122.4 – 2.96 2.37 3.28 2.87
PPTAgent
GPT-4oLM GPT-4oVM 97.8 459.7 7.48 3.25 3.24 4.39 3.62
Qwen2-VLLM Qwen2-VLVM 43.0 322.3 7.32 3.13 3.34 4.07 3.51
Qwen2.5LM Qwen2-VLVM 95.0 313.9 6.20 3.28 3.27 4.48 3.67
Ablation
PPTAGENT 95.0 313.9 6.20 3.28 3.27 4.48 3.67
w/o Outline 91.0 2304.3 6.94 3.24 3.30 3.36 3.30
w/o Schema 78.8 164.8 7.12 3.08 3.23 4.04 3.45
w/o Structure 92.2 189.9 7.66 3.28 3.25 3.45 3.32
w/o CodeRender 74.6 231.0 7.03 3.27 3.34 4.38 3.66

Table 2: Performance comparison of the baseline, our proposed PPTAgent framework, and its ablation variants.
Results are reported using existing metrics—Success Rate (SR), Perplexity (PPL), and FID (Heusel et al., 2017)—as
well as our proposed PPT Eval metrics, which assess Content, Design, Coherence, and their average score.

Domain SR (%) PPL FID PPTEval the exemplar presentation in the feature space.
Due to the limited sample size, we calculate
Culture 93.0 185.3 5.00 3.70 the FID using a 64-dimensional output vector.
Education 94.0 249.0 7.90 3.69
Science 96.0 500.6 6.07 3.56 • PPTEval measures the comprehensive qual-
Society 95.0 396.8 5.32 3.59 ity of presentations across three dimensions:
Tech 97.0 238.7 6.72 3.74 coherence, content, and design. We employ
GPT-4o as the judge model.
Table 3: Evaluation results under the configuration of
Qwen2-VLLM +Qwen2-VLVM in different domains, using 4.4 Result & Analysis
the success rate (SR), PPL, FID and the average PPTE- Table 2 presents the performance comparison be-
val score across three evaluation dimensions. tween PPT Agent and baseline methods, revealing
that:
presentation. We do not report the success rate and PPTAgent Enhances LLMs’ Presentation Gen-
FID of the baseline method for the same reason. eration Capabilities As demonstrated in Ta-
ble 2, our approach empowers LLMs to pro-
4.3 Evaluation Metrics
duce well-rounded presentations with a remark-
We evaluated the presentation generation using the able success rate, achieving ≥ 95% success
following metrics: rate for both Qwen2.5LM +Qwen2-VLVM and GPT-
4oLM +GPT-4oVM . This is a significant improve-
• Success Rate (SR) measures the robustness
ment compared to the highest accuracy of 10%
of the generation task by determining the per-
for session-based template editing tasks as reported
centage of presentations where all slides are
in Guo et al. (2023). This improvement can be
successfully generated.
attributed to three main factors: 1) PPT Agent con-
• Perplexity (PPL) measures the likelihood centrates on content modifying, thereby avoiding
of the language model generating the given intricate stying operations. 2) Our streamlined API
sequence. Following Bandyopadhyay et al. design allows LLMs to execute tasks with ease.
(2024), we calculate the average perplexity of 3) The code interaction module enhances LLMs’
slides within a presentation using GPT-2.. A comprehension of slides and offers opportunities
lower perplexity score indicates that the tex- for self-correction, enabling them to generate ac-
tual content is more fluent. curate actions robustly. Moreover, detailed perfor-
mance of Qwen2.5LM +Qwen2-VLVM across various
• FID (Heusel et al., 2017) measures the simi- domains, as illustrated in Table 3, underscores the
larity between the generated presentation and robustness of our approach.
Corelation Content Design Coherence Avg.
gpt-4o
Pearson 0.70 0.90 0.55 0.71
Spearman 0.73 0.88 0.57 0.74
Qwen2.5
Table 4: The correlation scores between human ratings
and LLM ratings under different dimensions (Coher-
Qwen2-VL Iter-0 ence, Content, Design). All presented data of similarity
Iter-1 exhibit a p-value below 0.05, indicating a statistically
Iter-2
Failed
significant level of confidence.
100 101 102 103 104

Figure 4: The number of iterative self-corrections re-


quired to generate a single slide under different models. the slide representation with the method described
in Guo et al. (2023) (w/o CodeRender), and (4)
removing guidance from the slide schema (w/o
PPTAgent Significantly Improves Overall Pre- Schema). These configurations were tested using
sentation Quality By adopting an Edit-based the Qwen2.5LM +Qwen2-VLVM .
paradigm, PPT Agent allows elements within the
presentation to inherit well-designed styling at- Code Representation Enhances LLMs’ Compre-
tributes from existing presentations. When using hension As shown in Table 2, the removal of the
GPT-4o, experimental results demonstrate compre- Code Render component leads to a significant drop
hensive improvements over the baseline. We sig- in the model’s success rate (SR) from 95.0 to 74.6.
nificantly surpass the baseline method in the de- This underscores the critical role of code represen-
sign dimension under PPT Eval (3.24 vs 2.33), as tation in leveraging LLMs’ coding capabilities to
the presentations generated by the baseline method improve their overall comprehension.
lack basic design efforts. Furthermore, we achieved Presentation Analysis is Essential for Generat-
substantial enhancements in coherence (4.39 vs ing Targeted Presentations The removal of the
3.28) and content (3.25 vs 2.98) dimensions, as the outline and structural information significantly de-
semantic information extracted during the Presen- grades coherence (from 4.48 to 3.36/3.45), under-
tation Analysis stage effectively guided the LLMs. scoring their crucial role in maintaining logical
Open-Source LLMs Rival GPT-4o in Perfor- flow. Furthermore, the absence of slide schema
mance GPT-4o consistently demonstrates out- hinders LLMs from generating targeted content ef-
standing performance across various evaluation fectively, resulting in a drop in success rate from
metrics, highlighting its advanced capabilities. 95.0 to 78.8.
While Qwen2-VL exhibits limitations in linguistic
4.6 Error Analysis
proficiency due to the trade-offs from multimodal
post-training, GPT-4o maintains a clear advantage Figure 4 illustrates the number of iterations re-
in handling language tasks. However, the intro- quired to generate a slide using different models.
duction of Qwen2.5 successfully mitigates these Although GPT-4o exhibits superior self-correction
linguistic deficiencies, bringing its performance on capabilities compared to Qwen2.5, Qwen2.5 en-
par with GPT-4o, and achieving the best perfor- counters fewer errors in the first iteration (Iter-0).
mance. This underscores the significant potential Additionally, we observed that Qwen2-VL experi-
of open-source LLMs as competitive and highly ences errors more frequently and has poorer self-
capable presentation agents. correction capabilities, likely due to its multimodal
post-training (Wang et al., 2024a). Ultimately, all
4.5 Ablation Study three models successfully corrected more than half
To better understand the impact of each compo- of the errors, demonstrating that our iterative self-
nent in our proposed method, we performed ab- correction mechanism effectively ensures the suc-
lation studies using four different configurations. cess of the generation process.
Specifically, we evaluated the method by: (1) ran-
domly selecting a slide as the edit target (w/o Out- 4.7 Effectiveness of PPTEval
line), (2) omitting structural information during Human Agreement Evaluation Despite Chen
outline generation (w/o Structure), (3) replacing et al. (2024a) have highlighted the impressive
1.00
content centric nature of presentations, leading to outputs
1.00 0.55 -0.04 -0.07 0.75 that lack engagement. Template-based methods,
including Cachola et al. (2024) and industrial solu-
0.50
tions like Tongyi, rely on pre-designed templates to
create visually appealing presentations. However,
vision

0.55 1.00 -0.03 -0.02 0.25


their dependence on extensive manual effort for
0.00 template annotation significantly limits scalability
and flexibility.
-0.04 -0.03 1.00 -0.01
ppl

0.25

0.50 LLM Agent Numerous studies (Li et al., 2024;


Deng et al., 2024; Wang et al., 2024c) have ex-
-0.07 -0.02 -0.01 1.00 0.75
fid

plored the potential of LLMs to act as agents as-


1.00 sisting humans in a wide array of tasks. For ex-
content vision ppl fid
ample, Zheng et al. (2024); Wang et al. (2024b)
Figure 5: Correlation heatmap between existing auto- demonstrate the capability of LLMs to accomplish
mated evaluation metrics and PPT Eval. tasks by generating executable actions and correct-
ing errors based on feedback. Furthermore, Guo
et al. (2023) introduces an evaluation system that
human-like discernment of LLMs in various gener- assesses the ability of LLMs to perform multi-turn,
ation tasks. However, it remains crucial to assess multimodal slide editing tasks using APIs, which
the correlation between LLM evaluations and hu- inspired the use of LLMs for complex tasks as pro-
man evaluations in the context of presentations. posed in this study.
This necessity arises from findings by Laskar et al.
(2024), which indicate that LLMs may not be ade- LLM as a Judge LLMs have demonstrated
quate evaluators for complex tasks. Table 4 shows strong capabilities in instruction following and con-
the correlation of ratings between humans and text perception, leading to their widespread use as
LLMs. The average Pearson correlation of 0.71 judges (Liu et al., 2023; Zheng et al., 2023). Further
exceeds the scores of other evaluation methods research by Zhuge et al. (2024) enhanced LLMs’
(Kwan et al., 2024), indicating that PPT Eval aligns abilities through external modules and functions,
well with human preferences. while Chen et al. (2024a) validated the feasibility of
Moreover, the heatmap in Figure 5 reveals the using multimodal large language models(MLLMs)
limitations of existing metrics when compared with as judges. Additionally, Kwan et al. (2024) intro-
the Content and Design dimensions ofPPT Eval. In duced a multi-dimensional evaluation framework
our experiments, we observed that PPL predomi- for multi-turn conversations, which inspired the
nantly captures text fluency and is susceptible to the development of our proposed PPT Eval.
fragmented nature of slide text, leading to ineffec-
tive measurements with frequent outliers. Similarly,
6 Conclusion
FID merely quantifies stylistic similarity to refer-
ence presentations rather than design quality, as In this paper, we introduced PPT Agent, which con-
conformity to reference styles does not necessarily ceptualizes presentation generation as a two-stage
indicate superior design. These findings underscore presentation editing task completed through the
the necessity of PPT Eval for comprehensive and abilities of LLMs to understand and generate code.
effective presentation evaluation. This approach leveraged the textual feature and
layout patterns to organize slides into different
5 Related Works
functional groups. Our experiments across data
Automated Presentation Generation Recent from multiple domains have demonstrated the su-
proposed methods for slide generation can be cate- periority of our method. Moreover, our proposed
gorized into rule-based and template-based based PPT Eval ensured the assessability of presentations.
on how they handle element placement. Rule- This research provides a new paradigm for generat-
based methods, such as those proposed by Mondal ing slides under unsupervised conditions and offers
et al. (2024) and Li et al. (2021), often focus on fresh insights for future work in presentation gen-
enhancing textual content but neglect the visual- eration.
7 Limitations with vision-language benchmark. arXiv preprint
arXiv:2402.04788.
While our method demonstrates its capability to
produce high-quality presentations, there remain Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu
inherent challenges that impact its universal appli- Lian, and Zheng Liu. 2024b. Bge m3-embedding:
Multi-lingual, multi-functionality, multi-granularity
cability. For instance, achieving a success rate of text embeddings through self-knowledge distillation.
over 95% on our dataset is impressive, but not ab- arXiv preprint arXiv:2402.03216.
solute, thus might limit its application. Moreover,
parsing slides with intricate nested group shapes Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam
Stevens, Boshi Wang, Huan Sun, and Yu Su. 2024.
often proves to be a bottleneck, leading to less con- Mind2web: Towards a generalist agent for the web.
sistent results. Additionally, although PPT Agent Advances in Neural Information Processing Systems,
shows noticeable improvements in layout optimiza- 36.
tion over prior approaches, it still falls short of
Nancy Duarte. 2008. Slide: ology: The art and science
exploiting the full potential of visual cues for refin- of creating great presentations, volume 1. O’Reilly
ing stylistic consistency. This often manifests in Media Sebastapol.
design flaws, such as overlapping elements, under-
Nancy Duarte. 2010. Resonate: Present visual stories
mining the visual harmony of the generated slides.
that transform audiences. John Wiley & Sons.
Addressing these limitations calls for future en-
hancements that integrate visual information into European Organization For Nuclear Research and Ope-
the generation process. nAIRE. 2013. Zenodo.

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani,


8 Ethical Considerations Arjun Akula, Xuehai He, Sugato Basu, Xin Eric
In the construction of Zenodo10K, we utilized the Wang, and William Yang Wang. 2024. Layoutgpt:
Compositional visual planning and generation with
publicly available API to scrape data while strictly large language models. Advances in Neural Informa-
adhering to the licensing terms associated with each tion Processing Systems, 36.
artifact. Specifically, artifacts that were not per-
mitted for modification or commercial use under Tsu-Jui Fu, William Yang Wang, Daniel McDuff, and
Yale Song. 2022. Doc2ppt: Automatic presentation
their respective licenses were filtered out to ensure slides generation from scientific documents. Pro-
compliance with intellectual property rights. Ad- ceedings of the AAAI Conference on Artificial Intelli-
ditionally, all annotation personnel involved in the gence, 36(1):634–642.
project were compensated at rates exceeding the
Michael Robert Gryk. 2022. Human readability of data
minimum wage in their respective cities, reflecting files. Balisage series on markup technologies, 27.
our commitment to fair labor practices and ethi-
cal standards throughout the dataset’s development Yiduo Guo, Zekai Zhang, Yaobo Liang, Dongyan Zhao,
process. and Duan Nan. 2023. Pptc benchmark: Evaluating
large language models for powerpoint task comple-
tion. arXiv preprint arXiv:2311.01767.
References Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Sambaran Bandyopadhyay, Himanshu Maheshwari, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans
Anandhavelu Natarajan, and Apoorv Saxena. 2024. trained by a two time-scale update rule converge to a
Enhancing presentation slide generation by llms with local nash equilibrium. Advances in neural informa-
a multi-staged end-to-end approach. arXiv preprint tion processing systems, 30.
arXiv:2406.06556.
Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei
Isabel Alyssa Cachola, Silviu Cucerzan, Allen Herring, Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun
Vuksan Mijovic, Erik Oveson, and Sujay Kumar Liu, and Kam-Fai Wong. 2024. Mt-eval: A multi-
Jauhar. 2024. Knowledge-centric templatic views turn capabilities evaluation benchmark for large lan-
of documents. In Findings of the Association for guage models. Preprint, arXiv:2401.16745.
Computational Linguistics: EMNLP 2024, pages
15460–15476, Miami, Florida, USA. Association for Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying
Computational Linguistics. Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gon-
zalez, Hao Zhang, and Ion Stoica. 2023. Efficient
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo memory management for large language model serv-
Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Pan ing with pagedattention. In Proceedings of the 29th
Zhou, Yao Wan, and Lichao Sun. 2024a. Mllm- Symposium on Operating Systems Principles, pages
as-a-judge: Assessing multimodal llm-as-a-judge 611–626.
Md Tahmid Rahman Laskar, Sawsan Alqahtani, M Sai- VikParuchuri. 2023. marker.
ful Bari, Mizanur Rahman, Mohammad Abdul-
lah Matin Khan, Haidar Khan, Israt Jahan, Amran Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi-
Bhuiyan, Chee Wei Tan, Md Rizwan Parvez, Enamul hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin
Hoque, Shafiq Joty, and Jimmy Huang. 2024. A sys- Wang, Wenbin Ge, et al. 2024a. Qwen2-vl: Enhanc-
tematic survey and critical review on evaluating large ing vision-language model’s perception of the world
language models: Challenges, limitations, and recom- at any resolution. arXiv preprint arXiv:2409.12191.
mendations. In Proceedings of the 2024 Conference
on Empirical Methods in Natural Language Process- Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang,
ing, pages 13785–13816, Miami, Florida, USA. As- Yunzhu Li, Hao Peng, and Heng Ji. 2024b. Exe-
sociation for Computational Linguistics. cutable code actions elicit better llm agents. arXiv
preprint arXiv:2402.01030.
Da-Wei Li, Danqing Huang, Tingting Ma, and Chin-
Yew Lin. 2021. Towards topic-aware slide genera- Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu,
tion for academic papers with unsupervised mutual Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi
learning. In Proceedings of the AAAI Conference Song, Bowen Li, Jaskirat Singh, et al. 2024c. Open-
on Artificial Intelligence, volume 35, pages 13243– devin: An open platform for ai software developers as
13251. generalist agents. arXiv preprint arXiv:2407.16741.

Yanda Li, Chi Zhang, Wanqi Yang, Bin Fu, Pei Cheng, Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan,
Xin Chen, Ling Chen, and Yunchao Wei. 2024. Ap- Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka,
pagent v2: Advanced agent for flexible mobile inter- Joseph Gonzalez, Kurt Keutzer, and Peter Vajda.
actions. arXiv preprint arXiv:2408.11824. 2020. Visual transformers: Token-based image
representation and processing for computer vision.
Chin-Yew Lin. 2004. ROUGE: A package for auto- Preprint, arXiv:2006.03677.
matic evaluation of summaries. In Text Summariza-
tion Branches Out, pages 74–81, Barcelona, Spain. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Association for Computational Linguistics. Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5 tech-
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, nical report. arXiv preprint arXiv:2412.15115.
Ruochen Xu, and Chenguang Zhu. 2023. G-eval:
NLG evaluation using gpt-4 with better human align- Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and
ment. In Proceedings of the 2023 Conference on Yu Su. 2024. Gpt-4v (ision) is a generalist web agent,
Empirical Methods in Natural Language Processing, if grounded. arXiv preprint arXiv:2401.01614.
pages 2511–2522, Singapore. Association for Com-
putational Linguistics. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Himanshu Maheshwari, Sambaran Bandyopadhyay, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023.
Aparna Garimella, and Anandhavelu Natarajan. 2024. Judging llm-as-a-judge with mt-bench and chatbot
Presentations are not always linear! gnn meets llm arena. Advances in Neural Information Processing
for document-to-presentation transformation with at- Systems, 36:46595–46623.
tribution. arXiv preprint arXiv:2405.13095.
Mingchen Zhuge, Changsheng Zhao, Dylan Ashley,
Ishani Mondal, S Shwetha, Anandhavelu Natarajan,
Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong,
Aparna Garimella, Sambaran Bandyopadhyay, and
Zechun Liu, Ernie Chang, Raghuraman Krishnamoor-
Jordan Boyd-Graber. 2024. Presentations by the hu-
thi, Yuandong Tian, et al. 2024. Agent-as-a-
mans and for the humans: Harnessing llms for gen-
judge: Evaluate agents with agents. arXiv preprint
erating persona-aware slides from documents. In
arXiv:2410.10934.
Proceedings of the 18th Conference of the European
Chapter of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 2664–2684.
Athar Sefid, Prasenjit Mitra, and Lee Giles. 2021. Slide-
gen: an abstractive section-based slide generator for
scholarly documents. In Proceedings of the 21st
ACM Symposium on Document Engineering, pages
1–4.
Edward Sun, Yufang Hou, Dakuo Wang, Yunfeng
Zhang, and Nancy XR Wang. 2021. D2s: Document-
to-slide generation via query-based text summariza-
tion. arXiv preprint arXiv:2105.03664.
Zecheng Tang, Chenfei Wu, Juntao Li, and Nan Duan.
2023. Layoutnuwa: Revealing the hidden layout
expertise of large language models. arXiv preprint
arXiv:2309.09506.
A Data Sampling Content
To maintain a reasonable cost, we selected presen-
tations ranging from 12 to 64 pages and documents
with text lengths from 2,048 to 20,480 characters.
Score:1 Score:3 Score:5
B Details of PPTEval Judgement:Lack of content Judgement: The content is
somewhat tedious and lacks
Judgement: The content
is impactful with relevant
the support of images images supports well
Through a Shanghai-based crowdsourcing plat-
form, we recruited four graduate students to evalu- Design
ate 50 randomly selected presentations from Zen-
odo10K, along with 100 presentations generated
by the baseline method and our approach, respec-
tively. The evaluations were conducted across three Score:2 Score:4 Score:5
Judgement: Monochromatic Judgement: Harmonious color Judgement: Slide presents
dimensions, as proposed by PPT Eval, based on the colors without visual with the use of geometric shapes; engaging design with consistent
elements However some minor flaws overall design
same scoring criteria listed in Appendix E along diminished the overall design

with converted slide images. Moreover, we listed


some scoring examples in Figure 6 and detailed Figure 6: Scoring Examples of PPT
PPT Eval.
performance of Qwen2.5LM +Qwen2-VLVM across
various domains in Table 3.

C Layout Analysis
We present our hierarchical clustering algorithm
for layout analysis in Algorithm 1, where slides are
grouped into clusters using a similarity threshold Structured Overview with Diagrammatic Work Flow:
Bullet: text Picture
θ of 0.65. To minimize clustering interference,
we replace the text and images in the slides with
placeholders beforehand. Moreover, examples of
the extracted slide clusters are provided in Figure 7.

D Code Interaction Text and visuals interaction Results Summary: Text

Our provided APIs and their corresponding func- Figure 7: Example of slide clusters.
tions are summarized in Table 5, with Figure 8
presenting an example of rendered HTML from a
slide. Algorithm 1 Slides Clustering Algorithm
1: Input: Similarity matrix of slides S ∈ RN ×N ,
E Prompts similarity threshold θ
E.1 Prompts for Presentation Analysis 2: Initialize: C ← ∅
3: while max(S) ≥ θ do
The prompts used for presentation analysis are il-
4: (i, j) ← arg max(S)
lustrated in Figures 9, 10, and 11.
5: if ∃ck ∈ C such that (i ∈ ck ∨ j ∈ ck )
E.2 Prompts for Presentation Generation then
6: ck ← ck ∪ {i, j}
The prompts used for generating presentations are
7: else
shown in Figures 12, 13, and 14.
8: cnew ← {i, j}
E.3 Prompts for PPTEval 9: C ← C ∪ {cnew }
The prompts used in PPTEval are depicted in Fig- 10: end if
ures 15, 16, 17, 18, 19 and 20. 11: Update S:
12: S[:, i] ← 0, S[i, :] ← 0
13: S[:, j] ← 0, S[j, :] ← 0
14: end while
15: Return: C
System Message:
You are a helpful assistant

Prompt:
Please analyze the slide elements and create a structured template schema in JSON format. The schema
should:

1. Identify key content elements (both text and images) that make up the slide
2. For each element, specify:
- "description": A clear description of the element's purpose, do not mention any detail
- "type": "text" or "image" determined that according the tag of element: “image” is assigned for <img>
tags
- "data":
* For text elements: The actual text content as string or array in paragraph level(<p> or <li>), merge
inline text segments(<span>)
* For image elements: Use the `alt` attribute of the <img> tag as the data of the image

Figure 8: Example of rendering a slide into HTML Example format:


{
"element_name": {
format. "description": "purpose of this element", # do not mention any detail, just purpose
"type": "text" or "image",
"data": "actual text" or "<type>:<50-word description>" # detail here, cannot be empty or null
or ["text1", "text2"] # Multiple text elements
or ["logo:...", "logo:..."] # Multiple image elements
}
}
Input:
{{slide}}
Please provide a schema that could be used as a template for creating similar slides.
System Message:
You are an expert presentation analyst specializing in categorizing PowerPoint slides, particularly skilled at
identifying structural slides (such as Opening, Transitions, and Ending slides) that guide the flow of the
presentation. Please follow the specified output format strictly when categorizing the slides.
Figure 11: Illustration of the prompt used to extract the
Prompt:
Objective: Analyze a set of slides provided in plain text format. Your task is to identify structural slides slide schema.
(such as Opening and Ending) based on their content and categorize all other slides under “Content.”

Instructions:
1. Categorize structural slides in the presentation (such as Opening, Ending); assign all other
slides to “Content.”
2. Category names for structural slides should be simple, reflect their function, and contain no
specific entity names.
3. Opening and Ending slides are typically located at the beginning or end of the presentation and
may consist of only one slide.
4. Other transition categories must contain multiple slides with partially identical text.

Output format requirements:


Use the Functional key to group all categorized structural slides, with category names that reflect
only the slide’s function (e.g., “Opening,” “Ending”) and do not describe any specific content.
Use the Content key to list all slides that do not fall into structural categories.

Example output:
```json
{
"functional": {
"opening": [1], System Message:
"table of contents": [2, 5], You are a professional presentation designer tasked with creating structured PowerPoint outlines. Each
"section header": [3, 6], slide outline should include a slide title, a suitable layout from provided options, and concise explanatory
"ending": [10] notes. Your objective is to ensure that the outline adheres to the specified slide count and uses only the
}, provided layouts. The final deliverable should be formatted as a JSON object. Please ensure that no layouts
"content": [4, 7, 8, 9] other than those provided are utilized in the outline.
}
``` Prompt:
Steps:
Ensure that all slides are included in the categorization, with their corresponding slide numbers listed in the
output. 1. Understand the JSON Content:
Carefully analyze the provided JSON input.
Input: {{slides}} Identify key sections and subsections.

Output: {{ json_content }}

2. Generate the Outline:


Ensure that the number of slides matches the specified requirement.
Keep the flow between slides logical and ensure that the sequence of slides enhances understanding.
Figure 9: Illustration of the prompt used for clustering Make sure that the transitions between sections are smooth through functional layouts.
Carefully analyze the content and media types specified in the provided layouts.
structural slides.
For each slide, provide:
A Slide Title that clearly represents the content.
A Layout selected from provided layouts tailored to the slide’s function.
Slide Description, which should contain concise and clear descriptions of the key points.

Please provide your output in JSON format.

Example Output:
{
System Message: "Opening of the XX": {
You are a helpful assistant "layout": "layout1(media_type)",
"subsection_keys": [],
Prompt: },
"description": "..."
Analyze the content layout and media types in the provided slide images.
Your objective is to create a concise, descriptive title that captures purely the presentation pattern and "Introduction to the XX": {
structural arrangement of content elements. "layout": "layout2(media_type)", # select from given layouts(functional or content)
Requirements: "subsection_keys": ["Title of Subsection 1.1", "Title of Subsection 1.2"],
Focus on HOW content is structured and presented, not WHAT the content is "description": "..."
Describe the visual arrangement and interaction between different content types (text, images, diagrams, }
etc.) }

Avoid: Input:
Any reference to specific topics or subjects Number of Slides: {{ num_slides }}
Business or industry-specific terms Image Information:
Actual content descriptions {{ image_information }}

# you can only use the following layouts


You cannot use the following layout names: Content Layouts:
{{ existed_layoutnames }} {{ layouts }}
Functional Layouts:
Example Outputs: {{ functional_keys }}
Hierarchical Bullet Points with Central Image
Presentation of Evolution Through a Timeline Output:
Analysis Displayed Using a Structured Table
Growth Overview Illustrated with Multiple Charts
Picture and illustrative key points
Layout
Output: Provide a one-line layout pattern title.
Figure 12: Illustration of the prompt used for generating
the outline.
Figure 10: Illustration of the prompt used to infer layout
patterns.
Function Name Description
del_span Deletes a specific span.
del_image Deletes an image element.
clone_paragraph Creates a duplicate of an existing paragraph.
replace_span Replaces the content of a specific span.
replace_image Replaces an image with a new image.

Table 5: Definition and function of our provided APIs.

System Message:
You are a Code Generator agent specializing in slide content manipulation. You precisely translate content
edit commands into API calls by following HTML structure, distinguishing between tags, and maintaining
proper parent-child relationships to ensure accurate element targeting.

Prompt:
Generate the sequence of API calls based on the provided commands, ensuring compliance with the
specified rules and precise execution.
You must determine the parent-child relationships of elements based on indentation and ensure that all
<span> and <img> elements are processed, leaving no unhandled content.

Each command follows this format: (element_class, type, quantity_change: int, old_data, new_data).

Steps

1. Quantity Adjustment:
- quantity_change Rules:
- If quantity_change = 0, do not perform clone_paragraph or del_span operations. Only replace the
content.
- If quantity_change > 0, use clone_paragraph to add the corresponding number of paragraphs:
- When cloning, prioritize paragraphs from the same element_class that already have special styles
(e.g., bold, color) if available.
- The paragraph_id for newly cloned paragraphs should be the current maximum paragraph_id of the
parent element plus 1, while retaining the span_id within the cloned paragraph unchanged.
System Message: - If quantity_change < 0, use del_span or del_image to reduce the corresponding number of elements.
Always ensure to remove span elements from the end of the paragraph first.
You are an Editor agent for presentation content. You transform reference text and available images into
Restriction:
structured slide content following schemas. You excel at following schema rules like content length and
- Each command’s API call can only use either clone_paragraph or del_span/del_image according to
ensuring all content is strictly derived from provided reference materials. You never generate new content
the `quantity_change`, but not both.
or use images not explicitly provided.
2. Content Replacement:
Prompt: - Text Content: Use replace_span to sequentially distribute new content into one or more <span>
elements within a paragraph. Select appropriate tags for emphasized content (e.g., bold, special color, larger
Generate slide content based on the provided schema.
font).
Each schema element specifies its purpose, and its default quantity. - Image Content: Use replace_image to replace image resources.
3. Output Format:
Requirements: - Add comments to each API call group, explaining the intent of the original command and the
1. Content Generation Rules: associated element_class.
- Follow default_quantity for elements, adjust when necessary - For cloning operations, annotate the paragraph_id of the newly created paragraphs.
- All generated content must be based on reference text or image information
- Ensure text content meets character limits Available APIs
- Generated text should use concise and impactful presentation style
- For image elements, data should be the image path # eg: "images/logo.png" {{api_docs}}
- Type of images should be a critical factor of image selection, if no relevant image(similar type or
purpose) provided, leave it blank Example Input:
2. Core Elements: Please output only the API call sequence, one call per line, wrapped in ```python and ```, with comments
- Must extract essential content from reference text (e.g., slide_title, main_content) and maintain for corresponding commands.
semantic consistency
- Must include images that support the main content (e.g., diagrams for explanations, visuals directly
discussed in text)

3. Supporting Elements (e.g., presenters, logo images):


- Generate only when relevant content exists in reference text or image information Figure 14: Illustration of the prompt used for generating
Generate content for each element and output in the following format:
{
editing actions.
"element1": {
"data": ["text1", "text2"] for text elements
or ["/path/to/image", "..."] for image elements
}, System Message:
} You are a help assistant

Input: Prompt:
Schema: Please describe the input slide based on the following three dimensions:
{{schema}} 1. The amount of information conveyed
Whether the slide conveys too lengthy or too little information, resulting in a large white space
Outline of Presentation: without colors or images.
{{outline}} 2. Content Clarity and Language Quality
Check if there are any grammatical errors or unclear expressions of textual content.
Metadata of Presentation: 3. Images and Relevance
{{metadata}} Assess the use of visual aids such as images or icons, their presence, and how well they relate to the
theme and content of the slides.
Reference Text:
{{text}} Provide an objective and concise description without comments, focusing exclusively on the dimensions
outlined above.
Available Images:
{{images_info}}

Output: the keys in generated content should be the same as the keys in schema Figure 15: Illustration of the prompt used to describe
content in PPTEval.
Figure 13: Illustration of the prompt used for generating
slide content. System Message:
You are a help assistant

Prompt:
Please describe the input slide based on the following three dimensions:
1. Visual Consistency
Describe whether any style diminished the readability, like border overflow or blur, low contrast, or visual
noise.
2. Color Scheme
Analyze the use of colors in the slide, identifying the colors used and determining whether the design is
monochromatic (black and white) or colorful (gray counts in).
3. Use of Visual Elements
Describe whether the slide include supporting visual elements, such as icons, backgrounds, images, or
geometric shapes (rectangles, circles, etc.).

Provide an objective and concise description without comments, focusing exclusively on the dimensions
outlined above.

Figure 16: Illustration of the prompt used to describe


style in PPTEval.
System Message:
You are an expert presentation content extractor responsible for analyzing and summarizing key elements
and metadata of presentations. Your task is to extract and provide the following information:

Prompt:
Scoring Criteria (Five-point scale):
1. Slide Descriptions: Provide a concise summary of the content and key points covered on each slide.
2. Presentation Metadata: Identify explicit background information(which means it should be a single
paragraph, not including in other paragraphs), such as the author, speaker, date, and other directly stated
details, from the opening and closing slides.

Example Output:
{
"slide_1": "This slide introduces the xx, xx.",
"slide_2": "...",
"background": {
"speaker": "speaker x",
"date": "date x"
}
}

Input:
{{presentation}}

Output:.

Figure 17: Illustration of the prompt used to extract


content in PPTEval.

System Message:
You are an unbiased presentation analysis judge responsible for evaluating the quality of slide content.
Please carefully review the provided slide image, assessing its content, and provide your judgement in a
JSON object containing the reason and score. Each score level requires that all evaluation criteria meet the System Message:
standards of that level. You are an unbiased presentation analysis judge responsible for evaluating the coherence of the
presentation. Please carefully review the provided summary of the presentation, assessing its logical flow
Prompt: and contextual information, each score level requires that all evaluation criteria meet the standards of that
Scoring Criteria (Five-Point Scale): level.

1 Point (Poor): Prompt:


The text on the slides contains significant grammatical errors or is poorly structured, making it difficult to Scoring Criteria (Five-Point Scale)
understand.
1 Point (Poor):
2 Points (Below Average): Terminology are inconsistent, or the logical structure is unclear, making it difficult for the audience to
The slides lack a clear focus, the text is awkwardly phrased, and the overall organization is weak, making it understand.
hard to engage the audience.
2 Points (Fair):
3 Points (Average): Terminology are consistent and the logical structure is generally reasonable, with minor issues in
The slide content is clear and complete but lacks visual aids, resulting in insufficient overall appeal. transitions.

4 Points (Good): 3 Points (Average):


The slide content is clear and well-developed, but the images have weak relevance to the theme, limiting The logical structure is sound with fluent transitions; however, it lacks basic background information.
the effectiveness of the presentation.
4 Points (Good):
5 Points (Excellent): The logical flow is reasonable and include basic background information (e.g., speaker or
The slides are well-developed with a clear focus, and the images and text effectively complement each acknowledgments/conclusion).
other to convey the information successfully.
5 Points (Excellent):
Example Output: The narrative structure is engaging and meticulously organized with detailed and comprehensive
{ background information included.
"reason": "xx",
"score": int Example Output:
} {
Input: {{descr}} "reason": "xx",
Let's think step by step and provide your judgment. "score": int
}

Input:
Figure 18: Illustration of the prompt used to evaluate {{presentation}}

Let's think step by step and provide your judgment, focusing exclusively on the dimensions outlined above
content in PPTEval. and strictly follow the criteria.

Figure 20: Illustration of the prompt used to evaluate


coherence in PPTEval.
System Message:
You are an unbiased presentation analysis judge responsible for evaluating the visual appeal of slides.
Please carefully review the provided description of the slide, assessing their aesthetics only, and provide
your judgment in a JSON object containing the reason and score. Each score level requires that all
evaluation criteria meet the standards of that level.

Prompt:
Scoring Criteria (Five-point scale):

1 Point (Poor):
There is a conflict between slide styles, making the content difficult to read.

2 Points (Fair):
The slide uses monotonous colors(black and white), ensuring readability while lacking visual appeal.

3 Points (Average):
The slide employs a basic color scheme; however, it lacks supplementary visual elements such as icons,
backgrounds, images, or geometric shapes(like rectangles), making it look plain.

4 Points (Good):
The slide uses a harmonious color scheme and contains some visual elements(like icons, backgrounds,
images, or geometric shapes); however, minor flaws may exist in the overall design.

5 Points (Excellent):
The style of the slide is harmonious and engaging, the use of supplementary visual elements like images
and geometric shapes enhances the slide’s overall visual appeal.

Example Output:
{
"reason": "xx",
"score": int
}

Input: {{descr}}
Let's think step by step and provide your judgment.

Figure 19: Illustration of the prompt used to evaluate


style in PPTEval.

You might also like