Paper Ai Presentations
Paper Ai Presentations
Hao Zheng1,2, * , Xinyan Guan1,2,∗ , Hao Kong3 , Jia Zheng1 , Hongyu Lin1
Yaojie Lu1 , Ben He1,2 , Xianpei Han1 , Le Sun1
1
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
2
University of Chinese Academy of Sciences
3
Shanghai Jiexin Technology
{zhenghao2022,guanxinyan2022,zhengjia,hongyu,luyaojie}@iscas.ac.cn
{xianpei,sunle}@iscas.ac.cn [email protected]
comprehensively improves presentation gener- Content: Visual Support Content: Tedious Text
ation through a two-stage, edit-based approach Design: Engaging Look Design: Boring layout
inspired by human workflows. PPT Agent Coherence: Proper Intro Coherence: Abrupt Start
first analyzes reference presentations to un-
derstand their structural patterns and content Figure 1: Comparison between our PPT Agent approach
schemas, then drafts outlines and generates (left) and the conventional abstractive summarization
slides through code actions to ensure consis- method (right). Our method, which begins by editing
tency and alignment. To comprehensively eval- a reference slide, aligns more closely with the human
uate the quality of generated presentations, presentation creation process.
we further introduce PPT Eval, an evaluation
framework that assesses presentations across
three dimensions: Content, Design, and Coher- 2024) by leveraging the generalization capabilities
ence. Experiments show that PPT Agent signif- of large language models (LLM).
icantly outperforms traditional automatic pre-
sentation generation methods across all three Existing approaches often adopt an end-to-end
dimensions. The code and data are available at text-generation paradigm, focusing solely on tex-
https://github.com/icip-cas/PPTAgent. tual content while neglecting layout design and
presentation structures, making them impractical
1 Introduction for real-world applications. For example, as shown
in Figure 1, prior studies (Mondal et al., 2024; Se-
Presentations are a widely used medium for infor- fid et al., 2021) treat presentation generation as an
mation delivery, valued for their visual effective- abstractive summarization task, focus primarily on
ness in engaging and communicating with audi- textual content while overlooking the interactive
ences. However, creating high-quality presenta- nature of presentations. This results in simplistic
tions requires a captivating storyline, visually ap- and visually uninspiring outputs that fail to engage
pealing layouts, and rich, impactful content (Fu audiences.
et al., 2022). Consequently, creating well-rounded
However, automatically creating visually rich
presentations requires advanced presentation skills
and structurally clear presentations remains chal-
and significant effort. Given the inherent complex-
lenging due to the complexity of data formats and
ity of presentation creation, there is growing in-
the lack of effective evaluation frameworks. First,
terest in automating the presentation generation
most presentations are saved in PowerPoint’s XML
process (Mondal et al., 2024; Maheshwari et al.,
format, which is inherently tedious and redundant
* These authors contributed equally (Gryk, 2022). This complex format poses signifi-
3. Outline Generation 4. Slide Generation
Slide 1: Opening
Doc Section: S1 …
Slide 2:…Display
Reference Presentation Document
1. Slide Clustering
Opening: Element Data
Feedback:
author Title AI Era Check for
and icon existence
Icon Drone.jpg PPTAgent
2. Schema Extraction
① replace_span(0, “AI Era”)
Opening: Bullets: Display: AI Era ② replace_image(7, “Drone.jpg”)
author few bullet landscape Self Correction
and icon points painting
② replace_image(1, “Drone.jpg”)
Figure 2: Overview of the PPT Agent workflow. Stage I: Presentation Analysis involves analyzing the input
presentation to cluster slides into groups and extract their content schemas. Stage II: Presentation Generation
generates new presentations guided by the outline, incorporating feedback mechanisms to ensure robustness.
cant challenges for LLMs in interpreting the pre- presentation outline and assigns specific document
sentation layout and structure, let alone generating sections and reference slides to each slide. For
appealing slides in an end-to-end fashion. Second, instance, the framework selects the opening slide
and more importantly, the absence of comprehen- as the reference slide to present meta-information,
sive evaluation frameworks exacerbates this issue. such as the title and icon. PPT Agent offers a suite
Current metrics like perplexity and ROUGE (Lin, of editing action APIs that empower LLMs to dy-
2004) fail to capture essential aspects of presenta- namically modify the reference slide. By breaking
tion quality such as narrative flow, visual design, down the process into discrete stages rather than
and content impact. Moreover, ROUGE-based eval- end-to-end generation, this approach ensures con-
uation tends to reward excessive textual alignment sistency, adaptability, and seamless handling of
with input documents, undermining the brevity and complex formats.
clarity crucial for effective presentations. These To comprehensively evaluate the quality of
limitations highlight the urgent need for advance- generated presentations, we propose PPT Eval,
ments in automated presentation generation, partic- a multidimensional evaluation framework. In-
ularly in enhancing visual design and developing spired by Chen et al. (2024a) and Kwan et al.
comprehensive evaluation frameworks. (2024), PPT Eval leverages the MLLM-as-a-judge
Rather than creating complex presentations from paradigm to enable systematic and scalable evalua-
scratch in a single pass, presentations are typically tion. Drawing from Duarte (2010), we categorized
created by selecting exemplary slides as references presentation quality into three dimensions: Content,
and then summarizing and transferring key con- Design, and Coherence, providing both quantitative
tent onto them (Duarte, 2010). Inspired by this scores and qualitative feedback for each dimension.
process, we design PPT Agent to decompose pre- Our human evaluation studies validated the relia-
sentation generation into an iterative, edit-based bility and effectiveness of PPT Eval.
workflow, as illustrated in Figure 2. In the first Results demonstrate that our method effectively
stage, given a document and a reference presen- generates high-quality presentations, achieving an
tation, PPT Agent analyzes the reference presenta- average score of 3.67 across the three dimensions
tions to extract semantic information, providing evaluated by PPT Eval. These results, covering a
the textual description that identifies the purpose diverse range of domains, highlight a high success
and data model of each slide. In the Presentation rate of 97.8%, showcasing the versatility and ro-
Generation stage, PPT Agent generates a detailed bustness of our approach.
Our main contributions can be summarized as paradigm for creating new slides, addressing
follows: challenges in processing spatial relationships
and designing styles. This approach generates
• We propose PPT Agent, a novel framework a sequence of actions to modify existing slides.
that redefines automatic presentation gener- Within this paradigm, both the input document and
ation as an edit-based workflow guided by the reference presentation serve as inputs. This
reference presentations. process can be described as Equation 2, where m
• We introduce PPT Eval, the first comprehen- represents the number of generated actions. Each
sive evaluation framework that assesses pre- action ai represents a line of executable code, and
sentations across three key dimensions: Con- Rj is the reference slide being edited.
tent, Design, and Coherence. m
X
A= ai = f (C | Rj ) (2)
• We publicly released the PPT Agent and i=1
PPT Eval codebase, along with a curated pre-
sentation dataset, to facilitate future research 2.2 Stage I: Presentation Analysis
in automatic presentation generation. To facilitate presentation generation, we first clus-
ter slides in the reference presentation and extract
2 PPTAgent their content schemas. This structured semantic
representation helps LLMs determine which slides
In this section, we first establish the formulation of
to edit and what content to convey in each slide.
the presentation generation task. Subsequently, we
describe the framework of our proposed PPT Agent, Slide Clustering Slides can be categorized into
which operates in two distinct stages. In stage I, two main types based on their functionalities:
we analyze the reference presentation by clustering slides that support the structure of the presentation
similar slides and extracting their content schemas. (e.g., opening slides) and slides that convey spe-
This process aims to enhance the expressiveness of cific content (e.g., bullet-point slides). We employ
the reference presentation, thereby facilitating sub- different clustering algorithms to effectively clus-
sequent presentation generation. In stage II, given ter slides in the presentation based on their textual
an input document and the analyzed reference pre- or visual characteristics. For structural slides, we
sentation, we aim to select the most suitable slides leverage LLMs to infer the functional role of each
and generate the target presentation through an slide and group them accordingly, as these slides
interactive editing process based on the selected often exhibit distinctive textual features. For the
slides. An overview of our proposed workflow is remaining slides, which primarily focus on present-
illustrated in Figure 2. ing specific content, we employ a hierarchical clus-
tering approach leveraging image similarity. For
2.1 Problem Formulation
each cluster, we infer the layout patterns of each
PPT Agent is designed to generate an engaging pre- cluster using MLLMs. Further details regarding
sentation via an edit-based process. We will pro- this method can be found in Appendix C.
vide formal definitions for both PPT Agent and the
conventional method, illustrating their divergence. Schema Extraction After clustering slides to fa-
The conventional method for creating each slide cilitate the selection of slide references, we further
S can be described in Equation 1, where n repre- analyzed their content schemas to ensure purpose-
sents the number of elements on the slide, and C de- ful alignment of the editing. Given the complexity
notes the source content composed of sections and and fragmentation of real-world slides, we utilized
figures. Each element on the slide, ei , is defined the context perception capabilities of LLMs (Chen
by its type, content, and styling attributes, such as et al., 2024a) to extract diverse content schemas.
(Textbox, "Hello", {border, size, position, . . . }). Specifically, we defined an extraction framework
where each element is represented by its category,
n
X modality, and content. Based on this framework,
S= ei = f (C) (1) the schema of each slide was extracted through
i=1 LLMs’ instruction-following and structured output
Compared to the conventional method, capabilities. Detailed instructions are provided in
PPT Agent adopts an edit-based generation Appendix E.
2.3 Stage II: Presentation Generation Logical Structure
Extract Slide-1: Describe xx
In this stage, we begin by generating an outline that …
specifies the reference slide and relevant content MLLM Judge
Slide-n: Conclude xx
Evaluation Target
for each slide in the new presentation. For each
Content:55
Content: Design:44
Design:
slide, LLMs iteratively edit the reference slide us- Content:
Thetextual 5
textual Design:design,
Cohesive 4
design, Coherence: 4
The Cohesive Minor flaws
ing interactive executable code actions to complete The textual content
content
content is is but
Cohesiveoverlaps
design, but
but overlaps
impactful, and is
well overlaps reduce presented in the
the generation process. impac4ul,
impactful,
supported and
and
by images reduce
reduce appeal.
appeal.
appeal. logical structure
Table 2: Performance comparison of the baseline, our proposed PPTAgent framework, and its ablation variants.
Results are reported using existing metrics—Success Rate (SR), Perplexity (PPL), and FID (Heusel et al., 2017)—as
well as our proposed PPT Eval metrics, which assess Content, Design, Coherence, and their average score.
Domain SR (%) PPL FID PPTEval the exemplar presentation in the feature space.
Due to the limited sample size, we calculate
Culture 93.0 185.3 5.00 3.70 the FID using a 64-dimensional output vector.
Education 94.0 249.0 7.90 3.69
Science 96.0 500.6 6.07 3.56 • PPTEval measures the comprehensive qual-
Society 95.0 396.8 5.32 3.59 ity of presentations across three dimensions:
Tech 97.0 238.7 6.72 3.74 coherence, content, and design. We employ
GPT-4o as the judge model.
Table 3: Evaluation results under the configuration of
Qwen2-VLLM +Qwen2-VLVM in different domains, using 4.4 Result & Analysis
the success rate (SR), PPL, FID and the average PPTE- Table 2 presents the performance comparison be-
val score across three evaluation dimensions. tween PPT Agent and baseline methods, revealing
that:
presentation. We do not report the success rate and PPTAgent Enhances LLMs’ Presentation Gen-
FID of the baseline method for the same reason. eration Capabilities As demonstrated in Ta-
ble 2, our approach empowers LLMs to pro-
4.3 Evaluation Metrics
duce well-rounded presentations with a remark-
We evaluated the presentation generation using the able success rate, achieving ≥ 95% success
following metrics: rate for both Qwen2.5LM +Qwen2-VLVM and GPT-
4oLM +GPT-4oVM . This is a significant improve-
• Success Rate (SR) measures the robustness
ment compared to the highest accuracy of 10%
of the generation task by determining the per-
for session-based template editing tasks as reported
centage of presentations where all slides are
in Guo et al. (2023). This improvement can be
successfully generated.
attributed to three main factors: 1) PPT Agent con-
• Perplexity (PPL) measures the likelihood centrates on content modifying, thereby avoiding
of the language model generating the given intricate stying operations. 2) Our streamlined API
sequence. Following Bandyopadhyay et al. design allows LLMs to execute tasks with ease.
(2024), we calculate the average perplexity of 3) The code interaction module enhances LLMs’
slides within a presentation using GPT-2.. A comprehension of slides and offers opportunities
lower perplexity score indicates that the tex- for self-correction, enabling them to generate ac-
tual content is more fluent. curate actions robustly. Moreover, detailed perfor-
mance of Qwen2.5LM +Qwen2-VLVM across various
• FID (Heusel et al., 2017) measures the simi- domains, as illustrated in Table 3, underscores the
larity between the generated presentation and robustness of our approach.
Corelation Content Design Coherence Avg.
gpt-4o
Pearson 0.70 0.90 0.55 0.71
Spearman 0.73 0.88 0.57 0.74
Qwen2.5
Table 4: The correlation scores between human ratings
and LLM ratings under different dimensions (Coher-
Qwen2-VL Iter-0 ence, Content, Design). All presented data of similarity
Iter-1 exhibit a p-value below 0.05, indicating a statistically
Iter-2
Failed
significant level of confidence.
100 101 102 103 104
0.25
Yanda Li, Chi Zhang, Wanqi Yang, Bin Fu, Pei Cheng, Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan,
Xin Chen, Ling Chen, and Yunchao Wei. 2024. Ap- Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka,
pagent v2: Advanced agent for flexible mobile inter- Joseph Gonzalez, Kurt Keutzer, and Peter Vajda.
actions. arXiv preprint arXiv:2408.11824. 2020. Visual transformers: Token-based image
representation and processing for computer vision.
Chin-Yew Lin. 2004. ROUGE: A package for auto- Preprint, arXiv:2006.03677.
matic evaluation of summaries. In Text Summariza-
tion Branches Out, pages 74–81, Barcelona, Spain. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Association for Computational Linguistics. Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5 tech-
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, nical report. arXiv preprint arXiv:2412.15115.
Ruochen Xu, and Chenguang Zhu. 2023. G-eval:
NLG evaluation using gpt-4 with better human align- Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and
ment. In Proceedings of the 2023 Conference on Yu Su. 2024. Gpt-4v (ision) is a generalist web agent,
Empirical Methods in Natural Language Processing, if grounded. arXiv preprint arXiv:2401.01614.
pages 2511–2522, Singapore. Association for Com-
putational Linguistics. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Himanshu Maheshwari, Sambaran Bandyopadhyay, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023.
Aparna Garimella, and Anandhavelu Natarajan. 2024. Judging llm-as-a-judge with mt-bench and chatbot
Presentations are not always linear! gnn meets llm arena. Advances in Neural Information Processing
for document-to-presentation transformation with at- Systems, 36:46595–46623.
tribution. arXiv preprint arXiv:2405.13095.
Mingchen Zhuge, Changsheng Zhao, Dylan Ashley,
Ishani Mondal, S Shwetha, Anandhavelu Natarajan,
Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong,
Aparna Garimella, Sambaran Bandyopadhyay, and
Zechun Liu, Ernie Chang, Raghuraman Krishnamoor-
Jordan Boyd-Graber. 2024. Presentations by the hu-
thi, Yuandong Tian, et al. 2024. Agent-as-a-
mans and for the humans: Harnessing llms for gen-
judge: Evaluate agents with agents. arXiv preprint
erating persona-aware slides from documents. In
arXiv:2410.10934.
Proceedings of the 18th Conference of the European
Chapter of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 2664–2684.
Athar Sefid, Prasenjit Mitra, and Lee Giles. 2021. Slide-
gen: an abstractive section-based slide generator for
scholarly documents. In Proceedings of the 21st
ACM Symposium on Document Engineering, pages
1–4.
Edward Sun, Yufang Hou, Dakuo Wang, Yunfeng
Zhang, and Nancy XR Wang. 2021. D2s: Document-
to-slide generation via query-based text summariza-
tion. arXiv preprint arXiv:2105.03664.
Zecheng Tang, Chenfei Wu, Juntao Li, and Nan Duan.
2023. Layoutnuwa: Revealing the hidden layout
expertise of large language models. arXiv preprint
arXiv:2309.09506.
A Data Sampling Content
To maintain a reasonable cost, we selected presen-
tations ranging from 12 to 64 pages and documents
with text lengths from 2,048 to 20,480 characters.
Score:1 Score:3 Score:5
B Details of PPTEval Judgement:Lack of content Judgement: The content is
somewhat tedious and lacks
Judgement: The content
is impactful with relevant
the support of images images supports well
Through a Shanghai-based crowdsourcing plat-
form, we recruited four graduate students to evalu- Design
ate 50 randomly selected presentations from Zen-
odo10K, along with 100 presentations generated
by the baseline method and our approach, respec-
tively. The evaluations were conducted across three Score:2 Score:4 Score:5
Judgement: Monochromatic Judgement: Harmonious color Judgement: Slide presents
dimensions, as proposed by PPT Eval, based on the colors without visual with the use of geometric shapes; engaging design with consistent
elements However some minor flaws overall design
same scoring criteria listed in Appendix E along diminished the overall design
C Layout Analysis
We present our hierarchical clustering algorithm
for layout analysis in Algorithm 1, where slides are
grouped into clusters using a similarity threshold Structured Overview with Diagrammatic Work Flow:
Bullet: text Picture
θ of 0.65. To minimize clustering interference,
we replace the text and images in the slides with
placeholders beforehand. Moreover, examples of
the extracted slide clusters are provided in Figure 7.
Our provided APIs and their corresponding func- Figure 7: Example of slide clusters.
tions are summarized in Table 5, with Figure 8
presenting an example of rendered HTML from a
slide. Algorithm 1 Slides Clustering Algorithm
1: Input: Similarity matrix of slides S ∈ RN ×N ,
E Prompts similarity threshold θ
E.1 Prompts for Presentation Analysis 2: Initialize: C ← ∅
3: while max(S) ≥ θ do
The prompts used for presentation analysis are il-
4: (i, j) ← arg max(S)
lustrated in Figures 9, 10, and 11.
5: if ∃ck ∈ C such that (i ∈ ck ∨ j ∈ ck )
E.2 Prompts for Presentation Generation then
6: ck ← ck ∪ {i, j}
The prompts used for generating presentations are
7: else
shown in Figures 12, 13, and 14.
8: cnew ← {i, j}
E.3 Prompts for PPTEval 9: C ← C ∪ {cnew }
The prompts used in PPTEval are depicted in Fig- 10: end if
ures 15, 16, 17, 18, 19 and 20. 11: Update S:
12: S[:, i] ← 0, S[i, :] ← 0
13: S[:, j] ← 0, S[j, :] ← 0
14: end while
15: Return: C
System Message:
You are a helpful assistant
Prompt:
Please analyze the slide elements and create a structured template schema in JSON format. The schema
should:
1. Identify key content elements (both text and images) that make up the slide
2. For each element, specify:
- "description": A clear description of the element's purpose, do not mention any detail
- "type": "text" or "image" determined that according the tag of element: “image” is assigned for <img>
tags
- "data":
* For text elements: The actual text content as string or array in paragraph level(<p> or <li>), merge
inline text segments(<span>)
* For image elements: Use the `alt` attribute of the <img> tag as the data of the image
Instructions:
1. Categorize structural slides in the presentation (such as Opening, Ending); assign all other
slides to “Content.”
2. Category names for structural slides should be simple, reflect their function, and contain no
specific entity names.
3. Opening and Ending slides are typically located at the beginning or end of the presentation and
may consist of only one slide.
4. Other transition categories must contain multiple slides with partially identical text.
Example output:
```json
{
"functional": {
"opening": [1], System Message:
"table of contents": [2, 5], You are a professional presentation designer tasked with creating structured PowerPoint outlines. Each
"section header": [3, 6], slide outline should include a slide title, a suitable layout from provided options, and concise explanatory
"ending": [10] notes. Your objective is to ensure that the outline adheres to the specified slide count and uses only the
}, provided layouts. The final deliverable should be formatted as a JSON object. Please ensure that no layouts
"content": [4, 7, 8, 9] other than those provided are utilized in the outline.
}
``` Prompt:
Steps:
Ensure that all slides are included in the categorization, with their corresponding slide numbers listed in the
output. 1. Understand the JSON Content:
Carefully analyze the provided JSON input.
Input: {{slides}} Identify key sections and subsections.
Output: {{ json_content }}
Example Output:
{
System Message: "Opening of the XX": {
You are a helpful assistant "layout": "layout1(media_type)",
"subsection_keys": [],
Prompt: },
"description": "..."
Analyze the content layout and media types in the provided slide images.
Your objective is to create a concise, descriptive title that captures purely the presentation pattern and "Introduction to the XX": {
structural arrangement of content elements. "layout": "layout2(media_type)", # select from given layouts(functional or content)
Requirements: "subsection_keys": ["Title of Subsection 1.1", "Title of Subsection 1.2"],
Focus on HOW content is structured and presented, not WHAT the content is "description": "..."
Describe the visual arrangement and interaction between different content types (text, images, diagrams, }
etc.) }
Avoid: Input:
Any reference to specific topics or subjects Number of Slides: {{ num_slides }}
Business or industry-specific terms Image Information:
Actual content descriptions {{ image_information }}
System Message:
You are a Code Generator agent specializing in slide content manipulation. You precisely translate content
edit commands into API calls by following HTML structure, distinguishing between tags, and maintaining
proper parent-child relationships to ensure accurate element targeting.
Prompt:
Generate the sequence of API calls based on the provided commands, ensuring compliance with the
specified rules and precise execution.
You must determine the parent-child relationships of elements based on indentation and ensure that all
<span> and <img> elements are processed, leaving no unhandled content.
Each command follows this format: (element_class, type, quantity_change: int, old_data, new_data).
Steps
1. Quantity Adjustment:
- quantity_change Rules:
- If quantity_change = 0, do not perform clone_paragraph or del_span operations. Only replace the
content.
- If quantity_change > 0, use clone_paragraph to add the corresponding number of paragraphs:
- When cloning, prioritize paragraphs from the same element_class that already have special styles
(e.g., bold, color) if available.
- The paragraph_id for newly cloned paragraphs should be the current maximum paragraph_id of the
parent element plus 1, while retaining the span_id within the cloned paragraph unchanged.
System Message: - If quantity_change < 0, use del_span or del_image to reduce the corresponding number of elements.
Always ensure to remove span elements from the end of the paragraph first.
You are an Editor agent for presentation content. You transform reference text and available images into
Restriction:
structured slide content following schemas. You excel at following schema rules like content length and
- Each command’s API call can only use either clone_paragraph or del_span/del_image according to
ensuring all content is strictly derived from provided reference materials. You never generate new content
the `quantity_change`, but not both.
or use images not explicitly provided.
2. Content Replacement:
Prompt: - Text Content: Use replace_span to sequentially distribute new content into one or more <span>
elements within a paragraph. Select appropriate tags for emphasized content (e.g., bold, special color, larger
Generate slide content based on the provided schema.
font).
Each schema element specifies its purpose, and its default quantity. - Image Content: Use replace_image to replace image resources.
3. Output Format:
Requirements: - Add comments to each API call group, explaining the intent of the original command and the
1. Content Generation Rules: associated element_class.
- Follow default_quantity for elements, adjust when necessary - For cloning operations, annotate the paragraph_id of the newly created paragraphs.
- All generated content must be based on reference text or image information
- Ensure text content meets character limits Available APIs
- Generated text should use concise and impactful presentation style
- For image elements, data should be the image path # eg: "images/logo.png" {{api_docs}}
- Type of images should be a critical factor of image selection, if no relevant image(similar type or
purpose) provided, leave it blank Example Input:
2. Core Elements: Please output only the API call sequence, one call per line, wrapped in ```python and ```, with comments
- Must extract essential content from reference text (e.g., slide_title, main_content) and maintain for corresponding commands.
semantic consistency
- Must include images that support the main content (e.g., diagrams for explanations, visuals directly
discussed in text)
Input: Prompt:
Schema: Please describe the input slide based on the following three dimensions:
{{schema}} 1. The amount of information conveyed
Whether the slide conveys too lengthy or too little information, resulting in a large white space
Outline of Presentation: without colors or images.
{{outline}} 2. Content Clarity and Language Quality
Check if there are any grammatical errors or unclear expressions of textual content.
Metadata of Presentation: 3. Images and Relevance
{{metadata}} Assess the use of visual aids such as images or icons, their presence, and how well they relate to the
theme and content of the slides.
Reference Text:
{{text}} Provide an objective and concise description without comments, focusing exclusively on the dimensions
outlined above.
Available Images:
{{images_info}}
Output: the keys in generated content should be the same as the keys in schema Figure 15: Illustration of the prompt used to describe
content in PPTEval.
Figure 13: Illustration of the prompt used for generating
slide content. System Message:
You are a help assistant
Prompt:
Please describe the input slide based on the following three dimensions:
1. Visual Consistency
Describe whether any style diminished the readability, like border overflow or blur, low contrast, or visual
noise.
2. Color Scheme
Analyze the use of colors in the slide, identifying the colors used and determining whether the design is
monochromatic (black and white) or colorful (gray counts in).
3. Use of Visual Elements
Describe whether the slide include supporting visual elements, such as icons, backgrounds, images, or
geometric shapes (rectangles, circles, etc.).
Provide an objective and concise description without comments, focusing exclusively on the dimensions
outlined above.
Prompt:
Scoring Criteria (Five-point scale):
1. Slide Descriptions: Provide a concise summary of the content and key points covered on each slide.
2. Presentation Metadata: Identify explicit background information(which means it should be a single
paragraph, not including in other paragraphs), such as the author, speaker, date, and other directly stated
details, from the opening and closing slides.
Example Output:
{
"slide_1": "This slide introduces the xx, xx.",
"slide_2": "...",
"background": {
"speaker": "speaker x",
"date": "date x"
}
}
Input:
{{presentation}}
Output:.
System Message:
You are an unbiased presentation analysis judge responsible for evaluating the quality of slide content.
Please carefully review the provided slide image, assessing its content, and provide your judgement in a
JSON object containing the reason and score. Each score level requires that all evaluation criteria meet the System Message:
standards of that level. You are an unbiased presentation analysis judge responsible for evaluating the coherence of the
presentation. Please carefully review the provided summary of the presentation, assessing its logical flow
Prompt: and contextual information, each score level requires that all evaluation criteria meet the standards of that
Scoring Criteria (Five-Point Scale): level.
Input:
Figure 18: Illustration of the prompt used to evaluate {{presentation}}
Let's think step by step and provide your judgment, focusing exclusively on the dimensions outlined above
content in PPTEval. and strictly follow the criteria.
Prompt:
Scoring Criteria (Five-point scale):
1 Point (Poor):
There is a conflict between slide styles, making the content difficult to read.
2 Points (Fair):
The slide uses monotonous colors(black and white), ensuring readability while lacking visual appeal.
3 Points (Average):
The slide employs a basic color scheme; however, it lacks supplementary visual elements such as icons,
backgrounds, images, or geometric shapes(like rectangles), making it look plain.
4 Points (Good):
The slide uses a harmonious color scheme and contains some visual elements(like icons, backgrounds,
images, or geometric shapes); however, minor flaws may exist in the overall design.
5 Points (Excellent):
The style of the slide is harmonious and engaging, the use of supplementary visual elements like images
and geometric shapes enhances the slide’s overall visual appeal.
Example Output:
{
"reason": "xx",
"score": int
}
Input: {{descr}}
Let's think step by step and provide your judgment.