ReLayout: Integrating Relation Reasoning for Content-aware Layout Generation with Multi-modal Large Language Models

Jiaxu Tian1\equalcontrib, Xuehui Yu2\equalcontrib, Yaoxing Wang1\equalcontrib, Pan Wang2, Guangqian Guo1, Shan Gao1 Corresponding author.
Abstract

Content-aware layout aims to arrange design elements appropriately on a given canvas to convey information effectively. Recently, the trend for this task has been to leverage large language models (LLMs) to generate layouts automatically, achieving remarkable performance. However, existing LLM-based methods fail to adequately interpret spatial relationships among visual themes and design elements, leading to structural and diversity problems in layout generation. To address this issue, we introduce ReLayout, a novel method that leverages relation-CoT to generate more reasonable and aesthetically coherent layouts by fundamentally originating from design concepts. Specifically, we enhance layout annotations by introducing explicit relation definitions, such as region, saliency, and margin between elements, with the goal of decomposing the layout into smaller, structured, and recursive layouts, thereby enabling the generation of more structured layouts. Furthermore, based on these defined relationships, we introduce a layout prototype rebalance sampler, which defines layout prototype features across three dimensions and quantifies distinct layout styles. This sampler addresses uniformity issues in generation that arise from data bias in the prototype distribution balance process. Extensive experimental results verify that ReLayout outperforms baselines and can generate structural and diverse layouts that are more aligned with human aesthetics and more explainable.

Introduction

Layout is an essential part of graphic design, aiming to convey information through the appropriate arrangement of elements such as logos and texts. Due to its importance, layout has various applications, spanning scenarios like documents (Li et al. 2019a; Zhong, Tang, and Yepes 2019), UIs (Raneburger, Popp, and Vanderdonckt 2012; Deka et al. 2017), magazines (Yang et al. 2016; Tabata et al. 2019) and posters (Guo et al. 2021; Lin et al. 2023). Among these, when the main visual element flows into an application, such as advertising posters, achieving harmony between the arrangement of elements and the canvas becomes one of the key goals. We call layout generation under the above condition content-aware layout generation.

This field is particularly challenging because it requires the integration of design elements, such as logos and text, with visual content to produce layouts that are both usable and aesthetically pleasing. Furthermore, the model needs to generate diverse layouts to ensure diversity. To address these challenges, researchers have proposed various methods (Zheng et al. 2019; Horita et al. 2024; Hsu et al. 2023; Zhou et al. 2022) based on generative models (Goodfellow et al. 2020; Kingma 2013; Ho, Jain, and Abbeel 2020) to enhance the quality of generated layouts. Among these methods, RALF (Horita et al. 2024), as a transformer-based (Vaswani 2017) method, has achieved notable advancements. It adopts a retrieval augmentation method to mitigate the data scarcity problem. Nevertheless, it treats layout generation only as a numerical problem, failing to capture the semantics, which prevents the model from generating visually and textually coherent layouts.

Recently, two LLM-based methods (Lin et al. 2024; Seol, Kim, and Yoo 2024) have emerged, aiming to leverage the ability of large language models to generate high-quality layouts. For instance, LayoutPrompter (Lin et al. 2024) employs dynamic exemplar selection to generate layouts without requiring training but cannot take a canvas image as input, thereby missing out on a significant amount of information. PosterLlama (Seol, Kim, and Yoo 2024), as the current SOTA, trains a MLLM to generate visually and textually coherent layouts. However, these methods remain limited to outputting coordinate information at the element-level (e.g., ”where to place” individual elements) and focusing only on layout-level outcomes, lacking the structural-level organization of element relations that bridges element-level positioning with layout-level design concepts. This limitation leads to two critical issues in layout generation: (1) structural problem, where related elements fail to maintain proper spatial relationships, as illustrated in Figure LABEL:fig:error(a), where PosterLlama produces overlapping elements, incorrect alignments, and fails to capture parallel relationships; and (2) diversity problem, where the generated layouts lack the rich structural variation found, as shown in Figure LABEL:fig:error(b), where these methods, without explicit modeling of element relationships, degrade to similar structural arrangements.

To address these issues, we propose ReLayout, a content-aware layout generation framework based on a MLLM, drawing inspiration from how designers organize layouts through structural element relations. Our core contribution lies in explicitly modeling design logic through a CoT reasoning mechanism (Wei et al. 2022) that deciphers element relations. As illustrated in the layout relation-CoT construction in Figure 1, it decomposes layouts into recursive, nested hierarchical structures (e.g., tree representations) by defining a relation space encompassing salient, region, and element. This structured approach enhances the model’s ability to generate semantically coherent layouts by leveraging relations between elements. Additionally, we introduce the layout prototype rebalance sampler, which quantifies the layout prototype into a three-dimensional feature space of saliency, region, and margin between elements based on the layout relation-CoT construction. By integrating feature clustering with weighted sampling, the sampler mitigates the long-tail distribution problem in the dataset, enabling balanced learning of diverse layout prototypes. User studies and visualization demonstrate that ReLayout outperforms state-of-the-art methods, achieving significant improvements in usability and diversity. In summary, our contributions are as follows:

  • We propose ReLayout, a relation-CoT paradigm designed to address hierarchical layout design challenges, specifically tackling structural and diversity problems via explicit spatial relations and layout prototype balancing.

  • We introduce a layout relation-CoT construction mechanism that decomposes layout element relationships into a hierarchical structure while incorporating element relation annotations into existing layout datasets.

  • We develop a layout prototype rebalance sampler, which quantifies layout prototypes through feature clustering and employs weighted sampling to ensure adaptability across diverse real-world scenarios.

  • We propose two datasets enriched with more layout information based on the layout relation-CoT construction, which we will release publicly to the community.

Refer to caption
Figure 1: Pipeline of ReLayout. We adopt the layout relation-CoT construction to add relation annotations on raw datasets. Then we use the layout prototype rebalance sampler to adjust the distribution of the new dataset for training.

Related Work

Automatic Layout Generation

Content-agnostic: Content-agnostic layout generation aims to create layouts independent of specific content. LayoutGAN (Li et al. 2019b) is the first method to introduce GAN for addressing this task; in addition, approaches involving VAE (Jiang et al. 2022; Jyothi et al. 2019) or Diffusion models (Chai, Zhuang, and Yan 2023; Zhang et al. 2023; Inoue et al. 2023) have also been employed to solve content-agnostic layout generation tasks. LayoutNUWA (Tang et al. ) is an LLM-based method that has achieved good performance using HTML format. This also demonstrates that LLMs have advantages over other generative methods in layout generation tasks.

Content-aware: Content-aware layout generation not only focuses on the quality of the generated layout like content-agnostic layout generation but also considers the harmony between the layout and the canvas. ContentGAN (Zheng et al. 2019) is the first to tackle the above problem. Starting from CGL-GAN (Zhou et al. 2022), subsequent works mostly begin leveraging saliency maps. DS-GAN (Hsu et al. 2023) uses a CNN-LSTM model to balance graphic and content-aware metrics. RADM (Li et al. 2023a) is the first diffusion-based method to incorporate textual content into layout tasks. RALF (Horita et al. 2024) leverages a retrieval augmentation method to mitigate the data scarcity problem. Thanks to the power of LLMs, LayoutPrompter (Lin et al. 2024) and PosterLlama (Seol, Kim, and Yoo 2024) demonstrate remarkable capabilities in the field of layout generation. The former achieves a training-free approach by selecting prompt examples with constraint layouts similar to test samples. The latter, PosterLlama, trains an adapter and fine-tunes the model to generate coherent visual and textual layouts. Among these works, LLM-based methods have become the mainstream method, with PosterLlama, the current SOTA, demonstrating outstanding performance.

However, they fail to capture the rich relationships between elements. In contrast, our method explicitly represents these relationships and decomposes the layout into smaller, structured, and recursive layouts. This leads to a layout that is both more visually appealing and more explainable.

Multi-modal Large Language Models

Advancements: LLMs have demonstrated remarkable capabilities in natural language understanding with billions of parameters. Based on this, MLLMs have achieved remarkable progress by integrating cross-modal data including visual, auditory, and other sensory data streams (Li et al. 2023b; Radford et al. 2021), thereby significantly expanding their range of applications, such as GPT-4 (Achiam et al. 2023), Gemini (Team et al. 2023), and Claude 3, as well as open-source models like InternVL (Chen et al. 2024a) and LLaVA-OneVision (Li et al. 2024). These models have been widely applied across diverse fields, including healthcare (Goyal et al. 2024; Yang et al. 2024) and agriculture (Peng et al. 2023; Tzachor et al. 2023).

Techniques: In recent years, several techniques have enhanced LLM capabilities. Few-shot learning (Brown et al. 2020) allows models to adapt to new tasks with minimal examples, reducing the need for large datasets. Chain-of-thought (CoT) prompting (Wei et al. 2022) improves reasoning by guiding models to break down complex problems step by step. LoRA fine-tuning (Hu et al. 2022) efficiently adapts models by adding small trainable matrices to specific layers, reducing memory and computation costs while maintaining strong performance.

In our work, we leverage InternVL as the base model and apply LoRA fine-tuning like PosterLlama to efficiently adapt it to layout generation tasks. Moreover, motivated by CoT, we improve our output format to guide the model in generating more reasonable and explainable layouts. Experimental results demonstrate the effectiveness of ReLayout.

Methods

Overview

Given a set of constraints, our goal is to generate a well-arranged layout. A layout \mathcal{L}caligraphic_L can be represented as a set of N𝑁Nitalic_N elements: ={𝐞1,,𝐞N}={(c1,𝐛1),,(cN,𝐛N)}subscript𝐞1subscript𝐞𝑁subscript𝑐1subscript𝐛1subscript𝑐𝑁subscript𝐛𝑁\mathcal{L}=\{\mathbf{e}_{1},\dots,\mathbf{e}_{N}\}=\{(c_{1},\mathbf{b}_{1}),% \dots,(c_{N},\mathbf{b}_{N})\}caligraphic_L = { bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_e start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } = { ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) }, where each element eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consists of its class cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and corresponding bounding box 𝐛i=[xi,yi,wi,hi]subscript𝐛𝑖subscript𝑥𝑖subscript𝑦𝑖subscript𝑤𝑖subscript𝑖\mathbf{b}_{i}=[x_{i},y_{i},w_{i},h_{i}]bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]. In our work, multi-modal inputs are a canvas image 𝐂𝐂\mathbf{C}bold_C and foreground elements ={(𝐭i,𝐩i)}i=1Nsubscriptsuperscriptsubscript𝐭𝑖subscript𝐩𝑖𝑁𝑖1\mathcal{F}=\{(\mathbf{t}_{i},\mathbf{p}_{i})\}^{N}_{i=1}caligraphic_F = { ( bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, where 𝐭isubscript𝐭𝑖\mathbf{t}_{i}bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents text (which can be empty) and 𝐩isubscript𝐩𝑖\mathbf{p}_{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents an element image (which can be empty except under condition constraints).

Furthermore, the constraints consist of two types: (1) content-aware constraints (avoiding occlusion of salient objects) and (2) user-specified constraints (e.g., generating bounding boxes conditioned on element categories).

The pipeline of ReLayout, shown in Figure 1, consists of two key components: layout relation-CoT construction and layout prototype rebalance sampler. The layout relation-CoT construction explicitly models the layout relations from three aspects: margin between elements, region, and saliency. These relations will be used for the training of the MLLM to enhance the model’s usability. Furthermore, these explicit relation models enable us to balance the samples in different clusters from the perspective of design styles, so as to achieve better optimization and diverse results. The inference procedure is illustrated in Figure 2(a). Unlike previous layout generation methods based on LLMs to directly generate layout coordinates, our method first predicts the structured relations (highlighted in orange) and then generates the layout coordinates based on the provided canvas image 𝐂𝐂\mathbf{C}bold_C and foreground elements \mathcal{F}caligraphic_F.

Refer to caption
Figure 2: (a) is ReLayout training process and its output distinction from previous methods. The bottom part is two key components of ReLayout. (b) illustrates the relation labels construction logic. (c) represents the layout dataset resampling process, which adjusts the dataset distribution to achieve a more balanced layout dataset.

Layout Relation-CoT Construction

To fully leverage the extensive knowledge of LLMs in layout design, we choose HTML to represent layouts. However, unlike previous LLM-based methods that represent layouts using HTML (Lin et al. 2024; Seol, Kim, and Yoo 2024), we introduce two types of relation spaces: region and saliency (see Figure 2(b)). These relational spaces are designed to address the shortcomings of previous methods, which often generate layouts that are poorly structured and lack human aesthetic appeal.

Refer to caption
Figure 3: Examples of hierarchical decomposition of complex layouts based on different directions.

Region: Caused by the fact that LLMs are inherently more sensitive to highly structured data, we introduce region. Region \mathcal{R}caligraphic_R serves as the fundamental unit of spatial arrangement, with its internal structure adhering to a single direction pattern. It can be understood as individual small layouts, similar to the structure of a tree. Thus it is both nestable and recursive. This makes the layout annotations formed by it highly structured, allowing the generation of complex overall arrangements through simple construction rules.

Region is defined by three key properties: =(d,a,𝐛)𝑑𝑎𝐛\mathcal{R}=\left(d,a,\mathbf{b}\right)caligraphic_R = ( italic_d , italic_a , bold_b ), where d𝑑ditalic_d is the flex-direction, representing the arrangement direction of elements within the region: d{row,column}𝑑rowcolumnd\in\left\{\textit{row},\textit{column}\right\}italic_d ∈ { row , column }, a𝑎aitalic_a represents align-items, and 𝐛𝐛\mathbf{b}bold_b represents the region’s position and size. As illustrated in Step 2 and Step 2+n2𝑛2+n2 + italic_n of Figure 2(b), regions are constructed step by step. We use Algorithm 1 and Figure 2(b) as examples to describe the specific steps of constructing our region. (1) We first perform the x-axis and y-axis projection operations on each element of this level-1 region. (2) Using GroupByOverlap, we analyze the IoD (Intersection over Detection) (Yu et al. 2020) matrix of projections to group bounding boxes into Gxsubscript𝐺𝑥G_{x}italic_G start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT (x-axis groups) and Gysubscript𝐺𝑦G_{y}italic_G start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT (y-axis groups), where IoD is defined as the intersection between the detection box and the ignored region divided by the area of the detection box. (3) Based on the group counts and variances, we determine the layout direction. At this point, we have obtained the direction of the level-1 region in Step 2 of Figure 2(b). Finally, we only need to recursively apply this process to each group to further subdivide the region, constructing a hierarchical structure like Step 2 + n of Figure 2(b). Figure 3 illustrates the process of converting bounding boxes into nested structures enriched with layout information under the heuristic Algorithm 1.

Furthermore, parallel 𝒫𝒫\mathcal{P}caligraphic_P (see the second column of Figure LABEL:fig:error(a)) is a specialized type of region, sharing the same fundamental attributes. It is typically employed for the parallel presentation of two or more related elements. These elements maintain uniform visual sizes and align along a designated axis (either row or column) to ensure consistency and symmetry within the layout.

For each element within a region, we introduce an additional attribute, margin, to represent relative position, i.e., the spacing between elements. When the region is arranged in a row, this attribute is defined as margin-left, whereas in a column, it is specified as margin-top. Using this property, we can effectively control the overall layout compactness.

Algorithm 1 Estimate Layout Direction
1:Bounding boxes \mathcal{B}caligraphic_B, overlap threshold ϕitalic-ϕ\phiitalic_ϕ.
2:(Direction, G𝐺Gitalic_G)
3:Lxsubscript𝐿𝑥absentL_{x}\leftarrowitalic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ← project bounding boxes \mathcal{B}caligraphic_B to x-axis;
4:Lysubscript𝐿𝑦absentL_{y}\leftarrowitalic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ← project bounding boxes \mathcal{B}caligraphic_B to y-axis;
5:Gxsubscript𝐺𝑥absentG_{x}\leftarrowitalic_G start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ← GroupByOverlap(Lxsubscript𝐿𝑥L_{x}italic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, ϕitalic-ϕ\phiitalic_ϕ);
6:Gysubscript𝐺𝑦absentG_{y}\leftarrowitalic_G start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ← GroupByOverlap(Lysubscript𝐿𝑦L_{y}italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, ϕitalic-ϕ\phiitalic_ϕ);
7:if |Gx|=1subscript𝐺𝑥1|G_{x}|=1| italic_G start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | = 1 AND |Gy|>1subscript𝐺𝑦1|G_{y}|>1| italic_G start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | > 1 then
8:     return (“column”, Gysubscript𝐺𝑦G_{y}italic_G start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT)
9:else if |Gy|=1subscript𝐺𝑦1|G_{y}|=1| italic_G start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | = 1 AND |Gx|>1subscript𝐺𝑥1|G_{x}|>1| italic_G start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | > 1 then
10:     return (“row”, Gxsubscript𝐺𝑥G_{x}italic_G start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT)
11:else
12:     Vxsubscript𝑉𝑥absentV_{x}\leftarrowitalic_V start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ← ComputeGroupVariance(Gxsubscript𝐺𝑥G_{x}italic_G start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT)
13:     Vysubscript𝑉𝑦absentV_{y}\leftarrowitalic_V start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ← ComputeGroupVariance(Gysubscript𝐺𝑦G_{y}italic_G start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT)
14:     if VxVysubscript𝑉𝑥subscript𝑉𝑦V_{x}\leq V_{y}italic_V start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ≤ italic_V start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT then return (“row”, Gxsubscript𝐺𝑥G_{x}italic_G start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT)
15:     else
16:         return (“column”, Gysubscript𝐺𝑦G_{y}italic_G start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT)
17:     end if
18:end if
19:function GroupByOverlap(L𝐿Litalic_L, ϕitalic-ϕ\phiitalic_ϕ)
20:     edges𝑒𝑑𝑔𝑒𝑠edges\leftarrow\emptysetitalic_e italic_d italic_g italic_e italic_s ← ∅
21:     for each pair (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) in L𝐿Litalic_L do
22:         if IoD(L[i]𝐿delimited-[]𝑖L[i]italic_L [ italic_i ], L[j]𝐿delimited-[]𝑗L[j]italic_L [ italic_j ]) ϕabsentitalic-ϕ\geq\phi≥ italic_ϕ  then
23:              edgesedges{(i,j)}𝑒𝑑𝑔𝑒𝑠𝑒𝑑𝑔𝑒𝑠𝑖𝑗edges\leftarrow edges\cup\{(i,j)\}italic_e italic_d italic_g italic_e italic_s ← italic_e italic_d italic_g italic_e italic_s ∪ { ( italic_i , italic_j ) }
24:         end if
25:     end for
26:     groups𝑔𝑟𝑜𝑢𝑝𝑠absentgroups\leftarrowitalic_g italic_r italic_o italic_u italic_p italic_s ← FindConnectedComponents(edges𝑒𝑑𝑔𝑒𝑠edgesitalic_e italic_d italic_g italic_e italic_s)
27:     return groups𝑔𝑟𝑜𝑢𝑝𝑠groupsitalic_g italic_r italic_o italic_u italic_p italic_s
28:end function

Saliency: Inspired by the goal that designers usually avoid placing elements over salient objects, we introduce salient blocks 𝒮𝒮\mathcal{S}caligraphic_S to help the model better grasp their features intuitively. These blocks are represented as a series of bounding boxes and are seamlessly integrated into an HTML-based representation. To detect these salient blocks, we propose an iterative algorithm that efficiently identifies prominent areas through integral image computation. This algorithm, detailed in the supplementary materials, progressively selects non-overlapping rectangular regions by evaluating their saliency scores based on the density of white and black pixels, ensuring the captured regions align with natural visual attention patterns. This unified way allows the model to understand the spatial relationships between elements and the background more effectively. Moreover, Section Ablation Study and Analysis also explains that adding salient blocks is crucial for the model to understand the background.

Sequence formalization: Our input sequence comprises a primary instruction, a task description (e.g., "layout generation with given class"), and an input HTML format. Four mask tokens (<X>, <Y>, <W>, <H>) are introduced to facilitate their prediction.

We combine Saliency and Region components to form a unified HTML format as the output sequence (refer to the ReLayout output shown in Figure 2(a)). This format provides an effective strategy for constructing relational CoT in LLM-based layout methods. Additionally, the CoT-annotated dataset generated on the PKU and CGL datasets will be open to the community for further research.

Layout Prototype Rebalance Sampler

Building upon layout relation-CoT annotations, we propose the layout prototype rebalance sampler to address the issue of limited diversity in previous methods. By the process, our method ensures a more even distribution across diverse layout prototypes, providing the model with greater opportunities to learn and generalize over a broader range of layouts. As shown in Figure 2(c), our layout prototype rebalance sampler consists of three key operations: feature extraction, feature clustering, and rebalance sampling. Below, we provide a detailed explanation of each operation.
Feature extraction: The ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT layout prototype is to be primarily characterized by three dimensions: {𝒮i,i,i}subscript𝒮𝑖subscript𝑖subscript𝑖\left\{\mathcal{S}_{i},\mathcal{R}_{i},\mathcal{E}_{i}\right\}{ caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }.

The set of saliency bounding boxes in the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT layout is denoted as 𝒮isubscript𝒮𝑖\mathcal{S}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, given by: 𝒮i={𝐛i,js}j=1risubscript𝒮𝑖superscriptsubscriptsuperscriptsubscript𝐛𝑖𝑗s𝑗1subscript𝑟𝑖\mathcal{S}_{i}=\left\{\mathbf{b}_{i,j}^{\text{s}}\right\}_{j=1}^{r_{i}}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_b start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. The number of saliency bounding boxes in layout Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is given by ri{1,2,3,4}subscript𝑟𝑖1234{r_{i}}\in\left\{1,2,3,4\right\}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , 2 , 3 , 4 }. The saliency feature vector for layout Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT captures the weighted center of all saliency boxes. Specifically, the centroid coordinates are computed as the weighted average of geometric centers of the saliency boxes, where the weights are proportional to the area of each saliency box.

We define the set of regions in a layout as i={𝐛i,jr,di,j}j=1sisubscript𝑖superscriptsubscriptsuperscriptsubscript𝐛𝑖𝑗rsubscript𝑑𝑖𝑗𝑗1subscript𝑠𝑖\mathcal{R}_{i}=\{\mathbf{b}_{i,j}^{\text{r}},d_{i,j}\}_{j=1}^{s_{i}}caligraphic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_b start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT r end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where di,j{row,column}subscript𝑑𝑖𝑗rowcolumnd_{i,j}\in\{\text{row},\text{column}\}italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ { row , column } represents the region’s alignment direction. Then, we extract statistical features from isubscript𝑖\mathcal{R}_{i}caligraphic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to describe their spatial distribution. It includes the total number of regions sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the standard deviations of their centroid coordinates σixsuperscriptsubscript𝜎𝑖x\sigma_{i}^{\text{x}}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT x end_POSTSUPERSCRIPT and σiysuperscriptsubscript𝜎𝑖y\sigma_{i}^{\text{y}}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT y end_POSTSUPERSCRIPT, and the counts of row-aligned and column-aligned regions, nirowsuperscriptsubscript𝑛𝑖rown_{i}^{\text{row}}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT row end_POSTSUPERSCRIPT and nicolumnsuperscriptsubscript𝑛𝑖columnn_{i}^{\text{column}}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT column end_POSTSUPERSCRIPT, to roughly quantify the overall layout structure.

We define the element set of the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT layout as i={ci,j}j=1tisubscript𝑖superscriptsubscriptsubscript𝑐𝑖𝑗𝑗1subscript𝑡𝑖\mathcal{E}_{i}=\{c_{i,j}\}_{j=1}^{t_{i}}caligraphic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the total number of elements, and ci,jsubscript𝑐𝑖𝑗c_{i,j}italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT represents the category of the jthsuperscript𝑗thj^{\text{th}}italic_j start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT element in the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT layout. We believe that the layout is highly related to the types and numbers of elements. Therefore, we define element-level features as follows: 𝐟ie=(j=1ti𝕀(ci,j=ck))k=1Ksuperscriptsubscript𝐟𝑖esuperscriptsubscriptmatrixsuperscriptsubscript𝑗1subscript𝑡𝑖𝕀subscript𝑐𝑖𝑗subscript𝑐𝑘𝑘1𝐾\mathbf{f}_{i}^{\text{e}}=\begin{pmatrix}\sum_{j=1}^{t_{i}}\mathbb{I}(c_{i,j}=% c_{k})\end{pmatrix}_{k=1}^{K}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT e end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_I ( italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ) start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, where 𝐟iesuperscriptsubscript𝐟𝑖e\mathbf{f}_{i}^{\text{e}}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT e end_POSTSUPERSCRIPT encodes the frequency of each element category cksubscript𝑐𝑘c_{k}italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT within the layout. Here, K𝐾Kitalic_K denotes the predefined number of element categories (e.g., text, logo) in the dataset.
Feature cluster: The final feature representation is constructed by weighted concatenation of the three feature dimensions:

𝐟i=α𝐟isβ𝐟irγ𝐟iesubscript𝐟𝑖direct-sum𝛼superscriptsubscript𝐟𝑖s𝛽superscriptsubscript𝐟𝑖r𝛾superscriptsubscript𝐟𝑖e\mathbf{f}_{i}=\alpha\mathbf{f}_{i}^{\text{s}}\oplus\beta\mathbf{f}_{i}^{\text% {r}}\oplus\gamma\mathbf{f}_{i}^{\text{e}}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_α bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT s end_POSTSUPERSCRIPT ⊕ italic_β bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT r end_POSTSUPERSCRIPT ⊕ italic_γ bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT e end_POSTSUPERSCRIPT (1)

Using these aggregated feature vectors, we apply K-means clustering to group layouts with similar characteristics. We set the number of clusters K=8𝐾8K=8italic_K = 8 to maintain reasonable group sizes for subsequent analysis.

Rebalance sampling: After obtaining K𝐾Kitalic_K clusters, we introduce a weighted sampling strategy to balance each cluster’s influence and prevent large clusters from dominating the training. Specifically, we assign a sampling weight to each cluster based on its size:

𝐰=𝐜𝐧𝐭1/θ𝐜𝐧𝐭1/θ1,𝐰superscript𝐜𝐧𝐭1𝜃subscriptnormsuperscript𝐜𝐧𝐭1𝜃1\mathbf{w}=\frac{\mathbf{cnt}^{1/\theta}}{\|\mathbf{cnt}^{1/\theta}\|_{1}},bold_w = divide start_ARG bold_cnt start_POSTSUPERSCRIPT 1 / italic_θ end_POSTSUPERSCRIPT end_ARG start_ARG ∥ bold_cnt start_POSTSUPERSCRIPT 1 / italic_θ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , (2)

where 𝐜𝐧𝐭1/θ1=k=1Kcntk1/θsubscriptnormsuperscript𝐜𝐧𝐭1𝜃1superscriptsubscript𝑘1𝐾superscriptsubscriptcnt𝑘1𝜃\|\mathbf{cnt}^{1/\theta}\|_{1}=\sum_{k=1}^{K}\text{cnt}_{k}^{1/\theta}∥ bold_cnt start_POSTSUPERSCRIPT 1 / italic_θ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT cnt start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / italic_θ end_POSTSUPERSCRIPT and cntksubscriptcnt𝑘\text{cnt}_{k}cnt start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents the number of layouts in cluster k𝑘kitalic_k, and θ𝜃\thetaitalic_θ is a hyperparameter that controls the distribution of weights. Larger θ𝜃\thetaitalic_θ makes the weights more uniform, ensuring small clusters are sampled more. However, overly large θ𝜃\thetaitalic_θ may over-sample rare clusters, distorting the data distribution. Smaller θ𝜃\thetaitalic_θ gives higher weights to large clusters, preserving the original distribution. But this may under-sample small clusters, limiting the model’s ability to learn from rare cases.

Experiments

Datasets

We use two publicly available e-commerce datasets, CGL (Zhou et al. 2022) and PKU (Hsu et al. 2023). The PKU dataset includes three element categories: Logo, Banner, and Text, while the CGL dataset has an additional element category called Embellishment. CGL contains 60,548 annotated poster-layout pairs and 1,000 unannotated canvases. PKU consists of 9,974 annotated poster-layout pairs and 905 unannotated canvases. Notably, considering that when designing text (especially text that needs an underlay), designers often treat the text and its underlay as a single unified element. To better reflect the practical value of the work, the ”Banner” refers to elements where Intersection over Union (IoU) or IoD (Yu et al. 2020) between the text and its underlay is greater than 0.95. We evaluate all baselines based on the above setting of categories. Finally, due to PKU and CGL datasets not providing annotated poster validation and test splits, we approximately divide the datasets into train/validation/test sets with a ratio of 8:1:1. Additionally, we create an extra hard split for each dataset. This hard split is selected from the test and validation sets based on the following conditions: (1) one region is nested within another, (2) a parallel relationship, and (3) the number of elements exceeding four.

Table 1: Performance comparison of the C absent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW S + P layout generation task on the PKU and CGL datasets. The best result is highlighted in bold, the second-best result is underlined, and the row corresponding to our method is marked in red.
Method Test Split Hard Split
Graphic Content Graphic Content
ΔΔ\Deltaroman_ΔVal\downarrow Ove\downarrow FD\downarrow Rea\downarrow Occ\downarrow ΔΔ\Deltaroman_ΔVal\downarrow Ove\downarrow FD\downarrow Rea\downarrow Occ\downarrow
PKU Annotated Dataset
Real Data 0.0000 (±plus-or-minus\pm± 0.0000) 0.0035 (±plus-or-minus\pm± 0.0000) - 0.1545 (±plus-or-minus\pm± 0.0000) 0.0639 (±plus-or-minus\pm± 0.0000) 0.0000 (±plus-or-minus\pm± 0.0000) 0.0047 (±plus-or-minus\pm± 0.0000) - 0.1673 (±plus-or-minus\pm± 0.0000) 0.0387 (±plus-or-minus\pm± 0.0000)
LayoutPrompter 0.0015 (±plus-or-minus\pm± 0.0000) 0.0090 (±plus-or-minus\pm± 0.0000) 8.0392 (±plus-or-minus\pm± 0.0000) 0.1683 (±plus-or-minus\pm± 0.0000) 0.1452 (±plus-or-minus\pm± 0.0000) 0.0632 (±plus-or-minus\pm± 0.0000) 0.0170 (±plus-or-minus\pm± 0.0000) 16.7438 (±plus-or-minus\pm± 0.0000) 0.1883 (±plus-or-minus\pm± 0.0000) 0.1530 (±plus-or-minus\pm± 0.0000)
RALF 0.0000 (±plus-or-minus\pm± 0.0000) 0.0915 (±plus-or-minus\pm± 0.0023) 15.5497 (±plus-or-minus\pm± 0.1499) 0.1617 (±plus-or-minus\pm± 0.0005) 0.0866 (±plus-or-minus\pm± 0.0024) 0.0000 (±plus-or-minus\pm± 0.0000) 0.1740 (±plus-or-minus\pm± 0.0031) 26.7978 (±plus-or-minus\pm± 0.2282) 0.1728 (±plus-or-minus\pm± 0.0003) 0.0639 (±plus-or-minus\pm± 0.0010)
PosterLlama 0.0002 (±plus-or-minus\pm± 0.0003) 0.0211 (±plus-or-minus\pm± 0.0018) 3.5318 (±plus-or-minus\pm± 0.2160) 0.1612 (±plus-or-minus\pm± 0.0002) 0.0863 (±plus-or-minus\pm± 0.0019) 0.0007 (±plus-or-minus\pm± 0.0002) 0.0318 (±plus-or-minus\pm± 0.0021) 5.9256 (±plus-or-minus\pm± 0.1448) 0.1727 (±plus-or-minus\pm± 0.0003) 0.0659 (±plus-or-minus\pm± 0.0006)
InternVL2.5-8B 0.0054 (±plus-or-minus\pm± 0.0009) 0.0175 (±plus-or-minus\pm± 0.0004) 2.6175 (±plus-or-minus\pm± 0.0905) 0.1588 (±plus-or-minus\pm± 0.0003) 0.0885 (±plus-or-minus\pm± 0.0016) 0.0050 (±plus-or-minus\pm± 0.0014) 0.0323 (±plus-or-minus\pm± 0.0014) 4.3106 (±plus-or-minus\pm± 0.2717) 0.1717 (±plus-or-minus\pm± 0.0002) 0.0661 (±plus-or-minus\pm± 0.0025)
ReLayout (Ours) 0.0001 (±plus-or-minus\pm± 0.0002) 0.0086 (±plus-or-minus\pm± 0.0011) 1.7865 (±plus-or-minus\pm± 0.1195) 0.1600 (±plus-or-minus\pm± 0.0004) 0.0857 (±plus-or-minus\pm± 0.0010) 0.0004 (±plus-or-minus\pm± 0.0005) 0.0109 (±plus-or-minus\pm± 0.0001) 3.4615 (±plus-or-minus\pm± 0.1304) 0.1727 (±plus-or-minus\pm± 0.0005) 0.0637 (±plus-or-minus\pm± 0.0002)
CGL Annotated Dataset
Real Data 0.0000 (±plus-or-minus\pm± 0.0000) 0.0060 (±plus-or-minus\pm± 0.0000) - 0.1654 (±plus-or-minus\pm± 0.0000) 0.0771 (±plus-or-minus\pm± 0.0000) 0.0000 (±plus-or-minus\pm± 0.0000) 0.0100 (±plus-or-minus\pm± 0.0000) - 0.1758 (±plus-or-minus\pm± 0.0000) 0.0540 (±plus-or-minus\pm± 0.0000)
LayoutPrompter 0.0125 (±plus-or-minus\pm± 0.0000) 0.0094 (±plus-or-minus\pm± 0.0000) 6.7951 (±plus-or-minus\pm± 0.0000) 0.1787 (±plus-or-minus\pm± 0.0000) 0.1510 (±plus-or-minus\pm± 0.0000) 0.0184 (±plus-or-minus\pm± 0.0000) 0.0124 (±plus-or-minus\pm± 0.0000) 9.3699 (±plus-or-minus\pm± 0.0000) 0.1932 (±plus-or-minus\pm± 0.0000) 0.1313 (±plus-or-minus\pm± 0.0000)
RALF 0.0147 (±plus-or-minus\pm± 0.0001) 0.0283 (±plus-or-minus\pm± 0.0007) 0.9277 (±plus-or-minus\pm± 0.0312) 0.1649 (±plus-or-minus\pm± 0.0002) 0.0744 (±plus-or-minus\pm± 0.0001) 0.0213 (±plus-or-minus\pm± 0.0001) 0.0478 (±plus-or-minus\pm± 0.0010) 1.7152 (±plus-or-minus\pm± 0.0557) 0.1760 (±plus-or-minus\pm± 0.0002) 0.0518 (±plus-or-minus\pm± 0.0001)
PosterLlama 0.0012 (±plus-or-minus\pm± 0.0003) 0.0102 (±plus-or-minus\pm± 0.0010) 4.4151 (±plus-or-minus\pm± 0.0129) 0.1674 (±plus-or-minus\pm± 0.0001) 0.0931 (±plus-or-minus\pm± 0.0004) 0.0017 (±plus-or-minus\pm± 0.0004) 0.0183 (±plus-or-minus\pm± 0.0013) 7.1272 (±plus-or-minus\pm± 0.0236) 0.1799 (±plus-or-minus\pm± 0.0003) 0.0747 (±plus-or-minus\pm± 0.0005)
InternVL-2.5-8B 0.0062 (±plus-or-minus\pm± 0.0005) 0.0114 (±plus-or-minus\pm± 0.0007) 2.8395 (±plus-or-minus\pm± 0.1031) 0.1649 (±plus-or-minus\pm± 0.0002) 0.0796 (±plus-or-minus\pm± 0.0003) 0.0098 (±plus-or-minus\pm± 0.0010) 0.0195 (±plus-or-minus\pm± 0.0009) 4.3051 (±plus-or-minus\pm± 0.0629) 0.1765 (±plus-or-minus\pm± 0.0002) 0.0588 (±plus-or-minus\pm± 0.0007)
ReLayout (Ours) 0.0004 (±plus-or-minus\pm± 0.0002) 0.0088 (±plus-or-minus\pm± 0.0003) 1.9311 (±plus-or-minus\pm± 0.0120) 0.1648 (±plus-or-minus\pm± 0.0001) 0.0787 (±plus-or-minus\pm± 0.0001) 0.0023 (±plus-or-minus\pm± 0.0001) 0.0117 (±plus-or-minus\pm± 0.0006) 3.1917 (±plus-or-minus\pm± 0.0215) 0.1760 (±plus-or-minus\pm± 0.0001) 0.0580 (±plus-or-minus\pm± 0.0004)

Baselines

We use the following three SOTA methods to compare our method. (1) LayoutPrompter (Lin et al. 2024) employs a dynamic exemplar selection module to eliminate the need for LLM training. In our work, we use the GPT-3.5 turbo instruct model because the GPT-3 text-davinci-003 model mentioned in the original paper is unavailable. (2) RALF (Horita et al. 2024) uses a retrieval augmentation to address the data scarcity issue. Unlike the original work, which filtered out posters with more than 10 elements for PKU, we extend the maximum number of elements to 20, enabling more complex layouts and ensuring a fairer comparison. (3) PosterLlama (Seol, Kim, and Yoo 2024) builds upon the architecture of a multi-modal large language model and trains an adapter to improve the accuracy of content-aware text layout generation.

Implementation Details

Our model is fine-tuned on InternVL2.5-8B (Chen et al. 2024a), which utilizes the InternViT-300M (Chen et al. 2024b) vision encoder and the InternLM2.5-7B (Cai et al. 2024) language model. Each experiment is conducted on eight NVIDIA A800 GPUs. We follow the settings specified in InternVL for training and inference by default.

Refer to caption
Figure 4: Qualitative comparison on the PKU and CGL datasets. Baselines layouts show noticeable errors, while ours meet basic requirements and better align with human aesthetics in margin and arrangement.

Evaluation Metrics

Following the evaluation metrics from previous works (Zhou et al. 2022; Kikuchi et al. 2021; Hsu et al. 2023), we apply five metrics. Additionally, we refine the overlap metric to ensure a more reasonable evaluation.
Graphic metrics: These metrics evaluate the graphic quality of the layout without considering the canvas. Validity (Val) represents the ratio of elements that are greater than 0.1% of the canvas. All other metrics are calculated using only these valid elements. Due to the presence of small elements like embellishments in CGL, we use ΔΔ\Deltaroman_ΔVal as a metric for evaluation. In previous works (Horita et al. 2024; Seol, Kim, and Yoo 2024; Hsu et al. 2023), Overlap (Ove) is the average IOU across all element pairs. However, it has a notable limitation: when a layout contains a pair of completely overlapping elements along with many pairs that don’t overlap at all, the metric will fail to reflect the actual layout quality. On the other hand, if a canvas has a pair of elements with extensive overlap in the real world, it is considered a failure. Therefore, we use the maximum IoU to evaluate the generated layouts. We calculate Fréchet Distance (FD) in the feature space derived from bounding boxes and categories to evaluate overall layout quality.
Content metrics: These metrics assess harmony between the generated layout and the canvas. Occlusion (Occ) calculates the pixel coverage ratio of layout elements over saliency maps. Readability score (Rea) evaluates text clarity using average pixel gradients, where lower scores indicate clearer text.
User study: In the layout generation field, the current metrics are insufficient to fully evaluate the quality of a layout. Therefore, we conduct user studies.

In terms of structure, we randomly select 300 images from the PKU dataset and invite 6 professional designers. For each image, we generate layouts using five different methods and present all layouts simultaneously in a shuffled order, with model names not perceived by users. Users assess each row of results based on two criteria: (1) identify all layouts that meet basic usability standards (e.g., no overlap, no occlusion), denoted as Pusesubscript𝑃useP_{\text{use}}italic_P start_POSTSUBSCRIPT use end_POSTSUBSCRIPT; and (2) select the single best layout according to professional design principles, considering appropriate margin, relative size, distance from products and overall visual harmony, denoted as Pbestsubscript𝑃bestP_{\text{best}}italic_P start_POSTSUBSCRIPT best end_POSTSUBSCRIPT.

Table 2: User study on structural evaluation.
LayoutPrompter RALF PosterLlama InternVL ReLayout
Pusesubscript𝑃useP_{\text{use}}italic_P start_POSTSUBSCRIPT use end_POSTSUBSCRIPT 36.0% 50.3% 71.3% 78.3% 91.0%
Pbestsubscript𝑃bestP_{\text{best}}italic_P start_POSTSUBSCRIPT best end_POSTSUBSCRIPT 1.3% 9.7% 12.0% 10.7% 66.3%
Table 3: User study on diversity evaluation.
RALF PosterLlama InternVL ReLayout
Score 41 47 36 56
𝐜𝐧𝐭𝐜𝐧𝐭\mathbf{cnt}bold_cnt (18, 23, 9) (14, 25, 11) (21, 22, 7) (11, 22, 17)

In terms of diversity, we randomly select 50 images from the PKU dataset and invite 6 professional designers. LayoutPrompter is excluded due to poor usability, leaving four methods for evaluation, each run with three different random seeds (0, 1, 2). For each image, results from all methods are displayed simultaneously in a shuffled order, with model names not perceived by users to ensure unbiased assessment. Users are instructed to evaluate diversity based on differences in relative position (e.g., alignment) and text size—any variation in either aspect is considered a distinct style. Each row presents four methods (three images per method), and diversity is scored as 0 (one style), 1 (two styles), or 2 (three or more styles).

Main Results

Since designers usually design elements first before arranging the overall layout, our experiments primarily focus on generating the positions and sizes of elements based on given categories and the auxiliary information that each model can support as input.
Quantitative comparison: Table 1 presents a comparison of different methods on the test and hard split of the PKU and CGL datasets. It can be observed that the metrics of our method are either the best or the second-best. Specifically, on the PKU dataset, our method demonstrates the best performance on most metrics, with a particularly notable improvement in the Ove metric. On the CGL dataset, while our method does not achieve the best performance across all metrics, it consistently outperforms others on the Ove metric. Furthermore, when transitioning from the test split to the hard split, the degradation in our method’s metrics is significantly smaller compared to other methods, highlighting the robustness of our approach. These improvements are attributed to the annotations margin property of our relation-CoT and the resampling strategy that effectively balances the dataset. These enhancements demonstrate that our method is better at generating more structured layouts. Although RALF performs well on the Occ and FD metrics in the CGL dataset, their higher Ove score reduces their practical usability, which is also reflected in the subsequent visualizations 4. Plus, Table 2 shows that ReLayout performs significantly better in aligning with human aesthetic preferences compared to baselines. Furthermore, Table 3 demonstrates that our method also achieves the highest diversity, exhibiting a greater number and variety of distinct layout styles under different seed settings.

Table 4: Cross-dataset evaluation on PKU and CGL datasets.
Train Test Method ΔΔ\Deltaroman_ΔVal\downarrow Ove\downarrow FD\downarrow Rea\downarrow Occ\downarrow
PKU CGL-hard PosterLlama 0.0225 (±plus-or-minus\pm± 0.0001) 0.0311 (±plus-or-minus\pm± 0.0004) 6.5679 (±plus-or-minus\pm± 0.1091) 0.1758 (±plus-or-minus\pm± 0.0004) 0.0688 (±plus-or-minus\pm± 0.0010)
ReLayout (Ours) 0.0167 (±plus-or-minus\pm± 0.0001) 0.0100 (±plus-or-minus\pm± 0.0004) 4.4413 (±plus-or-minus\pm± 0.0136) 0.1715 (±plus-or-minus\pm± 0.0001) 0.0631 (±plus-or-minus\pm± 0.0002)
CGL PKU-hard PosterLlama 0.0019 (±plus-or-minus\pm± 0.0007) 0.0205 (±plus-or-minus\pm± 0.0035) 7.1093 (±plus-or-minus\pm± 0.2796) 0.1726 (±plus-or-minus\pm± 0.0010) 0.0694 (±plus-or-minus\pm± 0.0016)
ReLayout (Ours) 0.0010 (±plus-or-minus\pm± 0.0005) 0.0120 (±plus-or-minus\pm± 0.0023) 5.9011 (±plus-or-minus\pm± 0.1120) 0.1730 (±plus-or-minus\pm± 0.0012) 0.0660 (±plus-or-minus\pm± 0.0009)
Table 5: Ablation study on the hard split of PKU dataset.
Region Saliency Resample ΔΔ\Deltaroman_ΔVal\downarrow Ove\downarrow FD\downarrow Rea\downarrow Occ\downarrow
V0 - - - 0.0025 (±plus-or-minus\pm± 0.0014) 0.0153 (±plus-or-minus\pm± 0.0009) 8.7960 (±plus-or-minus\pm± 0.0224) 0.1746 (±plus-or-minus\pm± 0.0006) 0.0821 (±plus-or-minus\pm± 0.0006)
V1 - - 0.0021 (±plus-or-minus\pm± 0.0002) 0.0379 (±plus-or-minus\pm± 0.0013) 12.2290 (±plus-or-minus\pm± 0.2255) 0.1967 (±plus-or-minus\pm± 0.0005) 0.1188 (±plus-or-minus\pm± 0.0011)
V2 - 0.0014 (±plus-or-minus\pm± 0.0007) 0.0150 (±plus-or-minus\pm± 0.0019) 7.3406 (±plus-or-minus\pm± 0.0719) 0.1769 (±plus-or-minus\pm± 0.0002) 0.0754 (±plus-or-minus\pm± 0.0014)
V3 0.0002 (±plus-or-minus\pm± 0.0001) 0.0097 (±plus-or-minus\pm± 0.0004) 4.9403 (±plus-or-minus\pm± 0.1903) 0.1755 (±plus-or-minus\pm± 0.0006) 0.0752 (±plus-or-minus\pm± 0.0007)

Qualitative comparison: Figure 4 visualizes the generated layouts, providing a comparison across different methods. It can be observed that, apart from the obvious errors marked in Figure 4, other methods also fall short in controlling element dimensions, maintaining spacing between elements, selecting layout arrangements, and achieving overall harmony. In contrast, ReLayout aligns more closely with human aesthetic preferences. Additionally, our method excels at generating diverse layouts and handling layout generation under various conditions, as demonstrated in the supplementary materials.

Out-of-domain generalization: To verify the generalization of our method, we conduct experiments using PKU as the training set and testing on CGL, and vice versa. As shown in Table 4, our method outperforms the current SOTA PosterLlama on most metrics. This demonstrates that ReLayout adapts well to real-world scenarios and demonstrates strong generalization performance.

Ablation Study and Analysis

Effect of Each Module: We conduct a series of ablation experiments on the PKU test and hard split to evaluate the contributions of different modules in ReLayout. To simplify the setup, the ablation study uses a training set with only a single main condition from PKU: generating position and size given the category, text, and aspect ratio. As shown in Table 5, V1 adds only region annotations, V2 builds on V1 by incorporating salient annotations, and V3 further enhances the setup by introducing a layout prototype rebalance sampler, several observations can be made. First, V1 demonstrates relatively poor overall metrics, likely due to the model focusing on the structure of elements while neglecting salient objects. Since no structure-related metrics, this effect cannot be quantified and must be analyzed through visualization in the supplementary materials. Second, V2 shows that FD improves compared to direct fine-tuning (V0), especially in the hard split. Compared to the Region-only setting, all metrics show an upward trend, particularly the Occ metric, which demonstrates the importance of Saliency in content-aware tasks. Third, V3 shows the performance of ReLayout, demonstrating that it achieves the best results in the hard split across the Val, Ove, FD, and Occ metrics. Notably, the improvements in Ove and FD are particularly significant.
Hyperparameter Analysis: We analyze the hyperparameter θ𝜃\thetaitalic_θ on the hard split of PKU dataset, and the results are shown in Table 6. It can be observed that when θ=6𝜃6\theta=6italic_θ = 6, most of the metrics achieve their optimal values. This demonstrates that, when θ𝜃\thetaitalic_θ is smaller, the original distribution is largely preserved, which results in rare layout prototypes being treated as noise and thus receiving insufficient sampling. On the other hand, when θ𝜃\thetaitalic_θ is larger, the distribution shifts toward a more balanced state across groups, which causes certain rare layout prototypes to be repeatedly learned, thus degrading its performance.

Table 6: Hyperparameter analysis.
θ𝜃\thetaitalic_θ ΔΔ\Deltaroman_ΔVal\downarrow Ove\downarrow FD\downarrow Rea\downarrow Occ\downarrow
3 0.0009 (±plus-or-minus\pm± 0.0004) 0.0121 (±plus-or-minus\pm± 0.0001) 4.7920 (±plus-or-minus\pm± 0.1461) 0.1742 (±plus-or-minus\pm± 0.0004) 0.0633 (±plus-or-minus\pm± 0.0006)
6 0.0004 (±plus-or-minus\pm± 0.0005) 0.0109 (±plus-or-minus\pm± 0.0001) 3.4615 (±plus-or-minus\pm± 0.1304) 0.1727 (±plus-or-minus\pm± 0.0005) 0.0637 (±plus-or-minus\pm± 0.0002)
10 0.0082 (±plus-or-minus\pm± 0.0003) 0.0168 (±plus-or-minus\pm± 0.0002) 5.0655 (±plus-or-minus\pm± 0.0347) 0.1791 (±plus-or-minus\pm± 0.0005) 0.0703 (±plus-or-minus\pm± 0.0006)
100 0.0075 (±plus-or-minus\pm± 0.0003) 0.0146 (±plus-or-minus\pm± 0.0010) 4.9287 (±plus-or-minus\pm± 0.2307) 0.1783 (±plus-or-minus\pm± 0.0001) 0.0808 (±plus-or-minus\pm± 0.0011)

Conclusion

In this work, we study content-aware layout generation tasks and address the issue in LLM-based methods where the relationships between elements have not been considered. We propose a novel method ReLayout, which consists of two modules. First, we enhance the model’s understanding of relationships by incorporating explicit relationship annotations, framed from the perspective of CoT. Second, we utilize relation annotations to cluster the dataset and adjust its distribution, thereby enhancing the quality of generated layouts. Moreover, extensive experiments validate the effectiveness of our method, particularly in visualization.

Furthermore, we identify two limitations in ReLayout. Firstly, current metrics can identify obviously inadequate layouts, but they lack the ability to evaluate the suitability of layouts in real-world application scenarios. Furthermore, small changes in these metrics don’t necessarily lead to a decline in layout quality. Secondly, we have not applied relation annotations to other open-source MLLMs to verify whether this method is equally effective.

Future works can leverage our relation annotations for a more refined understanding in layout generation tasks. Additionally, we aim to maximize the effectiveness of our relation annotations through reinforcement learning in the future.

References

  • Achiam et al. (2023) Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • Brown et al. (2020) Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
  • Cai et al. (2024) Cai, Z.; Cao, M.; Chen, H.; Chen, K.; Chen, K.; Chen, X.; Chen, X.; Chen, Z.; Chen, Z.; Chu, P.; et al. 2024. Internlm2 technical report. arXiv preprint arXiv:2403.17297.
  • Chai, Zhuang, and Yan (2023) Chai, S.; Zhuang, L.; and Yan, F. 2023. Layoutdm: Transformer-based diffusion model for layout generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18349–18358.
  • Chen et al. (2024a) Chen, Z.; Wang, W.; Cao, Y.; Liu, Y.; Gao, Z.; Cui, E.; Zhu, J.; Ye, S.; Tian, H.; Liu, Z.; et al. 2024a. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling. arXiv preprint arXiv:2412.05271.
  • Chen et al. (2024b) Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Zhong, M.; Zhang, Q.; Zhu, X.; Lu, L.; et al. 2024b. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 24185–24198.
  • Deka et al. (2017) Deka, B.; Huang, Z.; Franzen, C.; Hibschman, J.; Afergan, D.; Li, Y.; Nichols, J.; and Kumar, R. 2017. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology, 845–854.
  • Goodfellow et al. (2020) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2020. Generative adversarial networks. Communications of the ACM, 63(11): 139–144.
  • Goyal et al. (2024) Goyal, S.; Rastogi, E.; Rajagopal, S. P.; Yuan, D.; Zhao, F.; Chintagunta, J.; Naik, G.; and Ward, J. 2024. Healai: A healthcare llm for effective medical documentation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, 1167–1168.
  • Guo et al. (2021) Guo, S.; Jin, Z.; Sun, F.; Li, J.; Li, Z.; Shi, Y.; and Cao, N. 2021. Vinci: an intelligent graphic design system for generating advertising posters. In Proceedings of the 2021 CHI conference on human factors in computing systems, 1–17.
  • Ho, Jain, and Abbeel (2020) Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 6840–6851.
  • Horita et al. (2024) Horita, D.; Inoue, N.; Kikuchi, K.; Yamaguchi, K.; and Aizawa, K. 2024. Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 67–76.
  • Hsu et al. (2023) Hsu, H. Y.; He, X.; Peng, Y.; Kong, H.; and Zhang, Q. 2023. Posterlayout: A new benchmark and approach for content-aware visual-textual presentation layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6018–6026.
  • Hu et al. (2022) Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al. 2022. Lora: Low-rank adaptation of large language models. ICLR, 1(2): 3.
  • Inoue et al. (2023) Inoue, N.; Kikuchi, K.; Simo-Serra, E.; Otani, M.; and Yamaguchi, K. 2023. Layoutdm: Discrete diffusion model for controllable layout generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10167–10176.
  • Jiang et al. (2022) Jiang, Z.; Sun, S.; Zhu, J.; Lou, J.-G.; and Zhang, D. 2022. Coarse-to-fine generative modeling for graphic layouts. In Proceedings of the AAAI conference on artificial intelligence, volume 36, 1096–1103.
  • Jyothi et al. (2019) Jyothi, A. A.; Durand, T.; He, J.; Sigal, L.; and Mori, G. 2019. Layoutvae: Stochastic scene layout generation from a label set. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9895–9904.
  • Kikuchi et al. (2021) Kikuchi, K.; Simo-Serra, E.; Otani, M.; and Yamaguchi, K. 2021. Constrained graphic layout generation via latent optimization. In Proceedings of the 29th ACM International Conference on Multimedia, 88–96.
  • Kingma (2013) Kingma, D. P. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • Li et al. (2024) Li, B.; Zhang, Y.; Guo, D.; Zhang, R.; Li, F.; Zhang, H.; Zhang, K.; Zhang, P.; Li, Y.; Liu, Z.; et al. 2024. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326.
  • Li et al. (2023a) Li, F.; Liu, A.; Feng, W.; Zhu, H.; Li, Y.; Zhang, Z.; Lv, J.; Zhu, X.; Shen, J.; Lin, Z.; et al. 2023a. Relation-aware diffusion model for controllable poster layout generation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 1249–1258.
  • Li et al. (2023b) Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023b. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, 19730–19742. PMLR.
  • Li et al. (2019a) Li, J.; Yang, J.; Hertzmann, A.; Zhang, J.; and Xu, T. 2019a. LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators. In International Conference on Learning Representations.
  • Li et al. (2019b) Li, J.; Yang, J.; Hertzmann, A.; Zhang, J.; and Xu, T. 2019b. Layoutgan: Generating graphic layouts with wireframe discriminators. arXiv preprint arXiv:1901.06767.
  • Lin et al. (2024) Lin, J.; Guo, J.; Sun, S.; Yang, Z.; Lou, J.-G.; and Zhang, D. 2024. Layoutprompter: Awaken the design ability of large language models. Advances in Neural Information Processing Systems, 36.
  • Lin et al. (2023) Lin, J.; Zhou, M.; Ma, Y.; Gao, Y.; Fei, C.; Chen, Y.; Yu, Z.; and Ge, T. 2023. Autoposter: A highly automatic and content-aware design system for advertising poster generation. In Proceedings of the 31st ACM International Conference on Multimedia, 1250–1260.
  • Peng et al. (2023) Peng, R.; Liu, K.; Yang, P.; Yuan, Z.; and Li, S. 2023. Embedding-based retrieval with llm for effective agriculture information extracting from unstructured data. arXiv preprint arXiv:2308.03107.
  • Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PmLR.
  • Raneburger, Popp, and Vanderdonckt (2012) Raneburger, D.; Popp, R.; and Vanderdonckt, J. 2012. An automated layout approach for model-driven WIMP-UI generation. In Proceedings of the 4th ACM SIGCHI symposium on Engineering interactive computing systems, 91–100.
  • Seol, Kim, and Yoo (2024) Seol, J.; Kim, S.; and Yoo, J. 2024. PosterLlama: Bridging Design Ability of Language Model to Content-Aware Layout Generation. In European Conference on Computer Vision, 451–468. Springer.
  • Tabata et al. (2019) Tabata, S.; Yoshihara, H.; Maeda, H.; and Yokoyama, K. 2019. Automatic layout generation for graphical design magazines. In ACM SIGGRAPH 2019 Posters, 1–2.
  • (32) Tang, Z.; Wu, C.; Li, J.; and Duan, N. ???? LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models. In The Twelfth International Conference on Learning Representations.
  • Team et al. (2023) Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A. M.; Hauth, A.; Millican, K.; et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  • Tzachor et al. (2023) Tzachor, A.; Devare, M.; Richards, C.; Pypers, P.; Ghosh, A.; Koo, J.; Johal, S.; and King, B. 2023. Large language models and agricultural extension services. Nature food, 4(11): 941–948.
  • Vaswani (2017) Vaswani, A. 2017. Attention is all you need. Advances in Neural Information Processing Systems.
  • Wei et al. (2022) Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 24824–24837.
  • Yang et al. (2016) Yang, X.; Mei, T.; Xu, Y.-Q.; Rui, Y.; and Li, S. 2016. Automatic generation of visual-textual presentation layout. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 12(2): 1–22.
  • Yang et al. (2024) Yang, Z.; Xu, X.; Yao, B.; Rogers, E.; Zhang, S.; Intille, S.; Shara, N.; Gao, G. G.; and Wang, D. 2024. Talk2care: An llm-based voice assistant for communication between healthcare providers and older adults. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 8(2): 1–35.
  • Yu et al. (2020) Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; and Han, Z. 2020. Scale match for tiny person detection. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 1257–1265.
  • Zhang et al. (2023) Zhang, J.; Guo, J.; Sun, S.; Lou, J.-G.; and Zhang, D. 2023. Layoutdiffusion: Improving graphic layout generation by discrete diffusion probabilistic models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7226–7236.
  • Zheng et al. (2019) Zheng, X.; Qiao, X.; Cao, Y.; and Lau, R. W. 2019. Content-aware generative modeling of graphic design layouts. ACM Transactions on Graphics (TOG), 38(4): 1–15.
  • Zhong, Tang, and Yepes (2019) Zhong, X.; Tang, J.; and Yepes, A. J. 2019. Publaynet: largest dataset ever for document layout analysis. In 2019 International conference on document analysis and recognition (ICDAR), 1015–1022. IEEE.
  • Zhou et al. (2022) Zhou, M.; Xu, C.; Ma, Y.; Ge, T.; Jiang, Y.; and Xu, W. 2022. Composition-aware graphic layout GAN for visual-textual presentation designs. In IJCAI.