Code Aesthetics with Agentic Reward Feedback

Bang Xiao^1,2 Lingjie Jiang^1,3¹¹footnotemark: 1 Shaohan Huang¹^† Tengchao Lv¹
Yupan Huang¹ Xun Wu¹ Lei Cui¹ Furu Wei¹
¹ Microsoft Research Asia
² Zhiyuan College Equal contribution.

{\dagger}

Corresponding author.
Project pape: https://bangx7.github.io/code-aesthetics Shanghai Jiao Tong University
³ Peking University

Abstract

Large Language Models (LLMs) have become valuable assistants for developers in code-related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, they struggle with visually-oriented coding tasks, often producing suboptimal aesthetics. In this paper, we introduce a new pipeline to enhance the aesthetic quality of LLM-generated code. We first construct AesCode-358K, a large-scale instruction-tuning dataset focused on code aesthetics. Next, we propose agentic reward feedback, a multi-agent system that evaluates executability, static aesthetics, and interactive aesthetics. Building on this, we develop GRPO-AR, which integrates these signals into the GRPO algorithm for joint optimization of functionality and code aesthetics. Finally, we develop OpenDesign, a benchmark for assessing code aesthetics. Experimental results show that combining supervised fine-tuning on AesCode-358K with reinforcement learning using agentic reward feedback significantly improves performance on OpenDesign and also enhances results on existing benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT-4o and GPT-4.1, and achieves performance comparable to large open-source models with 480B–685B parameters, underscoring the effectiveness of our approach.

Refer to caption — Figure 1: Performance comparison of different models on the OpenDesign benchmark. Left: static score evaluation. Right: interactive score evaluation.

1 Introduction

LLMs have become powerful assistants in our daily lives, helping us polish writing, refine code, and access knowledge [46, 10, 38]. Recently, coding LLMs have achieved great sucess in various code related fields, such as code completion, bug fixing, and software engineering [2, 19, 23, 46]. While LLMs have demonstrated remarkable capabilities in single-text-modality coding tasks, they remain inadequate in visually-oriented tasks such as chart generation and webpage design, leading to poor visual outcomes like overlapping elements, inconsistent color schemes, and disorganized structures. Consequently, the aesthetic dimension of visually-oriented code generated by LLMs remains an underexplored area.

In this paper, we focus on assessing and improving LLMs ability in visually-oriented coding tasks, which refer to programming tasks in which the correctness or quality of the code is inherently tied to its visual output. Typical examples include tasks that generate or manipulate visual artifacts such as web pages (HTML/CSS), plots and charts (e.g., Matplotlib [21], Seaborn [48], Plotly [25]), or graphical scenes (e.g., Python Turtle). Unlike purely algorithmic coding tasks, these tasks require the model to reason about visual structure, spatial layout, and aesthetic consistency, in addition to syntactic or functional correctness. For visually-oriented coding tasks, a natural question arises: do LLMs possess any awareness of the aesthetics of their own code? In other words, do they have a sense of aesthetics?

Building on these insights, we propose the code aesthetics concept, which captures the aesthetic appeal of the execution result of visually-oriented code. Currently, reward methods for training coding LLMs often focus on a single textual modality, such as code executability and result correctness [17, 14, 31, 13]. These methods have significant limitations when applied to code aesthetics tasks, as they fail to assess visual aesthetics and are unable to interact with rendered visual interfaces like webpages, making them ineffective as reward sources. To address this challenge, we propose agentic reward feedback, a new reward system consisting of three agents, (i) execution agent, which checks the code executability, (ii) static aesthetics agent, which assesses the aesthetics based on an image of code execution result, and (iii) interactive aesthetics agent, which is specified to evaluate the function and aesthetics of rendered visual interface while interacting with the elements. When receiving a raw model output, the execution agent will try to extract the code blocks from the output and check its executability. If passed, static aesthetics agent and interactive aesthetics agent will then run in parallel to assess the static and interactive aesthetics perspectives respectively. The core idea is simple: adopting a multi-agent system to provide a comprehensive and systematic reward feedback from textual, visual, and interactive perspectives, thus giving a comprehensive feedback to better align the sense of aesthetics of model with human or advanced models. This approach addresses a key limitation of most open-source coding LLMs, which are confined to a single textual modality and thus lack awareness of the visual rendering of their code.

To achieve this goal, we first build a large-scale supervised instruction tuning dataset AesCode-358K of two major code aesthetics tasks: Python-based plot generation and webpage design. Given the absense of existing benchmarks for evaluating webpage aesthetics, we construct the OpenDesign benchmark, which consists of $840$ real webpage design cases, to evaluate the aesthetics of webpage from both visual (static) and interactive aesthetics using LLM-as-a-judge [57, 15] method. Consequently, we perform reinforcement learning using GRPO [43] algorithm combined with our Agentic Reward framework (GRPO-AR) to train two models with different parameter scales—AesCoder 4B and AesCoder 7B. After supervised fine-tuning on the AesCode-358K dataset and reinforcement learning with GRPO-AR, our models achieve significant improvement in PandasPlotBench[16] and OpenDesign, showcasing the effectiveness of the AesCode-358K dataset and GRPO-AR method.

The key contributions can be summarized as follows:

•

We introduce the concept of code aesthetics and investigate whether LLM-generated code demonstrates its own design aesthetics.
•

We construct the first dataset for code aesthetics, AesCode-358K, and introduce the first benchmark, OpenDesign, which specifically designed to assess webpage design aesthetics.
•

We propose a novel reward system for code aesthetics, agentic reward feedback, and combine it with GRPO algorithm for more effective model training in code aesthetics tasks.

2 Related Works

Aesthetics of AI-Generated Contents.

With the rapid advancement of generative artificial intelligence [47, 42, 26], increasing attention has been directed to the aesthetic taste of AI-generated content (AIGC) [6, 49] and the alignment between AI aesthetics and human preferences [59, 29, 39]. Previous works include textual aesthetics [27, 11], which investigates methods to provides a cleaner layout and better coherence of LLM’s output [27], and image aesthetics [12, 50], which focuses on assessing and improving the aesthetic quality of images. However, all these methods rely on evaluating static image(s) and may not capable to assess contents like webpages which need interactions. As the growing maturity of AI agents [1, 20], it becomes possible to integrate interactive evaluation into the contents generated by large language models, thereby providing more comprehensive and systematic feedback.

Reward Systems in Reinforcement Learning.

In reinforcement learning, the reward serves as a scalar feedback signal that quantitatively evaluates the immediate desirability of an agent’s actions, thereby guiding the learning process toward behaviors that maximize cumulative long-term return [28]. In the context of training large language models, the sources of reward can be broadly categorized into two main types: (i) Model-based Rewards: This approach utilizes a pre-trained reward model to generate feedback [39, 5, 53, 8]. These models encode human preferences or expert knowledge, providing an automated and scalable source of reward. (ii) Rule-based Rewards: This type of reward is generated directly from human-defined rules or logic [43, 54, 33]. However, in complex tasks, relying solely on a single source of reward can induce biased behaviors, ultimately driving optimization in an incorrect direction. Some works have been attempting to use agents, which combine human preference rewards with verifiable signals, to provide more reliable rewards [40].

3 The AesCode-358K Dataset

To investigate code aesthetics, we focus on domains where both the visual outcome and the implementation style matter. In this context, two representative areas are considered: Python-based plot generation, which emphasizes clarity and expressiveness in visualization, and webpage design, where aesthetic factors directly influence layout and user experience. In this section, we introduce AesCode-358K, a large-scale supervised instruction-tuning dataset designed for two key areas of code aesthetics.

3.1 Python-Based Plot Data Construction

We adapt instructions from the existing VisCode-200K dataset [35]. While the original dataset contains 200K data points, we find that some of the Python code snippets are either not executable or exhibited sub-optimal aesthetics, such as chart–legend overlap and improper font sizes. To ensure high quality, we use Qwen3-Coder-480B-A35B-Instruct-FP8 [46, 22] to regenerate the Python code.

We enforce quality control in two ways. First, we limit the Python environment to essential libraries like matplotlib, seaborn, and plotly to prevent unexpected imports. Second, we validate the code’s executability using Jupyter Notebook runtime checks, ensuring that the generated code runs without errors and produces the correct visualizations. After this rigorous filtering, we obtain 158K high-quality plot data points.

3.2 Webpage Design Data Construction

We develop a four-step process to create a large-scale webpage design dataset. First, we use GPT-4o to generate a seed keyword corpus across five webpage categories: General Website, 3D Design, Data Visualization, Game Dev, and UI Component. Next, GPT-4o is used to produce diverse webpage design instructions from these keywords. We then project the instructions into an embedding space and apply t-SNE [32] visualization to examine category overlap. To remove redundancy, we further apply large-scale clustering and retain only representative samples, resulting in a refined instruction dataset (details in Appendix B.2). Finally, we employ GPT-5 [38] and Qwen3-Coder-480B-A35B-Instruct-FP8 [46] to generate HTML code for each instruction. We present dataset statistics and keyword generation prompts in Appendix B.

To ensure the quality of the generated HTML code, we first confirmed that it was executable. We then rendered the webpages using playwright ^*^**https://playwright.dev/ and selenium ^†^††https://www.selenium.dev/ and asked GPT-5 to score the two outputs based on their rendered images. We selected the code with the higher score as our final data.

4 Agentic Reward Framework

For coding tasks, mainstream reward signals typically include execution or unit test success [17, 14], process-aware reward models [31, 13], and human preference feedback [44]. However, these approaches mainly focus on textual modality and lack visually-oriented reward signals, rendering them unsuitable for evaluating code aesthetics. In visually grounded code generation, we highlight three essential dimensions:

•

Code Executability. The generated code must run successfully, which forms the fundamental requirement of all code-related tasks.
•

Static Aesthetics. This dimension captures the visual quality of the rendered output. An effective design should be concise, well-structured, and visually coherent, with elements properly aligned and exhibiting a clear sense of design.
•

Interactive Aesthetics. Beyond static visuals, interactive aspects are crucial for webpages—especially those featuring 3D objects or browser-based games. This dimension ensures that the generated content does not pursue static aesthetics at the expense of interactivity or functionality, thereby achieving functionally correct interactive aesthetics.

Based on these dimensions, we propose an agentic reward framework that leverages a multi-agent system to assess each aspect, integrates their evaluations, and generates comprehensive feedback for webpage design from multiple perspectives.

4.1 Execution Agent

The execution agent verifies whether the model’s output is executable and reports the result to the feedback system. Specifically, it assigns $s_{\text{exec}}=1$ if the output passes all validations, and $s_{\text{exec}}=-1$ otherwise. For a raw model output, the agent first attempts to extract the HTML code from the html block; if not found, the entire output is treated as HTML. Given that web browsers tolerate many structural and syntactic errors, strict execution checking is unsuitable for HTML. Instead, we use HTMLHint ^‡^‡‡https://htmlhint.com/ to implement a rule-based HTML checker to validate the basic syntax. The detailed rules can be seen in Appendix H.7.

4.2 Static Aesthetics Agent

The static aesthetics agent evaluates visual quality using full-page webpage screenshots. For an HTML file, it first hosts the page locally using playwright in headless mode, then captures a full-page screenshot for subsequent visual assessment. We identify three dimensions essential for evaluating a webpage screenshot:

•

Instructional Alignment. Evaluates consistency between the page’s style and user instructions.
•

Visual Elements. Assesses the effective use of modern design features such as lighting, transparency, and gradients.
•

Layout and Cohesion. Examines whether the structure is functional, responsive, and visually coherent, with concise yet design-aware typography.

We select GPT-5 [38] as the judge for its strong multimodal reasoning ability. Using a chain-of-thought approach [52], the judge evaluates the full page screenshot and provides both a score and a rationale for each dimension. While both scores and explanations are required to ensure reliable evaluation [52, 56], we retain only the final aggregated score as the output of the static aesthetics agent. The detailed prompts are provided in Appendix H.2.

4.3 Interactive Aesthetics Agent

For webpage design, evaluation based only on static screenshots is insufficient, as it overemphasizes visual appearance while neglecting usability. This issue is particularly critical for interactive webpages such as 3D design platforms or browser-based games. To address this, we introduce the interactive aesthetics agent, which autonomously navigates, explores, and interacts with webpages to provide usability-aware feedback. Given the HTML code, the agent launches the page in a headless environment, interacts with its elements, and evaluates their interactive aesthetics. We adopt WebVoyager [24] as the basic web agent framework and GPT-4o [36] as the multimodal model for cost considerations.

Agent Planning.

At the start of evaluation, the agent generates an initial list of interaction candidates by reasoning about which elements are most relevant to the user instruction and webpage content. It then ranks these candidates and selects the top $N$ for execution. To ensure evaluations remain offline, interactions requiring internet access (e.g., social media logins) are excluded, focusing only on the core webpage functionality.

Agent Interacting and Scoring.

The agent then executes the planned interactions step by step, recording whether each attempt succeeds or fails by carefully comparing the screenshots before and after performing one interaction and judging whether the webpage responds correctly to the given interaction. After completing all interactions, it outputs a binary score list indicating success ( $1$ ) or failure ( $0$ ) for each action, and aggregates them into a final interaction score: $s_{\text{interact}}=\sum_{i=1}^{N}s_{i}$ . This score is then returned to the agentic reward framework (see Appendix H.3 for the full prompt).

Discussions.

Current web agents can handle most webpage operations [24], but may still struggle with certain corner cases, such as confusing webpage elements or being misled by irrelevant textual content [7, 51]. Such agent failures lead to a score of $0$ in the corresponding iteration, since we assign a score of $1$ only when the webpage responds correctly. This may cause the agent to make incorrect judgments, resulting in scores lower than the true values. On the other hand, agent failures also partially reveal non-standard or sub-optimal aspects of webpage design. Therefore, despite these limitations, using web agents as evaluators provides a reasonable proxy for assessing overall webpage aesthetics and interactivity.

4.4 Reward Aggregation

The results from the three agents are integrated by the agentic reward framework, which jointly evaluates execution, static aesthetics, and interactive aesthetics to provide comprehensive feedback on each webpage. Let $r_{\text{exec}}$ , $r_{\text{static}}$ , and $r_{\text{interact}}$ denote the rewards from the respective agents. The overall reward is then computed as

r=w_{\text{exec}}\cdot r_{\text{exec}}+w_{\text{static}}\cdot r_{\text{static}}+w_{\text{interact}}\cdot r_{\text{interact}}

(1)

where $w$ represents the weight assigned to each agent.

5 AesCoder Training

5.1 Stage I: Supervised Fine-Tuning on AesCode-358K

We perform supervised fine-tuning on two different model with different parameter scales on our AesCode-358K dataset: Qwen3-4B-Instruct-2507 [46] and Qwen2.5-Coder-7B-Instruct [22]. This validates the generalizability of AesCode-358K dataset and establish a robust foundation for next stage reinforcement learning.

5.2 Stage II: Reinforcement Learning with Agentic Reward Feedback

After supervised fine-tuning in stage I, the model acquires substantial high-quality knowledge. However, the model at this stage still exhibits limited generalization beyond the training distribution [9], especially in webpage design tasks. This limitation highlights the necessity of reinforcement learning (RL), which allows the model to adapt more flexibly and robustly to diverse and unseen scenarios. Thus, we perform reinforcement learning using the GRPO-AR method, which integrates the GRPO [43] algorithm with our Agentic Reward framework to enhance the model’s ability.

Data Preparation for RL.

For avoiding overlap with the data in AesCode-358K, which the model has already “seen" in stage I, we pick 20K RL data from WebSight v0.2 dataset [30], a large synthetic dataset containing HTML/CSS codes and LLM-generated descriptions of the webpages. However, the webpage description in WebSight v0.2 are relatively homogeneous, which does not align with the natural expression patterns of human users. So we use the original webpage descriptions as seeds and use GPT-4o [36] to generate user instructions for clearer semantic expression. Prompts refer to Appendix H.6.

GRPO with Agentic Reward.

To generalize model’s webpage design ability, we adopt our agentic reward system as a reliable and robust reward provider and perform reinforcement learning using GRPO [43] algorithm. We call this training method as GRPO-AR. For each prompt $p$ in our RL dataset $\mathcal{D}_{RL}$ , GRPO-AR samples a group of outputs $\{o_{1},o_{2},\dots,o_{G}\}$ from the old policy model $\pi_{\theta_{old}}$ and our agentic reward framework will give each output a total reward $r_{i}$ from execution, static aesthetics, and interactive aesthetics perspectives respectively, yielding $G$ rewards $\{r_{1},r_{2},\dots,r_{G}\}$ respectively. The advantage $\hat{A}_{i,t}$ can be caculated as follows:

\hat{A}_{i,t}=\frac{r_{i}-\text{mean}(r)}{\text{std}(r)}

(2)

Accordingly, the policy model is optimized by maximizing the GRPO objective under our agentic reward framework (GRPO-AR):

$\displaystyle\begin{aligned} \mathcal{J}_{\text{GRPO}}(\theta)={}&\mathbb{E}[p\sim\mathcal{D}_{RL},\{o_{i}\}_{i=1}^{G}\sim\pi_{\theta_{\text{SFT}}}(O|p)]\\ &\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o_{i}|}\sum_{t=1}^{|o_{i}|}\left\{\min\left[\frac{\pi_{\theta}(o_{i,t}|p,o_{i,<t})}{\pi_{\theta_{\text{SFT}}}(o_{i,t}|p,o_{i,<t})}\hat{A}_{i,t},\text{clip}\left(\frac{\pi_{\theta}(o_{i,t}|p,o_{i,<t})}{\pi_{\theta_{\text{SFT}}}(o_{i,t}|p,o_{i,<t})},1-\epsilon,1+\epsilon\right)\hat{A}_{i,t}\right]-\beta\mathbb{D}_{\text{KL}}\left[\pi_{\theta}||\pi_{\text{ref}}\right]\right\}\end{aligned}$

(3)

6 The OpenDesign Benchmark

Design Arena^§^§§https://designarena.ai/ is a widely used platform for benchmarking web page design, supported by a community of hundreds of thousands of voters. It allows users to design web pages with various models and receives community feedback through voting. While effective, this voting process is time-consuming and impractical for large-scale evaluation.

To address this limitation, we introduce the OpenDesign Benchmark, which enables efficient and automated assessment of web page aesthetics using large language models. The benchmark includes 840 real-world web page cases and evaluates both static and interactive aspects of design. A detailed breakdown of categories and their case counts is provided in the Appendix C.

6.1 Evaluation Mechanism

The benchmark assesses model performance from two perspectives: static aesthetics and interactive aesthetics. Static evaluation: given a prompt, the HTML generated by a model is rendered into a static image. The prompt and the image are then assessed by the static aesthetics agent (see Sec. 4.2), which produces a static aesthetics score. Interactive evaluation: using the same prompt and HTML code, the interactive aesthetics agent (see Sec. 4.3) assigns an interactive aesthetics score. The final benchmark score for a model is obtained by averaging these results across all benchmark cases.

6.2 Reliability Analysis of OpenDesign

To evaluate the quality and reliability of the OpenDesign benchmark, we adopt two complementary perspectives: (1) ranking consistency between OpenDesign and Design Arena, and (2) alignment between LLM scoring and human preference.

Ranking cosistency between OpenDesign and Design Arena.

We compare the rankings of 10 mainstream foundation models against the Design Arena leaderboard^¶^¶¶Rankings are taken as of September 22, 2025; Design Arena updates dynamically.. We measure consistency using Spearman’s and Kendall’s rank correlation coefficients, obtaining strong agreement: Spearman = 0.98 ( $p<1.5\times 10^{-6}$ ) and Kendall = 0.91 ( $p<3.0\times 10^{-5}$ ). Additionally, OpenDesign achieves $66.7\%$ top-3 and $80.0\%$ top-5 overlap with Design Arena. These results indicate that OpenDesign closely reflects large-scale human judgment. Figure 3(a) plots model ranks across both benchmarks. Points align closely with the diagonal, confirming OpenDesign as a reliable proxy for human preferences in webpage aesthetics.

Alignment with Human Scoring.

We sampled 200 HTML page pairs generated by the 10 models under the same prompts. Two evaluator groups—GPT judge and 10 humans (3 professors, 7 graduate students)—performed pairwise comparisons (win/tie/lose), yielding $2{,}000$ annotations. Figure 3(b) shows agreement ratios: human-human = $68.7\%$ , GPT-human = $80.9\%$ . These are comparable to MT-Bench results ( $66\%$ and $70\%$ , respectively) [4, 57], supporting LLM-as-a-Judge as an effective, robust method for assessing code aesthetics.

7 Experiments and Results

7.1 Experimental Setup

We evaluate the model’s plot generation using PandasPlotBench [16] with the head descriptor and vis mode. For each case, the model generates code from an instruction; executability is checked, and if an image is produced, it is compared to the ground truth. GPT-4o scores each case from $0$ to $100$ . This results in three quantitative results, (i) error rate, which refers to the portion of cases do not pass the executability check, (ii) average score, which is the average GPT-4o score among all test cases, and (iii) good rate, which refers to the protion of scores higher than $75$ . Webpage design ability is assessed using our OpenDesign benchmark (see Section 6). Training settings are provided in the Appendix E.

Table 1: Performance comparison between proprietary and open-source models across various benchmarks. In PandasPlotBench, Err., Avg., Good. refer to error rate, average score, good rate respectively. In OpenDesign, Align., Aes., Struct. refer to the three score perspectives: instructional alignment with user instruction, visual elements aesthetics, and structural cohesion respectively. Total. means the total score of the sum of three aspects’ scores, and InterAes. refers to the score of interactive evaluation stage. Note: Lower is better for Err., higher is better for all other metrics. Best results are in bold, second-best results are underlined (among all open-source models together).

Model	Size	PandasPlotBench			OpenDesign
		Err. (↓)	Avg. (↑)	Good. (↑)	Static Aesthetics				InterAes. (↑)
		Err. (↓)	Avg. (↑)		Align. (↑)	Aes. (↑)	Struct. (↑)	Total. (↑)	InterAes. (↑)
Proprietary Models
GPT-4o-mini	-	0.15	64	0.57	14.29	14.13	12.77	41.19	0.40
GPT-4o	-	0.09	68	0.60	16.90	16.05	15.13	48.08	0.44
GPT-4.1	-	0.09	69	0.61	23.53	21.99	20.27	65.79	0.74
GPT-5 (minimal)	-	0.04	75	0.66	30.38	25.94	24.71	81.03	1.37
Claude Sonnet 4	-	0.04	74	0.65	29.60	25.92	25.53	81.05	0.92
Open-Source Large Language Models
Qwen3-Coder-30B-A3B	30B	0.07	72	0.62	27.04	23.79	22.75	73.66	0.52
GLM-4-32B-0414	32B	0.07	70	0.59	24.67	22.90	21.80	69.40	0.48
GLM-4.5-Air	110B	0.08	71	0.63	29.29	24.83	24.04	78.16	0.93
Qwen3-Coder-480B-A35B	480B	0.05	73	0.66	30.13	25.16	24.62	79.90	0.70
DeepSeek-V3.1	685B	0.09	69	0.58	29.35	24.37	24.00	77.72	0.88
DeepSeek-R1-0528	685B	0.08	70	0.63	30.02	24.69	24.09	78.86	0.77
Open-Source Small Language Models
Qwen3-4B-Instruct-2507	4B	0.13	65	0.55	27.52	23.01	22.73	73.26	0.67
Qwen2.5-Coder-7B-Instruct	7B	0.22	60	0.50	16.38	15.13	14.73	46.27	0.38
AesCoder-4B (Ours)	4B	0.09	70	0.63	30.42	26.19	25.31	81.92	1.04
AesCoder-7B (Ours)	7B	0.09	67	0.57	30.03	25.98	25.18	81.23	0.94

7.2 Main Results

As shown in Table 1, both AesCoder-4B and AesCoder-7B achieve consistent improvements over their respective baselines. On PandasPlotBench, they achieve lower error rates and higher reliability, indicating stronger capability in generating correct plotting code. On OpenDesign, AesCoder achieves substantial improvements in both static aesthetics (alignment, visual appeal, and structure) and interactive aesthetics, surpassing all other open-source models. In particular, AesCoder matches or outperforms models with 30B–685B parameters, establishing new state-of-the-art results among open-source systems.

When compared with proprietary models, AesCoder-4B not only surpasses GPT-4o and GPT-4.1 on both PandasPlotBench and OpenDesign, but also delivers results competitive with substantially larger systems. Although GPT-5 and Claude Sonnet 4 still retain a slight overall advantage, our models achieve comparable scores across several aesthetic dimensions. These findings underscore the effectiveness of GRPO-AR, demonstrating that reinforcement learning with agentic reward feedback consistently enhances performance across different architectures and scales.

We further conducted human evaluation (Appendix F), and the results show that our models consistently outperform strong open-source baselines (GLM-4-32B-0414 and Qwen3-Coder-30B-A3B-Instruct), which further validates our results.

7.3 Analysis

Generalization of agentic reward.

We further analyze the reward dynamics during reinforcement learning, as illustrated in Appendix G. Both Qwen2.5-Coder-7B-Instruct-SFT and Qwen3-4B-Instruct-2507-SFT exhibit steadily increasing reward scores with training steps. This consistent upward trend indicates that the agentic reward framework provides stable and informative feedback, enabling continuous improvement across different model families and sizes. The results highlight the robustness of the framework as a general training signal, independent of specific architecture choices.

Effect of Agentic Reward.

To isolate the contribution of the proposed agentic reward, we conduct a controlled comparison against a variant that does not incorporate it. Specifically, instead of leveraging the full agentic reward framework, we directly employ the underlying reward model to score model-generated HTML outputs along three static dimensions—Instructional Alignment, Visual Design and Aesthetics, and Structural Coherence and Usability—and use these scores as the sole reward signal (see Appendix H.4 for the exact prompt). The policy optimization strictly follows the same procedure as in Sec. 5.2, with the updates computed according to Eq. 3, thereby ensuring a fair comparison.

Table 2: Comparison with DPO, RFT, and ablations on Agentic Reward for Qwen3-4B-Instruct-2507 and Qwen2.5-Coder-7B-Instruct.

Training Strategy	Align	Aes	Struct	InterAes
Qwen3-4B-Instruct-2507
SFT	28.50	25.27	24.36	0.62
RFT	29.32	25.30	24.67	0.71
DPO	28.79	25.31	24.38	0.70
GRPO-AR w/o Agentic Reward (ablation)	29.16	25.20	24.67	0.71
GRPO-AR w/ Agentic Reward (ours)	30.42	26.19	25.31	1.04
Qwen2.5-Coder-7B-Instruct
SFT	28.85	25.23	24.37	0.70
RFT	29.73	25.35	24.85	0.75
DPO	29.75	25.33	24.87	0.71
GRPO-AR w/o Agentic Reward (ablation)	28.81	25.02	24.41	0.72
GRPO-AR w/ Agentic Reward (ours)	30.03	25.98	25.18	0.94

As reported in Table 2, this simplified variant consistently underperforms the full method that integrates agentic reward feedback. The performance gap highlights that merely reusing the reward model to directly score code in textual modality is insufficient. In contrast, our agentic reward framework, which incorporates multi-perspective evaluations including execution, static, and interactive aesthetics, provides richer and more reliable feedback. These results demonstrate that agentic reward is essential for aligning the model with both functional correctness and human-perceived aesthetics.

Comparison with DPO and RFT.

To further validate the effectiveness of our proposed method GRPO-AR, we additionally compare it with two RLHF methods: Direct Preference Optimization (DPO) [41] and Rejection Sampling Fine-Tuning (RFT) [55]. Both methods are applied to the Stage I checkpoint $\pi_{\theta_{\mathrm{SFT}}}$ , using the same training data as in Stage II to ensure a fair comparison. Implementation details of DPO and RFT are provided in Appendix D. As shown in Table 2, our method consistently surpasses both DPO and RFT on OpenDesign across static and interactive aesthetics. These improvements highlight that incorporating agentic reward feedback not only enhances the visual quality of generated webpages but also strengthens their usability and structural robustness, confirming the superiority of GRPO-AR.

8 Case Study

We further conduct case studies on the OpenDesign benchmark to qualitatively compare AesCoder-4B with Claude Sonnet 4 [3] and DeepSeek-R1-0528 [10]. We select five representative cases from the five categories in OpenDesign for comparison. As illustrated in Figure 4, AesCoder-4B achieves results that are superior to or on par with state-of-the-art models across all five web design task categories. These results highlight the effectiveness of our approach in aligning code generation with both usability and aesthetic quality.

9 Conclusion

In this work, we introduce the concept of code aesthetics and present AesCode-358K, OpenDesign, and an agentic reward framework (GRPO-AR) that jointly enhance executability, static design, and interactivity in code generation. Through supervised tuning and reinforcement learning with GRPO-AR, our AesCoder models achieve state-of-the-art results on PandasPlotBench and OpenDesign, rivaling much larger models. These results demonstrate that multi-agent reward feedback can effectively align coding LLMs with both functional correctness and human-perceived aesthetics, paving the way for more capable and user-friendly coding assistants.

References

AAA⁺ [23] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[2] Anthropic. Claude code: Best practices for agentic coding. https://www.anthropic.com/engineering/claude-code-best-practices, 2025. Accessed: 2025-09-25.
[3] Anthropic. Introducing claude 4. https://www.anthropic.com/news/claude-4, May 2025.
BLB⁺ [24] Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 7421–7454. Association for Computational Linguistics, 2024.
CLB⁺ [17] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
CLL⁺ [25] Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip Yu, and Lichao Sun. A survey of ai-generated content (aigc). ACM Computing Surveys, 57(5):1–38, 2025.
CPY⁺ [25] Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent llm systems fail?, 2025.
CYD⁺ [23] Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. 2023.
CZY⁺ [25] Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025.
DAGY⁺ [25] DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025.
Dil [16] Paul C Dilley. Textual aesthetics. The Red Monastery Church: Beauty and Asceticism in Upper Egypt, page 175, 2016.
DLT [17] Yubin Deng, Chen Change Loy, and Xiaoou Tang. Image aesthetic assessment: An experimental survey. IEEE Signal Processing Magazine, 34(4):80–106, 2017.
DWZ⁺ [25] Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, and Lin Yan. Process supervision-guided policy optimization for code generation, 2025.
FHY⁺ [23] Qiang Fu, Xiao Han, Wei Yang, Deheng Ye, Kaiwen Xiao, Jiate Liu, and Yiqin Zhu. Rltf: Reinforcement learning from unit test feedback, 2023.
GJS⁺ [25] Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, 2025.
GTGB [25] Timur Galimzyanov, Sergey Titov, Yaroslav Golubev, and Egor Bogomolov. Drawing pandas: A benchmark for llms in generating plotting code, 2025.
GZC⁺ [24] Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriele Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning. ArXiv, abs/2410.02089, 2024.
GZX⁺ [24] Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, and Zihan Wang. Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024.
GZY⁺ [24] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
HLG⁺ [24] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024.
Hun [07] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90–95, 2007.
[22] Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024.
[23] Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report, 2024.
HYM⁺ [24] Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models, 2024.
Inc [15] Plotly Technologies Inc. Collaborative data science, 2015.
JC [22] Mladan Jovanovic and Mark Campbell. Generative artificial intelligence: Trends and prospects. Computer, 55(10):107–112, 2022.
JHWW [24] Lingjie Jiang, Shaohan Huang, Xun Wu, and Furu Wei. Textual aesthetics in large language models. arXiv preprint arXiv:2411.02930, 2024.
KLM [96] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996.
LLQ⁺ [25] Zhichao Liao, Xiaokun Liu, Wenyu Qin, Qingyu Li, Qiulin Wang, Pengfei Wan, Di Zhang, Long Zeng, and Pingfa Feng. Humanaesexpert: Advancing a multi-modality foundation model for human image aesthetic assessment. arXiv preprint arXiv:2503.23907, 2025.
LTS [24] Hugo Laurençon, Léo Tronchon, and Victor Sanh. Unlocking the conversion of web screenshots into html code with the websight dataset, 2024.
LWG⁺ [22] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Hoi. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
MH [08] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
MHH⁺ [24] Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian Kivlichan, Molly Lin, Alex Beutel, John Schulman, and Lilian Weng. Rule based rewards for language model safety. Advances in Neural Information Processing Systems, 37:108877–108901, 2024.
NCW⁺ [25] Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, and Franck Dernoncourt. Gui agents: A survey, 2025.
NNZ⁺ [25] Yuansheng Ni, Ping Nie, Kai Zou, Xiang Yue, and Wenhu Chen. Viscoder: Fine-tuning llms for executable python visualization code generation. arXiv preprint arXiv:2506.03930, 2025.
Ope [24] OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024. Accessed: 2025-09-13.
[37] OpenAI. Gpt-5 system card. https://openai.com/index/gpt-5-system-card/, August 2025.
[38] OpenAI. Introducing gpt-5. https://openai.com/index/introducing-gpt-5/, 2025. Accessed: 2025-09-13.
OWJ⁺ [22] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
PQW⁺ [25] Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, and Juanzi Li. Agentic reward modeling: Integrating human preferences with verifiable correctness signals for reliable reward systems. arXiv preprint arXiv:2502.19328, 2025.
RSM⁺ [24] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024.
SK [23] Tam Sakirin and Siddartha Kusuma. A survey of generative artificial intelligence techniques. Babylonian Journal of Artificial Intelligence, 2023:10–14, 2023.
SWZ⁺ [24] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
SZC⁺ [23] Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, Yuenan Guo, and Qianxiang Wang. Pangu-coder2: Boosting large language models for code with ranking feedback, 2023.
SZY⁺ [24] Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024.
Tea [25] Qwen Team. Qwen3 technical report, 2025.
vdZKS [13] Tijn van der Zant, Matthijs Kouw, and Lambert Schomaker. Generative artificial intelligence. In Philosophy and theory of artificial intelligence, pages 107–120. Springer, 2013.
Was [21] Michael L. Waskom. seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021, 2021.
WGC⁺ [23] Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Hong Lin. Ai-generated content (aigc): A survey. arXiv preprint arXiv:2304.06632, 2023.
WHW [24] Xun Wu, Shaohan Huang, and Furu Wei. Multimodal large language model is a human-aligned annotator for text-to-image generation. arXiv preprint arXiv:2404.15100, 2024.
WMF⁺ [24] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6), March 2024.
WWS⁺ [23] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023.
WXX⁺ [24] Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. arXiv preprint arXiv:2406.12845, 2024.
XGR⁺ [25] Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2502.14768, 2025.
YYL⁺ [23] Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023.
YZY⁺ [23] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023.
ZCS⁺ [23] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
ZHQ⁺ [25] Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Large language model-brained gui agents: A survey, 2025.
ZWX⁺ [24] Miaosen Zhang, Yixuan Wei, Zhen Xing, Yifei Ma, Zuxuan Wu, Ji Li, Zheng Zhang, Qi Dai, Chong Luo, Xin Geng, et al. Aligning vision models with human aesthetics in retrieval: Benchmarks and algorithms. Advances in Neural Information Processing Systems, 37:86399–86434, 2024.
ZZZ⁺ [24] Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models, 2024.

Appendix A LLM Usage Statement

A large language model (ChatGPT) was used to aid and polish the writing of the paper, including minor grammar correction and language refinement.

Appendix B Details of Web Page Data Construction

B.1 Keyword Corpus and Instruction Generation

We classified webpages into five categories: General Website, 3D Design, Data Visualization, Game Dev, and UI Component. Using GPT-4o, we generated 9K seed keywords for the General Website category, and 2.5K keywords for each of the remaining four categories. Table 3 summarizes the distribution.

Table 3: Seed keywords statistics across categories.

Category	General Website	3D Design	Data Visualization	Game Dev	UI Component
Seed Keywords	9,000	2,500	2,500	2,500	2,500

Based on the seed corpus, GPT-4o was asked to generate 20 non-redundant and semantically diverse instructions for each keyword. This resulted in a total of 400,000 webpage design instructions for further processing.

B.2 Semantic Analysis and Deduplication

We embedded all instructions using openai-text-embedding-3-large (3072 dimensions). From each category, 2,000 instructions were randomly sampled and visualized with t-SNE (perplexity = 30, max_iter = 1000). As shown in Figure 5, the raw dataset exhibited significant overlaps across categories, along with several dense clusters.

To filter out redundancy, we applied K-Means clustering with $K=200$ K on the embedded vectors and kept only the sample nearest to each cluster center. This resulted in a refined dataset of 200K instructions. The t-SNE visualization of the refined dataset shows clearer class boundaries and reduced overlap across categories, demonstrating the effectiveness of our filtering.

Appendix C OpenDesign Benchmark Categories

Table 4: Distribution of OpenDesign Benchmark Categories (Total: 840 cases)

General Website	3D Design	Data Visualization	Game Dev	UI Component	Total
60.9%	14.6%	4.8%	13.6%	4.9%	100%

Appendix D Implementation Details for DPO and RFT

In this section, we describe the construction pipeline of training data for both DPO and RFT used in §7.3. We adopt the same set of queries as in GRPO-AR for offline sampling. For each query $q$ , we sample $N$ responses from the SFT policy $\pi_{\theta_{\mathrm{SFT}}}$ , yielding

\mathcal{O}(q)\;=\;\bigl\{\,o_{i}\,\bigr\}_{i=1}^{N}.

(4)

A reward model $R_{\phi}$ then scores each response conditioned on $q$ :

\mathcal{R}(q)\;=\;\bigl\{\,r(o_{i}\mid q)\;\big|\;o_{i}\in\mathcal{O}(q)\,\bigr\},\quad\text{where }r(o\mid q)\equiv R_{\phi}(o\mid q).

(5)

DPO.

For DPO, we construct a preference dataset by taking, for each $q$ , the highest- and lowest-scoring responses:

\mathcal{D}_{\mathrm{DPO}}\;=\;\Bigl\{\,(q,o_{w},o_{l})\ \Big|\ o_{w}=\arg\max_{\,o\in\mathcal{O}(q)}r(o\mid q),\ o_{l}=\arg\min_{\,o\in\mathcal{O}(q)}r(o\mid q)\Bigr\}.

(6)

We then optimize $\pi_{\theta}$ (initialized from $\pi_{\theta_{\mathrm{SFT}}}$ ) with the standard DPO objective [41]:

\max_{\theta}\ \mathbb{E}_{(q,o_{w},o_{l})\,\sim\,\mathcal{D}_{\mathrm{DPO}}}\!\left[\log\sigma\!\left(\beta\Bigl(\log\tfrac{\pi_{\theta}(o_{w}\mid q)}{\pi_{\theta_{\mathrm{SFT}}}(o_{w}\mid q)}-\log\tfrac{\pi_{\theta}(o_{l}\mid q)}{\pi_{\theta_{\mathrm{SFT}}}(o_{l}\mid q)}\Bigr)\right)\right],

(7)

where $\sigma(\cdot)$ is the sigmoid and $\beta>0$ is a scaling hyperparameter.

RFT.

For RFT, we select only the top-scoring response per query:

\mathcal{D}_{\mathrm{RFT}}\;=\;\Bigl\{\,(q,o)\ \Big|\ o=\arg\max_{\,o\in\mathcal{O}(q)}r(o\mid q)\Bigr\}.

(8)

The model is then trained with a standard supervised objective:

\mathcal{L}_{\mathrm{RFT}}(\theta)=-\,\mathbb{E}_{(q,o)\,\sim\,\mathcal{D}_{\mathrm{RFT}}}\left[\sum_{t=1}^{|o|}\log\pi_{\theta}\!\left(o_{t}\,\middle|\,q,o_{1:t-1}\right)\right].

(9)

Implementation.

We implement both DPO and RFT with LLaMA-Factory [60]^∥^∥∥https://github.com/hiyouga/LLaMA-Factory. For a fair comparison with GRPO-AR, we keep the same learning rate, batch size, and the total number of training samples as in Stage II.

Appendix E Training Settings.

For stage I, all models are trained for 3 epochs with the AdamW optimizer, employing a 10% linear warmup followed by a cosine learning rate decay schedule. The maximum learning rate is set to $1\text{e}{-5}$ , with a batch size of $128$ and a maximum sequence length of $8\text{k}$ tokens. Training the 7B model in the SFT phase takes approximately $2$ days on $1$ nodes of 8xMI300 GPUs.

For stage II, we use VeRL [45] to conduct experiments. By default, we use a constant $3\times 10^{-6}$ learning rate together with AdamW optimizer for policy model, and use a batch size of 64 and micro batchsize of 8. The rollout stage collects 64 prompts and samples 8 responses for each prompt. We set KL coefficient to 0.001 and $\epsilon=0.5$ in Eq. 3 in all experiments. The RL phase takes approximately $7$ days on $1$ nodes of 8xMI300 GPUs. In agentic reward framework, we set $w_{exec}=0.1$ , $w_{static}=0.8$ , and $w_{interact}=0.1$ . Given the currently low success rate of GUI agents [58, 34, 24], we limit the number of interactive elements to $3$ during training. Additionally, when the GUI agent lists the interactive elements, we instruct it to prioritize them based on their importance. This ensures that the most critical and prominent elements are interacted with, thereby mitigating the impact of the GUI agent’s limited success rate on our GRPO-AR training.

Appendix F Human Evaluation

To validate the effectiveness of our model, we select four mainstream models, Claude Sonnet 4 [3], GPT-5 [37], GLM-4-32B-0414[18] and Qwen3-Coder-30B-A3B-Instruct [46] and randomly sampled 100 test cases from OpenDesign, resulting in $100$ HTML pairs $\langle\pi_{ours}(p),\pi_{others}(p)\rangle$ . Then we perform the same human preference annotations as Section 6. Results are shown in Figure 5. AesCoder achieves a win rate of over $55\%$ in comparisons with mid- to large-scale open-source models (GLM-4-32B-0414 and Qwen3-Coder-30B-A3B-Instruct), and maintains a near $50\%$ win rate when compared to state-of-the-art proprietary models (Claude Sonnet 4 and GPT-5), demonstrating the effectiveness of our agentic reward framework.