DeepForm: Reasoning Large Language Model for Communication System Formulation

Panlong Wu* , Ting Wang* , Yifei Zhong , Haoqi Zhang , Zitong Wang  and Fangxin Wang *These authors contributed to the work equllly and should be regarded as co-first authors.
Abstract

Communication system formulation is critical for advancing 6G and future wireless technologies, yet it remains a complex, expertise-intensive task. While Large Language Models (LLMs) offer potential, existing general-purpose models often lack the specialized domain knowledge, nuanced reasoning capabilities, and access to high-quality, domain-specific training data required for adapting a general LLM into an LLM specially for communication system formulation. To bridge this gap, we introduce DeepForm, the first reasoning LLM specially for automated communication system formulation. We propose the world-first large-scale, open-source dataset meticulously curated for this domain called Communication System Formulation Reasoning Corpus (CSFRC). Our framework employs a two-stage training strategy: first, Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) data to distill domain knowledge; second, a novel rule-based Reinforcement Learning (RL) algorithm, C-ReMax based on ReMax, to cultivate advanced modeling capabilities and elicit sophisticated reasoning patterns like self-correction and verification. Extensive experiments demonstrate that our model achieves state-of-the-art performance, significantly outperforming larger proprietary LLMs on diverse senerios. We will release related resources to foster further research in this area after the paper is accepted.

Index Terms:
Large Language Model, Communication System Formulation

1 Introduction

As the cornerstone of modern information society, communication systems are undergoing a paradigm shift from 5G to 6G, playing a pivotal role in advancing emerging domains such as the Internet of Things (IoT), the Industrial Internet, and the Vehicular Internet. These domains rely heavily on ultra-reliable, low-latency, and high-capacity networks to enable seamless connectivity, real-time data exchange, and intelligent decision-making.

The complexity of 6G systems, characterized by their integration of artificial intelligence, terahertz communication, massive machine-type communications, network slicing, and many other cutting-edge technologies, necessitates an accurate modeling framework. Such models provide researchers and engineers with a precise analytical tool to simulate and optimize system behavior under diverse operational conditions. In the advancement of communication technology, communication system formulation has increasingly become a crucial link between theoretical design and practical implementation. By providing a precise framework for understanding and analyzing system behavior, accurate modeling not only enhances the characterization of communication system properties but also lays the groundwork for optimizing performance and facilitating real-world deployment.

However, developing an accurate communication system formulation presents numerous technical challenges. This type of modeling demands a robust understanding of communication principles and often integrates multiple mathematical theories, including information theory, queuing theory, and optimization theory, thereby reflecting significant interdisciplinary complexity. Moreover, currently, mainstream approaches to communication system formulation exhibit considerable fragmentation across various subfields. Distinct communication system domains—such as integrated sensing and communication (ISAC), massive MIMO, intelligent reflecting surfaces (IRS), and millimeter-wave beamforming—have each evolved their own independent mathematical modeling methodologies. For instance, in ISAC systems, modelers are required to possess dual expertise in both communication signal processing and radar system design. This high degree of specialization has resulted in the emergence of isolated ”knowledge silos” within each subfield, hindering cross-disciplinary collaboration and integration. Consequently, overcoming these challenges necessitates not only a deep understanding of individual subfields but also efforts to bridge the gaps between them, fostering a more unified and cohesive approach to communication system formulation.

Recently, large language models (LLMs) such as GPT-4o[1], LLaMA[2] and Gemini[3], are driving a paradigm shift in the field of artificial intelligence. These models, with parameter counts of billions, exhibit remarkable capabilities in contextual reasoning and knowledge emergence through self-supervised learning methodology on large-scale data. For instance, GPT-4o, developed by OpenAI not only has complex semantic understanding, multi-turn logical reasoning, and creative content generation but also achieves expert-level performance in structured tasks such as code writing, mathematical proofs, and scientific discoveries. In the field of communication system formulation, LLMs offer new hope for enhancing the efficiency of communication system formulation by leveraging their advanced pre-trained knowledge.

However, although current LLMs demonstrate outstanding intelligence in open-domain tasks, they still face a severe ”capability gap” when applied to communication system formulation. The knowledge of the communication system is characterized by ”deep specialization.”. General LLMs lack a comprehensive understanding of the communication system, failing to meet the demands of cutting-edge research. Furthermore, existing LLMs are deficient in the deep reasoning abilities specific to communication systems. The complexity of communication system construction underscores the necessity of these deep reasoning capabilities in Communication system formulation. This makes small-scale open-source general LLMs perform poorly in communication system formulation. For large-scale LLMs, they have high deployment costs, for instance, deploying an LLM like DeepSeek R1 requires the use of 8 H100 GPUs, making it prohibitively expensive for many applications. Alternatively, leveraging external LLM APIs for modeling purposes introduces significant privacy and security risks, which hinder the widespread adoption of LLMs in the domain of communication system formulation.

This reveals the need for adapting a general LLM to Domain-Specific LLMs for communication system formulation, which requires fine-tuning on domain-specific data. However, several challenges arise from this.

Insufficient high-quality communication system formulation training data. According to the scaling law [4], the quantity of high-quality training data for communication system formulation is essential for the performance of LLMs. However, up to now, there is no large-scale, high-quality datasets in the field of communication system formulation.

Deep complexity of communication system formulation task. Communication system formulation requires a deep understanding of the communication system as well as related mathematical knowledge, such as optimization theory, information theory, etc. This requires injecting communication system reasoning ability into LLM during the training, which is highly difficult.

To fill this gap, in this paper we propose DeepForm, the first reasoning LLM for Communication system formulation. By employing a data-driven approach, our framework significantly reduces the need for expert intervention, thereby minimizing labor costs and shortening the modeling cycle. This ultimately enhances the overall efficiency of the communication system formulation.

We construct the Communication System Formulation Reasoning Corpus (CSFRC), the world’s first large-scale dataset specifically designed for complex communication system formulation. Curated from 2015-2025 ArXiv publications (10k+ samples related to communication system formulation), CSFRC addresses critical gaps in existing communication area datasets through two key innovations in two sub dataset. The first is Reasoning-Centric Supervision set. The philosophy of the data construction takes deep consideration of the intrinsic property of the communication system formulation area’s complexity, making it hard for LLM to learn knowledge directly from the answer, but more suitable to learn through detailed thinking processes. The second is Rule-centric Reinforcement Set. We develop a sub-dataset which consists of communication system formulation questions and related modeling formulations, specially designed for LLM to get accurate rewards through rule-based RL.

The model training consists of a two-stage training process. In the first stage, we conduct the data distillation through supervised fine-tuning on the constructed Chain of Thought (CoT) data to enable the student LLM to have domain knowledge in communication system formulation. In the second stage, we propose a rule-based RL algorithm C-ReMax based on ReMax [5] and is inspired by the amazing performance of rule-based RL in improving general math formulation problem[6]. The C-Remax algorithm is able to inject complex communication system formulation capability to LLM. The LLM can improve its strategy by answering different types of communication system formulation questions and getting feedback on whether the answer is correct. After being trained by the algorithm, the LLM learn complex reasoning capability in communication system formulation and emerges self correct, verification, back-tracking, and other reasoning behaviors.

In summary, our contributions are as follows:

  • To our best knowledge, we construct the world-first large-scale communication system formulation dataset, CSFRC, and will open source it together with relative dataset construction code.

  • We are the first to propose and open source the reasoning LLM DeepForm for communication system formulation. Our training framework introduces a novel knowledge distillation approach and leverages the potential of rule-based RL in communication system formulation.

  • We conduct extensive experiments, and the results show that the trained model ultimately achieves state-of-the-art performance in communication system formulation.

2 Related Work

2.1 Domain Specific LLM in other areas

The Significant powerful ability of LLMs has attracted many researchers to conduct research on dating a general LLM in other areas. In [7], the authors present HuatuoGPT, a medical LLM that combines distilled data from ChatGPT and real-world data from doctors to enhance its performance in medical consultations.  In [8], The authors propose LawLLM, an intelligent legal system with legal reasoning and verifiable knowledge retrieval capability. The system is trained on a high-quality supervised fine-tuning dataset called Law-SFT. The authors also construct a comprehensive legal benchmark, Law-Eval, to evaluate intelligent legal systems from both objective and subjective dimensions. In [9] the authors present DISC-FinLLM, a Chinese financial large language model built using a Multiple Experts Fine-tuning Framework, which enhances general LLMs with capabilities in the finance area. In [10], the authors introduces OCEANGPT, the first large language model specialized in ocean science tasks, present a novel instruction data generation framework called DOINSTRUCT. In [11], it presents the StarWhisper Telescope system, an AI autonomous framework that integrates LLMs with specialized function calls and modular workflows to automate end-to-end astronomical observations. In [12], the authors present MutaPLM, a novel protein language modeling framework that explicitly models protein mutations for enhanced explanation and engineering capabilities through a protein delta network and cross-modal supervision. In [13] presents EcomGPT, a large language model fine-tuned on the EcomInstruct dataset, demonstrating superior zero-shot generalization capabilities in e-commerce tasks compared to ChatGPT. In [14], the authors introduce MentaLLaMA, the first open-source instruction-following large language model series for interpretable mental health analysis on social media, along with the IMHI dataset, and demonstrate its effectiveness in correctness, explanation quality, and generalizability. In [15], the authors present DoctorGLM, a healthcare-focused language model fine-tuned from ChatGLM-6B using Chinese medical dialogue datasets and various techniques, achieving cost-effective deployment for medical purposes.

2.2 Application of LLM in Communication System

LLMs have recently garnered significant attention for their potential applications in various fields, including communication system. In [16], the authors propose a wireless agent framework that adapt and enhance LLMs to address the problems in wireless communication system by using prompt engineering, retrieval-augmented generation, and other techniques. The practical applicability is demonstrated in network slicing management. In [17], the authors rethinks generative semantic communication for multi-user systems in 6G and propose the M-GSC framework with a large language model as the shared knowledge base. It highlights three optimization strategies for M-GSC, including extending the LLM-based SKB into a multi-agent system, offloading semantic encoding and decoding, and managing communication and computational resources. A case study demonstrates the preliminary validation of M-GSC’s effectiveness in efficient decoding offloading. In [18] The authors propose CommLLM, a novel LLM-enhanced multi-agent system for 6G communications that integrates multi-agent data retrieval, collaborative planning, and evaluation-reflection modules. The system leverages natural language processing and LLMs to overcome challenges in 6G communication tasks by enabling self-learning, self-improvement, and efficient problem-solving. A case study on semantic communication demonstrates CommLLM’s effectiveness in autonomously generating and refining communication models to meet specific design requirements. In [19], the authors propose an LLM-based Generative IoT (GIoT) system deployed in a local network setting to address security concerns, which includes a Prompt Management Module, a Postprocessing Module, and a Task-specific Prompts Database to enhance the capacities of open-source LLMs and integrate prompting methods. In [20], a novel LLM-centric Intent Lifecycle (LC) management architecture is proposed for next-generation networks, enabling network configuration and management via natural language. In [21], the authors explore the integration of Large Language Models (LLMs) and graphs in dynamic networking, proposing a novel LLM-enabled graph framework for networking optimization and validating its effectiveness through a UAV networking case study. In [22], the authors evaluate the effectiveness of LLMs for intrusion detection in IoT networks, proposing a novel LLM-based framework that leverages techniques such as fine-tuning and embedding similarity. In [23], the authors present a novel framework for trustworthy zero-touch network and service management in 6G by integrating AI for anomaly detection, XAI for root cause analysis, and LLMs for generating user-friendly explanations and implementing corrective actions, demonstrating its efficiency through real-world experiments.

Refer to caption
Figure 1: Total Workflow

3 Methodology

3.1 Data Preprocess Pipeline

We systematically retrieved arXiv articles containing wireless communication terminology through the arXiv API. The acquired PDF documents were converted to semantically structured markdown using the minerU library [24][25], employing LaTeX-aware parsing to preserve mathematical notation integrity:

fparse:PDF(ci)minerUi=m1,m2,,mK:subscript𝑓parseminerUPDFsubscript𝑐𝑖subscript𝑖subscript𝑚1subscript𝑚2subscript𝑚𝐾f_{\text{parse}}:\text{PDF}(c_{i})\xrightarrow{\texttt{minerU}}\mathcal{M}_{i}% =\langle m_{1},m_{2},...,m_{K}\rangleitalic_f start_POSTSUBSCRIPT parse end_POSTSUBSCRIPT : PDF ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_ARROW overminerU → end_ARROW caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⟨ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ⟩ (1)

where isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the document object containing K𝐾Kitalic_K markdown elements with intact equation blocks (mkisubscript𝑚𝑘subscript𝑖m_{k}\in\mathcal{M}_{i}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT).

The workflow of the CoT data processing pipeline of a sample is shown in Fig. 2, which consists the following part.

3.1.1 Description-Formulation Pair Extraction

Technical reasoning pairs were extracted from system modeling sections (specifically targeting subsections labeled ”System Model” or ”Modeling Framework”). Through regular expression pattern matching, we identified contextual descriptions djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT preceding mathematical formulations fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the document flow:

𝒟raw={(dj,fj)dj𝒞,fj,pos(dj)<pos(fj)}subscript𝒟rawconditional-setsubscript𝑑𝑗subscript𝑓𝑗formulae-sequencesubscript𝑑𝑗𝒞formulae-sequencesubscript𝑓𝑗possubscript𝑑𝑗possubscript𝑓𝑗\mathcal{D}_{\text{raw}}=\left\{(d_{j},f_{j})\mid d_{j}\in\mathcal{C},f_{j}\in% \mathcal{F},\text{pos}(d_{j})<\text{pos}(f_{j})\right\}caligraphic_D start_POSTSUBSCRIPT raw end_POSTSUBSCRIPT = { ( italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∣ italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_C , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_F , pos ( italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < pos ( italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } (2)

where 𝒞𝒞\mathcal{C}caligraphic_C denotes contextual descriptions and \mathcal{F}caligraphic_F represents formal mathematical expressions.The condition pos(dj)<pos(fj)possubscript𝑑𝑗possubscript𝑓𝑗\text{pos}(d_{j})<\text{pos}(f_{j})pos ( italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < pos ( italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) indicates that, during the construction of the D_raw𝐷_𝑟𝑎𝑤D\_rawitalic_D _ italic_r italic_a italic_w data pairs, all textual descriptions appearing before the position of fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are collected as the corresponding description djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. In actual data processing, supplementary condition statements following the term “formulation”— often introduced by words such as “where” - are also appended to the description djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

3.1.2 Prompt Compression Optimization

Leveraging the DeepSeek V3 API [6] for semantic compression, we applied iterative pruning to description elements:

d^j=V3_compress(dj)s.t.len(d^j)Lmaxformulae-sequencesubscript^𝑑𝑗V3_compresssubscript𝑑𝑗s.t.lensubscript^𝑑𝑗subscript𝐿max\hat{d}_{j}=\text{V3\_compress}(d_{j})\quad\text{s.t.}\quad\text{len}(\hat{d}_% {j})\leq L_{\text{max}}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = V3_compress ( italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) s.t. len ( over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ italic_L start_POSTSUBSCRIPT max end_POSTSUBSCRIPT (3)

where Lmax=4096subscript𝐿max4096L_{\text{max}}=4096italic_L start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = 4096 tokens is chosen to mitigate excessive computational overhead resulting from excessively long prompts during training. This process yielded the compressed dataset 𝒟comp={(d^j,fj)}subscript𝒟compsubscript^𝑑𝑗subscript𝑓𝑗\mathcal{D}_{\text{comp}}=\{(\hat{d}_{j},f_{j})\}caligraphic_D start_POSTSUBSCRIPT comp end_POSTSUBSCRIPT = { ( over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) }.

3.1.3 Data Augmentation and labeling

The construction of the final dataset employs a two-stage annotation and labeling procedure.
Candidate Generation. For each compressed prompt d^jsubscript^𝑑𝑗\hat{d}_{j}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, reasoning paths g^jsubscript^𝑔𝑗\hat{g}_{j}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are sampled up to a maximum number of attempts Tmaxsubscript𝑇maxT_{\text{max}}italic_T start_POSTSUBSCRIPT max end_POSTSUBSCRIPT:

𝒢j={gj(k)gj(k)πR1(d^j),k=1,,Tmax},\mathcal{G}_{j}=\left\{g_{j}^{(k)}\mid g_{j}^{(k)}\sim\pi_{\text{R1}}(\cdot% \mid\hat{d}_{j}),\,k=1,...,T_{\text{max}}\right\},caligraphic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∣ italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT R1 end_POSTSUBSCRIPT ( ⋅ ∣ over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_k = 1 , … , italic_T start_POSTSUBSCRIPT max end_POSTSUBSCRIPT } , (4)

where πR1subscript𝜋R1\pi_{\text{R1}}italic_π start_POSTSUBSCRIPT R1 end_POSTSUBSCRIPT represents the generator policy utilizing rejection sampling with the DeepSeek R1 API [6] to evaluate whether a candidate answer aligns with the ground truth. The generation process terminates either when a candidate gj(k)superscriptsubscript𝑔𝑗𝑘g_{j}^{(k)}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is confirmed to match the reference answer fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT by the DeepSeek R1 model, or when the maximum number of attempts Tmaxsubscript𝑇maxT_{\text{max}}italic_T start_POSTSUBSCRIPT max end_POSTSUBSCRIPT is reached. If no candidate is validated within Tmaxsubscript𝑇maxT_{\text{max}}italic_T start_POSTSUBSCRIPT max end_POSTSUBSCRIPT trials, the sample proceeds to the Fallback Correction stage (if eligible).

Fallback Correction. When the candidate generation fails to produce the correct answer fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, but the highest similarity among generated candidates satisfies Sim(gj(k),fj)θSimsuperscriptsubscript𝑔𝑗𝑘subscript𝑓𝑗𝜃\text{Sim}(g_{j}^{(k)},f_{j})\geq\thetaSim ( italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≥ italic_θ (Levenshtein similarity[26]) :

g~j=argmaxg𝒢jlogpR1(fjgd^j),subscript~𝑔𝑗subscript𝑔subscript𝒢𝑗subscript𝑝R1conditionalsubscript𝑓𝑗direct-sum𝑔subscript^𝑑𝑗\tilde{g}_{j}=\arg\max_{g\in\mathcal{G}_{j}}\log p_{\text{R1}}(f_{j}\mid g% \oplus\hat{d}_{j}),over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_g ∈ caligraphic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT R1 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_g ⊕ over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (5)

where direct-sum\oplus denotes context concatenation. In such cases, the reasoning path is completed based on the given question d^jsubscript^𝑑𝑗\hat{d}_{j}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and the correct answer fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. These samples are labeled as ”given-answer CoT samples” in the final dataset to indicate that the reasoning path was reconstructed using the fallback mechanism.

If maxkSim(gj(k),fj)<θ2subscript𝑘Simsuperscriptsubscript𝑔𝑗𝑘subscript𝑓𝑗subscript𝜃2\max_{k}\text{Sim}(g_{j}^{(k)},f_{j})<\theta_{2}roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT Sim ( italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the sample is discarded without further processing.

The final dataset is constructed as follows:

𝒟sft={(d^j,g~j,fj,label)j=1,,M},subscript𝒟sftconditional-setsubscript^𝑑𝑗subscript~𝑔𝑗subscript𝑓𝑗label𝑗1𝑀\mathcal{D}_{\text{sft}}=\left\{(\hat{d}_{j},\tilde{g}_{j},f_{j},\text{label})% \mid j=1,...,M\right\},caligraphic_D start_POSTSUBSCRIPT sft end_POSTSUBSCRIPT = { ( over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , label ) ∣ italic_j = 1 , … , italic_M } , (6)

where label{”generated”,”given-answer”}label”generated””given-answer”\text{label}\in\{\text{"generated"},\text{"given-answer"}\}label ∈ { ”generated” , ”given-answer” } indicates whether the reasoning path was directly generated (”generated”) or reconstructed via fallback correction (”given-answer”).

The data constitution is shown in Fig. 4, which provides an overview of the diverse composition of the training data used in this study. As illustrated, the dataset encompasses a wide range of categories, reflecting a well-rounded and comprehensive coverage of key topics in communication system formulation. Notably, no single category dominates the dataset entirely, with the largest segment labeled as ”Others (37.3%percent37.337.3\%37.3 %)”—suggesting a balanced distribution across various subdomains. This diversity is further reinforced by the presence of multiple specialized areas such as Integrated Sensing and Communication, Wireless and AI, Channel Modeling, and MIMO Technology, each contributing a meaningful proportion to the overall dataset.

Refer to caption
Figure 2: CoT data detail

3.2 Stage One: Learning Complex Reasoning in Communication

Direct utilization of LLM to address communication related formulation problems described in natural language often results in inaccuracies, primarily due to their inability to comprehensively capture implicit information. To address this issue, we enhance our model capability to both define and solve the problems through supervised fine-tuning (SFT).

SFT is a parameter optimization paradigm where a pre-trained large language model adapts to specific downstream tasks through labeled instruction-response pairs. The core components of SFT include the training dataset 𝒟sftsubscript𝒟sft\mathcal{D}_{\text{sft}}caligraphic_D start_POSTSUBSCRIPT sft end_POSTSUBSCRIPT containing description-Chain of Thought (CoT) pairs {(xi,yi)}subscript𝑥𝑖subscript𝑦𝑖\{(x_{i},y_{i})\}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) }, and an optimization objective that minimizes the discrepancy between model predictions and ground-truth responses. This structured format enables explicit supervision for both reasoning trace generation and solution verification.

For a general LLM initially lacking domain knowledge in communication system formulation, SFT enforces the injection of the domain knowledge by updating model parameters through gradient-based optimization. To conduct more efficient training, we freeze all the pretrianed parameters and only update LoRA[27] parameters. As the update of the LLM has a low-rank nature, we can insert low-rank matrices and only update them during the training. Formally, for a pre-trained weight matrix W0subscript𝑊0W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, LoRA approximates the parameter update ΔWΔ𝑊\Delta Wroman_Δ italic_W as:

ΔW=ABTwhereAd×r,Bk×r(rmin(d,k))formulae-sequenceΔ𝑊𝐴superscript𝐵𝑇where𝐴superscript𝑑𝑟𝐵superscript𝑘𝑟much-less-than𝑟𝑑𝑘\Delta W\!=\!A\!\cdot\!B^{T}\quad\!\!\!\!\!\text{where}\!\!\!\!\quad A\in% \mathbb{R}^{d\times r},B\in\mathbb{R}^{k\times r}\quad\!\!\!\!\!\!(r\ll\min(d,% k))roman_Δ italic_W = italic_A ⋅ italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT where italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT , italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_r end_POSTSUPERSCRIPT ( italic_r ≪ roman_min ( italic_d , italic_k ) ) (7)

with r𝑟ritalic_r denoting the rank hyperparameter. This decomposition reduces the trainable parameters from 𝒪(dk)𝒪𝑑𝑘\mathcal{O}(d\cdot k)caligraphic_O ( italic_d ⋅ italic_k ) to 𝒪(r(d+k))𝒪𝑟𝑑𝑘\mathcal{O}(r(d+k))caligraphic_O ( italic_r ( italic_d + italic_k ) ) .

In our framework, each sample in 𝒟sftsubscript𝒟sft\mathcal{D}_{\text{sft}}caligraphic_D start_POSTSUBSCRIPT sft end_POSTSUBSCRIPT guides the LoRA-augmented model through two sequential phases: 1. Reasoning alignment phase: Given problem description d^jsubscript^𝑑𝑗\hat{d}_{j}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the model generates intermediate reasoning tokens g~jsubscript~𝑔𝑗\tilde{g}_{j}over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT under cross-entropy supervision 2. Solution verification phase: The model then produces the final solution fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT while jointly predicting labeljsubscriptlabel𝑗\text{label}_{j}label start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT through multi-task learning

The training process optimizes the model’s conditional probability distribution pθ(y|x)subscript𝑝𝜃conditional𝑦𝑥p_{\theta}(y|x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) using a composite maximum likelihood estimation objective:

SFT=𝔼(x,y)𝒟sft[logpθ(g~,f,labeld^)],subscriptSFTsubscript𝔼similar-to𝑥𝑦subscript𝒟sftdelimited-[]subscript𝑝𝜃~𝑔𝑓conditionallabel^𝑑\mathcal{L}_{\text{SFT}}=-\mathbb{E}_{(x,y)\sim\mathcal{D}_{\text{sft}}}\left[% \log p_{\theta}(\tilde{g},f,\text{label}\mid\hat{d})\right],caligraphic_L start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT sft end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_g end_ARG , italic_f , label ∣ over^ start_ARG italic_d end_ARG ) ] , (8)

where θ𝜃\thetaitalic_θ now explicitly incorporates the trainable LoRA parameters A𝐴Aitalic_A and B𝐵Bitalic_B. This formulation enables gradient updates to focus on low-dimensional subspaces of the original weight matrices, preserving pre-trained knowledge while adapting to task-specific patterns . The optimization alternates between forward pass computation of token-level cross-entropy losses and backward propagation of gradients, typically using AdamW[28] optimizer with linear learning rate decay.

3.3 Stage Two: Enhance Complex Reasoning with RL

Refer to caption
Figure 3: RL workflow
Refer to caption
Figure 4: Training dataset composition

We propose a rule-based RL algorithm C-Remax based on ReMax[5].

Th RL framework for C-ReMax as a Markov Decision Process is defined by the tuple 𝒮,𝒜,P,R,γ𝒮𝒜𝑃𝑅𝛾\langle\mathcal{S},\mathcal{A},P,R,\gamma\rangle⟨ caligraphic_S , caligraphic_A , italic_P , italic_R , italic_γ ⟩, where:

State Space 𝒮𝒮\mathcal{S}caligraphic_S: The state st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S at the time step t𝑡titalic_t is represented by the concatenation of the input communication system formulation question 𝐪𝒱𝐪superscript𝒱\mathbf{q}\in\mathcal{V}^{*}bold_q ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and the historical token generated by the LLM (a1,,at1)subscript𝑎1subscript𝑎𝑡1(a_{1},...,a_{t-1})( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ), where 𝒱superscript𝒱\mathcal{V}^{*}caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the vocabulary space of the token sequences. Mathematically,

st=𝐪(a1,,at1)subscript𝑠𝑡direct-sum𝐪subscript𝑎1subscript𝑎𝑡1s_{t}=\mathbf{q}\oplus(a_{1},...,a_{t-1})italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_q ⊕ ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) (9)

where direct-sum\oplus denotes sequence concatenation.

Action Space 𝒜𝒜\mathcal{A}caligraphic_A: The action at𝒜subscript𝑎𝑡𝒜a_{t}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A corresponds to selecting the next token from a discrete vocabulary 𝒱𝒱\mathcal{V}caligraphic_V, i.e., 𝒜=𝒱𝒜𝒱\mathcal{A}=\mathcal{V}caligraphic_A = caligraphic_V. Actions are sampled autoregressively over T𝑇Titalic_T time steps to generate answer to the communication system formulation question which can be represented by 𝐲1:T=(a1,,aT)subscript𝐲:1𝑇subscript𝑎1subscript𝑎𝑇\mathbf{y}_{1:T}=(a_{1},...,a_{T})bold_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT = ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ).

Policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT: The LLM parameterizes the policy as a conditional probability distribution over tokens:

πθ(at|st)=Pθ(at|𝐪,𝐲1:t1)t[1,T],formulae-sequencesubscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝑃𝜃conditionalsubscript𝑎𝑡𝐪subscript𝐲:1𝑡1for-all𝑡1𝑇\pi_{\theta}(a_{t}|s_{t})=P_{\theta}(a_{t}|\mathbf{q},\mathbf{y}_{1:t-1})\quad% \forall t\in[1,T],italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_q , bold_y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ) ∀ italic_t ∈ [ 1 , italic_T ] , (10)

where θ𝜃\thetaitalic_θ denotes the model parameters and 𝐲1:t1subscript𝐲:1𝑡1\mathbf{y}_{1:t-1}bold_y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT represents the previously generated tokens.

Reward R𝑅Ritalic_R:

We use rules to give the reward to the LLM. When the LLM finishes answering the question, it will receive the reward If the communication system formulation in the answer of the LLM is correct, we give a positive reward of 1, and if not, we give a zero reward to it. In addition to that, we also set a repetition reward to prevent the model to do repetition and make the answer more readable.

The reward is defined as

R(a)=Ra(a)Rrp(a)𝑅asubscript𝑅𝑎asubscript𝑅𝑟𝑝aR(\textbf{a})=R_{a}(\textbf{a})-R_{rp}(\textbf{a})italic_R ( a ) = italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( a ) - italic_R start_POSTSUBSCRIPT italic_r italic_p end_POSTSUBSCRIPT ( a ) (11)

where Ra(a)subscript𝑅𝑎aR_{a}(\textbf{a})italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( a ) represents the accuracy reward and Rrp(a)subscript𝑅𝑟𝑝aR_{rp}(\textbf{a})italic_R start_POSTSUBSCRIPT italic_r italic_p end_POSTSUBSCRIPT ( a ) represents the repetition reward. The specific definition of accuracy reward is as follows:

Ra(a)={1,if aequivalent toatrue0,otherwisesubscript𝑅𝑎acases1if aequivalent tosubscripta𝑡𝑟𝑢𝑒0otherwiseR_{a}(\textbf{a})=\begin{cases}1,&\text{if }\textbf{a}\quad\!\!\!\!\text{% equivalent to}\quad\!\!\!\!\textbf{a}_{true}\\ 0,&\text{otherwise}\end{cases}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( a ) = { start_ROW start_CELL 1 , end_CELL start_CELL if bold_a equivalent to a start_POSTSUBSCRIPT italic_t italic_r italic_u italic_e end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW (12)

where atruesubscripta𝑡𝑟𝑢𝑒\textbf{a}_{true}a start_POSTSUBSCRIPT italic_t italic_r italic_u italic_e end_POSTSUBSCRIPT is the correct modeling formulation and a is the LLM generated answer.

and the specific definition of repetition reward is as follows:

Rrp(a)=β(111+P(a))subscript𝑅𝑟𝑝a𝛽111𝑃aR_{rp}(\textbf{a})=\beta\left(1-\frac{1}{1+P(\textbf{a})}\right)italic_R start_POSTSUBSCRIPT italic_r italic_p end_POSTSUBSCRIPT ( a ) = italic_β ( 1 - divide start_ARG 1 end_ARG start_ARG 1 + italic_P ( a ) end_ARG ) (13)

where

P(a)=1λ|a|n=nminnmaxgn𝒢n(a)𝟏[ca(gn)>1]ca(gn)2𝑃a1𝜆asuperscriptsubscript𝑛subscript𝑛subscript𝑛subscriptsubscript𝑔𝑛subscript𝒢𝑛a1delimited-[]subscript𝑐asubscript𝑔𝑛1subscript𝑐asuperscriptsubscript𝑔𝑛2P(\textbf{a})=\frac{1}{\lambda|\textbf{a}|}\sum_{n=n_{\min}}^{n_{\max}}\sum_{g% _{n}\in\mathcal{G}_{n}(\textbf{a})}\mathbf{1}[c_{\textbf{a}}(g_{n})>1]\cdot c_% {\textbf{a}}(g_{n})^{2}italic_P ( a ) = divide start_ARG 1 end_ARG start_ARG italic_λ | a | end_ARG ∑ start_POSTSUBSCRIPT italic_n = italic_n start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( a ) end_POSTSUBSCRIPT bold_1 [ italic_c start_POSTSUBSCRIPT a end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 1 ] ⋅ italic_c start_POSTSUBSCRIPT a end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (14)

where λ𝜆\lambdaitalic_λ denotes the penalty coefficient. {nmin,nmax}subscript𝑛subscript𝑛\{n_{\min},n_{\max}\}{ italic_n start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT } denotes the range of n-gram lengths to be counted. β𝛽\betaitalic_β denotes the final output upper limit. a denotes the generated token. 𝒢n(a)subscript𝒢𝑛a\mathcal{G}_{n}(\textbf{a})caligraphic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( a ) denotes all n-grams of length n𝑛nitalic_n in text a. ca(gn)subscript𝑐asubscript𝑔𝑛c_{\textbf{a}}(g_{n})italic_c start_POSTSUBSCRIPT a end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) denotes the number of occurrences of the n-gram gnsubscript𝑔𝑛g_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in a. 1[]1delimited-[]\textbf{1}[\cdot]1 [ ⋅ ] is an indicator function, which only counts when the occurrence count is greater than 1.

One critical challenge in communication system formulation arises from its inherent complexity, which induces high gradient variance and hinders LLMs’ ability to effectively learn from rule-based feedback signals. In order to reduce the gradient variance, C-ReMax contrasts a stochastic rollout seqπθ(|x)\text{seq}\sim\pi_{\theta}(\cdot|x)seq ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) with a greedy rollout seqmax=argmaxa1:Tπθ(a1:T|x)subscriptseqmaxsubscriptsubscript𝑎:1𝑇subscript𝜋𝜃conditionalsubscript𝑎:1𝑇𝑥\text{seq}_{\text{max}}=\arg\max_{a_{1:T}}\pi_{\theta}(a_{1:T}|x)seq start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | italic_x ). The reward signal is defined as:

r^(x,seq)=rm(x,seq)rm(x,seqmax),^𝑟𝑥seqsubscript𝑟𝑚𝑥seqsubscript𝑟𝑚𝑥subscriptseqmax\hat{r}(x,\text{seq})=r_{m}(x,\text{seq})-r_{m}(x,\text{seq}_{\text{max}}),over^ start_ARG italic_r end_ARG ( italic_x , seq ) = italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x , seq ) - italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x , seq start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) ,

where rm()subscript𝑟𝑚r_{m}(\cdot)italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( ⋅ ) is a reward model. This baseline is a good approximation of the expected reward 𝔼a1:Tπθ(|x)[r(x,a1:T)]\mathbb{E}_{a_{1:T}\sim\pi_{\theta}(\cdot|x)}[r(x,a_{1:T})]blackboard_E start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_a start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ], reducing variance while preserving gradient direction.

Also, we incorporates KL divergence constraints to prevent catastrophic forgetting of previous knowledge as

KL=t=1T𝔼x,a1:t[DKL(πθ(|x,a1:t1)πREF(|x,a1:t1))].\mathcal{L}_{\text{KL}}=\sum_{t=1}^{T}\mathbb{E}_{x,a_{1:t}}\left[D_{\text{KL}% }\left(\pi_{\theta}(\cdot|x,a_{1:t-1})\parallel\pi_{\text{REF}}(\cdot|x,a_{1:t% -1})\right)\right].caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x , italic_a start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x , italic_a start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ) ∥ italic_π start_POSTSUBSCRIPT REF end_POSTSUBSCRIPT ( ⋅ | italic_x , italic_a start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ) ) ] . (15)

where πREFsubscript𝜋REF\pi_{\text{REF}}italic_π start_POSTSUBSCRIPT REF end_POSTSUBSCRIPT denotes the reference model,

The final objective combines reward maximization and KL regularization:

(θ)=𝔼xρ,seqπθ[logπθ(seq|x)r^(x,seq)]+λKL(θ),𝜃subscript𝔼formulae-sequencesimilar-to𝑥𝜌similar-toseqsubscript𝜋𝜃delimited-[]subscript𝜋𝜃conditionalseq𝑥^𝑟𝑥seq𝜆subscriptKL𝜃\mathcal{L}(\theta)=-\mathbb{E}_{x\sim\rho,\text{seq}\sim\pi_{\theta}}\left[% \log\pi_{\theta}(\text{seq}|x)\cdot\hat{r}(x,\text{seq})\right]+\lambda% \mathcal{L}_{\text{KL}}(\theta),caligraphic_L ( italic_θ ) = - blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ρ , seq ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( seq | italic_x ) ⋅ over^ start_ARG italic_r end_ARG ( italic_x , seq ) ] + italic_λ caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_θ ) , (16)

where λ𝜆\lambdaitalic_λ balances reward and regularization. Gradients are computed via:

θ(θ)=𝔼x,seq[θlogπθ(seq|x)r^(x,seq)]+λθKL(θ).subscript𝜃𝜃subscript𝔼𝑥seqdelimited-[]subscript𝜃subscript𝜋𝜃conditionalseq𝑥^𝑟𝑥seq𝜆subscript𝜃subscriptKL𝜃\nabla_{\theta}\mathcal{L}(\theta)=-\mathbb{E}_{x,\text{seq}}\left[\nabla_{% \theta}\log\pi_{\theta}(\text{seq}|x)\cdot\hat{r}(x,\text{seq})\right]+\lambda% \nabla_{\theta}\mathcal{L}_{\text{KL}}(\theta).∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ( italic_θ ) = - blackboard_E start_POSTSUBSCRIPT italic_x , seq end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( seq | italic_x ) ⋅ over^ start_ARG italic_r end_ARG ( italic_x , seq ) ] + italic_λ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_θ ) . (17)

3.3.1 Dataset construction for RL

4 Experiments

4.1 Experimental Setup

The experiment was conducted on two servers: one with 8 A100 GPUs and another with 8 A6000 GPUs, both running Ubuntu 20.04. All training was performed on the CSFRC dataset. The first stage utilized 5k data samples, while the second stage used 1.2k data samples.

Using the proposed method, we trained our models based on Qwen2.5-7B-Instruct [29]. In Stage 1, the models were fine-tuned on the SFT dataset for 5 epochs with a learning rate of 5×1065superscript1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and a batch size of 32. We employed LoRA tuning with a rank of 256 and used cosine learning with a warmup ratio of 0.05. In Stage 2, we applied reinforcement learning (RL) with a learning rate of 2×1062superscript1062\times 10^{-6}2 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, a batch size of 64, and a total of 5 training epochs. The KL divergence was set to 0.001.

Refer to caption
(a) Performance comparison of different RL algorithms
Refer to caption
(b) Performance comparison of SFT W/O CoT training data
Refer to caption
(c) Performance comparison of RL with different datasets

4.2 Experimental Results

Performance comparison with other LLMs

Fig. 7 shows the accuracy of different LLMs on the test dataset. It presents a comparative analysis of model performance through a bar chart, where the x-axis lists 6 distinct LLMs and the y-axis quantifies their accuracy scores. The models evaluated include ”DeepForm(Ours),” ”DeepSeek R1,” ”Qwen2.5-7B Instruct,” ”InternLM2.5 20B-Chat,” ”GPT-40 Mini,” and ”GLM-9B.” From the result we can find that DeepForm achieves the highest accuracy among all the LLMs with a model size of only 7B. It significantly outperform the second-highest LLM which is DeepSeek R1 that has a parameter size of 671B which is 94 times larger than our model.

Performance comparison of different RL algorithms
We compare our algorithm with several different RL algorithm to further verify the effectiveness of our algorithm. More specifically, we compare with the DPO[30] and KTO[31].

As these algorithms require example answers to enable the LLM to learn, we construct the dataset in the following ways.

Data construction of DPO. For DPO, we generate pairwise preferences using the DeepSeek R1 to compare reasoning paths:

𝒟DPO={(d^j,gjchosen,gjrejected)j=1,,N},subscript𝒟DPOconditional-setsubscript^𝑑𝑗superscriptsubscript𝑔𝑗chosensuperscriptsubscript𝑔𝑗rejected𝑗1𝑁\mathcal{D}_{\text{DPO}}=\left\{(\hat{d}_{j},g_{j}^{\text{chosen}},g_{j}^{% \text{rejected}})\mid j=1,...,N\right\},caligraphic_D start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT = { ( over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT chosen end_POSTSUPERSCRIPT , italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT rejected end_POSTSUPERSCRIPT ) ∣ italic_j = 1 , … , italic_N } ,

where gjchosensuperscriptsubscript𝑔𝑗choseng_{j}^{\text{chosen}}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT chosen end_POSTSUPERSCRIPT and gjrejectedsuperscriptsubscript𝑔𝑗rejectedg_{j}^{\text{rejected}}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT rejected end_POSTSUPERSCRIPT are two reasoning paths for the same question d^jsubscript^𝑑𝑗\hat{d}_{j}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, annotated by DeepSeek R1 and Doubao1.5pro (developed by Bytedance) as the preferred and dispreferred outputs, respectively.

Data construction of KTO. For KTO, it only need a single generated reasoning path for each sample and each sample requires a label. we generate the reasoning path and annotate single-sample preferences using the DeepSeek R1:

𝒟KTO={(d^j,gj,sj)j=1,,K},subscript𝒟KTOconditional-setsubscript^𝑑𝑗subscript𝑔𝑗subscript𝑠𝑗𝑗1𝐾\mathcal{D}_{\text{KTO}}=\left\{(\hat{d}_{j},g_{j},s_{j})\mid j=1,...,K\right\},caligraphic_D start_POSTSUBSCRIPT KTO end_POSTSUBSCRIPT = { ( over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∣ italic_j = 1 , … , italic_K } ,

where sj{0,1}subscript𝑠𝑗01s_{j}\in\{0,1\}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ { 0 , 1 } is a binary label indicating whether the reasoning path gjsubscript𝑔𝑗g_{j}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is deemed acceptable (sj=1subscript𝑠𝑗1s_{j}=1italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1) or not (sj=0subscript𝑠𝑗0s_{j}=0italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0).

Ablation Study. We conducted an ablation study on the model to analyze the impact of SFT and RL. The results are shown in Fig. 6. The LLM achieves an accuracy of 65.1%percent65.165.1\%65.1 % after the SFT stage and further increases to 71.4%percent71.471.4\%71.4 % after the RL stage. By observing the results, we can find that both the SFT stage and the RL stage play a critical role in improving the ability of the LLM in communication system formulation.

Refer to caption
Figure 6: Ablation Study
Refer to caption
Figure 7: LLM comparison

Performance comparison of SFT W/O CoT training data.

In this section, we conduct supervised fine-tuning (SFT) experiments on the Qwen 2.5-7B-Instuct model using two distinct datasets to validate the necessity of the proposed chain-of-thought (CoT) data augmentation methodology. For the non-CoT-enhanced dataset, we employ 𝒟comp={(d^j,fj)}subscript𝒟compsubscript^𝑑𝑗subscript𝑓𝑗\mathcal{D}_{\text{comp}}=\{(\hat{d}_{j},f_{j})\}caligraphic_D start_POSTSUBSCRIPT comp end_POSTSUBSCRIPT = { ( over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) }, which contains compressed descriptions djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT preceding mathematical formulations fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Conversely, the CoT-enhanced dataset 𝒟cot={(d^j,g~j,fj,label)j=1,,M}subscript𝒟cotconditional-setsubscript^𝑑𝑗subscript~𝑔𝑗subscript𝑓𝑗label𝑗1𝑀\mathcal{D}_{\text{cot}}=\left\{(\hat{d}_{j},\tilde{g}_{j},f_{j},\text{label})% \mid j=1,...,M\right\}caligraphic_D start_POSTSUBSCRIPT cot end_POSTSUBSCRIPT = { ( over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , label ) ∣ italic_j = 1 , … , italic_M } is utilized for comparative analysis to verify the effectiveness of the CoT augmentation approach. Both datasets are used to fine-tune the Qwen 2.5-7B-Instuct model for 270 training steps under identical hyperparameter configurations. Subsequently, average accuracy evaluations are performed on the same validation dataset using the training prompts. The experimental results are summarized in  5(b).

As shown in  5(b), the direct use of the non-CoT-enhanced dataset not only fails to improve the model’s answer accuracy, but also somewhat diminishes its reasoning ability. We believe that for complex modeling problems, the reasoning process is a crucial component. Merely providing questions and answers can mislead the model’s learning of modeling to some extent. These results demonstrate that sample quality plays a critical role in communication domain modeling, and the CoT data augmentation proves essential for achieving effective knowledge distillation in our dataset configuration.

Performance comparison with different difficulty level RL datasets. We compare the performance of the LLM after RL using datasets of varying difficulty. Two datasets were constructed: an easy dataset and a hard dataset. We trained the LLMs in three different ways: exclusively on the easy dataset, exclusively on the hard dataset, and through curriculum learning by first training on the easy dataset and then on the hard dataset. The results, shown in Fig. 5(c), indicate that the LLM trained only on the easy dataset achieved the highest accuracy. The LLMs trained on the hard dataset and through curriculum learning showed similar, but lower, accuracy levels. This discrepancy is likely due to the hard dataset causing instability during RL, as the LLM rarely receives positive rewards because it often fails to provide correct answers, making it difficult for the model to learn from feedback.

Refer to caption
Figure 8: Comparison of performance with different reward settings

Performance comparison of different reward settings. In this part, we compare the performance of the LLM with different reward settings in RL. The experiments

We conduct training on different reward penalties when the LLM provides the wrong answer. We experiment under the cases that a reward of ”11-1- 1” is given to the LLM and a reward of ”00” is given to the LLM if it provides the wrong answer. The two reward settings are as follows.

Ra(a)subscript𝑅aa\displaystyle R_{\textbf{a}}(\textbf{a})italic_R start_POSTSUBSCRIPT a end_POSTSUBSCRIPT ( a ) ={1,if a=atrue0,otherwiseabsentcases1if asubscriptatrue0otherwise\displaystyle=\begin{cases}1,&\text{if }\textbf{a}=\textbf{a}_{\text{true}}\\ 0,&\text{otherwise}\end{cases}= { start_ROW start_CELL 1 , end_CELL start_CELL if bold_a = a start_POSTSUBSCRIPT true end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW (18a)
Ra(a)subscript𝑅aa\displaystyle R_{\textbf{a}}(\textbf{a})italic_R start_POSTSUBSCRIPT a end_POSTSUBSCRIPT ( a ) ={1,if a=atrue1,otherwiseabsentcases1if asubscriptatrue1otherwise\displaystyle=\begin{cases}1,&\text{if }\textbf{a}=\textbf{a}_{\text{true}}\\ -1,&\text{otherwise}\end{cases}= { start_ROW start_CELL 1 , end_CELL start_CELL if bold_a = a start_POSTSUBSCRIPT true end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - 1 , end_CELL start_CELL otherwise end_CELL end_ROW (18b)

In addition to that, we also test the effect of the existence of the format reward. To address both semantic accuracy and format compliance, we propose a composite reward function combining:

- Accuracy reward: Measures correctness of the response.

- Format reward: Ensures adherence to format requirement.

The format reward is defined as:

Rf(a)={1,if format is correct0.1,otherwise.subscript𝑅𝑓acases1if format is correct0.1otherwiseR_{f}(\textbf{a})=\begin{cases}1,&\text{if format is correct}\\ -0.1,&\text{otherwise}.\end{cases}italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( a ) = { start_ROW start_CELL 1 , end_CELL start_CELL if format is correct end_CELL end_ROW start_ROW start_CELL - 0.1 , end_CELL start_CELL otherwise . end_CELL end_ROW (19)

and the combined reward is defined as:

R(a)=Rf(a)+Ra(a)𝑅asubscript𝑅𝑓asubscript𝑅𝑎aR(\textbf{a})=R_{f}(\textbf{a})+R_{a}(\textbf{a})italic_R ( a ) = italic_R start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( a ) + italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( a ) (20)

The performance comparison under different reward settings is illustrated in Fig. 8. The results indicate that a change in the accuracy reward setting can lead to an accuracy drop of (7.7%). This may be attributed to the LLM frequently receiving negative rewards, which can cause instability during training. Additionally, the inclusion of an external format reward in the reward function results in a (2%) decrease in performance. This decline could be due to the LLM’s reduced exploration to avoid negative format rewards.

5 Conclusion

In this work, we present a comprehensive framework for adapting LLMs to the domain of communication system formulation, addressing critical challenges in insufficient high-quality communication system formulation training data and the deep complexity of the communication system formulation task. Our contributions are threefold. Fisrt, we introduce CSFRC, the world’s first large-scale open-source dataset for communication system formulation, and will open-source it for researchers to do further research. Second, we are the first to develop a novel two-stage LLM training framework specially for communication system formulation, which contains a CoT data distillation stage and a rule based RL stage. We train the world-first reasoning LLM named DeepForm in communication system formulation and have done extensive experiments to prove the capability of DeepForm.

References

  • [1] A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford et al., “Gpt-4o system card,” arXiv preprint arXiv:2410.21276, 2024.
  • [2] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  • [3] G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
  • [4] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020.
  • [5] Z. Li, T. Xu, Y. Zhang, Z. Lin, Y. Yu, R. Sun, and Z.-Q. Luo, “Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models,” arXiv preprint arXiv:2310.10505, 2023.
  • [6] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025.
  • [7] H. Zhang, J. Chen, F. Jiang, F. Yu, Z. Chen, G. Chen, J. Li, X. Wu, Z. Zhiyi, Q. Xiao, X. Wan, B. Wang, and H. Li, “HuatuoGPT, towards taming language model to be a doctor,” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds.   Singapore: Association for Computational Linguistics, Dec. 2023, pp. 10 859–10 885. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.725/
  • [8] S. Yue, S. Liu, Y. Zhou, C. Shen, S. Wang, Y. Xiao, B. Li, Y. Song, X. Shen, W. Chen et al., “Lawllm: Intelligent legal system with legal reasoning and verifiable retrieval,” in International Conference on Database Systems for Advanced Applications, 2024, pp. 304–321.
  • [9] W. Chen, Q. Wang, Z. Long, X. Zhang, Z. Lu, B. Li, S. Wang, J. Xu, X. Bai, X. Huang et al., “Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning,” CoRR, 2023.
  • [10] Z. Bi, N. Zhang, Y. Xue, Y. Ou, D. Ji, G. Zheng, and H. Chen, “Oceangpt: A large language model for ocean science tasks,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 3357–3372.
  • [11] C. Wang, X. Hu, Y. Zhang, X. Chen, P. Du, Y. Mao, R. Wang, Y. Li, Y. Wu, H. Yang et al., “Starwhisper telescope: Agent-based observation assistant system to approach ai astrophysicist,” arXiv preprint arXiv:2412.06412, 2024.
  • [12] Y. Luo, Z. Nie, M. Hong, S. Zhao, H. Zhou, and Z. Nie, “Mutaplm: Protein language modeling for mutation explanation and engineering,” Advances in Neural Information Processing Systems, vol. 37, pp. 79 783–79 818, 2024.
  • [13] Y. Li, S. Ma, X. Wang, S. Huang, C. Jiang, H.-T. Zheng, P. Xie, F. Huang, and Y. Jiang, “Ecomgpt: Instruction-tuning large language models with chain-of-task tasks for e-commerce,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, 2024, pp. 18 582–18 590.
  • [14] K. Yang, T. Zhang, Z. Kuang, Q. Xie, J. Huang, and S. Ananiadou, “Mentallama: interpretable mental health analysis on social media with large language models,” in Proceedings of the ACM Web Conference 2024, 2024, pp. 4489–4500.
  • [15] H. Xiong, S. Wang, Y. Zhu, Z. Zhao, Y. Liu, L. Huang, Q. Wang, and D. Shen, “Doctorglm: Fine-tuning your chinese doctor is not a herculean task,” arXiv preprint arXiv:2304.01097, 2023.
  • [16] J. Tong, W. Guo, J. Shao, Q. Wu, Z. Li, Z. Lin, and J. Zhang, “Wirelessagent: Large language model agents for intelligent wireless networks,” arXiv preprint arXiv:2505.01074, 2025.
  • [17] W. Yang, Z. Xiong, S. Mao, T. Q. S. Quek, P. Zhang, M. Debbah, and R. Tafazolli, “Rethinking generative semantic communication for multi-user systems with large language models,” IEEE Wireless Communications, pp. 1–9, 2025.
  • [18] F. Jiang, Y. Peng, L. Dong, K. Wang, K. Yang, C. Pan, D. Niyato, and O. A. Dobre, “Large language model enhanced multi-agent systems for 6g communications,” IEEE Wireless Communications, vol. 31, no. 6, pp. 48–55, 2024.
  • [19] B. Xiao, B. Kantarci, J. Kang, D. Niyato, and M. Guizani, “Efficient prompting for llm-based generative internet of things,” IEEE Internet of Things Journal, vol. 12, no. 1, pp. 778–791, 2025.
  • [20] A. Mekrache, A. Ksentini, and C. Verikoukis, “Intent-based management of next-generation networks: an llm-centric approach,” IEEE Network, vol. 38, no. 5, pp. 29–36, 2024.
  • [21] G. Sun, Y. Wang, D. Niyato, J. Wang, X. Wang, H. V. Poor, and K. B. Letaief, “Large language model (llm)-enabled graphs in dynamic networking,” IEEE Network, pp. 1–1, 2024.
  • [22] E. Nwafor, U. Baskota, M. S. Parwez, J. Blackstone, and H. Olufowobi, “Evaluating large language models for enhanced intrusion detection in internet of things networks,” in GLOBECOM 2024 - 2024 IEEE Global Communications Conference, 2024, pp. 3358–3363.
  • [23] A. Mekrache, M. Mekki, A. Ksentini, B. Brik, and C. Verikoukis, “On combining xai and llms for trustworthy zero-touch network and service management in 6g,” IEEE Communications Magazine, vol. 63, no. 4, pp. 154–160, 2025.
  • [24] B. Wang, C. Xu, X. Zhao, L. Ouyang, F. Wu, Z. Zhao, R. Xu, K. Liu, Y. Qu, F. Shang, B. Zhang, L. Wei, Z. Sui, W. Li, B. Shi, Y. Qiao, D. Lin, and C. He, “Mineru: An open-source solution for precise document content extraction,” 2024. [Online]. Available: https://arxiv.org/abs/2409.18839
  • [25] C. He, W. Li, Z. Jin, C. Xu, B. Wang, and D. Lin, “Opendatalab: Empowering general artificial intelligence with open datasets,” arXiv preprint arXiv:2407.13773, 2024.
  • [26] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” Soviet Physics Doklady, vol. 10, pp. 707–710, 1966.
  • [27] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen et al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022.
  • [28] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations (ICLR).   OpenReview.net, 2019.
  • [29] A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al., “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025.
  • [30] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems, vol. 36, pp. 53 728–53 741, 2023.
  • [31] K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky, and D. Kiela, “Kto: Model alignment as prospect theoretic optimization,” arXiv preprint arXiv:2402.01306, 2024.