0% found this document useful (0 votes)
99 views9 pages

WenyanGPT: Classical Chinese NLP Model

Uploaded by

1chee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views9 pages

WenyanGPT: Classical Chinese NLP Model

Uploaded by

1chee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)

WenyanGPT: A Large Language Model for Classical Chinese Tasks

Xinyu Yao1 , Mengdi Wang1 , Bo Chen1,2∗ , Xiaobing Zhao1,2


1
School of Information Engineering, Minzu University of China
2
National Language Resource Monitoring and Research Center of Minority Languages
{xinyu230101, wang_mengdi_wmd}@[Link], chenbomuc@[Link], nmzxb_cn@[Link]

Abstract 问:请给这段古汉语文本添加标点符号:汝忆否四五年前某
夕吾尝语曰与使吾先死也无宁汝先吾而死
答:汝忆否?四五年前某夕,吾尝语曰:“与使吾先死也,
Classical Chinese, as the core carrier of Chinese 无宁汝先吾而死。”
culture, plays a crucial role in the inheritance and
Punctuation Question: Please add punctuation marks to this ancient
study of ancient literature. However, existing natu- Chinese text: 汝忆否四五年前某夕吾尝语曰与使吾先死也无
ral language processing models primarily optimize 宁汝先吾而死
for Modern Chinese, resulting in inadequate perfor- Answer: 汝忆否?四五年前某夕,吾尝语曰:“与使吾先死
也,无宁汝先吾而死。”
mance on Classical Chinese. This paper presents
a comprehensive solution for Classical Chinese 问:识别古汉语文本中的人物、地名、时间、官职名等实体:
language processing. By continuing pre-training 二十三年,罢宣慰司,立辽阳行省,以亦力撒合为参知政事。
答:'人物': ['亦力撒合'], '地点': [], '时间': ['二十三年'],
and instruction fine-tuning on the LLaMA3-8B- '官职': ['宣慰司', '参知政事']
Chinese model, we construct a large language NER Question: Identify entities such as characters, place, time,
model, WenyanGPT, which is specifically designed and official positions in ancient Chinese texts: 二十三年,
for Classical Chinese tasks. Additionally, we de- 罢宣慰司,立辽阳行省,以亦力撒合为参知政事。
velop an evaluation benchmark dataset, Wenyan- Answer: 'characters': ['亦力撒合'], 'place': [], 'time':
['二十三年'], 'official positions': ['宣慰司', '参知政事']
BENCH. Experimental results on WenyanBENCH
demonstrate that WenyanGPT significantly outper- 问:将这段古汉语文本翻译成现代汉语:古之学者必有师。
forms current advanced LLMs in various Classical 答:古代求学的人必定有老师。

Chinese tasks. We make the model’s training data, Translation Question: Translating this ancient Chinese text into
instruction fine-tuning data, and evaluation bench- modern Chinese: 古之学者必有师。
Answer: 古代求学的人必定有老师。
mark dataset publicly available to promote further
research and development in the field of Classical Figure 1: Examples of tasks from WenyanGPT. The model demon-
Chinese processing. strates advanced knowledge in Classical Chinese and shows strong
performance in both Classical Chinese understanding and genera-
tion tasks.
1 Introduction
Classical Chinese is an important component of Chinese cul-
ture, with a long history and profound cultural heritage. It is Models (HMMs) for POS [Huang et al., 2002], Conditional
essential for understanding traditional Chinese culture. With Random Fields (CRFs) for punctuation [Huang et al., 2010]
the rapid development of artificial intelligence technology, in- and NER [Yuan et al., 2019; Li, 2018]. In deep learn-
telligent processing of ancient Chinese texts offers a new so- ing, RNNs, LSTMs, GRUs, and Attention mechanisms have
lution for the preservation and inheritance of Classical Chi- been applied to various tasks, including couplet generation
nese. Modern technologies, such as digitalization and natu- and classical poetry generation [Yan et al., 2016; Yi et al.,
ral language processing (NLP), can efficiently preserve and 2017], punctuation and POS using BiLSTM-CRF models
spread traditional culture while fostering deeper and more in- [Wang et al., 2019; Cheng et al., 2020; Zhang et al., 2023b;
novative academic research. Maximizing the potential of AI Chang et al., 2024]. As Transformer architectures emerged
in processing Classical Chinese texts has become an urgent [Vaswani et al., 2017], studies began using large-scale paral-
need for cultural inheritance and academic development. lel corpora to train models for translating Classical Chinese
Early research in Classical Chinese language process- into Modern Chinese [Liu et al., 2018] and for generating
ing focused on tasks such as punctuation, word segmenta- classical poetry [Huang et al., 2020]. The introduction of
tion, part-of-speech tagging (POS) , named entity recogni- pre-trained models including BERT [Kenton and Toutanova,
tion (NER) and translation. These tasks initially relied on tra- 2019] and GPT [Radford and Narasimhan, 2018] provided
ditional machine learning methods, such as Hidden Markov new opportunities for intelligent Classical Chinese process-
ing. Some research integrated ancient Chinese texts into the

Corresponding author. training data of general pre-trained models, improving the

8339
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)

processing performance of Classical Chinese compared to achieved human-level performance in 1-shot learning on the
typical pre-trained models [Tian et al., 2020; Wang et al., BIG-bench [Srivastava et al., 2023] dataset. In 2023, Meta
2022; Wang et al., 2023a; Liu et al., 2023a]. Other studies AI released the LLaMA model [Touvron et al., 2023]. This
used Classical Chinese corpus to continue pre-training and model also follows a decoder-only Transformer architecture
fine-tuning large language models, aiming to build conver- and excels in various NLP tasks after large-scale training. In
sational models for Classical Chinese [Zhang et al., 2024; 2024, LLaMA 3 [Dubey et al., 2024] was released, including
Yang et al., 2024b; Cao et al., 2023; Cao et al., 2024]. a pre-trained version with 405 billion parameters and a post-
However, challenges remain in Classical Chinese process- training version, alongside the LLaMA Guard 3 model for
ing. Different tasks require training specialized models, and input-output safety. Pre-trained language models have seen
no effective universal model exists. Additionally, there is a rapid development, with Transformer-based models becom-
lack of standardized evaluation benchmarks in this field, with ing the mainstream technology in NLP.
existing evaluation tasks, datasets, and metrics being incon-
sistent, making it difficult to perform cross-task comparisons 2.2 Classical Chinese PLMs
and systematic assessments of model performance. Pre-trained language models have achieved widespread suc-
To address these issues, we propose WenyanGPT, a large cess in the field of NLP. However, studies show that general-
language model for Classical Chinese. Some examples of domain models often lack specialized knowledge for tasks in
WenyanGPT are shown in Figure 1. We also construct the specific domains. Models pre-trained with domain-specific
largest pre-training corpus for continued pre-training, en- data tend to perform better for specialized tasks [Ke et al.,
hancing the model’s domain adaptability. Additionally, we 2023; Gupta et al., 2023; Ibrahim et al., 2024; Taylor et al.,
propose a framework for generating domain-specific instruc- 2022; Lehman et al., 2023; Liu et al., 2020]. In the field
tion data for supervised fine-tuning in the development of of Classical Chinese, several studies have extended models
WenyanGPT. To promote research in the intelligent process- such as BERT, RoBERTa, and GPT by incorporating Classi-
ing of Classical Chinese, we build the WenyanBENCH eval- cal Chinese corpora for pre-training, resulting in specialized
uation dataset and conduct extensive experiments for detailed models like AnchiBERT [Tian et al., 2020], SikuBERT and
analysis. The main contributions are as follows: SikuRoBERTa [Wang et al., 2022], GujiBERT and GujiGPT
• We propose WenyanGPT, a large language model fo- series [Wang et al., 2023a], and SikuGPT [Liu et al., 2023a].
cused on Classical Chinese. It demonstrates superior These models show improved performance over general pre-
performance and wide applicability in tasks such as trained models in Classical Chinese tasks. Instruction fine-
punctuation, POS, translation, etc. tuning is another effective strategy. Using Supervised Fine-
Tuning (SFT) can activate LLMs’ ability to understand and
• We release pre-training and instruction fine-tuning answer questions in a specific domain [Liu et al., 2023b;
datasets, along with a novel method for constructing Xiong et al., 2023; Wang et al., 2023b; Yue et al., 2023;
domain-specific fine-tuning data, providing valuable re- Huang et al., 2023; Cui et al., 2023; Yang et al., 2023b;
sources for future research. Zhang et al., 2023a; Dan et al., 2023]. Classical Chinese
• We introduce WenyanBENCH, an evaluation bench- LLMs [Zhang et al., 2024; Yang et al., 2024b; Cao et al.,
mark for Classical Chinese tasks, with extensive ex- 2023; Cao et al., 2024] are in the early stages of develop-
periments verifying WenyanGPT’s leading performance ment. For example, the "Xunzi1 " Classical Chinese large lan-
across multiple tasks. guage model was trained on Classical Chinese-related cor-
pora based on general models such as Qwen2.5 [Yang et al.,
2 Related Work 2024a], Baichuan2 [Yang et al., 2023a], and GLM-4 [Zeng
et al., 2024]. It has shown excellent performance in intelli-
2.1 PLMs gent tagging, information extraction, and other tasks. TongGu
In 2017, Google introduced a new neural network architec- [Cao et al., 2024], through two-stage instruction fine-tuning,
ture, Transformer. It utilizes self-attention mechanisms to is capable of Classical Chinese punctuation, translation, and
better handle long-distance dependencies and significantly appreciation tasks. In this paper, WenyanGPT is fine-tuned
improves training efficiency through parallel computation. on higher-quality pre-training data and a larger, more diverse
Based on Transformer, various LLMs have been proposed. instruction dataset, showing superior and more comprehen-
BERT employs an encoder-only Transformer architecture and sive task handling capabilities.
is pre-trained using masked language modeling and next sen-
tence prediction tasks. The GPT series, uses a decoder- 3 WenyanGPT
only Transformer architecture and an autoregressive language
In order to obtain the WenyanGPT Classical Chinese model,
model (ALM). Over the course of the GPT series, the model
we first construct a Classical Chinese pre-training corpus and
size has steadily increased, from the original GPT to sub-
continue pre-training based on LLaMA3-8B-Chinese (Sec-
sequent iterations, including GPT-2 [Radford et al., 2019],
tion 3.1). Then, we propose a method to construct domain
GPT-3 [Brown et al., 2020], and GPT-4 [Achiam et al.,
instruction data (Section 3.2). In our framework, instruction
2023], with continuous improvements in performance. PaLM
[Chowdhery et al., 2023] uses the standard Transformer ar- generation is manually constructed, guided by LLMs, and
chitecture in a decoder-only model with a modified SwiGLU 1
[Link]
activation function. This model, with 540 billion parameters, XunziALLM.

8340
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)

Hyper parameter Value


Classical Chinese Corpus Domain Instruction per device train batch size 16
Data Generation gradient accumulation steps 1
learning rate 1.0e-4
WenyanBENCH
num train epochs 1
Pre-training Instruction fine- Classical Chinese
tuning data Benchmark
lr scheduler type cosine
corpus
warmup ratio 0.1
fine-tuning WenyanGPT
continued
pretraining Table 2: Hyper-parameter Settings in Continue Pre-training.
Base Model Pre-trained Model Classical Chinese LLM

Figure 2: Overall Training Framework of WenyanGPT. training are shown in Table 2.

Source Scale Source Scale


3.2 Supervised Fine-Tuning
Based on continued pre-training, we perform supervised fine-
Daizhige 5.2G Poetry-master 323M tuning to better adapt the model to specific tasks and instruc-
[Link] 1.6G PoetrySplider 16M tions. We use a high-quality set of instruction fine-tuning
network resource 1.1G TCM-Ancient- 322M data, which we have previously collected and organized, to
Books trigger the knowledge the model acquires during pre-training.
chinese-novel- 294M text-to-picture- 6.7M The detailed process for constructing the instruction fine-
master-long sidamingzhi tuning data is shown in Figure 3.
chinese- 23M chtxt-main 88M
gushiwen Data Selection and Initial Organization. We select rele-
Classical- 208M chinese-poetry 115M vant data from the Classical Chinese corpus, including three
Chinese major categories: question-answer pairs, parallel corpora,
Classical-Modern 853M guner2023 63M and tagged corpora. Parallel corpora are used for translation
core-books-main 752M kangxi-master 37M tasks, while tagged corpora support fine-grained tasks such as
core-texts 232M poems-db 660M punctuation and POS. In cases where the corpus lacks clear
GuWen-master 2.5M scripta-sinica 3.7G question-answer pairs, we generate supplementary data. Dur-
ing the selection phase, we prioritize high-quality data with
Table 1: Sources and Scale of Classical Chinese Pre-training Cor- clear content, standardized semantics, and task relevance to
pus. build the initial high-quality input-output pairs.
Manual Design of Task Instructions and Model Expan-
sion. We design initial task instruction templates manually
tested to ensure the high quality of the fine-tuning data. The based on high-quality input and output, covering tasks such as
complete training process is shown in Figure 2. Classical Chinese punctuation, and translating Classical Chi-
3.1 Pre-training nese into Modern Chinese. Then, we use LLMs, such as GPT
and Qwen series to expand the task instructions. On the one
The corpus used in the pre-training phase is sourced from au- hand, we generate diverse instructions from existing ones; on
thoritative websites such as Daizhige, Wenyanguji, and var- the other hand, we allow LLMs to perform reverse reasoning
ious Classical Chinese-related data collected and organized from the existing high-quality input-output pairs to generate
from GitHub. The detailed data sources and scale are shown new instructions, ensuring diversity in the instructions. After
in Table 1. We uniformly format and store data from these the expansion, we conduct an initial screening of the gener-
different sources, removing redundant information, errors, ated instructions, removing those with unclear or unreason-
special symbols, and invalid characters. As a result, we ob- able semantics, resulting in a seed instruction set.
tain a clean, large-scale, high-quality Classical Chinese cor-
pus of approximately 16GB. This corpus covers the Four Testing the Instruction Set and Optimizing Fine-Tuning
Books and Five Classics, including Confucian classics, his- Data. We randomly select high-quality input-output pairs
torical records, works of various philosophers, poetry, essays, and combine them with the seed instruction set to evaluate
drama, novels, miscellanies, and other literary genres. It also the model’s adherence to instructions across different task
encompasses diverse fields such as local gazetteers, genealo- scenarios. We analyze the test results to refine the instruc-
gies, religious texts, agricultural, law, medicine, astronomy, tional design and identify the task instructions that produce
geography, craft books, and military texts. The corpus in- higher-quality outputs. Then, we combine the optimized in-
tegrates both simplified and traditional Chinese characters, struction set with high-quality input-output pairs to construct
spanning from the Pre-Qin period to the Republic of China, fine-tuning data for specific instructions, providing reliable
providing rich materials for deep learning and research on data support for subsequent model training.
ancient Chinese texts. We select LLaMA3-8B-Chinese as the Generation and Supplementation of Specific Task Data.
base model and use the bfloat16 data format during training Due to the lack of specific task instruction data in the corpus,
to improve efficiency. The hyper-parameter settings in pre- we manually design diverse initial instructions and inputs.

8341
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)

expand
filter
Question-Answer Pairs
Initial instructions merge
reverse reasoning
古之学者必有师。 filter
(input, output) (input, output) Seed instructions
古代求学的人必定有老师。
high quality pairs extract Parts of high quality Seed instruction
Parallel corpora pairs data
Classical Chinese
Corpus instructions
十三年, filter
十三年/t ,/w High quality instruction test
filter instructions
Tagged Corpus (instruction, input, output)
Parts of high quality sft data
(instruction, input, output)
outputs
(instruction,input) Final high quality sft data
Larger models Generated output (instruction, input, output)
Artificial design
filter
Parts of high quality sft data

Figure 3: Instruction Fine-Tuning Data Construction Process.

Task Data Source Num Task Num Task Num


Punctuation Daizhige 107,3017 Punctuation 7,559 Translation 5,013
POS evahan 9,952 POS 1,247 Word Explanation 3,931
NER Self-built 29923 NER 3,741 Reverse Dictionary 4,462
Translation classical-modern 222,700 Total 25,953
[Link] 302,724
Word explanation [Link] 31,088 Table 5: Data Sources and Detailed Statistics of WenyanBench.
Reverse dictionary chinese-dictionary 39,708
chinese-xinhua 138,810
Total 1,847,922 tions and outputs, covering multiple Classical Chinese task
scenarios. Through comprehensive data validation and op-
timization, we create a high-quality instruction dataset. In
Table 3: Sources and Scale of Instruction Fine-Tuning Data.
the end, we obtain about 1.85 million instruction fine-tuning
data. The detailed data sources and counts are shown in Table
Hyper parameter Value 3. We use these data to fine-tune the pre-trained model. The
per device train batch size 8 hyper-parameter settings in fine-tuning are shown in Table 4.
gradient accumulation steps 2
learning rate 1.0e-4 4 Benchmarking Classical Chinese Tasks
num train epochs 1 WenyanBench. In order to evaluate the model’s perfor-
lr scheduler type cosine mance on Classical Chinese tasks, we devise a benchmark
warmup ratio 0.1 known as WenyanBench. WenyanBench shares the same data
sources as the instruction fine-tuning data and has undergone
Table 4: Hyper-parameter Settings in Fine-Tuning. duplicate data removal, as well as validation by both manual
and LLMs. For quality control, we sample a subset of the
data. The distribution and detailed statistics of WenyanBench
Then we use LLMs including Qwen2.5-14B and Qwen2.5- are shown in Table 5.
72B to generate high-quality answers. We remove low-
quality or irrelevant content through manual selection and Tasks. Our benchmark includes six tasks related to Classi-
automatic quality checks, forming another portion of the in- cal Chinese. Among them, we subdivide 14 types of punctu-
struction data. ation marks in the punctuation task, divide Classical Chinese
word classes into 17 categories in the POS task, such as nr for
Data Integration and Output Validation. We integrate personal nouns, ns for place nouns, and define 4 categories for
the data generated from the corpus and the supplementary the NER task.
data from the LLMs to form a complete instruction dataset.
During integration, we ensure consistency between instruc- Metrics. For the WenyanBench benchmark, different eval-
uation metrics are used for different types of tasks (under-

8342
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)

Punctuation POS NER


Model
P(%) R(%) F1(%) P(%) R(%) F1(%) P(%) R(%) F1(%)
Qwen2.5-7B-Instruct 54.34 53.31 53.82 51.25 48.16 49.65 66.05 46.55 54.61
Baichuan2-7B-Chat 51.05 21.03 29.79 47.11 30.97 37.37 35.10 10.58 16.26
GLM-4-9B-Chat 52.39 55.00 53.66 49.90 54.98 52.32 52.19 45.42 48.57
Meta-Llama-3-8B-Instruct 55.05 22.41 31.85 25.73 17.06 20.52 47.50 57.48 52.01
Llama3-8B-Chinese-Chat 45.76 38.07 41.56 21.34 19.34 20.29 46.85 66.69 55.04
Xunzi-Qwen-1.5-7B-Chat 52.08 47.19 49.51 77.54 78.07 77.81 49.79 51.21 50.49
GPT-4o 52.00 50.70 51.34 82.41 81.11 81.75 61.58 76.97 68.42
Deepseek-V3 56.33 61.94 59.01 79.12 79.18 79.15 56.83 79.75 66.36
WenyanGPT 76.84 74.52 75.66 89.66 88.54 89.1 92.14 90.19 91.16

Table 6: Results for understanding tasks (Punctuation, POS, NER) on WenyanBench. The results underlined represent the second-best
model’s F1 score.

Punctuation POS NER


1 1 1

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
,。;: “ 《》)(!<>? ” p c nr n r v w d ns t y u m a f j q characte place time official
rs positions

Figure 4: The F1 scores of WenYanGPT for the subcategories of understanding tasks (including Punctuation, POS, and NER) on Wenyan-
Bench.

standing tasks and generation tasks). For understanding Data and Evaluation. We use the WenyanBench bench-
tasks, evaluation primarily relies on Precision, Recall, and mark for testing. The understanding tasks, including punc-
F1-Score. For generation tasks, BLEU and BERT-Score are tuation, POS, and NER, are evaluated by Precision, Recall,
used as evaluation metrics. BLEU measures the N-gram over- and F1-Score. The generation tasks include word explana-
lap between generated content and reference answers, while tion, translation, and reverse dictionary, where BLEU is used
BERT-Score better captures the semantic similarity between for word explanation and translation, and BERT-Score is used
the generated content and reference answers. for reverse dictionary.
Evaluation Method. To efficiently assess model perfor- 5.2 Experimental Analysis
mance, we design a set of scripted tools to automatically
compute BLEU, BERT-Score, and other metrics. These tools WenyanGPT demonstrates a significant lead in language
quickly and accurately quantify model outputs, providing understanding tasks. The experimental results for under-
clear feedback for model optimization. This automated eval- standing tasks are presented in Table 6. In the NER task,
uation approach improves evaluation efficiency and ensures WenyanGPT’s precision, recall, and F1 score all exceed
the consistency and comparability of the results. 90%, while the second-best model, GPT-4o, fails to sur-
pass 77% in any of these metrics. In the punctuation task,
WenyanGPT’s F1 score is 16.65% higher than that of the
5 Experiments second-best model, Deepseek-V3, reaching 75.66%. Addi-
The experiments evaluate WenyanGPT’s performance on un- tionally, in the POS task, WenyanGPT’s F1 score is 7.35%
derstanding and generation tasks in Classical Chinese. higher than that of the second-best model, GPT-4o. These
results highlight WenyanGPT’s overwhelming superiority in
5.1 Experimental Setup understanding tasks, particularly in NER and punctuation.
Baselines. The baselines include general-domain and Clas- This performance is attributed to the model’s extensive pre-
sical Chinese domain LLMs. The general-domainLLMs training on Classical Chinese data, enabling it to handle com-
are Qwen2.5-7B-Instruct, Baichuan2-7B-Chat, GLM-4-9B- plex linguistic phenomena and ensuring higher accuracy and
Chat, Meta-Llama-3-8B-Instruct, Llama3-8B-Chinese-Chat, stability in fundamental language understanding tasks, such
GPT-4o [Hurst et al., 2024], and Deepseek-V3 [Liu et as POS and NER.
al., 2024]. The Classical Chinese domain LLM is Xunzi- WenyanGPT demonstrates excellent capabilities and high
Qwen1.5-7B-Chat. efficiency in subcategories of Classical Chinese compre-

8343
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)

Translation Word explanation Reverse dictionary


Model
BLEU1 BLEU2 BLEU3 BLEU4 BLEU1 BLEU2 BLEU3 BLEU4 P(%) R(%) F1(%)
Qwen2.5-7B-Instruct 0.37 0.23 0.17 0.14 0.16 0.09 0.07 0.05 68.43 68.99 68.66
Baichuan2-7B-Chat 0.33 0.20 0.14 0.11 0.14 0.08 0.05 0.04 64.81 66.21 65.42
GLM-4-9B-Chat 0.34 0.21 0.15 0.12 0.15 0.09 0.06 0.05 65.58 68.04 66.69
Meta-Llama-3-8B-Instruct 0.16 0.09 0.06 0.05 0.11 0.06 0.05 0.04 59.41 64.13 61.48
Llama3-8B-Chinese-Chat 0.26 0.15 0.10 0.08 0.11 0.06 0.04 0.03 61.8 65.18 63.28
Xunzi-Qwen1.5-7B-Chat 0.22 0.15 0.11 0.09 0.11 0.08 0.06 0.05 66.47 68.45 67.35
GPT-4o 0.41 0.27 0.19 0.14 0.19 0.13 0.09 0.07 64.96 66.76 65.81
Deepseek-V3 0.30 0.19 0.13 0.10 0.20 0.14 0.11 0.08 71.93 71.84 71.88
WenyanGPT 0.47 0.33 0.24 0.19 0.35 0.31 0.27 0.23 75.51 75.31 75.39

Table 7: Results for generation tasks (Translation, Word explanation, Reverse dictionary) on WenyanBench. The results underlined represent
the second-best model’s BLEU1 score and BERT-Score-F1 score.

Punctuation consistently high BLEU scores (BLEU1-BLEU4). In clas-


1.00
sical Chinese translation, the model’s BLEU1 score of 0.47
0.80 Part-of-speech exceeds that of the second-best model by 0.06. Further-
Reverse dictionary 0.60 tagging more, WenyanGPT’s F1 score in the reverse dictionary task
0.40 is 3.47% higher than the second best model. Through metic-
0.20 ulous pre-training and multitask training, WenyanGPT devel-
0.00 ops robust contextual consistency, enabling it to generate con-
tent that accurately reflects the intended meaning and con-
text. This capability ensures high-quality semantic transmis-
sion and content coherence in both long-text generation and
Word explanation NER
complex tasks, such as word explanation.
WenyanGPT can achieve better performance than the
Translation previous open source LLMs in Classical Chinese tasks.
Qwen2.5-7B-Instruct Baichuan2-7B-Chat WenyanGPT, through large-scale pre-training and multitask
GLM-4-9B-Chat Meta-Llama-3-8B-Instruct joint optimization, demonstrates substantial capabilities in
Llama3-8B-Chinese-Chat Xunzi-Qwen1.5-7B_chat Classical Chinese processing tasks, establishing a clear lead
gpt-4o Deepseek-V3 over existing mainstream LLMs. This validates the effective-
WenyanGPT ness of the proposed approach. The performance of various
LLMs on WenyanBench is shown in Figure 5. It is evident
Figure 5: Radar plot showing model performance on WenyanBench, that WenyanGPT has achieved the highest score in six Clas-
with values normalized to a 0-1 scale. sical Chinese tasks. WenyanGPT’s multitask training strat-
egy, which integrates various Classical Chinese processing
hension tasks. As shown in Figure 4, the F1 scores of tasks, enhances the model’s ability to learn across tasks. The
our model in the three task’s subcategories are generally sta- mutual reinforcement of shared linguistic features and se-
ble and high. Specifically, in the NER task, WenyanGPT’s mantic information between tasks notably boosts the model’s
F1 score remains above 80%, demonstrating its strong abil- generalization capabilities. This inter-task synergy not only
ity to correctly identify entities such as historical figures, improves performance on individual tasks but also enables
place, and proper nouns in ancient Chinese texts. This per- WenyanGPT to effectively handle multiple tasks simultane-
formance shows that it has strong accuracy and robustness ously, maintaining stable performance, especially in complex
when processing ancient Chinese texts, and can effectively tasks, and demonstrating superior adaptability across tasks.
capture complex contextual relationships and word meaning
changes. Overall, WenyanGPT’s high F1 score in ancient 5.3 Case Study
Chinese comprehension tasks not only reflects its high effi- We provide response examples for understanding tasks (POS)
ciency in basic tasks, but also demonstrates its advantages and and generation tasks (word explanation) from five LLMs:
potential in processing fine-grained tasks in ancient Chinese. WenyanGPT, Deepseek-V3, GPT-4o, Qwen2.5-7B-Instruct,
and Xunzi-Qwen1.5-7B-Chat, as shown in Table 8 and Table
WenyanGPT can significantly enhance the quality of
9. By analyzing these examples, we can see the performance
generated content in generation tasks. The experimen-
differences of different LLMs in ccessing tasks, especially in
tal results for generation tasks are displayed in Table 7.
the understanding and generation of classical Chinese.
WenyanGPT exhibits strong semantic preservation and con-
Table 8 illustrates some typical errors in the POS task.
textual consistency, demonstrates superior and more stable
Specifically, GPT-4o makes mistakes in tagging time words
performance in translation and word explanation tasks, with
and proper nouns. For instance, it incorrectly tags the time

8344
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)

Case 1 Please segment the following Clas- Case 2 Explain the meaning of "晕红" in the
sical Chinese content and accurately line "一片晕红才著雨,几丝柔绿乍
tag the parts of speech:四年春,衞 和烟" from "浣溪沙·谁道飘零不可
州吁弑桓公而立。 怜".
Ground Truth 四年/t 春/n ,/w 衞州吁/nr 弑/v 桓 Ground Truth 形容海棠花的色泽。
公/nr 而/c 立/v 。/w Describes the color of begonia flow-
ers.
WenyanGPT 四年/t 春/n ,/w 衞州吁/nr 弑/v 桓
公/nr 而/c 立/v 。/w WenyanGPT 形容海棠花色泽鲜艳而浓丽。
Deepseek-V3 四/m 年/t 春/t, /w 衞/ns 州 吁/nr Describes the bright and rich color of
弑/v 桓公/nr 而/c 立/v 。/w the begonia flower.
GPT-4o 四年/n 春/n ,/w 衞州/n 吁/v 弑/v Deepseek-V3 指雨后花朵泛起的朦胧红晕。
桓公/n 而/c 立/v 。/w It refers to the hazy red glow that ap-
Qwen2.5-7B- 四年/t 春/w ,/w 卫州吁/nr 射/v 桓 pears on flowers after the rain.
Instruct 公/nr 而/c 立/v 。/w GPT-4o 指花瓣在雨水滋润下呈现出的一种
Xunzi-Qwen1.5- 四年春/t , /w 衞州吁/nr 杀/v 桓 柔和而略带湿润光泽的红色。
7B-Chat 公/nr 而/c 立/v 。/w Refers to the soft and slightly moist
red color that the petals appear to
Table 8: Response examples for the POS task from different LLMs. have when nourished by rain.
Errors in POS are marked in red, while text errors are highlighted in Qwen2.5-7B- 指花瓣上晕染开的淡红色,适合颜
blue. Instruct 色柔和、渐变,类似于涂抹或被雨
水滋润后散开的红晕。
Refers to the light red color that
expression "四年" (the fourth year) as a common noun, and spreads across the petals. It is suit-
it also struggles with recognizing the proper noun "衞州 able for soft and gradual colors, sim-
吁" (Wei Zhou Xu). Qwen2.5-7B-Instruct errors in tagging ilar to the blush that spreads after be-
the part of speech for "春" (spring) and sometimes replaces ing smeared or moistened by rain.
characters in the original text with inappropriate alternatives.
Xunzi-Qwen1.5- 指花瓣上的红色晕染,因为花瓣在
Xunzi-Qwen1.5-7B-Chat primarily fails in distinguishing be-
7B-Chat 雨中被淋湿,所以红色晕染更加明
tween time words and nouns, mistakenly using the simplified
显。
character "杀" instead of the correct "弑". These errors high-
Refers to the red smudge on the
light the model’s difficulty in making precise distinctions be-
flower petals. Because the petals got
tween similar words.
wet in the rain, the red smudge is
For the word explanation task, WenyanGPT’s response
more obvious.
is the closest to the Ground Truth. Although WenyanGPT
adds some modifiers, it maintains simplicity and refinement
Table 9: Response examples for the word explanation task from dif-
in describing the color, which matches the tone of the poem.
ferent LLMs.
Deepseek-V3, GPT-4o, Qwen2.5-7B-Instruct, and Xunzi-
Qwen1.5-7B-Chat do not specify the subject as "海棠花"
(begonia flowers) and instead over-explained the moistness 6 Conclusion
of the petals and the process of spreading color, deviating
We propose a comprehensive solution for Classical Chi-
from the standard, concise description of "晕红" (blushing
nese language processing challenges, including the develop-
red). Specifically, the responses from Qwen2.5-7B-Instruct
ment of WenyanGPT, a large language model focused on the
and Xunzi-Qwen1.5-7B-Chat extend the impact of the ex-
Classical Chinese domain, and WenyanBENCH, an evalua-
ternal environment, which do not fully align with the origi-
tion benchmark dataset for Classical Chinese tasks. We re-
nal text’s expression. This over-expansion shows that these
lease pre-training and instruction fine-tuning datasets and de-
LLMs fail to capture the intended simplicity and depth of the
scribe the method for constructing the instruction fine-tuning
original description.
dataset. Through systematic experiments and analysis, we
WenyanGPT demonstrates strong semantic understanding
demonstrate the significant impact of domain-specific pre-
and retention capabilities in handling Classical Chinese tasks.
training and multi-task instruction fine-tuning on improv-
Its ability to accurately tag parts of speech and generate con-
ing Classical Chinese processing capabilities. Our model
cise, poetic word explanations demonstrates a deep grasp of
outpermforms existing mainstream LLMs in various down-
the nuances of Classical Chinese. In contrast, other LLMs
stream tasks. In future, we attend to explore the potential
struggle with both understanding the subtle distinctions of
of multimodal models by combining Classical Chinese texts
Classical Chinese and producing responses that remain faith-
with image data (such as inscriptions and manuscripts) to en-
ful to the content and meaning of the original text. This po-
hance processing capabilities.
sitions WenyanGPT as a powerful tool for handling complex
tasks in the domain of Classical Chinese, far surpassing the
other LLMs in terms of both accuracy and literariness.

8345
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)

Acknowledgements classical chinese. In Proceedings of the 5th International


We would like to sincerely thank all reviewers for their valu- Conference on Text, Speech and Dialogue, TSD ’02, page
able feedback and constructive comments. This work is sup- 115–122, Berlin, Heidelberg, 2002. Springer-Verlag.
ported by the National Social Science Fund of China (Grant [Huang et al., 2010] Hen-Hsen Huang, Chuen-Tsai Sun, and
No. 22&ZD035). Hsin-Hsi Chen. Classical chinese sentence segmentation.
In CIPS-SIGHAN Joint Conference on Chinese Language
References Processing, Beijing, China, August 28-29, 2010, 2010.
[Achiam et al., 2023] Josh Achiam, Steven Adler, Sandhini [Huang et al., 2020] Chuen-Min Huang, Kuo-Lin Lu, Yi-
Agarwal, et al. Gpt-4 technical report. arXiv preprint Ying Cheng, and Yu-Chen Peng. Generating chinese clas-
arXiv:2303.08774, 2023. sical poetry with quatrain generation model (qgm) using
encoder-decoder lstm. In 2020 IEEE International Con-
[Brown et al., 2020] Tom Brown, Benjamin Mann, Nick Ry-
ference on Big Data (Big Data), pages 5700–5702. IEEE,
der, Melanie Subbiah, et al. Language models are few-shot 2020.
learners. Advances in neural information processing sys-
tems, 33:1877–1901, 2020. [Huang et al., 2023] Quzhe Huang, Mingxu Tao, Zhenwei
An, Chen Zhang, Cong Jiang, Zhibin Chen, Zirui Wu, and
[Cao et al., 2023] Jiahuan Cao, Dezhi Peng, Yongxin Shi, Yansong Feng. Lawyer llama technical report. 2023.
Zongyuan Jiang, and Lianwen Jin. Translating ancient chi-
nese to modern chinese at scale: A large language model- [Hurst et al., 2024] Aaron Hurst, Adam Lerer, Adam P
based approach. In International Conference on Algorith- Goucher, et al. Gpt-4o system card. arXiv preprint
mic Learning Theory, 2023. arXiv:2410.21276, 2024.
[Cao et al., 2024] Jiahuan Cao, Dezhi Peng, Peirong Zhang, [Ibrahim et al., 2024] Adam Ibrahim, Benjamin Th’erien,
Yongxin Shi, et al. Tonggu: Mastering classical chinese Kshitij Gupta, Mats L. Richter, et al. Simple and scalable
understanding with knowledge-grounded large language strategies to continually pre-train large language models.
models. In Conference on Empirical Methods in Natural ArXiv, abs/2403.08763, 2024.
Language Processing, 2024. [Ke et al., 2023] Zixuan Ke, Yijia Shao, Haowei Lin, Tat-
[Chang et al., 2024] Bolin Chang, Yiguo Yuan, Bin Li, et al. suya Konishi, Gyuhak Kim, and Bin Liu. Continual pre-
Automatic word segmentation and part-of-speech tagging training of language models. In International Conference
for classical chinese based on radicals. Data Analysis and on Learning Representations, 2023.
Knowledge Discovery, 8(11):102–113, 2024. [Kenton and Toutanova, 2019] Jacob Devlin Ming-Wei
[Cheng et al., 2020] Ning Cheng, Bin Li, Liming Xiao, et al. Chang Kenton and Lee Kristina Toutanova. Bert: Pre-
Integration of automatic sentence segmentation and lexi- training of deep bidirectional transformers for language
cal analysis of ancient chinese based on bilstm-crf model. understanding. In Proceedings of naacL-HLT, volume 1,
In Proceedings of LT4HALA 2020-1st Workshop on Lan- page 2. Minneapolis, Minnesota, 2019.
guage Technologies for Historical and Ancient Languages, [Lehman et al., 2023] Eric Lehman, Evan Hernandez, Di-
pages 52–58, 2020. wakar Mahajan, et al. Do we still need clinical language
[Chowdhery et al., 2023] Aakanksha Chowdhery, Sharan models? ArXiv, abs/2302.08091, 2023.
Narang, Jacob Devlin, Maarten Bosma, et al. Palm: Scal- [Li, 2018] N Li. Automatic extraction of alias in ancient lo-
ing language modeling with pathways. Journal of Machine cal chronicles based on conditional random fields. J. Chin.
Learning Research, 24(240):1–113, 2023. Inf. Process, 32:41, 2018.
[Cui et al., 2023] Jiaxi Cui, Zongjian Li, Yang Yan, Bohua [Liu et al., 2018] Dayiheng Liu, Jiancheng Lv, Kexin Yang,
Chen, and Li Yuan. Chatlaw: Open-source legal large and Qian Qu. Ancient–modern chinese translation with
language model with integrated external knowledge bases. a new large training dataset. ACM Transactions on
CoRR, 2023. Asian and Low-Resource Language Information Process-
ing (TALLIP), 19:1 – 13, 2018.
[Dan et al., 2023] Yuhao Dan, Zhikai Lei, Yiyang Gu, et al.
Educhat: A large-scale language model-based chatbot sys- [Liu et al., 2020] Zhuang Liu, Degen Huang, Kaiyu Huang,
tem for intelligent education, 2023. Zhuang Li, and Jun Zhao. Finbert: A pre-trained financial
language representation model for financial text mining. In
[Dubey et al., 2024] Abhimanyu Dubey, Abhinav Jauhri,
International Joint Conference on Artificial Intelligence,
Abhinav Pandey, et al. The llama 3 herd of models. arXiv 2020.
preprint arXiv:2407.21783, 2024.
[Liu et al., 2023a] Chang Liu, Dongbo Wang, Zhixiao Zhao,
[Gupta et al., 2023] Kshitij Gupta, Benjamin Th’erien, et al. Sikugpt: A generative pre-trained model for in-
Adam Ibrahim, et al. Continual pre-training of large telligent information processing of ancient texts from the
language models: How to (re)warm your model? ArXiv, perspective of digital humanities. ArXiv, abs/2304.07778,
abs/2308.04014, 2023. 2023.
[Huang et al., 2002] Liang Huang, Yinan Peng, Huan Wang, [Liu et al., 2023b] June M. Liu, Donghao Li, He Cao, Tianhe
and Zhenyu Wu. Statistical part-of-speech tagging for Ren, Zeyi Liao, and Jiamin Wu. Chatcounselor: A

8346
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)

large language models for mental health support. ArXiv, [Yan et al., 2016] Rui Yan, Cheng-Te Li, Xiaohua Hu, and
abs/2309.15461, 2023. Ming Zhang. Chinese couplet generation with neural net-
[Liu et al., 2024] Aixin Liu, Bei Feng, Bing Xue, et work structures. In Proceedings of the 54th Annual Meet-
al. Deepseek-v3 technical report. arXiv preprint ing of the Association for Computational Linguistics (Vol-
arXiv:2412.19437, 2024. ume 1: Long Papers), pages 2347–2357, 2016.
[Radford and Narasimhan, 2018] Alec Radford and Karthik [Yang et al., 2023a] Aiyuan Yang, Bin Xiao, Bingning
Narasimhan. Improving language understanding by gen- Wang, Borong Zhang, et al. Baichuan 2: Open large-scale
erative pre-training. 2018. language models. arXiv preprint arXiv:2309.10305, 2023.
[Yang et al., 2023b] Yi Yang, Yixuan Tang, and Kar Yan
[Radford et al., 2019] Alec Radford, Jeff Wu, Rewon Child,
Tam. Investlm: A large language model for investment
David Luan, et al. Language models are unsupervised mul-
using financial domain instruction tuning, 2023.
titask learners. 2019.
[Yang et al., 2024a] An Yang, Baosong Yang, Beichen
[Srivastava et al., 2023] Aarohi Srivastava, Abhinav Ras-
Zhang, et al. Qwen2. 5 technical report. arXiv preprint
togi, Abhishek Rao, Abu Awal Md Shoeb, et al. Beyond arXiv:2412.15115, 2024.
the imitation game: Quantifying and extrapolating the ca-
pabilities of language models. Trans. Mach. Learn. Res., [Yang et al., 2024b] Songhua Yang, Hanjie Zhao, Senbin
2023, 2023. Zhu, , et al. Zhongjing: Enhancing the chinese medical
capabilities of large language model through expert feed-
[Taylor et al., 2022] Ross Taylor, Marcin Kardas, Guillem back and real-world multi-turn dialogue. In Proceedings
Cucurull, Thomas Scialom, et al. Galactica: A large lan- of the AAAI Conference on Artificial Intelligence, volume
guage model for science. ArXiv, abs/2211.09085, 2022. 38, pages 19368–19376, 2024.
[Tian et al., 2020] Huishuang Tian, Kexin Yang, Dayiheng [Yi et al., 2017] Xiaoyuan Yi, Ruoyu Li, and Maosong Sun.
Liu, and Jiancheng Lv. Anchibert: A pre-trained model Generating chinese classical poems with rnn encoder-
for ancient chinese language understanding and genera- decoder. In Chinese Computational Linguistics and Nat-
tion. 2021 International Joint Conference on Neural Net- ural Language Processing Based on Naturally Annotated
works (IJCNN), pages 1–8, 2020. Big Data: 16th China National Conference, CCL 2017,
[Touvron et al., 2023] Hugo Touvron, Thibaut Lavril, Gau- and 5th International Symposium, NLP-NABD 2017, Nan-
tier Izacard, et al. Llama: Open and efficient foundation jing, China, October 13-15, 2017, Proceedings 16, pages
language models. ArXiv, abs/2302.13971, 2023. 211–223. Springer, 2017.
[Vaswani et al., 2017] Ashish Vaswani, Noam M. Shazeer, [Yuan et al., 2019] Y Yuan, D Wang, S Huang, and B Li. The
Niki Parmar, et al. Attention is all you need. In Neural comparative study of different tagging sets on entity ex-
Information Processing Systems, 2017. traction of classical books. Data Analysis and Knowledge
[Wang et al., 2019] Hongbin Wang, Haibing Wei, Jianyi Discovery, 3(03):57–65, 2019.
Guo, and Liang Cheng. Ancient chinese sentence segmen- [Yue et al., 2023] Shengbin Yue, Wei Chen, Siyuan Wang,
tation based on bidirectional lstm+ crf model. Journal of Bingxuan Li, et al. Disc-lawllm: Fine-tuning large
advanced computational intelligence and intelligent infor- language models for intelligent legal services. ArXiv,
matics, 23(4):719–725, 2019. abs/2309.11325, 2023.
[Wang et al., 2022] Dongbo Wang, Chang Liu, Zihe Zhu, [Zeng et al., 2024] Team Glm Aohan Zeng, Bin Xu, Bowen
Jiangfeng Liu, et al. Sikubert and sikuroberta: Construc- Wang, et al. Chatglm: A family of large language models
tion and application of pre-trained models for the siku from glm-130b to glm-4 all tools. ArXiv, abs/2406.12793,
quanshu in the field of digital humanities. Library Tribune, 2024.
42(06):31–43, 2022. [Zhang et al., 2023a] Xuanyu Zhang, Qing Yang, and
[Wang et al., 2023a] Dongbo Wang, Chang Liu, Zhixiao Dongliang Xu. Xuanyuan 2.0: A large chinese financial
Zhao, et al. Gujibert and gujigpt: Construction of intel- chat model with hundreds of billions parameters. Proceed-
ligent information processing foundation language models ings of the 32nd ACM International Conference on Infor-
for ancient texts. arXiv preprint arXiv:2307.05354, 2023. mation and Knowledge Management, 2023.
[Wang et al., 2023b] Hao Wang, Chi-Liang Liu, Nuwa Xi, [Zhang et al., 2023b] Yiqin Zhang, Sanhong Deng, Qi
Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. Zhang, Dongbo Wang, and Hongcun Gong. Comparative
Huatuo: Tuning llama model with chinese medical knowl- analysis of language models for linguistic examination of
edge. ArXiv, abs/2304.06975, 2023. ancient chinese classics: A case study of zuozhuan cor-
pus. In 2023 International Conference on Asian Language
[Xiong et al., 2023] Honglin Xiong, Sheng Wang, Yitao Processing (IALP), pages 154–161. IEEE, 2023.
Zhu, Zihao Zhao, Yuxiao Liu, Linlin Huang, Qian Wang, [Zhang et al., 2024] Jundong Zhang, Songhua Yang,
and Dinggang Shen. Doctorglm: Fine-tuning your chinese
Jiangfeng Liu, and Qi Huang. Aigc empowering the
doctor is not a herculean task. ArXiv, abs/2304.01097,
revitalization of ancient books on traditional chinese
2023.
medicine:building the huang-di large language model.
Library Tribune, 44(10):103–112, 2024.

8347

You might also like