0% found this document useful (0 votes)
32 views13 pages

LLMs Understand Layout 1720859826

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views13 pages

LLMs Understand Layout 1720859826

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Large Language Models Understand Layout

Weiming Lia , Manni Duana , Dong Ana and Yan Shaoa,b, *


a Zhejiang
Lab, Hangzhou, China
b China Mobile, Hangzhou Research and Development Center, China

Abstract. Large language models (LLMs) demonstrate extraordi-


(a) Layout
nary abilities in a wide range of natural language processing (NLP)
tasks. In this paper, we show that, beyond text understanding ca- Here are three names mentioned in the context:
pability, LLMs are capable of processing text layouts that are de- What is your name? What is your name?
noted by spatial markers. They are able to answer questions that re- I’m James. I’ m Oliver.
quire explicit spatial perceiving and reasoning, while a drastic per- What is your name?
formance drop is observed when the spatial markers from the orig- I’m Emma.
inal data are excluded. We perform a series of experiments with
Question: What is the name mentioned in the top-left corner?
arXiv:2407.05750v1 [cs.CL] 8 Jul 2024

the GPT-3.5, Baichuan2, Llama2 and ChatGLM3 models on vari-


ous types of layout-sensitive datasets for further analysis. The exper- Answer: The name mentioned in the top-left corner is "James".
imental results reveal that the layout understanding ability of LLMs
(b) Strip
is mainly introduced by the coding data for pretraining, which is fur-
ther enhanced at the instruction-tuning stage. In addition, layout un- Here are three names mentioned in the context:
derstanding can be enhanced by integrating low-cost, auto-generated What is your name? What is your name? I’m
data approached by a novel text game. Finally, we show that lay- James. I’ m Oliver. What is your name? I’m
Emma.
out understanding ability is beneficial for building efficient visual
question-answering (VQA) systems. Question: What is the name mentioned in the top-left corner?
Answer: The name mentioned in the top-left corner is not specified
in the given context.
1 Introduction
Figure 1: Illustration of ChatGPT comprehending text layout.
In recent years, large language models (LLMs) have emerged as a
dominant force in the global artificial intelligence field, sparking ex- First, we build up a dataset called TextLayoutQA to evaluate
tensive discussions among researchers about their potential and lim- LLMs’ text layout understanding capability. Through experiments
itations [17, 2, 24]. Although LLMs are primarily designed for natu- with the GPT-3.5, Baichuan2 [45], Llama2 [39] and ChatGLM3 [48]
ral language processing (NLP) tasks, some studies demonstrate their models, we uncover that the incorporation of text layout information
additional abilities. For instance, they are employed to generate ex- substantially enhances model performance, resulting in an 8∼25%
ecutable code and even achieve remarkable performance in Google gain compared to text without layout.
coding interviews [6]. Furthermore, we explore the effects of pre-training and
Beyond text understanding capability, we find that LLMs are capa- instruction-tuning stages on LLMs’ comprehension of text layout.
ble of processing text layouts that are denoted by spatial markers. As We illustrate that although LLMs initially demonstrate a basic un-
shown in Figure 1, we conceptualize the newline-separated plain text derstanding during pre-training, their proficiency is further enhanced
as a "visual" two-dimensional canvas, as text editors and browsers are during the instruction-tuning stage.
visually two-dimensional intuitively. Three identical questions with Moreover, we explore the essential role of training data in shaping
distinct answers are arranged in different orientations, interspersed LLMs’ understanding of text layout, emphasizing the necessity of
with space markers (denoted as layout). We inquire with ChatGPT datasets enriched with layout information, such as code and table
about the answers in various orientations. Remarkably, ChatGPT data. Through instruction-tuning, we reveal the varying impacts of
provides accurate responses, and some other open-source LLMs also different types of datasets on LLMs’ performance, providing detailed
demonstrate reasonable results. To compare with this, we exclude the insights into their contributions and constraints.
space markers from the original data (denoted as strip), resulting in a Our findings not only illuminate the intrinsic capabilities of LLMs
substantial decline in performance. in comprehending text layout, but also carry profound implications
This study initiates a comprehensive examination of LLMs’ pro- for their broader applications. By unraveling the intricacies of LLMs’
ficiency in understanding text layout, aiming to unravel insights into interaction with text layout information, we pave the way for lever-
their performance and implications across various datasets and fine- aging this capability in tasks ranging from visual question answering
tuning methodologies. (VQA) [26] to document analysis and beyond. Our code and datasets
are available on Github.1
∗ Corresponding Author. Email: [email protected]. 1 https://github.com/liweim/TextLayoutLLM
The contribution of this paper can be summarized as: LLMs’ applications in text-rich VQA VQA is a task that involves
answering questions about images, spanning various formats such as
1. To the best of our knowledge, we are the first to systematically receipts, web pages, tables, documents, or even natural images con-
analyze the text layout understanding capability of LLMs. taining textual information. This task fundamentally requires mod-
2. We introduce TextLayoutQA, a dataset designed to assess the text els to understand multiple modalities. Previous work predominantly
layout understanding capability of LLMs. focuses on multimodal pre-trained models [18, 12, 38, 11, 21], aim-
3. The origin of text layout understanding capability is thoroughly ing to leverage all modal information. However, Liu et al. [25] point
investigated via instruction-tuning. out that LLMs struggle with text-rich VQA scenarios. The main rea-
4. We propose a low-cost data generation method, approached by a sons include the short token lengths, usually 256, of the textual input
novel text game, that significantly enhances the text layout under- from the visual encoder to the text encoder in multimodal LLMs, re-
standing capability. sulting in a significant loss of textual information. Additionally, the
5. We show that the text layout understanding capability can be ap- low resolution of the image encoder, typically 224*224, compresses
plied in text-rich VQA problems and achieve good performance and loses a considerable amount of textual information in text-rich
improvements. images. Under the circumstances, recent work starts exploring the
performance of purely textual LLMs in answering questions using
2 Related Work only serialized text from images [10], leveraging the high accuracy
of optical character recognition (OCR) models in recognizing long
Text layout and language Preprocessing for text layout is essential text.
before conducting any NLP on the textual content of intricate docu-
ments. Hurst and Nasukawa [13] present a general approach that in- 3 Layout Understanding Based on Spatial Markers
tegrates language model and text spatial arrangement knowledge. By
Text layout refers to the arrangement and presentation of text within a
taking into account text language features and layout characteristics,
visual space or canvas. It involves the spatial organization of charac-
this method accurately identifies the boundaries and structures of text
ters, words, and paragraphs to create a visually coherent and aesthet-
blocks. However, it relies on rules and is limited to simple cases. Fur-
ically pleasing display. It is a form of text representation and does not
thermore, it overlooks the relationship between text blocks, which is
specifically refer to particular data types like codes or tables. Gener-
crucial for document comprehension.
ally, text layout encompasses factors such as newlines, indentation,
LLMs’ capability in text layout understanding In recent years, alignment, font size, and spacing between characters and lines. In this
LLMs, represented by the GPT series, have demonstrated strong text study, we focus on the layout of plain texts, which means the texts do
comprehension abilities. In this domain, we are aware of some re- not have formats or font styles. We encode the spatial organization of
search efforts dedicated to exploring the performance of LLMs in the texts by using spatial markers such as space and newline, forming
spatial reasoning [17, 47, 4], as well as the application of LLMs in a plain text that can be directly input to the LLMs.
graph analysis, understanding, and visualization [2, 5, 9, 35, 41, 50, In general NLP, text layout is often not explicitly considered be-
51]. cause most traditional NLP tasks, such as sentiment analysis, named
A survey by Joublin et al. [17] investigates planning and logi- entity recognition, text classification, and machine translation, focus
cal reasoning application in spatial environments, finding ChatGPT on understanding the content and meaning of the text rather than its
adept at tracking objects and inferring spatial relationships. Their ex- visual presentation. Text layout becomes more relevant when dealing
periments span various tasks related to physical understanding in- with tasks that involve visual or spatial understanding, such as OCR,
cluding optical and shadow projection, spatial viewpoint reasoning, document understanding, and certain computer vision tasks. In these
predicting the impact of actions on objects, one-dimensional object cases, the physical placement of text on a page or within an image
sorting, two-dimensional box placement queries, simulated robot ex- becomes crucial for accurate interpretation.
ploration like navigating an apartment and searching for a ball in a However, as research progresses and interdisciplinary approaches
room, and simulated robot task completion such as setting a table for become more common, there is an increasing recognition of the im-
a meal. However, the effectiveness of these efforts is limited since all portance of text layout understanding, even in traditional NLP. For
"spatial" or "visual" features are interacted in high-level terms. For example, understanding the structure and layout of documents can
instance, navigating the apartment involves users informing Chat- aid in tasks like information extraction or summarization. As the field
GPT about the room’s location and available doors, followed by de- evolves, there will be more integration of text layout considerations
scribing the new room and door choices after the model makes a se- into a broader range of NLP applications.
lection. In general, ChatGPT demonstrates a certain level of spatial The existing datasets that incorporate text layout elements are of-
understanding capability, although not in a "geometric" sense. ten related to transcribed documents and tables [27, 26], but they are
Bayani [2] explores ChatGPT’s performance in visual tasks in- not specifically designed to evaluate the text layout understanding ca-
volving ASCII art [28] input, which is an image drawn using pability. Under the circumstances, we introduce a generated dataset
ASCII characters in plain text. They find that ChatGPT shows high- called TextLayoutQA specifically for this purpose.
level performance in tasks evaluating visual and spatial capabilities, Subsequently, we investigate whether the text layout under-
though there is still room for improvement. The study includes tasks standing capability emerges during the pre-training stage or the
like ASCII art recognition, for instance evaluating ChatGPT’s abil- instruction-tuning stage. Our hypothesis posits that the training cor-
ity to handle rotations, scaling, noise, and translation of box plots; pora with consecutive spaces, including programming code, tables,
ASCII art part identification, such as asking the model about the HTML, YAML-formatted data, may contribute to the text layout
identity of specific parts of ASCII art images, like heads or tails; and understanding capability of LLMs. To validate this hypothesis, we
ASCII art generation tasks, for example generating identical copies construct an instruction-tuning dataset that does not include data
of ASCII art, removing noise from ASCII art and proportionally en- with consecutive spaces as the instruction-basic dataset. We perform
larging ASCII art. instruction-tuning on LLMs of different types and sizes, utilizing the
characteristic of catastrophic forgetting to induce LLMs to "forget" evaluation, we use average normalized Levenshtein score (ANLS)
text layout understanding capability. Subsequently, we add training [26] as the evaluation metric. Since LLMs’ outputs are relatively
corpora containing consecutive spaces, such as code and table cor- long, the same LLM is used to rephrase the original output answers
pora, to the instruction-basic dataset to observe whether the text lay- into shorter ones so that they are aligned with the references.
out understanding capability is recovered. TextLayoutQA A layout QA dataset generated specifically for
Furthermore, inspired by the well-known text game, word search testing the layout understanding capability of LLMs. This dataset
puzzle, we devise a novel text game aiming at enhancing text layout revolves around enumerating items from various shopping lists ar-
comprehension through gameplay. Starting pre-training from scratch ranged in random orientations within the text canvas. As illustrated
with different training corpora would be the most direct validation in Figure 2a, some shopping lists are randomly positioned in four ori-
method. However, due to the substantial computational resources re- entations (top-left, top-right, bottom-left, bottom-right) on a newline-
quired, it is beyond the scope of our team’s capacity and could be separated plain text canvas filled with space markers. Each shopping
considered as future work. list is assigned a name (A, B, C, or D) and comprises different prod-
Finally, we apply the text layout understanding capabilities of ucts. Both the name and items within the same shopping list are first-
LLMs to text-rich VQA tasks. We introduce a method named text- letter aligned. For comparison, a version without layout information
LayoutParser that converts the original texts from VQA datasets to is constructed for each sample, which involves replacing consecutive
texts with layout. We observe that various LLMs yield better results space and newline markers with a single space marker. A minimum
on text with layout than on those without, highlighting the practical of two consecutive space markers are maintained between any two
benefits and effectiveness of our research in real-world applications. shopping lists. Figure 2b illustrates the "without layout" version cor-
responding to 2a. The paired samples, with and without layout, share
4 Experiments the same set of three questions as shown in Figure 2c.
The TextLayoutQA dataset comprises a total of 300 sample pairs,
4.1 Datasets encompassing 900 questions. All questions require the output in list
In this section, we describe all the datasets we use in the ex- format. F-score is employed to evaluate LLMs’ performance. The
periments. These include three public datasets: XfundQA, FetaQA evaluation process is as follows: first, lists are extracted from the out-
[27], DocVQA [26], along with a generated dataset for text lay- put using regular expressions. Subsequently, the F-score is calculated
out understanding evaluation, named TextLayoutQA. Additionally, with each element in the list as a token. If the output does not contain
we propose various instruction-tuning datasets including instruction- a list, the F-score is calculated with words as tokens, disregarding
basic, instruction-code, instruction-table, instruction-generated and characters besides words.
instruction-test.
(a) Layout
XfundQA A form QA dataset generated from the XFUND [44]
dataset, which is a multilingual form understanding benchmark Here are 4 shopping lists (A, B, C, D) with different products:
dataset covering 7 languages (Chinese, Japanese, Spanish, French, A B
Italian, German, Portuguese) with manually annotated forms. Each footwear lenses
movies
language includes 199 forms, with 149 forms in the training set and walkers
50 forms in the test set. The dataset involves two sub-tasks: semantic jet skis
entity recognition and relation extraction. As our primary focus is on
QA, we make the following modifications to the Chinese test set of C D
fortified wines animal clothes
XFUND: bulbs

1. Change the key-value relations to the QA format: "What is the (b) Strip
value of the key ’{key}’?" Here are 4 shopping lists (A, B, C, D) with different products:
2. Remove invalid QA pairs, including those with empty or invalid A B footwear lenses movies walkers jet
values and nested key-key-value relations. skis C D fortified wines animal clothes
3. Rewrite answers with multiple options to the selected one, such as bulbs
changing "✓A □ B" to "A".
(c) QA set
This modified dataset is named XfundQA. Since LLMs’ outputs Question: What products do shopping list B contain?
are usually long, we use recall as the evaluation metric, consider- Answer: ["lenses"]
ing a prediction correct if the ground truth appears completely in the Question: "What products do shopping list B and A contain?"
LLMs’ output. Answer: ["lenses", "footwear", "movies", "walkers", "jet skis"]
Question: What products do shopping list in the bottom-right corner
FetaQA A table QA dataset consists of free-form table questions
contain?"
that require deep reasoning and understanding. Most questions are
Answer: ["animal clothes", "bulbs"]
based on discontinuous blocks of information in the table. We con-
duct evaluations on the test set containing 2,003 samples. Consis-
tently with the dataset’s conventions, we use ROUGE-L [22] and Figure 2: A pair example of TextLayoutQA dataset with (a) and with-
BLEU-4 [29] as the evaluation metrics. out layout (b), they share the same QA set (c).
DocVQA A document QA dataset consists of printed and typed Instruction-basic dataset An instruction-tuning dataset designed
text as well as scanned documents with various layouts, some of to diminish the text layout understanding capability of LLMs. Specif-
which also include handwritten data. Evaluations are performed on ically, we randomly select 100k bilingual (English and Chinese) in-
the test set containing 5,188 samples. Following the conventional stances from publicly available instruction-tuning datasets [37, 14,
46, 16, 7, 49, 8, 43, 19, 31], deliberately excluding consecutive
Given a table:
spaces (three or more spaces or two or more tabs), to form the
instruction-basic dataset. The distribution of each sub-dataset in the Year Title Role
instruction-basic dataset is shown in Table 1. 2009-2013 We Speak NYC Jorge / Fredy
2014-2019 Broad City Jaime Castro
Table 1: Distribution of each sub-dataset in the instruction-basic 2015-2016 Alternatino Arturo
dataset. 2017 No Activity Pedro
Dataset Num Ratio/% 2019 Alternatino Arturo
MOSS 56,195 56.19
belle 20,881 20.88 Question: What is the Role of Year 2009-2013?
firefly 8,929 8.92 Answer: Jorge / Fredy
CSL 3,289 3.28
hh-rlhf 2,234 2.23 Figure 3: An example of the instruction-table dataset.
COI 2,104 2.10
HC3 1,577 1.57
Chain-of-Thought 1,200 1.20
prosocial-dialog 963 0.96
alpacaGPT4 851 0.85
gpt4tools 555 0.55
GPTeacher 431 0.431
alpaca 414 0.414
webGPT 173 0.173
dolly 128 0.128
Auto-CoT 59 0.059
GAOKAO 17 0.017

Figure 4: An example of a word search puzzle.


Table 2: Distribution of each sub-dataset in the instruction-code
dataset. The sentence search puzzle is a game that involves a grid of words,
Dataset Num Ratio/% where players are tasked with finding meaningful sentences hidden
GPT4all 65,773 65.77 within the grid. The challenge lies in locating continuous words
CodeAlpaca 18,911 18.91
that make up meaningful sentences horizontally and vertically. The
COIG 11,048 11.04
GPTeacher 4,268 4.26 unused spaces in the grid are usually filled with random words to
add complexity to the puzzle. Note: answer in the form of a list, for
example: [’a’, ’b’]. If you do not know the answer, reply with the
Instruction-code dataset An instruction-tuning dataset designed empty list []. Here is a toy example:
to verify the influence of the code corpora on the text layout un-
derstanding capability of LLMs. We randomly sample 100k bilin-
gual (English and Chinese) data from diverse public code-relative
instruction-tuning datasets [1, 3, 49]. The distribution of each sub-
dataset in the instruction-code dataset is shown in Table 2. To pre-
serve the other capabilities of LLMs, these code data are com-
bined with the data from the instruction-basic, resulting in a 200k First, search horizontally and find "good morning".
instruction-code dataset. Then, search vertically and find "get some food".
So all the sentences hidden in this puzzle are: ["good morning", "get
Instruction-table dataset An instruction-tuning dataset designed some food"].
to verify the influence of the table corpora on the text layout under- Let’s solve the following sentence search puzzle step by step:
standing capability of LLMs. We randomly sample tables from the
public table QA dataset, WikiTableQuestions [30]. We introduce text
layout by aligning the first characters of all elements in each column
of the table using consecutive space markers. A minimum of two
consecutive space markers is maintained between any two columns
of elements. Distinct from utilizing the dataset’s original QA pairs,
we reformulate inquiries to elicit the value of each cell in the table,
generating 100k new QA pairs. An example is depicted in Figure 3. Answer:
These QA instances are combined with the data from the instruction- First, search horizontally and find "i believe you".
basic, forming a 200k instruction-table dataset. Then, search vertically and find "what are you doing".
Instruction-generated dataset An instruction-tuning dataset de- So all the sentences hidden in this puzzle are: [’"i believe you"’,
signed to improve the text layout understanding capability of LLMs. ’"what are you doing"’].
Specifically, we propose a novel text game to generate data automat-
Figure 5: An example of the instruction-generated dataset.
ically, akin to the renowned text game word search puzzle (Figure 4)
which challenges to find hidden words within a grid of letters. These Acknowledging the scarcity of single letters in training corpora,
puzzles typically feature a rectangular or square grid filled with ran- we adapt the word search puzzle into a new game named sentence
dom letters, accompanied by a list of words to be found. The words search puzzle. This game is designed to identify hidden sentences
can be oriented in various directions—horizontally, vertically, diag- within a grid of words, with each word separated by consecutive
onally, and even backward. spaces, and each row maintaining a consistent length. The first letters
of all words in each column are aligned. A minimum of two consec- 4.3 Experimental Result and Analysis
utive space markers are maintained between any two columns of ele-
ments. The sentences can be oriented in two directions—horizontally 4.3.1 General Evaluation
and vertically. An illustrative example is provided in Figure 5. In this section, we first evaluate the text layout understanding capa-
To mitigate the difficulty of the game, we include intermediate- bility of LLMs, subsequently analyze how the tokenizers of LLMs
solving steps in the example provided. We mandate the game to out- encode the consecutive punctuations, and finally compare the perfor-
put a list of sentences. During the evaluation step, similar to Text- mance of using different spatial markers for ablation study.
LayoutQA, we employ regular expressions to extract the list from Table 3 shows the evaluation results of different LLMs on Text-
the output. LayoutQA with (layout) and without (strip) text layout information.
We randomly generate 100k sentence search games. These in- Compared to the strip, various LLMs achieve a performance im-
stances are combined with the data from the instruction-basic, form- provement of 8∼25% with text layout information, indicating the
ing a 200k instruction-generated dataset. models’ ability to understand text alignment, layout, and orientation.
Instruction-test A testing dataset designed to assess the efficacy of Table 3: Evaluation results, measured by F-score, of different LLMs
instruction-tuning, consisting of four segments of data, as illustrated on TextLayoutQA with (Layout) and without (Strip) text layout.
below, with a total of 800 samples. The evaluation metrics adopted LLMs Strip Layout Difference
for all instruction-tuning datasets are ROUGE-L and recall. ChatGLM3-6B 33.52 49.52 +16.00
Llama2-7B 47.47 58.80 +11.33
• Basic: it includes five task types: generation, answering, classifica- Llama2-13B 53.45 61.93 +8.48
Baichuan2-7B 47.08 60.82 +13.74
tion, rewriting, and mathematics, obtained by sampling 100 sam- Baichuan2-13B 47.00 68.69 +21.69
ples for each task type from the public dataset, natural-instructions GPT-3.5-Turbo 61.80 87.77 +25.97
[42], which is a benchmark of 1,616 diverse NLP tasks and expert-
written instructions. ChatGLM3, Llama2, Baichuan2, and GPT-3.5-Turbo all use byte
• Code: we randomly select 100 samples from the public LeetCode pair encoding (BPE) [34] from SentencePiece [20] as their tokenizer.
dataset oa_leet10k.2 We find that the tokenizers of these LLMs encode different lengths
• Table: we construct 100 samples using the same method as build- of consecutive spaces with distinct tokens. Table 4 illustrates the
ing the instruction-table dataset, sourced from the public table QA maximum lengths of consecutive punctuation that can be encoded
dataset, FeTaQA [27]. by different LLMs’ tokenizers. GPT-3.5-Turbo has different tokens
• Generate: we generate 100 samples using the same method as for most consecutive punctuation, while ChatGLM3, Llama2, and
building the instruction-generated dataset. Baichuan2 only have different tokens for relatively limited consecu-
tive punctuation. A commonality among them is that they all support
4.2 LLMs encoding relatively long consecutive spaces, indicating that consec-
utive spaces have a certain proportion in the training corpora. Given
that some programming languages like Python and certain file for-
In our experiments, we select several LLMs for evaluating TextLay-
mats like YAML are sensitive to indentation, we assume that these
outQA, including GLM3, Llama2, Baichuan2, and GPT-3.5-Turbo.
corpora aid LLMs in grasping how consecutive spaces align texts,
These models cover various sizes and types of open-source and pro-
thereby acquiring text layout understanding capability.
prietary LLMs that are currently popular.
Table 4: Maximum lengths of consecutive punctuation that can be
• GLM3: the latest generation in the open-source GLM series, encoded by different LLMs’ tokenizers.
LLMs Space Tab Newline Exclamation Comma Full-stop
featuring a singular parameter size of 6B. The chat version, ChatGLM3 15 1 1 2 1 4
ChatGLM3-6B exhibits strong performance among pre-training Llama2 15 1 1 2 1 4
models under 10B, with features such as fluent dialogue and a low Baichuan2 32 1 1 1 2 3
GPT-3.5-Turbo 81 20 14 5 4 9
deployment threshold.
• Llama2: the latest open-source LLMs released by Meta, Llama2
surpasses various open-source models on benchmarks and per- In TextLayoutQA, we use space and newline as spatial markers.
forms comparably to or better than GPT-4 on many test sets. Its For the ablation study, we investigate the other three characters as
open-source license permits commercial utilization, and it comes spatial markers: tab, caron (an accent mark), and a random vanilla
in a range of parameter sizes — 7B, 13B, and 70B. In our evalua- character a. Notably, newline is still used to separate text lines. Ta-
tion, we focus on the 7B and 13B parameter sizes. ble 5 delineates the effects of various LLMs on the TextLayoutQA
• Baichuan2: a new generation open-source LLM introduced dataset when deploying these characters as spatial markers. Gener-
by Baichuan Intelligence containing 7B and 13B parameters. ally, spaces consistently result in optimal performance for the ma-
Baichuan2 achieves optimal results among models of the same jority of LLMs. The character a generally exhibits the poorest per-
size on various general and domain benchmarks in Chinese, En- formance across various LLMs due to its lack of spatial semantics.
glish, and multilingual contexts. Our evaluation encompasses both In addition, LLMs do not generalize to interpret consecutive a as a
parameter sizes. spatial marker. Notably, although caron marker is rare in corpora, it
• GPT-3.5-Turbo: launched by OpenAI, GPT-3.5-Turbo is an ad- still outperforms strip for most LLMs.
vanced language model derived from GPT-3, which has 175B pa-
rameters. 4.3.2 The Origin of Layout Understanding Capability
In this section, we first delve into the formation stage of text layout
2 https://huggingface.co/datasets/cognitivecomputations/oa_leet10k understanding capability and subsequently examine the training cor-
Table 5: Evaluation results, measured by F-score, of different markers Table 7: Training corpora used by different LLMs.
on encoding text layout information in TextLayoutQA. LLMs Training corpora
LLMs Strip Space Tab Caron a GLM Pile, Chinese WudaoCorpora, Chinese corpora (including on-
ChatGLM3-6B 33.52 49.52 49.72 37.62 16.78 line forums, encyclopedia, and QA) crawled from the website
Llama2-7B 47.47 58.80 51.46 35.96 37.84 Llama CommonCrawl [36], C4 [33], Github, Wikipedia, Books,
Llama2-13B 53.45 61.93 46.00 43.39 29.33 ArXiv, StackExchange
Baichuan2-7B 47.08 60.82 60.30 50.48 28.49 Baichuan2 General internet webpages, books, research papers, codebases,
and more
Baichuan2-13B 47.00 68.69 66.27 56.11 44.76
GPT-3 CommonCrawl, WebText [32], Books1, Books2, Wikipedia
GPT-3.5-Turbo 61.80 87.77 87.36 72.86 45.79

pora used for LLMs. Finally, we explore the type of training corpora undergo 5 epochs with early stopping. As expected, adding different
that fosters the text layout understanding capability. instruction types improves the corresponding ability. For further
Due to the lack of ability to strictly follow instructions, the out- information, please refer to Appendix.
puts of the base models are difficult to align with the reference of Table 8 presents the performance of LLMs on TextLayoutQA fol-
the QA tasks. We therefore employ perplexity as the metric to en- lowing various types of instruction-tuning. Due to the characteristic
sure fair comparison of layout understanding capability between base of catastrophic forgetting, the text layout capability decreases after
and chat models. Perplexity is a widely used metric for assessing instruction-basic tuning compared to the chat model, except for the
language models. Lower perplexity indicates better modeling per- Llama2 series. This is because the Llama2 chat models tend to pro-
formance. By comparing the perplexity of different LLMs on the duce long responses and sometimes fail to follow the required out-
TextLayoutQA dataset with and without text layout, the results pre- put format. However, a significant recovery in text layout capabil-
sented in Table 6 are obtained. Notably, all the base models exhibit ity is observed after instruction-code tuning, underscoring the cru-
a lower perplexity on text with layout compared to text without lay- cial role of code-related data in enhancing text layout understand-
out, suggesting that the base models inherently possess some level ing. In contrast, the model fine-tuned on instruction-table experi-
of text layout understanding during the pre-training stage. Following ences a decrease in text layout capability, indicating that the table-
instruction-tuning, all the chat models demonstrate a lower perplex- related data do not contribute to the text layout capability. It is note-
ity compared to the base models when processing text with layout as worthy that performance shows enhancement following instruction-
opposed to text without layout. This indicates that instruction-tuning generated tuning. For specific small models like Llama2-7B and
further enhances the text layout understanding capability. It should Baichuan2-7B, instruction-generated tuning even yields the best re-
be noted that, to mitigate the influence of context length on perplex- sult among all instruction-tuning datasets. Considering that gener-
ity, newline markers are used for padding at the beginning of text ating data is significantly more convenient and cost-effective than
without layout, with the padding length being the difference between collecting, we have laid a promising path for pre-training LLMs.
the length of text without layout and the length of text with layout Table 8: Evaluation results, measured by F-score, of different
after tokenization. LLMs on TextLayoutQA after applying different instruction-tuning
Table 6: Perplexity of different LLMs on TextLayoutQA dataset with datasets. "Origin" indicates no instruction-tuning performed. "Ba-
(Layout) and without (Strip) text layout. Lower perplexity indicates sic", "Code", "Table", and "Generate" correspond to instruction-
better modeling performance. basic, instruction-table, instruction-code, and instruction-generated
LLMs Type Strip Layout Difference datasets, respectively.
Base 6.87 4.98 -1.89 LLM Origin Basic Code Table Generate
ChatGLM3-6B ChatGLM3-6B 49.52 45.36 63.36 28.98 59.52
Chat 5.58 3.56 -2.02
Base 2.33 1.85 -0.48 Llama2-7B 58.80 64.37 66.10 61.77 66.59
Llama2-7B Llama2-13B 61.93 72.31 73.69 66.71 66.26
Chat 2.95 2.26 -0.69
Base 2.15 1.81 -0.34 Baichuan2-7B 60.82 59.93 60.77 53.79 63.78
Llama2-13B Baichuan2-13B 68.69 66.14 72.06 65.93 69.15
Chat 3.06 2.27 -0.79
Base 1.90 1.40 -0.50
Baichuan2-7B
Chat 3.09 1.53 -1.56 4.3.3 Applications
Base 1.89 1.33 -0.56
Baichuan2-13B In this section, we illustrate the utilization of LLMs’ text layout un-
Chat 3.03 1.35 -1.68
derstanding capability in the text-rich VQA domain. We introduce
Table 7 presents the training corpora utilized by various LLMs a method named textLayoutParser designed to parse texts with di-
during the pre-training stage. Notably, the training corpora for GLM3 verse layouts from documents, including plain texts, forms, tables,
and Llama2 are not explicitly published, so related information about images, and their combinations. The method involves the place-
GLM-130b and Llama is considered. We find that GLM, Llama, and ment of text on a two-dimensional character canvas according to
GPT-3 all use datasets such as CommonCrawl, Wikipedia, and Books the text’s coordinates. Detailed implementation is available in Ap-
(Pile includes CommonCrawl, Wikipedia, and Books) in their pre- pendix. We evaluate the zero-shot performance on the test sets of
training. CommonCrawl is a large-scale, unstructured, multilingual three datasets—XfundQA, DocVQA, and FeTaQA. The prompts uti-
web dataset containing over 8 years of web crawler data. Addition- lized for each dataset are provided in Appendix.
ally, GLM and Llama utilize code-related sources like GitHub and XfundQA We use the OCR output provided by the dataset and
StackExchange. We do find some examples with various text layouts construct corpora with text layout using textLayoutParser. As a com-
sourced from GitHub and StackExchange within the Pile dataset. The parison, we replace consecutive spaces and newlines with a single
specific examples can be referred to Appendix. space marker, forming corpora without text layout. The evaluation
We perform instruction-tuning on the instruction-basic, results of different LLMs on XfundQA with and without text layout
instruction-code, instruction-table and instruction-generated dataset are presented in Table 11. Notably, corpora with text layout lead to
using Firefly [46] tuning framework. Each dataset is partitioned into performance improvements ranging from 1.96% to 9.55% compared
training and validation sets with a ratio of 98:2. The training sets to corpora without text layout.
Table 9: Evaluation results, measured by ROUGE-L and BLEU-4, of different LLMs on FeTaQA test set with (Layout) and without (Strip) text
layout.
ROUGE-L BLEU-4
LLMs
Strip Layout Difference Strip Layout Difference
ChatGLM3-6B 28.79 31.28 +2.49 10.84 11.08 +0.24
Llama2-7B 19.71 24.03 +4.32 5.82 7.67 +1.85
Llama2-13B 27.07 30.49 +3.42 9.14 10.91 +1.77
Baichuan2-7B 32.26 34.26 +2.00 12.15 13.57 +1.42
Baichuan2-13B 34.46 39.15 +4.69 14.04 16.51 +2.47
GPT-3.5-Turbo 39.05 39.76 +0.71 16.21 16.63 +0.42
Table 10: Evaluation results, measured by ROUGE-L and BLEU-4, of different table encoding methods on FeTaQA test set: "Array," which
transforms the original array table data into string format; "Linear," which employs distinct identifiers to differentiate headers and rows;
"Triplet," which formats each element as a col-row-value triplet to create a list; and "Ours," which utilizes spaces and newlines to align and
separate elements within the table.
ROUGE-L BLEU-4
LLMs
Array Linear Triplet Ours Array Linear Triplet Ours
ChatGLM3-6B 28.79 31.01 31.25 31.28 10.84 10.85 11.05 11.08
Llama2-7B 19.71 23.69 22.84 24.03 5.82 7.63 6.98 7.67
Llama2-13B 27.07 28.80 26.40 30.49 9.14 9.92 9.22 10.91
Baichuan2-7B 32.26 31.87 31.03 34.26 12.15 12.21 11.55 13.57
Baichuan2-13B 34.46 40.08 32.57 39.15 14.04 16.94 12.53 16.51
GPT-3.5-Turbo 39.05 35.21 36.88 39.76 16.21 14.15 14.96 16.63
Table 11: Evaluation results, measured by recall, of different LLMs
layout encoding techniques for an ablation study. Examples of dif-
on XfundQA with (Layout) and without (Strip) text layout.
LLMs Strip Layout Difference ferent table encoding methods can be referred to Appendix. Table 10
ChatGLM3-6B 60.13 66.18 +6.05 provides a performance assessment of various table encoding meth-
Llama2-7B 57.41 66.96 +9.55 ods on the FeTaQA test set. Our proposed method outperforms others
Llama2-13B 58.92 66.60 +7.68 for ChatGLM3-6B, Llama2-7B, and GPT-3.5-Turbo. Conversely, for
Baichuan2-7B 64.70 66.66 +1.96
Baichuan2-13B, the Linear encoding method demonstrates superior
Baichuan2-13B 67.38 73.27 +5.89
GPT-3.5-Turbo 76.67 77.50 +3.03 results.

DocVQA We use the OCR output provided by the dataset and 5 Conclusion
construct corpora with text layout using textLayoutParser. For com-
parison, consecutive spaces and newlines are replaced with a single This study extensively investigates the potential of LLMs in text
space marker, forming corpora without text layout. Table 12 shows layout understanding by constructing the TextLayoutQA dataset for
the evaluation results of different LLMs on the DocVQA test set in-depth research. Experiments utilizing various LLMs demonstrate
with and without text layout. Compared to corpora without text lay- that, compared to text without layout, the performance of LLMs
out, different LLMs achieved performance improvements of 2.67% on datasets with text layout improves by 8∼25%, confirming their
to 4.27% on corpora with text layout. potential in text alignment, layout, and orientation understanding.
The additional experiments show that during the pre-training phase,
Table 12: Evaluation results, measured by ANLS, of different LLMs
the base models already possess preliminary text layout under-
on DocVQA test set with (Layout) and without (Strip) text layout.
LLMs Strip Layout Difference
standing capabilities, which are further enhanced during instruction-
ChatGLM3-6B 44.60 48.30 +3.70 tuning. Through ablation experiments with diverse instruction-tuning
Llama2-7B 38.50 41.81 +3.31 datasets, we find that training data is crucial for LLMs to acquire
Llama2-13B 41.33 44.42 +3.09 text layout understanding, particularly datasets containing text lay-
Baichuan2-7B 33.50 36.17 +2.67 outs such as codes. In addition, text layout understanding can be en-
Baichuan2-13B 38.75 41.80 +3.05
GPT-3.5-Turbo 62.68 66.95 +4.27 hanced by low-cost auto-generated data approached by a novel text
game. Subsequently, leveraging the text layout understanding capa-
bilities of LLMs, we propose an approach named TextLayoutParser
FeTaQA The FeTaQA dataset provides tables in array format, we to address text-rich VQA problems, achieving decent performance
convert the array table data into string format serving as corpora improvements on the XfundQA, FetaQA, and DocVQA datasets.
without text layout. Additionally, corpora refactored by textLayout- In summary, our research unveils capabilities in LLMs that have
Parser are used as corpora with text layout. Table 9 presents the eval- been underexplored, demonstrating their potential to enhance the
uation results of different LLMs on the FeTaQA test set with and performance of text-rich VQA problems, expanding the application
without text layout. Notably, various LLMs showcase performance scenarios of language-centric LLMs, and providing new perspectives
enhancements ranging from 0.71% to 4.69% (ROUGE-L) and 0.24% for subsequent LLM corpora preparation.
to 2.47% (BLEU-4) on corpora with text layout, compared to those
without. 6 Acknowledgments
Difference text layout encoding methods are tailored to specific
cases. For instance, in the context of table QA, common table encod- This research was supported by "Pioneer" and "Leading Goose"
ing techniques include employing identifiers to distinguish headers R&D Program of Zhejiang (No. 2024C01020).
and rows (referred to as Linear) [23, 15] and representing each ele-
ment as a col-row-value triplet to create a list (referred to as Triplet)
[40]. Apart from our proposed method, we explore several other text
References towards the future of large language models. Meta-Radiology, page
100017, 2023.
[25] Y. Liu, Z. Li, H. Li, W. Yu, M. Huang, D. Peng, M. Liu, M. Chen, C. Li,
[1] Y. Anand, Z. Nussbaum, B. Duderstadt, B. Schmidt, and A. Mulyar. L. Jin, et al. On the hidden mystery of ocr in large multimodal models,
Gpt4all: Training an assistant-style chatbot with large scale data distil- 2023.
lation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all, 2023. [26] M. Mathew, D. Karatzas, and C. Jawahar. Docvqa: A dataset for vqa on
[2] D. Bayani. Testing the depth of chatgpt’s comprehension via cross- document images. In Proceedings of the IEEE/CVF winter conference
modal tasks based on ascii-art: Gpt3. 5’s abilities in regard to recogniz- on applications of computer vision, pages 2200–2209, 2021.
ing and generating ascii-art are not totally lacking, 2023. [27] L. Nan, C. Hsieh, Z. Mao, X. V. Lin, N. Verma, R. Zhang, W. Kryś-
[3] S. Chaudhary. Code alpaca: An instruction-following llama model for ciński, H. Schoelkopf, R. Kong, X. Tang, et al. Fetaqa: Free-form table
code generation. https://github.com/sahil280114/codealpaca, 2023. question answering. Transactions of the Association for Computational
[4] L. Chen, L. Wang, H. Dong, Y. Du, J. Yan, F. Yang, S. Li, P. Zhao, Linguistics, 10:35–49, 2022.
S. Qin, S. Rajmohan, et al. Introspective tips: Large language model for [28] K. O’Riordan. Ascii art. https://www.britannica.com/topic/ASCIIart.
in-context decision making, 2023. [29] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for
[5] S. Di Bartolomeo, G. Severi, V. Schetinger, and C. Dunne. Ask and automatic evaluation of machine translation. In Proceedings of the 40th
you shall receive (a graph drawing): Testing chatgpt’s potential to apply annual meeting of the Association for Computational Linguistics, pages
graph layout algorithms, 2023. 311–318, 2002.
[6] E. Dreibelbis. Chatgpt passes google coding interview for level 3 engi- [30] P. Pasupat and P. Liang. Compositional semantic parsing on semi-
neer with $183k salary. https://www.pcmag.com/news/chatgpt-passes- structured tables, 2015.
google-coding-interviewfor-level-3-engineer-with-183k-salary. [31] B. Peng, C. Li, P. He, M. Galley, and J. Gao. Instruction tuning with
[7] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, gpt-4. arXiv preprint arXiv:2304.03277, 2023.
B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al. Red teaming lan- [32] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al.
guage models to reduce harms: Methods, scaling behaviors, and lessons Language models are unsupervised multitask learners. OpenAI blog, 1
learned. arXiv preprint arXiv:2209.07858, 2022. (8):9, 2019.
[8] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, and Y. Wu. [33] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
How close is chatgpt to human experts? comparison corpus, evaluation, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning
and detection. arXiv preprint arXiv:2301.07597, 2023. with a unified text-to-text transformer. The Journal of Machine Learn-
[9] J. Guo, L. Du, and H. Liu. Gpt4graph: Can large language models ing Research, 21(1):5485–5551, 2020.
understand graph structured data? an empirical evaluation and bench- [34] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of
marking, 2023. rare words with subword units, 2015.
[10] N. Hegde, S. Paul, G. Madan, and G. Aggarwal. Analyzing the efficacy [35] Y. Shi, H. Ma, W. Zhong, G. Mai, X. Li, T. Liu, and J. Huang. Chat-
of an llm-only approach for image-based document question answering, graph: Interpretable text classification by converting chatgpt knowledge
2023. to graphs, 2023.
[11] R. Hu, A. Singh, T. Darrell, and M. Rohrbach. Iterative answer predic- [36] J. R. Smith, H. Saint-Amand, M. Plamada, P. Koehn, C. Callison-Burch,
tion with pointer-augmented multimodal transformers for textvqa. In and A. Lopez. Dirt cheap web-scale parallel text from the common
Proceedings of the IEEE/CVF conference on computer vision and pat- crawl. Association for Computational Linguistics, 2013.
tern recognition, pages 9992–10002, 2020. [37] T. Sun, X. Zhang, Z. He, P. Li, Q. Cheng, H. Yan, X. Liu, Y. Shao,
[12] Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei. Layoutlmv3: Pre-training for Q. Tang, X. Zhao, K. Chen, Y. Zheng, Z. Zhou, R. Li, J. Zhan, Y. Zhou,
document ai with unified text and image masking. In Proceedings of the L. Li, X. Yang, L. Wu, Z. Yin, X. Huang, and X. Qiu. Moss: Training
30th ACM International Conference on Multimedia, pages 4083–4091, conversational language models from synthetic data. 2023.
2022. [38] Z. Tang, Z. Yang, G. Wang, Y. Fang, Y. Liu, C. Zhu, M. Zeng, C. Zhang,
[13] M. Hurst and T. Nasukawa. Layout and language: Integrating spatial and M. Bansal. Unifying vision, text, and layout for universal document
and linguistic knowledge for layout understanding tasks. In COLING processing. In Proceedings of the IEEE/CVF Conference on Computer
2000 Volume 1: The 18th International Conference on Computational Vision and Pattern Recognition, pages 19254–19264, 2023.
Linguistics, 2000. [39] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,
[14] Y. Ji, Y. Gong, Y. Deng, Y. Peng, Q. Niu, B. Ma, and X. Li. Towards N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open
better instruction following language models for chinese: Investigating foundation and fine-tuned chat models, 2023.
the impact of training data and evaluation, 2023. [40] S. Vakulenko and V. Savenkov. Tableqa: Question answering on tabular
[15] Z. Jiang, Y. Mao, P. He, G. Neubig, and W. Chen. Omnitab: Pretrain- data, 2017.
ing with natural and synthetic data for few-shot table-based question [41] H. Wang, S. Feng, T. He, Z. Tan, X. Han, and Y. Tsvetkov. Can language
answering, 2022. models solve graph problems in natural language?, 2023.
[16] JosephusCheung. Guanaco - generative universal assistant for [42] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei,
natural-language adaptive contextaware omnilingual outputs. https: A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, et al.
//guanaco-model.github.io/, 2021. Super-naturalinstructions: Generalization via declarative instructions on
[17] F. Joublin, A. Ceravola, J. Deigmoeller, M. Gienger, M. Franzius, and 1600+ nlp tasks, 2022.
J. Eggert. A glimpse in chatgpt capabilities and its impact for ai re- [43] J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M.
search, 2023. Dai, and Q. V. Le. Finetuned language models are zero-shot learners.
[18] G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, In International Conference on Learning Representations.
D. Han, and S. Park. Ocr-free document understanding transformer. In [44] Y. Xu, T. Lv, L. Cui, G. Wang, Y. Lu, D. Florencio, C. Zhang, and
European Conference on Computer Vision, pages 498–517. Springer, F. Wei. Layoutxlm: Multimodal pre-training for multilingual visually-
2022. rich document understanding, 2021.
[19] H. Kim, Y. Yu, L. Jiang, X. Lu, D. Khashabi, G. Kim, Y. Choi, and [45] A. Yang, B. Xiao, B. Wang, B. Zhang, C. Bian, C. Yin, C. Lv, D. Pan,
M. Sap. Prosocialdialog: A prosocial backbone for conversational D. Wang, D. Yan, et al. Baichuan 2: Open large-scale language models,
agents. In EMNLP, 2022. 2023.
[20] T. Kudo and J. Richardson. Sentencepiece: A simple and language inde- [46] J. Yang. Firefly. https://github.com/yangjianxin1/Firefly, 2023.
pendent subword tokenizer and detokenizer for neural text processing, [47] Y. Ye, H. You, and J. Du. Improved trust in human-robot collaboration
2018. with chatgpt. IEEE Access, 2023.
[21] K. Lee, M. Joshi, I. R. Turc, H. Hu, F. Liu, J. M. Eisenschlos, U. Khan- [48] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu,
delwal, P. Shaw, M.-W. Chang, and K. Toutanova. Pix2struct: Screen- W. Zheng, X. Xia, et al. Glm-130b: An open bilingual pre-trained
shot parsing as pretraining for visual language understanding. In In- model, 2022.
ternational Conference on Machine Learning, pages 18893–18912. [49] G. Zhang, Y. Shi, R. Liu, R. Yuan, Y. Li, S. Dong, Y. Shu, Z. Li,
PMLR, 2023. Z. Wang, C. Lin, W. Huang, and J. Fu. Chinese open instruction gener-
[22] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In alist: A preliminary release, 2023.
Text summarization branches out, pages 74–81, 2004. [50] J. Zhang. Graph-toolformer: To empower llms with graph reasoning
[23] Q. Liu, B. Chen, J. Guo, M. Ziyadi, Z. Lin, W. Chen, and J.-G. Lou. ability via prompt augmented by chatgpt, 2023.
Tapex: Table pre-training via learning a neural sql executor, 2021. [51] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, Y. Yao, S. Deng, H. Chen,
[24] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He, and N. Zhang. Llms for knowledge graph construction and reasoning:
Z. Liu, et al. Summary of chatgpt-related research and perspective Recent capabilities and future opportunities, 2023.
Appendix A Examples of LLMs Text Layout Example 1
Understanding Capability
Tom
During the early exploration of GPT-3.5-Turbo’s QA ability, its re-
markable capability to comprehend text alignment, layout, and ori-
entation was discovered. Figure 6 shows some examples of the ex- Jean Thomas Lee
ploration.
David
Appendix B Corpora with Layout Information on Question: What is the text in the center?
Github and StackExchange Answer: The text in the center is "Thomas".
By searching for data within Pile that potentially contains text lay-
Example 2
out information, we discover considerable relevant data from sources
like GitHub and StackExchange. Figure 7 shows some examples. Here are two bboxes:

-------
Appendix C Test results of using different types of | |
instructions to tune LLMs | | ------
| | | |
The test results of using different types of instructions to tune LLMs | | | |
are presented in Table 13. Given LLMs often produce long responses ------- ------
that don’t align with the ground truth of the table subset, recall yields
more reasonable results than ROUGE-L. It can be observed that, Question: Which bbox is larger, left or right?
compared to the chat model, the code capability significantly de- Hint: the bbox with more whitespace inside is larger.
creases after tuning on instruction-base, but it substantially recov- Answer: Based on the given information, the left bbox is larger.
ers after tuning on the instruction-code, except for Baichuan2-7B.
In contrast to the model tuned on the instruction-base, table capabil- Example 3
ity gains a considerable improvement after tuning on the instruction- Now we define the mathematical symbols with their visual repre-
table, while layout capability obtains a remarkable improvement fol- sentation using a 5x5 matrix, which is made of "0" and "1". For
lowing instruction-generated tuning. example:
The visual representation of 0 is:
Appendix D Method of textLayoutParser 11111
10001
The implementation of textLayoutParser includes four steps: text 10001
parsing, determination of unit character size and coordinate conver- 10001
sion, filling text into the character matrix, and conversion of character 11111
matrix to plain text. The visual representation of 1 is:
00100
Text Parsing Utilize appropriate parsing methods based on differ-
00100
ent file formats to obtain text content and their corresponding posi-
00100
tional coordinates. For example, OCR can be used to extract text and
00100
coordinates from images, while the PyMuPDF Python library can be
00100
employed to parse PDF files. As for table data, we generated bound-
The visual representation of 2 is:
ing boxes (bboxes) for each element in the table, including header
11111
and cell, based on coordinates and text length. The generation pro-
00001
cess is as follows: Each character was treated as a unit character, with
11111
an assumption of a spacing of 2 between adjacent elements in the
10000
same row and 1 between adjacent elements in the same column. The
11111
maximum text length for all elements in the j th column is denoted
as l, and the bbox for the ith row’s j − 1th column element (Vij for
Question: What is the visual representation of 3?
short) is represented as [x1 , i, x2 , i + 1]. Then, the bbox coordinate
Answer: The visual representation of 3 is:
for Vij is [x2 + 2, i, x2 + 2 + l, i + 1].
11111
Determination of Unit Character Size and Coordinate Conver- 00001
sion Determine a unit character size by analyzing the sizes of 11111
all text characters, filtering out characters smaller than this unit 00001
size. The other text coordinates are then converted using this unit 11111
character size. Define a text t with length n and bbox coordinates Question: What is the visual representation of x?
(x1 , y1 , x2 , y2 ). The approximate character width and height can be Answer: The visual representation of x is:
calculated as (x2 − x1 )/n and y2 − y1 , respectively. Let the unit 10001
character’s width be x0 , and the height be y0 . The coordinates for t 01010
after conversion become (x1 /x0 , y1 /y0 , x2 /x0 , y2 /y0 ), rounded to 00100
the nearest integer. 01010
10001

Figure 6: Showcases of GPT-3.5-Turbo’s text layout understanding


capability.
Example 1: from Github
<html id=\"top\">
<head>
<meta charset=\"utf-8\">
<title>The Crosswalk Project</title>
<link rel=\"shortcut icon\" href=\"/assets/favicon.ico\" type=\"image/x-icon\" />
<link rel=\"icon\" href=\"/assets/favicon.ico\" type=\"image/x-icon\" />
<script>
WebFontConfig = {
custom: {
families: [’Clear Sans’],
urls: [’/css/fonts.css’]
},
google: {
families: [’Source Code Pro:n4,n6’]
},
timeout: 2000
};
</script>
</head>
</html>

Example 2: from Github


/*
* Summary:
* Selectors for feature type kCursiveConnectionType
*/
enum {
kUnconnectedSelector = 0,
kPartiallyConnectedSelector = 1,
kCursiveSelector = 2
};

Example 3: from StackExchange

** LOGGED HOURS ** ** SICK HOURS ** ** RESULT TABLE **


+--------+-------+ +--------+-------+ +--------+-------+-------+
|Name | Hours | |Name | Hours | |Name |Hours |Sick |
+--------+-------+ +--------+-------+ +--------+-------+-------+
|David |47 | |David |9 | |David |47 |9 |
+--------+-------+ +--------+-------+ +--------+-------+-------+
|David |9 | |David |9 |0 |
+--------+-------+ +--------+-------+-------+

Example 4: from StackExchange


Switch flooding when bonding interfaces in Linux
+----+-----+
| Switch 1 | (layer2/3)
+----+-----+
|
+----+-----+
| Switch 2 |
+----+-----+
|
+----------+----------+
+-------------------------+ Switch 3 +-------------------------+
| +----+-----------+----+ |
| | | |
| | | |
| eth0 (B0:B0:B0:B0:B0:B0) | | eth4 (B4:B4:B4:B4:B4:B4) |
| +----+-----------+----+ |
| | Host B | |
| +----+-----------+----+ |
| eth1 (B1:B1:B1:B1:B1:B1) | | eth5 (B5:B5:B5:B5:B5:B5) |
| | | |
| | | |
+------------------------------+ +------------------------------+

Figure 7: An example of data with layout information on Github and StackExchange.


Table 13: Test results, measured by ROUGE-L and recall, of using different types of instructions to tune LLMs. The "Instructions" column
specifies the different instruction-tuning datasets. "Origin" indicates no instruction-tuning performed. "Base", "Code", "Table", and "Generate"
correspond to instruction-base, instruction-table, instruction-code, and instruction-generated datasets, respectively. The "Tasks" row specifies
the different subsets within the test set, with "Others" encompasses results excluding "Code", "Table", and "Generate" subsets.
Metrics
ROUGE-L Recall
LLMs
Tasks
Code Table Generate Others Code Table Generate Others
Instructions
Origin 26.65 20.53 8.14 28.49 67.78 30.96 24.51 38.12
Base 9.25 25.65 5.39 33.88 19.39 29.25 11.04 37.15
ChatGLM3-6B Code 20.67 17.38 7.46 30.71 36.24 20.65 16.84 36.04
Table 2.13 32.00 5.87 21.49 3.72 32.00 6.05 25.37
Generate 4.79 15.05 64.33 33.92 7.14 15.98 67.16 38.31
Origin 19.28 24.34 0.07 17.41 66.92 40.44 0.11 38.17
Base 10.59 25.63 9.83 32.25 26.69 45.40 27.04 39.52
LlaMA2-7B Code 31.23 40.50 8.06 27.61 68.58 46.24 18.31 35.36
Table 23.32 62.72 5.73 32.91 54.87 63.17 17.62 37.96
Generate 12.53 30.74 74.98 29.26 31.13 37.94 85.97 35.58
Origin 21.27 18.97 0.39 17.68 68.79 47.94 0.44 38.97
Base 7.47 47.07 7.83 35.61 15.11 53.50 16.22 39.69
LlaMA2-13B Code 30.27 43.65 5.18 32.79 68.09 53.55 16.26 39.81
Table 12.41 61.83 8.29 32.25 25.87 62.07 11.84 38.75
Generate 17.53 43.23 77.05 34.13 32.70 46.67 80.18 39.93
Origin 27.34 18.05 3.79 22.72 70.00 47.98 5.90 41.59
Base 27.81 18.12 2.60 22.34 70.62 47.48 5.53 40.95
Baichuan2-7B Code 26.83 17.30 4.32 22.39 69.97 47.98 9.49 40.74
Table 27.83 61.29 10.19 25.31 55.01 61.67 12.30 37.64
Generate 12.60 11.40 77.64 26.65 31.69 36.56 81.90 39.49
Origin 29.80 13.92 4.22 26.22 67.04 46.87 10.15 41.38
Base 8.68 17.98 9.89 25.49 21.40 43.46 17.49 36.91
Baichuan2-13B Code 20.58 17.96 8.04 27.08 51.82 47.94 21.68 36.64
Table 10.72 63.76 13.73 29.03 26.97 64.24 17.93 41.76
Generate 15.91 19.52 75.77 24.08 37.57 41.45 80.28 35.48

Filling Text into the Character Matrix Using the coordinates, Given some shopping lists with different products, you are supposed
insert the text into a character matrix. Initialize a matrix with spaces to enumerate the products of specific lists and answer questions in
as elements, setting the rows and columns to the maximum y-value the form of a list, for example: [’a’, ’b’], reply with the list only! If
and x-value after conversion of text coordinates. Then, sequentially you don’t know the answer, reply with the empty list [].
place the text into the corresponding indices of the matrix from left
to right to ensure text continuity. For example, if the converted text For example:
coordinate is (10, 10, 20, 20), and the text length is 5, each character Here are 2 shopping lists (A, B) with different products:
of the text is placed in the matrix indices (10, 10) to (15, 10) one by A B
one. apple fish
banana chair
Conversion of Character Matrix to Plain Text Convert the char- car
acter matrix into the plain text for LLMs. This process involves join-
ing all characters in each row into one line of text, and then combin- Question: What products do shopping list B contain?
ing all lines of text using a newline character as a separator. In order Answer: [’fish’, ’chair’]
to reduce the redundancy of the dense spaces and newline markers,
we remove the first column of those with at least three consecutive Now answer the question below:
columns entirely filled with spaces, replace entire rows filled with {context}
spaces with a newline character, and replace at least three consecu-
tive newline markers with two newline markers. Question: {question}
Answer:

Appendix E Prompt Designs for Difference


Datasets (b) An example of zero-shot prompting for XfundQA dataset
Figure 8 illustrates the prompt designs for different datasets. (a) The following is a form composed of key-value pairs: "{context}".
display one-shot prompting for TextLayoutQA. (b)∼(d) display Please answer according to the given form.
zero-shot prompting for XfundQA, DocVQA, and FeTaQA, respec- Note: The value usually appears near the key. Think carefully and
tively. (e) illustrates the 3-shot prompting for rephrasing answers in answer with a few words.
the DocVQA dataset. The instructions remain consistent across all Question: What is the value of the key "{question}"?
LLMs except for the Llama2 series, as depicted in (f). Regarding Answer:
LLM parameter settings, we utilize a temperature of 0.1, maximum
output length of 512, top p of 0.85, and repetition penalty of 1.05.
(c) An example of zero-shot prompting for DocVQA dataset
(a) An example of one-shot prompting for TextLayoutQA Given the context:
dataset {context}
Use few words to answer the question: {question} Appendix F Examples of different table encoding
Answer: methods
Figure 9 shows examples of different table encoding methods. The
(d) An example of zero-shot prompting for FeTaQA dataset widely used table encoding methods include: arranging data in array
Given a table: format (Array), using unique identifiers to distinguish between head-
{context} ers and rows (Linear), and formatting each element as a column-row-
Answer questions about the table. value triplet to form a list (Triple).
Note: think step by step.
Question: {question}
Answer:

(e) An example of 3-shot prompting for rephrasing answers in


DocVQA dataset
Given the question and answer pair, rephrase the answer to provide
the most straightforward response to the question with few words in
English.

Example 1:
Question: What is the name of the person in the CC field?
Answer: The name of the person in the CC field is Jo Spach.
Rephrased answer: Jo Spach

Example 2:
Question: What is the given document about?
Answer: The given document appears to be a summary of an
evaluation survey conducted by Telmark in a particular monthly
region in 2014. The survey aimed to evaluate the effectiveness
of Telmark’s promotional programs in the region. The document
provides information on various aspects of the survey, including the
number of stores that received promotional materials, the percentage
of stores that placed the materials in a visible location, and the
number of stores that participated in the promotion. Additionally,
the document includes information on the wholesale accounts sold
by Telmark in the region and the percentage of accounts that refused
the promotion.
Rephrased answer: region monthly telmark program evaluation
survey

Example 3:
Question: What is the % of Employees in 2012 based on graph
’Distribution of Value-Added’?
Answer: Based on the graph ’Distribution of Value-Added’, it can be
observed that the percentage of employees in 2012 is around 80%.
Rephrased answer: 80%

Now rephrase the answer based on the QA pair:


Question: {question}
Answer: {answer}
Rephrased answer:

(f) Prompt template for Llama2


<s>[INST] <<SYS>>
{system prompt}
<</SYS>>
{instruction} [/INST]

Figure 8: Prompt designs for different datasets.


(a) Array
[[’Year’, ’Title’, ’Role’, ’Channel’],
[’2015’, ’Kuch Toh Hai Tere Mere Darmiyaan’, ’Sanjana Kapoor’, ’Star Plus’],
[’2016’, ’Kuch Rang Pyar Ke Aise Bhi’, ’Khushi’, ’Sony TV’],
[’2016’, ’Gangaa’, ’Aashi Jhaa’, ’\&TV’]]

(b) Linear
[HEAD] Year | Title | Role | Channel
[ROW] 1 2015 | Kuch Toh Hai Tere Mere Darmiyaan | Sanjana Kapoor | Star Plus
[ROW] 2 2016 | Kuch Rang Pyar Ke Aise Bhi | Khushi | Sony TV
[ROW] 3 2016 | Gangaa | Aashi Jhaa | \&TV

(c) Triplet
Row1 | Year | 2015
Row1 | Title | Kuch Toh Hai Tere Mere Darmiyaan
Row1 | Role | Sanjana Kapoor
Row1 | Channel | Star Plus
Row2 | Year | 2016
Row2 | Title | Kuch Rang Pyar Ke Aise Bhi
Row2 | Role | Khushi
Row2 | Channel | Sony TV
Row3 | Year | 2016
Row3 | Title | Gangaa
Row3 | Role | Aashi Jhaa
Row3 | Channel | \&TV

(d) Ours
Year Title Role Channel
2015 Kuch Toh Hai Tere Mere Darmiyaan Sanjana Kapoor Star Plus
2016 Kuch Rang Pyar Ke Aise Bhi Khushi Sony TV
2016 Gangaa Aashi Jhaa \&TV

Figure 9: Different table encoding methods: "Array," which transforms the original array table data into string format; "Linear," which employs
distinct identifiers to differentiate headers and rows; "Triplet," which formats each element as a col-row-value triplet to create a list; and "Ours,"
which utilizes spaces and line breaks to align and separate elements within the table.

You might also like