LLMs Understand Layout 1720859826
LLMs Understand Layout 1720859826
1. Change the key-value relations to the QA format: "What is the (b) Strip
value of the key ’{key}’?" Here are 4 shopping lists (A, B, C, D) with different products:
2. Remove invalid QA pairs, including those with empty or invalid A B footwear lenses movies walkers jet
values and nested key-key-value relations. skis C D fortified wines animal clothes
3. Rewrite answers with multiple options to the selected one, such as bulbs
changing "✓A □ B" to "A".
(c) QA set
This modified dataset is named XfundQA. Since LLMs’ outputs Question: What products do shopping list B contain?
are usually long, we use recall as the evaluation metric, consider- Answer: ["lenses"]
ing a prediction correct if the ground truth appears completely in the Question: "What products do shopping list B and A contain?"
LLMs’ output. Answer: ["lenses", "footwear", "movies", "walkers", "jet skis"]
Question: What products do shopping list in the bottom-right corner
FetaQA A table QA dataset consists of free-form table questions
contain?"
that require deep reasoning and understanding. Most questions are
Answer: ["animal clothes", "bulbs"]
based on discontinuous blocks of information in the table. We con-
duct evaluations on the test set containing 2,003 samples. Consis-
tently with the dataset’s conventions, we use ROUGE-L [22] and Figure 2: A pair example of TextLayoutQA dataset with (a) and with-
BLEU-4 [29] as the evaluation metrics. out layout (b), they share the same QA set (c).
DocVQA A document QA dataset consists of printed and typed Instruction-basic dataset An instruction-tuning dataset designed
text as well as scanned documents with various layouts, some of to diminish the text layout understanding capability of LLMs. Specif-
which also include handwritten data. Evaluations are performed on ically, we randomly select 100k bilingual (English and Chinese) in-
the test set containing 5,188 samples. Following the conventional stances from publicly available instruction-tuning datasets [37, 14,
46, 16, 7, 49, 8, 43, 19, 31], deliberately excluding consecutive
Given a table:
spaces (three or more spaces or two or more tabs), to form the
instruction-basic dataset. The distribution of each sub-dataset in the Year Title Role
instruction-basic dataset is shown in Table 1. 2009-2013 We Speak NYC Jorge / Fredy
2014-2019 Broad City Jaime Castro
Table 1: Distribution of each sub-dataset in the instruction-basic 2015-2016 Alternatino Arturo
dataset. 2017 No Activity Pedro
Dataset Num Ratio/% 2019 Alternatino Arturo
MOSS 56,195 56.19
belle 20,881 20.88 Question: What is the Role of Year 2009-2013?
firefly 8,929 8.92 Answer: Jorge / Fredy
CSL 3,289 3.28
hh-rlhf 2,234 2.23 Figure 3: An example of the instruction-table dataset.
COI 2,104 2.10
HC3 1,577 1.57
Chain-of-Thought 1,200 1.20
prosocial-dialog 963 0.96
alpacaGPT4 851 0.85
gpt4tools 555 0.55
GPTeacher 431 0.431
alpaca 414 0.414
webGPT 173 0.173
dolly 128 0.128
Auto-CoT 59 0.059
GAOKAO 17 0.017
pora used for LLMs. Finally, we explore the type of training corpora undergo 5 epochs with early stopping. As expected, adding different
that fosters the text layout understanding capability. instruction types improves the corresponding ability. For further
Due to the lack of ability to strictly follow instructions, the out- information, please refer to Appendix.
puts of the base models are difficult to align with the reference of Table 8 presents the performance of LLMs on TextLayoutQA fol-
the QA tasks. We therefore employ perplexity as the metric to en- lowing various types of instruction-tuning. Due to the characteristic
sure fair comparison of layout understanding capability between base of catastrophic forgetting, the text layout capability decreases after
and chat models. Perplexity is a widely used metric for assessing instruction-basic tuning compared to the chat model, except for the
language models. Lower perplexity indicates better modeling per- Llama2 series. This is because the Llama2 chat models tend to pro-
formance. By comparing the perplexity of different LLMs on the duce long responses and sometimes fail to follow the required out-
TextLayoutQA dataset with and without text layout, the results pre- put format. However, a significant recovery in text layout capabil-
sented in Table 6 are obtained. Notably, all the base models exhibit ity is observed after instruction-code tuning, underscoring the cru-
a lower perplexity on text with layout compared to text without lay- cial role of code-related data in enhancing text layout understand-
out, suggesting that the base models inherently possess some level ing. In contrast, the model fine-tuned on instruction-table experi-
of text layout understanding during the pre-training stage. Following ences a decrease in text layout capability, indicating that the table-
instruction-tuning, all the chat models demonstrate a lower perplex- related data do not contribute to the text layout capability. It is note-
ity compared to the base models when processing text with layout as worthy that performance shows enhancement following instruction-
opposed to text without layout. This indicates that instruction-tuning generated tuning. For specific small models like Llama2-7B and
further enhances the text layout understanding capability. It should Baichuan2-7B, instruction-generated tuning even yields the best re-
be noted that, to mitigate the influence of context length on perplex- sult among all instruction-tuning datasets. Considering that gener-
ity, newline markers are used for padding at the beginning of text ating data is significantly more convenient and cost-effective than
without layout, with the padding length being the difference between collecting, we have laid a promising path for pre-training LLMs.
the length of text without layout and the length of text with layout Table 8: Evaluation results, measured by F-score, of different
after tokenization. LLMs on TextLayoutQA after applying different instruction-tuning
Table 6: Perplexity of different LLMs on TextLayoutQA dataset with datasets. "Origin" indicates no instruction-tuning performed. "Ba-
(Layout) and without (Strip) text layout. Lower perplexity indicates sic", "Code", "Table", and "Generate" correspond to instruction-
better modeling performance. basic, instruction-table, instruction-code, and instruction-generated
LLMs Type Strip Layout Difference datasets, respectively.
Base 6.87 4.98 -1.89 LLM Origin Basic Code Table Generate
ChatGLM3-6B ChatGLM3-6B 49.52 45.36 63.36 28.98 59.52
Chat 5.58 3.56 -2.02
Base 2.33 1.85 -0.48 Llama2-7B 58.80 64.37 66.10 61.77 66.59
Llama2-7B Llama2-13B 61.93 72.31 73.69 66.71 66.26
Chat 2.95 2.26 -0.69
Base 2.15 1.81 -0.34 Baichuan2-7B 60.82 59.93 60.77 53.79 63.78
Llama2-13B Baichuan2-13B 68.69 66.14 72.06 65.93 69.15
Chat 3.06 2.27 -0.79
Base 1.90 1.40 -0.50
Baichuan2-7B
Chat 3.09 1.53 -1.56 4.3.3 Applications
Base 1.89 1.33 -0.56
Baichuan2-13B In this section, we illustrate the utilization of LLMs’ text layout un-
Chat 3.03 1.35 -1.68
derstanding capability in the text-rich VQA domain. We introduce
Table 7 presents the training corpora utilized by various LLMs a method named textLayoutParser designed to parse texts with di-
during the pre-training stage. Notably, the training corpora for GLM3 verse layouts from documents, including plain texts, forms, tables,
and Llama2 are not explicitly published, so related information about images, and their combinations. The method involves the place-
GLM-130b and Llama is considered. We find that GLM, Llama, and ment of text on a two-dimensional character canvas according to
GPT-3 all use datasets such as CommonCrawl, Wikipedia, and Books the text’s coordinates. Detailed implementation is available in Ap-
(Pile includes CommonCrawl, Wikipedia, and Books) in their pre- pendix. We evaluate the zero-shot performance on the test sets of
training. CommonCrawl is a large-scale, unstructured, multilingual three datasets—XfundQA, DocVQA, and FeTaQA. The prompts uti-
web dataset containing over 8 years of web crawler data. Addition- lized for each dataset are provided in Appendix.
ally, GLM and Llama utilize code-related sources like GitHub and XfundQA We use the OCR output provided by the dataset and
StackExchange. We do find some examples with various text layouts construct corpora with text layout using textLayoutParser. As a com-
sourced from GitHub and StackExchange within the Pile dataset. The parison, we replace consecutive spaces and newlines with a single
specific examples can be referred to Appendix. space marker, forming corpora without text layout. The evaluation
We perform instruction-tuning on the instruction-basic, results of different LLMs on XfundQA with and without text layout
instruction-code, instruction-table and instruction-generated dataset are presented in Table 11. Notably, corpora with text layout lead to
using Firefly [46] tuning framework. Each dataset is partitioned into performance improvements ranging from 1.96% to 9.55% compared
training and validation sets with a ratio of 98:2. The training sets to corpora without text layout.
Table 9: Evaluation results, measured by ROUGE-L and BLEU-4, of different LLMs on FeTaQA test set with (Layout) and without (Strip) text
layout.
ROUGE-L BLEU-4
LLMs
Strip Layout Difference Strip Layout Difference
ChatGLM3-6B 28.79 31.28 +2.49 10.84 11.08 +0.24
Llama2-7B 19.71 24.03 +4.32 5.82 7.67 +1.85
Llama2-13B 27.07 30.49 +3.42 9.14 10.91 +1.77
Baichuan2-7B 32.26 34.26 +2.00 12.15 13.57 +1.42
Baichuan2-13B 34.46 39.15 +4.69 14.04 16.51 +2.47
GPT-3.5-Turbo 39.05 39.76 +0.71 16.21 16.63 +0.42
Table 10: Evaluation results, measured by ROUGE-L and BLEU-4, of different table encoding methods on FeTaQA test set: "Array," which
transforms the original array table data into string format; "Linear," which employs distinct identifiers to differentiate headers and rows;
"Triplet," which formats each element as a col-row-value triplet to create a list; and "Ours," which utilizes spaces and newlines to align and
separate elements within the table.
ROUGE-L BLEU-4
LLMs
Array Linear Triplet Ours Array Linear Triplet Ours
ChatGLM3-6B 28.79 31.01 31.25 31.28 10.84 10.85 11.05 11.08
Llama2-7B 19.71 23.69 22.84 24.03 5.82 7.63 6.98 7.67
Llama2-13B 27.07 28.80 26.40 30.49 9.14 9.92 9.22 10.91
Baichuan2-7B 32.26 31.87 31.03 34.26 12.15 12.21 11.55 13.57
Baichuan2-13B 34.46 40.08 32.57 39.15 14.04 16.94 12.53 16.51
GPT-3.5-Turbo 39.05 35.21 36.88 39.76 16.21 14.15 14.96 16.63
Table 11: Evaluation results, measured by recall, of different LLMs
layout encoding techniques for an ablation study. Examples of dif-
on XfundQA with (Layout) and without (Strip) text layout.
LLMs Strip Layout Difference ferent table encoding methods can be referred to Appendix. Table 10
ChatGLM3-6B 60.13 66.18 +6.05 provides a performance assessment of various table encoding meth-
Llama2-7B 57.41 66.96 +9.55 ods on the FeTaQA test set. Our proposed method outperforms others
Llama2-13B 58.92 66.60 +7.68 for ChatGLM3-6B, Llama2-7B, and GPT-3.5-Turbo. Conversely, for
Baichuan2-7B 64.70 66.66 +1.96
Baichuan2-13B, the Linear encoding method demonstrates superior
Baichuan2-13B 67.38 73.27 +5.89
GPT-3.5-Turbo 76.67 77.50 +3.03 results.
DocVQA We use the OCR output provided by the dataset and 5 Conclusion
construct corpora with text layout using textLayoutParser. For com-
parison, consecutive spaces and newlines are replaced with a single This study extensively investigates the potential of LLMs in text
space marker, forming corpora without text layout. Table 12 shows layout understanding by constructing the TextLayoutQA dataset for
the evaluation results of different LLMs on the DocVQA test set in-depth research. Experiments utilizing various LLMs demonstrate
with and without text layout. Compared to corpora without text lay- that, compared to text without layout, the performance of LLMs
out, different LLMs achieved performance improvements of 2.67% on datasets with text layout improves by 8∼25%, confirming their
to 4.27% on corpora with text layout. potential in text alignment, layout, and orientation understanding.
The additional experiments show that during the pre-training phase,
Table 12: Evaluation results, measured by ANLS, of different LLMs
the base models already possess preliminary text layout under-
on DocVQA test set with (Layout) and without (Strip) text layout.
LLMs Strip Layout Difference
standing capabilities, which are further enhanced during instruction-
ChatGLM3-6B 44.60 48.30 +3.70 tuning. Through ablation experiments with diverse instruction-tuning
Llama2-7B 38.50 41.81 +3.31 datasets, we find that training data is crucial for LLMs to acquire
Llama2-13B 41.33 44.42 +3.09 text layout understanding, particularly datasets containing text lay-
Baichuan2-7B 33.50 36.17 +2.67 outs such as codes. In addition, text layout understanding can be en-
Baichuan2-13B 38.75 41.80 +3.05
GPT-3.5-Turbo 62.68 66.95 +4.27 hanced by low-cost auto-generated data approached by a novel text
game. Subsequently, leveraging the text layout understanding capa-
bilities of LLMs, we propose an approach named TextLayoutParser
FeTaQA The FeTaQA dataset provides tables in array format, we to address text-rich VQA problems, achieving decent performance
convert the array table data into string format serving as corpora improvements on the XfundQA, FetaQA, and DocVQA datasets.
without text layout. Additionally, corpora refactored by textLayout- In summary, our research unveils capabilities in LLMs that have
Parser are used as corpora with text layout. Table 9 presents the eval- been underexplored, demonstrating their potential to enhance the
uation results of different LLMs on the FeTaQA test set with and performance of text-rich VQA problems, expanding the application
without text layout. Notably, various LLMs showcase performance scenarios of language-centric LLMs, and providing new perspectives
enhancements ranging from 0.71% to 4.69% (ROUGE-L) and 0.24% for subsequent LLM corpora preparation.
to 2.47% (BLEU-4) on corpora with text layout, compared to those
without. 6 Acknowledgments
Difference text layout encoding methods are tailored to specific
cases. For instance, in the context of table QA, common table encod- This research was supported by "Pioneer" and "Leading Goose"
ing techniques include employing identifiers to distinguish headers R&D Program of Zhejiang (No. 2024C01020).
and rows (referred to as Linear) [23, 15] and representing each ele-
ment as a col-row-value triplet to create a list (referred to as Triplet)
[40]. Apart from our proposed method, we explore several other text
References towards the future of large language models. Meta-Radiology, page
100017, 2023.
[25] Y. Liu, Z. Li, H. Li, W. Yu, M. Huang, D. Peng, M. Liu, M. Chen, C. Li,
[1] Y. Anand, Z. Nussbaum, B. Duderstadt, B. Schmidt, and A. Mulyar. L. Jin, et al. On the hidden mystery of ocr in large multimodal models,
Gpt4all: Training an assistant-style chatbot with large scale data distil- 2023.
lation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all, 2023. [26] M. Mathew, D. Karatzas, and C. Jawahar. Docvqa: A dataset for vqa on
[2] D. Bayani. Testing the depth of chatgpt’s comprehension via cross- document images. In Proceedings of the IEEE/CVF winter conference
modal tasks based on ascii-art: Gpt3. 5’s abilities in regard to recogniz- on applications of computer vision, pages 2200–2209, 2021.
ing and generating ascii-art are not totally lacking, 2023. [27] L. Nan, C. Hsieh, Z. Mao, X. V. Lin, N. Verma, R. Zhang, W. Kryś-
[3] S. Chaudhary. Code alpaca: An instruction-following llama model for ciński, H. Schoelkopf, R. Kong, X. Tang, et al. Fetaqa: Free-form table
code generation. https://github.com/sahil280114/codealpaca, 2023. question answering. Transactions of the Association for Computational
[4] L. Chen, L. Wang, H. Dong, Y. Du, J. Yan, F. Yang, S. Li, P. Zhao, Linguistics, 10:35–49, 2022.
S. Qin, S. Rajmohan, et al. Introspective tips: Large language model for [28] K. O’Riordan. Ascii art. https://www.britannica.com/topic/ASCIIart.
in-context decision making, 2023. [29] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for
[5] S. Di Bartolomeo, G. Severi, V. Schetinger, and C. Dunne. Ask and automatic evaluation of machine translation. In Proceedings of the 40th
you shall receive (a graph drawing): Testing chatgpt’s potential to apply annual meeting of the Association for Computational Linguistics, pages
graph layout algorithms, 2023. 311–318, 2002.
[6] E. Dreibelbis. Chatgpt passes google coding interview for level 3 engi- [30] P. Pasupat and P. Liang. Compositional semantic parsing on semi-
neer with $183k salary. https://www.pcmag.com/news/chatgpt-passes- structured tables, 2015.
google-coding-interviewfor-level-3-engineer-with-183k-salary. [31] B. Peng, C. Li, P. He, M. Galley, and J. Gao. Instruction tuning with
[7] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, gpt-4. arXiv preprint arXiv:2304.03277, 2023.
B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al. Red teaming lan- [32] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al.
guage models to reduce harms: Methods, scaling behaviors, and lessons Language models are unsupervised multitask learners. OpenAI blog, 1
learned. arXiv preprint arXiv:2209.07858, 2022. (8):9, 2019.
[8] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, and Y. Wu. [33] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
How close is chatgpt to human experts? comparison corpus, evaluation, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning
and detection. arXiv preprint arXiv:2301.07597, 2023. with a unified text-to-text transformer. The Journal of Machine Learn-
[9] J. Guo, L. Du, and H. Liu. Gpt4graph: Can large language models ing Research, 21(1):5485–5551, 2020.
understand graph structured data? an empirical evaluation and bench- [34] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of
marking, 2023. rare words with subword units, 2015.
[10] N. Hegde, S. Paul, G. Madan, and G. Aggarwal. Analyzing the efficacy [35] Y. Shi, H. Ma, W. Zhong, G. Mai, X. Li, T. Liu, and J. Huang. Chat-
of an llm-only approach for image-based document question answering, graph: Interpretable text classification by converting chatgpt knowledge
2023. to graphs, 2023.
[11] R. Hu, A. Singh, T. Darrell, and M. Rohrbach. Iterative answer predic- [36] J. R. Smith, H. Saint-Amand, M. Plamada, P. Koehn, C. Callison-Burch,
tion with pointer-augmented multimodal transformers for textvqa. In and A. Lopez. Dirt cheap web-scale parallel text from the common
Proceedings of the IEEE/CVF conference on computer vision and pat- crawl. Association for Computational Linguistics, 2013.
tern recognition, pages 9992–10002, 2020. [37] T. Sun, X. Zhang, Z. He, P. Li, Q. Cheng, H. Yan, X. Liu, Y. Shao,
[12] Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei. Layoutlmv3: Pre-training for Q. Tang, X. Zhao, K. Chen, Y. Zheng, Z. Zhou, R. Li, J. Zhan, Y. Zhou,
document ai with unified text and image masking. In Proceedings of the L. Li, X. Yang, L. Wu, Z. Yin, X. Huang, and X. Qiu. Moss: Training
30th ACM International Conference on Multimedia, pages 4083–4091, conversational language models from synthetic data. 2023.
2022. [38] Z. Tang, Z. Yang, G. Wang, Y. Fang, Y. Liu, C. Zhu, M. Zeng, C. Zhang,
[13] M. Hurst and T. Nasukawa. Layout and language: Integrating spatial and M. Bansal. Unifying vision, text, and layout for universal document
and linguistic knowledge for layout understanding tasks. In COLING processing. In Proceedings of the IEEE/CVF Conference on Computer
2000 Volume 1: The 18th International Conference on Computational Vision and Pattern Recognition, pages 19254–19264, 2023.
Linguistics, 2000. [39] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,
[14] Y. Ji, Y. Gong, Y. Deng, Y. Peng, Q. Niu, B. Ma, and X. Li. Towards N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open
better instruction following language models for chinese: Investigating foundation and fine-tuned chat models, 2023.
the impact of training data and evaluation, 2023. [40] S. Vakulenko and V. Savenkov. Tableqa: Question answering on tabular
[15] Z. Jiang, Y. Mao, P. He, G. Neubig, and W. Chen. Omnitab: Pretrain- data, 2017.
ing with natural and synthetic data for few-shot table-based question [41] H. Wang, S. Feng, T. He, Z. Tan, X. Han, and Y. Tsvetkov. Can language
answering, 2022. models solve graph problems in natural language?, 2023.
[16] JosephusCheung. Guanaco - generative universal assistant for [42] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei,
natural-language adaptive contextaware omnilingual outputs. https: A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, et al.
//guanaco-model.github.io/, 2021. Super-naturalinstructions: Generalization via declarative instructions on
[17] F. Joublin, A. Ceravola, J. Deigmoeller, M. Gienger, M. Franzius, and 1600+ nlp tasks, 2022.
J. Eggert. A glimpse in chatgpt capabilities and its impact for ai re- [43] J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M.
search, 2023. Dai, and Q. V. Le. Finetuned language models are zero-shot learners.
[18] G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, In International Conference on Learning Representations.
D. Han, and S. Park. Ocr-free document understanding transformer. In [44] Y. Xu, T. Lv, L. Cui, G. Wang, Y. Lu, D. Florencio, C. Zhang, and
European Conference on Computer Vision, pages 498–517. Springer, F. Wei. Layoutxlm: Multimodal pre-training for multilingual visually-
2022. rich document understanding, 2021.
[19] H. Kim, Y. Yu, L. Jiang, X. Lu, D. Khashabi, G. Kim, Y. Choi, and [45] A. Yang, B. Xiao, B. Wang, B. Zhang, C. Bian, C. Yin, C. Lv, D. Pan,
M. Sap. Prosocialdialog: A prosocial backbone for conversational D. Wang, D. Yan, et al. Baichuan 2: Open large-scale language models,
agents. In EMNLP, 2022. 2023.
[20] T. Kudo and J. Richardson. Sentencepiece: A simple and language inde- [46] J. Yang. Firefly. https://github.com/yangjianxin1/Firefly, 2023.
pendent subword tokenizer and detokenizer for neural text processing, [47] Y. Ye, H. You, and J. Du. Improved trust in human-robot collaboration
2018. with chatgpt. IEEE Access, 2023.
[21] K. Lee, M. Joshi, I. R. Turc, H. Hu, F. Liu, J. M. Eisenschlos, U. Khan- [48] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu,
delwal, P. Shaw, M.-W. Chang, and K. Toutanova. Pix2struct: Screen- W. Zheng, X. Xia, et al. Glm-130b: An open bilingual pre-trained
shot parsing as pretraining for visual language understanding. In In- model, 2022.
ternational Conference on Machine Learning, pages 18893–18912. [49] G. Zhang, Y. Shi, R. Liu, R. Yuan, Y. Li, S. Dong, Y. Shu, Z. Li,
PMLR, 2023. Z. Wang, C. Lin, W. Huang, and J. Fu. Chinese open instruction gener-
[22] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In alist: A preliminary release, 2023.
Text summarization branches out, pages 74–81, 2004. [50] J. Zhang. Graph-toolformer: To empower llms with graph reasoning
[23] Q. Liu, B. Chen, J. Guo, M. Ziyadi, Z. Lin, W. Chen, and J.-G. Lou. ability via prompt augmented by chatgpt, 2023.
Tapex: Table pre-training via learning a neural sql executor, 2021. [51] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, Y. Yao, S. Deng, H. Chen,
[24] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He, and N. Zhang. Llms for knowledge graph construction and reasoning:
Z. Liu, et al. Summary of chatgpt-related research and perspective Recent capabilities and future opportunities, 2023.
Appendix A Examples of LLMs Text Layout Example 1
Understanding Capability
Tom
During the early exploration of GPT-3.5-Turbo’s QA ability, its re-
markable capability to comprehend text alignment, layout, and ori-
entation was discovered. Figure 6 shows some examples of the ex- Jean Thomas Lee
ploration.
David
Appendix B Corpora with Layout Information on Question: What is the text in the center?
Github and StackExchange Answer: The text in the center is "Thomas".
By searching for data within Pile that potentially contains text lay-
Example 2
out information, we discover considerable relevant data from sources
like GitHub and StackExchange. Figure 7 shows some examples. Here are two bboxes:
-------
Appendix C Test results of using different types of | |
instructions to tune LLMs | | ------
| | | |
The test results of using different types of instructions to tune LLMs | | | |
are presented in Table 13. Given LLMs often produce long responses ------- ------
that don’t align with the ground truth of the table subset, recall yields
more reasonable results than ROUGE-L. It can be observed that, Question: Which bbox is larger, left or right?
compared to the chat model, the code capability significantly de- Hint: the bbox with more whitespace inside is larger.
creases after tuning on instruction-base, but it substantially recov- Answer: Based on the given information, the left bbox is larger.
ers after tuning on the instruction-code, except for Baichuan2-7B.
In contrast to the model tuned on the instruction-base, table capabil- Example 3
ity gains a considerable improvement after tuning on the instruction- Now we define the mathematical symbols with their visual repre-
table, while layout capability obtains a remarkable improvement fol- sentation using a 5x5 matrix, which is made of "0" and "1". For
lowing instruction-generated tuning. example:
The visual representation of 0 is:
Appendix D Method of textLayoutParser 11111
10001
The implementation of textLayoutParser includes four steps: text 10001
parsing, determination of unit character size and coordinate conver- 10001
sion, filling text into the character matrix, and conversion of character 11111
matrix to plain text. The visual representation of 1 is:
00100
Text Parsing Utilize appropriate parsing methods based on differ-
00100
ent file formats to obtain text content and their corresponding posi-
00100
tional coordinates. For example, OCR can be used to extract text and
00100
coordinates from images, while the PyMuPDF Python library can be
00100
employed to parse PDF files. As for table data, we generated bound-
The visual representation of 2 is:
ing boxes (bboxes) for each element in the table, including header
11111
and cell, based on coordinates and text length. The generation pro-
00001
cess is as follows: Each character was treated as a unit character, with
11111
an assumption of a spacing of 2 between adjacent elements in the
10000
same row and 1 between adjacent elements in the same column. The
11111
maximum text length for all elements in the j th column is denoted
as l, and the bbox for the ith row’s j − 1th column element (Vij for
Question: What is the visual representation of 3?
short) is represented as [x1 , i, x2 , i + 1]. Then, the bbox coordinate
Answer: The visual representation of 3 is:
for Vij is [x2 + 2, i, x2 + 2 + l, i + 1].
11111
Determination of Unit Character Size and Coordinate Conver- 00001
sion Determine a unit character size by analyzing the sizes of 11111
all text characters, filtering out characters smaller than this unit 00001
size. The other text coordinates are then converted using this unit 11111
character size. Define a text t with length n and bbox coordinates Question: What is the visual representation of x?
(x1 , y1 , x2 , y2 ). The approximate character width and height can be Answer: The visual representation of x is:
calculated as (x2 − x1 )/n and y2 − y1 , respectively. Let the unit 10001
character’s width be x0 , and the height be y0 . The coordinates for t 01010
after conversion become (x1 /x0 , y1 /y0 , x2 /x0 , y2 /y0 ), rounded to 00100
the nearest integer. 01010
10001
Filling Text into the Character Matrix Using the coordinates, Given some shopping lists with different products, you are supposed
insert the text into a character matrix. Initialize a matrix with spaces to enumerate the products of specific lists and answer questions in
as elements, setting the rows and columns to the maximum y-value the form of a list, for example: [’a’, ’b’], reply with the list only! If
and x-value after conversion of text coordinates. Then, sequentially you don’t know the answer, reply with the empty list [].
place the text into the corresponding indices of the matrix from left
to right to ensure text continuity. For example, if the converted text For example:
coordinate is (10, 10, 20, 20), and the text length is 5, each character Here are 2 shopping lists (A, B) with different products:
of the text is placed in the matrix indices (10, 10) to (15, 10) one by A B
one. apple fish
banana chair
Conversion of Character Matrix to Plain Text Convert the char- car
acter matrix into the plain text for LLMs. This process involves join-
ing all characters in each row into one line of text, and then combin- Question: What products do shopping list B contain?
ing all lines of text using a newline character as a separator. In order Answer: [’fish’, ’chair’]
to reduce the redundancy of the dense spaces and newline markers,
we remove the first column of those with at least three consecutive Now answer the question below:
columns entirely filled with spaces, replace entire rows filled with {context}
spaces with a newline character, and replace at least three consecu-
tive newline markers with two newline markers. Question: {question}
Answer:
Example 1:
Question: What is the name of the person in the CC field?
Answer: The name of the person in the CC field is Jo Spach.
Rephrased answer: Jo Spach
Example 2:
Question: What is the given document about?
Answer: The given document appears to be a summary of an
evaluation survey conducted by Telmark in a particular monthly
region in 2014. The survey aimed to evaluate the effectiveness
of Telmark’s promotional programs in the region. The document
provides information on various aspects of the survey, including the
number of stores that received promotional materials, the percentage
of stores that placed the materials in a visible location, and the
number of stores that participated in the promotion. Additionally,
the document includes information on the wholesale accounts sold
by Telmark in the region and the percentage of accounts that refused
the promotion.
Rephrased answer: region monthly telmark program evaluation
survey
Example 3:
Question: What is the % of Employees in 2012 based on graph
’Distribution of Value-Added’?
Answer: Based on the graph ’Distribution of Value-Added’, it can be
observed that the percentage of employees in 2012 is around 80%.
Rephrased answer: 80%
(b) Linear
[HEAD] Year | Title | Role | Channel
[ROW] 1 2015 | Kuch Toh Hai Tere Mere Darmiyaan | Sanjana Kapoor | Star Plus
[ROW] 2 2016 | Kuch Rang Pyar Ke Aise Bhi | Khushi | Sony TV
[ROW] 3 2016 | Gangaa | Aashi Jhaa | \&TV
(c) Triplet
Row1 | Year | 2015
Row1 | Title | Kuch Toh Hai Tere Mere Darmiyaan
Row1 | Role | Sanjana Kapoor
Row1 | Channel | Star Plus
Row2 | Year | 2016
Row2 | Title | Kuch Rang Pyar Ke Aise Bhi
Row2 | Role | Khushi
Row2 | Channel | Sony TV
Row3 | Year | 2016
Row3 | Title | Gangaa
Row3 | Role | Aashi Jhaa
Row3 | Channel | \&TV
(d) Ours
Year Title Role Channel
2015 Kuch Toh Hai Tere Mere Darmiyaan Sanjana Kapoor Star Plus
2016 Kuch Rang Pyar Ke Aise Bhi Khushi Sony TV
2016 Gangaa Aashi Jhaa \&TV
Figure 9: Different table encoding methods: "Array," which transforms the original array table data into string format; "Linear," which employs
distinct identifiers to differentiate headers and rows; "Triplet," which formats each element as a col-row-value triplet to create a list; and "Ours,"
which utilizes spaces and line breaks to align and separate elements within the table.