2023 Emnlp-Demo 45
2023 Emnlp-Demo 45
Abstract
496
pages, blocks, lines), logical structure (e.g., title,
abstract, figures, tables, footnotes, sections), se-
mantic units (e.g., paragraphs, sentences, tokens),
and more (e.g., citations, terms). In practice, this
means different parts of doc.symbols can corre-
spond to different paragraphs, sentences, tokens,
etc. in the Document, each with its own set of
corresponding coordinates representing its visual
position on a page.
magelib represents structure using Layers that
can be accessed as attributes of a Document (e.g.,
doc.sentences, doc.figures, doc.tokens)
(Figure 1). Each Layer is a sequence of content
Figure 2: Entities are multimodal content units. Here, units, called Entities, which store both textual
spans of a sentence are used to identify its text among (e.g., spans, strings) and visuospatial (e.g.,
all symbols, while boxes map its visual coordinates on bounding boxes, pixel arrays) information:
a page. spans and boxes can include non-contiguous
units, allowing great flexibility in Entities to handle 1 >>> sentences = Layer ( entities =[
layout nuances. A sentence split across columns/pages 2 Entity (...) , Entity (...) , ...
3 ])
and interrupted by floating figures/footnotes would re-
quire multiple spans and bounding boxes to represent. See Figure 2 for an example on how “sentences” in
a scientific document are represented as Entities.
LayoutParser. To allow models of different Section §3.2 explains in more detail how a user can
modalities to work well together, we also devel- generate Entities.
oped the magelib library (§3.1).
Methods. magelib also provides a set of func-
3 Design of papermage tions for building and interacting with data: aug-
menting a Document with additional Layers,
papermage is three parts: (1) magelib, a library for traversing and spatially searching for matching
intuitively representing and manipulating visually- Entities in one Layer, and cross-referencing be-
rich documents, (2) Predictors, implementations tween Layers (see Figure 3).
of models for analyzing scientific papers that unify A Document that only contains doc.symbols
disparate machine learning frameworks under a can be augmented with additional Layers:
common interface, and (3) Recipes, combinations
1 >>> paragraphs = Layer (...)
of Predictors that form multimodal pipelines. 2 >>> sentences = Layer (...)
3 >>> tokens = Layer (...)
3.1 Representing and manipulating 4
visually-rich documents with magelib 5 >>> doc . add ( paragraphs , sentences , tokens )
In this section, we use code snippets to show how Adding Layers automatically grants users the
our library’s abstractions and syntax are tailored ability to iterate through Entities and cross-
for the visually-rich document problem domain. reference intersecting Entities across Layers:
Data Classes. magelib provides three base data 1 >>> for paragraph in doc . paragraphs :
for sent in paragraph . sentences :
classes for representing fundamental elements of 2
3 for token in sentence . tokens :
visually-rich, structured documents: Document, 4 ...
Layers and Entities. First, a Document might
minimally store text as a string of symbols: magelib also supports cross-modality opera-
1 >>> from papermage import Document
tions. For example, searching for textual Entities
2 >>> doc . symbols within a visual region on the PDF (See Figure 3 F):
3 " Revolt : Collaborative Crowdsourcing ... "
1 >>> query = Box ( l =423 , t =71 , w =159 , h =87)
But visually-rich documents are more than a lin- 2 >>> selection = doc . find ( query , " tokens ")
3 >>> [ t . text for t in selection ]
earized string. For example, analyzing a scientific 4 [ " Techniques " , " for " , " collecting " , ...]
paper requires access to its visuospatial layout (e.g.,
497
A >>> doc.paragraphs[0]
ABSTRACT D
A
Crowdsourcing provides a scalable and efficient way to con-
struct labeled datasets for training machine learning systems.
B >>> doc.paragraphs[0].sentences[2] However, creating comprehensive label guidelines for crowd-
or workers is often prohibitive even for seemingly simple con- B
>>> doc.sentences[2] cepts. Incomplete or ambiguous label guidelines can then
result in differing interpretations of concepts and inconsistent
C
labels. Existing approaches for improving label quality, such as
worker screening or detection of poor work, are ineffective for
this problem and can lead to rejection of honest work and a
C >>> doc.sentences[2].tokens[9:13]
missed opportunity to capture rich interpretations about data.
or We introduce Revolt, a collaborative approach that brings ideas
>>> doc.tokens[169:173] from expert annotation workflows to crowd-based labeling.
Revolt eliminates the burden of creating detailed label guide-
lines by harnessing crowd disagreements to identify ambigu- Figure 1. Revolt creates labels for unanimously labeled “certain” items
E
ous concepts and create rich structures (groups of semantically (e.g., cats and not cats), and surfaces categories of “uncertain” items
D >>> doc.figures[0] related items) for post-hoc label decisions. Experiments com- enriched with crowd feedback (e.g., cats and dogs and cartoon cats in
paring Revolt to traditional crowdsourced labeling show that the dotted middle region are annotated with crowd explanations). Rich
Revolt produces high quality labels without requiring label structures allow label requesters to better understand concepts in the
data and make post-hoc decisions on label boundaries (e.g., assigning
guidelines in turn for an increase in monetary cost. This up cats and dogs to the cats label and cartoon cats to the not cats label)
E >>> doc.captions[0] front cost, however, is mitigated by Revolt's ability to produce rather than providing crowd-workers with a priori label guidelines.
reusable structures that can accommodate a variety of label
boundaries without requiring new data to be collected. Further
comparisons of Revolt's collaborative and non-collaborative
variants show that collabvoration reaches higher label accura- learned models that must be trained on representative datasets
F >>> user_query = Box(l,t,w,h, page=0) labeled according to target concepts (e.g., speech labeled by
cy with lower monetary cost.
their intended commands, faces labeled in images, emails la-
>>> selected_tokens = ACM Classification Keywords beled as spam or not spam).
F
doc.find(user_query, layer=“tokens”) H.5.m. Information Interfaces and Presentation (e.g. HCI): Techniques for collecting labeled data include recruiting ex-
Miscellaneous perts for manual annotation [51], extracting relations from
>>> [token.text for readily available sources (e.g., identifying bodies of text in
token in selected_tokens] Author Keywords parallel online translations [46, 13]), and automatically
crowdsourcing; machine learning; collaboration; real-time gener- ating labels based on user behaviors (e.g., using dwell
[“Techniques”, “for”, “collecting”, time to implicitly mark search result relevance [2]). Recently,
INTRODUCTION many practitioners have also turned to crowdsourcing for cre-
“labeled”, “data”, “perts”, “for”,
From conversational assistants on mobile devices, to facial ating labeled datasets at low cost [49]. Successful crowd-
“manual”, “annotation”, ...]
Figure 3: Illustrates how Entities can be accessed flexibly in different ways: (A) Accessing the Entity of the first
paragraph in the Document via its own Layer (B) Accessing a sentence via the paragraph Entity or directly via the
sentences Layer (C) Similarly, the same tokens can be accessed via the overlapping sentence Entity or directly
via the tokens Layer of the Document (where the first tokens are the title of the paper.) (D, E) Figures, captions,
tables and keywords can be accessed in similar ways (F) Additionally, given a bounding box (e.g., of a user selected
region), papermage can find the corresponding Entities for a given Layer, in this case finding the tokens under
the region. Excerpt from Chang et al. (2017).
Protocols and Utilities. To instantiate a 3.2 Interfacing with models for scientific
Document, magelib provides protocols and document analysis through Predictors
utilities like Parsers and Rasterizers, which
hook into off-the-shelf PDF processing tools:5 In §3.1, we described how users create Layers
by assembling collections of Entities. But how
>>> import papermage as pm
1
would they make Entities in the first place?
2 >>> parser = pm . PDF2TextParser ()
3 >>> doc = parser . parse ( " ... pdf " ) For example, to identify multimodal structures
>>> [ token . text for token in doc . tokens ]
4
in visually-rich documents, researchers might want
5 [ " Revolt " , " :" , " Collaborative " , ...]
6 >>> doc . images to build complex pipelines that run and combine
7 None output from many different models (e.g., computer
8
9 >>> rasterizer = pm . PDF2ImageRasterizer ()
vision models for extracting figures, NLP models
10 >>> doc2 = rasterizer . rasterize (" ... pdf ") for classifying body text). papermage provides
11 >>> doc . images = doc2 . images a unified interface, called Predictors, to ensure
12 >>> doc . images
13 [ Image ( np . array (...)) , ...]
models produce Entities that are compatible with
the Document.
In this example, papermage runs PDF2TextParser papermage includes several ready-to-use
(using pdfplumber) to extract the textual in- Predictors that leverage state-of-the-art models
formation from a PDF file. Then it runs to extract specific document structures (Table 1).
PDF2ImageRasterizer (using pdf2image) to up- While magelib’s abstractions are general for
date the first Document with images of pages. visually-rich documents, Predictors are opti-
5
mized for parsing of scientific documents. They
PDFs are not the only way of representing visually-rich
documents. For example, many scientific documents are dis- are designed to (1) be compatible with models
tributed in XML format. As PDFs are the dominant distribu- from many different machine learning frameworks,
tion format of scientific documents, we focus our efforts on (2) support inference with text-only, vision-only,
PDF-specific needs. Nevertheless, we also provide Parsers
in magelib that can instantiate a Document from XML input. and multimodal models, and (3) support both adap-
See Appendix A.1. tation of off-the-shelf, pretrained models as well as
498
Use case Description Examples
Linguistic/ Segments doc into text SentencePredictor wraps sciSpaCy (Neumann et al., 2019) and
Semantic units often used for down- PySBD (Sadvilkar and Neumann, 2020) to segment sentences. WordPredictor is
stream models. a custom scikit-learn model to identify broken words split across PDF lines or
columns. ParagraphPredictor is a set of heuristics on top of both layout and
logical structure models to extract paragraphs.
Layout Segments doc into visual BoxPredictor wraps models from LayoutParser (Shen et al., 2021), which
Structure block regions. provides vision models like EfficientDet (Tan et al., 2020) pretrained on scientific
layouts (Zhong et al., 2019).
Logical Segments doc into orga- SpanPredictor wraps Token Classifiers from Transformers (Wolfe et al., 2022),
Structure nizational units like title, which provides both pretrained weights from VILA (Shen et al., 2022), as well as
abstract, body, footnotes, RoBERTa (Liu et al., 2019), SciBERT (Beltagy et al., 2019) weights that we’ve
caption, and more. finetuned on similar data.
Task- Models for a given sci- As many practitioners depend on prompting a model through an API call, we
specific entific document process- implement APIPredictor which interfaces external APIs, such as GPT-3 (Brown
ing task can be used with et al., 2020), to perform tasks like question answering over a structured Document.
papermage if wrapped as We also implement SnippetRetrievalPredictor which wraps models like Con-
a Predictor following triever (Izacard et al., 2022) to perform top-k within-document snippet retrieval.
common patterns. See §4 for how these two can be combined.
Model
Full Grobid Subset Predictors return a list of Entities, which
P R F1 P R F1 can be group_by() to organize them based on pre-
GrobidCRF 40.6 38.3 39.1 81.2 76.7 78.9 dicted label value (e.g., tokens classified as “title”
GrobidNN 42.0 36.5 37.6 84.1 73.0 78.2
RoBERTa 75.9 80.0 76.8 82.6 83.9 83.2
or “authors”). Finally, these predictions are passed
I-VILA 92.0 94.1 92.7 92.2 95.2 93.7 to doc.annotate() to be added to Document.
500
Ethical Considerations Oren Etzioni provided enthusiasm and support for
continued investment in this toolkit.
As a toolkit primarily designed to process scientific This project was supported in part by NSF Grant
documents, there are two areas where papermage OIA-2033558 and NSF Grant CNS-2213656.
could cause harms or have unintended effects.
Author Contributions
Extraction of bibliographic information
papermage could be used to parse author names, All authors contributed to the implementation of
affiliation, emails from scientific documents. Like papermage and/or the writing of this paper.
any software, this extraction can be noisy, leading Core contributors. Kyle Lo and Zejiang Shen
to incorrect parsing and thus mis-attribution of initiated the project and co-wrote initial implemen-
manuscripts. Further, since papermage relies tations of magelib and some Predictors. Later,
on static PDF documents, rather than metadata Kyle Lo and Luca Soldaini refactored a majority of
dynamically retrieved from publishers, users of magelib, Predictors, and added Recipes. Ben-
papermage need consider how and when extracted jamin Newman added new Predictors to support
names should no longer be associated with authors, use-cases like those in the Vignette (§4). Joseph
a harmful practice called deadnaming (Queer in AI Chee Chang implemented an end-to-end web-based
et al., 2023). We recommend papermage users to visual interface for papermage and helped iterate
exercise caution when using our toolkit to extract on papermage’s designs. All core contributors
metadata, to cross-reference extracted content with helped with writing. Finally, Kyle Lo led all aspects
other sources when possible, and to design systems of the project, including design and implementa-
such that authors have the ability to manually edit tion, as well as mentorship of other contributors to
any data about themselves. the toolkit (see below).
Misrepresentation or fabrication of informa- Other contributors. Russell Authur, Stefan Can-
tion in documents In §3, we discussed how dra, Yoganand Chandrasekhar, Regan Huff, Aman-
papermage can be easily extended to support high- preet Singh and Angele Zamarron each worked
level applications. Such applications might include closely with Kyle Lo to contribute a Predictor
question answering chatbots, or AI summarizers to papermage. Erin Bransom and Bailey Kuehl
that perform information synthesis over one or helped with data annotation for training and evalu-
more papermage documents. Such applications ating those Predictors. Chris Wilhelm provided
typically rely on generative models to produce their feedback on papermage’s design and implemented
output, which might fabricate incorrect informa- faster indexing of Entities when building Layers.
tion or misstate claims. Developers should be vig- Finally, Marti Hearst, Daniel Weld, and Doug
ilant when integrating papermage output into any Downey helped with writing and overall advising
downstream application, especially in systems that on the project.
purport to represent information gathered from sci-
entific publications.
References
Acknowledgements Waleed Ammar, Dirk Groeneveld, Chandra Bhagavat-
ula, Iz Beltagy, Miles Crawford, Doug Downey, Ja-
We thank our teammates at Semantic Scholar for son Dunkelberger, Ahmed Elgohary, Sergey Feld-
their help on this project. In particular: Rodney man, Vu Ha, Rodney Kinney, Sebastian Kohlmeier,
Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew Pe-
Kinney provided insight during discussions about ters, Joanna Power, Sam Skjonsberg, Lucy Lu Wang,
how best to represent data extracted from docu- Chris Wilhelm, Zheng Yuan, Madeleine van Zuylen,
ments; Paul Sayre provided feedback on initial and Oren Etzioni. 2018. Construction of the litera-
designs of the library; Chloe Anastasiades, Dany ture graph in semantic scholar. In Proceedings of
the 2018 Conference of the North American Chap-
Haddad and Egor Klevak tested earlier versions of ter of the Association for Computational Linguistics:
the library; Tal August, Raymond Fok, and Andrew Human Language Technologies, Volume 3 (Indus-
Head motivated the need for such a toolkit dur- try Papers), pages 84–91, New Orleans - Louisiana.
ing their internships building augmented reading Association for Computational Linguistics.
interfaces; Jaron Lochner and Kelsey MacMillan Tal August, Lucy Lu Wang, Jonathan Bragg, Marti A.
helped us get additional engineering support; and Hearst, Andrew Head, and Kyle Lo. 2023. Paper
501
plain: Making medical research papers approachable the Association for Computational Linguistics: Hu-
to healthcare consumers with natural language pro- man Language Technologies, pages 4599–4610, On-
cessing. ACM Trans. Comput.-Hum. Interact., 30(5). line. Association for Computational Linguistics.
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB- Raymond Fok, Joseph Chee Chang, Tal August, Amy X.
ERT: A pretrained language model for scientific text. Zhang, and Daniel S. Weld. 2023a. Qlarify: Bridg-
In Proceedings of the 2019 Conference on Empirical ing scholarly abstracts and papers with recursively
Methods in Natural Language Processing and the expandable summaries. arXiv, abs/2310.07581.
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 3615– Raymond Fok, Hita Kambhamettu, Luca Soldaini,
3620, Hong Kong, China. Association for Computa- Jonathan Bragg, Kyle Lo, Marti Hearst, Andrew
tional Linguistics. Head, and Daniel S Weld. 2023b. Scim: Intelligent
skimming support for scientific papers. In Proceed-
Tom Brown, Benjamin Mann, Nick Ryder, Melanie ings of the 28th International Conference on Intelli-
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind gent User Interfaces, IUI ’23, page 476–490, New
Neelakantan, Pranav Shyam, Girish Sastry, Amanda York, NY, USA. Association for Computing Machin-
Askell, Sandhini Agarwal, Ariel Herbert-Voss, ery.
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Grobid. 2008–2023. Grobid. https://github.com/
Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- kermitt2/grobid.
teusz Litwin, Scott Gray, Benjamin Chess, Jack
Yu Gu, Robert Tinn, Hao Cheng, Michael R. Lucas,
Clark, Christopher Berner, Sam McCandlish, Alec
Naoto Usuyama, Xiaodong Liu, Tristan Naumann,
Radford, Ilya Sutskever, and Dario Amodei. 2020.
Jianfeng Gao, and Hoifung Poon. 2020. Domain-
Language models are few-shot learners. In Ad-
specific language model pretraining for biomedical
vances in Neural Information Processing Systems,
natural language processing. ACM Transactions on
volume 33, pages 1877–1901. Curran Associates,
Computing for Healthcare (HEALTH), 3:1 – 23.
Inc.
Andrew Head, Kyle Lo, Dongyeop Kang, Raymond
Joseph Chee Chang, Saleema Amershi, and Ece Kamar. Fok, Sam Skjonsberg, Daniel S. Weld, and Marti A.
2017. Revolt: Collaborative crowdsourcing for label- Hearst. 2021. Augmenting scientific papers with just-
ing machine learning datasets. In Proceedings of the in-time, position-sensitive definitions of terms and
2017 CHI Conference on Human Factors in Comput- symbols. In Proceedings of the 2021 CHI Conference
ing Systems, CHI ’17, page 2334–2346, New York, on Human Factors in Computing Systems, CHI ’21,
NY, USA. Association for Computing Machinery. New York, NY, USA. Association for Computing
Joseph Chee Chang, Amy X. Zhang, Jonathan Bragg, Machinery.
Andrew Head, Kyle Lo, Doug Downey, and Daniel S. Zhi Hong, Aswathy Ajith, James Pauloski, Eamon
Weld. 2023. Citesee: Augmenting citations in scien- Duede, Kyle Chard, and Ian Foster. 2023. The dimin-
tific papers with persistent and personalized historical ishing returns of masked language models to science.
context. In Proceedings of the 2023 CHI Conference In Findings of the Association for Computational
on Human Factors in Computing Systems, CHI ’23, Linguistics: ACL 2023, pages 1270–1283, Toronto,
New York, NY, USA. Association for Computing Canada. Association for Computational Linguistics.
Machinery.
Po-Wei Huang, Abhinav Ramesh Kashyap, Yanxia Qin,
Catherine Chen, Zejiang Shen, Dan Klein, Gabriel Yajing Yang, and Min-Yen Kan. 2022a. Lightweight
Stanovsky, Doug Downey, and Kyle Lo. 2023. Are contextual logical structure recovery. In Proceedings
layout-infused language models robust to layout dis- of the Third Workshop on Scholarly Document Pro-
tribution shifts? a case study with scientific docu- cessing, pages 37–48, Gyeongju, Republic of Korea.
ments. In Findings of the Association for Computa- Association for Computational Linguistics.
tional Linguistics: ACL 2023, pages 13345–13360,
Toronto, Canada. Association for Computational Lin- Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and
guistics. Furu Wei. 2022b. Layoutlmv3: Pre-training for doc-
ument ai with unified text and image masking. Pro-
Isaac Councill, C. Lee Giles, and Min-Yen Kan. 2008. ceedings of the 30th ACM International Conference
ParsCit: an open-source CRF reference string pars- on Multimedia.
ing package. In Proceedings of the Sixth Interna-
tional Conference on Language Resources and Eval- Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebas-
uation (LREC’08), Marrakech, Morocco. European tian Riedel, Piotr Bojanowski, Armand Joulin, and
Language Resources Association (ELRA). Edouard Grave. 2022. Unsupervised dense informa-
tion retrieval with contrastive learning. Transactions
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, on Machine Learning Research.
Noah A. Smith, and Matt Gardner. 2021. A dataset
of information-seeking questions and answers an- Sarthak Jain, Madeleine van Zuylen, Hannaneh Ha-
chored in research papers. In Proceedings of the jishirzi, and Iz Beltagy. 2020. SciREX: A challenge
2021 Conference of the North American Chapter of dataset for document-level information extraction. In
502
Proceedings of the 58th Annual Meeting of the Asso- 2019. BioBERT: a pre-trained biomedical language
ciation for Computational Linguistics, pages 7506– representation model for biomedical text mining.
7516, Online. Association for Computational Lin- Bioinformatics, 36(4):1234–1240.
guistics.
Yoonjoo Lee, Kyungjae Lee, Sunghyun Park, Dasol
Hyeonsu B. Kang, Joseph Chee Chang, Yongsung Kim, Hwang, Jaehyeon Kim, Hong-In Lee, and Moontae
and Aniket Kittur. 2022. Threddy: An interactive Lee. 2023. QASA: Advanced question answering on
system for personalized thread-based exploration and scientific articles. In Proceedings of the 40th Inter-
organization of scientific literature. In Proceedings of national Conference on Machine Learning, volume
the 35th Annual ACM Symposium on User Interface 202 of Proceedings of Machine Learning Research,
Software and Technology, UIST ’22, New York, NY, pages 19036–19052. PMLR.
USA. Association for Computing Machinery.
Élise Lincker, Olivier Pons, Camille Guinaudeau, Is-
Hyeonsu B. Kang, Sherry Tongshuang Wu, Joseph Chee abelle Barbet, Jérôme Dupire, Céline Hudelot, Vin-
Chang, and Aniket Kittur. 2023. Synergi: A mixed- cent Mousseau, and Caroline Huron. 2023. Layout
initiative system for scholarly synthesis and sense- and activity-based textbook modeling for automatic
making. In Proceedings of the 36th Annual ACM pdf textbook extraction. In iTextbooks@AIED.
Symposium on User Interface Software and Technol-
ogy. Association for Computing Machinery. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Tae Soo Kim, Matt Latzke, Jonathan Bragg, Amy X. Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Zhang, and Joseph Chee Chang. 2023. Papeos: Aug- RoBERTa: A Robustly Optimized BERT Pretrain-
menting research papers with talk videos. In Proceed- ing Approach. ArXiv, abs/1907.11692.
ings of the 36th Annual ACM Symposium on User
Interface Software and Technology. Kyle Lo, Joseph Chee Chang, Andrew Head, Jonathan
Bragg, Amy X. Zhang, Cassidy Trier, Chloe Anas-
Rodney Kinney, Chloe Anastasiades, Russell Authur, tasiades, Tal August, Russell Authur, Danielle Bragg,
Iz Beltagy, Jonathan Bragg, Alexandra Buraczyn- Erin Bransom, Isabel Cachola, Stefan Candra, Yo-
ski, Isabel Cachola, Stefan Candra, Yoganand Chan- ganand Chandrasekhar, Yen-Sung Chen, Evie Yu-
drasekhar, Arman Cohan, Miles Crawford, Doug Yen Cheng, Yvonne Chou, Doug Downey, Rob
Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Raymond Fok, Fangzhou Hu, Regan Huff,
Evans, Sergey Feldman, Joseph Gorney, David Gra- Dongyeop Kang, Tae Soo Kim, Rodney Kinney,
ham, Fangzhou Hu, Regan Huff, Daniel King, Se- Aniket Kittur, Hyeonsu Kang, Egor Klevak, Bai-
bastian Kohlmeier, Bailey Kuehl, Michael Langan, ley Kuehl, Michael Langan, Matt Latzke, Jaron
Daniel Lin, Haokun Liu, Kyle Lo, Jaron Lochner, Lochner, Kelsey MacMillan, Eric Marsh, Tyler Mur-
Kelsey MacMillan, Tyler Murray, Chris Newell, ray, Aakanksha Naik, Ngoc-Uyen Nguyen, Srishti
Smita Rao, Shaurya Rohatgi, Paul Sayre, Zejiang Palani, Soya Park, Caroline Paulic, Napol Rachata-
Shen, Amanpreet Singh, Luca Soldaini, Shivashankar sumrit, Smita Rao, Paul Sayre, Zejiang Shen, Pao
Subramanian, Amber Tanaka, Alex D. Wade, Linda Siangliulue, Luca Soldaini, Huy Tran, Madeleine van
Wagner, Lucy Lu Wang, Chris Wilhelm, Caroline Wu, Zuylen, Lucy Lu Wang, Christopher Wilhelm, Caro-
Jiangjiang Yang, Angele Zamarron, Madeleine Van line Wu, Jiangjiang Yang, Angele Zamarron, Marti A.
Zuylen, and Daniel S. Weld. 2023. The Semantic Hearst, and Daniel S. Weld. 2023. The Semantic
Scholar Open Data Platform. ArXiv, abs/2301.10140. Reader Project: Augmenting Scholarly Documents
through AI-Powered Interactive Reading Interfaces.
Wei-Jen Ko, Te-yuan Chen, Yiyan Huang, Greg Durrett, ArXiv, abs/2303.14334.
and Junyi Jessy Li. 2020. Inquisitive question gener-
ation for high level text comprehension. In Proceed- Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kin-
ings of the 2020 Conference on Empirical Methods ney, and Daniel Weld. 2020. S2ORC: The semantic
in Natural Language Processing (EMNLP), pages scholar open research corpus. In Proceedings of the
6544–6555, Online. Association for Computational 58th Annual Meeting of the Association for Compu-
Linguistics. tational Linguistics, pages 4969–4983, Online. Asso-
ciation for Computational Linguistics.
Benjamin Charles Germain Lee, Jaime Mears, Eileen
Jakeway, Meghan Ferriter, Chris Adams, Nathan Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng
Yarasavage, Deborah Thomas, Kate Zwaard, and Zhang, Hoifung Poon, and Tie-Yan Liu. 2022.
Daniel S. Weld. 2020. The newspaper navigator Biogpt: Generative pre-trained transformer for
dataset: Extracting headlines and visual content from biomedical text generation and mining. Briefings
16 million historic newspaper pages in chronicling in bioinformatics.
america. In Proceedings of the 29th ACM Interna-
tional Conference on Information & Knowledge Man- Mark Neumann, Daniel King, Iz Beltagy, and Waleed
agement, CIKM ’20, page 3055–3062, New York, Ammar. 2019. ScispaCy: Fast and robust models
NY, USA. Association for Computing Machinery. for biomedical natural language processing. In Pro-
ceedings of the 18th BioNLP Workshop and Shared
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Task, pages 319–327, Florence, Italy. Association for
Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Computational Linguistics.
503
Benjamin Newman, Luca Soldaini, Raymond Fok, Ar- images, captions, and textual references. In Find-
man Cohan, and Kyle Lo. 2023. A question answer- ings of the Association for Computational Linguistics:
ing framework for decontextualizing user-facing snip- EMNLP 2020, pages 2112–2120, Online. Association
pets from scientific documents. In Proceedings of the for Computational Linguistics.
2023 Conference on Empirical Methods in Natural
Language Processing (EMNLP). M. Tan, R. Pang, and Q. V. Le. 2020. Efficientdet:
Scalable and efficient object detection. In 2020
Organizers of Queer in AI, Anaelia Ovalle, Arjun Sub- IEEE/CVF Conference on Computer Vision and Pat-
ramonian, Ashwin Singh, Claas Voelcker, Danica J. tern Recognition (CVPR), pages 10778–10787, Los
Sutherland, Davide Locatelli, Eva Breznik, Filip Klu- Alamitos, CA, USA. IEEE Computer Society.
bicka, Hang Yuan, Hetvi J, Huan Zhang, Jaidev
Shriram, Kruno Lehman, Luca Soldaini, Maarten Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas
Sap, Marc Peter Deisenroth, Maria Leonor Pacheco, Scialom, Anthony S. Hartshorn, Elvis Saravia, An-
Maria Ryskina, Martin Mundt, Milind Agarwal, Nyx drew Poulton, Viktor Kerkez, and Robert Stojnic.
Mclean, Pan Xu, A Pranav, Raj Korpan, Ruchira 2022. Galactica: A large language model for science.
Ray, Sarah Mathew, Sarthak Arora, St John, Tanvi ArXiv, abs/2211.09085.
Anand, Vishakha Agrawal, William Agnew, Yanan
Long, Zijie J. Wang, Zeerak Talat, Avijit Ghosh, pdf2image. 2023. pdf2image. https://github.
Nathaniel Dennler, Michael Noseworthy, Sharvani com/Belval/pdf2image.
Jha, Emi Baylor, Aditya Joshi, Natalia Y. Bilenko,
Andrew Mcnamara, Raphael Gontijo-Lopes, Alex pdfplumber. 2023. pdfplumber. https://github.
Markham, Evyn Dong, Jackie Kay, Manu Saraswat, com/jsvine/pdfplumber.
Nikhil Vytla, and Luke Stark. 2023. Queer In AI: A
Dominika Tkaczyk, Paweł Szostek, Mateusz Fedo-
Case Study in Community-Led Participatory AI. In
ryszak, Piotr Jan Dendek, and Lukasz Bolikowski.
Proceedings of the 2023 ACM Conference on Fair-
2015. Cermine: Automatic extraction of structured
ness, Accountability, and Transparency, FAccT ’23,
metadata from scientific literature. Int. J. Doc. Anal.
page 1882–1895, New York, NY, USA. Association
Recognit., 18(4):317–335.
for Computing Machinery.
Napol Rachatasumrit, Jonathan Bragg, Amy X. Zhang, Amalie Trewartha, Nicholas Walker, Haoyan Huo,
and Daniel S Weld. 2022. Citeread: Integrating lo- Sanghoon Lee, Kevin Cruse, John Dagdelen, Alex
calized citation contexts into scientific paper reading. Dunn, Kristin Aslaug Persson, Gerbrand Ceder, and
In 27th International Conference on Intelligent User Anubhav Jain. 2022. Quantifying the advantage of
Interfaces, IUI ’22, page 707–719, New York, NY, domain-specific pre-training on named entity recog-
USA. Association for Computing Machinery. nition tasks in materials science. Patterns, 3.
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Lucy Lu Wang, Isabel Cachola, Jonathan Bragg, Evie
Sun. 2015. Faster r-cnn: Towards real-time object de- (Yu-Yen) Cheng, Chelsea Hess Haupt, Matt Latzke,
tection with region proposal networks. IEEE Trans- Bailey Kuehl, Madeleine van Zuylen, Linda M. Wag-
actions on Pattern Analysis and Machine Intelligence, ner, and Daniel S. Weld. 2021. Improving the acces-
39:1137–1149. sibility of scientific documents: Current state, user
needs, and a system solution to enhance scientific pdf
Nipun Sadvilkar and Mark Neumann. 2020. PySBD: accessibility for blind and low vision users. ArXiv,
Pragmatic sentence boundary disambiguation. In abs/2105.00076.
Proceedings of Second Workshop for NLP Open
Source Software (NLP-OSS), pages 110–114, Online. Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar,
Association for Computational Linguistics. Russell Reas, Jiangjiang Yang, Doug Burdick, Darrin
Eide, Kathryn Funk, Yannis Katsis, Rodney Michael
Zejiang Shen, Kyle Lo, Lucy Lu Wang, Bailey Kuehl, Kinney, Yunyao Li, Ziyang Liu, William Merrill,
Daniel S. Weld, and Doug Downey. 2022. VILA: Im- Paul Mooney, Dewey A. Murdick, Devvret Rishi,
proving structured content extraction from scientific Jerry Sheehan, Zhihong Shen, Brandon Stilson,
PDFs using visual layout groups. Transactions of the Alex D. Wade, Kuansan Wang, Nancy Xin Ru Wang,
Association for Computational Linguistics, 10:376– Christopher Wilhelm, Boya Xie, Douglas M. Ray-
392. mond, Daniel S. Weld, Oren Etzioni, and Sebastian
Kohlmeier. 2020. CORD-19: The COVID-19 open
Zejiang Shen, Ruochen Zhang, Melissa Dell, B. Lee, research dataset. In Proceedings of the 1st Work-
Jacob Carlson, and Weining Li. 2021. Layoutparser: shop on NLP for COVID-19 at ACL 2020, Online.
A unified toolkit for deep learning based document Association for Computational Linguistics.
image analysis. In IEEE International Conference
on Document Analysis and Recognition. Rosalee Wolfe, John McDonald, Ronan Johnson, Ben
Sturr, Syd Klinghoffer, Anthony Bonzani, Andrew
Sanjay Subramanian, Lucy Lu Wang, Ben Bogin, Alexander, and Nicole Barnekow. 2022. Supporting
Sachin Mehta, Madeleine van Zuylen, Sravanthi mouthing in signed languages: New innovations and
Parasa, Sameer Singh, Matt Gardner, and Hannaneh a proposal for future corpus building. In Proceedings
Hajishirzi. 2020. MedICaT: A dataset of medical of the 7th International Workshop on Sign Language
504
Translation and Avatar Technology: The Junction of
the Visual and the Textual: Challenges and Perspec-
tives, pages 125–130, Marseille, France. European
Language Resources Association.
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu
Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha
Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou.
2021. LayoutLMv2: Multi-modal pre-training for
visually-rich document understanding. In Proceed-
ings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 2579–2591, Online.
Association for Computational Linguistics.
505
A Appendix the models it employs to perform logical struc-
ture recovery over the resulting token stream. As
A.1 Comparison and Compatibility with such, there is no straightforward way to run just
XML the model components of Grobid on an alternative
One can view Layers as capturing content hier- token stream like that provided in the S2-VL (Shen
archy (e.g., tokens vs sentences) similar to that of et al., 2022) dataset.
other structured document representations, like TEI To perform this baseline evaluation, we ran
XML trees. We note that Layers are stored as un- the original PDFs that were annotated for S2-VL
ordered attributes and don’t require nesting. This through our GrobidParser using v0.7.3. Grobid
allows for specific cross-layer referencing opera- also returns bounding boxes of some predicted cat-
tions that don’t adhere to strict nesting relationships. egories (e.g., authors, abstract, paragraphs). We
For example: use these bounding boxes to create Entities that
1 for sentence in doc . sentences : we annotate on a Document constructed manually
2 for line in sentence . lines : from from S2-VL data. Using magelib cross-layer
...
3
referencing, we were able to match Grobid predic-
Recall that a sentence can begin or end midway tions to S2-VL data to perform this evaluation.
through a line and cross multiple lines (§3.1). Sim- Though we found there are certain categories
ilarly, not all lines are exactly contained within for which bounding box information was either not
the boundaries of a sentence. As such, sentences available (e.g., Titles) or Grobid simply did not re-
and lines are not strictly nested within each other. turn that output (e.g., Figure text extraction). These
This relationship would be difficult to encode in an are represented by zeros in Table 3, which con-
XML format adhering to document tree structure. tributes to the lower scores in Table 2 after macro
Regardless, the way we represent structure averaging. For a more apples-to-apples compari-
in documents is highly versatile. We demon- son, we also included a “Grobid Subset” evaluation
strate this by also implementing GrobidParser which restricted to just categories in S2-VL for
as an alternative to the PDF2TextParser in §3.1. which Grobid produced bounding box information.
GrobidParser invokes Grobid to process PDFs, In addition to Grobid, we evaluate two of our pro-
and reads the resulting TEI XML file generated by vided Transformer-based models. The RoBERTa-
Grobid by converting each XML tag of a common large (Liu et al., 2019) model is a Transformers
level into an Entity of its own Layer. We use this token classification model that we finetuned on the
to perform the evaluation in Table 2. S2-VL training set. The I-VILA model is a layout-
infused Transformer model pretrained by Shen et al.
A.2 Additional magelib Protocols and (2022) on the S2-VL training set. Like we did with
Utilities Grobid, we ran our CoreRecipe using these two
Serialization. Any Document and all of its models on the original PDFs in S2-VL, and per-
Layers can be exported to a JSON format, and formed a similar token mapping operation since our
perfectly reconstructed: PDF2TextParser also produces a different token
stream than that provided in S2-VL.
1 import json
2 with open ( " .... json " , "w ") as f_out :
At the end of the day, the Transformer-based
3 json . dump ( doc . to_json () , f_out ) models performed better at this task than Grobid.
4
This is unsurprising given expected improvements
5 with open ( " ... json " , " r" ) as f_in :
6 doc = json . load ( f_in ) using a Transformer model over a CRF or BiL-
STM. The Transformer models were also trained
on S2-VL data, which gave them an advantage over
A.3 Evaluating papermage’s CoreRecipe
Grobid. Overall, this evaluation intended to show
against Grobid
how papermage enables cross-system comparisons,
Here, we detail how we performed the evaluation even eschewing token stream incompatibility, and
reported in §3.3 (Table 2). We also provide a full to illustrate an upper bound of the performance left
breakdown by category in Table 3. on the table by existing software systems that don’t
As described earlier in the paper, Grobid is quite use of state-of-the-art models.
difficult to evaluate as it is developed with tight
coupling between the PDF parser (pdfalto) and
506
Structure GROBIDCRF GROBIDNN RoBERTa I-VILA
Category P R F1 P R F1 P R F1 P R F1
Abstract 81.9 89.1 85.3 85.3 89.8 87.5 89.2 93.7 91.4 97.4 98.3 97.8
Author 55.2 42.6 48.1 75.1 14.0 23.6 87.5 73.5 79.9 65.5 96.9 78.2
Bibliography 96.5 98.6 97.5 95.5 97.6 96.5 93.6 93.3 93.5 99.7 98.2 99.0
Caption 70.3 70.0 70.2 70.2 69.7 70.0 80.0 77.3 78.6 93.1 89.6 91.3
Equation 71.1 85.3 77.6 71.1 85.3 77.6 55.0 85.7 67.0 90.7 94.2 92.4
Figure 0.0 0.0 0.0 0.0 0.0 0.0 88.9 82.3 85.4 99.8 96.8 98.3
Footer 0.0 0.0 0.0 0.0 0.0 0.0 56.1 59.9 57.9 96.8 78.1 86.5
Footnote 0.0 0.0 0.0 0.0 0.0 0.0 59.8 44.3 50.9 80.2 93.5 86.3
Header 0.0 0.0 0.0 0.0 0.0 0.0 40.5 84.3 54.7 92.9 99.1 95.9
Keywords 0.0 0.0 0.0 0.0 0.0 0.0 93.8 97.1 95.4 96.9 99.4 98.1
List 0.0 0.0 0.0 0.0 0.0 0.0 61.9 63.8 62.9 76.7 82.4 79.4
Paragraph 94.5 89.8 92.1 94.4 89.9 92.1 93.5 93.0 93.3 98.7 97.9 98.3
Section 83.0 79.4 81.1 83.0 79.4 81.1 67.7 82.7 74.4 96.2 91.6 93.9
Table 97.3 58.6 73.2 97.9 58.6 73.3 94.7 71.8 81.7 96.1 94.9 95.5
Title 0.0 0.0 0.0 0.0 0.0 0.0 76.3 96.7 85.3 98.7 99.9 99.3
Macro Avg
(Full S2-VL) 40.6 38.3 39.1 42.0 36.5 37.6 75.9 80.0 76.8 92.0 94.1 92.7
Macro Avg
81.2 76.7 78.9 84.1 73.0 78.2 82.6 83.9 83.2 92.2 95.2 93.7
(Grobid Subset)
Table 3: Evaluating CoreRecipe for logical structure recovery on S2-VL (Shen et al., 2022). These are per-category
metrics for Table 2. Metrics are computed for token-level classification, macro-averaged over categories. The
“Grobid Subset” limits evaluation to only categories for which Grobid returns bounding box information, which was
necessary for evaluation on S2-VL.
507