0% found this document useful (0 votes)
92 views

2023 Emnlp-Demo 45

PaperMage is an open-source Python toolkit for analyzing and processing scientific documents. It provides clean abstractions for representing and manipulating both textual and visual elements of scientific documents by integrating state-of-the-art NLP and CV models. PaperMage aims to make it easier for researchers to build sophisticated pipelines for tasks like extracting structured text and metadata from PDFs through high-level "recipes" that combine models from different frameworks.

Uploaded by

Nafiz Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views

2023 Emnlp-Demo 45

PaperMage is an open-source Python toolkit for analyzing and processing scientific documents. It provides clean abstractions for representing and manipulating both textual and visual elements of scientific documents by integrating state-of-the-art NLP and CV models. PaperMage aims to make it easier for researchers to build sophisticated pipelines for tasks like extracting structured text and metadata from PDFs through high-level "recipes" that combine models from different frameworks.

Uploaded by

Nafiz Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

PaperMage: A Unified Toolkit for Processing, Representing, and

Manipulating Visually-Rich Scientific Documents


Kyle Loα∗ Zejiang Shenα,τ ∗ Benjamin Newmanα∗ Joseph Chee Changα∗
Russell Authurα Erin Bransomα Stefan Candraα Yoganand Chandrasekharα
Regan Huffα Bailey Kuehlα Amanpreet Singhα Chris Wilhelmα Angele Zamarronα
Marti A. Hearstβ Daniel S. Weldα,ω Doug Downeyα,η Luca Soldainiα∗
α
Allen Institute for AI τ Massachusetts Institute of Technology
β
University of California Berkeley ω University of Washington η Northwestern University
{kylel, lucas}@allenai.org

Abstract

Despite growing interest in applying natural


language processing (NLP) and computer vi-
sion (CV) models to the scholarly domain,
scientific documents remain challenging to
work with. They’re often in difficult-to-use
PDF formats, and the ecosystem of models
to process them is fragmented and incom-
plete. We introduce papermage, an open-
source Python toolkit for analyzing and pro-
cessing visually-rich, structured scientific doc-
uments. papermage offers clean and intuitive
abstractions for seamlessly representing and
manipulating both textual and visual document
elements. papermage achieves this by integrat-
ing disparate state-of-the-art NLP and CV mod-
els into a unified framework, and provides turn-
key recipes for common scientific document
processing use-cases. papermage has powered
multiple research prototypes of AI applications
over scientific documents, along with Seman-
tic Scholar’s large-scale production system for
processing millions of PDFs. Figure 1: papermage’s document creation and represen-
§ github.com/allenai/papermage1 tation. (A) Recipes are turn-key methods for processing
a PDF. (B) They compose models operating across dif-
ferent data modalities and machine learning frameworks
1 Introduction to extract document structure, which we conceptualize
as layers of annotation that store textual and visual in-
Research papers and textbooks are central to the
formation. (C) Users can access and manipulate layers.
scientific enterprise, and there is increasing inter-
est in developing new tools for extracting knowl-
edge from these visually-rich documents. Recent and more. Further, extracting clean, properly-
research has explored, for example, AI-powered structured scientific text from PDF documents (Lo
reading support for math symbol definitions (Head et al., 2020; Wang et al., 2020) forms a critical
et al., 2021), in-situ passage explanations or sum- first step in pretraining language models of sci-
maries (August et al., 2023; Rachatasumrit et al., ence (Beltagy et al., 2019; Lee et al., 2019; Gu et al.,
2022; Kim et al., 2023), automatic span highlight- 2020; Luo et al., 2022; Taylor et al., 2022; Tre-
ing (Chang et al., 2023; Fok et al., 2023b), interac- wartha et al., 2022; Hong et al., 2023), automatic
tive clipping and synthesis (Kang et al., 2022, 2023) generation of more accessible paper formats (Wang
∗ et al., 2021), and developing datasets for scientific
Core contributors; see author contributions for details.
1
We use code snippets to illustrate our toolkit’s core de- natural language processing (NLP) tasks over struc-
signs and abstractions. Exact syntax in paper may differ from tured full text (Jain et al., 2020; Subramanian et al.,
the actual code, as software will evolve beyond the paper and 2020; Dasigi et al., 2021; Lee et al., 2023).
we opt to simplify syntax when needed for legibility and clarity.
We refer readers to our public code for latest documentation. However, this type of NLP research on scientific
495
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 495–507
December 6-10, 2023 ©2023 Association for Computational Linguistics
corpora is difficult because the documents come twelve interdependent sequence labeling models3
in difficult-to-use formats like PDF,2 and existing to perform its full text extraction. Other similar
tools for working with the documents are limited. tools inlude CERMINE (Tkaczyk et al., 2015) and
Typically, the first step in scientific document pro- ParsCit (Councill et al., 2008). While such software
cessing is to invoke a parser on a document file to is often an ideal choice for off-the-shelf processing,
convert it into a sequence of tokens and bounding they are not necessarily designed for easy extension
boxes in inferred reading order. Parsers extract only and/or integration with newer research models.4
the raw document content, and obtaining richer
document structure (e.g., titles, authors, figures) or 2.2 Models for scientific document processing
linguistic structure and semantics (e.g., sentences, While aforementioned software tools use CRF or
discourse units, scientific claims) requires sending BiLSTM-based models, Transformer-based models
the token sequence through downstream models. have seen wide adoption among NLP researchers
Unlike more mature parsers (§2.1), these down- for their powerful processing capabilities. Recent
stream models are often research prototypes (§2.2) years have seen the rise of layout-infused Trans-
that are limited to extracting only a subset of the formers (Xu et al., 2019; Shen et al., 2022; Xu
structures needed for one’s research (e.g., the same et al., 2021; Huang et al., 2022b; Chen et al., 2023)
model may not provide both sentence splits and fig- for processing visually-rich documents, including
ure detection). As a result, users must write exten- recovering logical structure (e.g., titles, abstracts)
sive custom code that strings pipelines of multiple of scientific papers (Huang et al., 2022a). Similarly,
models together. Research projects using models computer vision (CV) researchers have also shown
of different modalities (e.g., combining an image- impressive capabilities of CNN-based object de-
based formula detector with a text-based definition tection models (Ren et al., 2015; Tan et al., 2020)
extractor) can require hundreds of lines of code. for segmenting visually-rich documents based on
We introduce papermage, an open-source their layout. While these research models are pow-
Python toolkit for processing scientific documents. erful and extensible for research purposes, it often
Its contributions include (1) magelib, a library of requires significant “glue” code and stitching soft-
primitives and methods for representing and ma- ware tools to create robust processing pipelines.
nipulating visually-rich documents as multimodal For example, Lincker et al. (2023) bootstraps a so-
constructs, (2) Predictors, a set of implementa- phisticated processing pipeline around a research
tions that integrate different state-of-the-art scien- model for processing children’s textbooks.
tific document analysis models into a unified inter-
face, even if individual models are written in differ- 2.3 Combining models and pipelines
ent frameworks or operate on different modalities, papermage’s use case lies between that of turn-
and (3) Recipes, which provide turn-key access key software and a framework for supporting re-
to well-tested combinations of individual (often search. Similar to Transformers (Wolfe et al.,
single-modality) modules to form sophisticated, ex- 2022)’s integration of different research mod-
tensible multimodal pipelines. els into standard interfaces, others have done
similarly for the visually-rich document domain.
2 Related Work LayoutParser (Shen et al., 2021) provides mod-
2.1 Turn-key software for scientific documents els for visually-rich documents and supports
the creation of document processing pipelines.
Processing visually-rich documents like scientific papermage, in fact, depends on LayoutParser
documents requires a joint understanding of both for access to vision models, but is designed to
visual and textual information. In practice, this also integrate text models which are omitted from
often requires combining different models into
3
complex processing pipelines. For example, GRO- https://grobid.readthedocs.io/en/latest/
Training-the-models-of-Grobid/#models
BID (Grobid, 2008–2023), a widely-adopted soft- 4
Most research in NLP requires that a researcher be able to
ware tool for scientific document processing, uses manipulate models within Python. Yet, Grobid requires users
to manage a separate service process and send PDFs through
2
PDFs store text as character glyphs and their (x, y) posi- a client. In performing evaluation in §3.3, we also found it
tions on a page. Converting this data to usable text for NLP difficult to run only the model components isolated from PDF
requires error-prone operations like inferring token boundaries, utilities, which makes comparison with other research models
whitespacing, and reading order using visual positioning. challenging without significant “glue” code.

496
pages, blocks, lines), logical structure (e.g., title,
abstract, figures, tables, footnotes, sections), se-
mantic units (e.g., paragraphs, sentences, tokens),
and more (e.g., citations, terms). In practice, this
means different parts of doc.symbols can corre-
spond to different paragraphs, sentences, tokens,
etc. in the Document, each with its own set of
corresponding coordinates representing its visual
position on a page.
magelib represents structure using Layers that
can be accessed as attributes of a Document (e.g.,
doc.sentences, doc.figures, doc.tokens)
(Figure 1). Each Layer is a sequence of content
Figure 2: Entities are multimodal content units. Here, units, called Entities, which store both textual
spans of a sentence are used to identify its text among (e.g., spans, strings) and visuospatial (e.g.,
all symbols, while boxes map its visual coordinates on bounding boxes, pixel arrays) information:
a page. spans and boxes can include non-contiguous
units, allowing great flexibility in Entities to handle 1 >>> sentences = Layer ( entities =[
layout nuances. A sentence split across columns/pages 2 Entity (...) , Entity (...) , ...
3 ])
and interrupted by floating figures/footnotes would re-
quire multiple spans and bounding boxes to represent. See Figure 2 for an example on how “sentences” in
a scientific document are represented as Entities.
LayoutParser. To allow models of different Section §3.2 explains in more detail how a user can
modalities to work well together, we also devel- generate Entities.
oped the magelib library (§3.1).
Methods. magelib also provides a set of func-
3 Design of papermage tions for building and interacting with data: aug-
menting a Document with additional Layers,
papermage is three parts: (1) magelib, a library for traversing and spatially searching for matching
intuitively representing and manipulating visually- Entities in one Layer, and cross-referencing be-
rich documents, (2) Predictors, implementations tween Layers (see Figure 3).
of models for analyzing scientific papers that unify A Document that only contains doc.symbols
disparate machine learning frameworks under a can be augmented with additional Layers:
common interface, and (3) Recipes, combinations
1 >>> paragraphs = Layer (...)
of Predictors that form multimodal pipelines. 2 >>> sentences = Layer (...)
3 >>> tokens = Layer (...)
3.1 Representing and manipulating 4
visually-rich documents with magelib 5 >>> doc . add ( paragraphs , sentences , tokens )

In this section, we use code snippets to show how Adding Layers automatically grants users the
our library’s abstractions and syntax are tailored ability to iterate through Entities and cross-
for the visually-rich document problem domain. reference intersecting Entities across Layers:
Data Classes. magelib provides three base data 1 >>> for paragraph in doc . paragraphs :
for sent in paragraph . sentences :
classes for representing fundamental elements of 2
3 for token in sentence . tokens :
visually-rich, structured documents: Document, 4 ...
Layers and Entities. First, a Document might
minimally store text as a string of symbols: magelib also supports cross-modality opera-
1 >>> from papermage import Document
tions. For example, searching for textual Entities
2 >>> doc . symbols within a visual region on the PDF (See Figure 3 F):
3 " Revolt : Collaborative Crowdsourcing ... "
1 >>> query = Box ( l =423 , t =71 , w =159 , h =87)
But visually-rich documents are more than a lin- 2 >>> selection = doc . find ( query , " tokens ")
3 >>> [ t . text for t in selection ]
earized string. For example, analyzing a scientific 4 [ " Techniques " , " for " , " collecting " , ...]
paper requires access to its visuospatial layout (e.g.,
497
A >>> doc.paragraphs[0]
ABSTRACT D
A
Crowdsourcing provides a scalable and efficient way to con-
struct labeled datasets for training machine learning systems.
B >>> doc.paragraphs[0].sentences[2] However, creating comprehensive label guidelines for crowd-
or workers is often prohibitive even for seemingly simple con- B
>>> doc.sentences[2] cepts. Incomplete or ambiguous label guidelines can then
result in differing interpretations of concepts and inconsistent
C
labels. Existing approaches for improving label quality, such as
worker screening or detection of poor work, are ineffective for
this problem and can lead to rejection of honest work and a
C >>> doc.sentences[2].tokens[9:13]
missed opportunity to capture rich interpretations about data.
or We introduce Revolt, a collaborative approach that brings ideas
>>> doc.tokens[169:173] from expert annotation workflows to crowd-based labeling.
Revolt eliminates the burden of creating detailed label guide-
lines by harnessing crowd disagreements to identify ambigu- Figure 1. Revolt creates labels for unanimously labeled “certain” items
E
ous concepts and create rich structures (groups of semantically (e.g., cats and not cats), and surfaces categories of “uncertain” items
D >>> doc.figures[0] related items) for post-hoc label decisions. Experiments com- enriched with crowd feedback (e.g., cats and dogs and cartoon cats in
paring Revolt to traditional crowdsourced labeling show that the dotted middle region are annotated with crowd explanations). Rich
Revolt produces high quality labels without requiring label structures allow label requesters to better understand concepts in the
data and make post-hoc decisions on label boundaries (e.g., assigning
guidelines in turn for an increase in monetary cost. This up cats and dogs to the cats label and cartoon cats to the not cats label)
E >>> doc.captions[0] front cost, however, is mitigated by Revolt's ability to produce rather than providing crowd-workers with a priori label guidelines.
reusable structures that can accommodate a variety of label
boundaries without requiring new data to be collected. Further
comparisons of Revolt's collaborative and non-collaborative
variants show that collabvoration reaches higher label accura- learned models that must be trained on representative datasets
F >>> user_query = Box(l,t,w,h, page=0) labeled according to target concepts (e.g., speech labeled by
cy with lower monetary cost.
their intended commands, faces labeled in images, emails la-
>>> selected_tokens = ACM Classification Keywords beled as spam or not spam).
F
doc.find(user_query, layer=“tokens”) H.5.m. Information Interfaces and Presentation (e.g. HCI): Techniques for collecting labeled data include recruiting ex-
Miscellaneous perts for manual annotation [51], extracting relations from
>>> [token.text for readily available sources (e.g., identifying bodies of text in
token in selected_tokens] Author Keywords parallel online translations [46, 13]), and automatically
crowdsourcing; machine learning; collaboration; real-time gener- ating labels based on user behaviors (e.g., using dwell
[“Techniques”, “for”, “collecting”, time to implicitly mark search result relevance [2]). Recently,
INTRODUCTION many practitioners have also turned to crowdsourcing for cre-
“labeled”, “data”, “perts”, “for”,
From conversational assistants on mobile devices, to facial ating labeled datasets at low cost [49]. Successful crowd-
“manual”, “annotation”, ...]

Figure 3: Illustrates how Entities can be accessed flexibly in different ways: (A) Accessing the Entity of the first
paragraph in the Document via its own Layer (B) Accessing a sentence via the paragraph Entity or directly via the
sentences Layer (C) Similarly, the same tokens can be accessed via the overlapping sentence Entity or directly
via the tokens Layer of the Document (where the first tokens are the title of the paper.) (D, E) Figures, captions,
tables and keywords can be accessed in similar ways (F) Additionally, given a bounding box (e.g., of a user selected
region), papermage can find the corresponding Entities for a given Layer, in this case finding the tokens under
the region. Excerpt from Chang et al. (2017).

Protocols and Utilities. To instantiate a 3.2 Interfacing with models for scientific
Document, magelib provides protocols and document analysis through Predictors
utilities like Parsers and Rasterizers, which
hook into off-the-shelf PDF processing tools:5 In §3.1, we described how users create Layers
by assembling collections of Entities. But how
>>> import papermage as pm
1
would they make Entities in the first place?
2 >>> parser = pm . PDF2TextParser ()
3 >>> doc = parser . parse ( " ... pdf " ) For example, to identify multimodal structures
>>> [ token . text for token in doc . tokens ]
4
in visually-rich documents, researchers might want
5 [ " Revolt " , " :" , " Collaborative " , ...]
6 >>> doc . images to build complex pipelines that run and combine
7 None output from many different models (e.g., computer
8
9 >>> rasterizer = pm . PDF2ImageRasterizer ()
vision models for extracting figures, NLP models
10 >>> doc2 = rasterizer . rasterize (" ... pdf ") for classifying body text). papermage provides
11 >>> doc . images = doc2 . images a unified interface, called Predictors, to ensure
12 >>> doc . images
13 [ Image ( np . array (...)) , ...]
models produce Entities that are compatible with
the Document.
In this example, papermage runs PDF2TextParser papermage includes several ready-to-use
(using pdfplumber) to extract the textual in- Predictors that leverage state-of-the-art models
formation from a PDF file. Then it runs to extract specific document structures (Table 1).
PDF2ImageRasterizer (using pdf2image) to up- While magelib’s abstractions are general for
date the first Document with images of pages. visually-rich documents, Predictors are opti-
5
mized for parsing of scientific documents. They
PDFs are not the only way of representing visually-rich
documents. For example, many scientific documents are dis- are designed to (1) be compatible with models
tributed in XML format. As PDFs are the dominant distribu- from many different machine learning frameworks,
tion format of scientific documents, we focus our efforts on (2) support inference with text-only, vision-only,
PDF-specific needs. Nevertheless, we also provide Parsers
in magelib that can instantiate a Document from XML input. and multimodal models, and (3) support both adap-
See Appendix A.1. tation of off-the-shelf, pretrained models as well as
498
Use case Description Examples
Linguistic/ Segments doc into text SentencePredictor wraps sciSpaCy (Neumann et al., 2019) and
Semantic units often used for down- PySBD (Sadvilkar and Neumann, 2020) to segment sentences. WordPredictor is
stream models. a custom scikit-learn model to identify broken words split across PDF lines or
columns. ParagraphPredictor is a set of heuristics on top of both layout and
logical structure models to extract paragraphs.
Layout Segments doc into visual BoxPredictor wraps models from LayoutParser (Shen et al., 2021), which
Structure block regions. provides vision models like EfficientDet (Tan et al., 2020) pretrained on scientific
layouts (Zhong et al., 2019).
Logical Segments doc into orga- SpanPredictor wraps Token Classifiers from Transformers (Wolfe et al., 2022),
Structure nizational units like title, which provides both pretrained weights from VILA (Shen et al., 2022), as well as
abstract, body, footnotes, RoBERTa (Liu et al., 2019), SciBERT (Beltagy et al., 2019) weights that we’ve
caption, and more. finetuned on similar data.
Task- Models for a given sci- As many practitioners depend on prompting a model through an API call, we
specific entific document process- implement APIPredictor which interfaces external APIs, such as GPT-3 (Brown
ing task can be used with et al., 2020), to perform tasks like question answering over a structured Document.
papermage if wrapped as We also implement SnippetRetrievalPredictor which wraps models like Con-
a Predictor following triever (Izacard et al., 2022) to perform top-k within-document snippet retrieval.
common patterns. See §4 for how these two can be combined.

Table 1: Types of Predictors implemented in papermage.

Model
Full Grobid Subset Predictors return a list of Entities, which
P R F1 P R F1 can be group_by() to organize them based on pre-
GrobidCRF 40.6 38.3 39.1 81.2 76.7 78.9 dicted label value (e.g., tokens classified as “title”
GrobidNN 42.0 36.5 37.6 84.1 73.0 78.2
RoBERTa 75.9 80.0 76.8 82.6 83.9 83.2
or “authors”). Finally, these predictions are passed
I-VILA 92.0 94.1 92.7 92.2 95.2 93.7 to doc.annotate() to be added to Document.

Table 2: Evaluating performance of CoreRecipe for


3.3 End-to-end processing with Recipes
logical structure recovery on S2-VL (Shen et al., 2022).
Metrics are computed for token-level classification, Finally, papermage provides predefined combina-
macro-averaged over categories. The “Grobid Subset” tions of Predictors, called Recipes, for users
limits evaluation to only categories for which Grobid
seeking high-quality options for turn-key process-
returns bounding box information, which was necessary
for evaluation on S2-VL. See Appendix A.3 for details. ing of visually-rich documents:
1 from papermage import CoreRecipe
2 recipe = CoreRecipe ()
development of new ones from scratch. Similarly 3 doc = recipe . run ( " ... pdf ")
to the Transformers library, a Predictor’s 4 doc . captions [0]. text
5 >>> " Figure 1. ... "
implementation is typically independent from
its configuration, allowing users to customize
Recipes can also be flexibly modified to sup-
each Predictor by tweaking hyperparameters or
port development. For example, our current de-
loading a different set of weights.
fault combines the pdfplumber PDF parsing utility
Below, we showcase how a vision model and with the I-VILA (Shen et al., 2022) research model.
two text models (both neural and symbolic) can be We show in Table 2 an evaluation comparing this
applied in succession to a single Document. See against the same recipe but configured to (1) swap
Table 1 for a summary of supported Predictors. I-VILA for a RoBERTa model, as well as (2) swap
1 >>> import papermage as pm both for Grobid API calls. We expect Recipes
2 >>> cv = pm . BoxPredictor (...)
3 >>> tables , figures = cv . predict ( doc )
to appeal to two groups of users—end-to-end con-
4 >>> doc . add ( tables , figures ) sumers, and developers of high-level applications.
5 The former is comprised of developers and re-
6 >>> nlp_neu = pm . SpanPredictor (...)
7 >>> titles , authors = nlp_neu . predict ( doc )
searchers who are looking for a one-step solution
8 >>> doc . add ( titles , authors ) to multimodal scientific document analysis. The
9 latter are likely developers and researchers looking
10 >>> nlp_sym = pm . SentencePredictor (...)
11 >>> sentences = nlp_sym . predict ( doc ) to combine document structure primitives to build
12 >>> doc . add ( sentences ) a complex application (see example in §4).
499
4 Vignette: Building an Attributed QA similar call to how she defined the model input con-
System for Scientific Papers text (e.g., [s.boxes for s in doc.sentences]).
She can easily pass this to the user interface to en-
How could researchers leverage papermage for
able linking to and highlighting of those passages.
their research? Here, we walk through a user sce-
nario in which a researcher (Lucy) is prototyping Defining a Predictor. The pattern Lucy has
an attributed QA system for science. followed is used in our many Predictor imple-
mentations: (1) gain access to text by traversing
System Design. Drawing inspiration from Ko Layers (e.g., sentences), (2) perform all usual
et al. (2020), Lee et al. (2023), Fok et al. (2023a), NLP computation on that text, and (3) format
and Newman et al. (2023), Lucy is studying how model output as Entities. This simple pattern
language models can be used to resolve questions allows users to reuse familiar models in existing
that arise while reading a paper (e.g. What does frameworks and eschews lengthy onboarding to
this mean? or What does this refer to?). In her papermage. Lucy wraps her prompting and re-
prototype interface, a user can highlight a passage trieval code in new classes: APIPredictor and
in a PDF and ask a question about it. A retrieval SnippetRetrievalPredictor (see Table 1).
model then finds relevant passages from the rest
of the paper. The prototype then uses the text of Fast iterations. Leveraging the bounding box
the retrieved passages along with the user question data from papermage to visually highlight the re-
to prompt a language model to generate an answer. trieved passages, Lucy suspects the retrieval com-
When presenting the answer to the user, the proto- ponent is likely underperforming. She makes a sim-
type also visually highlights the retrieved passages ple edit from doc.sentences to doc.paragraphs
as supporting evidence to the generated answer. and evaluates system performance under different
input granularity. She also realizes the system of-
Getting started quickly. As a researcher profi- ten retrieves content outside the main body text.
cient in Python, it only takes Lucy minutes to install She restricts her traversal to filter out paragraphs
papermage using pip and successfully process a lo- that overlap with footnotes—[p.text for p in
cal PDF file by following the example code snippet doc.paragraphs if len(p.footnotes) == 0]—
for CoreRecipe in §3.2. In an interactive session, making clever use of the cross-referencing function-
she familiarizes herself with the provided Layers ality to detect when a paragraph is actually coming
by following the traversal, cross-referencing and from a footnote. This example demonstrates the
querying examples in §3.1. She makes sure she can versatility of the affordances provided by magelib.
serialize and re-instantiate her Document (§A.2).
5 Conclusion
Formatting input. Before using papermage,
Lucy has prior experience building QA pipelines, In this work, we’ve introduced papermage, an
but has only dealt with documents as sentence- open-source Python toolkit for processing scientific
split text data (e.g., <List[str]>). Lucy realizes documents. papermage was developed to supply
that she can reuse her prior text-only code with high-quality data and reduce friction for research
papermage by implementing a couple of wrappers prototype development at Semantic Scholar. To-
to gain additional capabilities: First, she converts day, it is being used in the production PDF process-
a user’s highlighted passage from a visual selec- ing pipeline to provide data for both the literature
tion to text following the example in Figure 3F. graph (Ammar et al., 2018; Kinney et al., 2023)
Next, she converts Document to her required text and the paper-reading interface (Lo et al., 2023). It
format by following the traversal examples in §3.1 has also been used in working research prototypes
(e.g., using [s.text for s in doc.sentences]). which have since contributed to research publica-
Within a few lines of code, Lucy has everything tions (Fok et al., 2023b; Kim et al., 2023).6 We
she needs for text-only input to her QA pipeline. open-source papermage in hopes it will simplify
research workflows that depend on scientific doc-
Formatting output. Lucy runs her QA system
uments and promote extensions to other visually-
on her newly acquired text data and now has (1) a
rich documents like textbooks (Lincker et al., 2023)
model-generated answer and (2) several retrieved
and digitized print media (Lee et al., 2020).
evidence passages. She realizes that she already
has access to the evidences’ bounding boxes via a 6
See a demo of such a prototype papeo.app/demo.

500
Ethical Considerations Oren Etzioni provided enthusiasm and support for
continued investment in this toolkit.
As a toolkit primarily designed to process scientific This project was supported in part by NSF Grant
documents, there are two areas where papermage OIA-2033558 and NSF Grant CNS-2213656.
could cause harms or have unintended effects.
Author Contributions
Extraction of bibliographic information
papermage could be used to parse author names, All authors contributed to the implementation of
affiliation, emails from scientific documents. Like papermage and/or the writing of this paper.
any software, this extraction can be noisy, leading Core contributors. Kyle Lo and Zejiang Shen
to incorrect parsing and thus mis-attribution of initiated the project and co-wrote initial implemen-
manuscripts. Further, since papermage relies tations of magelib and some Predictors. Later,
on static PDF documents, rather than metadata Kyle Lo and Luca Soldaini refactored a majority of
dynamically retrieved from publishers, users of magelib, Predictors, and added Recipes. Ben-
papermage need consider how and when extracted jamin Newman added new Predictors to support
names should no longer be associated with authors, use-cases like those in the Vignette (§4). Joseph
a harmful practice called deadnaming (Queer in AI Chee Chang implemented an end-to-end web-based
et al., 2023). We recommend papermage users to visual interface for papermage and helped iterate
exercise caution when using our toolkit to extract on papermage’s designs. All core contributors
metadata, to cross-reference extracted content with helped with writing. Finally, Kyle Lo led all aspects
other sources when possible, and to design systems of the project, including design and implementa-
such that authors have the ability to manually edit tion, as well as mentorship of other contributors to
any data about themselves. the toolkit (see below).
Misrepresentation or fabrication of informa- Other contributors. Russell Authur, Stefan Can-
tion in documents In §3, we discussed how dra, Yoganand Chandrasekhar, Regan Huff, Aman-
papermage can be easily extended to support high- preet Singh and Angele Zamarron each worked
level applications. Such applications might include closely with Kyle Lo to contribute a Predictor
question answering chatbots, or AI summarizers to papermage. Erin Bransom and Bailey Kuehl
that perform information synthesis over one or helped with data annotation for training and evalu-
more papermage documents. Such applications ating those Predictors. Chris Wilhelm provided
typically rely on generative models to produce their feedback on papermage’s design and implemented
output, which might fabricate incorrect informa- faster indexing of Entities when building Layers.
tion or misstate claims. Developers should be vig- Finally, Marti Hearst, Daniel Weld, and Doug
ilant when integrating papermage output into any Downey helped with writing and overall advising
downstream application, especially in systems that on the project.
purport to represent information gathered from sci-
entific publications.
References
Acknowledgements Waleed Ammar, Dirk Groeneveld, Chandra Bhagavat-
ula, Iz Beltagy, Miles Crawford, Doug Downey, Ja-
We thank our teammates at Semantic Scholar for son Dunkelberger, Ahmed Elgohary, Sergey Feld-
their help on this project. In particular: Rodney man, Vu Ha, Rodney Kinney, Sebastian Kohlmeier,
Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew Pe-
Kinney provided insight during discussions about ters, Joanna Power, Sam Skjonsberg, Lucy Lu Wang,
how best to represent data extracted from docu- Chris Wilhelm, Zheng Yuan, Madeleine van Zuylen,
ments; Paul Sayre provided feedback on initial and Oren Etzioni. 2018. Construction of the litera-
designs of the library; Chloe Anastasiades, Dany ture graph in semantic scholar. In Proceedings of
the 2018 Conference of the North American Chap-
Haddad and Egor Klevak tested earlier versions of ter of the Association for Computational Linguistics:
the library; Tal August, Raymond Fok, and Andrew Human Language Technologies, Volume 3 (Indus-
Head motivated the need for such a toolkit dur- try Papers), pages 84–91, New Orleans - Louisiana.
ing their internships building augmented reading Association for Computational Linguistics.
interfaces; Jaron Lochner and Kelsey MacMillan Tal August, Lucy Lu Wang, Jonathan Bragg, Marti A.
helped us get additional engineering support; and Hearst, Andrew Head, and Kyle Lo. 2023. Paper

501
plain: Making medical research papers approachable the Association for Computational Linguistics: Hu-
to healthcare consumers with natural language pro- man Language Technologies, pages 4599–4610, On-
cessing. ACM Trans. Comput.-Hum. Interact., 30(5). line. Association for Computational Linguistics.

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB- Raymond Fok, Joseph Chee Chang, Tal August, Amy X.
ERT: A pretrained language model for scientific text. Zhang, and Daniel S. Weld. 2023a. Qlarify: Bridg-
In Proceedings of the 2019 Conference on Empirical ing scholarly abstracts and papers with recursively
Methods in Natural Language Processing and the expandable summaries. arXiv, abs/2310.07581.
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 3615– Raymond Fok, Hita Kambhamettu, Luca Soldaini,
3620, Hong Kong, China. Association for Computa- Jonathan Bragg, Kyle Lo, Marti Hearst, Andrew
tional Linguistics. Head, and Daniel S Weld. 2023b. Scim: Intelligent
skimming support for scientific papers. In Proceed-
Tom Brown, Benjamin Mann, Nick Ryder, Melanie ings of the 28th International Conference on Intelli-
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind gent User Interfaces, IUI ’23, page 476–490, New
Neelakantan, Pranav Shyam, Girish Sastry, Amanda York, NY, USA. Association for Computing Machin-
Askell, Sandhini Agarwal, Ariel Herbert-Voss, ery.
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Grobid. 2008–2023. Grobid. https://github.com/
Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- kermitt2/grobid.
teusz Litwin, Scott Gray, Benjamin Chess, Jack
Yu Gu, Robert Tinn, Hao Cheng, Michael R. Lucas,
Clark, Christopher Berner, Sam McCandlish, Alec
Naoto Usuyama, Xiaodong Liu, Tristan Naumann,
Radford, Ilya Sutskever, and Dario Amodei. 2020.
Jianfeng Gao, and Hoifung Poon. 2020. Domain-
Language models are few-shot learners. In Ad-
specific language model pretraining for biomedical
vances in Neural Information Processing Systems,
natural language processing. ACM Transactions on
volume 33, pages 1877–1901. Curran Associates,
Computing for Healthcare (HEALTH), 3:1 – 23.
Inc.
Andrew Head, Kyle Lo, Dongyeop Kang, Raymond
Joseph Chee Chang, Saleema Amershi, and Ece Kamar. Fok, Sam Skjonsberg, Daniel S. Weld, and Marti A.
2017. Revolt: Collaborative crowdsourcing for label- Hearst. 2021. Augmenting scientific papers with just-
ing machine learning datasets. In Proceedings of the in-time, position-sensitive definitions of terms and
2017 CHI Conference on Human Factors in Comput- symbols. In Proceedings of the 2021 CHI Conference
ing Systems, CHI ’17, page 2334–2346, New York, on Human Factors in Computing Systems, CHI ’21,
NY, USA. Association for Computing Machinery. New York, NY, USA. Association for Computing
Joseph Chee Chang, Amy X. Zhang, Jonathan Bragg, Machinery.
Andrew Head, Kyle Lo, Doug Downey, and Daniel S. Zhi Hong, Aswathy Ajith, James Pauloski, Eamon
Weld. 2023. Citesee: Augmenting citations in scien- Duede, Kyle Chard, and Ian Foster. 2023. The dimin-
tific papers with persistent and personalized historical ishing returns of masked language models to science.
context. In Proceedings of the 2023 CHI Conference In Findings of the Association for Computational
on Human Factors in Computing Systems, CHI ’23, Linguistics: ACL 2023, pages 1270–1283, Toronto,
New York, NY, USA. Association for Computing Canada. Association for Computational Linguistics.
Machinery.
Po-Wei Huang, Abhinav Ramesh Kashyap, Yanxia Qin,
Catherine Chen, Zejiang Shen, Dan Klein, Gabriel Yajing Yang, and Min-Yen Kan. 2022a. Lightweight
Stanovsky, Doug Downey, and Kyle Lo. 2023. Are contextual logical structure recovery. In Proceedings
layout-infused language models robust to layout dis- of the Third Workshop on Scholarly Document Pro-
tribution shifts? a case study with scientific docu- cessing, pages 37–48, Gyeongju, Republic of Korea.
ments. In Findings of the Association for Computa- Association for Computational Linguistics.
tional Linguistics: ACL 2023, pages 13345–13360,
Toronto, Canada. Association for Computational Lin- Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and
guistics. Furu Wei. 2022b. Layoutlmv3: Pre-training for doc-
ument ai with unified text and image masking. Pro-
Isaac Councill, C. Lee Giles, and Min-Yen Kan. 2008. ceedings of the 30th ACM International Conference
ParsCit: an open-source CRF reference string pars- on Multimedia.
ing package. In Proceedings of the Sixth Interna-
tional Conference on Language Resources and Eval- Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebas-
uation (LREC’08), Marrakech, Morocco. European tian Riedel, Piotr Bojanowski, Armand Joulin, and
Language Resources Association (ELRA). Edouard Grave. 2022. Unsupervised dense informa-
tion retrieval with contrastive learning. Transactions
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, on Machine Learning Research.
Noah A. Smith, and Matt Gardner. 2021. A dataset
of information-seeking questions and answers an- Sarthak Jain, Madeleine van Zuylen, Hannaneh Ha-
chored in research papers. In Proceedings of the jishirzi, and Iz Beltagy. 2020. SciREX: A challenge
2021 Conference of the North American Chapter of dataset for document-level information extraction. In

502
Proceedings of the 58th Annual Meeting of the Asso- 2019. BioBERT: a pre-trained biomedical language
ciation for Computational Linguistics, pages 7506– representation model for biomedical text mining.
7516, Online. Association for Computational Lin- Bioinformatics, 36(4):1234–1240.
guistics.
Yoonjoo Lee, Kyungjae Lee, Sunghyun Park, Dasol
Hyeonsu B. Kang, Joseph Chee Chang, Yongsung Kim, Hwang, Jaehyeon Kim, Hong-In Lee, and Moontae
and Aniket Kittur. 2022. Threddy: An interactive Lee. 2023. QASA: Advanced question answering on
system for personalized thread-based exploration and scientific articles. In Proceedings of the 40th Inter-
organization of scientific literature. In Proceedings of national Conference on Machine Learning, volume
the 35th Annual ACM Symposium on User Interface 202 of Proceedings of Machine Learning Research,
Software and Technology, UIST ’22, New York, NY, pages 19036–19052. PMLR.
USA. Association for Computing Machinery.
Élise Lincker, Olivier Pons, Camille Guinaudeau, Is-
Hyeonsu B. Kang, Sherry Tongshuang Wu, Joseph Chee abelle Barbet, Jérôme Dupire, Céline Hudelot, Vin-
Chang, and Aniket Kittur. 2023. Synergi: A mixed- cent Mousseau, and Caroline Huron. 2023. Layout
initiative system for scholarly synthesis and sense- and activity-based textbook modeling for automatic
making. In Proceedings of the 36th Annual ACM pdf textbook extraction. In iTextbooks@AIED.
Symposium on User Interface Software and Technol-
ogy. Association for Computing Machinery. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Tae Soo Kim, Matt Latzke, Jonathan Bragg, Amy X. Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Zhang, and Joseph Chee Chang. 2023. Papeos: Aug- RoBERTa: A Robustly Optimized BERT Pretrain-
menting research papers with talk videos. In Proceed- ing Approach. ArXiv, abs/1907.11692.
ings of the 36th Annual ACM Symposium on User
Interface Software and Technology. Kyle Lo, Joseph Chee Chang, Andrew Head, Jonathan
Bragg, Amy X. Zhang, Cassidy Trier, Chloe Anas-
Rodney Kinney, Chloe Anastasiades, Russell Authur, tasiades, Tal August, Russell Authur, Danielle Bragg,
Iz Beltagy, Jonathan Bragg, Alexandra Buraczyn- Erin Bransom, Isabel Cachola, Stefan Candra, Yo-
ski, Isabel Cachola, Stefan Candra, Yoganand Chan- ganand Chandrasekhar, Yen-Sung Chen, Evie Yu-
drasekhar, Arman Cohan, Miles Crawford, Doug Yen Cheng, Yvonne Chou, Doug Downey, Rob
Downey, Jason Dunkelberger, Oren Etzioni, Rob Evans, Raymond Fok, Fangzhou Hu, Regan Huff,
Evans, Sergey Feldman, Joseph Gorney, David Gra- Dongyeop Kang, Tae Soo Kim, Rodney Kinney,
ham, Fangzhou Hu, Regan Huff, Daniel King, Se- Aniket Kittur, Hyeonsu Kang, Egor Klevak, Bai-
bastian Kohlmeier, Bailey Kuehl, Michael Langan, ley Kuehl, Michael Langan, Matt Latzke, Jaron
Daniel Lin, Haokun Liu, Kyle Lo, Jaron Lochner, Lochner, Kelsey MacMillan, Eric Marsh, Tyler Mur-
Kelsey MacMillan, Tyler Murray, Chris Newell, ray, Aakanksha Naik, Ngoc-Uyen Nguyen, Srishti
Smita Rao, Shaurya Rohatgi, Paul Sayre, Zejiang Palani, Soya Park, Caroline Paulic, Napol Rachata-
Shen, Amanpreet Singh, Luca Soldaini, Shivashankar sumrit, Smita Rao, Paul Sayre, Zejiang Shen, Pao
Subramanian, Amber Tanaka, Alex D. Wade, Linda Siangliulue, Luca Soldaini, Huy Tran, Madeleine van
Wagner, Lucy Lu Wang, Chris Wilhelm, Caroline Wu, Zuylen, Lucy Lu Wang, Christopher Wilhelm, Caro-
Jiangjiang Yang, Angele Zamarron, Madeleine Van line Wu, Jiangjiang Yang, Angele Zamarron, Marti A.
Zuylen, and Daniel S. Weld. 2023. The Semantic Hearst, and Daniel S. Weld. 2023. The Semantic
Scholar Open Data Platform. ArXiv, abs/2301.10140. Reader Project: Augmenting Scholarly Documents
through AI-Powered Interactive Reading Interfaces.
Wei-Jen Ko, Te-yuan Chen, Yiyan Huang, Greg Durrett, ArXiv, abs/2303.14334.
and Junyi Jessy Li. 2020. Inquisitive question gener-
ation for high level text comprehension. In Proceed- Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kin-
ings of the 2020 Conference on Empirical Methods ney, and Daniel Weld. 2020. S2ORC: The semantic
in Natural Language Processing (EMNLP), pages scholar open research corpus. In Proceedings of the
6544–6555, Online. Association for Computational 58th Annual Meeting of the Association for Compu-
Linguistics. tational Linguistics, pages 4969–4983, Online. Asso-
ciation for Computational Linguistics.
Benjamin Charles Germain Lee, Jaime Mears, Eileen
Jakeway, Meghan Ferriter, Chris Adams, Nathan Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng
Yarasavage, Deborah Thomas, Kate Zwaard, and Zhang, Hoifung Poon, and Tie-Yan Liu. 2022.
Daniel S. Weld. 2020. The newspaper navigator Biogpt: Generative pre-trained transformer for
dataset: Extracting headlines and visual content from biomedical text generation and mining. Briefings
16 million historic newspaper pages in chronicling in bioinformatics.
america. In Proceedings of the 29th ACM Interna-
tional Conference on Information & Knowledge Man- Mark Neumann, Daniel King, Iz Beltagy, and Waleed
agement, CIKM ’20, page 3055–3062, New York, Ammar. 2019. ScispaCy: Fast and robust models
NY, USA. Association for Computing Machinery. for biomedical natural language processing. In Pro-
ceedings of the 18th BioNLP Workshop and Shared
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Task, pages 319–327, Florence, Italy. Association for
Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Computational Linguistics.

503
Benjamin Newman, Luca Soldaini, Raymond Fok, Ar- images, captions, and textual references. In Find-
man Cohan, and Kyle Lo. 2023. A question answer- ings of the Association for Computational Linguistics:
ing framework for decontextualizing user-facing snip- EMNLP 2020, pages 2112–2120, Online. Association
pets from scientific documents. In Proceedings of the for Computational Linguistics.
2023 Conference on Empirical Methods in Natural
Language Processing (EMNLP). M. Tan, R. Pang, and Q. V. Le. 2020. Efficientdet:
Scalable and efficient object detection. In 2020
Organizers of Queer in AI, Anaelia Ovalle, Arjun Sub- IEEE/CVF Conference on Computer Vision and Pat-
ramonian, Ashwin Singh, Claas Voelcker, Danica J. tern Recognition (CVPR), pages 10778–10787, Los
Sutherland, Davide Locatelli, Eva Breznik, Filip Klu- Alamitos, CA, USA. IEEE Computer Society.
bicka, Hang Yuan, Hetvi J, Huan Zhang, Jaidev
Shriram, Kruno Lehman, Luca Soldaini, Maarten Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas
Sap, Marc Peter Deisenroth, Maria Leonor Pacheco, Scialom, Anthony S. Hartshorn, Elvis Saravia, An-
Maria Ryskina, Martin Mundt, Milind Agarwal, Nyx drew Poulton, Viktor Kerkez, and Robert Stojnic.
Mclean, Pan Xu, A Pranav, Raj Korpan, Ruchira 2022. Galactica: A large language model for science.
Ray, Sarah Mathew, Sarthak Arora, St John, Tanvi ArXiv, abs/2211.09085.
Anand, Vishakha Agrawal, William Agnew, Yanan
Long, Zijie J. Wang, Zeerak Talat, Avijit Ghosh, pdf2image. 2023. pdf2image. https://github.
Nathaniel Dennler, Michael Noseworthy, Sharvani com/Belval/pdf2image.
Jha, Emi Baylor, Aditya Joshi, Natalia Y. Bilenko,
Andrew Mcnamara, Raphael Gontijo-Lopes, Alex pdfplumber. 2023. pdfplumber. https://github.
Markham, Evyn Dong, Jackie Kay, Manu Saraswat, com/jsvine/pdfplumber.
Nikhil Vytla, and Luke Stark. 2023. Queer In AI: A
Dominika Tkaczyk, Paweł Szostek, Mateusz Fedo-
Case Study in Community-Led Participatory AI. In
ryszak, Piotr Jan Dendek, and Lukasz Bolikowski.
Proceedings of the 2023 ACM Conference on Fair-
2015. Cermine: Automatic extraction of structured
ness, Accountability, and Transparency, FAccT ’23,
metadata from scientific literature. Int. J. Doc. Anal.
page 1882–1895, New York, NY, USA. Association
Recognit., 18(4):317–335.
for Computing Machinery.

Napol Rachatasumrit, Jonathan Bragg, Amy X. Zhang, Amalie Trewartha, Nicholas Walker, Haoyan Huo,
and Daniel S Weld. 2022. Citeread: Integrating lo- Sanghoon Lee, Kevin Cruse, John Dagdelen, Alex
calized citation contexts into scientific paper reading. Dunn, Kristin Aslaug Persson, Gerbrand Ceder, and
In 27th International Conference on Intelligent User Anubhav Jain. 2022. Quantifying the advantage of
Interfaces, IUI ’22, page 707–719, New York, NY, domain-specific pre-training on named entity recog-
USA. Association for Computing Machinery. nition tasks in materials science. Patterns, 3.

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Lucy Lu Wang, Isabel Cachola, Jonathan Bragg, Evie
Sun. 2015. Faster r-cnn: Towards real-time object de- (Yu-Yen) Cheng, Chelsea Hess Haupt, Matt Latzke,
tection with region proposal networks. IEEE Trans- Bailey Kuehl, Madeleine van Zuylen, Linda M. Wag-
actions on Pattern Analysis and Machine Intelligence, ner, and Daniel S. Weld. 2021. Improving the acces-
39:1137–1149. sibility of scientific documents: Current state, user
needs, and a system solution to enhance scientific pdf
Nipun Sadvilkar and Mark Neumann. 2020. PySBD: accessibility for blind and low vision users. ArXiv,
Pragmatic sentence boundary disambiguation. In abs/2105.00076.
Proceedings of Second Workshop for NLP Open
Source Software (NLP-OSS), pages 110–114, Online. Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar,
Association for Computational Linguistics. Russell Reas, Jiangjiang Yang, Doug Burdick, Darrin
Eide, Kathryn Funk, Yannis Katsis, Rodney Michael
Zejiang Shen, Kyle Lo, Lucy Lu Wang, Bailey Kuehl, Kinney, Yunyao Li, Ziyang Liu, William Merrill,
Daniel S. Weld, and Doug Downey. 2022. VILA: Im- Paul Mooney, Dewey A. Murdick, Devvret Rishi,
proving structured content extraction from scientific Jerry Sheehan, Zhihong Shen, Brandon Stilson,
PDFs using visual layout groups. Transactions of the Alex D. Wade, Kuansan Wang, Nancy Xin Ru Wang,
Association for Computational Linguistics, 10:376– Christopher Wilhelm, Boya Xie, Douglas M. Ray-
392. mond, Daniel S. Weld, Oren Etzioni, and Sebastian
Kohlmeier. 2020. CORD-19: The COVID-19 open
Zejiang Shen, Ruochen Zhang, Melissa Dell, B. Lee, research dataset. In Proceedings of the 1st Work-
Jacob Carlson, and Weining Li. 2021. Layoutparser: shop on NLP for COVID-19 at ACL 2020, Online.
A unified toolkit for deep learning based document Association for Computational Linguistics.
image analysis. In IEEE International Conference
on Document Analysis and Recognition. Rosalee Wolfe, John McDonald, Ronan Johnson, Ben
Sturr, Syd Klinghoffer, Anthony Bonzani, Andrew
Sanjay Subramanian, Lucy Lu Wang, Ben Bogin, Alexander, and Nicole Barnekow. 2022. Supporting
Sachin Mehta, Madeleine van Zuylen, Sravanthi mouthing in signed languages: New innovations and
Parasa, Sameer Singh, Matt Gardner, and Hannaneh a proposal for future corpus building. In Proceedings
Hajishirzi. 2020. MedICaT: A dataset of medical of the 7th International Workshop on Sign Language

504
Translation and Avatar Technology: The Junction of
the Visual and the Textual: Challenges and Perspec-
tives, pages 125–130, Marseille, France. European
Language Resources Association.
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu
Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha
Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou.
2021. LayoutLMv2: Multi-modal pre-training for
visually-rich document understanding. In Proceed-
ings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 2579–2591, Online.
Association for Computational Linguistics.

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu


Wei, and Ming Zhou. 2019. Layoutlm: Pre-training
of text and layout for document image understanding.
Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining.

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes.


2019. Publaynet: largest dataset ever for document
layout analysis. In 2019 International Conference on
Document Analysis and Recognition (ICDAR), pages
1015–1022. IEEE.

505
A Appendix the models it employs to perform logical struc-
ture recovery over the resulting token stream. As
A.1 Comparison and Compatibility with such, there is no straightforward way to run just
XML the model components of Grobid on an alternative
One can view Layers as capturing content hier- token stream like that provided in the S2-VL (Shen
archy (e.g., tokens vs sentences) similar to that of et al., 2022) dataset.
other structured document representations, like TEI To perform this baseline evaluation, we ran
XML trees. We note that Layers are stored as un- the original PDFs that were annotated for S2-VL
ordered attributes and don’t require nesting. This through our GrobidParser using v0.7.3. Grobid
allows for specific cross-layer referencing opera- also returns bounding boxes of some predicted cat-
tions that don’t adhere to strict nesting relationships. egories (e.g., authors, abstract, paragraphs). We
For example: use these bounding boxes to create Entities that
1 for sentence in doc . sentences : we annotate on a Document constructed manually
2 for line in sentence . lines : from from S2-VL data. Using magelib cross-layer
...
3
referencing, we were able to match Grobid predic-
Recall that a sentence can begin or end midway tions to S2-VL data to perform this evaluation.
through a line and cross multiple lines (§3.1). Sim- Though we found there are certain categories
ilarly, not all lines are exactly contained within for which bounding box information was either not
the boundaries of a sentence. As such, sentences available (e.g., Titles) or Grobid simply did not re-
and lines are not strictly nested within each other. turn that output (e.g., Figure text extraction). These
This relationship would be difficult to encode in an are represented by zeros in Table 3, which con-
XML format adhering to document tree structure. tributes to the lower scores in Table 2 after macro
Regardless, the way we represent structure averaging. For a more apples-to-apples compari-
in documents is highly versatile. We demon- son, we also included a “Grobid Subset” evaluation
strate this by also implementing GrobidParser which restricted to just categories in S2-VL for
as an alternative to the PDF2TextParser in §3.1. which Grobid produced bounding box information.
GrobidParser invokes Grobid to process PDFs, In addition to Grobid, we evaluate two of our pro-
and reads the resulting TEI XML file generated by vided Transformer-based models. The RoBERTa-
Grobid by converting each XML tag of a common large (Liu et al., 2019) model is a Transformers
level into an Entity of its own Layer. We use this token classification model that we finetuned on the
to perform the evaluation in Table 2. S2-VL training set. The I-VILA model is a layout-
infused Transformer model pretrained by Shen et al.
A.2 Additional magelib Protocols and (2022) on the S2-VL training set. Like we did with
Utilities Grobid, we ran our CoreRecipe using these two
Serialization. Any Document and all of its models on the original PDFs in S2-VL, and per-
Layers can be exported to a JSON format, and formed a similar token mapping operation since our
perfectly reconstructed: PDF2TextParser also produces a different token
stream than that provided in S2-VL.
1 import json
2 with open ( " .... json " , "w ") as f_out :
At the end of the day, the Transformer-based
3 json . dump ( doc . to_json () , f_out ) models performed better at this task than Grobid.
4
This is unsurprising given expected improvements
5 with open ( " ... json " , " r" ) as f_in :
6 doc = json . load ( f_in ) using a Transformer model over a CRF or BiL-
STM. The Transformer models were also trained
on S2-VL data, which gave them an advantage over
A.3 Evaluating papermage’s CoreRecipe
Grobid. Overall, this evaluation intended to show
against Grobid
how papermage enables cross-system comparisons,
Here, we detail how we performed the evaluation even eschewing token stream incompatibility, and
reported in §3.3 (Table 2). We also provide a full to illustrate an upper bound of the performance left
breakdown by category in Table 3. on the table by existing software systems that don’t
As described earlier in the paper, Grobid is quite use of state-of-the-art models.
difficult to evaluate as it is developed with tight
coupling between the PDF parser (pdfalto) and
506
Structure GROBIDCRF GROBIDNN RoBERTa I-VILA
Category P R F1 P R F1 P R F1 P R F1
Abstract 81.9 89.1 85.3 85.3 89.8 87.5 89.2 93.7 91.4 97.4 98.3 97.8
Author 55.2 42.6 48.1 75.1 14.0 23.6 87.5 73.5 79.9 65.5 96.9 78.2
Bibliography 96.5 98.6 97.5 95.5 97.6 96.5 93.6 93.3 93.5 99.7 98.2 99.0
Caption 70.3 70.0 70.2 70.2 69.7 70.0 80.0 77.3 78.6 93.1 89.6 91.3
Equation 71.1 85.3 77.6 71.1 85.3 77.6 55.0 85.7 67.0 90.7 94.2 92.4
Figure 0.0 0.0 0.0 0.0 0.0 0.0 88.9 82.3 85.4 99.8 96.8 98.3
Footer 0.0 0.0 0.0 0.0 0.0 0.0 56.1 59.9 57.9 96.8 78.1 86.5
Footnote 0.0 0.0 0.0 0.0 0.0 0.0 59.8 44.3 50.9 80.2 93.5 86.3
Header 0.0 0.0 0.0 0.0 0.0 0.0 40.5 84.3 54.7 92.9 99.1 95.9
Keywords 0.0 0.0 0.0 0.0 0.0 0.0 93.8 97.1 95.4 96.9 99.4 98.1
List 0.0 0.0 0.0 0.0 0.0 0.0 61.9 63.8 62.9 76.7 82.4 79.4
Paragraph 94.5 89.8 92.1 94.4 89.9 92.1 93.5 93.0 93.3 98.7 97.9 98.3
Section 83.0 79.4 81.1 83.0 79.4 81.1 67.7 82.7 74.4 96.2 91.6 93.9
Table 97.3 58.6 73.2 97.9 58.6 73.3 94.7 71.8 81.7 96.1 94.9 95.5
Title 0.0 0.0 0.0 0.0 0.0 0.0 76.3 96.7 85.3 98.7 99.9 99.3
Macro Avg
(Full S2-VL) 40.6 38.3 39.1 42.0 36.5 37.6 75.9 80.0 76.8 92.0 94.1 92.7
Macro Avg
81.2 76.7 78.9 84.1 73.0 78.2 82.6 83.9 83.2 92.2 95.2 93.7
(Grobid Subset)

Table 3: Evaluating CoreRecipe for logical structure recovery on S2-VL (Shen et al., 2022). These are per-category
metrics for Table 2. Metrics are computed for token-level classification, macro-averaged over categories. The
“Grobid Subset” limits evaluation to only categories for which Grobid returns bounding box information, which was
necessary for evaluation on S2-VL.

507

You might also like