0% found this document useful (0 votes)

56 views8 pages

ArXiv Dataset for Research Benchmarking

The document discusses the arXiv as a valuable dataset for benchmarking next-generation models due to its extensive collection of 1.5 million pre-print articles across various scientific fields. It presents a pipeline for standardizing access to arXiv data, enabling the extraction of a 6.7 million edge citation graph and an 11 billion word corpus of full-text research articles. The authors aim to facilitate future research by providing a comprehensive resource for multi-modal, relational modeling using the arXiv's rich metadata and citation structures.

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views8 pages

ArXiv Dataset for Research Benchmarking

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/332799542

On the Use of ArXiv as a Dataset

Preprint · April 2019

DOI: 10.48550/arXiv.1905.00075

CITATIONS READS
2 1,954

4 authors, including:

Colin B Clement Alexander A Alemi

Cornell University Cornell University
13 PUBLICATIONS 310 CITATIONS 45 PUBLICATIONS 16,316 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Colin B Clement on 12 July 2019.

The user has requested enhancement of the downloaded file.

Presented at the Representation Learning on Graphs and Manifolds Workshop at ICLR 2019

O N THE U SE OF A R X IV AS A DATASET
Colin B. Clement Matthew Bierbaum
Cornell University, Department of Physics Cornell University, Department of Information Science
Ithaca, New York 14853-2501, USA Ithaca, New York 14853-2501, USA
cc2285@[Link] mkb72@[Link]

Kevin O’Keeffe Alexander A. Alemi

Senseable City Lab, Massachusetts Institute of Technology Google Research
Cambridge, MA 02139 Mountain View, CA
kokeeffe@[Link] alemi@[Link]

A BSTRACT

The arXiv has collected 1.5 million pre-print articles over 28 years, hosting lit-
erature from scientific fields including Physics, Mathematics, and Computer Sci-
ence. Each pre-print features text, figures, authors, citations, categories, and other
metadata. These rich, multi-modal features, combined with the natural graph
structure—created by citation, affiliation, and co-authorship—makes the arXiv
an exciting candidate for benchmarking next-generation models. Here we take the
first necessary steps toward this goal, by providing a pipeline which standardizes
and simplifies access to the arXiv’s publicly available data. We use this pipeline to
extract and analyze a 6.7 million edge citation graph, with an 11 billion word cor-
pus of full-text research articles. We present some baseline classification results,
and motivate application of more exciting generative graph models.

1 I NTRODUCTION

Real world datasets are typically multimodal (comprised of images, text, and time series, etc) and
have complex relational structures well captured by a graph. Recently, advances have been made on
models which act on graphs, allowing the rich features and relational structures of real-word data to
be utilized (Hamilton et al., 2017b;a; Battaglia et al., 2018; Goyal & Ferrara, 2018; Nickel et al.,
2016). Many of these advances have been facilitated by the availability of large, benchmark datasets:
for example, the ImageNet (Russakovsky et al., 2015) dataset has been widely used as a community
standard for image classification. We believe the arXiv can provide a similarly useful benchmark
for large scale, multimodal, relational modelling.
The arXiv1 is the de-facto online manuscript pre-print service for Computer Science, Mathematics,
Physics, and many interdisciplinary communities. Since 1991 the arXiv has offered a place for
researchers to reliably share their work as it undergoes the process of peer-review, and for many
researchers it is their primary source of literature. With over 1.5 million articles, a large multigraph
dataset can be built, including full-text articles, article metadata, and internal co-citations.
The arXiv has been used many times as a dataset. Liben-Nowell & Kleinberg (2007) used the
topology of the arXiv co-authorship graph to study link prediction. Dempsey et al. (2019) used
the authorship graph to test a hierarchically structured network model. Lopuszynski & Bolikowski
(2013) used the category labels of arXiv documents to train and assess an automatic text labelling
system. Dai et al. (2015) used a subset of the full text available on the arXiv to study the utility of
“paragraph vectors” for capturing document similarity. Alemi & Ginsparg (2015) used the fulltext
to evaluate a method for unsupervised text segmentation. Eger et al. (2019) and Liu et al. (2018)
built models to predict future research topic trends in machine learning and physics respectively.
The arXiv also formed the basis of the popular 2003 KDD Cup (Gehrke et al., 2003), in which

1
[Link]

1
Presented at the Representation Learning on Graphs and Manifolds Workshop at ICLR 2019

researchers competed for the prize of best algorithm for citation prediction, download estimation,
and data cleaning2 .
All these works used different subsets of arXiv’s data, limiting their potential impact, as future
researchers will be unable to directly compare their work to these existing results. The goal of this
paper is to improve this situation by providing an open-source pipeline to standardize, simplify, and
normalize access to the arXiv’s public data, providing a benchmark to facilitate the development of
models for multi-modal, relational data.

2 DATASET
We built a freely available, open-source pipeline3 for collecting arXiv metadata from the Open
Archive Initiative (Lagoze & Van de Sompel, 2001), and bulk PDF downloading from the arXiv4 .
Further, this pipeline converts the raw PDFs to plaintext, builds the intra-arXiv co-citation network
by searching the full-text for arXiv ids, and cleans and normalizes author strings.

2.1 M ETADATA

Through its participation in the Open Archives Initiative,5 the arXiv makes all article metadata6
available, with updates made shortly after new articles are published7 . We provide code for utilizing
these public APIs to download a full set of current arXiv metadata. As of 2019-03-01, metadata for
1,506,500 articles was available. For verification and ease of use purposes, we provide a copy of
the metadata (less abstracts) on the date we accessed it. An example listing is shown in Figure 1.
Each article includes an arXiv id (e.g. 0704.0001)8 used to identify the article, the publicly
visible name of the submitter, a list of authors, title, abstract, versions and category listings, as well
as optional doi, journal-ref and report-no fields. Of particular note is the first category
listed, the primary category, of which there are 171 at this time. Notice that the list of authors is just a
single string of author names, potentially joined with commas or ‘and’s. We’ve provided a suggested
normalization and splitting script for splitting these authors strings into a list of author names.
Additional fields may be present to denote doi, journal-ref and report-no, although these
are not validated they can potentially be used to find intersections between the arXiv dataset and
other scientific literature datasets. Population counts for the optional fields are shown in Table 1.

Count 1,506,500 1,491,303 1,229,138 810,209 608,286 154,922

Field id submitter comments doi journal-ref report-no

Table 1: Number of articles with the corresponding field populated. Note that the fields id,
abstract, authors, versions, and categories are always populated.

2.2 F ULL T EXT

One advantage the arXiv has over other graph datasets is that it provides a very rich attribute at each
id node: the full raw text and figures of a research article. To extract the raw text from PDFs, we
provide a pipeline with two parts. A helper script downloads the full set of PDFs available through the
arXiv’s bulk download service9 . Since arXiv hosts their data in a requester-pay AWS S3 buckets,
this constitutes ∼ 1.1 TB and ∼ $100 to fully download. For posterity, we have provided MD5
2
The data for those challenges are available at [Link]
kddcup/[Link]
3
[Link]
2.0
4
[Link]
5
[Link]
6
[Link]
7
Further details available at [Link]
8
There are two forms of valid arXiv IDs, delineated by the year 2007, described in [Link]
org/help/arxiv_identifier.
9
[Link]

2
Presented at the Representation Learning on Graphs and Manifolds Workshop at ICLR 2019

1 {’id’: ’1905.00075’,
2 ’submitter’: ’Colin B. Clement’,
3 ’authors’: ’Colin B. Clement, Matthew Bierbaum, Kevin P. O\’Keeffe, and Alexander A. Alemi’,
4 ’title’: ’On the Use of ArXiv as a Dataset’,
5 ’comments’: ’7 pages, 3 figures, 2 tables’,
6 ’journal-ref’: ’’,
7 ’doi’: ’’,
8 ’abstract’: ’The arXiv has collected 1.5 million pre-prints over 28 years, hosting literature from physics,
mathematics, computer science, biology, finance, statistics, electrical engineering, and economics.
Each pre-print features text, figures, author lists, citation lists, categories, and other metadata.
These rich, multi-modal features, combined with the natural relational graph structure created by
citation, affiliation, and co-authorship makes the arXiv an exciting candidate for benchmarking next-
generation models. Here we take the first necessary steps toward this goal, by providing a pipeline
which standardizes and simplifies access to the arXiv’s publicly available data. We use this pipeline
to extract and analyze a 6.7 million edge citation graph, with an 11 billion word corpus of full-text
research articles. We present some baseline classification results, and motivate application of more
exciting relational neural network models.’
9 ’categories’: [’[Link]’],
10 ’versions’: [’v1’]}

Figure 1: An example of what the metadata for this very article may look like if it were submitted
to the arXiv.

hashes of the PDFs at the state of the frozen metagraph extraction. Raw TEX source is also available
for the subset of articles that provide it. Second, we provide a standard PDF-to-text converter –
powered by pdftotext10 – to convert the PDFs to plaintext.
Using this pipeline, it is currently possible to extract a corpus of 1.37 million raw text documents.
Figure 2 shows an example of the text extracted from a PDF. Though the extracted text isn’t perfectly
clean, we believe it will still be useful for many tasks, and hope future contributions to our repository
will provide better data cleaning procedures.
The extracted raw-text dataset is ∼ 64 GB in size, totaling ∼ 11 billion words. An order of mag-
nitude larger than the common billion word corpus (Chelba et al., 2013), this large size makes the
arXiv raw-text a competitive alternative to other full text datasets. Moreover, the technical nature of
the arXiv distinguishes it from other full text datasets. For example, the TEX data contained in the
arXiv presents an opportunity to study mathematical formulae in bulk, as is done in the NTCIR-11
Task: Math-2 (Aizawa et al., 2014).

2.3 C O -C ITATIONS

While the arXiv does not currently publicly provide an API to access co-citations, our pipeline allows
a simple but large co-citation network to be extracted. We extracted this network by searching the
text of each article for valid arXiv ids, thereby finding which nodes should be linked to a given node
in the co-citation network. We provide a compressed binary of the resulting network at the reposi-
tory11 , so that researchers can study it directly, and avoid the difficulty of constructing it themselves.
Table 2 summarizes the size and statistical structure of our co-citation network, compared with other
popular citation networks. Šubelj et al. (2014) also studied data from the arXiv, but as indicated in
the bottom row of Table 2, it used only the 34,546 articles from the 2003 KDD Cup challenge.
Table 2 reports standard statistics for the co-citation network. Our arXiv co-citation network contains
O(106 ) nodes, an order of magnitude larger than the O(105 ) nodes in the other citation networks.
The exponents of best fit for the degree distributions αin and αout are consistent with the existing
citation networks Šubelj et al. (2014), as it the the degree hki. 62% of the nodes are contained in
the largest weakly connected component, while 31% of the nodes are fully isolated – meaning their
in-degree kin and out-degree kout are zero. Recall that our arXiv co-citation network only contains
publications which have been posted on the arXiv; a given paper which cites papers published else-
where – and not on the arXiv – will have kout = 0 in this set, which is an explanation the large
number of isolated nodes.

10
Version 0.61.1, available on most Debian systems from the apt package poppler-utils
11
As part of one of the tagged releases: [Link]
arxiv-public-datasets/releases

3
Presented at the Representation Learning on Graphs and Manifolds Workshop at ICLR 2019

1 Published as a conference paper at ICLR 2019

2
3 O N THE U SE OF A R X IV AS A DATASET
4 Colin B. Clement
5 Cornell University, Department of Physics
6 Ithaca, New York 14853-2501, USA
7 cc2285@[Link]
8
9 Matthew Bierbaum
10 Cornell University, Department of Information Science
11 Ithaca, New York 14853-2501, USA
12 mkb72@[Link]
13
14 Kevin O K e e f f e
15 Senseable City Lab, Massachusetts Institute of Technology
16 Cambridge, MA 02139
17 kokeeffe@[Link]
18
19 Alexander A. Alemi
20 Google Research
21 Mountain View, CA
22 alemi@[Link]
23
24 A BSTRACT
25 The arXiv has collected 1.5 million pre-print articles over 28 years, hosting literature from scientific
fields including Physics, Mathematics, and Computer Science. Each pre-print features text, figures,
authors, citations, categories, and other
26 metadata. These rich, multi-modal features, combined with the natural graph
27 structure created by citation, affiliation, and co-authorship makes the arXiv
28 an exciting candidate for benchmarking next-generation models. Here we take the
29 first necessary steps toward this goal, by providing a pipeline which standardizes
30 and simplifies access to the a r X i v s publicly available data. We use this pipeline to
31 extract and analyze a 6.7 million edge citation graph, with an 11 billion word corpus of full-text research
articles. We present some baseline classification results,
32 and motivate application of more exciting generative graph models.

Figure 2: Example text extracted from this PDF.

Table 2: Graph statistics for popular citation networks. All but the data for this work (first row)
were taken from Table 1 and 2 in Šubelj et al. (2014). hki is the average degree, and αin and αout
are power law exponents of best fit for the degree distribution. WCC refers to the largest weakly
connected components, computed using the python package ‘networkx’. The power law exponents
αin , αout were found using the python module powerlaw. When fitting data to a powerlaw, the
package discards all data below an automatically computed threshold xmin . These thresholds for kin
and kout were xmin = 73 and xmin = 59 respectively.

Dataset Nnodes Nedges hki αin αout % WCC

6 6
arXiv 1.35 × 10 6.72 × 10 9.933 2.93 3.93 62
WoS 1.40 × 105 6.4 × 105 9.11 2.39 3.88 97
CiteSeer 3.84 × 105 1.74 × 106 9.08 2.28 3.82 95
KDD2003 3.34 × 104 4.21 × 105 24.50 2.54 3.45 99.6

Beyond constructing and analyzing a co-citation network, the arXiv dataset can be used for many
tasks, such as relationally powered classification, author attribution, segmentation, clustering, struc-
tured prediction, language modeling, link prediction and automatic summary generation. As a basic
demonstration, in Table 3 we show some baseline category classification results. These were ob-
tained by training logistic regression on 1.2 million arXiv articles to predict in which category (e.g.
[Link], [Link]) a given article resides. See Appendix A for a detailed explanation of the exper-
imental setup. Titles and abstracts were represented by vectors from a pre-trained instance12 of the
Universal Sentence Encoder of Cer et al. (2018). We see that including more aspects of each docu-
ment (titles, abstracts, fulltext) and exposing their relations via co-citation leads to better predictive
power. This is only scratching the surface of possible tasks and models applied to this rich dataset.

12
From [Link]

4
Presented at the Representation Learning on Graphs and Manifolds Workshop at ICLR 2019

Table 3: Baseline classification performance on a holdout set of 390k articles. Titles and abstracts
were embedded in a 512 dimensional subspace using the Universal Sentence Encoder, and trained
on 1.2 million articles with logistic regression. ‘All’ refers to the concatenation of titles, abstract,
fulltext, and co-citation features. ‘All - X’ refers to the ablation of feature X from ‘All.’ Top n is
the classification accuracy testing when the correct class is in the top n most confident predictions.
Detailed explanation of the features and methods can be found in Appendix A.

Features Top 1 Top 3 Top 5 Perplexity

Titles (T) 36.6% 59.3% 68.8% 12.7
Abstracts (A) 46.0% 70.7% 79.5% 7.5
Fulltext (F) 64.2% 79.4% 85.9% 4.6
Co-citation (C) 37.8% 49.4% 53.8% 18.5
All = T + A + F + C 78.4% 91.4% 94.5% 2.3
All - T 77.0% 90.7% 94.0% 2.5
All - A 74.7% 88.3% 91.9% 2.8
All - F 59.0% 79.8% 86.2% 4.6
All - C 75.5% 89.9% 93.6% 2.6

3 C ONCLUSION
As research moves increasingly towards structured relational modelling (Hamilton et al., 2017b;a;
Battaglia et al., 2018), there is a growing need for large-scale, relational datasets with rich anno-
tations. With its authorship, categories, abstracts, co-citations, and full text, the arXiv presents an
exciting opportunity to promote progress in relational modelling. We have provided an open-source
repository of tools that make it easy to download and standardize the data available from the arXiv.
Our preliminary classification baselines support the claim that each mode of the arXiv’s feature set
allows for greatly improved category inference. More sophisticated models that include relational
inductive biases—encoding the graph structures of the arXiv—will improve these results. Further,
this new benchmark dataset will allow more rapid progress in tasks such as link prediction, automatic
summary generation, text segmentation, and time-varying topic modeling of scientific disciplines.

ACKNOWLEDGEMENTS
The authors thank the anonymous reviewers for their helpful comments. CBC was funded by NSF
grant DMR-1719490. MB thanks the Allen Institute for Artificial Intelligence for funding. KPO
thanks the members of the MIT Senseable City Lab consortium for their support.

R EFERENCES
Akiko Aizawa, Michael Kohlhase, Iadh Ounis, and Moritz Schubotz. Ntcir-11 math-2 task overview.
In NTCIR, volume 11, pp. 88–98. Citeseer, 2014.
Alexander A. Alemi and Paul Ginsparg. Text segmentation based on semantic word embeddings.
CoRR, abs/1503.05543, 2015. URL [Link]
Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi,
Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al.
Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261,
2018. URL [Link]
Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Lyn Untalan Limtiaco, Rhomni St. John,
Noah Constant, Mario Guajardo-Cspedes, Steve Yuan, Chris Tar, Yun hsuan Sung, Brian Strope,
and Ray Kurzweil. Universal sentence encoder. In In submission to: EMNLP demonstration,
Brussels, Belgium, 2018. URL [Link] In submission.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony
Robinson. One billion word benchmark for measuring progress in statistical language modeling.
arXiv preprint arXiv:1312.3005, 2013. URL [Link]

5
Presented at the Representation Learning on Graphs and Manifolds Workshop at ICLR 2019

Andrew M. Dai, Christopher Olah, and Quoc V. Le. Document embedding with paragraph vectors.
CoRR, abs/1507.07998, 2015. URL [Link]
Walter Dempsey, Brandon Oselio, and Alfred Hero. Hierarchical network models for structured
exchangeable interaction processes. arXiv preprint arXiv:1901.09982, 2019.
Steffen Eger, Chao Li, Florian Netzer, and Iryna Gurevych. Predicting research trends from arxiv.
arXiv preprint arXiv:1903.02831, 2019.
Johannes Gehrke, Paul Ginsparg, and Jon Kleinberg. Overview of the 2003 kdd cup. ACM SIGKDD
Explorations Newsletter, 5(2):149–151, 2003.
Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance:
A survey. Knowledge-Based Systems, 151:78–94, 2018. URL [Link]
1705.02801.
Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.
In Advances in Neural Information Processing Systems, pp. 1024–1034, 2017a. URL https:
//[Link]/abs/1706.02216.
William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods
and applications. 2017b. URL [Link]
Carl Lagoze and Herbert Van de Sompel. The open archives initiative: Building a low-barrier
interoperability framework. In Proceedings of the 1st ACM/IEEE-CS joint conference on Digital
libraries, pp. 54–62. ACM, 2001.
David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks. Journal
of the American society for information science and technology, 58(7):1019–1031, 2007.
Wenyuan Liu, Stanisław Saganowski, Przemysław Kazienko, and Siew Ann Cheong. Using machine
learning to predict the evolution of physics research. arXiv preprint arXiv:1810.12116, 2018.
Michal Lopuszynski and Lukasz Bolikowski. Tagging scientific publications using wikipedia and
natural language processing tools. comparison on the arxiv dataset. CoRR, abs/1309.0326, 2013.
URL [Link]
Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational
machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2016. URL
[Link]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.
ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision
(IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.

Lovro Šubelj, Dalibor Fiala, and Marko Bajec. Network-based statistical comparison of citation
topology of bibliographic databases. Scientific reports, 4:6496, 2014.

6
Presented at the Representation Learning on Graphs and Manifolds Workshop at ICLR 2019

A L OGISTIC R EGRESSION A RTICLE C LASSIFICATION BASELINE

ArXiv articles are assigned primary categories (e.g. [Link] is Artifical Intelligene and [Link] is
computational complexity) by the article submitter, which is then confirmed by the ArXiv modera-
tion system. This label can be obtained for each article from the OAI metadata described in the main
article, and is the first element of a space-delimited string in the categories attribute. There
are, at the time of writing, L = 175 possible categories. Since more categories can be added in
the future and the metadata can be modified, please consult the frozen metadata file in the github
repository release13 for these 175 categories. This appendix explains how we developed the article
classification baselines using features from the titles, abstracts, full-text, and co-citation network.
The code for performing this task can be found in the git repository14 .

A.1 B UILDING F EATURES

The title, abstract, and full-text of each article is a variable-length string, and each article has both
a title and abstract from the OAI metadata, but not all articles have a full-text PDF. In our frozen
dataset there are N = 1, 506, 500 articles with metadata, but only 1,357,536 have full-text in the
ArXiv. We vectorized each string into 512 dimensions using the pretrained Universal Sentence
Encoder,15 substituting zeros for missing full-text.
The intra-ArXiv citation graph can be used via the N × N co-citation matrix, which is defined as

1 if article i cites article j or vice-versa
Mij = (1)
0 else.
In order to prevent a leaking of the test set into the training set, using the train/test partition defined
below, we omitted citations in M from articles in the training set which connect to the test set, but
retained citations in the test set which connect to the training set.
We can also define the N × L category matrix in the standard one-hot fashion

1 if article j is in category l
Cjl = (2)
0 else.
Then the co-citation feature matrix is the N ×L matrix product M C. Note that this feature uses only
nearest-neighbor citation graph relationships. We could include next-nearest neighbor relationships
and so on by calculating M C + aM 2 C + bM 3 C + . . . for some constants a and b. In this paper we
only used first order connections via M C as the co-citation feature vectors.

A.2 T RAINING

Using vector embeddings from titles, abstracts, and full-text, and co-citation features as described
above, we fed several combinations of these vectors concatenated in the obvious way into the
scikit-learn SGD classifier sklearn.linear_model.SGDClassifier. We used the
keyword arguments loss=’log’, tol=1e-6, max_iter=50, and alpha=1e-7 to define the
model, which uses 50 epochs, and very small quadratic regularization alpha on the weights and
biases.
With the features and model defined, we performed a train/test split by shuffling the data in place
randomly, and selecting the first Ntrain = 1, 200, 000 for training. The remaining Ntest = 306, 500
articles were used to evaluate the accuracy of the trained classification, and the model perplexity as
reported in table in the main text.

13
[Link]
14
[Link]
analysis/[Link]
15
[Link]

View publication stats

Common questions

The arXiv dataset provides several advantages for developing machine learning models: it offers a rich multi-modal feature set including text, figures, author lists, and citation lists, which allows for comprehensive analysis . Its natural relational graph structure, created by citation, affiliation, and co-authorship, makes it a suitable candidate for benchmarking next-generation models, especially those focusing on graphical and relational data . Additionally, the availability of full raw-text and metadata facilitates deep learning tasks such as link prediction and topic modeling .

Converting arXiv PDFs to plain text presents challenges like incomplete conversion and noise in extracted text due to imperfections in the PDF-to-text process . These issues may affect the quality of plain text extracted, potentially hindering accurate natural language processing and machine learning tasks by introducing errors and inconsistencies in the data, thus requiring additional cleaning procedures .

The rich multi-modal features of the arXiv dataset facilitate the integration of comprehensive data types, such as textual content, figures, and citations, into model architectures . This diversity supports sophisticated models by providing diverse input modalities, allowing them to learn complex patterns and relationships that can enhance the accuracy of category inference. Furthermore, encoding graph structures inherent in the dataset through relational inductive biases can improve these models by leveraging the interconnectedness of data points .

The arXiv co-citation network is significantly larger than other networks, containing O(10^6) nodes compared to O(10^5) nodes in other citation networks . Moreover, it has a relatively high level of isolated nodes (31%), which is attributed to its inclusion only of publications on arXiv, thus omitting external citations . The largest weakly connected component contains 62% of the nodes, demonstrating moderate connectivity compared to other networks .

Relational inductive biases embed graph structures within models, enabling them to better capture the underlying relationships in the dataset, such as citation and co-authorship connections . In the arXiv dataset, these biases help models leverage the inherent relational information, thus potentially improving performance in tasks such as category inference and link prediction by utilizing the dataset's naturally graph-structured features .

The category matrix in the arXiv dataset is constructed in a one-hot fashion where each row corresponds to an article, and each column represents a category. An entry of 1 in this matrix indicates the presence of an article in a given category, and 0 otherwise . This matrix supports multi-label classification tasks by summarizing the categorical information of articles, allowing for models to predict multiple categories for a single article efficiently using this structured data .

The extracted full-text and metadata from arXiv provide comprehensive data that can enhance link prediction tasks through detailed citation networks and enriched text data . Benefits include more accurate predictions based on a broad range of scientific connections present in the full corpus. However, limitations arise due to the dataset's restriction to arXiv-published documents, potentially missing external links and citations, which can lead to incomplete models that do not fully capture the broader scholarly communication landscape .

The arXiv raw-text corpus, comprising around 11 billion words, is significantly larger than some common datasets like the Billion Word Corpus . Its technical nature, with a rich concentration of scientific literature, distinguishes it from other full-text datasets, making it particularly valuable for specific language processing tasks and scientific text analysis . The combination of size and scientific focus provides a competitive basis for employing the dataset in advanced NLP and machine learning research .

The arXiv dataset includes raw TEX source for a subset of articles, offering an opportunity to analyze mathematical formulae in bulk . This aspect allows researchers to explore mathematical expressions at scale, enabling opportunities for advancements in automated theorem proving and mathematical information retrieval. The technical nature of the corpus provides a unique resource for developing algorithms focused on understanding and classifying mathematical content, paralleling efforts like the NTCIR-11 Task: Math-2 .

The findings indicate that the arXiv co-citation network's structure, particularly the high number of isolated nodes, reflects a potential limitation in capturing comprehensive citation relationships, as it only includes arXiv-hosted articles . This can skew metrics of centrality and influence within academic circles, potentially underestimating the spread of knowledge that also involves non-arXiv publications. Understanding these limitations is crucial for accurate interpretations of co-citation analyses within academic knowledge networks .

On The Use of Arxiv As A Dataset
No ratings yet
On The Use of Arxiv As A Dataset
7 pages
Overview of arXiv's 14-Year Impact
No ratings yet
Overview of arXiv's 14-Year Impact
26 pages
Benchmark for Research Theme Classification
No ratings yet
Benchmark for Research Theme Classification
10 pages
Nmi 2
No ratings yet
Nmi 2
60 pages
Analysis of Redundant Data in arXiv Submissions
No ratings yet
Analysis of Redundant Data in arXiv Submissions
23 pages
Study of arXiv Open Access Archive
No ratings yet
Study of arXiv Open Access Archive
69 pages
OpenAlex API: Scholarly Works Overview
No ratings yet
OpenAlex API: Scholarly Works Overview
5 pages
Arxiv Research Trend Predictions
No ratings yet
Arxiv Research Trend Predictions
8 pages
arXiv: Transforming Open Access in 2002
No ratings yet
arXiv: Transforming Open Access in 2002
19 pages
Common Corpus: Ethical LLM Dataset
No ratings yet
Common Corpus: Ethical LLM Dataset
27 pages
arXiv Impact on Math Article Citations
No ratings yet
arXiv Impact on Math Article Citations
17 pages
Text Reuse Patterns in Scientific Articles
No ratings yet
Text Reuse Patterns in Scientific Articles
16 pages
PreprintToPaper Dataset Overview
No ratings yet
PreprintToPaper Dataset Overview
12 pages
Data Scraping Techniques for arXiv Papers
No ratings yet
Data Scraping Techniques for arXiv Papers
9 pages
Keyword Analysis in Social Computing Papers
No ratings yet
Keyword Analysis in Social Computing Papers
6 pages
Impact of ArXiv and X on Citations
No ratings yet
Impact of ArXiv and X on Citations
16 pages
Pile Dataset Overview and Documentation
No ratings yet
Pile Dataset Overview and Documentation
22 pages
arXiv vs. viXra: A Scienceographic Study
No ratings yet
arXiv vs. viXra: A Scienceographic Study
9 pages
Predicting Citation Impact of Articles
No ratings yet
Predicting Citation Impact of Articles
15 pages
Ontology-Based Article Recommendation
No ratings yet
Ontology-Based Article Recommendation
4 pages
arXiv Scientific Statement Classification
No ratings yet
arXiv Scientific Statement Classification
10 pages
URL Extraction in arXiv Formats
No ratings yet
URL Extraction in arXiv Formats
8 pages
Domain Mining in Scientific Research Papers
No ratings yet
Domain Mining in Scientific Research Papers
8 pages
Data Scraping Techniques for arXiv Papers
No ratings yet
Data Scraping Techniques for arXiv Papers
10 pages
Predicting High-Impact Research Topics
No ratings yet
Predicting High-Impact Research Topics
18 pages
Enhancing arXiv Search with arXivSI
No ratings yet
Enhancing arXiv Search with arXivSI
12 pages
Charles Sutton and Linan Gong: Csutton@inf - Ed.ac - Uk
No ratings yet
Charles Sutton and Linan Gong: Csutton@inf - Ed.ac - Uk
10 pages
Machine Learning Predictions with Open Data
No ratings yet
Machine Learning Predictions with Open Data
46 pages
Metadata Standards for Machine Learning
No ratings yet
Metadata Standards for Machine Learning
15 pages
ArXiv and X: Boosting Citations in 2023
No ratings yet
ArXiv and X: Boosting Citations in 2023
14 pages
CORE: Global Open Access Paper Service
No ratings yet
CORE: Global Open Access Paper Service
20 pages
COVID-19 Research Topic Evolution Analysis
No ratings yet
COVID-19 Research Topic Evolution Analysis
12 pages
NeurIPS 2023 Into The Laions Den Investigating Hate in Multimodal Datasets Paper Datasets - and - Benchmarks
No ratings yet
NeurIPS 2023 Into The Laions Den Investigating Hate in Multimodal Datasets Paper Datasets - and - Benchmarks
17 pages
Preprint Archives Security Audit Analysis
No ratings yet
Preprint Archives Security Audit Analysis
10 pages
Supervised ML for Auto-Tagging Papers
No ratings yet
Supervised ML for Auto-Tagging Papers
5 pages
Hugging Face Model Card Metadata Dataset
No ratings yet
Hugging Face Model Card Metadata Dataset
8 pages
PLOS ONE-A Case Study
No ratings yet
PLOS ONE-A Case Study
15 pages
ImageNet: A Critical Dataset Analysis
No ratings yet
ImageNet: A Critical Dataset Analysis
18 pages
Contextual Productivity in Scientometrics
No ratings yet
Contextual Productivity in Scientometrics
27 pages
C C: T L C E D LLM P - T: Ommon Orpus HE Argest Ollection of Thical Ata For RE Raining
No ratings yet
C C: T L C E D LLM P - T: Ommon Orpus HE Argest Ollection of Thical Ata For RE Raining
29 pages
Topic Modeling for Literature Screening
No ratings yet
Topic Modeling for Literature Screening
24 pages
AI Development on Hugging Face Hub Analysis
No ratings yet
AI Development on Hugging Face Hub Analysis
39 pages
Novel Heuristic for Graph-Based Topic Modeling
No ratings yet
Novel Heuristic for Graph-Based Topic Modeling
9 pages
Semantic Scholar Open Data Platform Overview
No ratings yet
Semantic Scholar Open Data Platform Overview
9 pages
Detecting Bursty Terms in CS Research
No ratings yet
Detecting Bursty Terms in CS Research
19 pages
Dynamic Thinking Spaces in Digital Archives
No ratings yet
Dynamic Thinking Spaces in Digital Archives
10 pages
The Digital Research Ecosystem
No ratings yet
The Digital Research Ecosystem
11 pages
Attribute First for Grounded Text Generation
No ratings yet
Attribute First for Grounded Text Generation
4 pages
OpenCitations Index Overview
No ratings yet
OpenCitations Index Overview
24 pages
Enhancing TAG Learning with LLMs
No ratings yet
Enhancing TAG Learning with LLMs
22 pages
Experimenting with Large Scale Vision Datasets
No ratings yet
Experimenting with Large Scale Vision Datasets
17 pages
Attention Mechanism in NLP Models
No ratings yet
Attention Mechanism in NLP Models
4 pages
Datasets for Fairness-Aware ML Survey
No ratings yet
Datasets for Fairness-Aware ML Survey
59 pages
Online News Articles Popularity Prediction System
No ratings yet
Online News Articles Popularity Prediction System
9 pages
Dooo 2
No ratings yet
Dooo 2
6 pages
Major Trends in Big Data Research
No ratings yet
Major Trends in Big Data Research
4 pages
Examining Misogyny in Multimodal Datasets
No ratings yet
Examining Misogyny in Multimodal Datasets
33 pages
Survey on Temporal Graph Learning
No ratings yet
Survey on Temporal Graph Learning
27 pages
Large-Scale Multi-Dimensional Knowledge Profiling of Scientific Literature
No ratings yet
Large-Scale Multi-Dimensional Knowledge Profiling of Scientific Literature
16 pages
Engineering Mathematics-II Exam Paper
No ratings yet
Engineering Mathematics-II Exam Paper
2 pages
Six Steps of the Scientific Method
No ratings yet
Six Steps of the Scientific Method
26 pages
Pengaruh Kepemimpinan dan Budaya Terhadap Kinerja Guru
No ratings yet
Pengaruh Kepemimpinan dan Budaya Terhadap Kinerja Guru
136 pages
NATED Report 190/1 Accreditation Form
No ratings yet
NATED Report 190/1 Accreditation Form
3 pages
NTS Application Form for Intelligence Bureau
No ratings yet
NTS Application Form for Intelligence Bureau
6 pages
Systematic Reviews of Economic Evaluations
No ratings yet
Systematic Reviews of Economic Evaluations
9 pages
A1 Listening Competency Test Guide
No ratings yet
A1 Listening Competency Test Guide
6 pages
Student Internship Evaluation Form
No ratings yet
Student Internship Evaluation Form
1 page
Impact of Motivational Packages on Performance
No ratings yet
Impact of Motivational Packages on Performance
5 pages
Certificate of Recognition with Honors
No ratings yet
Certificate of Recognition with Honors
20 pages
Efficient Hallucination Detection in LLMs
No ratings yet
Efficient Hallucination Detection in LLMs
11 pages
Grade 2 Mathematics Schemes of Work Term 2
No ratings yet
Grade 2 Mathematics Schemes of Work Term 2
16 pages
Professional Development at Mariano Untal High
No ratings yet
Professional Development at Mariano Untal High
5 pages
Emergency Nursing MCQs for Practice
No ratings yet
Emergency Nursing MCQs for Practice
28 pages
2022 Malawi English Mock Exam Paper
No ratings yet
2022 Malawi English Mock Exam Paper
10 pages
Ian Godleman: Elite Golf Coach Profile
No ratings yet
Ian Godleman: Elite Golf Coach Profile
3 pages
Understanding Citizenship in the Philippines
No ratings yet
Understanding Citizenship in the Philippines
4 pages
Key Features of Syllabus Document
No ratings yet
Key Features of Syllabus Document
17 pages
Women's Colonial Narratives in Algeria
No ratings yet
Women's Colonial Narratives in Algeria
321 pages
IEnergizer Fresher Interview Preparation Guide-1
No ratings yet
IEnergizer Fresher Interview Preparation Guide-1
7 pages
Delta Chi Fraternity Assessment 2011
No ratings yet
Delta Chi Fraternity Assessment 2011
3 pages
Oshawa YMCA Summer 2013 Schedule
No ratings yet
Oshawa YMCA Summer 2013 Schedule
8 pages
New Celebration Ideas for Peru
No ratings yet
New Celebration Ideas for Peru
16 pages
Components of Self-Identity Explained
No ratings yet
Components of Self-Identity Explained
2 pages
Designing Effective Microlearning Strategies
100% (4)
Designing Effective Microlearning Strategies
15 pages
Climate Change Lesson Plan Guide
No ratings yet
Climate Change Lesson Plan Guide
13 pages
Coping with Stress: Emotional Strategies
No ratings yet
Coping with Stress: Emotional Strategies
11 pages
English Worksheet: Jobs and Careers
No ratings yet
English Worksheet: Jobs and Careers
9 pages
Pure Maths Sample Project
No ratings yet
Pure Maths Sample Project
15 pages
Summer School Program Details and Links
No ratings yet
Summer School Program Details and Links
4 pages

ArXiv Dataset for Research Benchmarking

Uploaded by

ArXiv Dataset for Research Benchmarking

Uploaded by

See discussions, stats, and author profiles for this publication at: [Link]

On the Use of ArXiv as a Dataset

Preprint · April 2019

Colin B Clement Alexander A Alemi

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Kevin O’Keeffe Alexander A. Alemi

Count 1,506,500 1,491,303 1,229,138 810,209 608,286 154,922

2.2 F ULL T EXT

1 Published as a conference paper at ICLR 2019

Figure 2: Example text extracted from this PDF.

Dataset Nnodes Nedges hki αin αout % WCC

Features Top 1 Top 3 Top 5 Perplexity

A L OGISTIC R EGRESSION A RTICLE C LASSIFICATION BASELINE

A.1 B UILDING F EATURES

View publication stats

Common questions

What are the primary advantages of using the arXiv dataset for developing machine learning models compared to other datasets?

What challenges are associated with converting arXiv PDFs to plain text, and how might these impact data quality?

In what ways can the rich multi-modal features of the arXiv dataset support the development of more sophisticated models for category inference?

How does the co-citation network extracted from arXiv compare with other citation networks in terms of size and connectivity?

What role do relational inductive biases play in enhancing the performance of models applied to the arXiv dataset?

How is the category matrix in the arXiv dataset constructed, and how does it support multi-label classification tasks?

Discuss the potential benefits and limitations of using arXiv's extracted full-text and metadata for link prediction tasks in scientific literature databases.

Why might the arXiv raw-text corpus be considered a competitive alternative to other large full-text datasets?

How does the arXiv dataset enable the exploration of mathematical formulae in bulk, and what potential research opportunities does this present?

What implications do the findings on the arXiv co-citation network have for understanding the dissemination of knowledge within academic circles?

You might also like