0% found this document useful (0 votes)
16 views6 pages

Mapping The Ownership of Public Firms Attila Balogh

This paper presents a dataset that captures quarterly investment holdings of institutional investment managers, mapping the ownership of US public firms through Schedule 13F reports submitted to the SEC. The dataset is created from original regulatory filings, ensuring transparency and facilitating replication of research, unlike proprietary commercial databases. It includes detailed information on filings and is regularly updated to support academic exploration in financial economics.

Uploaded by

A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views6 pages

Mapping The Ownership of Public Firms Attila Balogh

This paper presents a dataset that captures quarterly investment holdings of institutional investment managers, mapping the ownership of US public firms through Schedule 13F reports submitted to the SEC. The dataset is created from original regulatory filings, ensuring transparency and facilitating replication of research, unlike proprietary commercial databases. It includes detailed information on filings and is regularly updated to support academic exploration in financial economics.

Uploaded by

A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Mapping the Ownership of Public Firms

Attila Balogh*
* Department of Finance, University of Melbourne, Melbourne, 3010, Australia

ABSTRACT

This paper describes a dataset that captures quarterly investment holdings of institutional investment managers and maps the
ownership of US public firms. Schedule 13F reports are submitted to the Securities and Exchange Commission quarterly by all
institutional investment managers with at least $100 million in assets under management. Most academic research examining
the common ownership of corporations and the portfolio holdings of large investment managers is based on proprietary
commercial databases. This hinders the replication of prior work due to unequal access to these subscriptions and because the
data manipulation steps in commercial databases are often opaque. To overcome these limitations, the presented dataset is
created from the original regulatory filings; it is updated regularly and includes all information reported by investment managers
without alteration.

Background & Summary


An institutional investment manager that operates by using the U.S. mail or any other interstate commerce means in its business
activities, and has control over investment decisions for $100 million or more in securities as defined under Section 13(f), is
required to file a quarterly report of its holdings on Form 13F with the Securities and Exchange Commission (SEC) within
45 days of the end of a calendar quarter. Broadly speaking, an institutional investment manager is defined as either an entity
that engages in the investment, purchase, or sale of securities for its own account; or a person or entity that has the authority
to make investment decisions on behalf of another person or entity. This category encompasses entities such as investment
advisers, banks, insurance companies, broker-dealers, pension funds, and corporations.
The original holding reports are submitted in Extensible Markup Language (XML) format to the SEC’s Electronic Data
Gathering, Analysis, and Retrieval (EDGAR) system, which facilitates the creation of relational databases from filings.
Reporting in XML format commenced in May, 2013, while prior transactions were filed in a pure text format. The most
common source of insider trading information used in prior research is the Refinitiv (formerly Thomson Reuters) Institutional
Common Stock Holdings and Transactions product that makes holding reports available commercially. The dataset described
in this paper is created by acquiring this information directly from EDGAR by downloading all individual filings and filing
metadata, and parsing the XML content. This process yields a richer set of data, facilitates replication by providing a link to the
original source for each observation, and offers additional benefits over commercial datasets described in this paper.
Project Layline is a research initiative that aims to leverage high performance and cloud computing to create publicly
accessible datasets for research in financial economics. It lowers barriers to entry by democratizing access to data and brings
increased transparency to the field by facilitating replication studies.1 The Layline Institutional Holding Reports dataset is
updated regularly to facilitate academic exploration of relevant and time-sensitive research questions.

Methods
The Python code developed for this project has to main components: acquisition and processing. The acquisition script
downloads the quarterly Master Index of EDGAR Dissemination Feed files for each year and quarter starting from 2013 and
ending with the most recent one.a It identifies all Form 13F-HR, 13F-HR/A, 13F-NT, and 13F-NT/A filings in the master index
and downloads both the metadata and the full filing to a local directory structure following a naming convention that is based on
Form type, the filer’s CIK, and each filing’s unique identifier, its Accession Number. The processing scripts parse the elements
in the XML file and saves them to a comma-separated values (CSV) file.
As of the end of 2023, the raw data depository contains over half a million files. Because the EDGAR system limits the rate
of downloads to no more than ten items per second, the process of downloading all filings can take a considerable time.b The
acquisition script creates and updates a Structured Query Language (SQL) database using SQLite to track filings that had been
successfully downloaded. Running the script multiple times ensures that attempts are made to download filings that are missed
during prior executions either because the EDGAR service was temporarily unavailable, or due to a 403 Forbidden hypertext
a https://www.sec.gov/Archives/edgar/full-index/
b https://www.sec.gov/oit/announcement/new-rate-control-limits
transfer protocol (HTTP) standard response code in instances when the script inadvertently exceeds the download limit. While
the acquisition script incorporates rate limiting using the ratelimit python package, the limit may be exceeded if multiple
acquisition scripts are running at the same time, or if other computers on the network are also in the process downloading
filings.
The processing scripts create eight individual CSV files: the filing’s metadata and header information; submission summary;
cover page; summary page; other manager; other manager 2; infotable; and signatures. The variable names in the tables follow
the naming convention of the XML tags in the original filings.c An additional error log is also created to accompany each of the
six CSV datasets. They list filings that were either unavailable to download and returned a 404 HTTP standard response code,
or include non-XML compatible strings.

Data cleaning steps


Whereas holding values were originally listed in thousand dollars, the new reporting template from 2023 requires these values
reported in dollars. The supplied Stata code provides an approach to change prior filings to dollar values so that reports are
uniform across the sample. Care needs to be applied with filings that relate to reporting quarters prior to 2023 but filed using
the new template. The schemaVersion field with a value of X0202 indicates filings under the new reporting template regardless
of reporting period and the provided code changes the tableValueTotal and value fields based on schemaVersion.
There are also a number of common reporting errors. I verify the implied stock price, which is the ratio of the value and
sshPrnamt values, by comparing it against the CRSP lowest bid price and highest ask price for the month of the report. If the
implied price is within the range, I assign the value of 1 to the isInrange variable and zero otherwise. For the last quarter of
2018, approximately 92 percent of holding items fall within the range

Data Records
The Layline Institutional Holding Reports dataset is made available at the Harvard Dataverse repository and it includes three
main types of regulatory filings pertaining to changes of firm ownership.d
Each dated version of the dataset comprises of eight files in Comma-separated values (CSV) format, one for each table of
the filing. The Header, Submission, Cover Page, Other Manager, Signature, Summary Page, Other Manager 2 and Infotable
tables make up the dataset and can be merged on the unique Accession Number identifier to create customized representations
of the data. Each table is also accompanied by an associated CSV error log file. It lists filings that were not successfully
downloaded from EDGAR with the a hypertext transfer protocol (HTTP) standard response code; filings that include strings
that are not XML compatible; and the list of filings are not in XML format. The eight tables that make up the dataset are stored
separately because of the many-to-many relationship across them. Table 1 lists all variable names in each table and Table 2
provides an overview of the filings with the annual breakdown of the sample. The majority of the filings in the dataset are Form
13F-HR filings or their amendments. Since the XML reporting requirement became effective from May, 2013, the first year is
expected to have fewer observations, not considering annual and seasonal trends.

– Table 1 here –

– Table 2 here –

Technical Validation
This section will introduce two methods to validate the presented datasets. The first validation involves running the acquisition
and processing scripts on multiple different systems on distinct networks, and comparing the output datasets. Access to files
on the EDGAR system may be intermittent and the acquisition script may encounter 403 Forbidden or 404 Not Found HTTP
standard response codes in attempting to download certain filings. It is also possible that filings stored on a local or network file
system become corrupted. To occurrence of these errors is minimized by running the acquisition script multiple times, ensuring
that no 403 Forbidden response codes are returned and recorded in the error logs. The processing scripts are then executed on
each of the computer systems and the output files are saved for reference. In untabulated analysis, I compare the four sets of
datasets and find them to be the same. The additional three validation datasets and the STATA code comparing them are made
available in the data repository.
In contrast to commercially available databases, the presented dataset offers institutional holding reports in their original
and unaltered form. It is aimed to encourage the pursuit of new research questions that were not feasible using commercially
c https://www.sec.gov/info/edgar/specifications/ownershipxmltechspec
d Layline Institutional Holding Reports: https://doi.org/10.7910/DVN/TZM1QT

2/6
available products. The following section will highlight some of the inconsistencies encountered in comparing the presented
dataset to the Refinitiv Institutional Common Stock Holdings and Transactions database.
The SEC also makes Form 13F datasets available after the end of each quarter in a tab-separated values file.e This paper
will not include an analysis of this data source because a cursory analysis reveals that it does not appear to be a comprehensive
dataset.

Linking to original filings


The presented dataset includes a direct URL to each observation, allowing researchers a direct method for cross-referencing
each observation in the dataset with the regulatory filing as it was submitted to EDGAR. The dataset also includes the original
CIK identifier for both the reporting entity and the issuer; which are masked by Refinitiv and replaced by proprietary identifiers.
These features are important because there are reported transactions in the Refinitiv database that are challenging to trace back
to the reported transaction.

Code availability
The code used for the data normalization and merging steps was created and run in STATA/MP 17.0 and it is made available in
the data repository. The acquisition and processing scripts are not shared publicly because downloading regulatory filings via
HTTP follows a standard procedure and parsing XML files in python using the lxml library is also well-documented.

Acknowledgements
This research includes computations that were developed using the computational cluster Katana supported by Research
Technology Services at UNSW Sydney. I am grateful to David McFarlane for excellent research assistance.

Competing interests
The author declares no competing interests.

Figures & Tables

e E.g. https://www.sec.gov/dera/data/form-13f

3/6
Table 1. Variable Names: Form 13F - Quarterly reports filed by institutional managers

Header Submission Cover page Infotable


URL accessionNumber accessionNumber accessionNumber
acceptanceDatetime schemaVersion reportCalendarOrQuarter index
accessionNumber liveTestFlag isAmendment nameOfIssuer
type submissionType amendmentNo titleOfClass
publicDocumentCount confirmingCopyFlag amendmentType cusip
period returnCopyFlag confDeniedExpired figi
filingDate overrideInternetFlag dateDeniedExpired value
dateOfFilingDateChange cik dateReported sshPrnamt
effectivenessDate fileNumber reasonForNonConfidentiality sshPrnamtType
name periodOfReport filingManagerName putCall
cik filingManagerStreet1 investmentDiscretion
sic filingManagerStreet2 otherManager
IRSNumber filingManagerCity votingAuthoritySole
stateOfIncorporation filingManagerStateOrCountry votingAuthorityShared
fiscalYearend Signature filingManagerZipCode votingAuthorityNone
formType accessionNumber reportType
act name form13FFileNumber
fileNumber title provideInfoForInstruction5
filmNumber phone additionalInformation
businessStreet1 signature
businessStreet2 city
businessCity stateOrCountry
businessState signatureDate
businessZip
businessPhone
mailingStreet1
mailingStreet2 Summary page Other manager Other manager 2
mailingCity accessionNumber accessionNumber accessionNumber
mailingState otherIncludedManagersCount sequenceNumber sequenceNumber
mailingZip tableEntryTotal form13FFileNumber form13FFileNumber
formerName tableValueTotal name name
dateChanged isConfidentialOmitted cik cik

This table provides the list of variable names in the Header, Submission, Cover page, Infotable, Signature, Summary page, Other manager and Other manager 2
table extracted from Forms 13F-HR and 13F-NT regulatory filings from the SEC’s EDGAR system.

4/6
Table 2. Quarterly breakdown of filings

13F-HR 13F-HR/A 13F-NT 13F-NT/A


2013 Q2 3,444 291 1,355 12
2013 Q3 3,455 257 1,349 6
2013 Q4 3,796 289 1,395 7
2014 Q1 3,801 258 1,400 3
2014 Q2 3,825 296 1,409 7
2014 Q3 3,813 262 1,431 5
2014 Q4 4,150 337 1,509 17
2015 Q1 4,144 297 1,510 11
2015 Q2 4,150 305 1,503 14
2015 Q3 4,128 245 1,509 5
2015 Q4 4,324 260 1,544 24
2016 Q1 4,275 270 1,515 6
2016 Q2 4,271 268 1,503 10
2016 Q3 4,241 243 1,487 11
2016 Q4 4,453 291 1,549 21
2017 Q1 4,426 262 1,538 21
2017 Q2 4,422 266 1,527 18
2017 Q3 4,406 235 1,513 19
2017 Q4 4,814 304 1,574 10
2018 Q1 4,789 275 1,570 33
2018 Q2 4,775 275 1,570 15
2018 Q3 4,773 248 1,576 28
2018 Q4 5,146 284 1,578 22
2019 Q1 5,108 291 1,577 15
2019 Q2 5,072 250 1,564 33
2019 Q3 5,050 215 1,547 4
2019 Q4 5,503 245 1,545 21
2020 Q1 5,449 243 1,531 17
2020 Q2 5,449 240 1,531 0
2020 Q3 5,429 227 1,544 7
2020 Q4 5,998 292 1,593 7
2021 Q1 6,014 274 1,612 17
2021 Q2 6,019 306 1,607 14
2021 Q3 6,009 273 1,618 9
2021 Q4 6,854 276 1,735 25
2022 Q1 6,821 293 1,718 8
2022 Q2 6,790 215 1,704 1
2022 Q3 6,755 207 1,694 7
2022 Q4 6,941 262 1,667 11
2023 Q1 1,964 14 116 0
Total 195,046 10,441 60,317 521

This table presents the quarterly breakdown of quarterly reports filed by institutional managers
based on the periofOfReport variable in the Submission table. The dataset includes Schedule
13F-HR holding reports, Schedule 13F-NT notices and their amendments. This table excludes
transactions that are reported for periods prior to the second quarter of 2013, the starting period
for reporting in structured format.

5/6
References
1. Harvey, C. R. Editorial: Replication in financial economics. Critical Finance Rev. 8, 1–9, http://doi.org/10.1561/104.
00000080 (2019).

6/6

You might also like