The Book of Ohdsi
The Book of Ohdsi
Preface ix
Goals of this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Structure of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Software Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
How the Book Is Developed . . . . . . . . . . . . . . . . . . . . . . . . . . xi
2 Where to Begin 11
2.1 Join the Journey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Where You Fit In . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Open Science 21
3.1 Open Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 OpenScience in Action: the StudyaThon . . . . . . . . . . . . . . . 23
3.3 Open Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Open Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Open Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Open Discourse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.7 OHDSI and the FAIR Guiding Principles . . . . . . . . . . . . . . . . 25
iii
iv Contents
5 Standardized Vocabularies 55
5.1 Why Vocabularies, and Why Standardizing . . . . . . . . . . . . . . . 55
5.2 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5 Internal Reference Tables . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6 Special Situations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
11 Characterization 173
11.1 Database Level Characterization . . . . . . . . . . . . . . . . . . . . . 174
11.2 Cohort Characterization . . . . . . . . . . . . . . . . . . . . . . . . . 174
11.3 Treatment Pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
11.4 Incidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
11.5 Characterizing Hypertensive Persons . . . . . . . . . . . . . . . . . . 176
11.6 Database Characterization in ATLAS . . . . . . . . . . . . . . . . . . 177
11.7 Cohort Characterization in ATLAS . . . . . . . . . . . . . . . . . . . 180
11.8 Cohort Characterization in R . . . . . . . . . . . . . . . . . . . . . . 187
11.9 Cohort Pathways in ATLAS . . . . . . . . . . . . . . . . . . . . . . . 190
11.10Incidence Analysis in ATLAS . . . . . . . . . . . . . . . . . . . . . . 194
11.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
11.12Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Appendix 380
A Glossary 381
Bibliography 439
Index 449
Preface
This is a book about the Observational Health Data Sciences and Informatics (OHDSI)
collaborative. The OHDSI community wrote the book to serve as a central knowledge
repository for all things OHDSI. The Book is a living document, communitymaintained
through opensource development tools, and evolves continuously. The online version,
available for free at http://book.ohdsi.org, always represents the latest version. A physical
copy of the book is available from Amazon at cost price.
ix
x Contents
Each section has multiple chapters, and, as appropriate, each chapter follows the sequence:
Introduction, Theory, Practice, Summary, and Exercises.
Contributors
Each chapter lists one or more chapter leads. These are the people who lead the writing
of the chapter. However, there are many others that have contributed to the book, whom
we would like to acknowledge here:
Software Versions
A large part of this book is about the opensource software of OHDSI, and this software
will evolve over time. Although the developers do their best to offer a consistent and
stable experience to the users, it is inevitable that over time improvements to the software
will render some of the instructions in this book outdated. The community will update
the online version of the book to reflect those changes, and new editions of the hard copy
will be released over time. For reference, these are the version numbers of the software
used in this version of the book:
Package Version
CaseControl 1.6.0
CaseCrossover 1.1.0
CohortMethod 3.1.0
Cyclops 2.0.2
DatabaseConnector 2.4.1
EmpiricalCalibration 2.0.0
EvidenceSynthesis 0.0.4
FeatureExtraction 2.2.4
MethodEvaluation 1.1.0
ParallelLogger 1.1.0
PatientLevelPrediction 3.0.6
SelfControlledCaseSeries 1.4.0
SelfControlledCohort 1.5.0
SqlRender 1.6.2
License
This book is licensed under the Creative Commons Zero v1.0 Universal license.
1
Chapter 1
3
4 Chapter 1. The OHDSI Community
There are different types of observational databases which capture disparate patientlevel
data in source systems. These databases are as diverse as the healthcare system itself,
reflecting different populations, care settings, and data capture processes. There are also
different types of evidence that could be useful to inform decisionmaking, which can
be classified by the analytic use cases of clinical characterization, populationlevel effect
estimation, and patientlevel prediction. Independent from the origin (source data) and
desired destination (evidence), the challenge is further complicated by the breadth of clin
ical, scientific, and technical competencies that are required to undertake the journey. It
requires a thorough understanding of health informatics, including its full provenance of
the source data from the pointofcare interaction between a patient and provider through
the administrative and clinical systems and into final repository, with an appreciation of
the biases that can arise as part of the health policies and behavioral incentives associ
ated with the data capture and curation processes. It requires mastery of epidemiologic
principles and statistical methods to translate a clinical question into an observational
study design properly suited to produce a relevant answer. It requires the technical ability
to implement and execute computationallyefficient data science algorithms to datasets
containing millions of patients with billions of clinical observations over years of longi
tudinal followup. It requires the clinical knowledge to synthesize what has been learned
across an observational data network with evidence from other information sources, and
to determine how this new knowledge should impact health policy and clinical practice.
Accordingly, it is quite rare that any one individual would possess the requisite skills and
resources to successfully trek from data to evidence alone. Instead, the journey often re
1.2. Observational Medical Outcomes Partnership 5
quires collaboration across multiple individuals and organizations to ensure that the best
available data are analyzed using the most appropriate methods to produce the evidence
that all stakeholders can trust and use in their decisionmaking processes.
the end of the OMOP journey needed to become the start of a new journey together. In
spite of OMOP’s methodological research providing tangible insights into scientific best
practices that could demonstrably improve the quality of evidence generated from obser
vational data, adoption of those best practices was slow. Several barriers were identified,
including: 1) fundamental concerns about observational data quality that were felt to
be higher priority to address before analytics innovations; 2) insufficient conceptual un
derstanding of the methodological problems and solutions; 3) inability to independently
implement solutions within their local environment; 4) uncertainty over whether these ap
proaches were applicable to their clinical problems of interest. The one common thread
to every barrier was the sense that one person alone didn’t have everything they needed
to enact change by themselves, but with some collaborative support all issues could be
overcome. But several areas of collaboration were needed:
opt to collaborate in the study. With the OHDSI distributed network, each data partner
retains full autonomy over the use of their patientlevel data, and continues to observe the
data governance policies within their respective institutions.
The OHDSI developer community has created a robust library of opensource analytics
tools atop the OMOP CDM to support 3 use cases: 1) clinical characterization for disease
natural history, treatment utilization, and quality improvement; 2) populationlevel ef
fect estimation to apply causal inference methods for medical product safety surveillance
and comparative effectiveness; and 3) patientlevel prediction to apply machine learn
ing algorithms for precision medicine and disease interception. OHDSI developers have
also developed applications to support adoption of the OMOP CDM, data quality assess
ment, and facilitation of OHDSI network studies. These tools include backend statistical
packages built in R and Python, and frontend web applications developed in HTML and
Javascript. All OHDSI tools are open source and publicly available via Github.2
OHDSI’s open science community approach, coupled with its opensource tools, has en
abled tremendous advances in observational research. One of the first OHDSI network
analyses examined treatment pathways across three chronic diseases: diabetes, depres
sion, and hypertension. Published in the Proceedings of the National Academy of Sci
ence, it was one of the largest observational studies ever conducted, with results from
11 data sources covering more than 250 million patients and revealed tremendous geo
graphic differences and patient heterogeneity in treatment choices that had never been
previously observable. (Hripcsak et al., 2016) OHDSI has developed new statistical
methods for confounding adjustment (Tian et al., 2018) and evaluating the validity of
2
https://github.com/OHDSI
1.5. Collaborating in OHDSI 9
observational evidence for causal inference, (Schuemie et al., 2018a) and it has applied
these approaches in multiple contexts, from an individual safety surveillance question in
epilepsy (Duke et al., 2017) to comparative effectiveness of secondline diabetes medica
tions (Vashisht et al., 2018) to a largescale populationlevel effect estimation study for
comparative safety of depression treatments. (Schuemie et al., 2018b) The OHDSI com
munity has also established a framework for how to responsibly apply machine learning
algorithms to observational healthcare data, (Reps et al., 2018) which has been applied
across various therapeutic areas. (Johnston et al., 2019; Cepeda et al., 2018; Reps et al.,
2019)
1.6 Summary
Where to Begin
11
12 Chapter 2. Where to Begin
• General: for general discussion about the OHDSI community and how to get in
volved
• Implementers: for discussion about how to implement the Common Data Model
and OHDSI analytics framework in your local environment
• Developers: for discussion around opensourced development of OHDSI applica
tions and other tools that leverage the OMOP CDM
• Researchers: for discussion around CDMbased research, including evidence gen
eration, collaborative research, statistical methods and other topics of interest to the
OHDSI Research Network
• CDM Builders: for discussion of ongoing CDM development, including require
ments, vocabulary, and technical aspects
• Vocabulary Users: for discussion around vocabulary content
• Regional Chapters (e.g. Korea, China, Europe): for regional discussions in their
native languages related to local OMOP implementations and OHDSI community
activities
To begin posting your own topics, you will need to sign up for an account. Once you have
a forums account, you are encouraged to introduce yourself on the General Topic under the
thread called “Welcome to OHDSI! Please introduce yourself”. You are invited to reply
and 1) Introduce yourself and tell us a bit about what you do and 2) Let us know how you’d
like to help out in the community (ex. software development, run studies, write research
papers, etc). Now you are on your OHDSI Journey! From here, you are encouraged to
join in the discussion. The OHDSI Community encourages using the Forums as your way
to ask questions, discuss new ideas and collaborate.
You can select topics to “watch.” What this means is whenever a new post is added
in a topic you’re watching, you will receive an email and be able to reply to the post
directly through your email. Watch the general thread to recieve details about up
coming meeting agendas, collaboration opportunities and have the weekly OHDSI
digest delivered directly to your inbox!
1
https://forums.ohdsi.org
14 Chapter 2. Where to Begin
OHDSI Forums.
As a newcomer to the OHDSI community, it is encouraged to add this call series to your
calendar to get acquainted with what is happening across the OHDSI network. If you
would like to join an OHDSI call, please consult the OHDSI wiki. Community call topics
vary from weektoweek. You can also consult the OHDSI Weekly Digest on the OHDSI
forum for more information on weekly presentation topics. Newcomers are invited to
introduce themselves on their first call and tell the community about themselves, their
background and what brought them to OHDSI.
Workgroup
Name Objective Target Audience
Atlas & Atlas and WebAPI are part of the OHDSI Java & JavaScript
WebAPI opensource software architecture that aim to software developers
provide standardized analytic capabilities aiming to improve and
built on the foundation of the OMOP contribute to the
Common Data Model. opensource
Atlas/WebAPI platform
CDM & To continue to develop the OMOP Common Any who has an
Vocabulary Data Model for the purpose of systematic, interest in improving
standardized and largescale analytics the OMOP Common
applied to clinical patient data. To improve Data Model and
the quality of the Standardized Vocabularies Standardized
by increasing their coverage of international Vocabularies to meet
coding systems and clinical aspects of all needs and use cases
patient care in order to support the
standardized analytics developed by other
working groups.
16 Chapter 2. Where to Begin
Workgroup
Name Objective Target Audience
Genomics Expand the OMOP CDM to incorporate Open to all
genomic data from patients. The group will
define a CDMcompatible schema that can
store information for genetic variants from
various sequencing process.
Population Develop scientific methods for observational Open to all
Level research leading to population level
Estimation estimates of effects that are accurate,
reliable, and reproducible, and facilitate the
use of these methods by the community.
Natural To promote the use of textual information Open to all
Language from Electronic Health Records (EHRs) for
Processing observational studies under the OHDSI
umbrella. To facilitate this objective, the
group will develop methods and software
that can be implemented to utilize clinical
text for studies by the OHDSI community.
Patient establish a standardized process for Open to all
Level developing accurate and wellcalibrated
Prediction patientcentered predictive models that can
be utilized for multiple outcomes of interest
and can be applied to observational
healthcare data from any patient
subpopulation of interest
Gold To enable members of the OHDSI Open to all with an
Standard community to find, evaluate, and utilize interest in curation and
Phenotype communityvalidated cohort definitions for validation of
Library research and other activities phenotypes
FHIR To establish the roadmap for the OHDSI Open to all with an
Workgroup FHIR integration and to make interest in
recommendations to the broader community interoperability
for leveraging the FHIR implementation and
data in EHR community for the
OHDSIbased observation studies and for
disseminating the OHDSI data and research
results through the FHIRbased tools and
APIs.
GIS Expand the OMOP CDM and leverage Open to all with an
OHDSI tools so that patients’ environmental interest in
exposure histories can be related to their healthrelated
clinical phenotypes geographic attributes
2.1. Join the Journey 17
Workgroup
Name Objective Target Audience
Clinical Understand clinical trial use cases where the Open to all with an
Trials OHDSI platform & ecosystem can aid trials interest in clinical trials
in any aspect, and assist in driving updates in
OHDSI tools to support.
THEMIS The objective of THEMIS is to develop
standard conventions, above and beyond the
OMOP CDM conventions, to ensure ETL
protocols designed at each OMOP site are of
highest quality, reproducible and efficient.
Metadata Our goal is to define a standard process for Open to all
& storing human and machineauthored
Annotations metadata and annotations in the Common
Data Model to ensure researchers can
consume and create useful data artifacts
about observational data sets.
Patient The goal of this WG would be developing Open to all
Generated ETL conventions, integration process with
Health clinical data, and analytic process for PGHD,
Data which is generated through Smart
(PGHD) Phone/App/Wearable devices.
Women of To provide a forum for women within the Open to all who
OHDSI OHDSI community to come together and identify with this
discuss challenges they face as women mission
working in science, technology, engineering
and mathematics (STEM). We aim to
facilitate discusses where women can share
their perspectives, raise concerns, propose
ideas on how the OHDSI community can
support women in STEM, and ultimately
inspire women to become leaders within the
community and their respective fields.
Steering To uphold OHDSI’s mission vision and Leaders within the
Committee values by ensuring all OHDSI activities and community
events are aligned with the needs of our
growing community. In addition, the group
serves as an advisory group for the OHDSI
coordinating center based at Columbia by
providing guidance for OHDSI’s future
direction.
18 Chapter 2. Where to Begin
with the community. Your questions are an important part of the OHDSI community.
Speak up and help us learn more about what evidence you are searching for!
I work in a healthcare leadership role. I may be a data owner and/or represent one.
I am evaluating the utility of the OMOP CDM and OHDSI analytical tools for my
organization. As an administrator/leader of an organization, you may have heard about
OHDSI and are curious to know the OMOP CDM could work for your use cases. You
may start by looking through OHDSI Past Events materials to see the body of research.
You may join a Community Call and simply listen in. You may also find that Chapter 7
(Data Analytics Use Cases) helps you understand the kind of research the OMOP CDM
and OHDSI analytics tools can enable. The OHDSI Community is here for you in your
journey. Don’t be afraid to speak up and ask for examples if you have specific areas
you’re interested in. More than 200 organizations around the world are collaborating in
OHDSI, there’s plenty of success stories to help showcase the value of this community.
I am a student looking to learn more about OHDSI. You’re in the right place! Consider
20 Chapter 2. Where to Begin
joining an OHDSI Community Call and introducing yourself. You are encouraged to
delve into the OHDSI tutorials, attend OHDSI Symposiums and facetoface meetings
to learn more about the methods and tools the OHDSI community offers. If you have a
specific research interest, let us know by posting in the Researcher topic on the OHDSI
Forum. Many organizations offer OHDSI sponsored research opportunities (e.g. post
Doc, research fellowships). The OHDSI Forum will give you the latest information on
these opportunities and more.
2.3 Summary
Open Science
21
22 Chapter 3. Open Science
The OHDSI community addresses these challenges in its own way, and it puts signifi
cant emphasis on the importance of generating medical evidence at scale. As stated in
Schuemie et al. (2018b), while the current paradigm “centers on generating one estimate
at a time using a unique study design with unknown reliability and publishing (or not) one
estimate at a time,” the OHDSI community “advocates for highthroughput observational
studies using consistent and standardized methods, allowing evaluation, calibration and
unbiased dissemination to generate a more reliable and complete evidence base.” This
is achieved by a combination of a network of medical data sources that map their data
to the OMOP common data model, open source analytics code that can be used and ver
ified by all, and largescale baseline data such as the condition occurrences published at
howoften.org. In the following paragraphs, concrete examples are provided and the open
science approach of OHDSI is detailed further using the four principles of Open Standards,
Open Source, Open Data and Open Discourse as a guide. The chapter is concluded with
a brief reference to the FAIR principles and outlook for OHDSI from an openscience
perspective.
3
https://www.nature.com/collections/prbfkwmwvz
3.2. OpenScience in Action: the StudyaThon 23
to medical practice. The OHDSI community has several annual OHDSI Symposia, held
in the United States, Europe, and Asia as well as dedicated communities of practice in,
amongst others, China and Korea. These symposia discuss the advancements in statistical
methods, data and software tooling, the standardized vocabularies, and all other aspects of
the OHDSI open source community. The OHDSI forums6 and wiki7 facilitate thousands
of researchers worldwide in practicing observational research. The community calls8 and
the code, issues and pull requests in Github9 constantly evolve the opencommunity as
sets such as code and the CDM, and in the OHDSI Network Studies, global observational
research is practiced in an open and transparent way using hundreds of millions of patient
records worldwide. Openness and open discourse is encouraged throughout the commu
nity, and this very book is written via an open process facilitated by the OHDSI wiki,
community calls and a GitHub repository.10 It needs to be stressed however that with
out all the OHDSI collaborators, the processes and tools would be empty shells. Indeed,
one could argue that the true value of the OHDSI community is with its members, who
share a vision of improving health through collaborative and openscience, as discussed
in Chapter 1.
3.7.1 Introduction
This last paragraph of the chapter takes a look at the current state of the OHDSI commu
nity and tooling, using the FAIR Data Guiding Principles published in Wilkinson et al.
(2016).
3.7.2 Findability
Any healthcare database that is mapped to OMOP and used for analytics should, from
a scientific perspective, persist for future reference and reproducibility. The use of
persistent identifiers for OMOP databases is not yet widespread, partly because these
databases are often contained behind firewalls and on internal networks and not necessar
ily connected to the internet. However, it is entirely possible to publish summaries of the
databases as a descriptor record that can be referenced for e.g. citation purposes. This
method is followed in for example the EMIF catalog11 , which provides a comprehensive
record of the database in terms of datagathering purpose, sources, vocabularies and
terms, access control mechanisms, license, consents, etc. (Oliveira et al., 2019) This
approach is further developed in the IMI EHDEN project.
6
https://forums.ohdsi.org
7
https://www.ohdsi.org/web/wiki
8
https://www.ohdsi.org/web/wiki/doku.php?id=projects:overview
9
https://github.com/ohdsi
10
https://github.com/OHDSI/TheBookOfOhdsi
11
https://emifcatalogue.eu
26 Chapter 3. Open Science
3.7.3 Accessibility
Accessibility of OMOP mapped data through an open protocol is typically achieved
through the SQL interface, which combined with the OMOP CDM provides a standard
ized and welldocumented method for accessing OMOP data. However, as discussed
above, OMOP sources are often not directly available over the internet for security
reasons. Creating a secure worldwide healthcare data network that is accessible for
researchers is an active research topic and operational goal of projects like IMI EHDEN.
However, results of analyses in multiple OMOP databases, as shown through OHDSI
initiatives such as LEGEND and http://howoften.org, can be openly published.
3.7.4 Interoperability
Interoperability is arguably the strong suit of the OMOP data model and OHDSI tooling.
In order to build a strong network of medical data sources worldwide which can be lever
aged for evidence generation, achieving interoperability between healthcare data sources
is key, and this is achieved through the OMOP model and Standardized Vocabularies.
However, by sharing cohort definitions and statistical approaches, the OHDSI commu
nity goes beyond code mapping and also provides a platform to build an interoperable
understanding of the analysis methods for healthcare data. Since healthcare systems such
as hospitals are often the source of record for OMOP data, the interoperability of the
OHDSI approach could be further enhanced by alignment with operational healthcare
interoperability standards such as HL7 FHIR, HL7 CIMI and openEHR. The same is
true for alignment with clinical interoperability standards such as CDISC and biomedi
cal ontologies. Especially in areas such as oncology, this is an important topic, and the
Oncology Working Group and Clinical Trials Working Group in the OHDSI community
provide good examples of forums where these issues are actively discussed. In terms of
references to other data and specifically ontology terms, ATLAS and OHDSI Athena are
important tools, as they allow the exploration of the OMOP Standardized Vocabularies in
the context of other available medical coding systems.
3.7.5 Reusability
The FAIR principles around reusability focus on important issues such as the data license,
provenance (clarifying how the data came in existence) and the link to relevant commu
nity standards. Data licensing is a complicated topic, especially across jurisdictions, and
it would fall outside of the scope of this book to cover it extensively. However, it is im
portant to state that if you intend for your data (e.g. analysis results) to be freely used
by others, it is good practice to explicitly provide these permissions via a data license.
This is not yet a common practice for most data that can be found on the internet, and the
OHDSI community is unfortunately not an exception here. Concerning the data prove
nance of OMOP databases, potential improvements exist for making metadata available
in an automated way, including, for example, CDM version, Standardized Vocabularies
release, custom code lists, etc. The OHDSI ETL tools do not currently produce this in
formation automatically, but working groups such as the Data Quality Working Group
3.7. OHDSI and the FAIR Guiding Principles 27
and Metadata Working Group actively work on these. Another important aspect is the
provenance of the underlying databases itself; it is important to know if a hospital or GP
information system was replaced or changed, and when known data omissions or other
data issues occurred historically. Exploring ways to attach this metadata systematically
in the OMOP CDM is the domain of the Metadata Working Group.
29
Chapter 4
31
32 Chapter 4. The Common Data Model
Figure 4.1: Overview of all tables in the CDM version 6.0. Note that not all relationships
between tables are shown.
4.1. Design Principles 33
Note: While the data model itself is platformindependent, many of the tools that have
been built to work with it require certain specifications. For more about this please see
Chapter 8.
4.2. Data Model Conventions 35
Records in the CONCEPT table contain detailed information about each concept (name,
domain, class etc.). Concepts, Concept Relationships, Concept Ancestors and other in
formation relating to Concepts is contained in the tables of the Standardized Vocabularies
(see Chapter 5).
Notation Description
[Event]_ID Unique identifier for each record, which serves as a
foreign keys establishing relationships across Event
tables. For example, PERSON_ID uniquely identifies
each individual. VISIT_OCCURRENCE_ID
uniquely identifies a Visit.
[Event]_CONCEPT_ID Foreign key to a Standard Concept record in the
CONCEPT reference table. This is the main
representation of the Event, serving as the primary
basis for all standardized analytics. For example,
CONDITION_CONCEPT_ID = 31967 contains the
reference value for the SNOMED concept of
“Nausea”.
[Event]_SOURCE Foreign key to a record in the CONCEPT reference
_CONCEPT_ID table. This Concept is the equivalent of the Source
Value (below), and it may happen to be a Standard
Concept, at which point it would be identical to the
[Event]_CONCEPT_ID, or another nonstandard
concept. For example,
CONDITION_SOURCE_CONCEPT_ID =
45431665 denotes the concept of “Nausea” in the
Read terminology, and the analogous
CONDITION_CONCEPT_ID is the Standard
SNOMEDCT Concept 31967. The use of Source
Concepts for standard analytics applications is
discouraged since only Standard Concepts represent
the semantic content of an Event in a unambiguous
way and therefore Source Concepts are not
interoperable.
4.2. Data Model Conventions 37
Notation Description
[Event]_TYPE_CONCEPT_ID Foreign key to a record in the CONCEPT reference
table, representing the origin of the source
information, standardized within the Standardized
Vocabularies. Note that despite the field name this is
not a type of an Event, or type of a Concept, but
declares the capture mechanism that created this
record. For example, DRUG_TYPE_CONCEPT_ID
discriminates if a Drug record was derived from a
dispensing Event in the pharmacy (“Pharmacy
dispensing”) or from an eprescribing application
(“Prescription written”)
[Event]_SOURCE_VALUE Verbatim code or free text string reflecting how this
Event was represented in the source data. Its use is
discouraged for standard analytics applications, as
these Source Values are not harmonized across data
sources. For example,
CONDITION_SOURCE_VALUE might contain a
record of “78702”, corresponding to ICD9 code
787.02 written in a notation omitting the dot.
Source Values are only provided for convenience and quality assurance (QA) purposes.
They may contain information that is only meaningful in the context of a specific data
source. The use of Source Values and Source Concepts is optional, even though strongly
recommended if the source data make use of coding systems. Standard Concepts are
mandatory however. This mandatory use of Standard Concepts is what allows all CDM
instances to speak the same language. For example, the condition “Pulmonary Tubercu
losis” (TB, Figure 4.2) shows that the ICD9CM code for TB is 011.
Without context, the code 011 could be interpreted as “Hospital Inpatient (Including Medi
care Part A)” from the UB04 vocabulary, or as “Nervous System Neoplasms without
Complications, Comorbidities” from the DRG vocabulary. This is where Concept IDs,
both Source and Standard, are valuable. The CONCEPT_ID value that represents the 011
ICD9CM code is 44828631. This differentiates the ICD9CM from the UBO4 and DRG.
The ICD9CM TB Source Concept maps to Standard Concept 253954 from the SNOMED
4.3. CDM Standardized Tables 39
To illustrate how these tables are used in practice, the data of one person will be used as
a common thread throughout the rest of the chapter.
1
https://github.com/OHDSI/CommonDataModel/wiki
40 Chapter 4. The Common Data Model
Every step of this painful journey I had to convince everyone how much pain
I was in.
Lauren had been experiencing endometriosis symptoms for many years; however, it took
a ruptured cyst in her ovary before she was diagnosed. You can read more about Lauren
at https://endometriosisuk.org/laurensstory.
stitution or provider is visited. As a next best solution, often the first record in the system
is considered the Start Date of the Observation Period and the latest record is considered
the End Date.
Based on the encounter records her OBSERVATION_PERIOD table might look some
thing like this:
4.3.4 VISIT_OCCURRENCE
the VISIT_OCCURRENCE table houses information about a patient’s encounters with
the health care system. Within the OHDSI vernacular these are referred to as Visits and are
4.3. CDM Standardized Tables 43
considered to be discrete events. There are 12 top categories of Visits with an extensive
hierarchy, depicting the many different circumstances healthcare might be delivered. The
most common Visits recorded are inpatient, outpatient, emergency department and non
medical institution Visits.
• A patient may interact with multiple health care Providers during one visit, as
4.3. CDM Standardized Tables 45
is often the case with inpatient stays. These interactions can be recorded in the
VISIT_DETAIL table. While not covered in depth in this chapter, you can read
more about the VISIT_DETAIL table in the CDM wiki.
4.3.5 CONDITION_OCCURRENCE
Records in the CONDITION_OCCURRENCE table are diagnoses, signs, or symptoms
of a condition either observed by a Provider or reported by the patient.
4.3.6 DRUG_EXPOSURE
The DRUG_EXPOSURE table captures records about the intent or actual introduction
of a drug into the body of the patient. Drugs include prescription and overthecounter
medicines, vaccines, and largemolecule biologic therapies. Drug exposures are inferred
from clinical events associated with orders, prescriptions written, pharmacy dispensings,
procedural administrations, and other patientreported information.
4.3.7 PROCEDURE_OCCURRENCE
The PROCEDURE_OCCURRENCE table contains records of activities or processes or
dered or carried out by a healthcare Provider on the patient with a diagnostic or therapeutic
purpose. Procedures are present in various data sources in different forms with varying
levels of standardization. For example:
50 Chapter 4. The Common Data Model
• Medical Claims include procedure codes that are submitted as part of a claim for
health services rendered, including procedures performed.
• Electronic Health Records that capture procedures as orders.
4.5 Summary
2
https://github.com/OHDSI/CommonDataModel/wiki
52 Chapter 4. The Common Data Model
4.6 Exercises
Prerequisites
For these first exercises you will need to review the CDM tables discussed earlier, and you
will have to look up concepts in the Vocabulary, which can be done through ATHENA3
or ATLAS.4
Exercise 4.1. John is an African American man born on August 4, 1974. Define an entry
in the PERSON table that encodes this information.
Exercise 4.2. John enrolled in his current insurance on January 1st, 2015. The data from
his insurance database were extracted on July 1st, 2019. Define an entry in the OBSER
VATION_PERIOD table that encodes this information.
Exercise 4.3. John was prescribed a 30day supply of Ibuprofen 200 MG Oral
tablets (NDC code: 76168009520) on May 1st, 2019. Define an entry in the
DRUG_EXPOSURE table that encodes this information.
Prerequisites
For these last three exercises we assume R, RStudio and Java have been installed as
described in Section 8.4.5. Also required are the SqlRender, DatabaseConnector, and
Eunomia packages, which can be installed using:
The Eunomia package provides a simulated dataset in the CDM that will run inside your
local R session. The connection details can be obtained using:
The CDM database schema is “main”. This is a SQL query example to retrieve one row
of the CONDITION_OCCURRENCE table:
library(DatabaseConnector)
connection <- connect(connectionDetails)
sql <- "SELECT *
FROM @cdm.condition_occurrence
LIMIT 1;"
result <- renderTranslateQuerySql(connection, sql, cdm = "main")
3
http://athena.ohdsi.org/
4
http://atlasdemo.ohdsi.org
4.6. Exercises 53
Exercise 4.4. Using SQL and R, retrieve all records of the condition “Gastrointestinal
hemorrhage” (with concept ID 192671).
Exercise 4.5. Using SQL and R, retrieve all records of the condition “Gastrointestinal
hemorrhage” using source codes. This database uses ICD10, and the relevant ICD10
code is “K92.2”.
Exercise 4.6. Using SQL and R, retrieve the observation period of the person with PER
SON_ID 61.
Standardized Vocabularies
55
56 Chapter 5. Standardized Vocabularies
Figure 5.1: 1660 London Bill of Mortality, showing the cause of death for deceased in
habitants using a classification system of 62 diseases known at the time.
5.1. Why Vocabularies, and Why Standardizing 57
(Germany), etc. Governments also control the marketing and sale of drugs and maintain
national repositories of such certified drugs. Vocabularies are also used in the private
sector, either as commercial products or for internal use, such as electronic health record
(EHR) systems or for medical insurance claim reporting.
As a result, each country, region, healthcare system and institution tends to have their
own classifications that would most likely only be relevant where it is used. This myriad
of vocabularies prevents interoperability of the systems they are used in. Standardization
is the key that enables patient data exchange, unlocks health data analysis on a global
level and allows systematic and standardized research, including performance characteri
zation and quality assessment. To address that problem, multinational organizations have
sprung up and started creating broad standards, such as the WHO mentioned above and
the Standard Nomenclature of Medicine (SNOMED) or Logical Observation Identifiers
Names and Codes (LOINC). In the US, the Health IT Standards Committee (HITAC) rec
ommends the use of SNOMED, LOINC and the drug vocabulary RxNorm as standards
to the National Coordinator for Health IT (ONC) for use in a common platform for na
tionwide health information exchange across diverse entities.
OHDSI developed the OMOP CDM, a global standard for observational research. As part
of the CDM, the OMOP Standardized Vocabularies are available for two main purposes:
The Standardized Vocabularies are available to the community free of charge and must
be used for OMOP CDM instance as its mandatory reference table.
To download a zip file with all Standardized Vocabularies tables select all the vocabular
ies you need for your OMOP CDM. Vocabularies with Standard Concepts (see Section
5.2.6) and very common usage are preselected. Add vocabularies that are used in your
source data. Vocabularies that are proprietary have no select button. Click on the “Li
cense required” button to incorporate such a vocabulary into your list. The Vocabulary
Team will contact you and request you demonstrate your license or help you connect to
the right folks to obtain one.
5.2 Concepts
All clinical events in the OMOP CDM are expressed as concepts, which represent the
semantic notion of each event. They are the fundamental building blocks of the data
records, making almost all tables fully normalized with few exceptions. Concepts are
stored in the CONCEPT table (see Figure 5.2).
This system is meant to be comprehensive, i.e. there are enough concepts to cover any
event relevant to the patient’s healthcare experience (e.g. conditions, procedures, expo
sures to drug, etc.) as well as some of the administrative information of the healthcare
system (e.g. visits, care sites, etc.).
Figure 5.2: Standard representation of vocabulary concepts in the OMOP CDM. The
example provided is the CONCEPT table record for the SNOMED code for Atrial Fibril
lation.
LANGUAGE_CONCEPT_ID field. The name is 255 characters long, which means that
very long names get truncated and the fulllength version recorded as another synonym,
which can hold up to 1000 characters.
5.2.3 Domains
Each concept is assigned a domain in the DOMAIN_ID field, which in contrast to the nu
merical CONCEPT_ID is a short casesensitive unique alphanumeric ID for the domain.
Examples of such domain identifiers are “Condition,” “Drug,” “Procedure,” “Visit,” “De
vice,” “Specimen,” etc. Ambiguous or precoordinated (combination) concepts can be
long to a combination domain, but Standard Concepts (see Section 5.2.6) are always as
signed a singular domain. Domains also direct to which CDM table and field a clinical
event or event attribute is recorded. Domain assignments are an OMOPspecific feature
done during vocabulary ingestion using a heuristic laid out in Pallas. Source vocabularies
tend to combine codes of mixed domains, but to a varying degree (see Figure 5.3).
The domain heuristic follows the definitions of the domains. These definitions are de
rived from the table and field definitions in the CDM (see Chapter 4). The heuristic is
not perfect; there are grey zones (see Section 5.6 “Special Situations”). If you find con
cept domains assigned incorrectly please report and help improve the process through a
Forums or CDM issue post.
5.2.4 Vocabularies
Each vocabulary has a short casesensitive unique alphanumeric ID, which generally fol
lows the abbreviated name of the vocabulary, omitting dashes. For example, ICD9CM
60 Chapter 5. Standardized Vocabularies
Figure 5.3: Domain assignment in procedure vocabularies CPT4 and HCPCS. By intu
ition, these vocabularies should contain codes and concepts of a single domain, but in
reality they are mixed.
5.2. Concepts 61
has the vocabulary ID “ICD9CM”. There are 111 vocabularies currently supported by
OHDSI, of which 78 are adopted from external sources, while the rest are OMOPinternal
vocabularies. These vocabularies are typically refreshed at a quarterly schedule. The
source and the version of the vocabularies is defined in the VOCABULARY reference
file.
Concept class
subdivision
principle Vocabulary
Horizontal all drug vocabularies, ATC, CDТ, Episode, HCPCS, HemOnc,
ICDs, MedDRA, OSM, Census
Vertical CIEL, HES Specialty, ICDO3, MeSH, NAACCR, NDFRT,
OPCS4, PCORNET, Plan, PPI, Provider, SNOMED, SPL, UCUM
Mixed CPT4, ISBT, LOINC
None APC, all Type Concepts, Ethnicity, OXMIS, Race, Revenue Code,
Sponsor, Supplier, UB04s, Visit
Horizontal concept classes allow you to determine a specific hierarchical level. For ex
ample, in the drug vocabulary RxNorm the concept class “Ingredient” defines the top
level of the hierarchy. In the vertical model, members of a concept class can be of any
hierarchical level from the top to the very bottom.
Figure 5.4: Standard, nonstandard source and classification concepts and their hierarchi
cal relationships in the condition domain. SNOMED is used for most standard condition
concepts (with some oncologyrelated concepts derived from ICDO3), MedDRA con
cepts are used for hierarchical classification concepts, and all other vocabularies contain
nonstandard or source concepts, which do not participate in the hierarchy.
the same meaning. In other words, there is no such a thing as a “standard vocabulary.”
See Table 5.2 for examples.
for classification
Domain for Standard Concepts for source concepts concepts
Condition SNOMED, ICDO3 SNOMED Veterinary MedDRA
Procedure SNOMED, CPT4, SNOMED Veterinary, None at this point
HCPCS, ICD10PCS, HemOnc, NAACCR
ICD9Proc, OPCS4
Measurement SNOMED, LOINC SNOMED Veterinary, None at this point
NAACCR, CPT4,
HCPCS, OPCS4, PPI
Drug RxNorm, RxNorm HCPCS, CPT4, ATC
Extension, CVX HemOnc, NAAACCR
Device SNOMED Others, currently not None at this point
normalized
Observation SNOMED Others None at this point
Visit CMS Place of Service, SNOMED, HCPCS, None at this point
ABMT, NUCC CPT4, UB04
Table 5.3: Concepts with identical concept code 1001, but different
vocabularies, domains and concept classes.
5.2.10 Life-Cycle
Vocabularies are rarely permanent corpora with a fixed set of codes. Instead, codes and
concepts are added and get deprecated. The OMOP CDM is a model to support lon
gitudinal patient data, which means it needs to support concepts that were used in the
past and might no longer be active, as well as supporting new concepts and placing them
into context. There are three fields in the CONCEPT table that describe the possible life
cycle statuses: VALID_START_DATE, VALID_END_DATE, and INVALID_REASON.
Their values differ depending on the concept lifecycle status:
• Active or new concept
– Description: Concept in use.
– VALID_START_DATE: Day of instantiation of concept, if that is not known
day of incorporation of concept in Vocabularies, if that is not known 197011.
– VALID_END_DATE: Set to 20991231 as a convention to indicate “Might
become invalid in an undefined future, but active right now”.
– INVALID_REASON: NULL
• Deprecated Concept with no successor
– Description: Concept inactive and cannot be used as Standard (see Section
5.2.6).
– VALID_START_DATE: Day of instantiation of concept, if that is not known
day of incorporation of concept in Vocabularies, if that is not known 197011.
– VALID_END_DATE: Day in the past indicating deprecation, or if that is not
known day of vocabulary refresh where concept in vocabulary went missing
or set to inactive.
– INVALID_REASON: “D”
• Upgraded Concept with successor
– Description: Concept inactive, but has defined successor. These are typically
concepts which went through deduplication.
– VALID_START_DATE: Day of instantiation of concept, if that is not known
5.3. Relationships 65
5.3 Relationships
Any two concepts can have a defined relationship, regardless of whether the two concepts
belong to the same domain or vocabulary. The nature of the relationships is indicated in
its short casesensitive unique alphanumeric ID in the RELATIONSHIP_ID field of the
CONCEPT_RELATIONSHIP table. Relationships are symmetrical, i.e. for each relation
ship an equivalent relationship exists, where the content of the fields CONCEPT_ID_1
and CONCEPT_ID_2 are swapped, and the RELATIONHSIP_ID is changed to its op
posite. For example, the “Maps to” relationship has an opposite relationship “Mapped
from.”
CONCEPT_RELATIONSHIP table records also have lifecycle fields RELATION
SHIP_START_DATE, RELATIONSHIP_END_DATE and INVALID_REASON.
However, only active records with INVALID_REASON = NULL are available through
ATHENA. Inactive relationships are kept in the Pallas system for internal processing
only. The RELATIONSHIP table serves as the reference with the full list of relationship
IDs and their reverse counterparts.
Relationship ID
pair Purpose
“Maps to” and Mapping to Standard Concepts. Standard Concepts are mapped
“Mapped from” to themselves, nonstandard concepts to Standard Concepts.
Most nonstandard and all Standard Concepts have this
relationship to a Standard Concept. The former are stored in
*_SOURCE_CONCEPT_ID, and the latter in the
*_CONCEPT_ID fields. Classification concepts are not mapped.
“Maps to value” Mapping to a concept that represents a Value to be placed into
and “Value the VALUE_AS_CONCEPT_ID fields of the MEASUREMENT
mapped from” and OBSERVATION tables.
“Equivalent concepts” means it carries the same meaning, and, importantly, the hierarchi
cal descendants cover the same semantic space. If an equivalent concept is not available
and the concept is not Standard, it is still mapped, but to a slightly broader concept (so
called “uphill mappings”). For example, ICD10CM W61.51 “Bitten by goose” has no
equivalent in the SNOMED vocabulary, which is generally used for standard condition
concepts. Instead, it is mapped to SNOMED 217716004 “Peck by bird,” losing the con
text of the bird being a goose. Uphill mappings are only used if the loss of information
is considered irrelevant to standard research use cases.
Some mappings connect a source concept to more than one Standard Concept. For ex
ample, ICD9CM 070.43 “Hepatitis E with hepatic coma” is mapped to both SNOMED
235867002 “Acute hepatitis E” as well as SNOMED 72836002 “Hepatic Coma.” The
reason for this is that the original source concept is a precoordinated combination of two
conditions, hepatitis and coma. SNOMED does not have that combination, which results
in two records written for the ICD9CM record, one with each mapped Standard Concept.
Relationships “Maps to value” have the purpose of splitting of a value for OMOP CDM
tables following an entityattributevalue (EAV) model. This is typically the case in the
following situations:
In these situations, the source concept is a combination of the attribute (test or history)
and the value (test result or disease). The “Maps to” relationship maps this source to the
5.3. Relationships 67
attribute concept, and the “Maps to value” to the value concept. See Figure 5.5 for an
example.
Figure 5.5: Onetomany mapping between source concept and Standard Concepts. A
precoordinated concept is split into two concepts, one of which is the attribute (here
history of clinical finding) and the other one is the value (peptic ulcer). While ”Maps to”
relationship will map to concepts of the measurement or observation domains, the ‘Maps
to value” concepts have no domain restriction.
Vocabulary team. They may serve as approximate mappings but often times are less pre
cise than the better curated mapping relationships. Highquality equivalence relationships
(such as “Source RxNorm equivalent”) are always duplicated by “Maps to” relationship.
CONCEPT_ID_1 CONCEPT_ID_2
4000504 “Urethra part” 36713433 “Partial duplication of urethra”
4000504 “Urethra part” 433583 “Epispadias”
4000504 “Urethra part” 443533 “Epispadias, male”
4000504 “Urethra part” 4005956 “Epispadias, female”
The quality and comprehensiveness of these relationships varies depending on the qual
ity in the original vocabulary. Generally, vocabularies that are used to draw Standard
Concepts from, such as SNOMED, are chosen for the reason of their better curation and
therefore tend to have higher quality internal relationships as well.
5.4 Hierarchy
Within a domain, standard and classification concepts are organized in a hierarchical struc
ture and stored in the CONCEPT_ANCESTOR table. This allows querying and retriev
ing concepts and all their hierarchical descendants. These descendants have the same
attributes as their ancestor, but also additional or more defined ones.
The CONCEPT_ANCESTOR table is built automatically from the CONCEPT_RELATIONSHIP
table traversing all possible concepts connected through hierarchical relationships. These
are the “Is a” “Subsumes” pairs (see Figure 5.6), and other relationships connecting
hierarchies across vocabularies. The choice whether a relationship participates in the hier
archy constructor is defined for each relationship ID by the flag DEFINES_ANCESTRY
in the RELATIONSHIP reference table.
The ancestral degree, or the number of steps between ancestor and descendant, is captured
in the MIN_LEVELS_OF_SEPARATION and MAX_LEVELS_OF_SEPARATION
6
https://www.ohdsi.org/web/wiki/doku.php?id=documentation:vocabulary
5.4. Hierarchy 69
Figure 5.6: Hierarchy of the condition “Atrial fibrillation.” First degree ancestry is de
fined through “Is a” and “Subsumes” relationships, while all higher degree relations are
inferred and stored in the CONCEPT_ANCESTOR table. Each concept is also its own
descendant with both levels of separation equal to 0.
70 Chapter 5. Standardized Vocabularies
fields, defining the shortest or longest possible connection. Not all hierarchical rela
tionships contribute equally to the levelsofseparation calculation. A step counted for
the degree is determined by the IS_HIERARCHICAL flag in the RELATIONSHIP
reference table for each relationship ID.
At the moment, a highquality comprehensive hierarchy exists only for two domains: drug
and condition. Procedure, measurement and observation domains are only partially cov
ered and in the process of construction. The ancestry is particularly useful for the drug
domain as it allows browsing all drugs with a given ingredient or members of drug classes
irrespective of the country of origin, brand name or other attributes.
5.6.1 Gender
Gender in the OMOP CDM and Standardized Vocabularies denotes the biological sex at
birth. Often, questions are posed how to define alternative genders. These use cases have
to be covered through records in the OBSERVATION table, where the selfdefined gender
of a person is stored (if the data asset contains such information).
5.6.5 Devices
Device concepts have no standardized coding scheme that could be used to source Stan
dard Concepts. In many source data, devices are not even coded or contained in an ex
ternal coding scheme. For this same reason, there is currently no hierarchical system
available.
integrated system, where concepts from different origins and purposes all reside in the
same domainspecific hierarchies.
5.7 Summary
– All events and administrative facts are represented in the OMOP Standard
ized Vocabularies as concepts, concept relationships, and concept ancestor
hierarchy.
– Most of these are adopted from existing coding schemes or vocabularies,
while some of them are curated denovo by the OHDSI Vocabulary Team.
– All concepts are assigned a domain, which controls where the fact represented
by the concept is stored in the CDM.
– Concepts of equivalent meaning in different vocabularies are mapped to one
of them, which is designated the Standard Concept. The others are source
concepts.
– Mapping is done through the concept relationships “Maps to” and “Maps to
value”.
– There is an additional class of concepts called classification concepts, which
are nonstandard, but in contrast to source concepts they participate in the
hierarchy.
– Concepts have a lifecycle over time.
5.8. Exercises 73
– Concepts within a domain are organized into hierarchies. The quality of the
hierarchy differs between domains, and the completion of the hierarchy sys
tem is an ongoing task.
– You are strongly encouraged to engage with the community if you believe
you found a mistake or inaccuracy.
5.8 Exercises
Prerequisites
For these first exercises you will need to look up concepts in the Standardized Vocabular
ies, which can be done through ATHENA7 or ATLAS.8
Exercise 5.2. Which ICD10CM codes map to the Standard Concept for “Gastrointestinal
hemorrhage”? Which ICD9CM codes map to this Standard Concept?
Exercise 5.3. What are the MedDRA preferred terms that are equivalent to the Standard
Concept for “Gastrointestinal hemorrhage”?
7
http://athena.ohdsi.org/
8
http://atlasdemo.ohdsi.org
74 Chapter 5. Standardized Vocabularies
Chapter 6
6.1 Introduction
In order to get from the native/raw data to the OMOP Common Data Model (CDM) we
have to create an extract, transform, and load (ETL) process. This process should restruc
ture the data to the CDM, and add mappings to the Standardized Vocabularies, and is
typically implemented as a set of automated scripts, for example SQL scripts. It is impor
tant that this ETL process is repeatable, so that it can be rerun whenever the source data
is refreshed.
Creating an ETL is usually a large undertaking. Over the years, we have developed best
practices, consisting of four major steps:
In this chapter we will discuss each of these steps in detail. Several tools have been
developed by the OHDSI community to support some of these steps, and these will be
discussed as well. We close this chapter with a discussion of CDM and ETL maintenance.
75
76 Chapter 6. Extract Transform Load
we are likely to get stuck in nittygritty details, while we should be focusing on the overall
picture.
Two closelyintegrated tools have been developed to support the ETL design process:
White Rabbit, and RabbitinaHat.
White Rabbit’s main function is to perform a scan of the source data, providing detailed
information on the tables, fields, and values that appear in a field. The source data can
be in commaseparated text files, or in a database (MySQL, SQL Server, Oracle, Post
greSQL, Microsoft APS, Microsoft Access, Amazon Redshift). The scan will generate a
report that can be used as a reference when designing the ETL, for instance by using it in
conjunction with the RabbitInaHat tool. White Rabbit differs from standard data pro
filing tools in that it attempts to prevent the display of personally identifiable information
(PII) data values in the generated output data file.
Process Overview
The typical sequence for using the software to scan source data:
1. Set working folder, the location on the local desktop computer where results will
be exported.
2. Connect to the source database or CSV text file and test connection.
3. Select the tables of interest for the scan and scan the tables.
4. White Rabbit creates an export of information about the source data.
After downloading and installing the White Rabbit application, the first thing you need
to do is set a working folder. Any files that White Rabbit creates will be exported to this
local folder. Use the “Pick Folder” button shown in Figure 6.1 to navigate in your local
environment where you would like the scan document to go.
1
https://github.com/OHDSI/WhiteRabbit.
6.2. Step 1: Design the ETL 77
Figure 6.1: The ”Pick Folder” button allows the specification of a working folder for the
White Rabbit application.
78 Chapter 6. Extract Transform Load
Connection to a Database
White Rabbit supports delimited text files and various database platforms. Hover the
mouse over the various fields to get a description of what is required. More detailed
information can be found in the manual.
After connecting to a database, you can scan the tables contained therein. A scan generates
a report containing information on the source data that can be used to help design the
ETL. Using the Scan tab shown in Figure 6.2 you can either select individual tables in
the selected source database by clicking on “Add” (Ctrl + mouse click), or automatically
select all tables in the database by clicking on “Add all in DB”.
• Checking the “Scan field values” tells WhiteRabbit that you would like to investi
gate which values appear in the columns.
• “Min cell count” is an option when scanning field values. By default, this is set to
5, meaning values in the source data that appear less than 5 times will not appear in
the report. Individual data sets may have their own rules about what this minimum
cell count can be.
6.2. Step 1: Design the ETL 79
• “Rows per table” is an option when scanning field values. By default, White Rabbit
will scan 100,000 randomly selected rows in the table.
Once all settings are completed, press the “Scan tables” button. After the scan is com
pleted the report will be written to the working folder.
The tabs for each of the tables show each field, the values in each field, and the frequency
of each value. Each source table column will generate two columns in the Excel. One
column will list all distinct values that have a “Min cell count” greater than what was set
at time of the scan. If a list of unique values was truncated, the last value in the list will be
“List truncated”; this indicates that there are one or more additional unique source values
that appear less than the number entered in the “Min cell count”. Next to each distinct
value will be a second column that contains the frequency (the number of times that value
occurs in the sample). These two columns (distinct values and frequency) will repeat for
all the source columns in the table profiled in the workbook.
The report is powerful in understanding your source data by highlighting what exists. For
example, if the results shown in Figure 6.4 were given back on the “Sex” column within
one of the tables scanned, we can see that there were two common values (1 and 2) that
80 Chapter 6. Extract Transform Load
appeared 61,491 and 35,401 times respectively. White Rabbit will not define 1 as male
and 2 as female; the data holder will typically need to define source codes unique to the
source system. However, these two values (1 & 2) are not the only values present in
the data because we see this list was truncated. These other values appear with very low
frequency (defined by “Min cell count”) and often represent incorrect or highly suspicious
values. When generating an ETL we should not only plan to handle the highfrequency
gender concepts 1 and 2 but the other lowfrequency values that exist within this column.
For example, if those lower frequency genders were “NULL” we want to make sure the
ETL can handle processing that data and knows what to do in that situation.
6.2.2 Rabbit-In-a-Hat
With the White Rabbit scan in hand, we have a clear picture of the source data. We also
know the full specification of the CDM. Now we need to define the logic to go from one
to the other. This design activity requires thorough knowledge of both the source data
and the CDM. The RabbitinaHat tools that comes with the White Rabbit software is
specifically designed to support a team of experts in these areas. In a typical setting, the
ETL design team sits together in a room, while RabbitinaHat is projected on a screen.
In a first round, the tabletotable mappings can be collaboratively decided, after which
fieldtofield mappings can be designed, while defining the logic by which values will be
transformed.
Process Overview
The typical sequence for using this software to generate documentation of an ETL:
1. Scanned results from WhiteRabbit completed.
2. Open scanned results; interface displays source tables and CDM tables.
3. Connect source tables to CDM tables where the source table provides information
for that corresponding CDM table.
6.2. Step 1: Design the ETL 81
4. For each source table to CDM table connection, further define the connection with
source column to CDM column detail.
5. Save RabbitInaHat work and export to a MS Word document.
Figure 6.5: General flow of an ETL and which tables to map first.
Table 6.1 below shows the logic that was imposed on the Synthea patients table to convert
it to the CDM PERSON table. The ‘Destination Field’ discusses where in the CDM data
is being mapped to. The ‘Source field’ highlights the column from the source table (in
this case patients) that will be used to populate the CDM column. Finally, the ‘Logic &
comments’ column gives explanations for the logic.
Table 6.1: ETL logic to convert the Synthea Patients table to CDM
PERSON table.
Source
Destination Field field Logic & comments
PERSON_ID Autogenerate. The PERSON_ID will be
generated at the time of implementation.
This is because the id value from the source
is a varchar value while the PERSON_ID is
an integer. The id field from the source is set
as the PERSON_SOURCE_VALUE to
preserve that value and allow for
errorchecking if necessary.
GENDER_CONCEPT_ID gender When gender = ‘M’ then set
GENDER_CONCEPT_ID to 8507, when
gender = ‘F’ then set to 8532. Drop any rows
with missing/unknown gender. These two
concepts were chosen as they are the only
two standard concepts in the gender domain.
The choice to drop patients with unknown
genders tends to be sitebased, though it is
recommended they are removed as people
without a gender are excluded from analyses.
YEAR_OF_BIRTH birthdate Take year from birthdate
MONTH_OF_BIRTH birthdate Take month from birthdate
DAY_OF_BIRTH birthdate Take day from birthdate
BIRTH_DATETIME birthdate With midnight as time 00:00:00. Here, the
source did not supply a time of birth so the
choice was made to set it at midnight.
RACE_CONCEPT_ID race When race = ‘WHITE’ then set as 8527,
when race = ‘BLACK’ then set as 8516,
when race = ‘ASIAN’ then set as 8515,
otherwise set as 0. These concepts were
chosen because they are the standard
concepts belonging to the race domain that
most closely align with the race categories in
the source.
84 Chapter 6. Extract Transform Load
Source
Destination Field field Logic & comments
ETHNICITY_ race When race = ‘HISPANIC’, or when ethnicity
CONCEPT_ID ethnicity in (‘CENTRAL_AMERICAN’,
‘DOMINICAN’, ‘MEXICAN’,
‘PUERTO_RICAN’,
‘SOUTH_AMERICAN’) then set as
38003563, otherwise set as 0. This is a good
example of how multiple source columns
can contribute to one CDM column. In the
CDM ethnicity is represented as either
Hispanic or not Hispanic so values from
both the source column race and source
column ethnicity will determine this value.
LOCATION_ID
PROVIDER_ID
CARE_SITE_ID
PERSON_SOURCE_ id
VALUE
GENDER_SOURCE_ gender
VALUE
GENDER_SOURCE_
CONCEPT_ID
RACE_SOURCE_ race
VALUE
RACE_SOURCE_
CONCEPT_ID
ETHNICITY_ ethnicity In this case the
SOURCE_VALUE ETHNICITY_SOURCE_VALUE will have
more granularity than the
ETHNICITY_CONCEPT_ID.
ETHNICITY_
SOURCE_CONCEPT_ID
For more examples on how the Synthea dataset was mapped to the CDM please see the
full specification document.3
be included and mapped. Check the VOCABULARY table in the OMOP Vocabulary to
see which vocabularies are included. To extract the mapping from nonstandard source
codes (e.g. ICD10CM codes) to standard concepts (e.g. SNOMED codes), we can use
the records in the CONCEPT_RELATIONSHIP table having relationship_id = “Maps
to”. For example, to find the standard concept ID for the ICD10CM code ‘I21’ (“Acute
Myocardial Infarction”), we can use the following SQL:
STANDARD_CONCEPT_ID
312327
Unfortunately, sometimes the source data uses coding systems that are not in the Vocabu
lary. In this case, a mapping must be created from the source coding system to the Stan
dard Concepts. Code mapping can be a daunting task, especially when there are many
codes in the source coding system. There are several things that can be done to make the
task easier:
• Focus on the most frequently used codes. A code that is never used or infrequently
used is not worth the effort of mapping, since it will never be used in a real study.
• Make use of existing information whenever possible. For example, many national
drug coding systems have been mapped to ATC. Although ATC is not detailed
enough for many purposes, the concept relationships between ATC and RxNorm
can be used to make good guesses of what the right RxNorm codes are.
• Use Usagi.
6.3.1 Usagi
Usagi is a tool to aid the manual process of creating a code mapping. It can make sug
gested mappings based on textual similarity of code descriptions. If the source codes are
only available in a foreign language, we have found that Google Translate4 often gives
surprisingly good translation of the terms into English. Usagi allows the user to search
for the appropriate target concepts if the automated suggestion is not correct. Finally, the
user can indicate which mappings are approved to be used in the ETL. Usagi is available
on GitHub.5
4
https://translate.google.com/
5
https://github.com/OHDSI/Usagi
86 Chapter 6. Extract Transform Load
Process Overview
The typical sequence for using this software is:
1. Load codes from your sources system (“source codes”) that you would like to map
to Vocabulary concepts.
2. Usagi will run a term similarity approach to map source codes to Vocabulary con
cepts.
3. Leverage Usagi interface to check, and where needed, improve suggested map
pings. Preferably an individual who has experience with the coding system and
medical terminology should be used for this review.
4. Export mapping to the Vocabulary’s SOURCE_TO_CONCEPT_MAP.
concepts in the Condition domain. By default, Usagi only maps to Standard Concepts,
but if the option “Filter standard concepts” is turned off, Usagi will also consider Classi
fication Concepts. Hover your mouse over the different filters for additional information
about the filter.
One special filter is “Filter by automatically selected concepts / ATC code”. If there is
information that you can use to restrict the search, you can do so by providing a list of
CONCEPT_IDs or an ATC code in the column indicated in the Auto concept ID column
(semicolondelimited). For example, in the case of drugs there might already be ATC
codes assigned to each drug. Even though an ATC code does not uniquely identify a
single RxNorm drug code, it does help limit the search space to only those concepts that
fall under the ATC code in the Vocabulary. To use the ATC code, follow these steps:
1. In the Column mapping section, switch from “Auto concept ID column” to “ATC
column”
2. In the Column mapping section, select the column containing the ATC code as
“ATC column”.
3. Turnon the “Filter by user selected concepts / ATC code” on in the Filters section.
You can also use other sources of information than the ATC code to restrict as well. In
the example shown in the figure above, we used a partial mapping derived from UMLS
to restrict the Usagi search. In that case we will need to use “Auto concept ID column”.
Once all your settings are finalized, click the “Import” button to import the file. The file
import will take a few minutes as it is running the term similarity algorithm to map source
codes.
88 Chapter 6. Extract Transform Load
ciated to this matched pair (matching scores are typically 0 to 1 with 1 being a confident
match), a score of 0.58 signifies that Usagi is not very sure of how well it has mapped
this Dutch code to SNOMED. Let us say in this case, we are okay with this mapping, we
can approve it by hitting the green “Approve” button in the bottom right hand portion of
the screen.
When using the manual search box, one should keep in mind that Usagi uses a fuzzy
search, and does not support structured search queries, so for example not supporting
Boolean operators like AND and OR.
To continue our example, suppose we used the search term “Cough” to see if we could
find a better mapping. On the right of the Query section of the Search Facility, there
is a Filters section, this provides options to trim down the results from the Vocabulary
when searching for the search term. In this case we know we want to only find standard
concepts, and we allow concepts to be found based on the names and synonyms of source
concepts in the vocabulary that map to those standard concepts.
When we apply these search criteria we find “254761Cough” and feel this may be an
appropriate Vocabulary concept to map to our Dutch code. In order to do that we can hit
the “Replace concept” button, which you will see in the “Selected Source Code” section
update, followed by the “Approve” button. There is also an “Add concept” button, this al
lows for multiple standardized Vocabulary concepts to map to one source code (e.g. some
source codes may bundle multiple diseases together while the standardized vocabulary
may not).
Concept Information
When looking for appropriate concepts to map to, it is important to consider the “social
life” of a concept. The meaning of a concept might depend partially on its place in the
hierarchy, and sometimes there are “orphan concepts” in the vocabulary with few or no
hierarchical relationships, which would be illsuited as target concepts. Usagi will often
report the number of parents and children a concept has, and it also possible to show
more information by pressing ALT + C or selecting view –> Concept information in the
top menu bar.
Figure 6.9 shows the concept information panel. It shows general information about a
concept, as well as its parents, children, and other source codes that map to the concept.
90 Chapter 6. Extract Transform Load
Users can use this panel to navigate the hierarchy and potentially choose a different target
concept.
Continue to move through this process, code by code, until all codes have been checked.
In the list of source codes at the top of the screen, by selecting the column heading you can
sort the codes. Often, we suggest going from the highest frequency codes to the lowest.
In the bottom left of the screen you can see the number of codes that have approved
mappings, and how many code occurrences that corresponds to.
Best Practices
Once you have created your map within USAGI, the best way to use it moving forward
is to export it and append it to the Vocabulary SOURCE_TO_CONCEPT_MAP table.
After selecting the SOURCE_VOCABULARY_ID, you give your export CSV a name
and save to location. The export CSV structure is in that of the SOURCE_TO_CONCEPT_MAP
table. This mapping could be appended to the Vocabulary’s SOURCE_TO_CONCEPT_MAP
table. It would also make sense to append a single row to the VOCABULARY table
defining the SOURCE_VOCABULARY_ID you defined in the step above. Finally, it is
important to note that only mappings with the “Approved” status will be exported into
the CSV file; the mapping needs to be completed in USAGI in order to export it.
92 Chapter 6. Extract Transform Load
It should be noted that after several independent attempts, we have given up on developing
the ‘ultimate’ userfriendly ETL tool. It is always the case that tools like that work well
for 80% of the ETL, but for the remaining 20% of the ETL some lowlevel code needs to
be written that is specific to a source database
Once the technical individuals are ready to start implementing, the ETL design document
should be shared with them. There should be enough information in the documentation
for them to get started however it should be expected that the developers have access to
the ETL designers to ask questions during their development process. Logic that may
be clear to the designers may be less clear to an implementer who might not be familiar
with the data and CDM. The implementation phase should remain a team effort. It is
considered acceptable practice to go through the process of CDM creation and testing
between the implementers and designers, respectively, until both groups are in agreement
that all logic has been executed correctly.
• Review of the ETL design document, computer code, and code mappings. Any one
person can make mistakes, so always at least one other person should review what
the what was done.
– The largest issues in the computer code tend to come from how the source
codes in the native data are mapped to Standard Concepts. Mapping can get
tricky, especially when it comes to datespecific codes like NDCs. Be sure to
double check any area where mappings are done to ensure the correct source
vocabularies are translated to the proper concept id.
• Manually compare all information on a sample of persons in the source and target
data.
– It can be helpful to walk through one person’s data, ideally a person with a
large number of unique records. Tracing through a single person can highlight
issues if the data in the CDM is not how you expect it to look based on the
agreed upon logic.
• Compare overall counts in the source and target data.
– There may be some expected differences in counts depending on how you
chose to address certain issues. For instance, some collaborators choose to
drop any people with a NULL gender since those people will not be included
in analyses anyway. It may also be the case that visits in the CDM are con
structed differently than visits or encounters in the native data. Therefore,
94 Chapter 6. Extract Transform Load
when comparing overall counts between the source and CDM data be sure to
account for and expect these differences.
• Replicate a study that has already been performed on the source data on the CDM
version.
– This is a good way to understand any major differences between the source
data and CDM version, though it is a little more timeintensive.
• Create unit tests meant to replicate a pattern in the source data that should be ad
dressed in the ETL. For example, if your ETL specifies that patients without gender
information should be dropped, create a unit test of a person without a gender and
assess how the builder handles it.
– Unit testing is very handy when evaluating the quality and accuracy of an ETL
conversion. It usually involves creating a much smaller dataset that mimics
the structure of the source data you are converting. Each person or record in
the dataset should test a specific piece of logic as written in the ETL docu
ment. Using this method, it is easy to trace back issues and to identify failing
logic. The small size also enables the computer code to execute very quickly
allowing for faster iterations and error identification.
These are highlevel ways to approach quality control from an ETL standpoint. For more
detail on the data quality efforts going on within OHDSI, please see Chapter 15.
performing an ETL, if there is a scenario that you are unsure how to handle, THEMIS
recommends that a question about the scenario is posed on the OHDSI Forums.8 Most
likely if you have a question, others in the community probably have it as well. THEMIS
uses these discussions, as well as work group meetings and facetoface discussions, to
help inform what other conventions need to be documented.
may lead to additional data being stored in the CDM. This might mean data that you
previously were not storing in the CDM might have a location in a new CDM version.
Less frequently are changes to existing CDM structure, however it is a possibility. For
example, the CDM has adopted DATETIME fields over the original DATE fields which
could cause an error in ETL processing. CDM versions are not released often and sites
can choose when they migrate.
• The 80/20 rule. If you can avoid it do not spend too much time manually mapping
source codes to concepts sets. Ideally, map the source codes that cover the majority
of your data. This should be enough to get you started and you can address any
remaining codes in the future based on use cases.
• It’s ok if you lose data that is not of research quality. Often these are the records
that would be discarded before starting an analysis anyway, we just remove them
during the ETL process instead.
• A CDM requires maintenance. Just because you complete an ETL does not mean
you do not need to touch it ever again. Your raw data might change, there might be
a bug in the code, there may be new vocabulary or an update to the CDM. Plan to
allocate resources for these changes so your ETL is always uptodate.
• For support starting the OHDSI CDM, performing your database conversion, or
running the analytics tools, please visit our Implementers Forum.9
6.9 Summary
– There is a generally agreed upon process for how to approach an ETL, includ
ing
9
https://forums.ohdsi.org/c/implementers
6.10. Exercises 97
6.10 Exercises
Exercise 6.1. Put the steps of the ETL process in the proper order:
A) Data experts and CDM experts together design the ETL
B) A technical person implements the ETL
C) People with medical knowledge create the code mappings
D) All are involved in quality control
Exercise 6.2. Using OHDSI resources of your choice, spot four issues with the PERSON
record show in Table 6.3 (table abbreviated for space):
Column Value
PERSON_ID A123B456
GENDER_CONCEPT_ID 8532
YEAR_OF_BIRTH NULL
MONTH_OF_BIRTH NULL
DAY_OF_BIRTH NULL
RACE_CONCEPT_ID 0
ETHNICITY_CONCEPT_ID 8527
PERSON_SOURCE_VALUE A123B456
GENDER_SOURCE_VALUE F
RACE_SOURCE_VALUE WHITE
ETHNICITY_SOURCE_VALUE NONE PROVIDED
Exercise 6.3. Let us try to generate VISIT_OCCURRENCE records. Here is some exam
ple logic written for Synthea: Sort data in ascending order by PATIENT, START, END.
Then by PERSON_ID, collapse lines of claim as long as the time between the END of
one line and the START of the next is <=1 day. Each consolidated inpatient claim is then
considered as one inpatient visit, set:
• MIN(START) as VISIT_START_DATE
• MAX(END) as VISIT_END_DATE
• “IP” as PLACE_OF_SERVICE_SOURCE_VALUE
If you see a set of visits as shown in Figure 6.10 in your source data, how would you
expect the resulting VISIT_OCCURRENCE record(s) to look in the CDM?
Data Analytics
99
Chapter 7
7.1 Characterization
Characterization attempts to answer the question
What happened to them?
We can use the data to provide answers to questions about the characteristics of the persons
in a cohort or the entire database, the practice of healthcare, and study how these things
change over time.
The data can provide answers to questions like:
• For patients newly diagnosed with atrial fibrillation, how many receive a prescrip
tion for warfarin?
• What is the average age of patients who undergo hip arthroplasty?
• What is the incidence rate of pneumonia in patients over 65 years old?
Typical characterization questions are formulated as:
101
102 Chapter 7. Data Analytics Use Cases
7.6 Summary
7.7 Exercises
Exercise 7.1. Which use case categories do these questions belong to?
1. Compute the rate of gastrointestinal (GI) bleeding in patients recently exposed to
NSAIDs.
2. Compute the probability that a specific patient experiences a GI bleed in the next
year, based on their baseline characteristics.
3. Estimate the increased risk of GI bleeding due to diclofenac compared to celecoxib.
Exercise 7.2. You wish to estimate the increased risk of GI bleeding due to diclofenac
compared to no exposure (placebo). Can this be done using observational healthcare data?
107
108 Chapter 8. OHDSI Analytics Tools
Figure 8.1: Different ways to implement an analysis against data in the CDM.
The third approach relies on our interactive analysis platform ATLAS, a webbased tool
that allows nonprogrammers to perform a wide range of analyses efficiently. ATLAS
makes use of the Methods Libraries but provides a simple graphical interface to design
analyses and in many cases generate the necessary R code to run the analysis. However,
ATLAS does not support all options available in the Methods Library. While it is expected
that the majority of studies can be performed through ATLAS, some studies may require
the flexibility offered by the second approach.
ATLAS and the Methods Library are not independent. Some of the more complicated
analytics that can be invoked in ATLAS are executed through calls to the packages in the
Methods Library. Similarly, cohorts used in the Methods Library are often designed in
ATLAS.
The first strategy views every analysis as a single individual study. The analysis must be
prespecified in a protocol, implemented as code, executed against the data, after which
the result can be compiled and interpreted. For every question, all steps must be repeated.
An example of such an analysis is the OHDSI study into the risk of angioedema associ
ated with levetiracetam compared with phenytoin. (Duke et al., 2017) Here, a protocol
8.3. ATLAS 109
was first written, analysis code using the OHDSI Methods Library was developed and
executed across the OHDSI network, and results were compiled and disseminated in a
journal publication.
The second strategy develops an application that allows users to answer a specific class of
questions in real time or nearreal time. Once the application has been developed, users
can interactively define queries, submit them, and view the results. An example of this
strategy is the cohort definition and generation tool in ATLAS. This tool allows users
to specify cohort definitions of varying complexity, and execute the definition against a
database to see how many people meet the various inclusion and exclusion criteria.
The third strategy similarly focuses on a class of questions, but then attempts to exhaus
tively generate all the evidence for the questions within the class. Users can then explore
the evidence as needed through a variety of interfaces. One example is the OHDSI study
into the effects of depression treatments. (Schuemie et al., 2018b) In this study all de
pression treatments are compared for a large set of outcomes of interest across four large
observational databases. The full set of results, including 17,718 empirically calibrated
hazard ratios along with extensive study diagnostics, is available in an interactive web
app.1
8.3 ATLAS
ATLAS is a free, publicly available, webbased tool developed by the OHDSI community
that facilitates the design and execution of analyses on standardized, patientlevel, obser
vational data in the CDM format. ATLAS is deployed as a web application in combination
with the OHDSI WebAPI and is typically hosted on Apache Tomcat. Performing real time
analyses requires access to the patientlevel data in the CDM and is therefore typically
installed behind an organization’s firewall. However, there is also a public ATLAS2 , and
1
http://data.ohdsi.org/SystematicEvidence/
2
http://www.ohdsi.org/web/atlas
110 Chapter 8. OHDSI Analytics Tools
although this ATLAS instance only has access to a few small simulated datasets, it can
still be used for many purposes including testing and training. It is even possible to fully
define an effect estimation or prediction study using the public instance of ATLAS, and
automatically generate the R code for executing the study. That code can then be run
in any environment with an available CDM without needing to install ATLAS and the
WebAPI.
A screenshot of ATLAS is provided in Figure 8.3. On the left is a navigation bar showing
the various functions provided by ATLAS:
Data Sources Data sources provides the capability review descriptive, standardized re
porting for each of the data sources that you have configured within your Atlas
platform. This feature uses the largescale analytics strategy: all descriptives have
been precomputed. Data sources is discussed in Chapter 11.
Vocabulary Search Atlas provides the ability to search and explore the OMOP stan
dardized vocabulary to understand what concepts exist within those vocabularies
and how to apply those concepts in your standardized analysis against your data
sources. This feature is discussed in Chapter 5.
Concept Sets Concept sets provides the ability to create collections of logical expres
sions that can be used to identify a set of concepts to be used throughout your stan
dardized analyses. Concept sets provide more sophistication than a simple list of
codes or values. A concept set is comprised of multiple concepts from the standard
ized vocabulary in combination with logical indicators that allow a user to specify
that they are interested in including or excluding related concepts in the vocabulary
hierarchy. Searching the vocabulary, identifying the set of concepts, and specify
8.3. ATLAS 111
ing the logic to be used to resolve a concept set provides a powerful mechanism for
defining the often obscure medical language used in analysis plans. These concept
sets can be saved within ATLAS and then used throughout your analysis as part of
cohort definitions or analysis specifications.
Cohort Definitions Cohort definitions is the ability to construct a set of persons who
satisfy one or more criteria for a duration of time and these cohorts can then serve
as the basis of inputs for all of your subsequent analyses. This feature is discussed
in Chapter 10.
Characterizations Characterizations is an analytic capability that allows you to look
at one or more cohorts that you’ve defined and to summarize characteristics about
those patient populations. This feature uses the realtime query strategy, and is
discussed in Chapter 11.
Cohort Pathways Cohort pathways is an analytic tool that allows you to look at the
sequence of clinical events that occur within one or more populations. This feature
uses the realtime query strategy, and is discussed in Chapter 11.
Incidence Rates Incidence rates is a tool that allows you to estimate the incidence of
outcomes within target populations of interest. This feature uses the realtime query
strategy, and is discussed in Chapter 11.
Profiles Profiles is a tool that allows you to explore an individual patients longitudinal
observational data to summarize what is going on within a given individual. This
feature uses the realtime query strategy.
Population Level Estimation Estimation is a capability to allow you to define a pop
ulation level effect estimation study using a comparative cohort design whereby
comparisons between one or more target and comparator cohorts can be explored
for a series of outcomes. This feature can be said to implement the realtime query
strategy, as no coding is required, and is discussed in Chapter 12.
Patient Level Prediction Prediction is a capability to allow you to apply machine learn
ing algorithms to conduct patient level prediction analyses whereby you can predict
an outcome within any given target exposures. This feature can be said to imple
ment the realtime query strategy, as no coding is required, and is discussed in
Chapter 13.
Jobs Select the Jobs menu item to explore the state of processes that are running through
the WebAPI. Jobs are often long running processes such as generating a cohort or
computing cohort characterization reports.
Configuration Select the Configuration menu item to review the data sources that have
been configured in the source configuration section.
Feedback The Feedback link will take you to the issue log for Atlas so that you can log
a new issue or to search through existing issues. If you have ideas for new features
or enhancements, this is also a place note these for the development community.
8.3.1 Security
ATLAS and the WebAPI provide a granular security model to control access to features
or data sources within the overall platform. The security system is built leveraging the
Apache Shiro library. Additional information on the security system can be found in the
112 Chapter 8. OHDSI Analytics Tools
8.3.2 Documentation
Documentation for ATLAS can be found online in the ATLAS GitHub repository wiki.4
This wiki includes information on the various application features as well as links to
online video tutorials.
allows for computing effectsize estimates for many exposures and outcomes, using vari
ous analysis settings, and the package will automatically choose the optimal way to com
pute all the required intermediary and final data sets. Steps that can be reused, such as ex
traction of covariates, or fitting a propensity model that is used for one targetcomparator
pair but multiple outcomes, will be executed only once. Where possible, computations
will take place in parallel to maximize the use of computational resources.
This computational efficiency allows for largescale analytics, answering many questions
at once, and is also essential for including control hypotheses (e.g. negative controls) to
measure the operating characteristics of our methods, and perform empirical calibration
as described in Chapter 18.
8.4.3 Documentation
R provides a standard way to document packages. Each package has a package manual
that documents every function and data set contained in the package. All package manuals
are available online through the Methods Library website7 , through the package GitHub
repositories, and for those packages available through CRAN they can be found in CRAN.
Furthermore, from within R the package manual can be consulted by using the question
mark. For example, after loading the DatabaseConnector package, typing the command
?connect brings up the documentation on the “connect” function.
In addition to the package manual, many packages provide vignettes. Vignettes are long
form documentation that describe how a package can be used to perform certain tasks. For
example, one vignette8 describes how to perform multiple analyses efficiently using the
7
https://ohdsi.github.io/MethodsLibrary
8
https://ohdsi.github.io/CohortMethod/articles/MultipleAnalyses.html
8.4. Methods Library 115
CohortMethod package. Vignettes can also be found through the Methods Library web
site, through the package GitHub repositories, and for those packages available through
CRAN they can be found in CRAN.
The database server must hold the observational healthcare data in CDM format. The
Methods Library supports a wide array of database management systems including tradi
tional database systems (PostgreSQL, Microsoft SQL Server, and Oracle), parallel data
warehouses (Microsoft APS, IBM Netezza, and Amazon Redshift), as well as Big Data
platforms (Hadoop through Impala, and Google BigQuery).
The analytics workstation is where the Methods Library is installed and run. This can
either be a local machine, such as someone’s laptop, or a remote server running RStudio
Server. In all cases the requirements are that R is installed, preferably together with RStu
dio. The Methods Library also requires that Java is installed. The analytics workstation
should also be able to connect to the database server, specifically, any firewall between
them should have the database server access ports opened the workstation. Some of the
analytics can be computationally intensive, so having multiple processing cores and am
ple memory can help speed up the analyses. We recommend having at least four cores
and 16 gigabytes of memory.
In Windows, both R and Java come in 32bit and 64bits architectures. If you
install R in both architectures, you must also install Java in both architectures. It
is recommended to only install the 64bit version of R.
116 Chapter 8. OHDSI Analytics Tools
Installing R
2. After the download has completed, run the installer. Use the default options ev
erywhere, with two exceptions: First, it is better not to install into program files.
Instead, just make R a subfolder of your C drive as shown in Figure 8.6. Second,
to avoid problems due to differing architectures between R and Java, disable the
32bit architecture as shown in Figure 8.7.
Once completed, you should be able to select R from your Start Menu.
8.4. Methods Library 117
Installing Rtools
1. Go to https://cran.rproject.org/, click on “Download R for Windows”, then
“Rtools”, and select the very latest version of Rtools to download.
2. After downloading has completed run the installer. Select the default options ev
erywhere.
Installing RStudio
1. Go to https://www.rstudio.com/, select “Download RStudio” (or the “Download”
button under “RStudio”), opt for the free version, and download the installer for
Windows as shown in Figure 8.8.
2. After downloading, start the installer, and use the default options everywhere.
118 Chapter 8. OHDSI Analytics Tools
Installing Java
1. Go to https://java.com/en/download/manual.jsp, and select the Windows 64bit in
staller as shown in Figure 8.9. If you also installed the 32bit version of R, you
must also install the other (32bit) version of Java.
install.packages("SqlRender")
library(SqlRender)
translate("SELECT TOP 10 * FROM person;", "postgresql")
install.packages("drat")
drat::addRepo("OHDSI")
install.packages("CohortMethod")
8.5.1 Broadsea
Broadsea9 uses Docker container technology.10 The OHDSI tools are packaged along
with dependencies into a single portable binary file called a Docker Image. This image
can then be run on a Docker engine service, creating a virtual machine with all the soft
ware installed and ready to run. Docker engines are available for most operating systems,
including Microsoft Windows, MacOS, and Linux. The Broadsea Docker image contains
the main OHDSI tools, including the Methods Library and ATLAS.
organization’s management tools and best practices. The architecture for OHDSIonAWS
is depicted in Figure 8.11.
8.6 Summary
SQL and R
In OHDSI, we would like to be agnostic to the specific dialect a platform uses; we would
like to ‘speak’ the same SQL language across all OHDSI databases. For this reason
OHDSI developed the SqlRender package, an R package that can translate from one stan
dard dialect to any of the supported dialects that will be discussed later in this chapter.
This standard dialect OHDSI SQL is mainly a subset of the SQL Server SQL dialect.
121
122 Chapter 9. SQL and R
The example SQL statements provided throughout this chapter will all use OHDSI SQL.
Each database platform also comes with its own software tools for querying the database
using SQL. In OHDSI we developed the DatabaseConnector package, one R package that
can connect to many database platforms. DatabaseConnector will also be discussed later
in this chapter.
So although one can query a database that conforms to the CDM without using any
OHDSI tools, the recommended path is to use the DatabaseConnector and SqlRender
packages. This allows queries that are developed at one site to be used at any other site
without modification. R itself also immediately provides features to further analyze the
data extracted from the database, such as performing statistical analyses and generating
(interactive) plots.
In this chapter we assume the reader has a basic understanding of SQL. We first review
how to use SqlRender and DatabaseConnector. If the reader does not intend to use these
packages these sections can be skipped. In Section 9.3 we discuss how to use SQL (in
this case OHDSI SQL) to query the CDM. The following section highlights how to use
the OHDSI Standardized Vocabulary when querying the CDM. We highlight the QueryLi
brary, a collection of commonlyused queries against the CDM that is publicly available.
We close this chapter with an example study estimating incidence rates, and implement
this study using SqlRender and DatabaseConnector.
9.1 SqlRender
The SqlRender package is available on CRAN (the Comprehensive R Archive Network),
and can therefore be installed using:
install.packages("SqlRender")
SqlRender supports a wide array of technical platforms including traditional database sys
tems (PostgreSQL, Microsoft SQL Server, SQLite, and Oracle), parallel data warehouses
(Microsoft APS, IBM Netezza, and Amazon Redshift), as well as Big Data platforms
(Hadoop through Impala, and Google BigQuery). The R package comes with a package
manual and a vignette that explores the full functionality. Here we describer some of the
main features.
IfThenElse
Sometimes blocks of codes need to be turned on or off based on the values of one or
more parameters. This is done using the {Condition} ? {if true} : {if false}
syntax. If the condition evaluates to true or 1, the if true block is used, else the if false
block is shown (if present).
render(sql, x = TRUE)
render(sql, x = 2)
sql <- "SELECT * FROM cohort {@x IN (1,2,3)} ? {WHERE subject_id = 1};"
render(sql, x = 2)
There are limits to what SQL functions and constructs can be translated properly,
both because only a limited set of translation rules have been implemented in the
package, but also some SQL features do not have an equivalent in all dialects. This
is the primary reason why OHDSI SQL was developed as its own, new SQL dialect.
However, whenever possible we have kept to the SQL Server syntax to avoid rein
venting the wheel.
Despite our best efforts, there are quite a few things to consider when writing OHDSI
SQL that will run without error on all supported platforms. In what follows we discuss
these considerations in detail.
-- Simple selects:
SELECT * FROM table;
-- Nested queries:
SELECT * FROM (SELECT * FROM table_1) tmp WHERE a = b;
-- Creating tables:
CREATE TABLE table (field INT);
126 Chapter 9. SQL and R
-- OVER clauses:
SELECT ROW_NUMBER() OVER (PARTITION BY a ORDER BY b)
AS "Row Number" FROM table;
-- UNIONs:
SELECT * FROM a UNION SELECT * FROM b;
-- INTERSECTIONs:
SELECT * FROM a INTERSECT SELECT * FROM b;
-- EXCEPT:
SELECT * FROM a EXCEPT SELECT * FROM b;
String Concatenation
String concatenation is one area where SQL Server is less specific than other dialects.
In SQL Server, one would write SELECT first_name + ' ' + last_name AS
full_name FROM table, but this should be SELECT first_name || ' ' ||
last_name AS full_name FROM table in PostgreSQL and Oracle. SqlRender tries
to guess when values that are being concatenated are strings. In the example above,
because we have an explicit string (the space surrounded by single quotation marks),
the translation will be correct. However, if the query had been SELECT first_name
+ last_name AS full_name FROM table, SqlRender would have had no clue the
two fields were strings, and would incorrectly leave the plus sign. Another clue that a
9.1. SqlRender 127
-- Using AS keyword
SELECT *
FROM my_table AS table_1
INNER JOIN (
SELECT * FROM other_table
) AS table_2
ON table_1.person_id = table_2.person_id;
However, Oracle will throw an error when the AS keyword is used. In the above example,
the first query will fail. It is therefore recommended to not use the AS keyword when
aliasing tables. (Note: we can’t make SqlRender handle this, because it can’t easily dis
tinguish between table aliases where Oracle doesn’t allow AS to be used, and field aliases,
where Oracle requires AS to be used.)
Temp Tables
Temp tables can be very useful to store intermediate results, and when used correctly
can dramatically improve performance of queries. On most database platforms temp ta
bles have very nice properties: they’re only visible to the current user, are automatically
dropped when the session ends, and can be created even when the user has no write ac
cess. Unfortunately, in Oracle temp tables are basically permanent tables, with the only
difference that the data inside the table is only visible to the current user. This is why, in
Oracle, SqlRender will try to emulate temp tables by
1. Adding a random string to the table name so tables from different users will not
conflict.
2. Allowing the user to specify the schema where the temp tables will be created.
128 Chapter 9. SQL and R
For example:
Note that the user will need to have write privileges on temp_schema.
Also note that because Oracle has a limit on table names of 30 characters. Temp table
names are only allowed to be at most 22 characters long, because else the name will
become too long after appending the session ID.
Furthermore, remember that temp tables are not automatically dropped on Oracle, so you
will need to explicitly TRUNCATE and DROP all temp tables once you’re done with them
to prevent orphan tables accumulating in the Oracle temp schema.
Implicit Casts
One of the few points where SQL Server is less explicit than other dialects is that it allows
implicit casts. For example, this code will work on SQL Server:
Even though txt is a VARCHAR field and we are comparing it with an integer, SQL
Server will automatically cast one of the two to the correct type to allow the comparison.
In contrast, other dialects such as PostgreSQL will throw an error when trying to compare
a VARCHAR with an INT.
You should therefore always make casts explicit. In the above example, the last statement
should be replaced with either
or
it is preferred to use
where on SQL Server we can include both database and schema names in the value:
databaseSchema = "cdm_data.dbo". On other platforms, we can use the same
code, but now only specify the schema as the parameter value: databaseSchema =
"cdm_data".
The one situation where this will fail is the USE command, since USE cdm_data.dbo;
will throw an error. It is therefore preferred not to use the USE command, but always
specify the database / schema where a table is located.
A Shiny app is included in the SqlRender package for interactively editing source SQL
and generating rendered and translated SQL. The app can be started using:
launchSqlRenderDeveloper()
That will open the default browser with the app shown in Figure 9.1. The app is also
publicly available on the web.1
In the app you can enter OHDSI SQL, select the target dialect as well as provide values
for the parameters that appear in your SQL, and the translation will automatically appear
at the bottom.
9.2 DatabaseConnector
DatabaseConnector is an R package for connecting to various database platforms using
Java’s JDBC drivers. The DatabaseConnector package is available on CRAN (the Com
prehensive R Archive Network), and can therefore be installed using:
install.packages("DatabaseConnector")
drivers, but because of licensing reasons the drivers for BigQuery, Netezza and Impala
are not included but must be obtained by the user. Type ?jdbcDrivers for instructions
on how to download these drivers. Once downloaded, you can use the pathToDriver
argument of the connect, dbConnect, and createConnectionDetails functions.
disconnect(conn)
Note that, instead of providing the server name, it is also possible to provide the JDBC
connection string if this is more convenient:
schema = "cdm")
conn <- connect(details)
9.2.2 Querying
The main functions for querying database are the querySql and executeSql functions.
The difference between these functions is that querySql expects data to be returned by
the database, and can handle only one SQL statement at a time. In contrast, executeSql
does not expect data to be returned, and accepts multiple SQL statements in a single SQL
string.
Some examples:
Both functions provide extensive error reporting: When an error is thrown by the server,
the error message and the offending piece of SQL are written to a text file to allow better
debugging. The executeSql function also by default shows a progress bar, indicating the
percentage of SQL statements that has been executed. If those attributes are not desired,
the package also offers the lowLevelQuerySql and lowLevelExecuteSql functions.
x <- renderTranslateQuerySql(conn,
sql = "SELECT TOP 10 * FROM @schema.person",
schema = "cdm_synpuf")
Note that the SQL Serverspecific ‘TOP 10’ syntax will be translated to for example
‘LIMIT 10’ on PostgreSQL, and that the SQL parameter @schema will be instantiated
with the provided value ‘cdm_synpuf’.
data(mtcars)
insertTable(conn, "mtcars", mtcars, createTable = TRUE)
In this example, we’re uploading the mtcars data frame to a table called ‘mtcars’ on the
server, which will be automatically created.
PERSON_COUNT
26299001
SELECT AVG(DATEDIFF(DAY,
observation_period_start_date,
observation_period_end_date) / 365.25) AS num_years
FROM @cdm.observation_period;
NUM_YEARS
1.980803
We can join tables to produce additional statistics. A join combines fields from multiple
tables, typically by requiring specific fields in the tables to have the same value. For
example, here we join the PERSON table to the OBSERVATION_PERIOD table on the
PERSON_ID fields in both tables. In other words, the result of the join is a new tablelike
set that has all the fields of the two tables, but in all rows the PERSON_ID fields from the
two tables must have the same value. We can now for example compute the maximum
age at observation end by using the OBSERVATION_PERIOD_END_DATE field from
the OBSERVATION_PERIOD table together with the year_of_birth field of the PERSON
table:
SELECT MAX(YEAR(observation_period_end_date) -
year_of_birth) AS max_age
FROM @cdm.person
INNER JOIN @cdm.observation_period
ON person.person_id = observation_period.person_id;
MAX_AGE
90
A much more complicated query is needed to determine the distribution of age at the start
of observation. In this query, we first join the PERSON to the OBSERVATION_PERIOD
table to compute age at start of observation. We also compute the ordering for this joined
set based on age, and store it as order_nr. Because we want to use the result of this join
multiple times, we define it as a common table expression (CTE) (defined using WITH
... AS) that we call “ages,” meaning we can refer to ages as if it is an existing table.
We count the number of rows in ages to produce “n,” and then for each quantile find the
minimum age where the order_nr is smaller than the fraction times n. For example, to
find the median we use the minimum age where 𝑜𝑟𝑑𝑒𝑟_𝑛𝑟 < .50 ∗ 𝑛. The minimum and
maximum age are computed separately:
WITH ages
AS (
9.3. Querying the CDM 135
SELECT age,
ROW_NUMBER() OVER (
ORDER BY age
) order_nr
FROM (
SELECT YEAR(observation_period_start_date) - year_of_birth AS age
FROM @cdm.person
INNER JOIN @cdm.observation_period
ON person.person_id = observation_period.person_id
) age_computed
)
SELECT MIN(age) AS min_age,
MIN(CASE
WHEN order_nr < .25 * n
THEN 9999
ELSE age
END) AS q25_age,
MIN(CASE
WHEN order_nr < .50 * n
THEN 9999
ELSE age
END) AS median_age,
MIN(CASE
WHEN order_nr < .75 * n
THEN 9999
ELSE age
END) AS q75_age,
MAX(age) AS max_age
FROM ages
CROSS JOIN (
SELECT COUNT(*) AS n
FROM ages
) population_size;
More complex computations can also be performed in R instead of using SQL. For exam
ple, we can get the same answer using this R code:
CONDITION_SOURCE_VALUE CODE_COUNT
4019 49094668
25000 36149139
78099 28908399
319 25798284
31401 22547122
317 22453999
311 19626574
496 19570098
I10 19453451
3180 18973883
may wish to count the number of persons in the database stratified by gender, and it would
be convenient to resolve the GENDER_CONCEPT_ID field to a concept name:
SUBJECT_COUNT CONCEPT_NAME
14927548 FEMALE
11371453 MALE
A very powerful feature of the Vocabulary is its hierarchy. A very common query looks
for a specific concept and all of its descendants. For example, imagine we wish to count
the number of prescriptions containing the ingredient ibuprofen:
PRESCRIPTION_COUNT
26871214
9.5 QueryLibrary
QueryLibrary is a library of commonlyused SQL queries for the CDM. It is available as
an online application2 shown in Figure 9.2, and as an R package.3
The purpose of the library is to help new users learn how to query the CDM. The queries
in the library have been reviewed and approved by the OHDSI community. The query
library is primarily intended for training purposes, but it is also a valuable resource for
experienced users.
2
http://data.ohdsi.org/QueryLibrary
3
https://github.com/OHDSI/QueryLibrary
138 Chapter 9. SQL and R
The QueryLibrary makes use of SqlRender to output the queries in the SQL dialect of
choice. Users can also specify the CDM database schema, vocabulary database schema
(if separate), and the Oracle temp schema (if needed), so the queries will be automatically
rendered with these settings.
9.6.2 Exposure
We’ll define exposure as first exposure to lisinopril. By first we mean no earlier expo
sure to lisinopril. We require 365 days of continuous observation time prior to the first
9.7. Implementing the Study Using SQL and R 139
exposure.
9.6.3 Outcome
We define angioedema as any occurrence of an angioedema diagnosis code during an
inpatient or emergency room (ER) visit.
9.6.4 Time-At-Risk
We will compute the incidence rate in the first week following treatment initiation, irre
spective of whether patients were exposed for the full week.
library(DatabaseConnector)
conn <- connect(dbms = "postgresql",
server = "localhost/postgres",
user = "joe",
password = "secret")
cdmDbSchema <- "cdm"
cohortDbSchema <- "scratch"
cohortTable <- "my_cohorts"
Here we have parameterized the database schema and table names, so we can easily adapt
them to different environments. The result is an empty table on the database server.
140 Chapter 9. SQL and R
renderTranslateExecuteSql(conn, sql,
cohort_db_schema = cohortDbSchema,
cohort_table = cohortTable,
cdm_db_schema = cdmDbSchema)
Here we use the DRUG_ERA table, a standard table in the CDM that is automatically
derived from the DRUG_EXPOSURE table. The DRUG_ERA table contains eras of
continuous exposure at the ingredient level. We can thus search for lisinopril, and this
9.7. Implementing the Study Using SQL and R 141
will automatically identify all exposures to drugs containing lisinopril. We take the first
drug exposure per person, and then join to the OBSERVATION_PERIOD table, and be
cause a person can have several observation periods we must make sure we only join to
the period containing the drug exposure. We then require at least 365 days between the
OBSERVATION_PERIOD_START_DATE and the COHORT_START_DATE.
renderTranslateExecuteSql(conn, sql,
cohort_db_schema = cohortDbSchema,
cohort_table = cohortTable,
cdm_db_schema = cdmDbSchema)
ON tar.subject_id = angioedema.subject_id
AND tar.cohort_start_date <= angioedema.cohort_start_date
AND tar.cohort_end_date >= angioedema.cohort_start_date
WHERE cohort_definition_id = 2 -- Outcome
GROUP BY gender,
age
) events
ON days.gender = events.gender
AND days.age = events.age;
"
We first create “tar,” a CTE that contains all exposures with the appropriate timeatrisk.
Note that we truncate the timeatrisk at the OBSERVATION_PERIOD_END_DATE. We
also compute the age in 10year bins, and identify the gender. The advantage of using
a CTE is that we can use the same set of intermediate results several times in a query.
In this case we use it to count the total amount of timeatrisk, as well as the number of
angioedema events that occur during the timeatrisk.
With the help of the ggplot2 package we can easily plot our results:
library(ggplot2)
ggplot(results, aes(x = age, y = ir, group = gender, color = gender)) +
geom_line() +
xlab("Age") +
ylab("Incidence (per 1,000 patient weeks)")
144 Chapter 9. SQL and R
9.7.4 Clean Up
Don’t forget to clean up the table we created, and to close the connection:
disconnect(conn)
9.7.5 Compatibility
Because we use OHDSI SQL together with DatabaseConnector and SqlRender through
out, the code we reviewed here will run on any database platform supported by OHDSI.
Note that for demonstration purposes we chose to create our cohorts using handcrafted
SQL. It would probably have been more convenient to construct cohort definition in AT
LAS, and use the SQL generated by ATLAS to instantiate the cohorts. ATLAS also
produced OHDSI SQL, and can therefore easily be used together with SqlRender and
DatabaseConnector.
9.8. Summary 145
9.8 Summary
9.9 Exercises
Prerequisites
For these exercises we assume R, RStudio and Java have been installed as described
in Section 8.4.5. Also required are the SqlRender, DatabaseConnector, and Eunomia
packages, which can be installed using:
The Eunomia package provides a simulated dataset in the CDM that will run inside your
local R session. The connection details can be obtained using:
Exercise 9.1. Using SQL and R, compute how many people are in the database.
Exercise 9.2. Using SQL and R, compute how many people have at least one prescription
of celecoxib.
Exercise 9.3. Using SQL and R, compute how many diagnoses of gastrointestinal hem
orrhage occur during exposure to celecoxib. (Hint: the concept ID for gastrointestinal
hemorrhage is 192671.)
146 Chapter 9. SQL and R
Defining Cohorts
Observational health data, also referred to real world data, are the data related to pa
tient health status and/or the delivery of health care routinely collected from a variety
of sources. As such, OHDSI data stewards (OHDSI collaborators who maintain data
in CDM for their sites) may capture data from a number of sources including Electronic
Health Records (EHR), health insurance claims and billing activities, product and disease
registries, patientgenerated data including in homeuse settings, and data gathered from
other sources that can inform on health status, such as mobile devices. As these data were
not collected for research purposes, the data may not explicitly capture the clinical data
elements we are interested in.
For example, a health insurance claims database is designed to capture all care provided
for some condition (e.g. angioedema) so the associated costs can appropriately be reim
bursed, and information on the actual condition is captured only as part of this aim. If
we wish to use such observational data for research purposes, we will often have to write
some logic that uses what is captured in the data to infer what we are really interested
in. In other words, we often need to create a cohort using some definition of how a clin
ical event manifests. Thus, if we want to identify angioedema events in an insurance
claims database, we may define logic requiring an angioedema diagnose code recorded
in an emergency room setting, to distinguish from claims that merely describe followup
care for some past angioedema occurrence. Similar considerations may apply for data
captured during routine healthcare interactions logged in an EHR. As data are being used
for a secondary purpose, we must be cognizant of what each database was originally de
signed to do. Each time we design a study, we must think through the nuances of how
our cohort exists in a variety of healthcare settings.
The chapter serves to explain what is meant by creating and sharing cohort definitions,
the methods for developing cohorts, and how to build your own cohorts using ATLAS or
SQL.
147
148 Chapter 10. Defining Cohorts
A cohort is a set of persons who satisfy one or more inclusion criteria for a duration
of time.
It is important to realize that this definition of a cohort used in OHDSI might differ
from that used by others in the field. For example, in many peerreviewed scientific
manuscripts, a cohort is suggested to be analogous to a code set of specific clinical codes
(e.g. ICD9/ICD10, NDC, HCPCS, etc). While code sets are an important piece in as
sembling a cohort, a cohort is not defined by code set. A cohort requires specific logic for
how to use the code set for the criteria (e.g. is it the first occurrence of the ICD9/ICD10
code? any occurrence?). A welldefined cohort specifies how a patient enters a cohort
and how a patient exits a cohort.
There are unique nuances to utilizing OHDSI’s definition of a cohort, including:
• One person may belong to multiple cohorts
• One person may belong to the same cohort for multiple different time periods
• One person may not belong to the same cohort multiple times during the same
period of time
• A cohort may have zero or more members
There are two main approaches to constructing a cohort:
1. Rulebased cohort definitions use explicit rules to describe when a patient is in
the cohort. Defining these rules typically relies heavily on the domain expertise of
the individual designing the cohort to use their knowledge of the therapeutic area
of interest to build rules for cohort inclusion criteria.
2. Probabilistic cohort definitions use a probabilistic model to compute a probabil
ity between 0 and 100% of the patient being in the cohort. This probability can be
turned into a yesno classification using some threshold, or in some study designs
10.2. RuleBased Cohort Definitions 149
can be used as is. The probabilistic model is typically trained using machine learn
ing (e.g. logistic regression) on some example data to automatically identify the
relevant patient characteristics that are predictive.
The next sections will discuss these approaches in further detail.
the clinical activity (e.g. SNOMED codes for conditions, RxNorm codes for drugs) as
well as any other specific attributes (e.g. age at occurrence, first diagnosis/procedure/etc,
specifying start and end date, specifying visit type or criteria, days supply, etc). The set
of people having an entry event is referred to as the initial event cohort.
Inclusion criteria: Inclusion criteria are applied to the initial event cohort to further re
strict the set of people. Each inclusion criterion is defined by the CDM domain(s) where
the data are stored, concept set(s) representing the clinical activity, domainspecific at
tributes (e.g. days supply, visit type, etc), and the temporal logic relative to the cohort
index date. Each inclusion criterion can be evaluated to determine the impact of the cri
teria on the attrition of persons from the initial event cohort. The qualifying cohort is
defined as all people in the initial event cohort that satisfy all inclusion criteria.
Cohort exit criteria: The cohort exit event signifies when a person no longer qualifies
for cohort membership. Cohort exit can be defined in multiple ways such as the end of
the observation period, a fixed time interval relative to the initial entry event, the last
event in a sequence of related observations (e.g. persistent drug exposure) or through
other censoring of observation period. Cohort exit strategy will impact whether a person
can belong to the cohort multiple times during different time intervals.
In the OHDSI tools there is no distinction between inclusion and exclusion criteria.
All criteria are formulated as inclusion criteria. For example, the exclusion crite
rion “Exclude people with prior hypertension” can be formulated as the inclusion
criterion “Include people with 0 occurrences of prior hypertension”.
10.3. Concept Sets 151
• Exclude: Exclude this concept (and any of its descendants if selected) from the
concept set.
• Descendants: Consider not only this concept, but also all of its descendants.
• Mapped: Allow to search for nonstandard concepts.
For example, a concept set expression could contains two concepts as depicted in Table
10.1. Here we include concept 4329847 (“Myocardial infarction”) and all of its descen
dants, but exclude concept 314666 (“Old myocardial infarction”) and all of its descen
dants.
As shown in Figure 10.2, this will include “Myocardial infarction” and all of its descen
dants except “Old myocardial infarction” and its descendants. In total, this concept set
expression implies nearly a hundred Standard Concepts. These Standard Concepts in turn
reflect hundreds of source codes (e.g. ICD9 and ICD10 codes) that may appear in the
various databases.
Figure 10.2: A concept set including ”Myocardial infarction” (with descendants), but
excluding ”Old myocardial infarction” (with descendants).
152 Chapter 10. Defining Cohorts
library, the entries of which are held to specific standards of design and evaluation. For ad
ditional information related to the GSPL, consult the OHDSI workgroup page.2 Research
within this workgroup includes APHRODITE (Banda et al., 2017) and the PheValuator
tool (Swerdel et al., 2019) , discussed in the prior section, as well as work done to share
the Electronic Medical Records and Genomics eMERGE Phenotype Library across the
OHDSI network (Hripcsak et al., 2019). If phenotype curation is your interest, consider
contributing to this workgroup.
With this context in mind, we are now going to build our cohort. As we go through this
exercise, we will approach building our cohort similar to standard attrition chart. Figure
10.3 shows the logical framework for how we want to build this cohort.
You can build a cohort in the user interface of ATLAS or you can write a query directly
against your CDM. We will briefly discuss both in this chapter.
2
https://www.ohdsi.org/web/wiki/doku.php?id=projects:workgroups:goldlibrarywg
154 Chapter 10. Defining Cohorts
Before you do anything else, you are encouraged to change the name of the cohort from
“New Cohort Definition” to your own unique name for this cohort. You may opt for a
name like “New users of ACE inhibitors as firstline monotherapy for hypertension”.
ATLAS will not allow two cohorts to have the same exact names. ATLAS will give
you a popup error message if you choose a name already used by another ATLAS
cohort.
Once you have chosen a name, you can save the cohort by clicking .
based criteria, our question would be looking for patients with a specific drug or drug
class. Since we want to find patients who initiate ACE inhibitors monotherapy as firstline
treatments for hypertension, we want to choose a DRUG_EXPOSURE criteria. You may
say, “but we also care about hypertension as a diagnosis”. You are correct. Hypertension
is another criterion we will build. However, the cohort start date is defined by the initiation
of the ACE inhibitor treatment, which is therefore the initial event. The diagnosis of
hypertension is what we call an additional qualifying criteria. We will return to this once
we build this criteria. We will click “Add Drug Exposure”.
The screen will update with your selected criteria but you are not done yet. As we see in
Figure 10.6, ATLAS does not know what drug we are looking for. We need to tell ATLAS
which concept set is associated to ACE inhibitors.
You will need to click to open the dialogue box that will allow you to retrieve a concept
set to define ACE Inhibitors.
When you have found terms that you would like to use to define this drug exposure, you
can select the concept by clicking on . You can return to your cohort definition by using
the left arrow in the top left of Figure 10.7. You can refer back to Chapter 5 (Standardized
Vocabularies) on how to navigate the vocabularies to find clinical concepts of interest.
Figure 10.8 shows our concept set expression. We selected all ACE inhibitor ingredients
we are interested in, and include all their descendants, thus including all drugs that contain
any of these ingredients. We can click on “Included concepts” to see all 21,536 concepts
implied by this expression, or we can click on “Included Source Codes” to explore all
source codes in the various coding systems that are implied.
the concept set repository of your ATLAS as shown in Figure 10.9. In the example figure
the user is retrieving concept sets stored in ATLAS. The user typed in the name given to
this concept set “ace inhibitors” in the right hand search. This shortened the concept set
list to only concepts with matching names. From there, the user can click on the row of
the concept set to select it. (Note: the dialogue box will disappear once you have selected
a concept set.) You will know this action is successful when the Any Drug box is updated
with the name of the concept set you selected.
The current design of ATLAS may confuse some. Despite its appearance, the
is not intended to mean “No”. It is an actionable feature to allow the user to delete
the criteria. If you click , this criteria will go away. Thus, you need to leave the
criteria with the to keep the criteria active.
Now you have built an initial qualifying event. To ensure you are capturing the first
observed drug exposure, you will want to add a lookback window to know that you are
looking at enough of the patient’s history to know what comes first. It is possible that
a patient with a short observation period may have received an exposure elsewhere that
10.7. Implementing a Cohort Using ATLAS 159
we do not see. We cannot control this but we can mandate a minimum amount of time
the patient must be in the data prior to the index date You can do this by adjusting the
continuous observation drop downs. You could also click the box and type in a value to
these windows. We will require 365 days of continuous observation prior to the initial
event. You will update your observation period to: with continuous observation of 365
days before, as shown in Figure 10.10. This lookback window is the discretion of your
study team. You may choose differently in other cohorts. This creates, as best as we are
able, a minimum period of time we see the patient to ensure we are capturing the first
record. This criteria is about prior history and does not involve time after the index event.
Therefore, we require 0 days after the index event. Our qualifying event is the firstever
use of ACE inhibitors. Thus, we limit initial events to the “earliest event” per person.
Figure 10.10: Setting the required continuous observation before the index date.
To further explain how this logic comes together, you can think about assembling patient
timelines.
In Figure 10.11, each line represents a single patient that may be eligible to join the cohort.
The filled in stars represent a time the patient fulfills the specified criteria. As additional
criteria is applied, you may see some stars are a lighter shade. This means that these
patients have other records that satisfy the criteria but there is another record that proceeds
that. By the time we get to the last criteria, we are looking at the cumulative view of
patients who have ACE inhibitors for the first time and have 365 days prior to the first
time occurrence. Logically, limiting to the initial event is redundant though it is helpful
to maintain our explicit logic in every selection we make. When you are building your
own cohorts, you may opt to engage the Researchers section of the OHDSI Forum to get
a second opinion on how to construct your cohort logic.
criteria. If you opt to add criteria into the “New inclusion criteria”, you will get an attrition
chart to show you how many patients are lost by applying additional inclusion criteria. It
is highly encouraged to utilize the Inclusion Criteria section so you can understand the
impact of each rule on the overall success of the cohort definition. You may find a certain
inclusion criteria severely limits the number of people who end up in the cohort. You
may choose to relax this criterion to get a larger cohort. This will ultimately be at the
discretion of the expert consensus assembling this cohort.
You will now want to click “New inclusion criteria” to add a subsequent piece of logic
about membership to this cohort. The functionality in this section is identical to the way
we discussed building cohort criteria above. You may specific the criteria and add specific
attributes. Our first additional criteria is to subset the cohort to only patients: With at
least 1 occurrence of hypertension disorder between 365 and 0 days after index date
(first initiation of an ACE inhibitor). You will click “New inclusion criteria” to add a new
criteria. You should name your criteria and, if desired, put a little description of what you
are looking for. This is for your own purposes to recall what you build – it will not impact
the integrity of the cohort you are defining.
Once you have annotated this new criteria, you will click on the “+Add criteria to group”
button to build your actual criteria for this rule. This button functions similar to the
“Add Initial Event” except we are no longer specifying an initial event. We could add
multiple criteria to this – which is why it specifies “add criteria to group”. An exam
ple would be if you have multiple ways of finding a disease (e.g. logic for a CONDI
TION_OCCURRENCE, logic using a DRUG_EXPOSURE as a proxy for this condition,
logic for using a MEASUREMENT as a proxy for this condition). These would be sep
arate domains and require different criteria but can be grouped into one criteria looking
for this condition. In this case, we want to find a diagnosis of hypertension so we “Add
condition occurrence”. We will follow similar steps as we did with the initial event by
attaching a concept set to this record. We also want to specify the event starts between
365 days before and 0 days after the index date (the occurrence of the first ACE inhibitor
use). Now check your logic against Figure 10.12.
You will then want to add another criterion to look for patients: with exactly 0 occurrences
of hypertension drugs ALL days before and 1 day before index start date (no exposure to
HT drugs before an ACE inhibitor). This process begins as we did before by clicking
the “New inclusion criteria” button, adding your annotations to this criterion and then
clicking “+Add criteria to group”. This is a DRUG_EXPOSURE so you will click “Add
Drug Exposure”, attach a concept set for hypertensive drugs, and will specify ALL days
before and 0 days after (or “1 days before” is equivalent as seen in figure) the index date.
Make sure to confirm you have exactly 0 occurrence selected. Now check your logic
against Figure 10.13.
and append the concept set for “ACE inhibitors”. Now check your logic against Figure
10.15.
In the case of this cohort, there are no other censoring events. However, you may build
other cohorts where you need to specify this criteria. You would proceed similarly to the
way we have added other attributes to this cohort definition. You have now successfully
finished creating your cohort. Make sure to hit the button. Congratulations! Building
a cohort is the most important building block of answering a question in the OHDSI tools.
You can now use the “Export” tab to share your cohort definition to other collaborators
in the form of SQL code or JSON files to load into ATLAS.
library(CohortMethod)
connDetails <- createConnectionDetails(dbms = "postgresql",
server = "localhost/ohdsi",
user = "joe",
password = "supersecret")
The last three lines define the cdmDbSchema, cohortDbSchema, and cohortTable vari
ables. We will use these later to tell R where the data in CDM format live, and where
the cohorts of interest have to be created. Note that for Microsoft SQL Server, database
schemas need to specify both the database and the schema, so for example cdmDbSchema
<- "my_cdm_data.dbo".
FROM @cdm_db_schema.drug_exposure
INNER JOIN @cdm_db_schema.concept_ancestor
ON descendant_concept_id = drug_concept_id
WHERE ancestor_concept_id IN (@ace_i)
GROUP BY person_id;"
renderTranslateExecuteSql(conn,
sql,
cdm_db_schema = cdmDbSchema,
ace_i = aceI)
renderTranslateExecuteSql(conn,
sql,
cdm_db_schema = cdmDbSchema,
hypertension = hypertension)
Note that we SELECT DISTINCT, because else if a person has multiple hypertension di
agnoses in their past, we would create duplicate cohort entries.
renderTranslateExecuteSql(conn,
sql,
cdm_db_schema = cdmDbSchema,
all_ht_drugs = allHtDrugs)
Note that we use a left join, and only allow rows where the person_id, which comes from
the DRUG_EXPOSURE table is NULL, meaning no matching record was found.
10.8.7 Monotherapy
We require there to be only one exposure to hypertension treatment in the first seven days
of the cohort entry:
renderTranslateExecuteSql(conn,
sql,
cdm_db_schema = cdmDbSchema,
all_ht_drugs = allHtDrugs)
ends.era_end_date AS cohort_end_date
INTO #exposure_era
FROM (
SELECT exposure.person_id,
exposure.concept_id,
exposure.exposure_start_date,
MIN(events.end_date) AS era_end_date
FROM #exposure exposure
JOIN (
--cteEndDates
SELECT person_id,
concept_id,
DATEADD(DAY, - 1 * @max_gap, event_date) AS end_date
FROM (
SELECT person_id,
concept_id,
event_date,
event_type,
MAX(start_ordinal) OVER (
PARTITION BY person_id ,concept_id ORDER BY event_date,
event_type ROWS UNBOUNDED PRECEDING
) AS start_ordinal,
ROW_NUMBER() OVER (
PARTITION BY person_id, concept_id ORDER BY event_date,
event_type
) AS overall_ord
FROM (
-- select the start dates, assigning a row number to each
SELECT person_id,
concept_id,
exposure_start_date AS event_date,
0 AS event_type,
ROW_NUMBER() OVER (
PARTITION BY person_id, concept_id ORDER BY exposure_start_date
) AS start_ordinal
FROM #exposure exposure
UNION ALL
-- add the end dates with NULL as the row number, padding the end dates by
-- @max_gap to allow a grace period for overlapping ranges.
SELECT person_id,
concept_id,
DATEADD(day, @max_gap, exposure_end_date),
1 AS event_type,
NULL
FROM #exposure exposure
) rawdata
) events
170 Chapter 10. Defining Cohorts
renderTranslateExecuteSql(conn,
sql,
cdm_db_schema = cdmDbSchema,
max_gap = 30)
This code merges all subsequent exposures, allowing for a gap between exposures as
defined by the max_gap argument. The resulting drug exposure eras are written to a
temp table called #exposure_era.
Next, we simply join these ACE inhibitor exposure eras to our original cohort to use the
era end dates as our cohort end dates:
renderTranslateExecuteSql(conn,
sql,
cohort_db_schema = cohortDbSchema,
cohort_table = cohortTable)
Here we store the final cohort in schema and table we defined earlier. We assign it a
cohort definition ID of 1, to distinguish it from other cohorts we may wish to store in the
same table.
10.9. Summary 171
10.8.9 Cleanup
Finally, it is always recommend to clean up any temp tables that were created, and dis
connect from the database server:
renderTranslateExecuteSql(conn, sql)
disconnect(conn)
10.9 Summary
– A cohort is set of persons who satisfy one or more inclusion criteria for a
duration of time.
– A cohort definition is the description of logic used for identifying a particular
cohort.
– Cohorts are used (and reused) throughout the OHDSI analytics tools to define
for example the exposures and outcomes of interest.
– There are two major approaches to building a cohort: rulebased and proba
bilistic.
– Rulebased cohort definitions can be created in ATLAS, or using SQL
172 Chapter 10. Defining Cohorts
10.10 Exercises
Prerequisites
For the first exercise, access to an ATLAS instance is required. You can use the instance
at http://atlasdemo.ohdsi.org, or any other instance you have access to.
Exercise 10.1. Use ATLAS to create a cohort definition following these criteria:
• New users of diclofenac
• Ages 16 or older
• With at least 365 days of continuous observation prior to exposure
• Without prior exposure to any NSAID (NonSteroidal AntiInflammatory Drug)
• Without prior diagnosis of cancer
• With cohort exit defined as discontinuation of exposure (allowing for a 30day gap)
Prerequisites
For the second exercise we assume R, RStudio and Java have been installed as described
in Section 8.4.5. Also required are the SqlRender, DatabaseConnector, and Eunomia
packages, which can be installed using:
The Eunomia package provides a simulated dataset in the CDM that will run inside your
local R session. The connection details can be obtained using:
Exercise 10.2. Use SQL and R to create a cohort for acute myocardial infarction (AMI)
in the existing COHORT table, following these criteria:
• An occurrence of a myocardial infarction diagnose (concept 4329847 “Myocardial
infarction” and all of its descendants, excluding concept 314666 “Old myocardial
infarction” and any of its descendants).
• During an inpatient or ER visit (concepts 9201, 9203, and 262 for “Inpatient visit”,
“Emergency Room Visit”, and “Emergency Room and Inpatient Visit”, respec
tively).
Characterization
Usecases for characterization include disease natural history, treatment utilization and
quality improvement. In this chapter will describe the methods for characterization. We
will use a population of hypertensive persons to demonstrate how to use ATLAS and R
to perform these characterization tasks.
173
174 Chapter 11. Characterization
depression respectively. The events for each person were then aggregated to a set of
summary statistics and visualized for each condition and for each database.
11.4 Incidence
Incidence rates and proportions are statistics that are used in public health to assess the
occurrence of a new outcome in a population during a timeatrisk (TAR). Figure 11.2
aims to show the components of an incidence calculation for a single person:
In figure 11.2, a person has a period of time where they are observed in the data denoted
by their observation start and end time. Next, the person has a point in time where they
enter and exit a cohort by meeting some eligibility criteria. The time at risk window then
denotes when we seek to understand the occurrence of an outcome. If the outcome falls
into the TAR, we count that as an incidence of the outcome.
176 Chapter 11. Characterization
An incidence proportion provides a measure of the new outcomes per person in the popu
lation during the timeatrisk. Stated another way, this is the proportion of the population
of interest that developed the outcome in a defined timeframe.
An incidence rate is a measure of the number of new outcomes during the cumulative TAR
for the population. When a person experiences the outcome in the TAR, their contribution
to the total persontime stops at the occurrence of the outcome event. The cumulative TAR
is referred to as persontime and is expressed in days, months or years.
When calculated for therapies, incidence proportions and incidence rates of use of a given
therapy are classic populationlevel DUS.
we make use of ATLAS and R to explore a database to understand its composition for
studying hypertensive populations. Then, we will use these same tools to describe the
natural history and treatment patterns of hypertensive populations.
To search for a specific condition of interest, click on the Table tab to reveal the full list of
conditions in the database with person count, prevalence and records per person. Using
the filter box on the top, we can filter down the entries in the table based on concept name
containing the term “hypertension”:
We can explore a detailed drilldown report of a condition by clicking on a row. In this
case, we will select “essential hypertension” to get a breakdown of the trends of the se
lected condition over time and by gender, the prevalence of the condition by month, the
type recorded with the condition and the age at first occurrence of the diagnosis:
Now that we have reviewed the database’s characteristics for the presence of hypertension
concepts and the trends over time, we can also explore drugs used to treat hypertensive
persons. The process to do this follows the same steps except we use the Drug Era report to
178 Chapter 11. Characterization
Figure 11.4: Atlas Data Sources: Conditions with ”hypertension” found in the concept
name
11.6. Database Characterization in ATLAS 179
Figure 11.5: Atlas Data Sources: Essential hypertension drill down report
180 Chapter 11. Characterization
11.7.1 Design
A characterization analysis requires at least one cohort and at least one feature to charac
terize. For this example, we will use two cohorts. The first cohort will define persons
initiating a treatment for hypertension as their index date with at least one diagnosis of
hypertension in the year prior. We will also require that persons in this cohort have at
least one year of observation after initiating the hypertensive drug (Appendix B.6). The
second cohort is identical to the first cohort described with a requirement having at least
three years of observation instead of one (Appendix B.7).
Cohort Definitions
We assume the cohorts have already been created in ATLAS as described in Chapter 10.
Click on and select the cohorts as shown in figure 11.6. Next, we’ll define the
features to use for characterizing these two cohorts.
11.7. Cohort Characterization in ATLAS 181
Feature Selection
ATLAS comes with nearly 100 preset feature analyses that are used to perform charac
terization across the clinical domains modeled in the OMOP CDM. Each of these preset
feature analyses perform aggregation and summarization functions on clinical observa
tions for the selected target cohorts. These calculations provide potentially thousands of
features to describe the cohorts baseline and postindex characteristics. Under the hood,
ATLAS is utilizing the OHDSI FeatureExtraction R package to perform the characteriza
tion for each cohort. We will cover the use of FeatureExtraction and R in more detail in
the next section.
Click on to select the feature to characterize. Below is a list of features we will use
to characterize these cohorts:
The figure above shows the list of features selected along with a description of what each
feature will characterize for each cohort. The features that start with the name “Demo
graphics” will calculate the demographic information for each person at the cohort start
date. For the features that start with a domain name (i.e. Visit, Procedure, Condition,
182 Chapter 11. Characterization
Drug, etc), these will characterize all recorded observations in that domain. Each domain
feature has four options of time window preceding the cohort star, namely:
• Any time prior: uses all available time prior to cohort start that fall into the per
son’s observation period
• Long term: 365 days prior up to and including the cohort start date.
• Medium term: 180 days prior up to and including the cohort start date.
• Short term: 30 days prior up to and including the cohort start date.
Subgroup Analysis
What if we were interested in creating different characteristics based on gender? We can
use the “subgroup analyses” section to define new subgroups of interest to use in our
characterization.
To create a subgroup, click on and add your criteria for subgroup membership. This step
is similar to the criteria used to identify cohort enrollment. In this example, we’ll define
a set of criteria to identify females amongst our cohorts:
Subgroup analyses in ATLAS are not the same as strata. Strata are mutually exclu
sive while subgroups may include the same persons based on the criteria chosen.
11.7.2 Executions
Once we have our characterization designed, we can execute this design against one or
more databases in our environment. Navigate to the Executions tab and click on the
Generate button to start the analysis on a database:
Once the analysis is complete, we can view reports by clicking on the “All Executions”
11.7. Cohort Characterization in ATLAS 183
button and from the list of executions, select “View Reports”. Alternatively, you can click
“View latest result” to view the last execution performed.
11.7.3 Results
The results provide a tabular view of the different features for each cohort selected in the
design. In figure 11.10, a table provides a summary of all conditions present in the two
cohorts in the preceding 365 days from the cohort start. Each covariate has a count and
percentage for each cohort and the female subgroup we defined within each cohort.
We used the search box to filter the results to see what proportion of persons have a
cardiac arrhythmia in their history in an effort to understand what cardiovascular
related diagnoses are observed in the populations. We can use the Explore link next to
the cardiac arrhythmia concept to open a new window with more details about the concept
for a single cohort as shown in figure 11.11:
Since we have characterized all condition concepts for our cohorts, the explore option
enables a view of all ancestor and descendant concepts for the selected concept, in this
case cardiac arrhythmia. This exploration allows us to navigate the hierarchy of concepts
to explore other cardiac diseases that may appear for our hypertensive persons. Like in
the summary view, the count and percentage are displayed.
184 Chapter 11. Characterization
We can also use the same characterization results to find conditions that are contraindi
cated for some antihypertensive treatment such as angioedema. To do this, we’ll follow
the same steps above but this time search for ‘edema’ as shown in figure 11.12:
Once again, we’ll use the explore feature to see the characteristics of Edema in the hyper
tension population to find the prevalence of angioedema:
Here we find that a portion of this population has a record of angioedema in the year prior
to starting an antihypertensive medication.
11.7. Cohort Characterization in ATLAS 185
Figure 11.14: Characterization results of age for each cohort and sub group.
186 Chapter 11. Characterization
While domain covariates are computed using a binary indicator (i.e. was a record of the
code present in the prior timeframe), some variables provide a continuous value such as
the age of persons at cohort start. In the example above, we show the age for the 2 cohorts
characterized expressed with the count of persons, mean age, median age and standard
deviation.
In this example, we will define a custom feature that will identify the count of persons in
each cohort that have a drug era of ACE inhibitors in their history after cohort start:
The criteria defined above assumes that it will be applied to a cohort start date. Once we
have defined the criteria and saved it, we can apply it to the characterization design we
created in the previous section. To do this, open the characterization design and navigate
to the Feature Analysis section. Click the button and from the menu select the
new custom features. They will now appear in the feature list for the characterization
design. As described earlier, we can execute this design against a database to produce the
characterization for this custom feature:
11.8. Cohort Characterization in R 187
FeatureExtraction creates covariates in two distinct ways: personlevel features and ag
gregate features. Personlevel features are useful for machine learning applications. In
this section, we’ll focus on using aggregate features that are useful for generating baseline
covariates that describe the cohort of interest. Additionally, we’ll focus on the second two
ways of constructing covariates: prespecified and custom analyses and leave using the
default set as an exercise for the reader.
library(FeatureExtraction)
connDetails <- createConnectionDetails(dbms = "postgresql",
server = "localhost/ohdsi",
user = "joe",
password = "supersecret")
The last four lines define the cdmDbSchema, cohortsDbSchema, and cohortsDbTable
variables, as well as the CDM version. We will use these later to tell R where the data in
CDM format live, where the cohorts of interest have been created, and what version CDM
is used. Note that for Microsoft SQL Server, database schemas need to specify both the
database and the schema, so for example cdmDbSchema <- "my_cdm_data.dbo".
This will create binary covariates for gender, age (in 5 year age groups), and each concept
observed in the condition_occurrence table any time prior to (and including) the cohort
start date.
Many of the prespecified analyses refer to a short, medium, or long term time window.
By default, these windows are defined as:
• Long term: 365 days prior up to and including the cohort start date.
• Medium term: 180 days prior up to and including the cohort start date.
• Short term: 30 days prior up to and including the cohort start date.
However, the user can change these values. For example:
useDrugEraLongTerm = TRUE,
useDrugEraShortTerm = TRUE,
longTermStartDays = -180,
shortTermStartDays = -14,
endDays = -1)
This redefines the longterm window as 180 days prior up to (but not including) the cohort
start date, and redefines the short term window as 14 days prior up to (but not including)
the cohort start date.
Again, we can also specify which concept IDs should or should not be used to construct
covariates:
The use of aggregated = TRUE for all of the examples above indicate to Feature
Extraction to provide summary statistics. Excluding this flag will compute covari
ates for each person in the cohort.
summary(covariateData2)
190 Chapter 11. Characterization
covariateData2$covariates
covariateData2$covariatesContinuous
In figure 11.17, the person is part of the target cohort with a defined start and end date.
Then, the numbered line segments represent where that person also is identified in an
event cohort for a duration of time. Event cohorts allow us to describe any clinical event
11.9. Cohort Pathways in ATLAS 191
of interest that is represented in the CDM such that we are not constrained to creating a
pathway for a single domain or concept.
11.9.1 Design
To start, we will continue to use the cohorts initiating a firstline therapy for hypertension
with 1 and 3 years follow up (Appendix B.6, B.7). Use the button to import the 2 cohorts.
Next we’ll define the event cohorts by creating a cohort for each firstline hypertensive
drug of interest. For this, we’ll start by creating a cohort of ACE inhibitor users and
define the cohort end date as the end of continuous exposure. We’ll do the same for 8
other hypertensive medications and note that these definitions are found in Appendix B.8
B.16. Once complete use the button to import these into the Event Cohort section
of the pathway design:
When complete, your design should look like the one above. Next, we’ll need to decide
on a few additional analysis settings:
• Combination window: This setting allows you to define a window of time, in days,
in which overlap between events is considered a combination of events. For exam
ple, if two drugs represented by 2 event cohorts (event cohort 1 and event cohort
2) overlap within the combination window the pathways algorithm will combine
them into “event cohort 1 + event cohort 2”.
• Minimum cell count: Event cohorts with less than this number of people will be
censored (removed) from the output to protect privacy.
192 Chapter 11. Characterization
Figure 11.19: Event cohorts for pathway design for initiating a firstline antihypertensive
therapy.
11.9. Cohort Pathways in ATLAS 193
• Max path length: This refers to the maximum number of sequential events to
consider for the analysis.
11.9.2 Executions
Once we have our pathway analysis designed, we can execute this design against one or
more databases in our environment. This works the same way as we described for cohort
characterization in ATLAS. Once complete, we can review the results of the analysis.
The results of a pathway analysis are broken into 3 sections: The legend section displays
the total number of persons in the target cohort along with the number of persons that had
1 or more events in the pathway analysis. Below that summary are the color designations
for each of the cohorts that appear in the sunburst plot in the center section.
The sunburst plot is a visualization that represents the various event pathways taken by
persons over time. The center of the plot represents the cohort entry and the first color
coded ring shows the proportion of persons in each event cohort. In our example, the
center of the circle represents hypertensive persons initiating a first line therapy. Then,
the first ring in the sunburst plot shows the proportion of persons that initiated a type of
194 Chapter 11. Characterization
firstline therapy defined by the event cohorts (i.e. ACE inhibitors, Angiotensin receptor
blockers, etc). The second set of rings represents the 2nd event cohort for persons. In
certain event sequences, a person may never have a 2nd event cohort observed in the data
and that proportion is represented by the grey portion of the ring.
Clicking on a section of the sunburst plot will display the path details on the right. Here
we can see that the largest proportion of people in our target cohort initiated a firstline
therapy with ACE inhibitors and from that group, a smaller proportion started a Thiazide
or thiazide diuretics.
11.10.1 Design
We assume the cohorts used in this example have already been created in ATLAS as
described in Chapter 10. The Appendix provides the full definitions of the target cohorts
(Appendix B.2, B.5), and outcomes (Appendix B.4, B.3, B.9) cohorts.
On the definition tab, click to choose the New users of ACE inhibitors cohort and the
New users of Thiazide or Thiazidelike diuretics cohort. Close the dialog to view that
these cohorts are added to the design. Next we add our outcome cohorts by clicking on
and from the dialog box, select the outcome cohorts of acute myocardial infarction events,
angioedema events and Angiotensin receptor blocker (ARB) use. Again, close the window
to view that these cohorts are added to the outcome cohorts section of the design.
Next, we will define the time at risk window for the analysis. As shown above, the time
at risk window is defined relative to the cohort start and end dates. Here we will define
the time at risk start as 1 day after cohort start for our target cohorts. Next, we’ll define
the time at risk to end at the cohort end date. In this case, the definition of the ACEi and
THZ cohorts have a cohort end date when the drug exposure ends.
ATLAS also provides a way to stratify the target cohorts as part of the analysis specifica
tion:
To do this, click the New Stratify Criteria button and follow the same steps described
in Chapter 11. Now that we have completed the design, we can move to executing our
design against one or more databases.
196 Chapter 11. Characterization
11.10.2 Executions
Click the Generation tab and then the button to reveal a list of databases to use
to execute the analysis:
Select one or more databases and click the Generate button to start the analysis to analyze
all combinations of targets and outcomes specified in the design.
Figure 11.26: Incidence Rate analysis output New ACEi users with AMI outcome.
of cases per 1000 people. The time at risk, in years, is calculated for the target cohort.
The incidence rate is expressed as the number of cases per 1000 personyears.
We can also view the incidence metrics for the strata that we defined in the design. The
same metrics mentioned above are calculated for each stratum. Additionally, a treemap
visualization provides a representation of the proportion of each stratum represented by
the boxed areas. The color represents the incidence rate as shown in the scale along the
bottom.
We can gather the same information to see the incidence of new use of ARBs amongst
the ACEi population. Using the dropdown at the top, change the outcome to ARBs use
and click the button to reveal the details.
As shown, the metrics calculated are the same but the interpretation is different since the
input (ARB use) references a drug utilization estimate instead of a health outcome.
11.11 Summary
Figure 11.27: Incidence Rate New users of ACEi receiving ARBs treatment during
ACEi exposure.
11.12 Exercises
Prerequisites
For these exercises, access to an ATLAS instance is required. You can use the instance at
http://atlasdemo.ohdsi.org, or any other instance you have access to.
Exercise 11.1. We would like to understand how celecoxib is used in the real world. To
start, we would like to understand what data a database has on this drug. Use the ATLAS
Data Sources module to find information on celecoxib.
Exercise 11.2. We would like to better understand the disease natural history of celecoxib
users. Create a simple cohort of new users of celecoxib using a 365day washout period
(see Chapter 10 for details on how to do this), and use ATLAS to create a characterization
of this cohort, showing comorbid conditions and drugexposures.
11.12. Exercises 199
Exercise 11.3. We are interested in understand how often gastrointestinal (GI) bleeds oc
cur any time after people initiate celecoxib treatment. Create a cohort of GI bleed events,
simply defined as any occurrence of concept 192671 (“Gastrointestinal hemorrhage”) or
any of its descendants. Compute the incidence rate of these GI events after celecoxib
initiation, using the exposure cohort defined in the previous exercise.
Population-Level Estimation
Chapter leads: Martijn Schuemie, David Madigan, Marc Suchard & Patrick Ryan
Observational healthcare data, such as administrative claims and electronic health records,
offer opportunities to generate realworld evidence about the effect of treatments that can
meaningfully improve the lives of patients. In this chapter we focus on populationlevel
effect estimation, which refers to the estimation of average causal effects of exposures
(e.g. medical interventions such as drug exposures or procedures) on specific health out
comes of interest. In what follows, we consider two different estimation tasks:
• Direct effect estimation: estimating the effect of an exposure on the risk of an
outcome, as compared to no exposure.
• Comparative effect estimation: estimating the effect of an exposure (the target
exposure) on the risk of an outcome, as compared to another exposure (the com
parator exposure).
In both cases, the patientlevel causal effect contrasts a factual outcome, i.e., what hap
pened to the exposed patient, with a counterfactual outcome, i.e., what would have hap
pened had the exposure not occurred (direct) or had a different exposure occurred (com
parative). Since any one patient reveals only the factual outcome (the fundamental prob
lem of causal inference), the various effect estimation designs employ different analytic
devices to shed light on the counterfactual outcomes.
Usecases for populationlevel effect estimation include treatment selection, safety
surveillance, and comparative effectiveness. Methods can test specific hypotheses one
at a time (e.g. ‘signal evaluation’) or explore multiplehypothesesatonce (e.g. ‘signal
detection’). In all cases, the objective remains the same: to produce a highquality
estimate of the causal effect.
In this chapter we first describe various populationlevel estimation study designs, all of
which are implemented as R packages in the OHDSI Methods Library. We then detail the
design of an example estimation study, followed by stepbystep guides of how to imple
ment the design using ATLAS and R. Finally, we review the various outputs generated
201
202 Chapter 12. PopulationLevel Estimation
Figure 12.1: The newuser cohort design. Subjects observed to initiate the target treat
ment are compared to those initiating the comparator treatment. To adjust for differences
between the two treatment groups several adjustment strategies can be used, such as strat
ification, matching, or weighting by the propensity score, or by adding baseline charac
teristics to the outcome model. The characteristics included in the propensity model or
outcome model are captured prior to treatment initiation.
The cohort method attempts to emulate a randomized clinical trial. (Hernan and Robins,
2016) Subjects that are observed to initiate one treatment (the target) are compared to sub
jects initiating another treatment (the comparator) and are followed for a specific amount
of time following treatment initiation, for example the time they stay on the treatment.
We can specify the questions we wish to answer in a cohort study by making the five
choices highlighted in Table 12.1.
Choice Description
Target cohort A cohort representing the target treatment
Comparator cohort A cohort representing the comparator treatment
Outcome cohort A cohort representing the outcome of interest
Timeatrisk At what time (often relative to the target and comparator cohort
start and end dates) do we consider the risk of the outcome?
Model The model used to estimate the effect while adjusting for
differences between the target and comparator
The choice of model specifies, among others, the type of outcome model. For example,
we could use a logistic regression, which evaluates whether or not the outcome has oc
curred, and produces an odds ratio. A logistic regression assumes the timeatrisk is of
the same length for both target and comparator, or is irrelevant. Alternatively, we could
12.1. The Cohort Method Design 203
choose a Poisson regression which estimates the incidence rate ratio, assuming a constant
incidence rate. Often a Cox regression is used which considers time to first outcome to
estimate the hazard ratio, assuming proportional hazards between target and comparator.
The newuser cohort method inherently is a method for comparative effect esti
mation, comparing one treatment to another. It is difficult to use this method to
compare a treatment against no treatment, since it is hard to define a group of un
exposed people that is comparable with the exposed group. If one wants to use
this design for direct effect estimation, the preferred way is to select a comparator
treatment for the same indication as the exposure of interest, where the compara
tor treatment is believed to have no effect on the outcome. Unfortunately, such a
comparator might not always be available.
A key concern is that the patients receiving the target treatment may systematically differ
from those receiving the comparator treatment. For example, suppose the target cohort is
on average 60 years old, whereas the comparator cohort is on average 40 years old. Com
paring target to comparator with respect to any agerelated health outcome (e.g. stroke)
might then show substantial differences between the cohorts. An uninformed investigator
might reach the conclusion there is a causal association between the target treatment and
stroke as compared to the comparator. More prosaically or commonplace, the investiga
tor might conclude that there exist target patients that experienced stroke that would not
have done so had they received the comparator. This conclusion could well be entirely
incorrect! Maybe those target patients disproportionately experienced stroke simply be
cause they are older; maybe the target patients that experienced stroke might well have
done so even if they had received the comparator. In this context, age is a “confounder.”
One mechanism to deal with confounders in observational studies is through propensity
scores.
the target treatment based on what we can observe in the data on and before the time of
treatment initiation (irrespective of the treatment they actually received). This is a straight
forward predictive modeling application; we fit a model (e.g. a logistic regression) that
predicts whether a subject receives the target treatment, and use this model to generate
predicted probabilities (the PS) for each subject. Unlike in a standard randomized trial,
different patients will have different probabilities of receiving the target treatment. The
PS can be used in several ways including matching target subjects to comparator subjects
with similar PS, stratifying the study population based on the PS, or weighting subjects
using Inverse Probability of Treatment Weighting (IPTW) derived from the PS. When
matching we can select just one comparator subject for each target subject, or we can
allow more than one comparator subject per target subject, a technique know as variable
ratio matching. (Rassen et al., 2012)
For example, suppose we use oneonone PS matching, and that Jan has a priori proba
bility of 0.4 of receiving the target treatment and in fact receives the target treatment. If
we can find a patient (named Jun) that also had an a priori probability of 0.4 of receiving
the target treatment but in fact received the comparator, the comparison of Jan and Jun’s
outcomes is like a minirandomized trial, at least with respect to measured confounders.
This comparison will yield an estimate of the JanJun causal contrast that is as good as
the one randomization would have produced. Estimation then proceeds as follows: for
every patient that received the target, find one or more matched patients that received
the comparator but had the same prior probability of receiving the target. Compare the
outcome for the target patient with the outcomes for the comparator patients within each
of these matched groups.
We typically include the day of treatment initiation in the covariate capture window
because many relevant data points such as the diagnosis leading to the treatment are
recorded on that date. On this day the target and comparator treatment themselves
are also recorded, but these should not be included in the propensity model, because
they are the very thing we are trying to predict. We must therefore explicitly exclude
the target and comparator treatment from the set of covariates
Some have argued that a datadriven approach to covariate selection that does not depend
on clinical expertise to specify the “right” causal structure runs the risk of erroneously
including socalled instrumental variables and colliders, thus increasing variance and po
tentially introducing bias. (Hernan et al., 2002) However, these concerns are unlikely
to have a large impact in realworld scenarios. (Schneeweiss, 2018) Furthermore, in
medicine the true causal structure is rarely known, and when different researchers are
asked to identify the ‘right’ covariates to include for a specific research question, each re
searcher invariably comes up with a different list, thus making the process irreproducible.
Most importantly, our diagnostics such as inspection of the propensity model, evaluating
balance on all covariates, and including negative controls would identify most problems
related to colliders and instrumental variables.
12.1.3 Caliper
Since propensity scores fall on a continuum from 0 to 1, exact matching is rarely possible.
Instead, the matching process finds patients that match the propensity score of a target
patient(s) within some tolerance known as a “caliper.” Following Austin (2011), we use
a default caliper of 0.2 standard deviations on the logit scale.
𝐹 𝑆 𝑃
ln ( ) = ln ( ) − ln ( )
1−𝐹 1−𝑆 1−𝑃
Where 𝐹 is the preference score, 𝑆 is the propensity score, and 𝑃 is the proportion of
patients receiving the target treatment.
Walker et al. (2013) discuss the concept of “empirical equipoise.” They accept exposure
pairs as emerging from empirical equipoise if at least half of the exposures are to patients
with a preference score of between 0.3 and 0.7.
206 Chapter 12. PopulationLevel Estimation
Choice Description
Target cohort A cohort representing the treatment
Outcome cohort A cohort representing the outcome of interest
Timeatrisk At what time (often relative to the target cohort start and
end dates) do we consider the risk of the outcome?
Control time The time period used as the control time
12.1.5 Balance
Good practice always checks that the PS adjustment succeeds in creating balanced groups
of patients. Figure 12.19 shows the standard OHDSI output for checking balance. For
each patient characteristic, this plots the standardized difference between means between
the two exposure groups before and after PS adjustment. Some guidelines recommend
an afteradjustment standardized difference upper bound of 0.1. (Rubin, 2001)
Figure 12.2: The selfcontrolled cohort design. The rate of outcomes during exposure to
the target is compared to the rate of outcomes in the time preexposure.
The selfcontrolled cohort (SCC) design (Ryan et al., 2013a) compares the rate of out
comes during exposure to the rate of outcomes in the time just prior to the exposure. The
four choices shown in Table 12.2 define a selfcontrolled cohort question.
Because the same subject that make up the exposed group are also used as the control
group, no adjustment for betweenperson differences need to be made. However, the
method is vulnerable to other differences, such as differences in the baseline risk of the
outcome between different time periods.
Figure 12.3: The casecontrol design. Subjects with the outcome (‘cases’) are compared
to subjects without the outcome (‘controls’) in terms of their exposure status. Often, cases
and controls are matched on various characteristics such as age and sex.
Choice Description
Outcome cohort A cohort representing the cases (the outcome of interest)
Control cohort A cohort representing the controls. Typically the control
cohort is automatically derived from the outcome cohort
using some selection logic
Target cohort A cohort representing the treatment
Nesting cohort Optionally, a cohort defining the subpopulation from
which cases and controls are drawn
Timeatrisk At what time (often relative to the index date) do we
consider exposure status?
that experience the outcome of interest, with “controls,” i.e., subjects that did not experi
ence the outcome of interest. The choices in Table 12.3 define a casecontrol question.
Often, one selects controls to match cases based on characteristics such as age and sex to
make them more comparable. Another widespread practice is to nest the analysis within
a specific subgroup of people, for example people that have all been diagnosed with one
of the indications of the exposure of interest.
Figure 12.4: The casecrossover design. The time around the outcome is compared to a
control date set at a predefined interval prior to the outcome date.
Choice Description
Outcome cohort A cohort representing the cases (the outcome of interest)
Target cohort A cohort representing the treatment
Timeatrisk At what time (often relative to the index date) do we
consider exposure status?
Control time The time period used as the control time
come date is always later than the control date, the method will be positively biased if
the overall frequency of exposure increases over time (or negatively biased if there is a
decrease). To address this, the casetimecontrol design (Suissa, 1995) was developed,
which adds controls, matched for example on age and sex, to the casecrossover design
to adjust for exposure trends.
Figure 12.5: The SelfControlled Case Series design. The rate of outcomes during expo
sure is compared to the rate of outcomes when not exposed.
The SelfControlled Case Series (SCCS) design (Farrington, 1995; Whitaker et al., 2006)
compares the rate of outcomes during exposure to the rate of outcomes during all unex
posed time, including before, between, and after exposures. It is a Poisson regression
that is conditioned on the person. Thus, it seeks to answer the question: “Given that a
patient has the outcome, is the outcome more likely during exposed time compared to
12.6. Designing a Hypertension Study 209
Choice Description
Target cohort A cohort representing the treatment
Outcome cohort A cohort representing the outcome of interest
Timeatrisk At what time (often relative to the target cohort start and
end dates) do we consider the risk of the outcome?
Model The model to estimate the effect, including any
adjustments for timevarying confounders
(THZ), which could be just as effective in managing hypertension and its associated risks
such as acute myocardial infarction (AMI), but without increasing the risk of angioedema.
The following will demonstrate how to apply our populationlevel estimation framework
to observational healthcare data to address the following comparative estimation ques
tions:
What is the risk of angioedema in new users of ACE inhibitors compared to
new users of thiazide and thiazidelike diuretics?
What is the risk of acute myocardial infarction in new users of ACE inhibitors
compared to new users of thiazide and thiazidelike diuretics?
Since these are comparative effect estimation questions we will apply the cohort method
as described in Section 12.1.
12.6.3 Outcome
We define angioedema as any occurrence of an angioedema condition concept during an
inpatient or emergency room (ER) visit, and require there to be no angioedema diagnosis
recorded in the seven days prior. We define AMI as any occurrence of an AMI condition
concept during an inpatient or ER visit, and require there to be no AMI diagnosis record
in the 180 days prior.
12.6.4 Time-At-Risk
We define timeatrisk to start on the day after treatment initiation, and stop when exposure
stops, allowing for a 30day gap between subsequent drug exposures.
12.6.5 Model
We fit a PS model using the default set of covariates, including demographics, conditions,
drugs, procedures, measurements, observations, and several comorbidity scores. We
exclude ACEi and THZ from the covariates. We perform variableratio matching and
condition the Cox regression on the matched sets.
Table 12.6: Main design choices for our comparative cohort study.
Choice Value
Target cohort New users of ACE inhibitors as firstline monotherapy for
hypertension.
Comparator cohort New users of thiazides or thiazidelike diuretics as firstline
monotherapy for hypertension.
Outcome cohort Angioedema or acute myocardial infarction.
Timeatrisk Starting the day after treatment initiation, stopping when
exposure stops.
Model Cox proportional hazards model using variableratio matching.
In the Estimation design function, there are three sections: Comparisons, Analysis Set
tings, and Evaluation Settings. We can specify multiple comparisons and multiple analy
sis settings, and ATLAS will execute all combinations of these as separate analyses. Here
we discuss each section:
Note that we can select multiple outcomes for a targetcomparator pair. Each outcome
will be treated independently, and will result in a separate analysis.
Concepts to Include
When selecting concept to include, we can specify which covariates we would like to
generate, for example to use in a propensity model. When specifying covariates here, all
other covariates (aside from those you specified) are left out. We usually want to include
all baseline covariates, letting the regularized regression build a model that balances all
covariates. The only reason we might want to specify particular covariates is to replicate
an existing study that manually picked covariates. These inclusions can be specified in
this comparison section or in the analysis section, because sometimes they pertain to a
specific comparison (e.g. known confounders in a comparison), or sometimes they pertain
to an analysis (e.g. when evaluating a particular covariate selection strategy).
12.7. Implementing the Study Using ATLAS 213
Concepts to Exclude
Rather than specifying which concepts to include, we can instead specify concepts to
exclude. When we submit a concept set in this field, we use every covariate except for
those that we submitted. When using the default set of covariates, which includes all drugs
and procedures occurring on the day of treatment initiation, we must exclude the target
and comparator treatment, as well as any concepts that are directly related to these. For
example, if the target exposure is an injectable, we should not only exclude the drug, but
also the injection procedure from the propensity model. In this example, the covariates
we want to exclude are ACEi and THZ. Figure 12.8 shows we select a concept set that
includes all these concepts, including their descendants.
After selecting the negative controls and covariates to exclude, the lower half of the com
214 Chapter 12. PopulationLevel Estimation
Figure 12.9: The comparison window showing concept sets for negative controls and
concepts to exclude.
Study Population
There are a wide range of options to specify the study population, which is the set of
subjects that will enter the analysis. Many of these overlap with options available when
designing the target and comparator cohorts in the cohort definition tool. One reason for
using the options in Estimation instead of in the cohort definition is reusability; we can
define the target, comparator, and outcome cohorts completely independently, and add
dependencies between these at a later point in time. For example, if we wish to remove
people who had the outcome before treatment initiation, we could do so in the definitions
of the target and comparator cohort, but then we would need to create separate cohorts for
every outcome! Instead, we can choose to have people with prior outcomes be removed
in the analysis settings, and now we can reuse our target and comparator cohorts for our
two outcomes of interest (as well as our negative control outcomes).
The study start and end dates can be used to limit the analyses to a specific period. The
study end date also truncates risk windows, meaning no outcomes beyond the study end
date will be considered. One reason for selecting a study start date might be that one of the
drugs being studied is new and did not exist in an earlier time. Automatically adjusting
for this can be done by answering “yes” to the question “Restrict the analysis to the
period when both exposures are present in the data?”. Another reason to adjust study
12.7. Implementing the Study Using ATLAS 215
start and end dates might be that medical practice changed over time (e.g., due to a drug
warning) and we are only interested in the time where medicine was practiced a specific
way.
The option “Should only the first exposure per subject be included?” can be used to
restrict to the first exposure per patient. Often this is already done in the cohort definition,
as is the case in this example. Similarly, the option “The minimum required continuous
observation time prior to index date for a person to be included in the cohort” is often
already set in the cohort definition, and can therefore be left at 0 here. Having observed
time (as defined in the OBSERVATION_PERIOD table) before the index date ensures
that there is sufficient information about the patient to calculate a propensity score, and
is also often used to ensure the patient is truly a new user, and therefore was not exposed
before.
“Remove subjects that are in both the target and comparator cohort?” defines, to
gether with the option “If a subject is in multiple cohorts, should timeatrisk be cen
sored when the new timeatrisk starts to prevent overlap?” what happens when a
subject is in both target and comparator cohorts. The first setting has three choices:
• “Keep All” indicating to keep the subjects in both cohorts. With this option it might
be possible to doublecount subjects and outcomes.
• “Keep First” indicating to keep the subject in the first cohort that occurred.
• “Remove All” indicating to remove the subject from both cohorts.
If the options “keep all” or “keep first” are selected, we may wish to censor the time when
a person is in both cohorts. This is illustrated in Figure 12.10. By default, the timeatrisk
is defined relative to the cohort start and end date. In this example, the timeatrisk starts
one day after cohort entry, and stops at cohort end. Without censoring the timeatrisk for
the two cohorts might overlap. This is especially problematic if we choose to keep all,
because any outcome that occurs during this overlap (as shown) will be counted twice. If
we choose to censor, the first cohort’s timeatrisk ends when the second cohort’s timeat
risk starts.
We can choose to remove subjects that have the outcome prior to the risk window
start, because often a second outcome occurrence is the continuation of the first one.
For instance, when someone develops heart failure, a second occurrence is likely, which
means the heart failure probably never fully resolved in between. On the other hand,
some outcomes are episodic, and it would be expected for patients to have more than
one independent occurrence, like an upper respiratory infection. If we choose to remove
people that had the outcome before, we can select how many days we should look back
when identifying prior outcomes.
Our choices for our example study are shown in Figure 12.11. Because our target and
comparator cohort definitions already restrict to the first exposure and require observation
time prior to treatment initiation, we do not apply these criteria here.
216 Chapter 12. PopulationLevel Estimation
Figure 12.10: Timeatrisk (TAR) for subjects who are in both cohorts, assuming timeat
risk starts the day after treatment initiation, and stops at exposure end.
Covariate Settings
Here we specify the covariates to construct. These covariates are typically used in the
propensity model, but can also be included in the outcome model (the Cox proportional
hazards model in this case). If we click to view details of our covariate settings, we
can select which sets of covariates to construct. However, the recommendation is to
use the default set, which constructs covariates for demographics, all conditions, drugs,
procedures, measurements, etc.
We can modify the set of covariates by specifying concepts to include and/or exclude.
These settings are the same as the ones found in Section 12.7.1 on comparison settings.
The reason why they can be found in two places is because sometimes these settings
are related to a specific comparison, as is the case here because we wish to exclude the
drugs we are comparing, and sometimes the settings are related to a specific analysis.
When executing an analysis for a specific comparison using specific analysis settings,
the OHDSI tools will take the union of these sets.
Figure 12.12 shows our choices for this study. Note that we have selected to add descen
dants to the concept to exclude, which we defined in the comparison settings in Figure
12.9.
Time At Risk
Timeatrisk is defined relative to the start and end dates of our target and comparator
cohorts. In our example, we had set the cohort start date to start on treatment initiation, and
cohort end date when exposure stops (for at least 30 days). We set the start of timeatrisk
to one day after cohort start, so one day after treatment initiation. A reason to set the time
atrisk start to be later than the cohort start is because we may want to exclude outcome
events that occur on the day of treatment initiation if we do not believe it biologically
plausible they can be caused by the drug.
We set the end of the timeatrisk to the cohort end, so when exposure stops. We could
choose to set the end date later if for example we believe events closely following treat
218 Chapter 12. PopulationLevel Estimation
ment end may still be attributable to the exposure. In the extreme we could set the timeat
risk end to a large number of days (e.g. 99999) after the cohort end date, meaning we will
effectively follow up subjects until observation end. Such a design is sometimes referred
to as an intenttotreat design.
A patient with zero days at risk adds no information, so the minimum days at risk is
normally set at one day. If there is a known latency for the side effect, then this may be
increased to get a more informative proportion. It can also be used to create a cohort more
similar to that of a randomized trial it is being compared to (e.g., all the patients in the
randomized trial were observed for at least N days).
A golden rule in designing a cohort study is to never use information that falls after
the cohort start date to define the study population, as this may introduce bias. For
example, if we require everyone to have at least a year of timeatrisk, we will
likely have limited our analyses to those who tolerate the treatment well. This
setting should therefore be used with extreme care.
If we choose to include all covariates in the outcome model, it may make sense to use regu
larization when fitting the model if there are many covariates. Note that no regularization
will be applied to the treatment variable to allow for unbiased estimation.
Figure 12.15 shows our choices for this study. Because we use variableratio matching,
we must condition the regression on the strata (i.e. the matched sets).
In Section 12.7.1 we selected a concept set representing the negative control outcomes.
However, we need logic to convert concepts to cohorts to be used as outcomes in our
analysis. ATLAS provides standard logic with three choices. The first choice is whether
to use all occurrences or just the first occurrence of the concept. The second choice
determines whether occurrences of descendant concepts should be considered. For
example, occurrences of the descendant “ingrown nail of foot” can also be counted as
an occurrence of the ancestor “ingrown nail.” The third choice specifies which domains
should be considered when looking for the concepts.
222 Chapter 12. PopulationLevel Estimation
download the zip file. The zip file contains an R package, with the usual required folder
structure for R packages. (Wickham, 2015) To use this package we recommend using R
Studio. If you are running R Studio locally, unzip the file, and double click the .Rproj file
to open it in R Studio. If you are running R Studio on an R studio server, click
to upload and unzip the file, then click on the .Rproj file to open the project.
Once you have opened the project in R Studio, you can open the README file, and follow
the instructions. Make sure to change all file paths to existing paths on your system.
A common error message that may appear when running the study is “High correlation be
tween covariate(s) and treatment detected.” This indicates that when fitting the propensity
model, some covariates were observed to be highly correlated with the exposure. Please
review the covariates mentioned in the error message, and exclude them from the set of
covariates if appropriate (see Section 12.1.2).
For our example study we will rely on the CohortMethod package to execute our study.
CohortMethod extracts the necessary data from a database in the CDM and can use a
large set of covariates for the propensity model. In the following example we first only
consider angioedema as outcome. In Section 12.8.6 we then describe how this can be
extended to include AMI and the negative control outcomes.
library(CohortMethod)
connDetails <- createConnectionDetails(dbms = "postgresql",
server = "localhost/ohdsi",
user = "joe",
password = "supersecret")
The last four lines define the cdmDbSchema, cohortDbSchema, and cohortTable vari
ables, as well as the CDM version. We will use these later to tell R where the data in CDM
format live, where the cohorts of interest have been created, and what version CDM is
used. Note that for Microsoft SQL Server, database schemas need to specify both the
database and the schema, so for example cdmDbSchema <- "my_cdm_data.dbo".
Now we can tell CohortMethod to extract the cohorts, construct covariates, and extract
all necessary data for our analysis:
#Load data:
cmData <- getDbCohortMethodData(connectionDetails = connectionDetails,
cdmDatabaseSchema = cdmDatabaseSchema,
oracleTempSchema = NULL,
targetId = 1,
comparatorId = 2,
outcomeIds = 3,
studyStartDate = "",
studyEndDate = "",
exposureDatabaseSchema = cohortDbSchema,
exposureTable = cohortTable,
outcomeDatabaseSchema = cohortDbSchema,
outcomeTable = cohortTable,
cdmVersion = cdmVersion,
firstExposureOnly = FALSE,
removeDuplicateSubjects = FALSE,
restrictToCommonPeriod = FALSE,
washoutPeriod = 0,
226 Chapter 12. PopulationLevel Estimation
covariateSettings = cs)
cmData
## CohortMethodData object
##
## Treatment concept ID: 1
## Comparator concept ID: 2
## Outcome concept ID(s): 3
There are many parameters, but they are all documented in the CohortMethod manual.
The createDefaultCovariateSettings function is described in the FeatureExtrac
tion package. In short, we are pointing the function to the table containing our cohorts
and specify which cohort definition IDs in that table identify the target, comparator and
outcome. We instruct that the default set of covariates should be constructed, including
covariates for all conditions, drug exposures, and procedures that were found on or before
the index date. As mentioned in Section 12.1 we must exclude the target and comparator
treatments from the set of covariates, and here we achieve this by listing all ingredients in
the two classes, and tell FeatureExtraction to also exclude all descendants, thus excluding
all drugs that contain these ingredients.
All data about the cohorts, outcomes, and covariates are extracted from the server and
stored in the cohortMethodData object. This object uses the package ff to store infor
mation in a way that ensures R does not run out of memory, even when the data are large,
as mentioned in Section 8.4.2.
We can use the generic summary() function to view some more information of the data
we extracted:
summary(cmData)
saveCohortMethodData(cmData, "AceiVsThzForAngioedema")
We can use the loadCohortMethodData() function to load the data in a future session.
getAttritionTable(studyPop)
The createPs function uses the Cyclops package to fit a largescale regularized logistic
regression. To fit the propensity model, Cyclops needs to know the hyperparameter value
which specifies the variance of the prior. By default Cyclops will use crossvalidation to
estimate the optimal hyperparameter. However, be aware that this can take a really long
time. You can use the prior and control parameters of the createPs function to spec
ify Cyclops’ behavior, including using multiple CPUs to speedup the crossvalidation.
Here we use the PS to perform variableratio matching:
12.8. Implementing the Study Using R 229
# Outcomes of interest:
ois <- c(3, 4) # Angioedema, AMI
# Negative controls:
ncs <- c(434165,436409,199192,4088290,4092879,44783954,75911,137951,77965,
376707,4103640,73241,133655,73560,434327,4213540,140842,81378,
432303,4201390,46269889,134438,78619,201606,76786,4115402,
45757370,433111,433527,4170770,4092896,259995,40481632,4166231,
230 Chapter 12. PopulationLevel Estimation
433577,4231770,440329,4012570,4012934,441788,4201717,374375,
4344500,139099,444132,196168,432593,434203,438329,195873,4083487,
4103703,4209423,377572,40480893,136368,140648,438130,4091513,
4202045,373478,46286594,439790,81634,380706,141932,36713918,
443172,81151,72748,378427,437264,194083,140641,440193,4115367)
Next, we specify what arguments should be used when calling the various functions de
scribed previously in our example with one outcome:
We then combine these into a single analysis settings object, which we provide a unique
analysis ID and some description. We can combine one or more analysis settings objects
into a list:
We can now run the study including all comparisons and analysis settings:
The result object contains references to all the artifacts that were created. For example,
we can retrieve the outcome model for AMI:
We can also retrieve the effect size estimates for all outcomes with one command:
In general it is a good idea to also inspect the propensity model itself, and especially
so if the model is very predictive. That way we may discover which variables are most
predictive. Table 12.7 shows the top predictors in our propensity model. Note that if
a variable is too predictive, the CohortMethod package will throw an informative error
rather than attempt to fit a model that is already known to be perfectly predictive.
Table 12.7: Top 10 predictors in the propensity model for ACEi and
THZ. Positive values mean subjects with the covariate are more
likely to receive the target treatment. “(Intercept)” indicates the
intercept of this logistic regression model.
Beta Covariate
1.42 condition_era group during day 30 through 0 days relative to index: Edema
1.11 drug_era group during day 0 through 0 days relative to index: Potassium
Chloride
0.68 age group: 0509
0.64 measurement during day 365 through 0 days relative to index: Renin
0.63 condition_era group during day 30 through 0 days relative to index: Urticaria
0.57 condition_era group during day 30 through 0 days relative to index:
Proteinuria
0.55 drug_era group during day 365 through 0 days relative to index: INSULINS
AND ANALOGUES
0.54 race = Black or African American
0.52 (Intercept)
234 Chapter 12. PopulationLevel Estimation
Beta Covariate
0.50 gender = MALE
Figure 12.19: Covariate balance, showing the absolute standardized difference of mean
before and after propensity score matching. Each dot represents a covariate.
12.9. Study Outputs 235
Figure 12.20: Attrition diagram. The counts shown at the top are those that meet our
target and comparator cohort definitions. The counts at the bottom are those that enter
our outcome model, in this case a Cox regression.
Since the sample size is fixed in retrospective studies (the data has already been col
lected), and the true effect size is unknown, it is therefore less meaningful to compute
the power given an expected effect size. Instead, the CohortMethod package provides
the computeMdrr function to compute the minimum detectable relative risk (MDRR). In
our example study the MDRR is 1.69.
To gain a better understanding of the amount of followup available we can also inspect
the distribution of followup time. We defined followup time as time at risk, so not cen
sored by the occurrence of the outcome. The getFollowUpDistribution can provide
236 Chapter 12. PopulationLevel Estimation
a simple overview as shown in Figure 12.21, which suggests the followup time for both
cohorts is comparable.
Figure 12.21: Distribution of followup time for the target and comparator cohorts.
12.9.4 Kaplan-Meier
One last check is to review the KaplanMeier plot, showing the survival over time in
both cohorts. Using the plotKaplanMeier function we can create 12.22, which we can
check for example if our assumption of proportionality of hazards holds. The Kaplan
Meier plot automatically adjusts for stratification or weighting by PS. In this case, because
variableratio matching is used, the survival curve for the comparator groups is adjusted
to mimic what the curve had looked like for the target group had they been exposed to
the comparator instead.
We observe a hazard ratio of 4.32 (95% confidence interval: 2.45 8.08) for angioedema,
which tells us that ACEi appears to increase the risk of angioedema compared to THZ.
Similarly, we observe a hazard ratio of 1.13 (95% confidence interval: 0.59 2.18) for
AMI, suggesting little or no effect for AMI. Our diagnostics, as reviewed earlier, give
no reason for doubt. However, ultimately the quality of this evidence, and whether we
choose to trust it, depends on many factors that are not covered by the study diagnostics
as described in Chapter 14.
12.10. Summary 237
12.10 Summary
12.11 Exercises
Prerequisites
For these exercises we assume R, RStudio and Java have been installed as described
in Section 8.4.5. Also required are the SqlRender, DatabaseConnector, Eunomia and
CohortMethod packages, which can be installed using:
238 Chapter 12. PopulationLevel Estimation
The Eunomia package provides a simulated dataset in the CDM that will run inside your
local R session. The connection details can be obtained using:
The CDM database schema is “main”. These exercises also make use of several cohorts.
The createCohorts function in the Eunomia package will create these in the COHORT
table:
Eunomia::createCohorts(connectionDetails)
Problem Definition
What is the risk of gastrointestinal (GI) bleed in new users of celecoxib com
pared to new users of diclofenac?
The celecoxib newuser cohort has COHORT_DEFINITION_ID = 1. The diclofenac
newuser cohort has COHORT_DEFINITION_ID = 2. The GI bleed cohort has CO
HORT_DEFINITION_ID = 3. The ingredient concept IDs for celecoxib and diclofenac
are 1118084 and 1124300, respectively. Timeatrisk starts on day of treatment initiation,
and stops at the end of observation (a socalled intenttotreat analysis).
Exercise 12.1. Using the CohortMethod R package, use the default set of covariates and
extract the CohortMethodData from the CDM. Create the summary of the CohortMethod
Data.
Exercise 12.3. Fit a Cox proportional hazards model without using any adjustments.
What could go wrong if you do this?
Exercise 12.4. Fit a propensity model. Are the two groups comparable?
Exercise 12.6. Fit a Cox proportional hazards model using the PS strata. Why is the
result different from the unadjusted model?
Patient-Level Prediction
239
240 Chapter 13. PatientLevel Prediction
developing and reporting prediction models. For example, the Transparent Reporting of
a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) state
ment1 provides clear recommendations for reporting prediction model development and
validation and addresses some of the concerns related to transparency.
Massivescale, patientspecific predictive modeling has become reality due to OHDSI,
where the Common Data Model (CDM) allows for uniform and transparent analysis at
an unprecedented scale. The growing network of databases standardized to the CDM
enables external validation of models in different healthcare settings on a global scale. We
believe this provides immediate opportunity to serve large communities of patients who
are in most need of improved quality of care. Such models can inform truly personalized
medical care, leading hopefully to sharply improved patient outcomes.
In this chapter we describe OHDSI’s standardized framework for patientlevel prediction,
(Reps et al., 2018) and discuss the PatientLevelPrediction R package that implements
established best practices for development and validation. We start with providing the
necessary theory behind the development and evaluation of patientlevel prediction and
provide a highlevel overview of the implemented machine learning algorithms. We then
discuss an example prediction problem and provide stepbystep guidance on its definition
and implementation using ATLAS or custom R code. Finally, we discuss the use of Shiny
applications for the dissemination of study results.
As shown in Table 13.1, to define a prediction problem we have to define t=0 by a target
cohort, the outcome we like to predict by an outcome cohort, and the timeatrisk. We
1
https://www.equatornetwork.org/reportingguidelines/tripodstatement/
13.1. The Prediction Problem 241
Furthermore, we have to make design choices for the model we like to develop, and
determine the observational datasets to perform internal and external validation.
Choice Description
Target cohort How do we define the cohort of persons for whom we wish to
predict?
Outcome cohort How do we define the outcome we want to predict?
Timeatrisk In which time window relative to t=0 do we want to make the
prediction?
Model What algorithms do we want to use, and which potential
predictor variables do we include?
This conceptual framework works for all types of prediction problems, for example:
Based on this example data, and assuming the time at risk is the year following the index
date (the target cohort start date), we can construct the covariates and the outcome status.
A covariate indicating “Essential hypertension in the year prior” will have the value 0
(not present) for person ID 1 (the condition occurred after the index date), and the value
1 (present) for person ID 2. Similarly, the outcome status will be 0 for person ID 1 (this
person had no entry in the outcome cohort), and 1 for person ID 2 (the outcome occurred
within a year following the index date).
that noise. We therefore may prefer to define a decision boundary that does not perfectly
discriminate in our training data but captures the “real” complexity. Techniques such as
regularization aim to maximize model performance while minimizing complexity.
Each supervised learning algorithm has a different way to learn the decision boundary
and it is not straightforward which algorithm will work best on your data. As the No Free
Lunch theorem states not one algorithm is always going to outperform the others on all
prediction problems. Therefore, we recommend trying multiple supervised learning al
gorithms with various hyperparameter settings when developing patientlevel prediction
models.
The following algorithms are available in the PatientLevelPrediction package:
Note that the variance is optimized by maximizing the outofsample likelihood in a cross
validation, so the starting variance has little impact on the performance of the resulting
model. However, picking a starting variance that is too far from the optimal value may
lead to long fitting time.
13.3.6 AdaBoost
AdaBoost is a boosting ensemble technique. Boosting works by iteratively adding classi
fiers but adds more weight to the datapoints that are misclassified by prior classifiers in
the cost function when training the next classifier. We use the sklearn AdaboostClassifier
implementation in Python.
For evaluation we must use a different dataset than was used to develop the model,
or else we run the risk of favoring models that are overfitted (see Section 13.3) and
may not perform well for new patients.
We distinguish between
• Internal validation: Using different sets of data extracted from the same database
to develop and evaluate the model.
• External validation: Developing the model in one database, and evaluating in
another database.
There are two ways to perform internal validation:
• A holdout set approach splits the labelled data into two independent sets: a train
set and a test set (the hold out set). The train set is used to learn the model and the
test set is used to evaluate it. We can simply divide our patients randomly into a
train and test set, or we may choose to:
13.4. Evaluating Prediction Models 249
– Split the data based on time (temporal validation), for example training on
data before a specific date, and evaluating on data after that date. This may
inform us on whether our model generalizes to different time periods.
– Split the data based on geographic location (spatial validation).
• Cross validation is useful when the data are limited. The data is split into 𝑛 equally
sized sets, where 𝑛 needs to be prespecified (e.g. 𝑛 = 10). For each of these
sets a model is trained on all data except the data in that set and used to generate
predictions for the holdout set. In this way, all data is used once to evaluate the
modelbuilding algorithm. In the patientlevel prediction framework we use cross
validation to pick the optimal hyperparameters.
External validation aims to assess model performance on data from another database,
i.e. outside of the settings it was developed in. This measure of model transportability is
important because we want to apply our models not only on the database it was trained
on. Different databases may represent different patient populations, different healthcare
systems and different datacapture processes. We believe that the external validation of
prediction models on a large set of databases is a crucial step in model acceptance and
implementation in clinical practice.
Has outcome
Predicted class during
Patient ID Predicted risk at 0.5 threshold timeatrisk Type
1 0.8 1 1 TP
2 0.1 0 0 TN
3 0.7 1 0 FP
4 0 0 0 TN
5 0.05 0 0 TN
6 0.1 0 0 TN
7 0.9 1 1 TP
250 Chapter 13. PatientLevel Prediction
Has outcome
Predicted class during
Patient ID Predicted risk at 0.5 threshold timeatrisk Type
8 0.2 0 1 FN
9 0.3 0 0 TN
10 0.5 1 0 FP
If a patient is predicted to have the outcome and has the outcome (during the timeatrisk)
then this is called a true positive (TP). If a patient is predicted to have the outcome but
does not have the outcome then this is called a false positive (FP). If a patient is predicted
to not have the outcome and does not have the outcome then this is called a true negative
(TN). Finally, if a patient is predicted to not have the outcome but does have the outcome
then this is called a false negative (FN).
The following thresholdbased metrics can be calculated:
• accuracy: (𝑇 𝑃 + 𝑇 𝑁 )/(𝑇 𝑃 + 𝑇 𝑁 + 𝐹 𝑃 + 𝐹 𝑁 )
• sensitivity: 𝑇 𝑃 /(𝑇 𝑃 + 𝐹 𝑁 )
• specificity: 𝑇 𝑁 /(𝑇 𝑁 + 𝐹 𝑃 )
• positive predictive value: 𝑇 𝑃 /(𝑇 𝑃 + 𝐹 𝑃 )
Note that these values can either decrease or increase if the threshold is lowered. Lower
ing the threshold of a classifier may increase the denominator by increasing the number
of results returned. If the threshold was previously set too high, the new results may all
be true positives, which will increase positive predictive value. If the previous threshold
was about right or too low, further lowering the threshold will introduce false positives,
decreasing positive predictive value. For sensitivity the denominator does not depend on
the classifier threshold (𝑇 𝑃 + 𝐹 𝑁 is a constant). This means that lowering the classifier
threshold may increase sensitivity by increasing the number of true positive results. It
is also possible that lowering the threshold may leave sensitivity unchanged, while the
positive predictive value fluctuates.
Discrimination
Discrimination is the ability to assign a higher risk to patients who will experience the
outcome during the time at risk. The Receiver Operating Characteristics (ROC) curve is
created by plotting 1 – specificity on the xaxis and sensitivity on the yaxis at all possible
thresholds. An example ROC plot is presented later in this chapter in Figure 13.17. The
area under the receiver operating characteristic curve (AUC) gives an overall measure
of discrimination where a value of 0.5 corresponds to randomly assigning the risk and a
value of 1 means perfect discrimination. Most published prediction models obtain AUCs
between 0.60.8.
The AUC provides a way to determine how different the predicted risk distributions are
between the patients who experience the outcome during the time at risk and those who
13.4. Evaluating Prediction Models 251
do not. If the AUC is high, then the distributions will be mostly disjointed, whereas when
there is a lot of overlap, the AUC will be closer to 0.5, as shown in Figure 13.3.
Figure 13.3: How the ROC plots are linked to discrimination. If the two classes have
similar distributions of predicted risk, the ROC will be close to the diagonal, with AUC
close to 0.5.
For rare outcomes even a model with a high AUC may not be practical, because for every
positive above a given threshold there could also be many negatives (i.e. the positive pre
dictive value will be low). Depending on the severity of the outcome and cost (health risk
and/or monetary) of some intervention, a high false positive rate may be unwanted. When
the outcome is rare another measure known as the area under the precisionrecall curve
(AUPRC) is therefore recommended. The AUPRC is the area under the line generated by
plotting the sensitivity on the xaxis (also known as the recall) and the positive predictive
value (also known as the precision) on the yaxis.
Calibration
Calibration is the ability of the model to assign the correct risk. For example, if the model
assigned one hundred patients a risk of 10% then ten of the patients should experience
the outcome during the time at risk. If the model assigned 100 patients a risk of 80%
then eighty of the patients should experience the outcome during the time at risk. The
calibration is generally calculated by partitioning the patients into deciles based on the
predicted risk and in each group calculating the mean predicted risk and the fraction of
the patients who experienced the outcome during the time at risk. We then plot these ten
points (predicted risk on the yaxis and observed risk on the xaxis) and see whether they
fall on the x = y line, indicating the model is well calibrated. An example calibration plot
252 Chapter 13. PatientLevel Prediction
is presented later in this chapter in Figure 13.18. We also fit a linear model using the points
to calculate the intercept (which should be close to zero) and the gradient (which should
be close to one). If the gradient is greater than one then the model is assigning a higher
risk than the true risk and if the gradient is less than one the model is assigning a lower
risk than the true risk. Note that we also implemented Smooth Calibration Curves in our
Rpackage to better capture the nonlinear relationship between predicted and observed
risk.
or we want to perform sensitivity analyses with subpopulations of the target cohort. For
this we have to answer the following questions:
• What is the minimum amount of observation time we require before the start of
the target cohort? This choice could depend on the available patient time in the
training data, but also on the time we expect to be available in the data sources we
want to apply the model on in the future. The longer the minimum observation
time, the more baseline history time is available for each person to use for feature
extraction, but the fewer patients will qualify for analysis. Moreover, there could
be clinical reasons to choose a short or longer lookback period. For our example,
we will use a 365day prior history as lookback period (washout period).
• Can patients enter the target cohort multiple times? In the target cohort definition,
a person may qualify for the cohort multiple times during different spans of time, for
example if they had different episodes of a disease or separate periods of exposure
to a medical product. The cohort definition does not necessarily apply a restriction
to only let the patients enter once, but in the context of a particular patientlevel
prediction problem we may want to restrict the cohort to the first qualifying episode.
In our example, a person can only enter the target cohort once since our criteria was
based on first use of an ACE inhibitor.
• Do we allow persons to enter the cohort if they experienced the outcome before?
Do we allow persons to enter the target cohort if they experienced the outcome
before qualifying for the target cohort? Depending on the particular patientlevel
prediction problem, there may be a desire to predict incident first occurrence of an
outcome, in which case patients who have previously experienced the outcome are
not at risk for having a first occurrence and therefore should be excluded from the
target cohort. In other circumstances, there may be a desire to predict prevalent
episodes, whereby patients with prior outcomes can be included in the analysis and
the prior outcome itself can be a predictor of future outcomes. For our prediction
example, we will choose not to include those with prior angioedema.
• How do we define the period in which we will predict our outcome relative to the
target cohort start? We have to make two decisions to answer this question. First,
does the timeatrisk window start at the date of the start of the target cohort or
later? Arguments to make it start later could be that we want to avoid outcomes
that were entered late in the record that actually occurred before the start of the
target cohort or we want to leave a gap where interventions to prevent the outcome
could theoretically be implemented. Second, we need to define the timeatrisk by
setting the risk window end, as some specification of days offset relative to the
target cohort start or end dates. For our problem we will predict in a timeatrisk
window starting 1 day after the start of the target cohort up to 365 days later.
• Do we require a minimum amount of timeatrisk? We have to decide if we want
to include patients that did not experience the outcome but did leave the database
earlier than the end of our timeatrisk period. These patients may experience the
outcome when we no longer observe them. For our prediction problem we decide
254 Chapter 13. PatientLevel Prediction
to answer this question with “yes,” requiring a minimum timeatrisk for that rea
son. Furthermore, we have to decide if this constraint also applies to persons who
experienced the outcome or we will include all persons with the outcome irrespec
tive of their total time at risk. For example, if the outcome is death, then persons
with the outcome are likely censored before the full timeatrisk period is complete.
Choice Value
Target cohort Patients who have just started on an ACE inhibitor for the first
time. Patients are excluded if they have less than 365 days of
prior observation time or have prior angioedema.
Outcome cohort Angioedema.
Timeatrisk 1 day until 365 days from cohort start. We will require at least
364 days at risk.
Model Gradient Boosting Machine with hyperparameters ntree: 5000,
max depth: 4 or 7 or 10 and learning rate: 0.001 or 0.01 or 0.1
or 0.9. Covariates will include gender, age, conditions, drugs,
drug groups, and visit count. Data split: 75% train 25% test,
randomly assigned by person.
13.6. Implementing the Study in ATLAS 255
In the Prediction design function, there are four sections: Prediction Problem Settings,
Analysis Settings, Execution Settings, and Training Settings. Here we discuss each sec
tion:
To select a target population cohort we need to have previously defined it in ATLAS. In
stantiating cohorts is described in Chapter 10. The Appendix provides the full definitions
of the target (Appendix B.1) and outcome (Appendix B.4) cohorts used in this example.
To add a target population to the cohort, click on the “Add Target Cohort” button. Adding
outcome cohorts similarly works by clicking the “Add Outcome Cohort” button. When
done, the dialog should look like Figure 13.4.
Model Settings
We can pick one or more supervised learning algorithms for model development. To add
a supervised learning algorithm click on the “Add Model Settings” button. A dropdown
containing all the models currently supported in the ATLAS interface will appear. We can
select the supervised learning model we want to include in the study by clicking on the
name in the dropdown menu. This will then show a view for that specific model, allowing
the selection of the hyperparameter values. If multiple values are provided, a grid search
is performed across all possible combinations of values to select the optimal combination
using crossvalidation.
For our example we select gradient boosting machines, and set the hyperparameters as
specified in Figure 13.5.
Covariate Settings
We have defined a set of standard covariates that can be extracted from the observational
data in the CDM format. In the covariate settings view, it is possible to select which of
the standard covariates to include. We can define different types of covariate settings, and
each model will be created separately with each specified covariate setting.
To add a covariate setting into the study, click on the “Add Covariate Settings”. This will
open the covariate setting view.
The first part of the covariate settings view is the exclude/include option. Covariates are
generally constructed for any concept. However, we may want to include or exclude
specific concepts, for example if a concept is linked to the target cohort definition. To
only include certain concepts, create a concept set in ATLAS and then under the “What
concepts do you want to include in baseline covariates in the patientlevel prediction
model? (Leave blank if you want to include everything)” select the concept set by
clicking on . We can automatically add all descendant concepts to the concepts in the
concept set by answering “yes” to the question “Should descendant concepts be added
to the list of included concepts?” The same process can be repeated for the question
“What concepts do you want to exclude in baseline covariates in the patientlevel
prediction model? (Leave blank if you want to include everything)”, allowing covari
ates corresponding to the selected concepts to be removed. The final option “A comma
delimited list of covariate IDs that should be restricted to” enables us to add a set of
covariate IDs (rather than concept IDs) comma separated that will only be included in the
model. This option is for advanced users only. Once done, the inclusion and exclusion
settings should look like Figure 13.6.
The next section enables the selection of nontime bound variables.
13.6. Implementing the Study in ATLAS 257
The standard covariates enable three flexible time intervals for the covariates:
• end days: when to end the time intervals relative to the cohort start date [default is
0]
• long term [default 365 days to end days prior to cohort start date]
• medium term [default 180 days to end days prior to cohort start date]
13.6. Implementing the Study in ATLAS 259
• short term [default 30 days to end days prior to cohort start date]
Once done, this section should look like Figure 13.8.
The next option is the covariates extracted from the era tables:
• Condition: Construct covariates for each condition concept ID and time interval
selected and if a patient has the concept ID with an era (i.e., the condition starts
or ends during the time interval or starts before and ends after the time interval)
during the specified time interval prior to the cohort start date in the condition era
table, the covariate value is 1, otherwise 0.
• Condition group: Construct covariates for each condition concept ID and time in
terval selected and if a patient has the concept ID or any descendant concept ID
with an era during the specified time interval prior to the cohort start date in the
condition era table, the covariate value is 1, otherwise 0.
• Drug: Construct covariates for each drug concept ID and time interval selected and
if a patient has the concept ID with an era during the specified time interval prior
to the cohort start date in the drug era table, the covariate value is 1, otherwise 0.
• Drug group: Construct covariates for each drug concept ID and time interval se
lected and if a patient has the concept ID or any descendant concept ID with an
era during the specified time interval prior to the cohort start date in the drug era
table, the covariate value is 1, otherwise 0.
Overlapping time interval setting means that the drug or condition era should start prior
to the cohort start date and end after the cohort start date, so it overlaps with the cohort
start date. The era start option restricts to finding condition or drug eras that start during
the time interval selected.
Once done, this section should look like Figure 13.9.
The next option selects covariates corresponding to concept IDs in each domain for the
260 Chapter 13. PatientLevel Prediction
• Condition: Construct covariates for each condition concept ID and time interval
selected and if a patient has the concept ID recorded during the specified time in
terval prior to the cohort start date in the condition occurrence table, the covariate
value is 1, otherwise 0.
• Condition Primary Inpatient: One binary covariate per condition observed as a
primary diagnosis in an inpatient setting in the condition_occurrence table.
• Drug: Construct covariates for each drug concept ID and time interval selected and
if a patient has the concept ID recorded during the specified time interval prior to
the cohort start date in the drug exposure table, the covariate value is 1, otherwise
0.
• Procedure: Construct covariates for each procedure concept ID and time interval
selected and if a patient has the concept ID recorded during the specified time in
terval prior to the cohort start date in the procedure occurrence table, the covariate
value is 1, otherwise 0.
• Measurement: Construct covariates for each measurement concept ID and time
interval selected and if a patient has the concept ID recorded during the specified
time interval prior to the cohort start date in the measurement table, the covariate
value is 1, otherwise 0.
• Measurement Value: Construct covariates for each measurement concept ID with a
value and time interval selected and if a patient has the concept ID recorded during
the specified time interval prior to the cohort start date in the measurement table,
the covariate value is the measurement value, otherwise 0.
• Measurement range group: Binary covariates indicating whether measurements are
below, within, or above normal range.
• Observation: Construct covariates for each observation concept ID and time inter
val selected and if a patient has the concept ID recorded during the specified time
interval prior to the cohort start date in the observation table, the covariate value is
1, otherwise 0.
• Device: Construct covariates for each device concept ID and time interval selected
and if a patient has the concept ID recorded during the specified time interval prior
to the cohort start date in the device table, the covariate value is 1, otherwise 0.
• Visit Count: Construct covariates for each visit and time interval selected and count
13.6. Implementing the Study in ATLAS 261
the number of visits recorded during the time interval as the covariate value.
• Visit Concept Count: Construct covariates for each visit, domain and time interval
selected and count the number of records per domain recorded during the visit type
and time interval as the covariate value.
The distinct count option counts the number of distinct concept IDs per domain and time
interval.
Once done, this section should look like Figure 13.10.
The final option is whether to include commonly used risk scores as covariates. Once
done, the risk score settings should look like Figure 13.11.
Population Settings
The population settings is where addition inclusion criteria can be applied to the target
population and is also where the timeatrisk is defined. To add a population setting into
the study, click on the “Add Population Settings” button. This will open up the population
setting view.
The first set of options enable the user to specify the timeatrisk period. This is the time
262 Chapter 13. PatientLevel Prediction
interval where we look to see whether the outcome of interest occurs. If a patient has
the outcome during the timeatrisk period then we will classify them as “Has outcome”,
otherwise they are classified as “No outcome”. “Define the timeatrisk window start,
relative to target cohort entry:” defines the start of the timeatrisk, relative to the target
cohort start or end date. Similarly, “Define the timeatrisk window end:” defines the
end of the timeatrisk.
“Minimum lookback period applied to target cohort” specifies the minimum baseline
period, the minimum number of days prior to the cohort start date that a patient is contin
uously observed. The default is 365 days. Expanding the minimum lookback will give a
more complete picture of a patient (as they must have been observed for longer) but will
filter patients who do not have the minimum number of days prior observation.
If “Should subjects without time at risk be removed?” is set to yes, then a value for
“Minimum time at risk:” is also required. This allows removing people who are lost to
followup (i.e. that have left the database during the timeatrisk period). For example, if
the timeatrisk period was 1 day from cohort start until 365 days from cohort start, then
the full timeatrisk interval is 364 days (3651). If we only want to include patients who
are observed the whole interval, then we set the minimum time at risk to be 364. If we
are happy as long as people are in the timeatrisk for the first 100 days, then we select
minimum time at risk to be 100. In this case as the timeatrisk start is 1 day from the
cohort start, a patient will be included if they remain in the database for at least 101 days
from the cohort start date. If we set “Should subjects without time at risk be removed?”
to ‘No’, then this will keep all patients, even those who drop out from the database during
the timeatrisk.
The option “Include people with outcomes who are not observed for the whole at risk
period?” is related to the previous option. If set to “yes”, then people who experience
the outcome during the timeatrisk are always kept, even if they are not observed for the
specified minimum amount of time.
The option “Should only the first exposure per subject be included?” is only useful
if our target cohort contains patients multiple times with different cohort start dates. In
this situation, picking “yes” will result in only keeping the earliest target cohort date per
patient in the analysis. Otherwise a patient can be in the dataset multiple times.
Setting “Remove patients who have observed the outcome prior to cohort entry?” to
“yes” will remove patients who have the outcome prior to the timeatrisk start date, so the
model is for patients who have never experienced the outcome before. If “no” is selected,
then patients could have had the outcome prior. Often, having the outcome prior is very
predictive of having the outcome during the timeatrisk.
Once done, the population settings dialog should look like Figure 13.12.
Now that we are finished with the Analysis Settings, the entire dialog should look like
Figure 13.13.
13.6. Implementing the Study in ATLAS 263
• “Percentage of the data to be used as the test set (0100%)”: Select the percent
age of data to be used as test data (default = 25%).
• “The number of folds used in the cross validation”: Select the number of folds
for crossvalidation used to select the optimal hyperparameter (default = 3).
• “The seed used to split the test/train set when using a person type testSplit
(optional):”: Select the random seed used to split the train/test set when using a
person type test split.
For our example we make the choices shown in Figure 13.15.
R using:
install.packages("drat")
drat::addRepo("OHDSI")
install.packages("PatientLevelPrediction")
Some of the machine learning algorithms require additional software to be installed. For
a full description of how to install the PatientLevelPrediction package, see the “Patient
Level Prediction Installation Guide” vignette.
To use the study R package we recommend using R Studio. If you are running R Studio
locally, unzip the file generated by ATLAS, and double click the .Rproj file to open it in
R Studio. If you are running R Studio on an R studio server, click to upload
and unzip the file, then click on the .Rproj file to open the project.
Once you have opened the project in R Studio, you can open the README file, and follow
the instructions. Make sure to change all file paths to existing paths on your system.
library(PatientLevelPrediction)
connDetails <- createConnectionDetails(dbms = "postgresql",
server = "localhost/ohdsi",
user = "joe",
password = "supersecret")
13.7. Implementing the Study in R 267
The last four lines define the cdmDbSchema, cohortsDbSchema, and cohortsDbTable
variables, as well as the CDM version. We will use these later to tell R where the data in
CDM format live, where the cohorts of interest have been created, and what version CDM
is used. Note that for Microsoft SQL Server, database schemas need to specify both the
database and the schema, so for example cdmDbSchema <- "my_cdm_data.dbo".
First it makes sense to verify that the cohort creation has succeeded by counting the num
ber of cohort entries:
## cohort_definition_id count
## 1 1 527616
## 2 2 3201
Now we can tell PatientLevelPrediction to extract all necessary data for our analysis. Co
variates are extracted using the FeatureExtraction package. For more detailed infor
mation on the FeatureExtraction package see its vignettes. For our example study we
decided to use these settings:
The final step for extracting the data is to run the getPlpData function and input the
connection details, the database schema where the cohorts are stored, the cohort definition
IDs for the cohort and outcome, and the washout period which is the minimum number
268 Chapter 13. PatientLevel Prediction
of days prior to cohort index date that the person must have been observed to be included
into the data, and finally input the previously constructed covariate settings.
There are many additional parameters for the getPlpData function which are all doc
umented in the PatientLevelPrediction manual. The resulting plpData object uses the
package ff to store information in a way that ensures R does not run out of memory,
even when the data are large.
Creating the plpData object can take considerable computing time, and it is probably a
good idea to save it for future sessions. Because plpData uses ff, we cannot use R’s
regular save function. Instead, we’ll have to use the savePlpData function:
savePlpData(plpData, "angio_in_ace_data")
We can use the loadPlpData() function to load the data in a future session.
removeSubjectsWithPriorOutcome = TRUE,
priorOutcomeLookback = 9999,
riskWindowStart = 1,
riskWindowEnd = 365,
addExposureDaysToStart = FALSE,
addExposureDaysToEnd = FALSE,
minTimeAtRisk = 364,
requireTimeAtRisk = TRUE,
includeAllOutcomes = TRUE,
verbosity = "DEBUG"
)
The runPlP function uses the population, plpData, and model settings to train and evalu
ate the model. We can use the testSplit (person/time) and testFraction parameters
to split the data in a 75%25% split and run the patientlevel prediction pipeline:
Under the hood the package will now use the R xgboost package to fit a a gradient boost
ing machine model using 75% of the data and will evaluate the model on the remaining
270 Chapter 13. PatientLevel Prediction
25%. A results data structure is returned containing information about the model, its
performance, etc.
In the runPlp function there are several parameters to save the plpData, plpResults,
plpPlots, evaluation, etc. objects which are all set to TRUE by default.
We can save the model using:
plotPlp(gbmResults, "plots")
To make things easier we also provide the externalValidatePlp function for perform
ing external validation that also extracts the required data. Assuming we ran result <-
runPlp(...) then we can extract the data required for the model and evaluate it on new
data. Assuming the validation cohorts are in the table mainschema.dob.cohort with
IDs 1 and 2 and the CDM data is in the schema cdmschema.dob:
viewPlp(plpResult)
The Shiny app opens with a summary of the performance metrics on the test and train
sets (see Figure 13.16). The results show that the AUC on the train set was 0.78 and this
13.8. Results Dissemination 273
dropped to 0.74 on the test set. The test set AUC is the more accurate measure. Overall,
the model appears to be able to discriminate those who will develop the outcome in new
users of ACE inhibitors but it slightly overfit as the performance on the train set is higher
than the test set. The ROC plot is presented in Figure 13.17.
The calibration plot in Figure 13.18 shows that generally the observed risk matches the
predicted risk as the dots are around the diagonal line. The demographic calibration plot in
Figure 13.19 however shows that the model is not well calibrated for the younger patients,
as the blue line (the predicted risk) differs from the red line (the observed risk) for those
aged below 40. This may indicate we need to remove the under 40s from the target
population (as the observed risk for the younger patients is nearly zero).
Finally, the attrition plot shows the loss of patients from the labelled data based on inclu
sion/exclusion criteria (see Figure 13.20). The plot shows that we lost a large portion of
the target population due to them not being observed for the whole time at risk (1 year
follow up). Interestingly, not as many patients with the outcome lacked the complete time
at risk.
274 Chapter 13. PatientLevel Prediction
The interactive Shiny app will start at the summary page as shown in Figure 13.21.
Figure 13.21: The shiny summary page containing key hold out set performance metrics
for each model trained
• basic information about the model (e.g., database information, classifier type, time
atrisk settings, target population and outcome names)
• hold out target population count and incidence of outcome
• discrimination metrics: AUC, AUPRC
To the left of the table is the filter option, where we can specify the development/validation
databases to focus on, the type of model, the time at risk settings of interest and/or the
cohorts of interest. For example, to pick the models corresponding to the target population
“New users of ACE inhibitors as first line monotherapy for hypertension”, select this in
the Target Cohort option.
To explore a model click on the corresponding row, a selected row will be highlighted.
With a row selected, we can now explore the model settings used when developing the
model by clicking on the Model Settings tab:
Similarly, we can explore the population and covariate settings used to generate the model
in the other tabs.
13.8. Results Dissemination 277
Figure 13.22: To view the model settings used when developing the model.
This summary view shows the selected prediction question in the standard format, a
threshold selector and a dashboard containing key thresholdbased metrics such as posi
tive predictive value (PPV), negative predictive value (NPV), sensitivity and specificity
(see Section 13.4.2). In Figure 13.23 we see that at a threshold of 0.00482 the sensitivity
is 83.4% (83.4% of patients with the outcome in the following year have a risk greater
than or equal to 0.00482) and the PPV is 1.2% (1.2% of patients with a risk greater than
or equal to 0.00482 have the outcome in the following year). As the incidence of the
outcome within the year is 0.741%, identifying patients with a risk greater than or equal
278 Chapter 13. PatientLevel Prediction
to 0.00482 would find a subgroup of patients that have nearly double the risk of the popu
lation average risk. We can adjust the threshold using the slider to view the performance
at other values.
To look at the overall discrimination of the model click on the “Discrimination” tab to
view the ROC plot, precisionrecall plot, and distribution plots. The line on the plots
corresponds to the selected threshold point. Figure 13.24 show the ROC and precision
recall plots. The ROC plot shows the model was able to discriminate between those who
will have the outcome within the year and those who will not. However, the performance
looks less impressive when we see the precisionrecall plot, as the low incidence of the
outcome means there is a high false positive rate.
Figure 13.24: The ROC and precisionrecall plots used to access the overall discrimina
tion ability of the model.
To inspect the final model, select the option from the left hand menu. This
will open a view containing plots for each variable in the model, shown in Figure 13.27,
13.8. Results Dissemination 279
Figure 13.25: The predicted risk distribution for those with and without the outcome. The
more these overlap the worse the discrimination
and a table summarizing all the candidate covariates, shown in Figure 13.28. The vari
able plots are separated into binary variables and continuous variables. The xaxis is the
prevalence/mean in patients without the outcome and the yaxis is the prevalence/mean
in patients with the outcome. Therefore, any variable’s dot falling above the diagonal
is more common in patients with the outcome and any variable’s dot falling below the
diagonal is less common in patients with the outcome.
Figure 13.27: Model summary plots. Each dot corresponds to a variable included in the
model.
The table in Figure 13.28 displays the name, value (coefficient if using a general linear
model, or variable importance otherwise) for all the candidate covariates, outcome mean
(the mean value for those who have the outcome) and nonoutcome mean (the mean value
for those who do not have the outcome).
Predictive models are not causal models, and predictors should not be mistaken for
causes. There is no guarantee that modifying any of the variables in Figure 13.28
will have an effect on the risk of the outcome.
13.10 Summary
13.11 Exercises
Prerequisites
For these exercises we assume R, RStudio and Java have been installed as described
in Section 8.4.5. Also required are the SqlRender, DatabaseConnector, Eunomia and
PatientLevelPrediction packages, which can be installed using:
The Eunomia package provides a simulated dataset in the CDM that will run inside your
local R session. The connection details can be obtained using:
The CDM database schema is “main”. These exercises also make use of several cohorts.
The createCohorts function in the Eunomia package will create these in the COHORT
table:
Eunomia::createCohorts(connectionDetails)
Problem Definition
In patients that started using NSAIDs for the first time, predict who will
develop a gastrointestinal (GI) bleed in the next year.
The NSAID newuser cohort has COHORT_DEFINITION_ID = 4. The GI bleed cohort
has COHORT_DEFINITION_ID = 3.
Exercise 13.1. Using the PatientLevelPrediction R package, define the covariates you
want to use for the prediction and extract the PLP data from the CDM. Create the summary
of the PLP data.
Exercise 13.2. Revisit the design choices you have to make to define the final target
population and specify these using the createStudyPopulation function. What will
the effect of your choices be on the final size of the target population?
Exercise 13.3. Build a prediction model using LASSO and evaluate its performance using
the Shiny application. How well is your model performing?
Evidence Quality
285
Chapter 14
Evidence Quality
Reliable evidence should be repeatable, meaning that researchers should expect to pro
duce identical results when applying the same analysis to the same data for any given
question. Implicit in this minimum requirement is the notion that evidence is the result
of the execution of a defined process with a specified input, and should be free of manual
intervention of posthoc decisionmaking along the way. More ideally, reliable evidence
should be reproducible such that a different researcher should be able to perform the
same task of executing a given analysis on a given database and expect to produce an
identical result as the first researcher. Reproducibility requires that the process is fully
specified, generally in both humanreadable and computerexecutable form such that no
287
288 Chapter 14. Evidence Quality
study decisions are left to the discretion of the investigator. The most efficient solution
to achieve repeatability and reproducibility is to use standardized analytics routines that
have defined inputs and outputs, and apply these procedures against versioncontrolled
databases.
We are more likely to be confident that our evidence is reliable if it can be shown to be
replicable, such that the same question addressed using the identical analysis against sim
ilar data yield similar results. For example, evidence generated from an analysis against
an administrative claims database from one large private insurer may be strengthened if
replicated on claims data from a different insurer. In the context of populationlevel ef
fect estimation, this attribute aligns well with Sir Austin Bradford Hill’s causal viewpoint
on consistency, “Has it been repeatedly observed by different persons, in different places,
circumstances and times?…whether chance is the explanation or whether a true hazard
has been revealed may sometimes be answered only by a repetition of the circumstances
and the observations.” (Hill, 1965) In the context of patientlevel prediction, replicabil
ity highlights the value of external validation and the ability to evaluate performance of
a model that was trained on one database by observing its discriminative accuracy and
calibration when applied to a different database. In circumstances where identical anal
yses are performed against different databases and still show consistently similar results,
we have further gain confidence that our evidence is generalizable. A key value of the
OHDSI research network is the diversity represented by different populations, geogra
phies and data capture processes. Madigan et al. (2013b) showed that effect estimates
can be sensitive to choice of data. Recognizing that each data source carries with it inher
ent limitations and unique biases that limit our confidence in singular findings, there is
tremendous power in observing similar patterns across heterogeneous datasets because it
greatly diminishes the likelihood that sourcespecific biases alone can explain the findings.
When network studies show consistent populationlevel effect estimates across multiple
claims and EHR databases across US, Europe and Asia, they should be recognized as
stronger evidence about the medical intervention that can have a broader scope to impact
medical decisionmaking.
Reliable evidence should be robust, meaning that the findings should not be overly sen
sitive to the subjective choices that can be made within an analysis. If there are alterna
tive statistical methods that can be considered potentially reasonable for a given study,
then it can provide reassurance to see that the different methods yield similar results, or
conversely can give caution if discordant results are uncovered. (Madigan et al., 2013a)
For populationlevel effect estimation, sensitivity analyses can include highlevel study
design choice, such as whether to apply a comparative cohort or selfcontrolled case se
ries design, or can focus on analytical considerations embedded within a design, such as
whether to perform propensity score matching, stratification or weighting as a confound
ing adjustment strategy within the comparative cohort framework.
Last, but potentially most important, evidence should be calibrated. It is not sufficient to
have an evidence generating system that produces answers to unknown questions if the
performance of that system cannot be verified. A closed system should be expected to
have known operating characteristics, which should be able to measured and communi
14.2. Understanding Evidence Quality 289
cated as context for interpreting any results that the system produces. Statistical artifacts
should be able to be empirically demonstrated to have welldefined properties, such as a
95% confidence interval having 95% coverage probability or a cohort with a predicted
probability of 10% having a observed proportion of events in 10% of the population. An
observational study should always be accompanied by study diagnostics that test assump
tions around the design, methods, and data. These diagnostics should be centered on
evaluating the primary threats to study validity: selection bias, confounding, and mea
surement error. Negative controls have been shown to be a powerful tool for identifying
and mitigating systematic error in observational studies. (Schuemie et al., 2016, 2018a,b)
Table 14.1.
Table 14.1: The four components of evidence quality.
Component of
Evidence Quality What it Measures
Data Quality Are the data completely captured with plausible values in a
manner that is conformant to agreedupon structure and
conventions?
Clinical Validity To what extent does the analysis conducted match the clinical
intention?
Software Validity Can we trust that the process transforming and analyzing the data
does what it is supposed to do?
Method Validity Is the methodology appropriate for the question, given the
strengths and weaknesses of the data?
14.4 Summary
* Data Quality
* Clinical Validity
* Software Validity
* Method Validity
– When communicating evidence, we should express the uncertainty arising
from the various challenges to evidence quality.
Chapter 15
Data Quality
Most of the data used for observational healthcare research were not collected for research
purposes. For example, electronic health records (EHRs) aim to capture the information
needed to support the care of patients, and administrative claims are collected to provide
a grounds for allocating costs to payers. Many have questioned whether it is appropriate
to use such data for clinical research, with van der Lei (1991) even stating that “Data
shall be used only for the purpose for which they were collected.” The concern is that
because the data were not collected for the research that we would like to do, it is not
guaranteed to have sufficient quality. If the quality of the data is poor (garbage in), then
the quality of the result of research using that data must be poor as well (garbage out). An
important aspect of observational healthcare research therefore deals with assessing data
quality, aiming to answer the question:
Note that it is unlikely that our data are perfect, but they may be good enough for our
purposes.
DQ cannot be observed directly, but methodology has been developed to assess it. Two
types of DQ assessments can be distinguished (Weiskopf and Weng, 2013): assessments
to evaluate DQ in general, and assessments to evaluate DQ in the context of a specific
study.
In this chapter we will first review possible sources of DQ problems, after which we’ll
discuss the theory of general and studyspecific DQ assessments, followed by a stepby
step description of how these assessments can be performed using the OHDSI tools.
291
292 Chapter 15. Data Quality
understanding of a CDM instance, the DQD goes table by table and field by field to quan
tify the number of records in a CDM that do not conform to the given specifications. In
all, over 1,500 checks are performed, each one organized into the Kahn framework. For
each check the result is compared to a threshold whereby a FAIL is considered to be any
percentage of violating rows falling above that value. Table 15.1 shows some example
checks.
Table 15.1: Example data quality rules in the Data Quality Dash
board.
Fraction
violated
rows Check description Threshold Status
0.34 A yes or no value indicating if the provider_id in 0.05 FAIL
the VISIT_OCCURRENCE is the expected data
type based on the specification.
0.99 The number and percent of distinct source values 0.30 FAIL
in the measurement_source_value field of the
MEASUREMENT table mapped to 0.
0.09 The number and percent of records that have a 0.10 PASS
value in the drug_concept_id field in the
DRUG_ERA table that do not conform to the
ingredient class.
0.02 The number and percent of records with a value 0.05 PASS
in the verbatim_end_date field of the
DRUG_EXPOSURE that occurs prior to the date
in the DRUG_EXPOSURE_START_DATE field
of the DRUG_EXPOSURE table.
0.00 The number and percent of records that have a 0.00 PASS
duplicate value in the procedure_occurrence_id
field of the PROCEDURE_OCCURRENCE.
Within the tool the checks are organized in multiple ways, one being into table, field,
and concept level checks. Table checks are those done at a highlevel within the CDM,
for example determining if all required tables are present. The field level checks are
carried out in such a way to evaluate every field within every table for conformance to
CDM specifications. These include making sure all primary keys are truly unique and all
standard concept fields contain concepts ids in the proper domain, among many others.
Concept level checks go a little deeper to examine individual concept ids. Many of these
fall into the plausibility category of the Kahn framework such as ensuring that gender
specific concepts are not attributed to persons of incorrect gender (i.e. prostate cancer in
a female patient).
15.2. Data Quality in General 295
ACHILLES and DQD are executed against the data in the CDM. DQ issues iden
tified this way may be due to the conversion to the CDM, but may also reflect DQ
issues already present in the source data. If the conversion is at fault, it is usually
within our control to remedy the problem, but if the underlying data are at fault the
only course of action may be to delete the offending records.
source("Framework.R")
declareTest(101, "Person gender mappings")
add_enrollment(member_id = "M000000102", gender_of_member = "male")
add_enrollment(member_id = "M000000103", gender_of_member = "female")
expect_person(PERSON_ID = 102, GENDER_CONCEPT_ID = 8507
expect_person(PERSON_ID = 103, GENDER_CONCEPT_ID = 8532)
In this example, the framework generated by RabbitinaHat is sourced, loading the func
tions that are used in the remainder of the code. We then declare we will start testing per
son gender mappings. The source schema has an ENROLLMENT table, and we use the
add_enrollment function created by RabbitinaHat to create two entries with different
values for the MEMBER_ID and GENDER_OF_MEMBER fields. Finally, we specify
the expectation that after the ETL two entries should exist in the PERSON table with
various expected values.
Note that the ENROLLMENT table has many other fields, but we do not care much about
what values these other fields have in the context of this test. However, leaving those val
ues (e.g. date of birth) empty might cause the ETL to discard the record or throw an error.
To overcome this problem while keeping the test code easy to read, the add_enrollment
296 Chapter 15. Data Quality
function will assign default values (the most prevalent values as observed in the White
Rabbit scan report) to field values that are not explicitly specified by the user.
Similar unit tests can be created for all other logic in an ETL, typically resulting in hun
dreds of tests. When we are done defining the test, we can use the framework to generate
two sets of SQL statements, one to create the fake source data, and one to create the tests
on the ETLed data:
Figure 15.1: Unit testing an ETL (ExtractTransformLoad) process using the Rabbitin
aHat testing framework.
The test SQL returns a table that will look like Table 15.2. In this table we see that we
passed the two tests we defined earlier.
ID Description Status
101 Person gender mappings PASS
101 Person gender mappings PASS
The power of these unit tests is that we can easily rerun them any time the ETL process
is changed.
15.3. StudySpecific Checks 297
these codes over time to help identify temporal issues associated with specific source
codes. The example output in Figure 15.2 shows a (partial) breakdown of a concept
set called “Depressive disorder.” The most prevalent concept in this concept set in the
database of interest is concept 440383 (“Depressive disorder”). We see that three source
codes in the database map to this concept: ICD9 code 3.11, and ICD10 codes F32.8
and F32.89. On the left we see that the concept as a whole first shows a gradual increase
over time, but then shows a sharp drop. If we look at the individual codes, we see that
this drop can be explained by the fact that the ICD9 code stops being used at the time
of the drop. Even though this is the same time the ICD10 codes start being used, the
combined prevalence of the ICD10 codes is much smaller than that of the ICD9 code.
This specific example was due to the fact that the ICD10 code F32.9 (“Major depressive
disorder, single episode, unspecified”) should have also mapped to the concept. This
problem has since been resolved in the Vocabulary.
Even though the previous example demonstrates a chance finding of a source code
that was not mapped, in general identifying missing mappings is more difficult than
checking mappings that are present. It requires knowing which source codes should
map but don’t. A semiautomated way to perform this assessment is to use the
findOrphanSourceCodes function in the MethodEvaluation R package. This function
allows one to search the vocabulary for source codes using a simple text search, and it
checks whether these source codes map to a specific concept or to one of the descendants
of that concept. The resulting set of source codes is subsequently restricted to only those
that appear in the CDM database at hand. For example, in a study the concept “Gan
grenous disorder” (439928) and all of its descendants was used to find all occurrences of
gangrene. To evaluate whether this truly includes all source codes indicating gangrene,
several terms (e.g. “gangrene”) were used to search the descriptions in the CONCEPT
and SOURCE_TO_CONCEPT_MAP tables to identify source codes. An automated
search is then used to evaluate whether each gangrene source code appearing in the
15.4. ACHILLES in Practice 299
data indeed directly or indirectly (through ancestry) maps to the concept “Gangrenous
disorder.” The result of this evaluation is shown in Figure 15.3, revealing that the ICD10
code J85.0 (“Gangrene and necrosis of lung”) was only mapped to concept 4324261
(“Pulmonary necrosis”), which is not a descendant of “Gangrenous disorder.”
library(Achilles)
connDetails <- createConnectionDetails(dbms = "postgresql",
server = "localhost/ohdsi",
user = "joe",
password = "supersecret")
The last two lines define the cdmDbSchema variable, as well as the CDM version. We will
use these later to tell R where the data in the CDM format live, and what version CDM
is used. Note that for Microsoft SQL Server, database schemas need to specify both the
database and the schema, so for example cdmDbSchema <- "my_cdm_data.dbo".
Next, we run ACHILLES:
resultsDatabaseSchema = cdmDbSchema,
sourceName = "My database",
cdmVersion = cdmVersion)
This function will create several tables in the resultsDatabaseSchema, which we’ve
set here to the same database schema as the CDM data.
We can view the ACHILLES database characterization. This can be done by pointing
ATLAS to the ACHILLES results databases, or by exporting the ACHILLES results to a
set of JSON files:
exportToJson(connectionDetails,
cdmDatabaseSchema = cdmDatabaseSchema,
resultsDatabaseSchema = cdmDatabaseSchema,
outputPath = "achillesOut")
The JSON files will be written to the achillesOut subfolder, and can be used together
with the AchillesWeb web application to explore the results. For example, Figure 15.4
shows the ACHILLES data density plot. This plot shows that the bulk of the data starts in
2005. However, there also appear to be a few records from around 1961, which is likely
an error in the data.
Figure 15.4: The data density plot in the ACHILLES web viewer.
Another example is shown in Figure 15.5, revealing a sudden change in the prevalence
of a diabetes diagnosis code. This change coincides with changes in the reimbursement
rules in this specific country, leading to more diagnoses but probably not a true increase
in prevalence in the underlying population.
15.5. Data Quality Dashboard in Practice 301
Figure 15.5: Monthly rate of diabetes coded in the ACHILLES web viewer.
DataQualityDashboard::executeDqChecks(connectionDetails = connectionDetails,
cdmDatabaseSchema = cdmDbSchema,
resultsDatabaseSchema = cdmDbSchema,
cdmSourceName = "My database",
outputFolder = "My output")
The above function will execute all available data quality checks on the schema specified.
It will then write a table to the resultsDatabaseSchema which we have here set to
the same schema as the CDM. This table will include all information about each check
run including the CDM table, CDM field, check name, check description, Kahn category
and subcategory, number of violating rows, the threshold level, and whether the check
passes or fails, among others. In addition to a table this function also writes a JSON file
to the location specified as the outputFolder. Using this JSON file we can launch a
web viewer to inspect the results.
viewDqDashboard(jsonPath)
The variable jsonPath should be the path to the JSON file containing the results of the
302 Chapter 15. Data Quality
When you first open the Dashboard you will be presented with the overview table, as seen
in Figure 15.6. This will show you the total number of checks run in each Kahn category
broken out by context, the number and percent that pass in each, as well as the overall
pass rate.
Figure 15.6: Overview of Data Quality Checks in the Data Quality Dashboard.
Clicking on Results in the lefthand menu will take you to the drilldown results for each
check that was run (Figure 15.7). In this example, the table showing a check run to
determine the completeness of individual CDM tables, or, the number and percent of
persons in the CDM that have at least one record in the specified table. In this case
the five tables listed are all empty which the Dashboard counts as a fail. Clicking on
the icon will open a window that displays the exact query that was run on your data
to produce the results listed. This allows for easy identification of the rows that were
considered failures by the Dashboard.
Figure 15.7: Drilldown into Data Quality Checks in the Data Quality Dashboard.
library(MethodEvaluation)
json <- readChar("cohort.json", file.info("cohort.json")$size)
sql <- readChar("cohort.sql", file.info("cohort.sql")$size)
checkCohortSourceCodes(connectionDetails,
cdmDatabaseSchema = cdmDbSchema,
cohortJson = json,
cohortSql = sql,
outputFile = "output.html")
We can open the output file in a web browser as shown in Figure 15.8. Here we see
that the angioedema cohort definition has two concept sets: “Inpatient or ER visit”, and
“Angioedema”. In this example database the visits were found through databasespecific
source codes “ER” and “IP”, that are not in the Vocabulary, although they were mapped
during the ETL to standard concepts. We also see that angioedema is found through one
ICD9 and two ICD10 codes. We clearly see the point in time of the cutover between
the two coding systems when we look at the sparklines for the individual codes, but for
the concept set as a whole there is no discontinuity at that time.
Next, we can search for orphan source codes, which are source codes that do not map to
standard concept codes. Here we look for the Standard Concept “Angioedema,” and then
we look for any codes and concepts that have “Angioedema” or any of the synonyms we
provide as part of their name:
304 Chapter 15. Data Quality
The only potential orphan found that is actually used in the data is “Angioneurotic edema,
sequela”, which should not be mapped to angioedema. This analysis therefore did not
reveal any missing codes.
15.7. Summary 305
15.7 Summary
15.8 Exercises
Prerequisites
For these exercises we assume R, RStudio and Java have been installed as described in
Section 8.4.5. Also required are the SqlRender, DatabaseConnector, ACHILLES, and
Eunomia packages, which can be installed using:
The Eunomia package provides a simulated dataset in the CDM that will run inside your
local R session. The connection details can be obtained using:
Clinical Validity
Chapter leads: Joel Swerdel, Seng Chan You, Ray Chen & Patrick Ryan
The likelihood of transforming matter into energy is something akin to shoot
ing birds in the dark in a country where there are only a few birds. Einstein,
1935
The vision of OHDSI is “A world in which observational research produces a compre
hensive understanding of health and disease.” Retrospective designs provide a vehicle
for research using existing data but can be riddled with threats to various aspects of valid
ity as discussed in Chapter14. It is not easy to isolate clinical validity from quality of data
and statistical methodology, but here we will focus on three aspects in terms of clinical
validity: Characteristics of health care databases, Cohort validation, and Generalizability
of the evidence. Let’s go back to the example of populationlevel estimation (Chapter
12). We tried to answer the question “Do ACE inhibitors cause angioedema compared
to thiazide or thiazidelike diuretics?” In that example, we demonstrated that ACE in
hibitors caused more angioedema than thiazide or thiazidelike diuretics. This chapter is
dedicated to answer the question: “To what extent does the analysis conducted match the
clinical intention?”
307
308 Chapter 16. Clinical Validity
and payers whereby services provided to patients by providers are sufficiently justified to
enable agreement on payments by the responsible parties. Data elements in EHR records
are captured to support clinical care and administrative operations, and they commonly
only reflect the information that providers within a given health system feel are necessary
to document the current service and provide necessary context for anticipated followup
care within their health system. They may not represent a patient’s complete medical
history and may not integrate data from across health systems.
To generate reliable evidence from observational data, it is useful for a researcher to un
derstand the journey that the data undergoes from the moment that a patient seeks care
through the moment that the data reflecting that care are used in an analysis. As an exam
ple, “drug exposure” can be inferred from various sources of observational data, including
prescriptions written by clinicians, pharmacy dispensing records, hospital procedural ad
ministrations, or patient selfreported medication history. The source of data can impact
our level of confidence in the inference we draw about which patients did or did not use
the drug, as well as when and for how long. The data capture process can result in under
estimation of exposure, such as if free samples or overthe counter drugs are not recorded,
or overestimation of exposure, such as if a patient doesn’t fill the prescription written or
doesn’t adherently consume the prescription dispensed. Understanding the potential bi
ases in exposure and outcome ascertainment, and more ideally quantifying and adjusting
for these measurement errors, can improve our confidence in the validity of the evidence
we draw from the data we have available.
This description highlights several attributes useful to reinforce when considering clinical
validity: 1) it makes it clear that we are talking about something that is observable (and
therefore possible to be captured in our observational data); 2) it includes the notion of
time in the phenotype specification (since a state of a person can change); 3) it draws
a distinction between the phenotype as the desired intent vs. the phenotype algorithm,
which is the implementation of the desired intent.
OHDSI has adopted the term “cohort” to define the set of persons satisfying one or more
16.2. Cohort Validation 309
inclusion criteria for a duration of time. A “cohort definition” represents the logic neces
sary to instantiate a cohort against an observational database. In this regard, the cohort
definition (or phenotype algorithm) is used to produce a cohort, which is intended to rep
resent the phenotype, being the persons who belong to the observable clinical state of
interest.
Most types of observational analyses, including clinical characterization, populationlevel
effect estimation, and patientlevel prediction, require one or more cohorts to be estab
lished as part of the study process. To evaluate the validity of the evidence produced by
these analyses, one must consider this question for each cohort: to what extent do the per
sons identified in the cohort based on the cohort definition and the available observational
data accurately reflect the persons who truly belong to the phenotype?
To return to the populationlevel estimation example (Chapter 12) “Do ACE inhibitors
cause angioedema compared to thiazide or thiazidelike diuretics?”, we must define three
cohorts: a target cohort of persons who are new users of ACE inhibitors, a comparator
cohort of persons who are new users of thiazide diuretics, and an outcome cohort of per
sons who develop angioedema. How confident are we that all use of ACE inhibitors or
thiazide diuretics is completely captured, such that “new users” can be identified by the
first observed exposure, without concern of prior (but unobserved) use? Can we com
fortably infer that persons who have a drug exposure record for ACE inhibitors were in
fact exposed to the drug, and those without a drug exposure were indeed unexposed? Is
there uncertainty in defining the duration of time that a person is classified in the state of
“ACE inhibitor use,” either when inferring cohort entry at the time the drug was started or
cohort exit when the drug was discontinued? Have persons with a condition occurrence
record of “Angioedema” actually experienced rapid swelling beneath the skin, differen
tiated from other types of dermatologic allergic reactions? What proportion of patients
who developed angioedema received medical attention that would give rise to the observa
tional data used to identify these clinical cases based on the cohort definition? How well
can the angioedema events which are potentially druginduced be disambiguated from
the events known to be caused by other agents, such as food allergy or viral infection? Is
disease onset sufficiently well captured that we have confidence in drawing a temporal
association between exposure status and outcome incidence? Answering these types of
questions is at the heart of clinical validity.
In this chapter, we will discuss the methods for validating cohort definitions. We first
describe the metrics used to measure the validity of a cohort definition. Next, we describe
two methods to estimate these metrics: 1) clinical adjudication through source record
verification, and 2) PheValuator, a semiautomated method using diagnostic predictive
modeling.
The true and false results from the cohort definition are determined by applying the defi
nition to a group of persons. Those included in the definition are considered positive for
the health condition and are labeled “True.” Those persons not included in the cohort def
inition are considered negative for the health condition and are labeled “False”. While the
absolute truth of a person’s heath state considered in the cohort definition is very difficult
to determine, there are multiple methods to establish a reference gold standard, two of
which will be described later in the chapter. Regardless of the method used, the labeling
of these persons is the same as described for the cohort definition.
In addition to errors in the binary indication of phenotype designation, the timing of the
health condition may also be incorrect. For example, while the cohort definition may
correctly label a person as belonging to a phenotype, the definition may incorrectly specify
the date and time when a person without the condition became a person with the condition.
This error would add bias to studies using survival analysis results, e.g., hazard ratios, as
an effect measure.
The next step in the process is to assess the concordance of the gold standard with the
cohort definition. Those persons that are labeled by both the gold standard method and
the cohort definition as “True” are called “True Positives.” Those persons that are labeled
by the gold standard method as “False” and by the cohort definition as “True” are called
“False Positives,” i.e., the cohort definition misclassified these persons as having the con
dition when they do not. Those persons that are labeled by both the gold standard method
and the cohort definition as “False” are called “True Negatives.” Those persons that are
labeled by the gold standard method as “True” and by the cohort definition as “False” are
called “False Negatives,” i.e., the cohort definition incorrectly classified these persons
as not having the condition, when it fact they do belong to the phenotype. Using the
counts from the four cells in the confusion matrix, we can quantify the accuracy of the
cohort definition in classifying phenotype status in a group of persons. There are standard
performance metrics for measuring cohort definition performance:
1. Sensitivity of the cohort definition – what proportion of the persons who truly
belong to the phenotype in the population were correctly identified to have the
health outcome based on the cohort definition? This is determined by the following
formula:
16.3. Source Record Verification 311
2. Specificity of the cohort definition – what proportion of the persons who do not
belong to the phenotype in the population were correctly identified to not have the
health outcome based on the cohort definition? This is determined by the following
formula:
4. Negative predictive value (NPV) of the cohort definition – what proportion of the
persons identified by the cohort definition to not have the health condition actually
did not belong to the phenotype? This is determined by the following formula:
Perfect scores for these measures are 100%. Due to the nature of observational data,
perfect scores are usually far from the norm. Rubbo et al. (2015) reviewed studies val
idating cohort definitions for myocardial infarction. Of the 33 studies they examined,
only one cohort definition in one dataset obtained a perfect score for PPV. Overall, 31 of
the 33 studies reported PPVs ≥ 70%. They also found, however, that of the 33 studies
only 11 reported sensitivity and 5 reported specificity. PPV is a function of sensitivity,
specificity, and prevalence. Datasets with different values for prevalence will produce
different values for PPV with sensitivity and specificity held constant. Without sensitiv
ity and specificity, correcting for bias due to imperfect cohort definitions is not possible.
Additionally, the misclassification of the health condition may be differential, meaning
the cohort definition performs differently on one group of persons relative to the compar
ison group, or nondifferentially, when the cohort definition performs similarly on both
comparison groups. Prior cohort definition validation studies have not tested for potential
differential misclassification, even though it can lead to strong bias in effect estimates.
Once the performance metrics have been established for the cohort definition, these may
be used to adjust the results for studies using these definitions. In theory, adjusting
study results for these measurement error estimates has been well established. In prac
tice, though, because of the difficulty in obtaining the performance characteristics, these
adjustments are rarely considered. The methods used to determine the gold standard are
described in the remainder of this section.
or more domain experts with sufficient knowledge to competently classify the clinical
condition or characteristic of interest. Chart review generally follows the following steps:
1. Obtain permission from local institutional review board (IRB) and/or persons as
needed to conduct study including chart review.
2. Generate cohort using cohort definition to be evaluated. Sample a subset of the
persons to manually review if there are insufficient resources to adjudicate the entire
cohort.
3. Identify one or more persons with sufficient clinical expertise to review person
records.
4. Determine guidelines for adjudicating whether a person is positive or negative for
the desired clinical condition or characteristic.
5. Clinical experts review and adjudicate all available data for the persons within the
sample to classify each person as to whether they belong to the phenotype or not.
6. Tabulate persons according to the cohort definition classification and clinical adju
dication classification into a confusion matrix, and calculate the performance char
acteristics possible from the data collected.
Results from a chart review are typically limited to the evaluation of one performance
characteristic, positive predictive value (PPV). This is because the cohort definition un
der evaluation only generates persons that are believed to have the desired condition or
characteristics. Therefore, each person in the sample of the cohort is classified as either
a true positive or false positive based on the clinical adjudication. Without knowledge
of all persons in the phenotype in the entire population (including those not identified by
the cohort definition), it is not possible to identify the false negatives, and thereby fill in
the remainder of the confusion matrix to generate the remaining performance characteris
tics. Potential methods of identifying all persons in the phenotype across the population
include chart review of the entire database, which is generally not feasible unless the over
all population is small, or the utilization of comprehensive clinical registries in which all
true cases have already been flagged and adjudicated, such as tumor registries (see ex
ample below). Alternatively, one can sample persons who do not qualify for the cohort
definition to produce a subset of predicted negatives, and then repeating steps 36 of the
chart review above to check whether these patients are truly lacking the clinical condition
or characteristic of interest can identify true negatives or false negatives. This would al
low the estimation of negative predictive value (NPV), and if an appropriate estimate of
the phenotype prevalence is available, then sensitivity and specificity can be estimated.
There are a number of limitations to clinical adjudication through source record verifi
cation. As alluded to earlier, chart review can be a very timeconsuming and resource
intensive process, even just for the evaluation of a single metric such as PPV. This limi
tation significantly impedes the practicality of evaluating an entire population to fill out
a complete confusion matrix. In addition, multiple steps in the above process have the
potential to bias the results of the study. For example, if records are not equally accessible
in the EHR, if there is no EHR, or if individual patient consent is required, then the subset
under evaluation may not be truly random and could introduce sampling or selection bias.
In addition, manual adjudication is susceptible to human error or misclassification and
16.3. Source Record Verification 313
thereby may not represent a perfectly accurate metric. There can often be disagreement
between clinical adjudicators due to the data in the person’s record being vague, subjec
tive, or of low quality. In many studies, the process involves a majorityrules decision
for consensus which yields a binary classification for persons that does not reflect the
interrater discordance.
1. Submitted proposal and obtained IRB consent for OHDSI cancer phenotyping
study.
2. Developed a cohort definition for prostate cancer: Using ATHENA and ATLAS to
explore the vocabulary, we created a cohort definition to include all patients with
a condition occurrence for Malignant Tumor of Prostate (concept ID 4163261), ex
cluding Secondary Neoplasm of Prostate (concept ID 4314337) or NonHodgkin’s
Lymphoma of Prostate (concept ID 4048666).
3. Generated cohort using ATLAS and randomly selected 100 patients for manual re
view, mapping each PERSON_ID back to patient MRN using mapping tables. 100
patients were selected in order to achieve our desired level of statistical precision
for the performance metric of PPV.
4. Manually reviewed records in the various EHRs—both inpatient and outpatient—
in order to determine whether each person in the random subset was a true or false
positive.
5. Manual review and clinical adjudication were performed by one physician
(although ideally in future more rigorous validation studies would be done by a
higher number of reviewers to assess for consensus and interrater reliability).
6. Determination of a reference standard was based on clinical documentation, pathol
ogy reports, labs, medications and procedures as documented in the entirety of the
available electronic patient record.
7. Patients were labeled as 1) prostate cancer 2) no prostate cancer or 3) unable to
determine.
8. A conservative estimate of PPV was calculated using the following: prostate can
cer/ (no prostate cancer + unable to determine).
9. Then, using the tumor registry as an additional gold standard to identify a reference
standard across the entire CUIMC population, we counted the number of persons in
the tumor registry which were and were not accurately identified by the cohort def
inition, which allowed us to estimate sensitivity using these values as true positives
and false negatives.
10. Using the estimated sensitivity, PPV, and prevalence, we could then estimate
specificity for this cohort definition. As noted previously, this process was time
314 Chapter 16. Clinical Validity
A review of validation efforts for myocardial infarction (MI) cohort definitions by Rubbo
et al. (2015) found that there was significant heterogeneity in the cohort definitions used
in the studies as well as in the validation methods and the results reported. The authors
concluded that for acute myocardial infarction there is no gold standard cohort definition
available. They noted that the process was both costly and timeconsuming. Due to
that limitation, most studies had small sample sizes in their validation leading to wide
variations in the estimates for the performance characteristics. They also noted that in
the 33 studies, while all the studies reported positive predictive value, only 11 studies
reported sensitivity and only five studies reported specificity. As mentioned previously,
without estimates of sensitivity and specificity, statistical correction for misclassification
bias cannot be performed.
16.4 PheValuator
The OHDSI community has developed a different approach to constructing a gold stan
dard by using diagnostic predictive models. (Swerdel et al., 2019) The general idea is to
emulate the ascertainment of the health outcome similar to the way clinicians would in a
source record validation, but in an automated way that can be applied at scale. The tool
has been developed as an opensource R package called PheValuator.1 PheValuator uses
functions from the Patient Level Prediction package.
4. Apply the fitted model to estimate the probability of the outcome for a holdout set
of persons who will be used to evaluate cohort definition performance: The set of
predictors from the model can be applied to a person’s data to estimate the predicted
probability that the person belongs to the phenotype. We use these predictions as a
probabilistic gold standard.
5. Evaluate the performance characteristics of the cohort definitions: We compare
the predicted probability to the binary classification of a cohort definition (the test
conditions for the confusion matrix). Using the test conditions and the estimates
for the true conditions, we can fully populate the confusion matrix and estimate the
entire set of performance characteristics, i.e., sensitivity, specificity, and predictive
values.
The primary limitation to using this approach is that the estimation of the probability of
a person having the health outcome is limited by the data in the database. Depending on
the database, important information, such as clinician notes, may not be available.
Figure 16.2: An extremely specific cohort definition (xSpec) for myocardial infarction.
318 Chapter 16. Clinical Validity
Figure 16.3: An extremely sensitive cohort definition (xSens) for myocardial infarction.
set the excludedConcepts parameter to 4329847, the concept Id for Myocardial infarction,
and we would also set the addDescendantsToExclude parameter to TRUE, indicating that
any descendants of the excluded concepts should also be excluded.
There are several parameters that may be used to specify the characteristics of the persons
included in the modeling process. We can set the ages of the persons included in the
modeling process by setting the lowerAgeLimit to the lower bounds of age desired in
the model and the upperAgeLimit to the upper bounds. We may wish to do this if the
cohort definitions for a planned study will be created for a certain age group. For example,
if the cohort definition to be used in a study is for Type 1 diabetes mellitus in children, you
may want to limit the ages used to develop the diagnostic predictive model to ages 5 to 17
years old for example. In doing so, we will produce a model with features that are likely
more closely related to the persons selected by the cohort definitions to be tested. We
can also specify which sex is included in the model by setting the gender parameter to
the concept ID for either male or female. By default, the parameter is set to include both
males and females. This feature may be useful in sexspecific health outcomes such as
prostate cancer. We can set the time frame for person inclusion based on the first visit in
the person’s record by setting the startDate and endDate parameters to the lower and
upper bounds of the date range, respectively. Finally, the mainPopnCohort parameter
may be used to specify a large population cohort from which all persons in the target and
outcome cohorts will be selected. In most instances this will be set to 0, indicating no
limitation on selecting persons for the target and outcome cohorts. There may be times,
however, when this parameter is useful for building a better model, possibly in cases
where the prevalence of the health outcome is extremely low, perhaps 0.01% or lower.
For example:
16.4. PheValuator 319
setwd("c:/temp")
library(PheValuator)
connectionDetails <- createConnectionDetails(
dbms = "postgresql",
server = "localhost/ohdsi",
user = "joe",
password = "supersecret")
In this example, we used the cohorts defined in the “my_results” database, specify
ing the location of the cohort table (cohortDatabaseSchema, cohortDatabaseTable
“my_results.cohort”) and where the model will find the conditions, drug exposures,
etc. to inform the model (cdmDatabaseSchema “my_cdm_data”). The persons
included in the model will be those whose first visit in the CDM is between January
1, 2010 and December 31, 2017. We are also specifically excluding the concept
IDs 312327, 314666, and their descendants which were used to create the xSpec
cohort. Their ages at the time of first visit will be between 18 and 90. With the
parameters above, the name of the predictive model output from this step will be:
“c:/temp/lr_results_5XMI_train_myCDM_ePPV0.75_20181206V1.rds”
step. This could include specifying the lower and upper ages limits (by setting, as ages,
the lowerAgeLimit and upperAgeLimit arguments, respectively), the sex (by setting
the gender parameter to the concept IDs for male and/or female), the starting and
ending dates (by setting, as dates, the startDate and endDate arguments, respectively),
and designating a large population from which to select the persons by setting the
mainPopnCohort to the cohort Id for the population to use.
For example:
setwd("c:/temp")
connectionDetails <- createConnectionDetails(
dbms = "postgresql",
server = "localhost/ohdsi",
user = "joe",
password = "supersecret")
In this example, the parameters specify that the function should use the model file:
“c:/temp/lr_results_5XMI_train_myCDM_ePPV0.75_20181206V1.rds” to produce the
evaluation cohort file: “c:/temp/lr_results_5XMI_eval_myCDM_ePPV0.75_20181206V1.rds”
The model and the evaluation cohort files created in this step will be used in the evaluation
of the cohort definitions provided in the next step.
16.4.
In part A of Figure 16.4, we examined the persons from the cohort definition to be tested
and found those persons from the evaluation cohort (created in the previous step) who
were included in the cohort definition (Person IDs 016, 019, 022, 023, and 025) and those
from the evaluation cohort who were not included (Person Ids 017, 018, 020, 021, and
024). For each of these included/excluded persons, we had previously determined the
probability of the health outcome using the predictive model (p(O)).
We estimated the values for True Positives, True Negatives, False Positives, and False
Negatives as follows (Part B of Figure 16.4):
1. If the cohort definition included a person from the evaluation cohort, i.e., the co
hort definition considered the person a “positive.” The predicted probability for the
health outcome indicated the expected value of the number of counts contributed
by that person to the True Positives, and one minus the probability indicated the
expected value of the number of counts contributed by that person to the False Pos
itives for that person. We added all the expected values of counts across persons
to get the total expected value. For example, PersonId 016 had a predicted proba
bility of 99% for the presence of the health outcome, 0.99 was added to the True
Positives (expected value of counts added 0.99) and 1.00–0.99 = 0.01 was added
to the False Positives (0.01 expected value). This was repeated for all the persons
from the evaluation cohort included in the cohort definition (i.e., PersonIds 019,
022, 023, and 025).
2. Similarly, if the cohort definition did not include a person from the evaluation co
hort, i.e. the cohort definition considered the person a “negative,” one minus the
predicted probability for the phenotype for that person was the expected value of
counts contributed to True Negatives and was added to it, and, in parallel, the pre
dicted probability for the phenotype was the expected value of counts contributed
to the False Negatives and was added to it. For example, PersonId 017 had a pre
dicted probability of 1% for the presence of the health outcome (and, correspond
ingly, 99% for the absence of the health outcome) and 1.00 – 0.01 = 0.99 was added
to the True Negatives and 0.01 was added to the False Negatives. This was repeated
for all the persons from the evaluation cohort not included in the cohort definition
(i.e., PersonIds 018, 020, 021, and 024).
After adding these values over the full set of persons in the evaluation cohort, we filled the
four cells of the confusion matrix with the expected values of counts for each cell, and we
were able to create point estimates of the PA performance characteristics like sensitivity,
specificity, and positive predictive value (Figure 1C). We emphasize that these expected
cell counts cannot be used to assess the variance of the estimates, only the point estimates.
In the example, the sensitivity, specificity, PPV, and NPV were 0.99, 0.63, 0.42, and 0.99,
respectively.
Determining the performance characteristics of the cohort definition uses the func
tion testPhenotype. This function uses the output from the prior two steps where
322 Chapter 16. Clinical Validity
we created the model and evaluation cohorts. We would set the modelFileName
parameter to the RDS file output from createPhenoModel function, in this example,
“c:/temp/lr_results_5XMI_train_myCDM_ePPV0.75_20181206V1.rds”. We would set
the resultsFileName parameter to the RDS file output from createEvalCohort function, in
this example, “c:/temp/lr_results_5XMI_eval_myCDM_ePPV0.75_20181206V1.rds”.
To test the cohort definition we wish to use in our study, we set the cohortPheno to
the cohort ID for that cohort definition. We can set the phenText parameter to some
human readable description for the cohort definition, such as “MI Occurrence, Hospital
InPatient Setting”. We will set the testText parameter to some human readable
description for the xSpec definition, such as “5 X MI.” The output from this step is
a data frame that contains the performance characteristics for the cohort definition
tested. The settings for the cutPoints parameter is a list of values that will be used
to develop the performance characteristics results. The performance characteristics are
usually calculated using the “expected values” as described in Figure 1. To retrieve the
performance characteristics based on the expected values, we include “EV” in the list
for the cutPoints parameter. We may also want to see the performance characteristics
based on specific predicted probabilities, i.e., cut points. For example, if we wanted
to see the performance characteristics of all those at or above a predicted probability
of 0.5 were considered positive for the health outcome and all those under a predicted
probability of 0.5 were considered negative, we would add “0.5” to the cutPoints
parameter list. For example:
setwd("c:/temp")
connectionDetails <- createConnectionDetails(
dbms = "postgresql",
server = "localhost/ohdsi",
user = "joe",
password = "supersecret")
In this example, a wide range of prediction thresholds are provided (cutPoints) including
the expected value (“EV”). Given that parameter setting, the output from this step will
324 Chapter 16. Clinical Validity
by their designs and analytic methods, but also bt their choice of data source. Madigan
et al. (2013b) demonstrated that choice of database affects the result of observational
study. They systematically investigated heterogeneity in the results for 53 drugoutcome
pairs and two study designs (cohort studies and selfcontrolled case series) across the
10 observational databases. Even though they held study design constant, substantial
heterogeneity in effect estimates was observed.
Across the OHDSI network, observational databases vary considerably in the populations
they represent (e.g. pediatric vs. elderly, privatelyinsured employees vs. publiclyinsured
unemployed), the care settings where data are captured (e.g. inpatient vs. outpatient, pri
mary vs. secondary/specialty care), the data capture processes (e.g. administrative claims,
EHRs, clinical registries), and the national and regional health system from which care is
based. These differences can manifest as heterogeneity observed when studying disease
and the effects of medical interventions and can also influence the confidence we have
in the quality of each data source that may contribute evidence within a network study.
While all databases within the OHDSI network are standardized to the CDM, it is impor
tant to reinforce that standardization does not reduce the true inherent heterogeneity that
is present across populations, but simply provides a consistent framework to investigate
and better understand the heterogeneity across the network. The OHDSI research net
work provides the environment to apply the same analytic process on various databases
across the world, so that researchers can interpret results across multiple data sources
while holding other methodological aspects constant. OHDSI’s collaborative approach
to open science in network research, where researchers across participating data partners
work together alongside those with clinical domain knowledge and methodologists with
analytical expertise, is one way of reaching a collective level of understanding of the
clinical validity of data across a network that should serve as a foundation for building
confidence in the evidence generated using these data.
16.6 Summary
Software Validity
327
328 Chapter 17. Software Validity
not reproducible, but also because it lacks transparency; we do not know exactly what
was done to produce the result, so we also cannot verify that no mistakes were made.
Every analysis generating evidence must therefore be fully automated. By automated we
mean the analysis should be implemented as a single script, and we should be able to redo
the entire analysis from database in CDM format to results, including tables and figures,
with a single command. The analysis can be of arbitrary complexity, perhaps producing
just a single count, or generating empirically calibrated estimates for millions of research
questions, but the same principle applies. The script can invoke other scripts, which in
turn can invoke even lowerlevel analysis processes.
The analysis script can be implemented in any computer language, although in OHDSI
the preferred language is R. Thanks to the DatabaseConnector R package, we can connect
directly to the data in CDM format, and many advanced analytics are available through
the other R packages in the OHDSI Methods Library.
Users can install the Methods Library in R directly from the master branches in the GitHub
repositories, or through a system known as “drat” that is always uptodate with the master
branches. A number of the Methods Library packages are available through R’s Compre
hensive R Archive Network (CRAN), and this number is expected to increase over time.
Reasonable software development and testing methodologies are employed by OHDSI to
maximize the accuracy, reliability and consistency of the Methods Library performance.
Importantly, as the Methods Library is released under the terms of the Apache License
V2, all source code underlying the Methods Library, whether it be in R, C++, SQL, or
Java is available for peer review by all members of the OHDSI community, and the public
in general. Thus, all the functionality embodied within the Methods Library is subject to
continuous critique and improvement relative to its accuracy, reliability and consistency.
17.2.2 Documentation
All packages in the Methods Library are documented through R’s internal documentation
framework. Each package has a package manual that describes every function available in
the package. To promote alignment between the function documentation and the function
17.2. Methods Library Software Development Process 331
All Method Library source code is available to end users. Feedback from the community
is facilitated using GitHub’s issue tracking system and the OHDSI forums.
All leaders of the OHDSI PopulationLevel Estimation Workgroup and OHDSI Patient
Level Prediction Workgroup hold PhDs from accredited academic institutions and have
published extensively in peer reviewed journals.
The OHDSI Methods Library is hosted on the GitHub system. GitHub’s disaster recovery
facilities are described at https://github.com/security.
A large set of automated validation tests is maintained and upgraded by OHDSI to enable
the testing of source code against known data and known results. Each test begins with
specifying some simple input data, then executes a function in one of the packages on this
input, and evaluates whether the output is exactly what would be expected. For simple
functions, the expected result is often obvious (for example when performing propensity
score matching on example data containing only a few subjects); for more complicated
functions the expected result may be generated using combinations of other functions
available in R (for example, Cyclops, our largescale regression engine, is tested among
others by comparing results on simple problems with other regression routines in R). We
aim for these tests in total to cover 100% of the lines of executable source code.
These tests are automatically performed when changes are made to a package (specifically,
when changes are pushed to the package repository). Any errors noted during testing
automatically trigger emails to the leadership of the Workgroups, and must be resolved
prior to release of a new version of a package. The source code and expected results for
these tests are available for review and use in other applications as may be appropriate.
These tests are also available to end users and/or system administrators and can be run as
part of their installation process to provide further documentation and objective evidence
as to the accuracy, reliability and consistency of their installation of the Methods Library.
17.3.2 Simulation
For more complex functionality it is not always obvious what the expected output should
be given the input. In these cases simulations are sometimes used, generating input given
a specific statistical model, and establishing whether the functionality produces results
in line with this known model. For example, in the SelfControlledCaseSeries package
simulations are used to verify that the method is able to detect and appropriately model
temporal trends in simulated data.
17.4. Summary 333
17.4 Summary
Method Validity
335
336 Chapter 18. Method Validity
Library. For example, Section 12.9 lists a wide range of diagnostics generated by the
CohortMethod package, including:
• Propensity score distribution to asses initial comparability of cohorts.
• Propensity model to identify potential variables that should be excluded from the
model.
• Covariate balance to evaluate whether propensity score adjustment has made the
cohorts comparable (as measured through baseline covariates).
• Attrition to observe how many subjects were excluded in the various analysis steps,
which may inform on the generalizability of the results to the initial cohorts of
interest.
• Power to assess whether enough data is available to answer the question.
• Kaplan Meier curve to asses typical time to onset, and whether the proportionality
assumption underlying Cox models is met.
Other study designs require different diagnostics to test the different assumptions in those
designs. For example, for the selfcontrolled case series (SCCS) design we may check
the necessary assumption that the end of observation is independent of the outcome. This
assumption is often violated in the case of serious, potentially lethal, events such as my
ocardial infarction. We can evaluate whether the assumption holds by generating the plot
shown in Figure 18.1, which shows histograms of the time to observation period end for
those that are censored, and those that are uncensored. In our data we consider those
whose observation period ends at the end date of data capture (the date when observa
tion stopped for the entire data base, for example the date of extraction, or the study end
date) to be uncensored, and all others to be censored. In Figure 18.1 we see only minor
differences between the two distributions, suggesting our assumptions holds.
Figure 18.1: Time to observation end for those that are censored, and those that are un
censored.
biased.
We should select negative controls that are comparable to our hypothesis of interest, which
means we typically select exposureoutcome pairs that either have the same exposure as
the hypothesis of interest (socalled “outcome controls”) or the same outcome (“exposure
controls”). Our negative controls should further meet these criteria:
• The exposure should not cause the outcome. One way to think of causation is to
think of the counterfactual: could the outcome be caused (or prevented) if a patient
was not exposed, compared to if the patient had been exposed? Sometimes this
is clear, for example ACEi are known to cause angioedema. Other times this is
far less obvious. For example, a drug that may cause hypertension can therefore
indirectly cause cardiovascular diseases that are a consequence of the hypertension.
• The exposure should also not prevent or treat the outcome. This is just another
causal relationship that should be absent if we are to believe the true effect size
(e.g. the hazard ratio) is 1.
• The negative control should exist in the data, ideally with sufficient numbers. We
try to achieve this by prioritizing candidate negative controls based on prevalence.
• Negative controls should ideally be independent. For example, we should avoid
having negative controls that are either ancestors of each other (e.g. “ingrown nail”
and “ingrown nail of foot”) or siblings (e.g. “fracture of left femur” and “fracture
of right femur”).
• Negative controls should ideally have some potential for bias. For example, the
last digit of someone’s social security number is basically a random number, and is
338 Chapter 18. Method Validity
outcomes will not be confounded, and we may therefore be optimistic in our evaluation
of our capacity to deal with confounding for positive controls. To preserve confounding,
we want the new outcomes to show similar associations with baseline subjectspecific co
variates as the original outcomes. To achieve this, for each outcome we train a model to
predict the survival rate with respect to the outcome during exposure using covariates cap
tured prior to exposure. These covariates include demographics, as well as all recorded
diagnoses, drug exposures, measurements, and medical procedures. An L1regularized
Poisson regression (Suchard et al., 2013) using 10fold crossvalidation to select the reg
ularization hyperparameter fits the prediction model. We then use the predicted rates to
sample simulated outcomes during exposure to increase the true effect size to the desired
magnitude. The resulting positive control thus contains both real and simulated outcomes.
Figure 18.2 depicts this process. Note that although this procedure simulates several im
portant sources of bias, it does not capture all. For example, some effects of measurement
error are not present. The synthetic positive controls imply constant positive predictive
value and sensitivity, which may not be true in reality.
Although we refer to a single true “effect size” for each control, different methods estimate
different statistics of the treatment effect. For negative controls, where we believe no
causal effect exists, all such statistics, including the relative risk, hazard ratio, odds ratio,
incidence rate ratio, both conditional and marginal, as well as the average treatment effect
in the treated (ATT) and the overall average treatment effect (ATE) will be identical to 1.
Our process for creating positive controls synthesizes outcomes with a constant incidence
rate ratio over time and between patients, using a model conditioned on the patient where
this ratio is held constant, up to the point where the marginal effect is achieved. The true
effect size is thus guaranteed to hold as the marginal incidence rate ratio in the treated.
Under the assumption that our outcome model used during synthesis is correct, this also
holds for the conditional effect size and the ATE. Since all outcomes are rare, odds ratios
are all but identical to the relative risk.
• Area Under the receiver operator Curve (AUC): the ability to discriminate be
tween positive and negative controls.
• Coverage: how often the true effect size is within the 95% confidence interval.
• Mean precision: precision is computed as 1/(𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟)2 , higher precision
means narrower confidence intervals. We use the geometric mean to account for
the skewed distribution of the precision.
• Mean squared error (MSE): Mean squared error between the log of the effect size
pointestimate and the log of the true effect size.
• Type 1 error: For negative controls, how often was the null rejected (at 𝛼 = 0.05).
This is equivalent to the false positive rate and 1 − 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦.
• Type 2 error: For positive controls, how often was the null not rejected (at 𝛼 =
0.05). This is equivalent to the false negative rate and 1 − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦.
• Nonestimable: For how many of the controls was the method unable to produce
an estimate? There can be various reasons why an estimate cannot be produced,
for example because there were no subjects left after propensity score matching, or
because no subjects remained having the outcome.
Depending on our use case, we can evaluate whether these operating characteristics are
suitable for our goal. For example, if we wish to perform signal detection, we may care
about type 1 and type 2 error, or if we are willing to modify our 𝛼 threshold, we may
inspect the AUC instead.
where 𝑁 (𝑎, 𝑏) denotes a Gaussian distribution with mean 𝑎 and variance 𝑏, and estimate
𝜇 and 𝜎2 by maximizing the following likelihood:
𝑛
𝐿(𝜇, 𝜎|𝜃, 𝜏 ) ∝ ∏ ∫ 𝑝(𝜃𝑖̂ |𝛽𝑖 , 𝜃𝑖 , 𝜏𝑖̂ )𝑝(𝛽𝑖 |𝜇, 𝜎)d𝛽𝑖
𝑖=1
̂
𝜃𝑛+1 ∼ 𝑁 (𝜇,̂ 𝜎̂ + 𝜏𝑛+1
̂ )
̂
When 𝜃𝑛+1 is smaller than 𝜇,̂ the onesided calibrated pvalue for the new pair is then
⎛ 𝜃𝑛+1 − 𝜇̂ ⎞
𝜙⎜
⎜ ⎟
⎟
√𝜎̂ 2 + 𝜏 ̂2
⎝ 𝑛+1 ⎠
where 𝜙(⋅) denotes the cumulative distribution function of the standard normal distribu
̂
tion. When 𝜃𝑛+1 is bigger than 𝜇,̂ the onesided calibrated pvalue is then
⎛ 𝜃𝑛+1 − 𝜇̂ ⎞
1−𝜙⎜
⎜ ⎟
⎟
√𝜎̂ 2 + 𝜏 ̂ 2
⎝ 𝑛+1 ⎠
Formally, we assume that 𝑏𝑒𝑡𝑎𝑖 , the bias associated with pair 𝑖, again comes from a Gaus
sian distribution, but this time using a mean and standard deviation that are linearly related
to 𝑡ℎ𝑒𝑡𝑎𝑖 , the true effect size:
𝛽𝑖 ∼ 𝑁 (𝜇(𝜃𝑖 ), 𝜎2 (𝜃𝑖 ))
where
𝑛
𝑙(𝑎, 𝑏, 𝑐, 𝑑|𝜃, 𝜃,̂ 𝜏 ̂) ∝ ∏ ∫ 𝑝(𝜃𝑖̂ |𝛽𝑖 , 𝜃𝑖 , 𝜏𝑖̂ )𝑝(𝛽𝑖 |𝑎, 𝑏, 𝑐, 𝑑, 𝜃𝑖 )d𝛽𝑖 ,
𝑖=1
We compute a calibrated CI that uses the systematic error model. Let 𝜃𝑛+1̂ again denote
the log of the effect estimate for a new outcome of interest, and let 𝜏𝑛+1
̂ denote the cor
responding estimated standard error. From the assumptions above, and assuming 𝛽𝑛+1
arises from the same systematic error model, we have:
̂
𝜃𝑛+1 ∼ 𝑁 (𝜃𝑛+1 + 𝑎̂ + 𝑏̂ × 𝜃𝑛+1 , 𝑐 ̂ + 𝑑×
̂ ∣ 𝜃𝑛+1 ∣ +𝜏 ̂2 ).
𝑛+1
We find the lower bound of the calibrated 95% CI by solving this equation for 𝜃𝑛+1 :
where Φ(⋅) denotes the cumulative distribution function of the standard normal distribu
tion. We find the upper bound similarly for probability 0.975. We define the calibrated
point estimate by using probability 0.5.
Both pvalue calibration and confidence interval calibration are implemented in the Em
piricalCalibration package.
design across different databases can produce vastly different effect size estimates, (Madi
gan et al., 2013b) suggesting that either the effect differs greatly for different populations,
or that the design does not adequately address the different biases found in the different
databases. In fact, we observe that accounting for residual bias in a database through em
pirical calibration of confidence intervals can greatly reduce betweenstudy heterogeneity.
(Schuemie et al., 2018a)
One way to express betweendatabase heterogeneity is the 𝐼 2 score, describing the per
centage of total variation across studies that is due to heterogeneity rather than chance.
(Higgins et al., 2003) A naive categorization of values for 𝐼 2 would not be appropriate
for all circumstances, although one could tentatively assign adjectives of low, moderate,
and high to 𝐼 2 values of 25%, 50%, and 75%. In a study estimating the effects for many
depression treatments using a newuser cohort design with largescale propensity score
adjustment, (Schuemie et al., 2018b) observed only 58% of the estimates to have an 𝐼 2
below 25%. After empirical calibration this increased to 83%.
available in the various packages should be used, as described in the next sections.
Figure 18.3: A concept set containing the concepts defining the target and comparator
exposures.
Next, we go to the “Explore Evidence” tab, and click on the button. Gen
erating the evidence overview will take a few minutes, after which you can click on the
button. This will open the list of outcomes as shown in Figure 18.4.
This list shows condition concepts, along with an overview of the evidence linking the
condition to any of the exposures we defined. For example, we see the number of publica
tions that link the exposures to the outcomes found in PubMed using various strategies, the
number of product labels of our exposures of interest that list the condition as a possible
adverse effect, and the number of spontaneous reports. By default the list is sorted to show
candidate negative controls first. It is then sorted by the “Sort Order,” which represents
the prevalence of the condition in a collection of observational databases. The higher the
18.3. Method Validation in Practice 345
Figure 18.4: Candidate control outcomes with an overview of the evidence found in lit
erature, product labels, and spontaneous reports.
Sort Order, the higher the prevalence. Although the prevalence in these databases might
not correspond with the prevalence in the database we wish to run the study, it is likely a
good approximation.
The next step is to manually review the candidate list, typically starting at the top, so with
the most prevalent condition, and working our way down until we are satisfied we have
enough. One typical way to do this is to export the list to a CSV (comma separated values)
file, and have clinicians review these, considering the criteria mentioned in Section 18.2.1.
For our example study we select the 76 negative controls listed in Appendix C.1.
library(MethodEvaluation)
# Create a data frame with all negative control exposure-
# outcome pairs, using only the target exposure (ACEi = 1).
eoPairs <- data.frame(exposureId = 1,
outcomeId = ncs)
Note that we must mimic the timeatrisk settings used in our estimation study design.
The synthesizePositiveControls function will extract information about the expo
sures and negative control outcomes, fit outcome models per exposureoutcome pair, and
synthesize outcomes. The positive control outcome cohorts will be added to the cohort
table specified by cohortDbSchema and cohortTable. The resulting pcs data frame
contains the information on the synthesized positive controls.
Next we must execute the same study used to estimate the effect of interest to also esti
mate effects for the negative and positive controls. Setting the set of negative controls in
the comparisons dialog in ATLAS instructs ATLAS to compute estimates for these con
trols. Similarly, specifying that positive controls be generated in the Evaluation Settings
includes these in our analysis. In R, the negative and positive controls should be treated
as any other outcome. All estimation packages in the OHDSI Methods Library readily
allow estimation of many effects in an efficient manner.
18.3. Method Validation in Practice 347
Figure 18.5: Estimates for the negative (true hazard ratio = 1) and positive controls (true
hazard ratio > 1). Each dot represents a control. Estimates below the dashed line have a
confidence interval that doesn’t include the true effect size.
Based on these estimates we can compute the metrics shown in Table 18.1 using the
computeMetrics function in the MethodEvaluation package.
Metric Value
AUC 0.96
Coverage 0.97
Mean Precision 19.33
MSE 2.08
Type 1 error 0.00
Type 2 error 0.18
Nonestimable 0.08
We see that coverage and type 1 error are very close to their nominal values of 95% and
5%, respectively, and that the AUC is very high. This is certainly not always the case.
Note that although in Figure 18.5 not all confidence intervals include one when the true
hazard ratio is one, the type 1 error in Table 18.1 is 0%. This is an exceptional situation,
caused by the fact that confidence intervals in the Cyclops package are estimated using
likelihood profiling, which is more accurate than traditional methods but can result in
asymmetric confidence intervals. The pvalue instead is computed assuming symmetrical
confidence intervals, and this is what was used to compute the type 1 error.
348 Chapter 18. Method Validity
library(EmpiricalCalibration)
plotCalibrationEffect(logRrNegatives = ncEstimates$logRr,
seLogRrNegatives = ncEstimates$seLogRr,
logRrPositives = oiEstimates$logRr,
seLogRrPositives = oiEstimates$seLogRr,
showCis = TRUE)
Figure 18.6: Pvalue calibration: estimates below the dashed line have a conventional p
< 0.05. Estimates in the shaded area have calibrated p < 0.05. The narrow band around
the edge of the shaded area denotes the 95% credible interval. Dots indicate negative
controls. Diamonds indicate outcomes of interest.
In Figure 18.6 we see that the shaded area almost exactly overlaps with the area denoted by
the dashed lines, indicating hardly any bias was observed for the negative controls. One
of the outcomes of interest (AMI) is above the dashed line and the shaded area, indicating
we cannot reject the null according to both the uncalibrated and calibrated pvalue. The
other outcome (angioedema) clearly stands out from the negative control, and falls well
within the area where both uncalibrated and calibrated pvalues are smaller than 0.05.
18.3. Method Validation in Practice 349
oiEstimates$p
Figure 18.7: Effect size estimates and 95% confidence intervals (CI) from five different
databases and a metaanalytic estimate when comparing ACE inhibitors to thiazides and
thiazidelike diuretics for the risk of angioedema.
Figure 18.8: Calibrated Effect size estimates and 95% confidence intervals (CI) from five
different databases and a metaanalytic estimate for the hazard ratio of angioedema when
comparing ACE inhibitors to thiazides and thiazidelike diuretics.
18.4. OHDSI Methods Benchmark 351
We see that the estimates from the matched and stratified analysis are in strong agreement,
with the confidence intervals for stratification falling completely inside of the confidence
intervals for matching. This suggests that our uncertainty around this design choice does
not impact the validity of our estimates. Stratification does appear to give us more power
(narrower confidence intervals), which is not surprising since matching results in loss
of data, whereas stratification does not. The price for this could be an increase in bias,
due to withinstrata residual confounding, although we see no evidence of increased bias
reflected in the calibrated confidence intervals.
Study diagnostics allow us to evaluate design choices even before fully executing
a study. It is recommended not to finalize the protocol before generating and re
viewing all study diagnostics. To avoid phacking (adjusting the design to achieve
a desired result), this should be done while blinded to the effect size estimate of
interest.
a method when a contextspecific empirical evaluation is not (yet) available. The bench
mark consists of 200 carefully selected negative controls that can be stratified into eight
categories, with the controls in each category either sharing the same exposure or the same
outcome. From these 200 negative controls, 600 synthetic positive controls are derived
as described in Section 18.2.2. To evaluate a method, it must be used to produce effect
size estimates for all controls, after which the metrics described in Section 18.2.3 can be
computed. The benchmark is publicly available, and can be deployed as described in the
Running the OHDSI Methods Benchmark vignette in the MethodEvaluation package.
We have run all the methods in the OHDSI Methods Library through this benchmark, with
various analysis choices per method. For example, the cohort method was evaluated using
propensity score matching, stratification, and weighting. This experiment was executed
on four large observational healthcare databases. The results, viewable in an online Shiny
app1 , show that although several methods show high AUC (the ability to distinguish pos
itive controls from negative controls), most methods in most settings demonstrate high
type 1 error and low coverage of the 95% confidence interval, as shown in Figure 18.9.
This emphasizes the need for empirical evaluation and calibration: if no empirical evalu
ation is performed, which is true for almost all published observational studies, we must
assume a prior informed by the results in Figure 18.9, and conclude that it is likely that
the true effect size is not contained in the 95% confidence interval!
Our evaluation of the designs in the Methods Library also shows that empirical calibration
restores type 1 error and coverage to their nominal values, although often at the cost of
increasing type 2 error and decreasing precision.
18.5 Summary
1
http://data.ohdsi.org/MethodEvalViewer/
18.5. Summary 353
Figure 18.9: Coverage of the 95% confidence interval for the methods in the Methods
Library. Each dot represents the performance of a specific set of analysis choices. The
dashed line indicates nominal performance (95% coverage). SCCS = SelfControlled
Case Series, GI = Gastrointestinal, IBD = inflammatory bowel disease.
– Study diagnostics can be used to guide analytic design choices and adapt the
protocol, as long as the researcher remains blinded to the effect of interest to
avoid phacking.
354 Chapter 18. Method Validity
Part V
OHDSI Studies
355
Chapter 19
Study steps
Here we aim to provide a general stepbystep guide to the design and implementation of
an observational study with the OHDSI tools. We will break out each stage of the study
process and then describe steps generically and in some cases discuss specific aspects
of the main study types (1) characterization, (2) population level estimation (PLE), and
(3) patient level prediction (PLP) described in earlier chapters of the Book of OHDSI.
To do so, we will synthesize many elements discussed in the previous chapters in a way
that is accessible for the beginner. At the same time, this chapter can stand alone for a
reader who wants practical highlevel explanations with options to pursue more indepth
materials in other chapters as needed. Finally, we will illustrate throughout with a few
key examples.
In addition, we will summarize guidelines and best practices for observational studies
as recommended by the OHDSI community. Some principles that we will discuss are
generic and shared with best practice recommendations found in many other guidelines
for observational research while other recommended processes are more specific to the
OHDSI framework. We will therefore highlight where OHDSIspecific approaches are
enabled by the OHDSI tool stack.
Throughout the chapter, we assume that an infrastructure of OHDSI tools, R and SQL
are available to the reader and therefore we do not discuss any aspects of setting up this
infrastructure in this chapter (see Chapters 8 and 9 for guidance). We also assume our
reader is interested in running a study primarily on data at their own site using a database
in OMOP CDM (for OMOP ETL, see Chapter 6). However, we emphasize that once
a study package is prepared as discussed below, it can in principle be distributed and
executed at other sites. Additional considerations specific to running OHDSI network
studies, including organizational and technical details, are discussed in detail in Chapter
20.
357
358 Chapter 19. Study steps
19.1.3 Protocol
An observational study plan should be documented in the form of a protocol created prior
to executing a study. At a minimum, a protocol describes the primary study question,
the approach, and metrics that will be used to answer the question. The study popula
tion should be described to a level of detail such that the study population may be fully
19.1. General Best Practice Guidelines 359
reproduced by others. In addition, all methods or statistical procedures and the form of
expected study results such as metrics, tables and graphs should be described. Often, a
protocol will also describe a set of preanalyses designed to assess the feasibility or statis
tical power of the study. Furthermore, protocols may contain descriptions of variations
on the primary study question referred to as sensitivity analyses. Sensitivity analyses are
designed to evaluate the potential impact of study design choices on the overall study find
ings and should be described in advance whenever possible. Sometimes unanticipated is
sues arise that may necessitate a protocol amendment after a protocol is completed. If this
becomes necessary, it is critical to document the change and the reasons for the change
in the protocol itself. Particularly in the case of PLE or PLP, a completed study protocol
will ideally be recorded in an independent platform (such as clinicaltrials.gov or OHDSI’s
studyProtocols sandbox) where its versions and any amendments can be tracked indepen
dently with timestamps. It is also often the case that your institution or the owner of the
data source will require the opportunity to review and approve your protocol prior to study
execution.
The OHDSI approach supports the inclusion of feasibility and study diagnostics within
the protocol by again enabling these steps to be performed relatively simply within a
common framework and tools (see section 19.2.4 below).
to record such a study package in the git environment. This study package contains all
parameters and versioning stamps for the code base. As mentioned previously, observa
tional studies are often asking questions with potential to impact public health decisions
and policy. Therefore, before acting on any findings, they should ideally be replicated
in multiple settings by different researchers. The only way to achieve such a goal is for
every detail required to fully reproduce a study to be mapped out explicitly and not left
to guesswork or misinterpretation. To support this best practice, the OHDSI tools are
designed to aid in the translation from a protocol in the form of a written document into a
computer or machinereadable study package. One tradeoff of this framework is that not
every use case or customized analysis can easily be addressed with the existing OHDSI
tools. As the community grows and evolves, however, more functionality to address a
larger array of use cases is being added. Anyone involved in the community may raise
suggestions for new functionality driven by a novel use case.
OHDSI studies are premised on observational databases being translated into the OMOP
common data model (CDM). All OHDSI tools and downstream analytics steps make an
assumption that the data representation conforms to the specifications of the CDM (see
Chapter 4). It is therefore also critical that the ETL process (see Chapter 6) for doing so
is welldocumented for your specific data sources as this process may introduce artifacts
or differences between databases at different sites. The purpose of the OMOP CDM is to
move in the direction of reducing site specific data representation, but this is far from a
perfect process and still remains a challenging area that the community seeks to improve.
It therefore remains critical when executing studies to collaborate with individuals at your
site, or at external sites when executing network studies, who are intimately familiar with
any source data that has been transformed into the OMOP CDM.
In addition to the CDM, the OMOP standardized vocabulary system (Chapter 5) is also
a critical component of working with the OHDSI framework to obtain interoperability
across diverse data sources. The standardized vocabulary seeks to define a set of standard
concepts within each vocabulary domain to which all other source vocabulary systems
are mapped. In this way, two different databases which use a different source vocabulary
system for drugs, diagnoses or procedures will be comparable when transformed into the
CDM. The OMOP vocabularies also contain hierarchies which are useful in identifying
the appropriate codes for a particular cohort definition. Again, it is recommended best
practice to implement the vocabulary mappings and use the codes of OMOP standardized
vocabularies in downstream queries in order to gain the full benefits of ETLing your
database into the OMOP CDM and using the OMOP vocabulary.
19.2. Study Steps in Detail 361
glycosylated hemoglobin (HbA1c) levels which is a lab measurement that reflects a pa
tient’s blood sugar levels averaged over the prior 3 months. These values may or may not
be available for all patients. If unavailable for all or even a portion of patients, you will
have to consider whether other clinical criteria for severity of T2DM can be identified
and used instead. Alternatively, if the HbA1c values are available for only a subset of
patients, you will also need to evaluate whether focusing on this subset of patients only
would lead to unwanted bias in the study. See Chapter 7 for additional discussion of the
issue of missing data.
Another common issue is the lack of information about a particular care setting. In the
PLE example described above, the suggested outcome was hospitalization for heart fail
ure. If a given database does not have any inpatient information, one may need to con
sider a different outcome to evaluate the comparative effectiveness of different T2DM
treatment approaches. In other databases, outpatient diagnosis data may not be available
and therefore one would need to consider the design of the cohort.
characteristics of all patients who have a T2DM diagnosis code in their medical record.
In this case, it may not be appropriate to apply further qualifying criteria to attempt to
remove erroneously coded T1DM patients.
Once the definition of a study population or populations is described, the OHDSI tool
ATLAS is a good starting point to create the relevant cohorts. ATLAS and the cohort
generation process are described in detail in Chapters 8 and 10. Briefly, ATLAS provides
a user interface (UI) to define and generate cohorts with detailed inclusion criteria. Once
cohorts are defined in ATLAS, a user can directly export their detailed definitions in a
humanreadable format for incorporation in a protocol. If for some reason an ATLAS
instance is not connected to an observational health database, ATLAS can still be used to
create a cohort definition and directly export the underlying SQL code for incorporation
into a study package to be run separately on a SQL database server. Directly using ATLAS
is recommended when possible because ATLAS provides some advantages above and
beyond the creation of SQL code for the cohort definition (see below). Finally, there may
be some rare situations where a cohort definition can not be implemented with the ATLAS
UI and requires manual custom SQL code.
The ATLAS UI enables defining cohorts based on numerous selection criteria. Criteria
for cohort entry and exit as well as baseline criteria can be defined on the basis of any
domains of the OMOP CDM such as conditions, drugs, procedures, etc. where standard
codes must be specified for each domain. In addition, logical filters on the basis of these
domains, as well as timebased filters to define study periods, and baseline timeframes
can be defined within ATLAS. ATLAS can be particularly helpful when selecting codes
for each criteria. ATLAS incorporates a vocabularybrowsing feature which can be used
to build sets of codes required for your cohort definitions. This feature relies solely on the
OMOP standard vocabularies and has options to include all descendants in the vocabulary
hierarchy (see Chapter 5). Note therefore that this feature requires that all codes have been
appropriately mapped to standard codes during the ETL process (see Chapter 6). If the
best codesets to use in your inclusion criteria are not clear, this may be a place where some
exploratory analysis may be warranted in cohort definitions. Alternatively a more formal
sensitivity analysis could be considered to account for different possible definitions of a
cohort using different codesets.
When a cohort is created, summary characteristics of the patient demographics and fre
quencies of the most frequent drugs and conditions observed can be created and viewed
364 Chapter 19. Study steps
overlap between the populations in the target and comparator groups. These steps are
described in detail in Chapter 12. In addition, using these final matched cohorts, the
statistical power can then be calculated.
In some cases, work in the OHDSI community examines the statistical power only after
a study is run by reporting a minimal detectable relative risk (MDRR) given the avail
able sample sizes. This approach may be more useful when running high throughput,
automated studies across a lot of databases and sites. In this scenario, a study’s power in
any given database is perhaps better explored after the all analyses have been performed
rather than prefiltering.
As shown in Figure 19.1, assembling the final study protocol in humanreadable form
should be performed in parallel with preparing all the machinereadable study code that
is incorporated into the final study package. These latter steps are referred to as study
implementation in the diagram below. This will include export of the finalized study
package from ATLAS and/or development of any custom code that may be required.
The completed study package can then be used to execute only the preliminary diagnos
tics steps which in turn can be described in the protocol. For example, in the case of a
new user cohort PLE study to examine comparative effectiveness of two treatments, the
preliminary execution of study diagnostics steps will require cohort creation, propensity
score creation, and matching to confirm that the target and comparator populations have
sufficient overlap for the study to be feasible. Once this is determined, power calcula
tions can be performed with the matched target and comparator cohorts intersected with
the outcome cohort to obtain outcome counts and the results of these calculations can be
described in the protocol. On the basis of these diagnostics results, a decision can then
be made whether or not to move forward with executing the final outcome model. In the
context of a characterization or a PLP study, there may be similar steps that need to be
completed at this stage, although we don’t attempt to outline all scenarios here.
parameters outlined in the protocol. It may also be necessary to test and debug a study
package to ensure it runs appropriately in your environment.
In a welldefined study where sample sizes are sufficient and data quality is reasonable,
the interpretation of results will often be straightforward. Similarly, because most of the
work of creating a final report other than writing up the final results is done in the planning
and creation of the protocol, the final writeup of a report or manuscript for publication
will often be straightforward as well.
There are, however, some common situations where interpretation becomes more chal
lenging and should be approached with caution.
1. Sample sizes are borderline for significance and confidence intervals become large
2. Specific for PLE: pvalue calibration with negative controls may reveal substantial
bias
3. Unanticipated data quality issues come to light during the process of running the
study
For any given study, it will be up to the discretion of the study authors to report on any
concerns above and temper their interpretation of study results accordingly. As with the
protocol development process, we also recommend that the study findings and interpreta
tions be reviewed by clinical experts and stakeholders prior to releasing a final report or
submitting a manuscript for publication.
19.3. Summary 367
19.3 Summary
369
370 Chapter 20. OHDSI Network Research
– Access to free tools: OHDSI publishes free, open source tools for data char
acterization and standardized analytics (e.g. browsing the clinical concepts,
defining and characterizing cohorts, running PopulationLevel Estimation
and PatientLevel Prediction studies).
– Participate in a premier research community: Author and publish network
research, collaborate with leaders across various disciplines and stakeholder
groups.
– Opportunity to benchmark care: Network studies can enable clinical char
acterization and quality improvement benchmarks across data partners.
The results of an observational study can be influenced by many factors that vary by the
location of the data source such as adherence, genetic diversity, or environmental factors,
overall health status: factors that may not have been possible to vary in the context of a
clinical trial even if one exists for your same study question. A typical motivation to run
an observational study in a network is therefore to increase the diversity of data sources
and potentially study populations to understand how well the results generalize. In other
words, can the study findings be replicated across multiple sites or do they differ and if
they differ, can any insights be gleaned as to why?
Network studies, therefore, offer the opportunity to investigate the effects of “real world”
factors on observational studies’ findings by examining a broad array of settings and data
sources.
20.2. OHDSI Network Studies 371
The OHDSI approach to network research uses the OMOP CDM and standardized tools
and study packages which fully specify all parameters for running a study. OHDSI stan
dardized analytics are designed specifically to reduce artifacts and improve the efficiency
and scalability of network studies.
Network studies are an important part of the OHDSI research community. However, there
is no mandate that an OHDSI study be packaged and shared across the entire OHDSI
network. You may still conduct research using the OMOP CDM and OHDSI methods
library within a single institution or limit a research study to only select institutions. These
research contributions are equally important to the community. It is at the discretion
of each investigator whether a study is designed to run on a single database, conduct a
study across a limited set of partners or open the study to full participation the OHDSI
network. This chapter intends to speak to the opentoall network studies that the OHDSI
community conducts.
Elements of an Open OHDSI Network Study: When conducting an open OHDSI net
work study, you are committing to fully transparent research. There are a few components
that make OHDSI research unique. This includes:
• All documentation, study code and subsequent results are made publicly available
on the OHDSI GitHub.
• Investigators must create and publish a public study protocol detailing the scope
and intent of the analysis to be performed.
• Investigators must create a study package (typically with R or SQL) with code that
is CDM compliant.
• Investigators are encouraged to attend OHDSI Community Calls to promote and
recruit collaborators for their OHDSI network study.
• At the end of the analysis, aggregate study results are made available in the OHDSI
GitHub.
• Where possible, investigators are encouraged to publish study R Shiny Applica
tions to data.ohdsi.org.
In the next section we will talk about how to create your own network study as well as
the unique design and logistical considerations for implementing a network study.
true in the data you are utilizing for your analysis. For example, if you were assembling
an angioedema cohort you may opt to pick only concept codes for angioedema that are
represented in your CDM. This may be problematic if your data are in a specific care
setting (e.g. primary care, ambulatory settings) or specific to a region (e.g. UScentric).
Your code selections might be biasing your cohort definition.
In an OHDSI network study, you are no longer designing and building a study package
just for your data. You are building a study package to be run across multiple sites across
the globe. You will never see the underlying data for participating sites outside of your
own institution. OHDSI network studies only share results files. Your study package
can only collect what data that is available in the domains of the CDM. You will need
an exhaustive approach to concept set creation to represent the diversity of care settings
that observational health data are captured. OHDSI study packages often use the same
cohort definition across all sites. This means that you must think holistically to avoid
biasing a cohort definition to only represent a subset of eligible data (e.g. claimscentric
data or EHRspecific data) in the network. You are encouraged to write an exhaustive
cohort definition that can be ported across multiple CDMs. OHDSI study packages use
the same set of parameterized code across all sites – with only minor customizations for
connecting into the database layer and storing local results. Later on, we will discuss the
implications for interpreting clinical findings from diverse datasets.
In addition to clinical coding variation, you will need to design anticipating variations
in the local technical infrastructure. Your study code will no longer be running in a sin
gle technical environment. Each OHDSI network site makes its own independent choice
of database layer. This means that you cannot hardcode a study package to a specific
database dialect. The study code needs to be parameterized to a type of SQL that can be
easily modified to the operators in that dialect. Fortunately, the OHDSI Community has
solutions such as ATLAS, DatabaseConnector, and SqlRender to help you generalize your
study package for CDM compliance across different database dialects. OHDSI investiga
tors are encouraged to solicit help from other network study sites to test and validate the
study package can be executed in different environments. When coding errors come up,
OHDSI Researchers can utilize the OHDSI Forums to discuss and debug packages.
has with running an OHDSI network study will also impact the personnel required.
• Registering the study with the Institutional Review Board (or equivalent), if re
quired
• Receiving Institutional Review Board approval to execute the study, if required
• Receiving database level permissions to read/write a schema to the approved CDM
• Ensuring configuration of a functional RStudio environment to execute the study
package
• Reviewing the study code for any technical anomalies
• Working with a local IT team to permit and install any dependent R packages
needed to execute the package within technical constraints
Each site will have a local data analyst who executes the study package. This individual
must review the output of the study package to ensure no sensitive information is trans
mitted, although all the data in CDM had been already deidentified. When you are using
prebuilt OHDSI methods such as PopulationLevel Effect Estimation (PLE) and Patient
Level Prediction (PLP), there are configurable settings for the minimum cell count for
a given analysis. The data analyst is required to review these thresholds and ensure it
follows local governance policies.
When sharing study results, the data analyst must comply with all local governance poli
cies, inclusive of method of results transmission and adhering to approval processes for
external publication of results. OHDSI network studies do not share patientlevel data.
In other words, patient level data from different sites is never pooled in a central environ
ment. Study packages create results files designed to be aggregate results (e.g. summary
statistics, pointestimates, diagnostic plots, etc.) and do not share patientlevel informa
tion. Many organizations do not require data sharing agreements be executed between
the participating study team members. However, depending on the institutions involved
and the data sources, it may be necessary to have more formal data sharing agreements
in place and signed by specific study team members. If you are a data owner interested
in participating in network studies, you are encouraged to consult your local governance
team to understand what policies are in place and must be fulfilled to join OHDSI com
munity studies.
374 Chapter 20. OHDSI Network Research
A study will move to execution when the study lead reaches out to the OHDSI community
to formally announce a new OHDSI network study and formally begins recruiting partic
ipating sites. The study lead will publish the study protocol to the OHDSI GitHub. The
study lead will announce the study on the weekly OHDSI Community Call and OHDSI
Forum, inviting participating centers and collaborators. As sites optin to participate,
a study lead will communicate directly with each site and provide information on the
GitHub repository where the study protocol and code are published as well as instruc
tions on how to execute the study package. Ideally, a network study will be performed
in parallel by all sites, so the final results are shared concurrently ensuring that no site’s
team members are biased by knowledge of another team’s findings.
At each site, the study team will ensure the study follows institutional procedures for
receiving approval to participate in the study, execute the package and share results ex
ternally. This will likely include receiving Institutional Review Board (IRB) exemption
or approval or equivalent for the specified protocol. When the study is approved to run,
the site data scientists/statisticians will follow the study lead’s instructions to access the
OHDSI study package and generate results in the standardized format following OHDSI
guidelines. Each participating site will follow internal institutional processes regarding
data sharing rules. Sites should not share results unless approval or exemption is obtained
from IRB or other institutional approval processes.
The study lead will be responsible for communicating how they want to receive results
(e.g. via SFTP or a secure Amazon S3 bucket) and the timeframe for turning around
results. Sites may specify if the method of transmission is out of compliance with internal
protocol and a workaround may be developed accordingly.
During the execution phase, the collective study team (inclusive of the study lead and
participating site teams) may iterate on results, if reasonable adjustments are required. If
the scope and extent of the protocol evolve beyond what is approved, it is the responsibil
ity of the participating site to communicate this to their organization by working with the
study lead to update the protocol then resubmit the protocol for review and reapproval
by the local IRB.
It is ultimately the responsibility of the study lead and any supporting data scien
tist/statistician to perform the aggregation of results across centers and perform
metaanalysis, as appropriate. The OHDSI community has validated methodologies
to aggregate results files shared from multiple network sites into a single answer. The
EvidenceSynthesis package is a freely available R package containing routines for
combining evidence and diagnostics across multiple sources, such as multiple data sites
in a distributed study. This includes functions for performing metaanalysis and forest
plots.
The study lead will need to monitor site participation and help eliminate barriers in exe
cuting the package by regularly checking in with participating sites. Study execution is
not onesizefitsall at each site. There may be challenges related to the database layer
(e.g. access rights / schema permissions) or analytics tool in their environment (e.g. unable
to install required packages, unable to access databases through R, etc.). The participating
376 Chapter 20. OHDSI Network Research
site will be in the driver’s seat and will communicate what barriers exist to executing the
study. It is ultimately the discretion of the participating site to enlist appropriate resources
to help resolve issues encountered in their local CDM.
While OHDSI studies can be executed rapidly, it is advised to allow for a reasonable
amount of time for all participating sites to execute the study and receive appropriate ap
provals to publish results. Newer OHDSI network sites may find the first network study
they participate in to be longer than normal as they work through issues with environment
configuration such as database permissions or analytics library updates. Support is avail
able from the OHDSI Community. Issues can be posted to the OHDSI Forum as they are
encountered.
A study lead should set study milestones in the protocol and communicate anticipated
closure date in advance to assist with managing the overall study timeline. If the timeline
is not adhered to, it is the responsibility of the study lead to inform participating sites of
updates to the study schedule and manage the overall progress of study execution.
Not sure where to publish your OHDSI network study? Consult JANE (Jour
nal/Author Name Estimator), a tool which takes your abstract and scans
publications for relevance and fit.1
Network Studies on weekly OHDSI community calls and at OHDSI Symposia across the
globe.
ARACHNE is a platform that is designed to streamline and automate the process of con
ducting network studies. ARACHNE uses OHDSI standards and establishes a consistent,
transparent, secure and compliant observational research process across multiple orga
nizations. ARACHNE standardizes the communication protocol to access the data and
exchange analysis results, while enabling authentication and authorization for restricted
content. It brings participating organizations data providers, investigators, sponsors and
data scientists into a single collaborative study team and facilitates an endtoend obser
vational study coordination. The tool enables the creation of a complete, standardsbased
R, Python and SQL execution environment including approval workflows controlled by
the data custodian.
ARACHNE is built to provide a seamless integration with other OHDSI tools, including
ACHILLES reports and an ability to import ATLAS design artifacts, create selfcontained
packages and automatically execute those across multiple sites. The future vision is to
eventually enable multiple networks to be linked together for the purpose of conducting
research not only between organizations within a single network, but also between orga
378 Chapter 20. OHDSI Network Research
Study Execution: Where possible, study leads are encouraged to utilize ATLAS, the
OHDSI Methods Library and OHDSI Study Skeletons to create study code that used
standardized analytics packages as much as possible. Study code should always be in a
CDM compliant, database layer agnostic way using OHDSI packages. Be sure to param
eterize all functions and variables (e.g. do not hard database connection, local hard drive
path, assume a certain operating system). When recruiting participating sites, a study
lead should ensure that each network site is CDM compliant and regularly updates the
OMOP standardized vocabularies. A study lead should perform due diligence to ensure
with each network site has performed and documented data quality checks on their CDM
(e.g. ensuring ETL has followed THEMIS business rules and conventions, correct data
was placed into correct CDM tables and fields). Each data analyst is advised to update
their local R packages to the latest OHDSI package versions before executing the study
package.
Results and dissemination: A study lead should ensure each site follows local gover
nance rules before results are shared. Open, reproducible science means that everything
that is designed and executed becomes available. OHDSI Network studies are fully trans
parent with all documentation and subsequent results published to the OHDSI GitHub
repository or the data.ohdsi.org R Shiny server. As you prepare your manuscript, the
study lead should review the principles of the OMOP CDM and Standardized Vocabular
ies to ensure the journal understands how data can vary across OHDSI network sites. For
example, if you are performing a network study that uses claims databases and EHRs, you
may be asked by journal reviewers to explain how the integrity of the cohort definition
was maintained across multiple data types. A reviewer may want to understand how the
OMOP observation period (as discussed in Chapter 4 compares to an eligibility file – a
file that exists in claims databases to attribute when a person is and is not covered by an
insurance provider. This is inherently asking to focus on an artifactual element of the
databases themselves and focuses on the ETL of how the CDM transforms the records
into observations. In this case, the network study lead may find it helpful to reference how
the OMOP CDM OBSERVATION PERIOD is created and describe how observations are
created using the encounters in the source system. The manuscript discussion may need
to acknowledge the limitations of how EHR data, unlike claims data which reflects all
paid encounters for that period of time they are covered, it does not record when a per
son sees a provider who uses a different EHR of record and thus, breaks in observation
periods may occur because of the person seeks care from an outofEHR provider. This
is an artifact of how data exists in the system it is captured in. It is not a clinically mean
ingful difference but may confuse those who are unfamiliar with how OMOP derives the
observation period table. It is worth explaining in the discussion section to clarify this
unfamiliar convention. Similarly, a study lead may find it useful to describe the terminol
ogy service provided by the OMOP standard vocabularies enables a clinical concept to
be the same across wherever it is captured. There are always decisions made in mapping
of source codes to standard concepts however THEMIS conventions and CDM quality
checks can help provide information on where information should go and how well a
database adhered to that principle.
380 Chapter 20. OHDSI Network Research
20.6 Summary
Glossary
381
382 Appendix A. Glossary
Common Data Model (CDM) A convention for representing healthcare data that al
lows portability of analysis (the same analysis unmodified can be executed on mul
tiple datasets). See Chapter 4.
Comparative Effectiveness A comparison of the effects of two different exposures on
an outcome of interest. See Chapter 12.
Condition A diagnosis, a sign, or a symptom, which is either observed by a provider or
reported by the patient.
Confounding Confounding is a distortion (inaccuracy) in the estimated measure of as
sociation that occurs when the primary exposure of interest is mixed up with some
other factor that is associated with the outcome.
Covariate Data element (e.g., weight) that is used in a statistical model as independent
variable.
Data quality The state of completeness, validity, consistency, timeliness and accuracy
that makes data appropriate for a specific use.
Device A foreign physical object or instrument which is used for diagnostic or thera
peutic purposes through a mechanism beyond chemical action. Devices include
implantable objects (e.g. pacemakers, stents, artificial joints), medical equipment
and supplies (e.g. bandages, crutches, syringes), other instruments used in medical
procedures (e.g. sutures, defibrillators) and material used in clinical care (e.g. ad
hesives, body material, dental material, surgical material).
Drug A Drug is a biochemical substance formulated in such a way that when admin
istered to a Person it will exert a certain physiological effect. Drugs include pre
scription and overthecounter medicines, vaccines, and largemolecule biologic
therapies. Radiological devices ingested or applied locally do not count as Drugs.
Domain A Domain defines the set of allowable Concepts for the standardized fields
in the CDM tables. For example, the “Condition” Domain contains Concepts
that describe a condition of a patient, and these Concepts can only be stored
in the condition_concept_id field of the CONDITION_OCCURRENCE and
CONDITION_ERA tables.
Electronic Health Record (EHR) Data generated during course of care and recorded in
an electronic system.
Epidemiology The study of the distribution, patterns and determinants of health and dis
ease conditions in defined populations.
Evidencebased medicine The use of empirical and scientific evidence in making deci
sions about the care of individual patients.
ETL (ExtractTransformLoad) The process of converting data from one format to an
other, for example from a source format to the CDM. See Chapter 6.
Matching Many populationlevel effect estimation approaches attempt to identify the
causal effects of exposures by comparing outcomes in exposed patients to those
same outcomes in unexposed patients (or exposed to A versus B). Since these two
patient groups might differ in ways other than exposure, “matching” attempts to
create exposed and unexposed patient groups that are as similar as possible at least
with respect to measured patient characteristics.
Measurement A structured value (numerical or categorical) obtained through systematic
383
Cohort definitions
385
386 Appendix B. Cohort definitions
Inclusion Rules
Inclusion Criteria #1: has hypertension diagnosis in 1 yr prior to treatment
Having all of the following criteria:
• at least 1 occurrences of a condition occurrence of Hypertensive disorder (Table
B.3) where event starts between 365 days Before and 0 days After index start date
Inclusion Criteria #2: Has no prior antihypertensive drug exposures in medical history
Having all of the following criteria:
• exactly 0 occurrences of a drug exposure of Hypertension drugs (Table B.4) where
event starts between all days Before and 1 days Before index start date
Inclusion Criteria #3: Is only taking ACE as monotherapy, with no concomitant combi
nation treatments
Having all of the following criteria:
B.2. New Users of ACE Inhibitors Monotherapy 387
with continuous observation of at least 0 days prior and 0 days after event index date, and
limit initial events to: all events per person.
For people matching the Primary Events, include: Having any of the following criteria:
Date Offset Exit Criteria. This cohort definition end date will be the index event’s start
date plus 7 days
390 Appendix B. Cohort definitions
B.4 Angioedema
Initial Event Cohort
People having any of the following:
• a condition occurrence of Angioedema (Table B.7)
with continuous observation of at least 0 days prior and 0 days after event index date, and
limit initial events to: all events per person.
For people matching the Primary Events, include: Having any of the following criteria:
• at least 1 occurrences of a visit occurrence of Inpatient or ER visit (Table B.8)
where event starts between all days Before and 0 days After index start date and
event ends between 0 days Before and all days After index start date
Limit cohort of initial events to: all events per person.
Limit qualifying cohort to: all events per person.
Inclusion Rules
Inclusion Criteria #1: has hypertension diagnosis in 1 yr prior to treatment
Having all of the following criteria:
• at least 1 occurrences of a condition occurrence of Hypertensive disorder (Table
B.10) where event starts between 365 days Before and 0 days After index start date
Inclusion Criteria #2: Has no prior antihypertensive drug exposures in medical history
Having all of the following criteria:
• exactly 0 occurrences of a drug exposure of Hypertension drugs (Table B.11) where
event starts between all days Before and 1 days Before index start date
Inclusion Criteria #3: Is only taking ACE as monotherapy, with no concomitant combi
nation treatments
Having all of the following criteria:
392 Appendix B. Cohort definitions
Inclusion Rules
Having all of the following criteria:
• exactly 0 occurrences of a drug exposure of Hypertension drugs (Table B.13) where
event starts between all days Before and 1 days Before index start date
• and at least 1 occurrences of a condition occurrence of Hypertensive disorder (Table
B.14) where event starts between 365 days Before and 0 days After index start date
Limit cohort of initial events to: earliest event per person. Limit qualifying cohort to:
earliest event per person.
Negative controls
This Appendix contains negative controls used in various chapters of the book.
403
404 Appendix C. Negative controls
Protocol template
1. Table of contents
2. List of abbreviations
3. Abstract
4. Amendments and Updates
5. Milestones
6. Rationale and Background
7. Study Objectives
• Primary Hypotheses
• Secondary Hypotheses
• Primary Objectives
• Secondary Objectives
8. Research methods
• Study Design
• Data Source(s)
• Study population
• Exposures
• Outcomes
• Covariates
9. Data Analysis Plan
• Calculation of timeat risk
• Model Specification
• Pooling effect estimates across databases
• Analyses to perform
• Output
• Evidence Evaluation
10. Study Diagnostics
• Sample Size and Study Power
• Cohort Comparability
• Systematic Error Assessment
407
408 Appendix D. Protocol template
Suggested Answers
This Appendix contains suggested answers for the exercises in the book.
Exercise 4.2
Based on the description in the exercise, John’s record should look like Table E.2.
Exercise 4.3
Based on the description in the exercise, John’s record should look like Table E.3.
Exercise 4.4
To find the set of records, we can query the CONDITION_OCCURRENCE table:
library(DatabaseConnector)
connection <- connect(connectionDetails)
sql <- "SELECT *
FROM @cdm.condition_occurrence
WHERE condition_concept_id = 192671;"
Exercise 4.5
To find the set of records, we can query the CONDITION_OCCURRENCE table using
the CONDITION_SOURCE_VALUE field:
Exercise 4.6
library(DatabaseConnector)
connection <- connect(connectionDetails)
sql <- "SELECT *
FROM @cdm.observation_period
WHERE person_id = 61;"
Exercise 5.2
ICD10CM codes:
• K29.91 “Gastroduodenitis, unspecified, with bleeding”
• K92.2 “Gastrointestinal hemorrhage, unspecified”
ICD9CM codes:
• 578 “Gastrointestinal hemorrhage”
• 578.9 “Hemorrhage of gastrointestinal tract, unspecified”
Exercise 5.3
MedDRA preferred terms:
• “Gastrointestinal haemorrhage” (Concept ID 35707864)
• “Intestinal haemorrhage” (Concept ID 35707858)
Exercise 6.2
Exercise 6.3
Column Value
VISIT_OCCURRENCE_ID 1
PERSON_ID 11
VISIT_START_DATE 20040926
VISIT_END_DATE 20040930
VISIT_CONCEPT_ID 9201
VISIT_SOURCE_VALUE inpatient
1. Characterization
2. Patientlevel prediction
3. Populationlevel estimation
E.5. SQL and R 415
Exercise 7.2
Probably not. Defining a nonexposure cohort that is comparable to your diclofenac expo
sure cohort is often impossible, since people take diclofenac for a reason. This precludes a
betweenperson comparison. It might possible to a withinperson comparison, so for each
patient in the diclofenac cohort identifying time when they are not exposed, but a similar
problem occurs here: these times are likely incomparable, because there are reasons when
at one time someone is exposed and at other times not.
library(DatabaseConnector)
connection <- connect(connectionDetails)
sql <- "SELECT COUNT(*) AS person_count
FROM @cdm.person;"
## PERSON_COUNT
## 1 2694
Exercise 9.2
To compute the number of people with at least one prescription of celecoxib, we can query
the DRUG_EXPOSURE table. To find all drugs containing the ingredient celecoxib, we
join to the CONCEPT_ANCESTOR and CONCEPT tables:
library(DatabaseConnector)
connection <- connect(connectionDetails)
sql <- "SELECT COUNT(DISTINCT(person_id)) AS person_count
FROM @cdm.drug_exposure
INNER JOIN @cdm.concept_ancestor
ON drug_concept_id = descendant_concept_id
INNER JOIN @cdm.concept ingredient
ON ancestor_concept_id = ingredient.concept_id
WHERE LOWER(ingredient.concept_name) = 'celecoxib'
AND ingredient.concept_class_id = 'Ingredient'
AND ingredient.standard_concept = 'S';"
## PERSON_COUNT
416 Appendix E. Suggested Answers
## 1 1844
Note that we use COUNT(DISTINCT(person_id)) to find the number of distinct persons,
considering that a person might have more than one prescription. Also note that we use
the LOWER function to make our search for “celecoxib” caseinsensitive.
Alternatively, we can use the DRUG_ERA table, which is already rolled up to the ingre
dient level:
library(DatabaseConnector)
connection <- connect(connectionDetails)
## PERSON_COUNT
## 1 1844
Exercise 9.3
To compute the number of diagnoses during exposure we extend our previous
query by joining to the CONDITION_OCCURRENCE table. We join to the CON
CEPT_ANCESTOR table to find all condition concepts that imply a gastrointestinal
haemorrhage:
library(DatabaseConnector)
connection <- connect(connectionDetails)
sql <- "SELECT COUNT(*) AS diagnose_count
FROM @cdm.drug_era
INNER JOIN @cdm.concept ingredient
ON drug_concept_id = ingredient.concept_id
INNER JOIN @cdm.condition_occurrence
ON condition_start_date >= drug_era_start_date
AND condition_start_date <= drug_era_end_date
INNER JOIN @cdm.concept_ancestor
ON condition_concept_id =descendant_concept_id
WHERE LOWER(ingredient.concept_name) = 'celecoxib'
AND ingredient.concept_class_id = 'Ingredient'
AND ingredient.standard_concept = 'S'
AND ancestor_concept_id = 192671;"
E.6. Defining Cohorts 417
## DIAGNOSE_COUNT
## 1 41
Note that in this case it is essential to use the DRUG_ERA table instead of the
DRUG_EXPOSURE table, because drug exposures with the same ingredient can
overlap, but drug eras can not. This could lead to double counting. For example, imagine
a person received two drug drugs containing celecoxib at the same time. This would be
recorded as two drug exposures, so any diagnoses occurring during the exposure would
be counted twice. The two exposures will be merged into a single nonoverlapping drug
era.
Figure E.1: Cohort entry event settings for new users of diclofenac
The concept set expression for diclofenac should look like Figure E.2, including the in
gredient ‘Diclofenac’ and all of its descendant, thus including all drugs containing the
ingredient diclofenac.
Next, we require no prior exposure to any NSAID, as shown in Figure E.3.
The concept set expression for NSAIDs should look like Figure E.4, including the
NSAIDs class and all of its descendant, thus including all drugs containing any NSAID.
418 Appendix E. Suggested Answers
The concept set expression for “Broad malignancies” should look like Figure E.6, includ
ing the high level concept “Malignant neoplastic disease” and all of its descendant.
Finally, we define the cohort exit criteria as discontinuation of exposure (allowing for a
30day gap), as shown in Figure E.7.
Exercise 10.2
For readability we here split the SQL into two steps. We first find all condition occur
rences of myocardial infarction, and store these in a temp table called “#diagnoses”:
library(DatabaseConnector)
connection <- connect(connectionDetails)
sql <- "SELECT person_id AS subject_id,
condition_start_date AS cohort_start_date
INTO #diagnoses
420 Appendix E. Suggested Answers
FROM @cdm.condition_occurrence
WHERE condition_concept_id IN (
SELECT descendant_concept_id
FROM @cdm.concept_ancestor
WHERE ancestor_concept_id = 4329847 -- Myocardial infarction
)
AND condition_concept_id NOT IN (
SELECT descendant_concept_id
FROM @cdm.concept_ancestor
WHERE ancestor_concept_id = 314666 -- Old myocardial infarction
);"
We then select only those that occur during an inpatient or ER visit, using some unique
COHORT_DEFINITION_ID (we selected ‘1’):
Note that an alternative approach would have been to join the conditions to the visits
based on the VISIT_OCCURRENCE_ID, instead of requiring the condition date to fall
within the visit start and end date. This would likely be more accurate, as it would guar
antee that the condition was recorded in relation to the inpatient or ER visit. However,
many observational databases do not record the link between visit and diagnose, and we
therefore chose to use the dates instead, likely giving us a higher sensitivity but perhaps
lower specificity.
Note also that we ignored the cohort end date. Often, when a cohort is used to define an
outcome we are only interested in the cohort start date, and there is no point in creating
an (illdefined) cohort end date.
It is recommended to clean up any temp tables when no longer needed:
renderTranslateExecuteSql(connection, sql)
E.7 Characterization
Exercise 11.1
In ATLAS we click on and select the data source we’re interested in.
We could select the Drug Exposure report, select the “Table” tab, and search for “cele
coxib” as shown in Figure E.8. Here we see that this particular database has exposures
to various formulations of celecoxib. We could click on any of these drugs to get a more
detailed view, for example showing age and gender distributions for these drugs.
Exercise 11.2
Click on and then “New cohort” to create a new cohort. Give the
cohort a meaningful name (e.g. “Celecoxib new users”) and go to the “Concept Sets” tab.
Click on “New Concept Set”, and give your concept set a meaningful names (e.g. “Cele
coxib”). Open the module, search for “celecoxib”, restrict the Class to “Ingre
dient” and Standard Concept to “Standard”, and click the to add the concept to your
concept set as show in Figure E.9.
422 Appendix E. Suggested Answers
Figure E.9: Selecting the standard concept for the ingredient ”celecoxib”.
424 Appendix E. Suggested Answers
Click on the left arrow shown at the top left of Figure E.9 to return to your cohort defi
nition. Click on “+Add Initial Event” and then “Add Drug Era”. Select your previously
created concept set for the drug era criterion. Click on “Add attribute…” and select “Add
First Exposure Criteria.” Set the required continuous observation to at least 365 days be
fore the index date. The result should look like Figure E.10. Leave the Inclusion Criteria,
Cohort Exit, and Cohort Eras section as they are. Make sure to save the cohort definition
by clicking , and close it by clicking .
Now that we have our cohort defined, we can characterize it. Click on
and then “New Characterization”. Give you characterization a meaningful name
(e.g. “Celecoxib new users characterization”). Under Cohort Definitions, click on
“Import” and select your recently created cohort definition. Under “Feature Analyses”,
click on “Import” and select at least one condition analysis and one drug analysis, for
example “Drug Group Era Any Time Prior” and “Condition Group Era Any Time Prior”.
Your characterization definition should now look like Figure E.11. Make sure to save the
characterization settings by clicking .
Click on the “Executions” tab, and click on “Generate” for one of the data sources. It
may take a while for the generation to complete. When done, we can click on “View
latest results”. The resulting screen will look something like Figure E.12, showing for
example that pain and arthropathy are commonly observed, which should not surprise
use as these are indications for celecoxib. Lower on the list we may see conditions we
were not expecting.
E.7. Characterization 425
Exercise 11.3
Click on and then “New cohort” to create a new cohort. Give the
cohort a meaningful name (e.g. “GI bleed”) and go to the “Concept Sets” tab. Click on
“New Concept Set”, and give your concept set a meaningful names (e.g. “GI bleed”).
Open the module, search for “Gastrointestinal hemorrhage”, and click the
next to the top concept to add the concept to your concept set as show in Figure E.13.
Click on the left arrow shown at the top left of Figure E.13 to return to your cohort def
inition. Open the “Concept Sets” tab again, and check “Descendants” next to the GI
hemorrhage concept, as shown in Figure E.14.
Return to the “Definition” tab, click on “+Add Initial Event” and then “Add Condition
Occurrence”. Select your previously created concept set for the condition occurrence
criterion. The result should look like Figure E.15. Leave the Inclusion Criteria, Cohort
Exit, and Cohort Eras section as they are. Make sure to save the cohort definition by
clicking , and close it by clicking .
428 Appendix E. Suggested Answers
Now that we have our cohort defined, we can compute the incidence rate. Click on
and then “New Analysis”. Give your analysis a meaningful name
(e.g. “Incidence of GI bleed after celecoxib initiation”). Click “Add Target Cohort” and
select our celecoxib new user cohort. Click on “Add Outcome Cohort” and add our new
GI bleed cohort. Set the Time At Risk to end 1095 days after the start date. The analysis
should now look like Figure E.16. Make sure to save the analysis settings by clicking .
Click on the “Generation” tab, and click on “Generate”. Select one of the data sources and
click “Generate”. When done, we can see the computed incidence rate and proportion, as
shown in Figure E.17.
E.8. PopulationLevel Estimation 429
We specify the default set of covariates, but we must exclude the two drugs we’re com
paring, including all their descendants, because else our propensity model will become
perfectly predictive:
library(CohortMethod)
nsaids <- c(1118084, 1124300) # celecoxib, diclofenac
covSettings <- createDefaultCovariateSettings(
excludedCovariateConceptIds = nsaids,
addDescendantsToExclude = TRUE)
# Load data:
cmData <- getDbCohortMethodData(
connectionDetails = connectionDetails,
cdmDatabaseSchema = "main",
targetId = 1,
comparatorId = 2,
outcomeIds = 3,
exposureDatabaseSchema = "main",
exposureTable = "cohort",
outcomeDatabaseSchema = "main",
outcomeTable = "cohort",
covariateSettings = covSettings)
summary(cmData)
## 3 479 479
##
## Covariates:
## Number of covariates: 389
## Number of non-zero covariate values: 26923
Exercise 12.2
We create the study population following the specifications, and output the attrition dia
gram:
We see that we did not lose any subjects compared to the original cohorts, probably be
cause the restrictions used here were already applied in the cohort definitions.
Exercise 12.3
We fit a simple outcome model using a Cox regression:
It is likely that celecoxib users are not exchangeable with diclofenac users, and that these
baseline differences already lead to different risks of the outcome. If we do not adjust for
these difference, like in this analysis, we are likely producing biased estimates.
Exercise 12.4
We fit a propensity model on our study population, using all covariates we extracted. We
then show the preference score distribution:
Note that this distribution looks a bit odd, with several spikes. This is because we are
using a very small simulated dataset. Real preference score distributions tend to be much
smoother.
The propensity model achieves an AUC of 0.63, suggested there are differences between
target and comparator cohort. We see quite a lot overlap between the two groups suggest
ing PS adjustment can make them more comparable.
Exercise 12.5
We stratify the population based on the propensity scores, and compute the covariate
balance before and after stratification:
E.8. PopulationLevel Estimation 433
We see that various baseline covariates showed a large (>0.3) standardized difference of
means before stratification (xaxis). After stratification, balance is increased, with the
maximum standardized difference <= 0.1.
Exercise 12.6
We fit a outcome model using a Cox regression, but stratify it by the PS strata:
library(PatientLevelPrediction)
covSettings <- createCovariateSettings(
useDemographicsGender = TRUE,
useDemographicsAge = TRUE,
useConditionGroupEraLongTerm = TRUE,
useConditionGroupEraAnyTimePrior = TRUE,
useDrugGroupEraLongTerm = TRUE,
useDrugGroupEraAnyTimePrior = TRUE,
useVisitConceptCountLongTerm = TRUE,
longTermStartDays = -365,
endDays = -1)
summary(plpData)
##
## Outcome counts:
## Event count Person count
## 3 479 479
##
## Covariates:
## Number of covariates: 245
## Number of non-zero covariate values: 54079
Exercise 13.2
We create a study population for the outcome of interest (in this case the only outcome for
which we extracted data), removing subjects who experienced the outcome before they
started the NSAID, and requiring 364 days of timeatrisk:
## [1] 2578
In this case we have lost a few people by removing those that had the outcome prior, and
by requiring a timeatrisk of at least 364 days.
Exercise 13.3
We run a LASSO model by first creating a model settings object, and then calling the
runPlp function. In this case we do a person split, training the model on 75% of the data
and evaluating on 25% of the data:
testSplit = 'person',
testFraction = 0.25,
nfold = 2,
splitSeed = 0)
Note that for this example set the random seeds both for the LASSO crossvalidation and
for the traintest split to make sure the results will be the same on multiple runs.
We can now view the results using the Shiny app:
viewPlp(lassoResults)
This will launch the app as shown in Figure E.18. Here we see an AUC on the test set
of 0.645, which is better than random guessing, but maybe not good enough for clinical
practice.
library(ACHILLES)
result <- achilles(connectionDetails,
cdmDatabaseSchema = "main",
resultsDatabaseSchema = "main",
E.10. Data Quality 437
sourceName = "Eunomia",
cdmVersion = "5.3.0")
Exercise 15.2
To run the Data Quality Dashboard:
DataQualityDashboard::executeDqChecks(
connectionDetails,
cdmDatabaseSchema = "main",
resultsDatabaseSchema = "main",
cdmSourceName = "Eunomia",
outputFolder = "C:/dataQualityExample")
Exercise 15.3
To view the list of data quality checks:
DataQualityDashboard::viewDqDashboard(
"C:/dataQualityExample/Eunomia/results_Eunomia.json")
438 Appendix E. Suggested Answers
Bibliography
Allison, D. B., Brown, A. W., George, B. J., and Kaiser, K. A. (2016). Reproducibility:
A tragedy of errors. Nature, 530(7588):27–29.
Arnold, B. F., Ercumen, A., BenjaminChung, J., and Colford, J. M. (2016). Brief Report:
Negative Controls to Detect Selection Bias and Measurement Bias in Epidemiologic
Studies. Epidemiology, 27(5):637–641.
Austin, P. C. (2011). Optimal caliper widths for propensityscore matching when esti
mating differences in means and differences in proportions in observational studies.
Pharmaceutical statistics, 10(2):150–161.
Banda, J. M., Halpern, Y., Sontag, D., and Shah, N. H. (2017). Electronic phenotyping
with APHRODITE and the Observational Health Sciences and Informatics (OHDSI)
data network. AMIA Jt Summits Transl Sci Proc, 2017:48–57.
Boland, M. R., Parhi, P., Li, L., Miotto, R., Carroll, R., Iqbal, U., Nguyen, P. A., Schuemie,
M., You, S. C., Smith, D., Mooney, S., Ryan, P., Li, Y. J., Park, R. W., Denny, J., Dudley,
J. T., Hripcsak, G., Gentine, P., and Tatonetti, N. P. (2017). Uncovering exposures
responsible for birth season disease effects: a global study. J Am Med Inform Assoc.
Botsis, T., Hartvigsen, G., Chen, F., and Weng, C. (2010). Secondary use of ehr: data
quality issues and informatics opportunities. Summit on Translational Bioinformatics,
2010:1.
Byrd, J. B., Adam, A., and Brown, N. J. (2006). Angiotensinconverting enzyme inhibitor
associated angioedema. Immunol Allergy Clin North Am, 26(4):725–737.
Callahan, T. J., Bauck, A. E., Bertoch, D., Brown, J., Khare, R., Ryan, P. B., Staab, J.,
Zozus, M. N., and Kahn, M. G. (2017). A comparison of data quality assessment checks
in six data sharing networks. eGEMs, 5(1).
Cepeda, M. S., Reps, J., Fife, D., Blacketer, C., Stang, P., and Ryan, P. (2018). Finding
treatmentresistant depression in realworld data: How a datadriven approach com
pares with expertbased heuristics. Depress Anxiety, 35(3):220–228.
Chen, X., DallmeierTiessen, S., Dasler, R., Feger, S., Fokianos, P., Gonzalez, J. B., Hir
vonsalo, H., Kousidis, D., Lavasa, A., Mele, S., Rodriguez, D. R., Šimko, T., Smith, T.,
Trisovic, A., Trzcinska, A., Tsanaktsidis, I., Zimmermann, M., Cranmer, K., Heinrich,
439
440 Bibliography
L., Watts, G., Hildreth, M., Iglesias, L. L., LassilaPerini, K., and Neubert, S. (2018).
Open is not enough. Nature Physics, 15(2):113–119.
Cicardi, M., Zingale, L. C., Bergamaschini, L., and Agostoni, A. (2004). Angioedema
associated with angiotensinconverting enzyme inhibitor use: outcome after switching
to a different treatment. Arch. Intern. Med., 164(8):910–913.
Dasu, T. and Johnson, T. (2003). Exploratory data mining and data cleaning, volume
479. John Wiley & Sons.
Defalco, F. J., Ryan, P. B., and Soledad Cepeda, M. (2013). Applying standardized drug
terminologies to observational healthcare databases: a case study on opioid exposure.
Health Serv Outcomes Res Methodol, 13(1):58–67.
DerSimonian, R. and Laird, N. (1986). Metaanalysis in clinical trials. Control Clin
Trials, 7(3):177–188.
Duke, J. D., Ryan, P. B., Suchard, M. A., Hripcsak, G., Jin, P., Reich, C., Schwalm, M. S.,
Khoma, Y., Wu, Y., Xu, H., Shah, N. H., Banda, J. M., and Schuemie, M. J. (2017).
Risk of angioedema associated with levetiracetam compared with phenytoin: Findings
of the observational health data sciences and informatics research network. Epilepsia,
58(8):e101–e106.
Farrington, C. P. (1995). Relative incidence estimation from case series for vaccine safety
evaluation. Biometrics, 51(1):228–235.
Farrington, C. P., AnayaIzquierdo, K., Whitaker, H. J., Hocine, M. N., Douglas, I., and
Smeeth, L. (2011). Selfcontrolled case series analysis with eventdependent observa
tion periods. Journal of the American Statistical Association, 106(494):417–426.
Fuller, W. A. (2009). Measurement error models, volume 305. John Wiley & Sons.
Garza, M., Del Fiol, G., Tenenbaum, J., Walden, A., and Zozus, M. N. (2016). Evaluating
common data models for use with a longitudinal community registry. J Biomed Inform,
64:333–341.
Hernan, M. A., HernandezDiaz, S., Werler, M. M., and Mitchell, A. A. (2002). Causal
knowledge as a prerequisite for confounding evaluation: an application to birth defects
epidemiology. Am. J. Epidemiol., 155(2):176–184.
Hernan, M. A. and Robins, J. M. (2016). Using Big Data to Emulate a Target Trial When
a Randomized Trial Is Not Available. Am. J. Epidemiol., 183(8):758–764.
Hersh, W. R., Weiner, M. G., Embi, P. J., Logan, J. R., Payne, P. R., Bernstam, E. V.,
Lehmann, H. P., Hripcsak, G., Hartzog, T. H., Cimino, J. J., et al. (2013). Caveats
for the use of operational electronic health record data in comparative effectiveness
research. Medical care, 51(8 0 3):S30.
Higgins, J. P., Thompson, S. G., Deeks, J. J., and Altman, D. G. (2003). Measuring
inconsistency in metaanalyses. BMJ, 327(7414):557–560.
Bibliography 441
Kahn, M. G., Raebel, M. A., Glanz, J. M., Riedlinger, K., and Steiner, J. F. (2012). A
pragmatic framework for singlesite and multisite data quality assessment in electronic
health recordbased clinical research. Medical care, 50.
Liaw, S.T., Rahimi, A., Ray, P., Taggart, J., Dennis, S., de Lusignan, S., Jalaludin, B.,
Yeo, A., and TalaeiKhoei, A. (2013). Towards an ontology for data quality in inte
grated chronic disease management: a realist review of the literature. International
journal of medical informatics, 82(1):10–24.
Lipsitch, M., Tchetgen Tchetgen, E., and Cohen, T. (2010). Negative controls: a tool
for detecting confounding and bias in observational studies. Epidemiology, 21(3):383–
388.
Maclure, M. (1991). The casecrossover design: a method for studying transient effects
on the risk of acute events. Am. J. Epidemiol., 133(2):144–153.
Madigan, D., Ryan, P. B., and Schuemie, M. (2013a). Does design matter? System
atic evaluation of the impact of analytical choices on effect estimates in observational
studies. Ther Adv Drug Saf, 4(2):53–62.
Madigan, D., Ryan, P. B., Schuemie, M., Stang, P. E., Overhage, J. M., Hartzema, A. G.,
Suchard, M. A., DuMouchel, W., and Berlin, J. A. (2013b). Evaluating the impact of
database heterogeneity on observational study results. Am. J. Epidemiol., 178(4):645–
651.
Magid, D. J., Shetterly, S. M., Margolis, K. L., Tavel, H. M., O’Connor, P. J., Selby, J. V.,
and Ho, P. M. (2010). Comparative effectiveness of angiotensinconverting enzyme in
hibitors versus betablockers as secondline therapy for hypertension. Circ Cardiovasc
Qual Outcomes, 3(5):453–458.
Matcho, A., Ryan, P., Fife, D., and Reich, C. (2014). Fidelity assessment of a clinical
practice research datalink conversion to the OMOP common data model. Drug Saf,
37(11):945–959.
Noren, G. N., Caster, O., Juhlin, K., and Lindquist, M. (2014). Zoo or savannah? Choice
of training ground for evidencebased pharmacovigilance. Drug Saf, 37(9):655–659.
Norman, J. L., Holmes, W. L., Bell, W. A., and Finks, S. W. (2013). Lifethreatening
Bibliography 443
Oliveira, J. L., Trifan, A., and Silva, L. A. B. (2019). EMIF catalogue: A collaborative
platform for sharing and reusing biomedical data. International Journal of Medical
Informatics, 126:35–45.
Olsen, L., Aisner, D., McGinnis, J. M., et al. (2007). The learning healthcare system:
workshop summary. Natl Academy Pr.
Overhage, J. M., Ryan, P. B., Reich, C. G., Hartzema, A. G., and Stang, P. E. (2012).
Validation of a common data model for active safety surveillance research. J Am Med
Inform Assoc, 19(1):54–60.
Perkins, N. J., Cole, S. R., Harel, O., Tchetgen Tchetgen, E. J., Sun, B., Mitchell, E. M.,
and Schisterman, E. F. (2017). Principled approaches to missing data in epidemiologic
studies. American journal of epidemiology, 187(3):568–575.
Powers, B. J., Coeytaux, R. R., Dolor, R. J., Hasselblad, V., Patel, U. D., Yancy, W. S.,
Gray, R. N., Irvine, R. J., Kendrick, A. S., and Sanders, G. D. (2012). Updated report
on comparative effectiveness of ACE inhibitors, ARBs, and direct renin inhibitors for
patients with essential hypertension: much more data, little new information. J Gen
Intern Med, 27(6):716–729.
Prasad, V. and Jena, A. B. (2013). Prespecified falsification end points: can they validate
true observational associations? JAMA, 309(3):241–242.
Ramcharran, D., Qiu, H., Schuemie, M. J., and Ryan, P. B. (2017). Atypical Antipsy
chotics and the Risk of Falls and Fractures Among Older Adults: An Emulation Anal
ysis and an Evaluation of Additional Confounding Control Strategies. J Clin Psy
chopharmacol, 37(2):162–168.
Rassen, J. A., Shelat, A. A., Myers, J., Glynn, R. J., Rothman, K. J., and Schneeweiss, S.
(2012). Onetomany propensity score matching in cohort studies. Pharmacoepidemiol
Drug Saf, 21 Suppl 2:69–80.
Reps, J. M., Rijnbeek, P. R., and Ryan, P. B. (2019). Identifying the DEAD: Development
and Validation of a PatientLevel Model to Predict Death Status in PopulationLevel
Claims Data. Drug Saf.
Reps, J. M., Schuemie, M. J., Suchard, M. A., Ryan, P. B., and Rijnbeek, P. R. (2018). De
sign and implementation of a standardized framework to generate and evaluate patient
level prediction models using observational healthcare data. Journal of the American
Medical Informatics Association, 25(8):969–975.
444 Bibliography
Roebuck, K. (2012). Data quality: highimpact strategieswhat you need to know: defi
nitions, adoptions, impact, benefits, maturity, vendors. Emereo Publishing.
Rosenbaum, P. (2005). Sensitivity Analysis in Observational Studies. American Cancer
Society.
Rosenbaum, P. and Rubin, D. (1983). The central role of the propensity score in obser
vational studies for causal effects. Biometrika, 70:41–55.
Rubbo, B., Fitzpatrick, N. K., Denaxas, S., Daskalopoulou, M., Yu, N., Patel, R. S., Hem
ingway, H., Danesh, J., Allen, N., Atkinson, M., Blaveri, E., Brannan, R., Brayne, C.,
Brophy, S., Chaturvedi, N., Collins, R., deLusignan, S., Denaxas, S., Desai, P., East
wood, S., Gallacher, J., Hemingway, H., Hotopf, M., Landray, M., Lyons, R., O’Neil,
T., Pringle, M., Sprosen, T., Strachan, D., Sudlow, C., Sullivan, F., Zhang, Q., and
Flaig, R. (2015). Use of electronic health records to ascertain, validate and pheno
type acute myocardial infarction: A systematic review and recommendations. Int. J.
Cardiol., 187:705–711.
Rubin, D. B. (2001). Using propensity scores to help design observational studies: appli
cation to the tobacco litigation. Health Services and Outcomes Research Methodology,
2(34):169–188.
Ryan, P. B., Buse, J. B., Schuemie, M. J., DeFalco, F., Yuan, Z., Stang, P. E., Berlin,
J. A., and Rosenthal, N. (2018). Comparative effectiveness of canagliflozin, SGLT2
inhibitors and nonSGLT2 inhibitors on the risk of hospitalization for heart failure and
amputation in patients with type 2 diabetes mellitus: A realworld metaanalysis of 4
observational databases (OBSERVE4D). Diabetes Obes Metab, 20(11):2585–2597.
Ryan, P. B., Madigan, D., Stang, P. E., Overhage, J. M., Racoosin, J. A., and Hartzema,
A. G. (2012). Empirical assessment of methods for risk identification in healthcare data:
results from the experiments of the Observational Medical Outcomes Partnership. Stat
Med, 31(30):4401–4415.
Ryan, P. B., Schuemie, M. J., and Madigan, D. (2013a). Empirical performance of a
selfcontrolled cohort method: lessons for developing a risk identification and analysis
system. Drug Saf, 36 Suppl 1:95–106.
Ryan, P. B., Schuemie, M. J., Ramcharran, D., and Stang, P. E. (2017). Atypical Antipsy
chotics and the Risks of Acute Kidney Injury and Related Outcomes Among Older
Adults: A Replication Analysis and an Evaluation of Adapted Confounding Control
Strategies. Drugs Aging, 34(3):211–219.
Ryan, P. B., Stang, P. E., Overhage, J. M., Suchard, M. A., Hartzema, A. G., DuMouchel,
W., Reich, C. G., Schuemie, M. J., and Madigan, D. (2013b). A comparison of the
empirical performance of methods for a risk identification system. Drug Saf, 36 Suppl
1:S143–158.
Sabroe, R. A. and Black, A. K. (1997). Angiotensinconverting enzyme (ACE) inhibitors
and angiooedema. Br. J. Dermatol., 136(2):153–158.
Bibliography 445
Toh, S., Reichman, M. E., Houstoun, M., Ross Southworth, M., Ding, X., Hernandez,
A. F., Levenson, M., Li, L., McCloskey, C., Shoaibi, A., Wu, E., Zornberg, G., and
Hennessy, S. (2012). Comparative risk for angioedema associated with the use of drugs
that target the reninangiotensinaldosterone system. Arch. Intern. Med., 172(20):1582–
1589.
van der Lei, J. (1991). Use and abuse of computerstored medical records. Methods of
information in medicine, 30(02):79–80.
Vandenbroucke, J. P. and Pearce, N. (2012). Casecontrol studies: basic concepts. Int J
Epidemiol, 41(5):1480–1489.
Vashisht, R., Jung, K., Schuler, A., Banda, J. M., Park, R. W., Jin, S., Li, L., Dudley, J. T.,
Johnson, K. W., Shervey, M. M., Xu, H., Wu, Y., Natrajan, K., Hripcsak, G., Jin, P.,
Van Zandt, M., Reckard, A., Reich, C. G., Weaver, J., Schuemie, M. J., Ryan, P. B.,
Callahan, A., and Shah, N. H. (2018). Association of Hemoglobin A1c Levels With
Use of Sulfonylureas, Dipeptidyl Peptidase 4 Inhibitors, and Thiazolidinediones in Pa
tients With Type 2 Diabetes Treated With Metformin: Analysis From the Observational
Health Data Sciences and Informatics Initiative. JAMA Netw Open, 1(4):e181755.
von Elm, E., Altman, D. G., Egger, M., Pocock, S. J., Gøtzsche, P. C., and Vandenbroucke,
J. P. (2008). The strengthening the reporting of observational studies in epidemiology
(strobe) statement: guidelines for reporting observational studies. Journal of Clinical
Epidemiology, 61(4):344 – 349.
Voss, E. A., Boyce, R. D., Ryan, P. B., van der Lei, J., Rijnbeek, P. R., and Schuemie, M. J.
(2016). Accuracy of an Automated Knowledge Base for Identifying Drug Adverse
Reactions. J Biomed Inform.
Voss, E. A., Ma, Q., and Ryan, P. B. (2015a). The impact of standardizing the definition
of visits on the consistency of multidatabase observational health research. BMC Med
Res Methodol, 15:13.
Voss, E. A., Makadia, R., Matcho, A., Ma, Q., Knoll, C., Schuemie, M., DeFalco, F. J.,
Londhe, A., Zhu, V., and Ryan, P. B. (2015b). Feasibility and utility of applications
of the common data model to multiple, disparate observational health databases. J Am
Med Inform Assoc, 22(3):553–564.
Walker, A. M., Patrick, A. R., Lauer, M. S., Hornbrook, M. C., Marin, M. G., Platt, R.,
Roger, V. L., Stang, P., and Schneeweiss, S. (2013). A tool for assessing the feasibility
of comparative effectiveness research. Comp Eff Res, 3:11–20.
Wang, Y., Desai, M., Ryan, P. B., DeFalco, F. J., Schuemie, M. J., Stang, P. E., Berlin,
J. A., and Yuan, Z. (2017). Incidence of diabetic ketoacidosis among patients with type
2 diabetes mellitus treated with SGLT2 inhibitors and other antihyperglycemic agents.
Diabetes Res. Clin. Pract., 128:83–90.
Weinstein, R. B., Ryan, P., Berlin, J. A., Matcho, A., Schuemie, M., Swerdel, J., Patel, K.,
and Fife, D. (2017). Channeling in the Use of Nonprescription Paracetamol and Ibupro
Bibliography 447
fen in an Electronic Medical Records Database: Evidence and Implications. Drug Saf,
40(12):1279–1292.
Weiskopf, N. G. and Weng, C. (2013). Methods and dimensions of electronic health
record data quality assessment: enabling reuse for clinical research. Journal of the
American Medical Informatics Association: JAMIA, 20(1):144–151.
Whelton, P. K., Carey, R. M., Aronow, W. S., Casey, D. E., Collins, K. J., Denni
son Himmelfarb, C., DePalma, S. M., Gidding, S., Jamerson, K. A., Jones, D. W.,
MacLaughlin, E. J., Muntner, P., Ovbiagele, B., Smith, S. C., Spencer, C. C., Stafford,
R. S., Taler, S. J., Thomas, R. J., Williams, K. A., Williamson, J. D., and Wright, J. T.
(2018). 2017 ACC/AHA/AAPA/ABC/ACPM/AGS/APhA/ASH/ASPC/NMA/PCNA
Guideline for the Prevention, Detection, Evaluation, and Management of High Blood
Pressure in Adults: Executive Summary: A Report of the American College of Cardi
ology/American Heart Association Task Force on Clinical Practice Guidelines. Circu
lation, 138(17):e426–e483.
Whitaker, H. J., Farrington, C. P., Spiessens, B., and Musonda, P. (2006). Tutorial in
biostatistics: the selfcontrolled case series method. Stat Med, 25(10):1768–1797.
Who, A. (2013). Global brief on hypertension. World Health Organization.
Wickham, H. (2015). R Packages. O’Reilly Media, Inc., 1st edition.
Wikipedia (2019a). Open science — Wikipedia, the free encyclopedia. http://en.
wikipedia.org/w/index.php?title=Open%20science&oldid=900178688. [Online; ac
cessed 24June2019].
Wikipedia (2019b). Science 2.0 — Wikipedia, the free encyclopedia. http://en.wikipedia.
org/w/index.php?title=Science%202.0&oldid=887565958. [Online; accessed 09July
2019].
Wikiquote (2019). Ronald fisher — wikiquote,. [Online; accessed 2August2019].
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak,
A., Blomberg, N., Boiten, J. W., da Silva Santos, L. B., Bourne, P. E., Bouwman,
J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo,
C. T., Finkers, R., GonzalezBeltran, A., Gray, A. J., Groth, P., Goble, C., Grethe,
J. S., Heringa, J., ’t Hoen, P. A., Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher, S. J.,
Martone, M. E., Mons, A., Packer, A. L., Persson, B., RoccaSerra, P., Roos, M., van
Schaik, R., Sansone, S. A., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Swertz,
M. A., Thompson, M., van der Lei, J., van Mulligen, E., Velterop, J., Waagmeester, A.,
Wittenburg, P., Wolstencroft, K., Zhao, J., and Mons, B. (2016). The FAIR Guiding
Principles for scientific data management and stewardship. Sci Data, 3:160018.
Yoon, D., Ahn, E. K., Park, M. Y., Cho, S. Y., Ryan, P., Schuemie, M. J., Shin, D., Park,
H., and Park, R. W. (2016). Conversion and Data Quality Assessment of Electronic
Health Record Data at a Korean Tertiary Teaching Hospital to a Common Data Model
for Distributed Network Research. Healthc Inform Res, 22(1):54–58.
448 Bibliography
Yuan, Z., DeFalco, F. J., Ryan, P. B., Schuemie, M. J., Stang, P. E., Berlin, J. A., Desai,
M., and Rosenthal, N. (2018). Risk of lower extremity amputations in people with type
2 diabetes mellitus treated with sodiumglucose cotransporter2 inhibitors in the USA:
A retrospective cohort study. Diabetes Obes Metab, 20(3):582–589.
Zaadstra, B. M., Chorus, A. M., van Buuren, S., Kalsbeek, H., and van Noort, J. M. (2008).
Selective association of multiple sclerosis with infectious mononucleosis. Mult. Scler.,
14(3):307–313.
Zaman, M. A., Oparil, S., and Calhoun, D. A. (2002). Drugs targeting the renin
angiotensinaldosterone system. Nat Rev Drug Discov, 1(8):621–636.
Index
449
450 Index