Using Electronically Available Inpatient Hospital Data For Research
Using Electronically Available Inpatient Hospital Data For Research
Mandar Apte, M.B.B.S., Ms.P.H.1, Matthew Neidell, Ph.D.2, E. Yoko Furuya, M.D., M.S.3,4, David Caplan, B.S.5, Sherry Glied, Ph.D.6,*,
Elaine Larson, R.N., Ph.D.7
Abstract
Despite a push to create electronic health records and a plethora of healthcare data from disparate sources, there are no data from
a single electronic source that provide a full picture of a patient’s hospital course. This paper describes a process to utilize electroni-
cally available inpatient hospital data for research. We linked several different sources of extracted data, including clinical, procedural,
administrative, and accounting data, using patients’ medical record numbers to compile a cohesive, comprehensive account of patient
encounters. Challenges encountered included (1) interacting with distinct administrative units to locate data elements; (2) finding a
secure, central location to house the data; (3) appropriately defining health measures of interest; (4) obtaining and linking these data
to create a usable format for conducting research; and (5) dealing with missing data. Although the resulting data set is incredibly rich
and likely to prove useful for a wide range of clinical and comparative effectiveness research questions, there are multiple challenges
associated with linking hospital data to improve the quality of patient care. Clin Trans Sci 2011; Volume 4: 338–345
Keywords: surveillance, electronic data, ICD9 codes.
Introduction
For the past decade there has been a national commitment to shown that ICD-9-CM representations of clinical events such as
enhance health information technology and develop electronic infections are inadequate for clinical research since they do not
health records. These efforts are intended to monitor and improve match well with clinical definitions.4 Faced with this situation,
evidence-based practice and quality of care and secure patient we identified relevant data sources and developed algorithms to
information in a highly mobile environment. For example, the collate data from a variety of electronic sources. The purpose of
HiTech provisions of the American Recovery and Reinvestment this paper is to describe the process we used to combine various
Act of 2009 (Public Law 111–5) included $20 billion in spending sources of electronically available inpatient hospital data for health
to spur the adoption of electronic health records. Hospitals across services research.
the country have developed and/or adopted electronic methods
to collect and store data, but electronic databases have often been Methods
designed for a specific purpose or department such as laboratory,
radiology, pharmacy, patient tracking, clinician orders, central Sample and setting
supply, or billing. Hence, many hospitals have a number of such Data were extracted from various electronic databases from four
databases which function well for one purpose but are unlinked sites in a large healthcare system in metropolitan New York City:
and do not “speak to each other.” As a result, with few exceptions, the New York-Presbyterian Hospital (NYPH) System. NYPH
such as the healthcare facilities of the Department of Veterans is the largest hospital system in the largest metropolitan region in
Affairs (http://www.ehealth.va.gov/VistA.asp), there may be a the United States and includes a community hospital, pediatric
plethora of healthcare data available regarding therapies provided, hospital, and two tertiary/quaternary care hospitals that provide
test results, and costs of care, but often no single electronic source care to a diverse range of patients. Although the database was
that provides a full picture of a patient’s hospital course(s). developed to study healthcare-associated infections, and hence this
In general, the United States has been slow to adopt electronic paper disproportionately focuses on these outcomes, the approach
health records.1,2 As of 2005 only 5% of hospitals used computerized is generalizable to a wide range of clinical research topics.
physician order entry,3 and even fewer had unified electronic health
records. Hence, the current potential for using data to conduct Data extraction
comparative effectiveness research and monitor and improve the
quality of patient care is limited and little is known about how these Clinical Data Warehouse (CDW).
data can be used. In an ongoing NIH-funded study of healthcare- The four hospital sites share a CDW that enables hospital or
associated infections and predictors and costs of antimicrobial university personnel engaged in either clinical research or
resistance among patients in a large hospital system (Distribution activities related to hospital treatment, payment, or operations
of the Costs of Antimicrobial Resistant Infections, 5R01NR10822), to perform analytic queries on clinical data across patients.
we found that relevant data were not readily available from a single The Warehouse integrates data from over 20 clinical electronic
source. A major limitation of commonly available data such as sources and organizes the data by subject. We extracted the
ICD-9-CM codes is that it only identifies health end points that following data elements from the CDW: (1) laboratory results,
are relevant for billing purposes. In addition, several studies have including microbiologic results from blood, urine, and respiratory
1
School of Nursing, Columbia University, New York, New York, USA; 2Associate Professor, Department of Health Policy and Management, Mailman School of Public Health, Columbia
University, New York, New York, USA; 3Department of Medicine, Division of Infectious Diseases, Columbia University, New York, New York USA; 4NewYork-Presbyterian Hospital,
Columbia University Medical Center, New York, New York, USA; 5Information Services Division—Business Solutions Group, New York-Presbyterian Hospital, Columbia University
Medical Center, USA; 6Department of Health Policy and Management, Mailman School of Public Health, Columbia University, New York, New York, USA; 7School of Nursing and
Mailman School of Public Health, Columbia University, New York, New York, USA
*
Dr. Glied is currently on leave at the Department of Health and Human Services (HHS), where she is Assistant Secretary for Planning and Evaluation. Her contributions to this paper
were made prior to her appointment at HHS and the paper does not reflect the official views of HHS.
Correspondence: E Larson ([email protected])
DOI: 10.1111/j.1752-8062.2011.00353.x
cultures, all cultures taken from possible surgical sites, and urine results, urine microscopy results, and ICD-9-CM diagnoses codes,
microscopy results; (2) patient location, including hospital unit, we identified patients as having an infection (cases), not having
room and bed occupied for each day of hospital stay as well as an infection (controls), and patients whom we could not clearly
patient’s home address; and (3) detailed accounts of medications categorize (noncase, noncontrol). We separately identified cases
administered and procedures performed, including use of central for organisms of interest (those often associated with multidrug
venous (CV) catheters. resistance) and for any organism.8 Appendix provides a detailed
description of the algorithms used.
Operating room data.
Data on procedures performed in the operating room were obtained Variables constructed
from the perioperative services of each institution. Data included Using the data sources described earlier, we coded categories of
the date and time of entry in the operating room, commencement variables for the final data set. A data dictionary describing all
of and recovery from anesthesia, time of incision and closure, variables is available upon request from the authors. A limited set
procedure descriptions and type of anesthesia used. of patient demographics was also collected, namely age and zip
code of residence, which could be used to link neighborhood level
Administrative data. characteristics from external data sets such as the decennial census
Administrative data from the admission, discharge, transfer of housing and population. Admission and discharge variables
(ADT) billing, and coding and abstraction systems included included the date of admission, length of hospital stay, whether the
admission and discharge dates, ICD-9-CM principal and patient died in the hospital, several variants of diagnosis related
secondary diagnosis and procedure codes with associated codes groups (DRGs), and measures of risk of mortality and severity of
for diagnoses present on admission, and admission source and illness based on output from 3M’s grouper software, which uses a
discharge destinations. proprietary algorithm to assign an APR–DRG to each discharge.9
Several measures of the health status of the patient were collected,
Cost accounting data. including prior hospitalizations, diabetes, chronic dermatitis,
Financial information for each discharge was obtained from the trauma, burns, and history of substance abuse. ICD-9-CM
cost accounting system, including total charges and insurance/ diagnoses codes for conditions present on admission were used
payer information. In addition, details for each item charged to calculate a weighted Charlson score as a measure of patients’
to the patient’s stay were collected, including date of service, health status at admission.10 Several measures of procedure based
charge amount, and UB-92 revenue codes (maintained by the risk factors were collected, including the use of medications, CV
National Uniform Billing Committee), which identify specific catheterization, urinary catheterization, mechanical ventilation,
accommodations. cardiac catheterization, catheter angiography, vascular stenting,
dialysis, surgical procedure, general anesthesia, intubation,
Data from the electronic health record system. and ICU stay. All of these variables included both the date the
Data on urinary catheter output was obtained through mediated procedure started and ended. We also coded patients in whom an
queries to flowsheets in the physician and nursing order entry infection occurred, including details on the organism responsible,
system (Eclipsys XA, http://www.allscripts.com/) antibiotic susceptibility pattern and when the infection occurred.
Financial variables collected included the total charges for the
Linking data. encounter, total payments received, along with information on
Patient information was linked across the multiple data sets the source of payment, and daily itemization of charges.
using the unique account number associated with each hospital Given that some of the events varied throughout the course of
admission where available. In case of data for which account a patient’s hospital stay (e.g., presence of a urinary catheter) while
numbers were not available, source data were matched to the some were fixed throughout the stay (e.g., malignancy, diabetes),
correct hospital stay using the unique medical record number we created both time varying and time invariant variables. To
and date/time stamps associated with source data. Once data allow for the construction of time varying variables, the unit of
sets were linked and processed, data sets were de-identified by analysis was the patient-day, so each patient encounter contributed
replacing account numbers and medical record numbers with one observation for each day in his or her length of hospital stay.
unique identification numbers. This data set construction is analogous to the structure often used
for discrete time survival models, hence making it possible to
Algorithms for identifying infections model risk factors for infections.
To study the cost of antimicrobial resistant infections, infection
outcomes needed to be defined across multiple domains and axes: Imputation
the type of infection, the date an infection occurred, the causative The rollout of the electronic health record (Eclipsys; H/P
organism and its antimicrobial susceptibility pattern. Our team Technologies, Phoenix, AZ, USA) was staggered at the four
of clinicians and researchers developed electronic algorithms to hospitals for the time period of our analysis. Because this
identify hospital stays with any of four types of infections: blood system was primarily used in our data set to record the use
stream infection, urinary tract infection, pneumonia and surgical of CV catheters, urinary catheters, and the administration of
site infections. We used the surveillance definitions from the medication, these observations were frequently missing for
Centers for Disease Control and Prevention National Healthcare earlier years. Because this pattern of “missingness” was due solely
Safety Network (NHSN, http://www.cdc.gov/nhsn/about.html) to the introduction of the new system, we imputed these variables
for healthcare-associated infections5–7 as a starting point to to maintain a full sample.
identify elements of these definitions which could be mapped to We used two imputation procedures. To identify whether
available electronic data. Using a combination of microbiologic or not one of the three events (CV catheterization, urinary
Hospital/year 2006 2007 2008 Total Stata version 10.1 (Stata Corp, College Station, TX, USA) was
used for imputation.
Community 13,706 13,515 13,570 40,791
Pediatric 16,551 18,375 19,260 54,186
Results
Tertiary1 41,524 41,586 40,724 123,834 Table 1 displays the summary of discharges for each hospital
Tertiary2 33,547 33,926 33,661 101,134 separately by year for all inpatient discharges from 2006 to 2008.
Nearly 320,000 discharges occurred during this time period, with
Total 105,328 107,402 107,215 319,945
small increases in discharges at each hospital over the 3-year period.
Table 1. Summary of discharges by hospital and year. Given the different target populations, there were considerably
more discharges at the two tertiary care hospitals. Table 2 and
Table 3 display the number of discharges in which patients were
Hospital BSI1 UTI2 PNU3 SSI4
identified as being infected according to our algorithms, separately
Community 937 3,285 256 80 by site, organism, and hospital. Consistent with the number of
Pediatric 1,145 1,163 176 137 discharges across hospitals, there were more infections at the
tertiary care hospitals. Table 4 displays the summary statistics of
Tertiary1 3,024 7,728 1,101 835
a subset of variables in the final data set.
Tertiary2 3,241 8,241 1,706 705 As one way of assessing the validity of our imputation, we
Total 8,347 20,417 3,239 1,757 compared the distribution of nonmissing observations to the
Notes: 1BSI, blood stream infection; 2UTI, urinary tract infection; 3PNU, pneumonia; distribution of missing (imputed) observations. Figures 1 and 2
4
SSI, surgical site infection. display histograms for CV catheter, with Figure 1 showing the
Table 2. Number of infections by any organism and hospital. results for imputing the first day of insertion and Figure 2 showing
the results for the duration of insertion (results are comparable for
urinary catheter and medication administration). In both figures,
catheterization, and the administration of medication) had the white bars represent cases where CV catheter data were
occurred for a patient, we used multiple imputation by chained complete (observed) and the dark bars representing cases in which
equations11 using logistic regression with all other available CV catheter data were imputed. These figures demonstrate that
variables in the data set as predictors for the three events. our imputation procedure was generally effective in replicating
Once we imputed these three variables, we then needed to impute the distribution of these variables.
the day the event started and the duration of the event. Because
start and end dates must be restricted to occur within a patient’s Discussion
hospital stay (i.e., we could not predict a CV catheter to be inserted Although a fully integrated database is essential for comparative
on a patient’s tenth day if he only stayed in the hospital for 9 effectiveness and outcomes research, the initial development
days) and the distribution of start day and duration are skewed, phase of this project posed a number of challenges and required
we performed hotdeck imputation, which replaces data for the considerable time. In fact, the process required almost 2 years of
missing observations (“recipients”) with data from nonmissing work of a team including a clinician, economist, epidemiologist,
observations in the same sample that have similar characteristics and an experienced programmer and statistician. Major
(“donors”).12 The “recipients” consisted of patients whose CV challenges that we encountered are discussed below, and
catheter, urinary catheter, and use of medications had just been included identifying and obtaining permission for access to
imputed in the first step. The “donors” consisted of patients who data sources, limitations regarding extraction of text-based data,
had one of these events, but with the same length of stay as the and technical issues regarding merging various systems across
recipient and a similar predicted probability of start day and institutions.
duration. This predicted probability was obtained by estimating In many healthcare systems, departments or service lines
separate count models for start day and duration using all other often operate independently; it is thus not surprising that silos
variables in the data sets as predictors, using only the sample or fiefdoms develop to facilitate getting work accomplished
in which one of the events had occurred. Similar predicted efficiently. Considerable effort was required in this project to
probability was defined by grouping the predicted start day and first identify within each department and across settings the
duration into deciles. “proprietor” or steward/manager of specific data sources and then
Data extraction, manipulation and analysis were conducted to work with them to obtain the necessary permissions to access
using TOAD for DB2 version 3.1.1 (Quest Software, Aliso Viejo, and use the data. There were no specific protocols or guidelines in
CA, USA), SAS version 9.1.3 (SAS Institute, Cary, NC, USA) and place to clarify how this should be done, and in some cases it was
difficult to determine who actually had the right to grant access added to make the cost worth the effort. Secondly, even when NLP
to data for anyone outside their specific area. Over a period of algorithms are established, they are often not sufficiently sensitive
months, we had multiple conversations with various individuals to assure efficient and accurate retrieval of useable information.15
to develop our own list of individuals with the authority to grant Most importantly, our first priority was to create a system that
access. Because multiple and varying electronic data collection was potentially generalizable across institutions in which the
systems had been purchased or internally developed by many required NLP expertise might not be. Although our study is
individual departments or divisions, this was one of the most limited by the fact that we do not have data extracted from text
time consuming tasks we encountered. To facilitate future efforts notations, this is also an advantage in terms of generalizability
to consolidate data bases, we recommend that healthcare systems and sustainability.
begin to identify the various sources of clinical, administrative Finally, and not surprisingly, we encountered technical
and financial data and develop policies and procedures to access issues regarding merging various software and data formats
and use the data. across institutions. Despite the fact that the four hospitals in this
Natural language processing (NLP) algorithms have been study were part of a single large hospital system, the institutions
used in a number of clinical applications to extract useful varied with regard to the electronic record systems used. In fact,
information for research.13–15 In this study, we considered using during the study period, one of the hospitals changed electronic
NLP algorithms to extract data from text-based records such as medical records systems and, as noted in Methods, some data
nursing notes and radiology reports. We found, however, that elements were not available for the entire study period at all
while it was possible, we chose not to pursue using NLP for several sites, necessitating the application of imputation methods.
reasons. First, a huge investment in additional time and resources Such technical problems require considerable programming
would have been necessary and we did not see sufficient value expertise.
Conclusion
Given that it is not always possible
to design randomized clinical
trials to understand the impact
of various clinical interventions,
researchers must often instead
rely on retrospectively collected
data from various sources. In
analyses using retrospective
data, it becomes more important
to account for the full range of
experiences patients encounter
in the healthcare system. Detailed
information on these encounters is
often recorded electronically, but
these data are typically stored in
distinct databases, thus limiting
Figure 2. Result of impulation for duration of central venous catheter.
researchers’ ability to compile a
cohesive, comprehensive account
of patient encounters.
Clearly, the extensive resources required to overcome such In this paper, we have described the steps we have taken
challenges are not justifiable if the database remains static for to compile such a database from a major hospital system in
a short period of time, because the data will quickly become New York City as part of a larger study to examine the impact
outdated and less relevant for research or quality monitoring. of antimicrobial-resistant infections on the costs to society.
Hence, we are now in the process of incorporating the Several obstacles were encountered in this process that are
database into the institution’s Clinical Data Warehouse as a likely to be common across other settings, including: (1)
datamart, and setting up automatic feeds to update the data interacting with distinct administrative units to locate data
on a continuous, ongoing basis. The database to date has been elements; (2) finding a secure, central location to house the
used to examine clinical problems related to infections such data; (3) appropriately defining health measures of interest;
as identifying risk factors for multidrug resistant infections, (4) obtaining and linking these data to create a usable format
examining the relationship between short bowel syndrome for conducting research; and (5) dealing with missing data.
and incidence of bloodstream infection, and correlating Although some of the steps we have taken to address these
measures of glucose control and risk of surgical site infection issues are context specific, these steps are likely to serve as
in diabetics and nondiabetics. Additional data elements can a general guideline for creating such data sets in other large
be added to the database for investigators seeking to test other healthcare systems.
specific hypotheses. We plan to widely disseminate information The resulting data set is an incredibly rich one that is likely
regarding the availability of these data to investigators within to prove useful for a wide range of clinical research questions.
and outside the study institutions. Looking ahead, a major focus centers on maintaining the
Although the algorithms developed to identify infections sustainability of these data to ensure they can be regularly updated
were specific to the focus of our grant on healthcare acquired to include additional years of data as it becomes available.
6. Horan TC, Andrus M, Dudeck MA. CDC/NHSN surveillance definition of health care-associated
Acknowledgments
infection and criteria for specific types of infections in the acute care setting. Am J Infect Control.
The study was funded by NIH/NINR Grant R01 NR010822, 2008; 36: 309–332.
Distribution of the Costs of Antimicrobial Resistant Infections. 7. National Healthcare Safety Network (NHSN) [Internet]. Atlanta: Centers for Disease Control and
We gratefully acknowledge the administrative support of Bevin Prevention. Available from: http://www.cdc.gov/nhsn/library.html (accessed on November 1, 2010).
Cohen and statistical expertise of Jennifer Hill and Haomiao 8. Landers T, Apte M, Hyman S, Furuya Y, Glied S, Larson E. A comparison of methods to detect
urinary tract infections using electronic data. Jt Comm J Qual Patient Saf. 2010; 36: 411–417.
Jia.
9. All Patient Refined Diagnosis Related Groups (APR-DRGs). Version 20.0. Available from: http://
www.hcup-us.ahrq.gov/db/nation/nis/APR-DRGsV20MethodologyOverviewandBibliography.pdf.
Wallingford, CT: 3M Health Information Systems; 2003. Accessed September 29, 2011.
References
1. Balfour DC 3rd, Evans S, Januska J, Lee HY, Lewis SJ, Nolan SR, Noga M, Stemple C, Thapar 10. Charlson ME, Pompei P, Ales KL, MacKenzie CR. A new method of classifying prognostic co-
K. Health information technology—results of a roundtable. J Manag Care Pharm. 2009; 15(1 morbidity in longitudinal studies: development and validation. J Chronic Dis. 1987; 40: 373–383.
Suppl. A): 10–17. 11. Van Buuren S, Brand J, Groothuis-Oudshoorn CD, Rubin D. Fully conditional specification in
2. Poon EG, Jha AK, Christino M, Honour MM, Fernandopulle R, Middleton B, Newhouse J, Leape multivariate imputation. J Stat Comput Simul 2006; 76: 1049–1064.
L, Bates DW, Blumenthal D, et al. Assessing the level of health information technology in the 12. Allison PD. Missing Data (Quantitative Applications in the Social Sciences), Thousand Oaks,
United States: a snapshot. BMC Med Informat Decis Mak. 2006; 6: 1. CA: Sage, 2002: 27–72.
3. Jha AK, Ferris TG, Donelan K, DesRoches C, Shields A, Rosenbaum S, Blumenthal D. How 13. Xu H, Jiang M, Oetjens M, Bowton EA, Ramirez AH, Jeff JM, Basford MA, Pulley JM, Cowan
common are electronic health records in the United States? A summary of the evidence. Health JD, Wang X, et al. Facilitating pharmacogenetic studies using electronic health records and
Aff (Millwood) 2006; 25: w496–w507. natural language processing: a case study of warfarin. J Am Med Inform Assoc. 2011; 18:
387–391.
4. Sherman ER, Heydon KH, St. John KH, Teszner E, Rettig SL, Alexander SK, Zaoutis TZ, Coffin SE.
Administrative data fail to accurately identify cases of healthcare-associated infection. Infect Contr 14. Womack JA, Scotch M, Gibert C, Chapman W, Yin M, Justice AC, Brandt C. A comparison of two
Hosp Epidemiol. 2006; 27: 332–337. approaches to text processing: facilitating chart reviews of radiology reports in electronic medical
records. Perspect Health Inf Manag. 2010; 7: 1a.
5. Stevenson KB, Khan Y, Dickman J, Gillenwater T, Kulich P, Taylor D, Santangelo J, Lundy J,
Jarjoura D. Administrative coding data, compared with CDC/NHSN criteria, are poor indicators of 15. Mendonca EA, Haas J, Shagina L, Larson E, Friedman C. Extracting information on pneumonia in in-
health care-associated infections. Am J Infect Control 2008; 36: 155–164. fants using natural language processing of radiology reports. J Biomed Inform. 2005; 38: 314–321.
Pneumonia