0% found this document useful (0 votes)
14 views5 pages

Genomic Epidemiology Challenges

This article discusses challenges in developing data infrastructure for genomic epidemiology that were exacerbated by the COVID-19 pandemic. It highlights key challenges including unstable data sources, rapid development of new tools, and the need for timely reporting. It then provides design principles to address these, such as ensuring clean data and modular and reusable components. Finally, it describes the Swiss SARS-CoV-2 Sequencing Consortium's implementation using a relational database and containerized microservices.

Uploaded by

Natasha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views5 pages

Genomic Epidemiology Challenges

This article discusses challenges in developing data infrastructure for genomic epidemiology that were exacerbated by the COVID-19 pandemic. It highlights key challenges including unstable data sources, rapid development of new tools, and the need for timely reporting. It then provides design principles to address these, such as ensuring clean data and modular and reusable components. Finally, it describes the Swiss SARS-CoV-2 Sequencing Consortium's implementation using a relational database and containerized microservices.

Uploaded by

Natasha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Epidemics 39 (2022) 100576

Contents lists available at ScienceDirect

Epidemics
journal homepage: www.elsevier.com/locate/epidemics

Advancing genomic epidemiology by addressing the bioinformatics


bottleneck: Challenges, design principles, and a Swiss example
Chaoran Chen 1 , Sarah Nadeau 1 , Ivan Topolsky, Niko Beerenwinkel, Tanja Stadler ∗
Department of Biosystems Science and Engineering, ETH Zürich, Basel, CH 4058, Switzerland
Swiss Institute of Bioinformatics, Lausanne, CH 1015, Switzerland

ARTICLE INFO ABSTRACT

Keywords: The SARS-CoV-2 pandemic led to a huge increase in global pathogen genome sequencing efforts, and the
Genomic epidemiology resulting data are becoming increasingly important to detect variants of concern, monitor outbreaks, and
SARS-CoV-2 quantify transmission dynamics. However, this rapid up-scaling in data generation brought with it many
Data infrastructure
IT infrastructure challenges. In this paper, we report about developing an improved system for genomic
Relational database
epidemiology. We (i) highlight key challenges that were exacerbated by the pandemic situation, (ii) provide
Microservices
data infrastructure design principles to address them, and (iii) give an implementation example developed
by the Swiss SARS-CoV-2 Sequencing Consortium (S3C) in response to the COVID-19 pandemic. Finally, we
discuss remaining challenges to data infrastructure for genomic epidemiology. Improving these infrastructures
will help better detect, monitor, and respond to future public health threats.

0. Introduction Du Plessis et al., 2021; Miller et al., 2020). Thus, pathogen genome
sequence data is instrumental for disease detection, outbreak tracking,
An increasingly important tool to help fight pathogenic diseases is and quantifying transmission dynamics.
genomic epidemiology. The analysis of pathogen genome sequences The wealth and geographic distribution of available genomic data
allows us to learn about pathogen evolution and epidemic or endemic underlying these and other analyses indicates many groups around
transmission dynamics (Kraemer et al., 2019; Grenfell et al., 2004). the world have developed their own infrastructures for genomic epi-
However, the SARS-CoV-2 pandemic has highlighted a growing dispar-
demiology. So far, several large national initiatives have published
ity between global sequencing data generation capacities and analysis
descriptions of their technical infrastructures. In particular, (Nicholls
capacities (Black et al., 2020). As Hodcroft et al. (2021) underscores,
et al., 2021; Matthews et al., 2018; Egli et al., 2019) describe UK-
we seem to be drowning in data rather than swimming in information.
, Canadian- and Swiss-specific infrastructures that enable linking of
Genome sequence data are becoming increasingly important for
epidemic response, as highlighted during the SARS-CoV-2 pandemic. genome sequence data with associated metadata and integrate data
In December 2019, when an unknown respiratory disease was iden- from multiple regional contributors. Other examples are available as
tified in Wuhan, China, the first whole genome sequence from the code bases, for instance that of the Spanish SARS-CoV-2 Sequencing
causal virus helped classify the new human pathogen SARS-CoV-2 (Wu Consortium (Spanish SARS-CoV-2 sequencing consortium, 2022).
et al., 2020) and establish its likely origins (Andersen et al., 2020). Despite these successes, developing a data infrastructure for
Then, comparison of mutational differences in genomes collected from genome-based surveillance and genomic epidemiology remains a chal-
different regions helped distinguish imported cases from community lenge (Black et al., 2020; Bernasconi et al., 2021). In the COVID-19
transmission (Worobey et al., 2020). Next, genome surveillance ef- pandemic, bioinformatics capacity has proven to be a key bottleneck in
forts identified more transmissible variants of concern, e.g. the al- pandemic response (Hodcroft et al., 2021). This is particularly true in
pha variant (World Health Organization, 2021) in the UK in late countries without a well-supported national initiative, or in the period
2020 (Volz et al., 2021). Finally, phylogenetic and phylodynamic meth- before such an initiative is established. As a US-focused report (Com-
ods use genome sequence data to quantify epidemic dynamics, includ-
mittee on Data Needs to Monitor Evolution of SARS-CoV-2 et al., 2020)
ing the reproductive number, transmission routes, effects of public
highlights, a key priority for pandemic preparedness is to improve upon
health measures, and the role of super-spreading (Nadeau et al., 2021;

∗ Corresponding author at: Department of Biosystems Science and Engineering, ETH Zürich, Basel, CH 4058, Switzerland.
E-mail address: [email protected] (T. Stadler).
1
These authors contributed equally.

https://doi.org/10.1016/j.epidem.2022.100576
Received 21 January 2022; Received in revised form 5 April 2022; Accepted 5 May 2022
Available online 14 May 2022
1755-4365/© 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
C. Chen et al. Epidemics 39 (2022) 100576

Fig. 1. An illustration of how three key entities – tests, plates, and sequences – are stored in database tables and the mapping table that links the information from each.

existing systems to integrate clinical and genomic data and better coor- and analyzing them in aggregate allows public health officials to track
dinate between different public health stakeholders. In this paper, we transmission and monitor key mutations. However, the format of these
share lessons learned in the Swiss SARS-CoV-2 Sequencing Consortium data may shift over the course of an outbreak, and new data may
(S3C) pertaining to three challenges that were particularly exacerbated become available. For example, accommodating genomic restructuring
by the COVID-19 pandemic: unstable data sources, rapid development by the pathogen itself (e.g., by insertion, deletion, recombination, or
of new tools, and the need for timely reporting. We outline design reassortment), annotating samples with the presence or absence of
principles to address these challenges and describe our implementation newly discovered key mutations, and newly available or re-formatted
of a relational database and containerized microservices as an example. metadata all represent shifts in the basic data required for effective
Finally, we highlight remaining challenges in data management for genomic surveillance. Furthermore, it might not be possible to define
genomic epidemiology. a fixed and sensible file format for data exchange in the early stages of
The S3C began generating and analyzing SARS-CoV-2 genome se- outbreak response due to time pressure.
quences in March 2020. The Consortium started as a partnership be-
tween two academic groups, an associated academic sequencing facil- Recommendation: ensure clean data
ity, and a large Swiss medical diagnostics company (S3C, 2021). Since
then, S3C has partnered with three core sequencing facilities in Switzer- Unreliable and shifting source data can quickly lead to messy data
land to sequence over 44,000 samples from companies, hospitals, and with, for example, missing values and different spellings of the same en-
research institutions. These data are made available on GISAID (Elbe tity. Ideally, infrastructure developers will work with data submitters to
and Buckland-Merrett, 2017) and the European Nucleotide Archive. develop a standardized data dictionary with clearly defined permitted
To meet the demands of a growing genomic surveillance program values for each variable. However, it is also essential to strictly validate
in Switzerland, S3C benefited from early data infrastructure design data upon import as a double-check. It should also be anticipated
choices that enabled rapid extension to new data sources, types, and that changes and corrections to the data will be necessary over time.
users. Therefore, data should be maintained in a non-redundant form so that
In the following sections we describe major implementation chal- changes to one attribute can be easily made without the danger of
lenges for data infrastructure in light of the pandemic and outline causing inconsistencies. Data relations should be tracked so that the
design principles to address them. In particular, we discuss S3C’s imple- effect of changes to one attribute on others are easy to identify. Data
mentation of a relational database and microservices-based approach types should be strictly enforced so that changes to data formats are
as an example fulfilling these design criteria using open source tools. rapidly detected and mistakes are not incorporated. Finally, it should
Finally, we consider remaining challenges in data infrastructure for be easy to define custom data types and add attributes as new data is
genomic epidemiology that must be met to improve future public health made available.
response to pathogenic diseases.
Example: relational database
1. Unstable data sources
Relational database management systems provide a good way to ful-
Emerging public health threats bring great uncertainties, including fill these design criteria. In a relational database management system,
in data availability and formats. The basic data necessary for genomic data are stored in a collection of tables, also known as the ‘‘relational
surveillance are pathogen genome sequences and minimal patient meta- format’’. Each table is independent from the others, but they may be
data, e.g., sample collection date and location. Coupling these data linked (related) via shared keys, i.e. information common to two or

2
C. Chen et al. Epidemics 39 (2022) 100576

Fig. 2. Containerized microservices operate autonomously to add or extract data from the database.

more tables. This allows us to formulate complex queries by joining Example: containerized microservices
different tables together.
A relational database approach helps keep data clean in the face A microservices approach separates different tasks performed by dif-
of unstable data sources. Each table’s columns have fixed data types ferent tools into loosely-coupled programs that operate autonomously,
and it is possible to define custom types with a limited set of allowed each performing a single, well-defined task. For the S3C, we imple-
mented a growing set of microservices that import, export, and process
values. Foreign keys, CHECK constraints and triggers allow definitions
data by adding or extracting data from the database (Fig. 2). The
of arbitrarily complex validations. Invalid entries are rejected upon
microservices each have their own code base, and, depending on the
import so we know when corrections are necessary. This is especially
task, they are written in different languages.
important in the S3C, since we accept partially human-edited Excel files
We used a containerization technology to deploy these microser-
and non-documented output data from PCR machines as input. Non- vices. This packages software applications together with their depen-
redundancy between tables makes it easier to correct mistakes in these dencies into single units, called containers. For example, a Pango
data when they arise. Finally, new and corrected data is simultaneously lineage assigner requires the pangolin tool (O’Toole et al., 2021), a
available to all database users. Nextclade importer needs Nextclade (Aksamentov et al., 2021), and
Several relational database management systems are available. The the metadata importer has to mount a network folder. The services can
S3C uses PostgreSQL,2 which is freely available and open-source. In be written in different programming languages, perhaps even different
our implementation, we have three core database tables, one each versions of the same language to accommodate different dependencies.
for tests (samples), plates of RNA extracts, and SARS-CoV-2 genome Most services act only upon missing data. For example, we have
sequences (Fig. 1). The test table contains sample metadata from the a Nextclade importer service that runs the Nextclade program and
originating laboratory, the plate table tracks where each plate was sent imports resulting quality scores and mutations. This service queries the
for sequencing and when, and the sequence table stores the assembled database every ten minutes and looks for entries in the sequence table
where Nextclade quality scores were previously unpopulated. Other
SARS-CoV-2 whole-genome sequence and associated quality control
services avoid redundancy by maintaining a database table that stores a
statistics. Finally, a mapping table links the respective keys from each
state, e.g. filenames which have already been processed and should not
table. These tables represent the core of our database, though we
be re-imported. For example, our metadata importer service operates in
have added other tables through time to accommodate new data. For
this way.
example, we store the identifiers assigned by public databases and The containerized microservices allow fast adoption of new or up-
additional sample metadata provided by the Swiss Federal Office for dated tools. Since they are packaged and deployed independently, they
Public Health (FOPH). can be started or stopped without impacting other services. The con-
tainerization further serves to isolate each tool and remove dependency
2. New tools conflicts between tools. Finally, since services only act upon missing
data or when a state is changed, we avoid redundant computation. An-
other complementary approach to achieving analysis modularity would
State-of-the art computational tools are also likely to change or be to use scientific workflow systems, such as Snakemake (Mölder et al.,
are even being newly developed over the course of a public health 2021) or Nextflow (Di Tommaso et al., 2017). These systems can be
response. This is exemplified in the COVID-19 pandemic by evolving used together with containerization technologies and further simplify
nomenclature systems. Lineage assignment tools were frequently up- tracking of component software versions and workflow revisions used
dated to keep up with nomenclature changes as new lineages arose. to generate output files.
For example, the popular pangolin software for assigning SARS-CoV-2
genome sequences to global lineages has 75 releases since its develop- 3. Timely reporting
ment in April 2020 (O’Toole et al., 2021).
Timely reporting is crucial for an evidence-based public health
response. Turn-around times for SARS-CoV-2 sequences to be made
Recommendation: modular analysis workflows available on GISAID vary from a few days to a few weeks post-sampling,
or more. Sample transport logistics, sequencing capacities, bioinformat-
Analysis workflows should be modular, rather than monolithic ics analysis, and report preparation all contribute to this turn-around
pipelines. It should be easy to update one component or swap it out time. Here, we focus on how to ensure rapid final reporting, as this is
for a different tool without having to re-run a full suite of analysis the aspect data managers have the most influence on.
programs on the entire cohort. This modular structure allows individual
Recommendation: Multiple levels of querying
components to be adapted or re-used for other pathogens or other
projects. For use cases where software version tracking is especially
A data management system needs to support rapid, ad-hoc querying
important, workflow and software versions can be stored alongside the in addition to generation of regular, stable reports. The prior is nec-
data in the database. essary for early outbreak detection and detection of new variants of
concern, while the latter is essential for longer-term monitoring. Ide-
ally, the system should be able to expose an application programming
2
https://www.postgresql.org/. interface (API) for safe public data sharing.

3
C. Chen et al. Epidemics 39 (2022) 100576

Fig. 3. A SQL query that finds the samples with the S:N501Y mutation.

Example: Database queries and the need for timely reporting. Then, we outlined general design
principles to address these challenges. As an example, we describe
Relational database systems support querying in several ways, ful- the S3C’s implementation of a relational database and containerized
filling the above design criteria. One way to interact with data in microservices.
a relational database is by directly using structured query language These design choices directly enabled genome-based outbreak de-
(SQL), which is a high-level and declarative language specifically de- tection, monitoring, and public health response in the Swiss SARS-
signed for efficient querying. In SQL, the user describes (declares) what CoV-2 epidemic. Even before a new variant could be reliably called
data should be added or retrieved, but not exactly how. The language by lineage classification tools, we could quickly query Swiss data
then works behind-the-scenes to optimize the necessary computations for mutations characterizing variants of concern. This enabled us to
and return the desired information (Fig. 3). SQL is widely used by data detect the first instances of the Beta, Gamma, and Delta variants in
analysts and does not require prior programming experience. Graphical Switzerland. Our database also enabled us to quickly develop two
user interfaces, for example DataGrip,3 allow users to manually add public-facing websites for epidemic monitoring. Finally, we collaborate
or modify data and submit queries. For those who are programmers, with the Swiss FOPH as members of the Swiss National COVID-19
popular languages like R and python have packages like dplyr and Science Task Force6 to link genome sequences to patient metadata.
pandas that enable reading data from a database directly into data Lineage assignment and mutation data are passed back to the FOPH
frames. to support the health authorities in their pandemic response.
For recurring queries, for instance for regular reporting, the Many labs around the world have developed a data infrastructure
database enables easy aggregation and reporting using ‘‘views’’. These for genomic epidemiology over the course of the COVID-19 pandemic.
are derived tables that aggregate data from existing tables according In fact, there are over 4000 unique submitting labs in the GISAID Epi-
to a query. For reporting purposes, we created a number of views, CoV database as of January 2022. Unfortunately, a paucity of published
for instance a billing view that contains the number of sequenced and examples makes it difficult to compare the strengths and weaknesses of
submitted samples per week and a surveillance view that aggregates various implementations in light of the challenges outlined by Black
per-sample lineage assignment and mutation information for the Swiss et al. (2020), Bernasconi et al. (2021) and highlighted here. The
FOPH. These views are automatically updated with the correction or largest pathogen genome sequencing consortium in the world is that
addition of data. We also have a microservice that exports the mutation of COG-UK. Like S3C, they use a relational database. On top of it, they
information view on a daily basis to a drop-point for the Swiss FOPH. developed an API and a web interface for the collaborators to submit
Finally, for monitoring purposes, a relational database can also and retrieve data (Nicholls et al., 2021). In comparison, we did not
serve as the back-end to dashboards or websites. We offer two public- define a fixed metadata or sequence data format but adapted to the
facing websites to interact with sequencing and case data stored in our data provided by collaborators. Our aim was to reduce overhead for our
database. One is a dashboard focused on Swiss case data4 and the other collaborators. However, as data inputs stabilize, a future improvement
enables monitoring of global SARS-CoV-2 variants 5 (Chen et al., 2021). would be to develop a more robust procedure for defining formats and
updating data. An improved technical interface for data upload and
Discussion correction by sequence submitters like that of COG-UK would also help.
There are also larger outstanding challenges to developing data
The COVID-19 pandemic has underscored both the utility of ge- infrastructures for genomic epidemiology. First, genome sequencing
nomic epidemiology for public health response and remaining chal- efforts are highly skewed towards high-income countries. In an in-
lenges in supporting related data infrastructure. Here we highlighted terconnected world, local variants and fast epidemic spread are of
three challenges that were exacerbated by the rapidly changing pan- global concern no matter where they arise. Expanding the technical
demic situation: unstable data sources, rapid development of new tools, and personnel resources for genome sequencing and data management
in low and middle-income countries would enable a better, more coor-
dinated public health response. Second, mistakes are common — from
3
https://www.jetbrains.com/datagrip/.
4
https://ibz-shiny.ethz.ch/covidDashboard/?_inputs_&tab=%22ts%22.
5 6
https://cov-spectrum.org. https://sciencetaskforce.ch.

4
C. Chen et al. Epidemics 39 (2022) 100576

sequencing errors introducing spurious mutations, to sample contami- Bernasconi, A., Canakoglu, A., Masseroli, M., Pinoli, P., Ceri, S., 2021. A review on
nation, to metadata errors. SARS-CoV-2 sequences and their metadata viral data sources and search systems for perspective mitigation of COVID-19.
Brief. Bioinform. (ISSN: 1477-4054) 22 (2), 664–675. http://dx.doi.org/10.1093/
are regularly modified or deleted from public repositories. While some
bib/bbaa359.
amount of mistakes are inevitable, better tools for tracking of changes Black, A., MacCannell, D.R., Sibley, T.R., Bedford, T., 2020. Ten recommendations for
to sequence data and their metadata would make correcting mistakes supporting open pathogen genomic analysis in public health. Nat. Med. (ISSN:
easier and promote reproducible science and transparency. Finally, 1546-170X) 26 (6), 832–841. http://dx.doi.org/10.1038/s41591-020-0935-z.
we need robust infrastructures for safe linking of patient metadata Chen, C., et al., 2021. CoV-Spectrum: analysis of globally shared SARS-CoV-2 data to
identify and characterize new variants. Bioinformatics (ISSN: 1367-4803) 38 (6),
with genome data. It can be a challenge to establish standardized, 1735–1737. http://dx.doi.org/10.1093/bioinformatics/btab856.
anonymized identifiers at the relevant scale for national sequencing Committee on Data Needs to Monitor Evolution of SARS-CoV-2, et al., 2020. Genomic
projects, particularly in countries with decentralized health care ser- Epidemiology Data Infrastructure Needs for SARS-CoV-2: Modernizing Pandemic
vices. Strong partnerships with government health ministries will help Response Strategies. National Academies Press, Washington, D.C., ISBN: 978-0-309-
68091-2, http://dx.doi.org/10.17226/25879, Pages: 1-110, https://www.nap.edu/
here, with metadata like vaccination and hospitalization status being
catalog/25879.
provided to ensure actionable results for public health response. Spanish SARS-CoV-2 sequencing consortium, 2022. FISABIO-NGS / SARS-CoV2-
In conclusion, generating pathogen genome sequence data and link- mapping. https://gitlab.com/fisabio-ngs/sars-cov2-mapping.
ing it to case-level metadata facilitates a rapid, evidence-based public Di Tommaso, P., et al., 2017. Nextflow enables reproducible computational workflows.
health response to evolving infectious pathogens. Effective and timely Nature Biotechnol. (ISSN: 1546-1696) 35 (4), 316–319. http://dx.doi.org/10.1038/
nbt.3820.
generation of these data in rapidly changing situations relies on robust Du Plessis, L., et al., 2021. Establishment and lineage dynamics of the SARS-CoV-
and agile data infrastructures, and improvements in the area should be 2 epidemic in the UK. Science 371 (6530), 708–712. http://dx.doi.org/10.1126/
a priority for pandemic preparedness. science.abf2946.
Egli, A., et al., 2019. Improving the quality and workflow of bacterial genome sequenc-
ing and analysis: paving the way for a Switzerland-wide molecular epidemiological
Funding
surveillance platform. Swiss Med. Weekly (49), https://smw.ch/article/doi/smw.
2018.14693.
TS, SN and CC are supported by the Swiss National Science Founda- Elbe, S., Buckland-Merrett, G., 2017. Data, disease and diplomacy: GISAID’s innovative
tion (grant number 31CA30_196267). NB and IT are supported by the contribution to global health. Glob. Chall. 1 (1), 33–46, https://onlinelibrary.wiley.
SIB Swiss Institute of Bioinformatics. com/doi/abs/10.1002/gch2.1018.
Grenfell, B.T., Pybus, O.G., Gog, J.R., Wood, J.L.N., Daly, J.M., Mumford, J.A.,
Holmes, E.C., 2004. Unifying the epidemiological and evolutionary dynamics
CRediT authorship contribution statement of pathogens. Science 303 (5656), 327–332. http://dx.doi.org/10.1126/science.
1090727.
Chaoran Chen: Conceptualization, Data curation, Methodology, Hodcroft, E.B., et al., 2021. Want to track pandemic variants faster? Fix the bioin-
formatics bottleneck. Nature 2021 591:7848 (ISSN: 14764687) 591, 30–33. http:
Software, Visualization, Writing – original draft, Writing – review &
//dx.doi.org/10.1038/d41586-021-00525-x.
editing. Sarah Nadeau: Conceptualization, Data curation, Methodol- Kraemer, M.U.G., et al., 2019. Reconstruction and prediction of viral disease epidemics.
ogy, Software, Visualization, Writing – original draft, Writing – review Epidemiol. Infect. 147, e34. http://dx.doi.org/10.1017/S0950268818002881.
& editing. Ivan Topolsky: Data curation, Resources, Software, Writ- Matthews, T.C., et al., 2018. The integrated rapid infectious disease analysis (IRIDA)
ing – review & editing. Niko Beerenwinkel: Funding acquisition, platform. BioRxiv http://dx.doi.org/10.1101/381830.
Miller, D., et al., 2020. Full genome viral sequences inform patterns of SARS-CoV-2
Project administration, Resources, Supervision, Writing – review &
spread into and within Israel. Nature Commun. (ISSN: 2041-1723) http://dx.doi.
editing. Tanja Stadler: Conceptualization, Funding acquisition, Project org/10.1038/s41467-020-19248-0.
administration, Resources, Supervision, Writing – review & editing. Mölder, F., et al., 2021. Sustainable data analysis with snakemake [version 2;
peer review: 2 approved]. F1000Research 10 (33), http://dx.doi.org/10.12688/
f1000research.29032.2.
Declaration of competing interest
Nadeau, S.A., Vaughan, T.G., Scire, J., Huisman, J.S., Stadler, T., 2021. The origin and
early spread of SARS-CoV-2 in Europe. Proc. Natl. Acad. Sci. (ISSN: 0027-8424)
The authors declare that they have no known competing finan- 118, http://dx.doi.org/10.1073/PNAS.2012008118.
cial interests or personal relationships that could have appeared to Nicholls, S.M., et al., 2021. CLIMB-COVID: continuous integration supporting decen-
influence the work reported in this paper. tralised sequencing for SARS-CoV-2 genomic surveillance. Genome Biol. (ISSN:
1474-760X) 22 (1), 196. http://dx.doi.org/10.1186/s13059-021-02395-y.
O’Toole, A., et al., 2021. Assignment of epidemiological lineages in an emerging
Code availability pandemic using the pangolin tool. Virus Evol. (ISSN: 2057-1577) 7 (2), http:
//dx.doi.org/10.1093/ve/veab064, veab064.
Our code is openly available under the LGPL-license on GitHub at S3C, 2021. Swiss SARS-CoV-2 sequencing consortium (S3C). https://bsse.ethz.ch/cevo/
research/sars-cov-2/swiss-sars-cov-2-sequencing-consortium.html.
https://github.com/cevo-public/harvester-database-and-automation.
Volz, E., et al., 2021. Assessing transmissibility of SARS-CoV-2 lineage B.1.1.7 in
England. Nature 2021 593:7858 (ISSN: 1476-4687) 593, 266–269. http://dx.doi.
References org/10.1038/s41586-021-03470-x.
World Health Organization, 2021. Tracking SARS-CoV-2 variants. https://www.who.
Aksamentov, I., Roemer, C., Hodcroft, E.B., Neher, R.A., 2021. Nextclade: clade int/en/activities/tracking-SARS-CoV-2-variants/.
assignment, mutation calling and quality control for viral genomes. J. Open Source Worobey, M., et al., 2020. The emergence of SARS-CoV-2 in Europe and North America.
Softw. 6 (67), 3773. http://dx.doi.org/10.21105/joss.03773. Science 370 (6516), 564–570, https://www.science.org/doi/abs/10.1126/science.
Andersen, K.G., Rambaut, A., Lipkin, W.I., Holmes, E.C., Garry, R.F., 2020. The abc8169.
proximal origin of SARS-CoV-2. Nat. Med. 2020 26:4 (ISSN: 1546-170X) 26, Wu, F., et al., 2020. A new coronavirus associated with human respiratory disease in
450–452, https://www.nature.com/articles/s41591-020-0820-9, China. Nature (ISSN: 14764687) 579, 265–269. http://dx.doi.org/10.1038/s41586-
020-2008-3.

You might also like