GBE
Patchwork: Alignment-Based Retrieval
and Concatenation of Phylogenetic Markers from
Genomic Data
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by UFRJ - CFCH/IFCS user on 12 September 2024
Felix Thalén1,2, Clara G. Köhne1, and Christoph Bleidorn 1,
*
1
Department for Animal Evolution and Biodiversity, Georg-August-Universität Göttingen, Göttingen 37073, Germany
2
Cardio-CARE AG, Medizincampus Davos, Davos Wolfgang 7265, Switzerland
Accepted: December 06, 2023
Abstract
Low-coverage whole-genome sequencing (also known as “genome skimming”) is becoming an increasingly affordable ap
proach to large-scale phylogenetic analyses. While already routinely used to recover organellar genomes, genome skimming
is rather rarely utilized for recovering single-copy nuclear markers. One reason might be that only few tools exist to work with
this data type within a phylogenomic context, especially to deal with fragmented genome assemblies. We here present a new
software tool called Patchwork for mining phylogenetic markers from highly fragmented short-read assemblies as well as
directly from sequence reads. Patchwork is an alignment-based tool that utilizes the sequence aligner DIAMOND and is writ
ten in the programming language Julia. Homologous regions are obtained via a sequence similarity search, followed by a “hit
stitching” phase, in which adjacent or overlapping regions are merged into a single unit. The novel sliding window algorithm
trims away any noncoding regions from the resulting sequence. We demonstrate the utility of Patchwork by recovering near-
universal single-copy orthologs within a benchmarking study, and we additionally assess the performance of Patchwork in
comparison with other programs. We find that Patchwork allows for accurate retrieval of (putatively) single-copy genes
from genome skimming data sets at different sequencing depths with high computational speed, outperforming existing
software targeting similar tasks. Patchwork is released under the GNU General Public License version 3. Installation instruc
tions, additional documentation, and the source code itself are all available via GitHub at https://github.com/fethalen/
Patchwork.
Key words: genome skimming, low-coverage sequencing, museomics, phylogenomics, short reads, single-copy genes.
Significance
Even though current sequencing and computational methods allow for the completion of high-quality genomes for all
life on earth, the availability of material for sequencing became a major bottleneck in phylogenomic studies, especially
since material stored in museum collections—or during barcoding campaigns—is often not suitable for reconstructing
high-quality, highly continuous genomes. At the same time, the output of short-read sequencing machines is increasing,
and prices for these techniques are dropping. Short-read data are still routinely used to recover organellar genomes, but
this so-called genome skimming approach is rather rarely utilized for recovering single-copy nuclear markers. We pre
sent a new software tool called Patchwork for mining phylogenetic markers from highly fragmented genome assem
blies, as well as directly from short sequence reads. We demonstrate the accuracy of this new approach and show in
a benchmarking study that it also outperforms existing software for similar tasks. Patchwork allows to compile prese
lected gene sets from low-coverage short-read sequencing data sets and is thereby ideally suited when including ma
terial from museum collections into phylogenomic studies.
© The Author(s) 2023. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse,
distribution, and reproduction in any medium, provided the original work is properly cited.
Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023 1
Thalén et al. GBE
Introduction due to the introduction of newer sequencing platforms
(e.g. Illumina's NovaSeq sequencing platform) short-read
Advancements in high-throughput sequencing techniques
WGS became relatively cheap and prices are even expected
have revolutionized the field of phylogenetics and ultimate
to drop with Ultima Genomics, another highly competitive
ly our understanding of the tree of life (Lemmon and
sequencing platform entering the market (Simmons et al.
Lemmon 2013). The availability of genomic and
2023). Moreover, short-read sequencing library construc
transcriptomic data for basically all desired taxa and for a
tion also allows that highly fragmented DNA can be used
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by UFRJ - CFCH/IFCS user on 12 September 2024
reasonable price has transformed the field to phyloge
as input (Hu et al. 2021), thereby enabling the use of mater
nomics: genome-scale phylogenetic systematic analyses ial from museum collections from all around the world
(McCormack et al. 2013). Some challenges remain, how (Raxworthy and Smith 2021). Consequently, LC-WGS can
ever, as many studies still show incongruent results, low be used to generate data from various sources of targeted
branch support, or lacking resolution (Philippe et al. organisms to retrieve marker loci on a genome scale. While
2017; Steenwyk et al. 2023). Even though complete gen this so-called “genome skimming” approach has frequent
omes are becoming available for more and more eukar ly been used to reconstruct organellar genomes or other
yotes, the access to high-molecular-weight DNA is the high-copy fractions of eukaryote genomes (Richter et al.
bottleneck in the quest for sequencing genomes of all life 2015; Jin et al. 2020), it seems currently underutilized to re
on earth (Blom 2021; Dahn et al. 2022). Nowadays and in trieve single-copy nuclear markers (Liu et al. 2021). One
the past, most large-scale phylogenomic studies were con reason is that short-read assemblies of eukaryotic genomes
ducted using either transcriptome sequencing or genome tend to be highly discontinuous, and automated annota
subsampling methods such as target enrichment, which fo tion of such large, fragmented genomes remains difficult
cuses on a set of preselected loci (Bleidorn 2017). (Salzberg 2019), as they are characterized by the presence
Transcriptome sequencing offers a way to sequence only of “genes in pieces,” where introns interrupt coding se
the expressed portion of a genome without prior sequence quences (Rogozin et al. 2005). Depending on the coverage,
knowledge (Stark et al. 2019). Unfortunately, this approach short-read draft genomes are characterized by low N50s in
requires freshly collected material or specifically stored ma the range of few (if at all) kilobase pairs (kb; Salzberg et al.
terial, e.g. deeply flash frozen or in RNAlater. Furthermore, 2012), and consequently, exons of a single gene usually
smaller specimens may need to be pooled together to at end up on several contigs.
tain sufficient amounts of mRNA, and such practice risks The disuse of genome skimming in large-scale phyloge
mixing up individuals with undetected genetic variation. netics could potentially be ascribed to the lack of suitable
Unfortunately, a large amount of collected specimens data analysis methods (Zhang et al. 2019). Existing soft
only exist in natural history museum collections, and most ware tools for working with LC-WGS data in a phyloge
of these are ethanol preserved and thus not usable for tran nomic context, such as aTRAM 2 (Allen et al. 2017,
scriptomic studies (Call et al. 2021). As taxon sampling is 2018), ALiBaSeq (Knyshov et al. 2021), and GeMoMa
considered one of the most important factors for accurate (Keilwagen et al. 2016, 2018), are either written in an inter
phylogenetic tree reconstruction (Heath et al. 2008), it preted language (e.g. Perl or Python) that does not allow
would be missing an opportunity to leave the potential of the program to scale well with the large biological data
natural history collections untapped. Target enrichment ap sets that are commonplace today (e.g. aTRAM 2,
proaches, on the other hand, require prior knowledge of ALiBaSeq) or need well-annotated reference genomes or
target sequences (e.g. from well-annotated genomes) for transcriptomes (e.g. GeMoMa). A recent addition to the
the construction of oligonucleotide probes. Moreover, the portfolio of available tools for such programs is
number of enriched targets is limited by the number of oli Read2Tree, which directly infers trees from unassembled
gonucleotides included in the enrichment kit of choice, and data (Dylus et al. 2023).
the efficiency of such approaches decreases as the To address the limitations typically associated when
bait-to-target distance increases (Bragg et al. 2016). working with genome skimming data, we present
Another downside is that the data produced are difficult Patchwork, an alignment-based tool for mining phylogen
to reuse for other types of genomic or evolutionary studies. etic markers directly from WGS data. Patchwork utilizes
A viable alternative to assemble taxon-rich phyloge the sequence aligner DIAMOND (Buchfink et al. 2021)
nomic data sets is low-coverage whole-genome sequen and is written in the programming language Julia
cing (LC-WGS; also known as “shallow genome (Bezanson et al. 2017) to achieve the best possible speed,
sequencing” or “genome skimming”) using short-read thus allowing Patchwork to scale well with today's
technologies such as Illumina sequencing (Dodsworth genome-scale data sets. In addition, our implementation
2015). Relying solely on this approach has been shown to focuses on ease of use, and our program handles each
be inadequate for the reconstruction of highly contiguous step in the analysis—from start to finish. Using our new ap
reference-quality genomes (Rhie et al. 2021). However, proach, we targeted universal single-copy orthologs
2 Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023
Patchwork: Alignment-Based Retrieval and Concatenation GBE
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by UFRJ - CFCH/IFCS user on 12 September 2024
FIG. 1.—Graphical overview of the Patchwork algorithm. First, a) query sequences are aligned to the provided reference sequence. These alignments may
or may not be overlapping. b) Overlapping alignments are realigned but only in the area in which they overlap. The best-scoring alignment is retained while all
others are discarded. c) Nonaligned residues are then removed, and d) the remaining regions are concatenated into a single, continuous sequence.
(USCOs), which are available based on careful analysis of a --retain-stops and --retain-ambiguous flags, respectively.
curated database (OrthoDB, www.orthodb.org). A set of Finally, Patchwork implements a sliding window–based
954 metazoan-specific USCOs has been validated against alignment trimming step to remove poorly aligned residues
364 metazoan genomes and shown to be indeed (i) single- (e.g. due to the presence of putative noncoding regions)
copy and (ii) nearly universally present (Manni et al. 2021). from the resulting sequences. The output is available as nu
cleotide or amino acid sequences.
Results Benchmark
Patchwork is a reference- and alignment-based method for To asses performance of our approach, we (i) test an ideal
mining phylogenetic markers from WGS data, using either case where the query and reference species are identical,
assembled contigs or reads as input (Fig. 1). The aim of (ii) where the query and reference are 2 distant species,
Patchwork is to capture multiexon or fragmented genes, and (iii) compare Patchwork v.0.5.1 with ALiBaSeq v.1.2
scattered across different contigs or reads. One or more ref (Knyshov et al. 2021) and aTRAM v.2.4.3 (Allen et al.
erence protein sequences guide the “stitching” process, 2017). Throughout these benchmarks, we use Illumina
where the best-scoring translated query nucleotide se short-read nucleotide sequences from the marine annelid
quences for any given region are merged into continuous Dimorphilus gyrociliatus (accession PRJEB37657 in the
stretches of amino acid sequences. Merged sequences go European Nucleotide Archive). A highly contiguous
through a masking step in which unaligned residues, am (N50 = 2.24 Mb) and complete (95.8% BUSCO genes
biguous amino acid characters (letters that do not deter recovered, metazoa_odb10) annotated version of the
mine a unique amino acid; they are B, J, X, or Z, where compact 73.8-Mb genome of this annelid is publicly avail
B = D or N, J = I or L, X = unknown, and Z = E or Q), and able (Martín-Durán et al. 2021).
stop codons are removed from query sequences. As we only used short-read data sets at different cov
Optionally, the removal of stop codons and ambiguous erages for our benchmark analyses, we created highly dis
amino acid characters may be skipped by providing the continuous assemblies with low N50s as typical for
Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023 3
Thalén et al. GBE
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by UFRJ - CFCH/IFCS user on 12 September 2024
FIG. 2.—Percent identity and query coverage in markers based on a Patchwork analysis of a SPAdes assembly of D. gyrociliatus, targeting 815 single-copy
orthologs from the same species.
real-world low-coverage genomic data sets. We assembled Table 1
these sequence reads using SPAdes and subsequently used Results from Patchwork when using a D. gyrociliatus SPAdes assembly as
Patchwork to search for near-USCOs (Seppey et al. 2019), the query and USCOs from a long-read assembly of D. gyrociliatus as a
using a preannotated set of USCOs from that same species reference
as a reference. Next, we used that same assembly of Variable Mean Min Median Max
D. gyrociliatus to search for USCOs, this time using Reference length 447.606 77 351.0 2,748
USCOs from the leech Helobdella robusta as the reference, Query length 407.953 27 322.0 2,553
a clitellate annelid that diverged at least 400 mya from Matches 385.075 24 306.0 2,549
Mismatches 21.207 0 0.0 1,075
D. gyrociliatus (Erséus et al. 2020). Finally, we compared
Deletions 42.131 0 5.0 1,097
our program to ALiBaSeq (Knyshov et al. 2021) and
Query coverage 92.181 5.22 98.71 100.0
aTRAM 2 (Allen et al. 2017). For this comparison, we also
Identity 95.887 30.91 100.0 100.0
subsampled the aforementioned D. gyrociliatus short reads
in order to simulate various sequencing coverages. We
decided not to include the software GeMoMa (Keilwagen hypothetical case where the entire set of reference se
et al. 2016) in this comparison, as it heavily relies on the quences should be recoverable as exactly matching stitched
availability of reference genomic or transcriptomic data. contigs from the query sequences. We retrieved all of the
Read2Tree (Dylus et al. 2023) has also not been included initial 815 markers. On average, 95.9% of all aligned posi
in the comparisons, as its focus is tree inference and not tions were identical matches, with a mean query coverage
marker retrieval. of 92.2%; this equals a combined measure of 88.4% iden
We compared the retrieved translated and stitched con tical matches for all reference positions, whether aligned to
tigs, hereafter called “recovered markers,” to the reference query residues or not (Fig. 2 and Table 1).
D. gyrociliatus USCOs. For each reference sequence, the
evaluation included percent identical positions out of all
aligned positions as well as percent of reference sequence Effect of Reference Divergence
positions covered by the recovered markers. Patchwork In the second iteration, we aligned a set of high-coverage
automatically generates these statistics and produces a de query assemblies against a very distant reference set, in or
tailed output for each reference as well as an aggregated der to estimate the program's performance when using
output over all references. highly divergent sequences as reference. For this purpose,
the same D. gyrociliatus SPAdes assembly as in the previous
Effect of Genome Fragmentation on Accuracy evaluation served as query sequence set, and 957
In the initial setup, we assessed the accuracy of Patchwork near-USCOs from the annotated genome of the leech
using a high-quality query assembly and 815 USCO refer H. robusta were used as a reference. We retrieved 943
ence sequences from the same species, D. gyrociliatus, out of the 957 H. robusta reference sequences. Of these,
and thereby exploring the program's performance for the 769 successfully aligned back to 1 and only 1 of the 778
4 Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023
Patchwork: Alignment-Based Retrieval and Concatenation GBE
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by UFRJ - CFCH/IFCS user on 12 September 2024
FIG. 3.—Percent identity and query coverage in markers based on a Patchwork analysis of a SPAdes assembly of D. gyrociliatus, targeting 957 single-copy
orthologs from the leech H. robusta.
Table 2 performs best over almost all data sets (Fig. 4), reaching
Results from Patchwork when using a D. gyrociliatus SPAdes assembly as as much as 62% total percent identity for data with at least
the query and USCOs from H. robusta as a reference 10× coverage. Only with a coverage of 1×, D. gyrociliatus
Variable Mean Min Median Max read-based data seem to be better suited for marker re
Reference length 448.025 77 351.0 2.748 trieval. Cutoff thresholds during the assembly might lead
Query length 309.93 31 259.0 2.326 to discarding part of the sequence data that is retained
Matches 249.126 15 195.0 2.326 when using reads, therefore causing the latter to achieve
Mismatches 30.234 0 7.0 355
higher query coverage for the 1× data set. Note that query
deletions 8.319 0 2.0 268
coverage improves especially for read data when tantan
Query coverage 74.540 5.41 82.78 100.0
Identity 89.008 25.46 96.99 100.0
masking in DIAMOND is disabled (i.e. by providing
--masking 0 as an argument). For data sets with higher
The recovered markers were evaluated against the set of 778 D. gyrociliatus
USCOs that were considered homologous to sequences in the H. robusta reference coverage, running Patchwork on read data still achieves
set. well over 50% total percent identity. Using read data there
fore is a valid option that could be considered if the com
D. gyrociliatus USCOs that were considered homologous
to the H. robusta set (Fig. 3 and Table 2). For these 769 re pute resources necessary for assembling the sequences
trieved markers, the average percent identity measure was are scarce. The performance of Patchwork stays approxi
89%, with a mean query coverage of 74.5%. Put different mately constant for data sets with coverages of at least
ly, the recovered markers had an average of 67.2% identi 10×, independent of the used data. By comparison,
cal matches against all reference positions. ALiBaSeq achieves approximately 7% less total percent
identity than Patchwork with assemblies for all data sets
and performs only slightly better than Patchwork with
Program Comparison read data for a coverage over 10×. aTRAM 2, on the other
In the third setup, we compared the performance and hand, performs comparatively poorly, with a maximum to
runtime for Patchwork to that of ALiBaSeq (Knyshov tal percent identity of about 22% for the data set with 20×
et al. 2021) and aTRAM 2 (Allen et al. 2018), using a coverage. This is mostly due to the small number of recov
D. gyrociliatus short-read data set at different sequence ered markers; the markers themselves generally have a high
coverage levels (1×, 3×, 5×, 10×, 20×, and 40×). While percent identity value. For a coverage of 1×, aTRAM 2 was
Patchwork can use both reads and assembled contigs as unable to recover any stitched contigs at all. The program
an input, ALiBaSeq uses assembled contigs, and aTRAM 2 was also not evaluated for the data set of 40× coverage
is read based. Performance was assessed using a combined as it had not completed within the cluster's maximum run
measure for accuracy and completeness of the recovered time of 5 d.
USCO markers annotated, hereafter called “total percent Both Patchwork and ALiBaSeq are very fast; the pro
identity.” Patchwork with D. gyrociliatus assembly data grams terminated in under 5 min when using assembly
Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023 5
Thalén et al. GBE
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by UFRJ - CFCH/IFCS user on 12 September 2024
FIG. 4.—Accuracy and completeness of the recovered marker sequences for the different D. gyrociliatus data sets when run against a reference set of H.
robusta USCOs. Accuracy and completeness were jointly measured as percent identical out of all aligned positions multiplied with the total percentage of
aligned nongap positions. This integrated measure avoids a distorted performance estimation, e.g. due to small number of recovered markers but high percent
identity in the aligned positions. Patchwork was run with D. gyrociliatus assemblies unless indicated differently. ALiBaSeq received assemblies, while aTRAM 2
received reads as input.
FIG. 5.—Program runtime for each D. gyrociliatus data set. Patchwork was run both as a script and as a compiled program. It received D. gyrociliatus
assemblies unless indicated differently. ALiBaSeq was run on assemblies, while aTRAM 2 received reads as input.
6 Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023
Patchwork: Alignment-Based Retrieval and Concatenation GBE
Brueelia antiqua
100 Bothriometopus macronemis
Haematopinus macronemis
Proechionopthirus fluctus
100 Antarctophtirus microchir
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by UFRJ - CFCH/IFCS user on 12 September 2024
84 Linognathus spicatus
100 Hoplopleura arborcicola
100 Neohaematopinus pacificus
81
Pedicinius badius
Pthirus gorillae
100 100 Pthirus pubis
Pediculus schaeffi B
100
100 Pediculus schaeffi A
100 Pediculus humanus A 2013
100 Pediculus humanus B 2013
Tree scale: 0.1
FIG. 6.—Phylogenetic analysis of Phthiraptera relationships as recovered from a Maximum Likelihood analysis of a combined supermatrix using USCOs as
recovered by Patchwork. Analysis was conducted using IQ-TREE 2 including model and partition finding. Bootstrap values from 1,000 pseudoreplicates are
given at the branches.
data (Fig. 5). The runtimes fluctuated only slightly Discussion
between data sets. Using Patchwork with reads
Patchwork is a new software for quickly mining phylogen
required more time for larger data sets, but even for
etic markers from WGS data. Since Patchwork can retrieve
the largest evaluated data set, it finished after half an
homologous regions even in distantly related taxa, this pro
hour. By comparison, running aTRAM 2 took days for
gram lends itself especially well for recovering phylogenetic
all data sets.
markers for phylogenomic studies. It is simultaneously an
efficient way for increasing marker occupancy in poorly as
Patchwork in a Phylogenomic Context sembled genomes and/or in the presence of multilocus
To demonstrate how our software could be utilized in a exons. Finally, Patchwork allows the user to combine 2 dif
phylogenomic pipeline, we used it to retrieve a set of 957 ferent data types—i.e. transcriptomic and genomic data—
metazoan-specific USCOs from a phthirapteran data set into a single data set, thus further enabling an even larger
(Allen et al. 2017). When reusing a set of 15 lice Operaional taxon sampling and encouraging data reusability.
Taxonomic Units (Hexapoda and Phthiraptera), we were Special consideration should be taken to avoid the cre
able to retrieve all 957 USCOs, for all taxa. The resulting ation of chimeric sequences. One way in which such se
alignments contained few gaps for any marker; i.e. most quences may arise is when orthologous (i.e. genes related
markers were well above the 90% aligned position trim via a speciation event) and paralogous (i.e. genes related
ming threshold. The trimmed alignment contained via a gene duplication event) sequences are merged to
3,454,320 positions in total, compared to 5,383,303 be gether. To circumvent this issue, we recommend that the
fore trimming (i.e. ∼64% positions were retained after user limits the use of reference sequences to near-USCOs.
trimming). Our phylogenetic reconstruction resulted in a Different lineage-specific sets of such USCOs are available
well-supported tree (Fig. 6), which is largely congruent based on carefully analyzed sets of homologous genes
with the original analysis (Allen et al. 2017), with the excep from a curated database (Manni et al. 2021). Besides their
tion of the position of Haematopinus macronemis. How use in evaluating the quality of genomic and metagenomic
ever, this placement is the only part of the tree that is not data, USCOs became also prominent as preselected marker
well supported, and reasons for incongruence are unclear, sets in phylogenomic analyses (Sahbou et al. 2022) and
which could be, e.g. slightly different choice of phylogenet have been recently proposed as a unifying framework for
ic markers. However, in general, the approach worked very DNA-based species delimitation (Dietz et al. 2023). Many
well, and for 951 of 954 USCOs, nearly complete exonic programs, e.g. the aforementioned program BUSCO, exist
data could be retrieved. for retrieving such sequences from an already assembled
Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023 7
Thalén et al. GBE
genome, and these could be used as reference sequences github.com/fethalen/Patchwork), is distributed under
(Waterhouse et al. 2018). Additionally, several downstream the GPLv3 license, and targets both Linux and macOS
analysis tools are available to control for the presence of (Windows users may run Patchwork by using the
possible cross-contamination, (unexpected) paralogous Windows Subsystem for Linux).
copies, or other artifacts confounding systematic studies In order to facilitate reproducibility, a Docker container
(Lozano-Fernandez 2022). To control for the possible arti (Merkel 2014) of Patchwork is also distributed via the
factual inclusion of stretches of noncoding sequences, the BioContainers framework (da Veiga Leprevost et al.
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by UFRJ - CFCH/IFCS user on 12 September 2024
tool PREQUAL could be used to detect such and remove 2017). Similarly, we also provide an Apptainer definition
such regions (Whelan et al. 2018). Finally, multiple align file for users of the Apptainer/Singularity platform.
ment tools such as MACSE (Ranwez et al. 2018) can be Apptainer (formerly known as Singularity; Kurtzer et al.
used to deal with putative confounding problems from 2017) is another container platform that targets shared sys
the occurrence of premature stop codons, which might oc tems such as High-Performance Computing platforms,
cur when working with data with coverage genomic data. which are commonplace at universities today.
The accuracy and the robustness of the results depends Most phylogenomic studies include more than a handful
on how closely related the target and the reference species of taxa, and concatenating these manually gets increasingly
under study are. The difficulty stems from the ability to ac tedious as the data set size increases. Therefore, Patchwork
curately predict noncoding regions in aligned contigs; be also includes a set of complementary tools for streamlining
cause alignment trimming relies on gap-excluded identity, the downstream analysis. For example, the script multi_
choosing the correct cutoff threshold becomes increasingly patchwork.sh lets the user run Patchwork on multiple input
easier as the level of identity approaches 100% (the identity files and concatenate homologous sequences from differ
of noncoding regions is likely to stay the same, while the ent taxa into 1 file.
identity to coding regions increases). On the upside, high-
quality genomes for practically all major lineages exist and
are readily available online (Formenti et al. 2022). Initial Alignment and Database Construction
Moreover, and not surprisingly, the coverage of the input First, all reference protein sequences, regardless of whether
read data sets correlates with the performance of retrieving they are spread across multiple FASTA files or not, are
single-copy marker genes. Similar to a previous study (Liu pooled together into a single FASTA file, from which a
et al. 2021), we also find that a coverage of 10× and DIAMOND database is created. There is also the option to
more should be targeted when designing genome skim use an existing DIAMOND-formatted database or a BLAST
ming studies. However, as seen in the proof of principle, output file in a tabular format by using the --database or
even lower coverages enable the construction of phyloge --tabular options, respectively. These files are both provided
nomic data matrices. For very low-coverage data sets, the in the output of Patchwork and can thus be reutilized when
read-based mode outperforms assembly-based analyses. trying out different parameters. In either case, DIAMOND's
For the latter, assembly size seems to be more important BLASTX algorithm is used to align translated nucleotide se
than contiguity. quences to 1 or more reference protein sequences.
In summary, Patchwork allows the retrieval of (putative Like DIAMOND, Patchwork, by default, scores align
ly) single-copy genes from genome skimming data sets at ments using the substitution matrix BLOSUM62 (Henikoff
different sequencing coverage with high computational and Henikoff 1996), a gap open penalty of 11, and a gap
speed. Availability and quality of biological specimens are extension penalty of 1. Other built-in or custom substitu
becoming the major bottleneck for phylogenomic studies. tion matrices may be used in place of the default option.
Especially for phylogenomic studies relying on collection- User-chosen gap open penalties and gap extension penal
based material, Patchwork offers a fast and efficient way ties may also be set, as long as they fall within the limits
for marker retrieval from short-read sequence data sets. set by the substitution matrix of choice. For the users’ con
venience, Patchwork supports a number of different
DIAMOND options that can usually be provided in the
Materials and Methods same manner as in DIAMOND itself.
Patchwork is implemented in Julia (Bezanson et al. 2017), a For all Patchwork benchmarks, we observed that disab
just-in-time (JIT)–compiled programming language that ling DIAMOND's tantan masking (Frith 2011), by setting
is typically faster than interpreted languages such as --masking 0, as described in Table 2, yielded higher query
Python or R. Existing Julia bioinformatics packages such as coverages. This effect was more pronounced for read
BioAlignments.jl (https://github.com/BioJulia/BioAlignments.jl) data sets but could also be detected in assembled data
and BioSequences.jl (https://github.com/BioJulia/Bio sets. On the other hand, the number of exact matches in
Sequences.jl) were used to speed up the development all aligned positions (i.e. percent identity) between the
process. Patchwork is obtainable from GitHub (https:// query and the reference decreased slightly. When
8 Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023
Patchwork: Alignment-Based Retrieval and Concatenation GBE
combining both measures, however, disabling tantan (correctly) join 2 or more regions that are located on separ
masking improved the overall results. ate contigs due to incomplete assembly or sequencing
Since the alignment search is likely to result in more than errors.
1 hit per reference region, certain measures are taken to en
sure that none of these hits are overlapping: They are, “hit Alignment Masking
stitching” (also known as contig or exon stitching; i.e. mer At this step, unaligned residues, ambiguous amino acid char
ging of overlapping regions), removal of unaligned resi
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by UFRJ - CFCH/IFCS user on 12 September 2024
acters, and stop codons (also known as “termination co
dues, and concatenation of nonoverlapping regions. dons”) are all removed from the resulting query sequence.
Query sequences may contain residues that do not align to
Hit Stitching any particular region of the subject sequence. Such regions
may be noncoding regions or simply insertions. In either
During “hit stitching,” all alignments made between the case, unaligned residues are removed on the basis that inserts
query region and the target sequence are merged in a are less likely to constitute phylogenetically informative sites
way such that only the highest-scoring segment pair and risks introducing untranslated regions and therefore bias
(HSP) for each region is retained. This results in a single, ing the downstream analysis. Similarly, ambiguous amino
continuous sequence, and, as a consequence, some hits acids are most likely noninformative, and stop codons are a
may be removed entirely (see also Fig. 1). clear indicator that noncoding characters have been included
The “hit stitching” algorithm works as follows: First, in the alignment. Although such regions are likely to be re
query regions are sorted according to how they align to moved in the subsequent step (see above), the user may
the target sequence—from first to last—and are added to choose to keep stop codons and/or ambiguous amino acid
the stack. Next, each pair of query regions on the stack is characters by providing the flags --retain-stops and/or
checked for overlaps. In case of an overlap, first, all regions --retain-ambiguous.
are sorted by their first and last position at which they align
to the reference sequence. The first region is added to the Sliding Window–Based Alignment Trimming
stack. Its start and end coordinates are then compared with
those of the following region to check if they are overlap One side effect of aligning translated nucleotide sequences
ping. If they are not overlapping, the next adjacent region to amino acid sequences is that one might recover non
is added to the stack and compared with the following re coding portions of DNA, provided that the following 2 con
gion. If they are overlapping, however, the region that is ditions are fulfilled: (i) the noncoding DNA is located in
currently at the top of the stack is removed. The overlap between 2 or more coding portions and (ii) there is a se
ping parts of this region and the next region are realigned quence region in the reference sequence that the non
to identify the best-scoring sequence at that particular coding region can align to. In the resulting alignment,
interval. Then, based on the realignment score, the se noncoding portions are characterized by many indels, inter
quences are sliced such that the best-scoring sequence is cepted by occasional matches. The alignment of noncoding
retained at the overlapping region and so that the nonover portions of DNA can already be observed in the alignments
lapping, flanking parts of both regions, if existing, are pre produced by DIAMOND, and thus, this side effect does not
served as well. Thus, a maximum of 3 sliced region parts are stem from Patchwork itself. In fact, the Patchwork algo
then added to the stack as new, separate regions: The se rithm will only include noncoding parts if nothing else aligns
quence part preceding the overlap, which originates from better to the affected region of the reference sequence.
the first region, the highest-scoring sequence at the over To mitigate this effect, we have implemented a sliding
lap, which may be from either of the 2, and the sequence window–based alignment trimming approach to rid the
part that follows the overlap, which originates from the se alignments from these unwanted regions. This works by
cond region. The algorithm then continues in the same scanning the alignment from left to right, cutting all regions
manner, comparing the topmost region of the stack with where the average distance between query and reference is
the following region, until all overlaps are removed and above the user-provided distance threshold. The window
all regions have been added to the stack. This procedure size and the distance threshold can both be set by the
may require multiple iterations, since in every run, only user, but need not be, since we implemented default values
each pair of consecutive regions are compared and for both. This step can also be skipped over in its entirety.
merged. This approach tries to avoid cases where a single bad, but
Different aligned regions from the same contig are al correct, match would have otherwise been cut out.
lowed to be stitched together. While “hit stitching” may re
sult in the creation of chimeric sequences (i.e. 2 or more Concatenation and Realignment of Remaining Regions
biological sequences incorrectly joined together), this pro Finally, the resulting set of ordered, nonoverlapping se
cedure has the potential to increase coverage and to quence regions are concatenated into 1 continuous
Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023 9
Thalén et al. GBE
sequence. The concatenated sequence is then realigned to Program Comparison
the reference to obtain the final output sequence and align In order to generate data sets at different sequencing cov
ment score. erages, we subsampled the trimmed D. gyrociliatus reads
downloaded from NCBI GenBank. Corresponding read
Benchmark pairs were selected randomly from the paired-end data.
Subsampling was done using Subsample.jl, a Julia package
Patchwork v.0.5.1 was continuously run using Julia v.1.8.2
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by UFRJ - CFCH/IFCS user on 12 September 2024
distributed together with Patchwork. The resulting data
and DIAMOND v.2.0.13 (Buchfink et al. 2021), with the sets have coverages of 1×, 3×, 5×, 10×, 20×, and 40×.
options --ultra-sensitive --frameshift 15 --masking 0. All ana For each of the data sets, we produced a short-read-only
lyses were performed on the high-performance computing de novo assembly, as ALiBaSeq is designed for assembly
cluster maintained by the Gesellschaft für wissenschaftliche data, while aTRAM 2 requires read data and Patchwork
Datenverarbeitung mbH Göttingen (GWDG), running the can process both. We used the assembler SPAdes
Scientific Linux release 7.9 (Nitrogen) operating system with v.3.15.3 (Nurk et al. 2013), with a K-mer size of 33, and
a Linux kernel of version 3.10.0. All runs were allocated 32 the quality of the assembly was assessed using QUAST
Intel Xeon Platinum 9242 CPUs running at 2.30 GHz. v.5.0.2 (Gurevich et al. 2013). We aligned the D. gyrocilia
Elapsed time was calculated as reported by Slurm. tus reads and assemblies against the same set of H. robusta
USCOs mentioned before.
Effect of Genome Fragmentation on Accuracy ALiBaSeq v.1.2 was run with the D. gyrociliatus assem
blies described above. The program requires BLAST; the
A publicly available set of Illumina short-read sequences
version here used was 2.11.0. The program builds a data
of D. gyrociliatus (Martín-Durán et al. 2021) was used
base from the D. gyrociliatus sequences and searches this
for the query set. We used SPAdes v.3.15.3 to generate
database with the H. robusta sequences before stitching
the de novo assembly, using a K-mer size of 55. A set
the hits together. We set the parameters according to
of 815 D. gyrociliatus USCOs retrieved from the published
the guide for a protein-based search without reciprocal
high-quality genome assembly (GenBank, accession:
GCA_904063045.1) served as the reference. search, as explained in their documentation on GitHub
(see the README file): -x a [extract all hits and join into
(super)contigs] -f S [single alignment table (TBLASTN result
Effect of Reference Divergence file)] -e 1e-10 [e-value cutoff for further processing of
We reused the de novo assembly from the previous evalu TBLASTN hits] -c 1 [extract single best (super)contig]
ation for the query, while a set of 957 near-USCOs from --amalgamate-hits [scoring scheme for (super)contigs] --is
the annotated genome of the leech H. robusta (GenBank, [enable contig stitching] –ac aa-tdna [search protein
accession: GCA_000326865.1) were used as reference se “baits” (H. robusta USCOs) against tDNA “target” data
quences. We used the same parameter settings described base (D. gyrociliatus reads)].
above. For this evaluation, we did not use Patchwork's We ran aTRAM v.2.4.3 with the sampled D. gyrociliatus
own accuracy and completeness assessment, because the read data sets. The program further requires BLAST, as well
true number of identical matches and the amount of query as a de novo assembler, and exonerate. We used BLAST
coverage are not known between the 2 divergent species v.2.11.0 and exonerate v.2.2.0 (Slater and Birney 2005)
D. gyrociliatus and H. robusta. We therefore chose to com and employed SPAdes v.3.15.3 for the assembly step. The
pare the recovered markers to a subset of the D. gyrociliatus full aTRAM 2 pipeline consists of 3 consecutive steps:
USCOs described in the previous benchmark. More specif Firstly, the preparation of a database from the D. gyrocilia
ically, only those D. gyrociliatus USCOs that produced a hit tus reads, secondly, the assembly of different loci, and last
when searching against the H. robusta USCOs with ly, a reference-guided stitching process. The parameter
DIAMOND v2.0.13 in ultrasensitive mode were used, since settings for the core module of aTRAM 2 as well as the
only these were considered “recoverable” in this setup. The stitcher were as follows: --evalue 1e-10 --file-filter “*.fil
resulting D. gyrociliatus USCO set contains 778 sequences; tered contigs.fasta” --overlap N.
37 sequences were discarded. The set of recovered markers Patchwork v.0.5.1 was run with both the sampled
was searched against the reference USCO set using D. gyrociliatus read data sets and the assemblies we pro
DIAMOND in --ultra-sensitive mode. For each reference se duced for these sampled read data. We ran the uncom
quence, we retrieved only the marker that produced the piled program using Julia v.1.8.2 as well as the compiled
highest bit score during the alignment step. We then eval version on each data set in order to perform runtime com
uated percent identical positions out of all aligned positions parisons. Patchwork achieves all its objectives in a 1-step
as well as percent of reference sequence positions covered procedure, i.e. can be called with a single command,
by the recovered markers. unlike ALiBaSeq and aTRAM 2. The program builds a
10 Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023
Patchwork: Alignment-Based Retrieval and Concatenation GBE
DIAMOND database from the H. robusta sequences and, SRR5088472, SRR5088473, SRR1182279, SRR5308136,
after obtaining D. gyrociliatus hits, proceeds to stitch SRR5308138, SRR5088474, SRR5088475, SRR5308112,
them together. All nondefault parameter settings for and SRR5088466) using prefetch, vdb-validate, and fasterq-
Patchwork were as described above. They were used for dump (with the flag --split-spot), all from the NCBI SRA tool
both read-based runs and assembly-based runs. kit (Leinonen et al. 2011). We ran Patchwork v.0.5.1 with
We ran the 3 programs with their respective parameter each of the specimens as query input and a set of 957
settings on the different D. gyrociliatus data sets against near-USCOs from the leech H. robusta as reference se
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by UFRJ - CFCH/IFCS user on 12 September 2024
the H. robusta USCO set described above, which contains quences (see Table 2 for parameter settings). A multiple se
957 sequences. The aTRAM 2 run for the 40× coverage quence alignment (MSA) was constructed for each of these
read data set was ended prematurely because it had not 957 loci using MAFFT (Katoh and Standley 2013) with the
terminated after 5 d. In a following step, the recovered mar options –globalpair --ep 0.123. The resulting alignments
kers produced by each program for each data set were eval were trimmed with trimAl (Capella-Gutiérrez et al. 2009),
uated with respect to completeness and accuracy of the removing all positions with more than 90% gaps but retain
resulting sequences by comparing them to the same set ing at least 60% of each alignment (options -gt 0.9 and
of 778 D. gyrociliatus USCOs mentioned above, again be -cons 60, respectively). We used FASconcat-G (Kück and
cause only this subset could be recovered by the programs Longo 2014) to concatenate the trimmed alignments into
in this setup. ALiBaSeq and aTRAM 2 output DNA se a supermatrix. This supermatrix was then input into
quences that contain the ambiguous nucleotide N in all po IQ-TREE 2 (Minh et al. 2020), alongside its corresponding
sitions that could not be recovered during stitching. These gene partition file, to reconstruct the phylogeny using the
N were removed for the subsequent evaluation steps be maximum likelihood (ML) approach. We ran IQ-TREE 2
cause they distort the query coverage measure; the amount with extended model selection and tree inference, calculat
of a reference sequence covered by the recovered marker is ing 1,000 replicates for the ultrafast bootstrap (command
artificially increased due to the uninformative inserted N. line options -m MFP and -B 1000, respectively).
Completeness and accuracy were measured jointly
as percent identical aligned positions multiplied with
the total amount of aligned, or recovered, positions (here
Acknowledgments
called pidentical, cov): This work used the Scientific Compute Cluster at GWDG,
the joint data center of Max Planck Society for the
nmatch Advancement of Science (MPG) and University of
pidentical, cov = · cov.
naligned Göttingen. We acknowledge support by the Open Access
Publication Funds of the Göttingen University.
s (length(srecovered ))
cov = recovered .
sUSCOs (length(sUSCOs )) Funding
This work was supported by the German Research
nmatch being the total number of exact matches in all align
Foundation (DFG) BL787/8-1.
ments between recovered markers and reference USCOs
and naligned the total number of aligned, i.e. nongap, posi
tions. The coverage cov was computed as the ratio of the Data Availability
total lengths of all recovered markers srecovered and all refer The data underlying this article are available via GitHub at
ence USCOs sUSCOs. We chose to combine the measures for https://github.com/animal-evolution-and-biodiversity/benc
accuracy of the recovered markers, i.e. percent identical out hmarking-patchwork. Patchwork is distributed under the
of all aligned positions, and completeness or query cover GPLv3 license via GitHub at https://github.com/fethalen/
age, i.e. percent recovered positions, in order to avoid a dis patchwork.
torted outcome. For example, a program might recover
only a very small number of markers but these with high
percent identity, such that using only the percent identity Literature Cited
measure would have resulted in an overestimation of the Allen JM, Boyd B, Nguyen NP, Vachaspati P, Warnow T, Huang DI,
program's performance. Grady PGS, Bell KC, Cronk QCB, Mugisha L, et al.
Phylogenomics from whole genome sequences using aTRAM.
Syst Biol. 2017:66(5):786–798. https://doi.org/10.1093/sysbio/
Patchwork in a Phylogenomic Context syw105.
Allen JM, LaFrance R, Folk RA, Johnson KP, Guralnick RP. aTRAM 2.0:
We retrieved the raw reads from the NCBI Sequence Read an improved, flexible locus assembler for NGS data. Evol
Archive (SRA accession SRR5088465, SRR5088468, Bioinform. 2018:14:1176934318774546. https://doi.org/10.
SRR5308129 SRR5308123, SRR5088469 SRR5088471, 1177/1176934318774546.
Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023 11
Thalén et al. GBE
Bezanson J, Edelman A, Karpinski S, Shah VB. Julia: a fresh approach to Henikoff JG, Henikoff S. Blocks database and its applications. Meth
numerical computing. SIAM Rev. 2017:59(1):65–98. https://doi. Enzymol. 1996:266:88–105. https://doi.org/10.1016/S0076-
org/10.1137/141000671. 6879(96)66008-X.
Bleidorn C. Phylogenomics. An introduction. Cham: Springer Hu T, Chitnis N, Monos D, Dinh A. Next-generation sequencing tech
International Publishing; 2017. nologies: an overview. Human Immunol. 2021:82(11):801–811.
Blom MPK. Opportunities and challenges for high-quality bio https://doi.org/10.1016/j.humimm.2021.02.012.
diversity tissue archives in the age of long-read sequencing. Jin J-J, Yu W-B, Yang J-B, Song Y, dePamphilis CW, Yi T-S, Li D-Z.
Mol Ecol. 2021:30(23):5935–5948. https://doi.org/10.1111/ GetOrganelle: a fast and versatile toolkit for accurate de novo as
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by UFRJ - CFCH/IFCS user on 12 September 2024
mec.15909. sembly of organelle genomes. Genome Biol. 2020:21(1):241.
Bragg JG, Potter S, Bi K, Moritz C. Exon capture phylogenomics: effi https://doi.org/10.1186/s13059-020-02154-5.
cacy across scales of divergence. Mol Ecol Res. 2016:16(5): Katoh K, Standley DM. MAFFT multiple sequence alignment software
1059–1068. https://doi.org/10.1111/1755-0998.12449. version 7: improvements in performance and usability. Mol Biol
Buchfink B, Reuter K, Drost H-G. Sensitive protein alignments at Evol. 2013:30(4):772–780. https://doi.org/10.1093/molbev/
tree-of-life scale using DIAMOND. Nature Meth. 2021:18(4): mst010.
366–368. https://doi.org/10.1038/s41592-021-01101-x. Keilwagen J, Hartung F, Paulini M, Twardziok SO, Grau J. Combining
Call E, Mayer C, Twort V, Dietz L, Wahlberg N. 2021. Museomics: phy RNA-seq data and homology-based gene prediction for plants, an
logenomics of the moth family Epicopeiidae (Lepidoptera) using imals and fungi. BMC Bioinformatics 2018:19(1):189. https://doi.
target enrichment. Insect Syst Divers. 5(2):6. https://doi.org/10. org/10.1186/s12859-018-2203-5.
1093/isd/ixaa021 Keilwagen J, Wenk M, Erickson JL, Schattat MH, Grau J, Hartung F.
Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. Trimal: a tool for Using intron position conservation for homology-based gene pre
automated alignment trimming in large-scale phylogenetic ana diction. Nucleic Acids Res. 2016:44(9):e89. https://doi.org/10.
lyses. Bioinformatics 2009:25(15):1972–1973. https://doi.org/10. 1093/nar/gkw092.
1093/bioinformatics/btp348. Knyshov A, Gordon ERL, Weirauch C. New alignment-based sequence
Dahn HA, Mountcastle J, Balacco J, Winkler S, Bista I, Schmitt AD, extraction software (ALiBaSeq) and its utility for deep level phylo
Pettersson OV, Formenti G, Oliver K, Smith M, et al. genetics. PeerJ 2021:9:e11019. https://doi.org/10.7717/peerj.
Benchmarking ultra-high molecular weight DNA preservation 11019.
methods for long-read and long-range sequencing. GigaScience Kück P, Longo GC. FASconCAT-G: extensive functions for multiple se
2022:11:giac068. https://doi.org/10.1093/gigascience/giac068. quence alignment preparations concerning phylogenetic studies.
da Veiga Leprevost F, Grüning BA, Alves Aflitos S, Röst HL, Uszkoreit J, Frontiers Zool. 2014:11(1):81. https://doi.org/10.1186/s12983-
Barsnes H, Perez-Riverol Y. BioContainers: an open-source and 014-0081-x.
community-driven framework for software standardization. Kurtzer GM, Sochat V, Bauer MW. Singularity: scientific containers for
Bioinformatics 2017:33(16):2580–2582. https://doi.org/10.1093/ mobility of compute. PLoS One 2017:12(5):e0177459. https://doi.
bioinformatics/btx192. org/10.1371/journal.pone.0177459.
Dietz L, Eberle J, Mayer C, Kukowka S, Bohacz C, Baur H, Espeland M, Leinonen R, Sugawara H, Shumway M. The sequence read archive.
Huber BA, Hutter C, Mengual X, et al. Standardized nuclear mar Nucleic Acids Res. 2011:39(Database):D19–D21. https://doi.org/
kers improve and homogenize species delimitation in Metazoa. 10.1093/nar/gkq1019.
Methods Ecol Evol. 2023:14(2):543–555. https://doi.org/10. Lemmon EM, Lemmon AR. High-throughput genomic data in systema
1111/2041-210X.14041. tics and phylogenetics. Annu Rev Ecol Evol Syst. 2013:44(1):
Dodsworth S. Genome skimming for next-generation biodiversity ana 99–121. https://doi.org/10.1146/annurev-ecolsys-110512-135822.
lysis. Trends Plant Sci. 2015:20(9):525–527. https://doi.org/10. Liu B-B, Liu B-B, Ma Z-Y, Ren C, Hodel RGJ. 2021. Capturing single-
1016/j.tplants.2015.06.012. copy nuclear genes, organellar genomes, and nuclear ribosomal
Dylus D, Altenhoff A, Majidian S, Sedlazeck FJ, Dessimoz C. Inference of DNA from deep genome skimming data for plant phylogenetics:
phylogenetic trees directly from raw sequencing reads using a case study in Vitaceae. Appl Plant Sci. 11(4):e11537. https://
Read2Tree. Nat Biotechnol. 2023. https://doi.org/10.1038/s41587- doi.org/10.1111/jse.12806
023-01753-4. Lozano-Fernandez J. A practical guide to design and assess a phyloge
Erséus C, Williams BW, Horn KM, Halanych KM, Santos SR, James SW, nomic study. Genome Biol Evol. 2022:14(9):evac129. https://doi.
Des Creuzé Châtelliers M, Anderson FE. Phylogenomic analyses re org/10.1093/gbe/evac129.
veal a Palaeozoic radiation and support a freshwater origin for cli Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO up
tellate annelids. Zool Scr. 2020:49(5):614–640. https://doi.org/10. date: novel and streamlined workflows along with broader and
1111/zsc.12426. deeper phylogenetic coverage for scoring of eukaryotic, prokaryot
Formenti G, Theissinger K, Fernandes C, Bista I, Bombarely A, ic, and viral genomes. Mol Biol Evol. 2021:38(10):4647–4654.
Bleidorn C, Ciofi C, Crottini A, Godoy JA, Höglund J, et al. The https://doi.org/10.1093/molbev/msab199.
era of reference genomes in conservation genomics. Trends Martín-Durán JM, Vellutini BC, Marlétaz F, Cetrangolo V, Cvetesic N,
Ecol Evol. 2022:37(3):197–202. https://doi.org/10.1016/j.tree. Thiel D, Henriet S, Grau-Bové X, Carrillo-Baltodano AM, Gu W,
2021.11.008. et al. Conservative route to genome compaction in a miniature an
Frith MC. A new repeat-masking method enables specific detection of nelid. Nat Ecol Evol. 2021:5(2):231–242. https://doi.org/10.1038/
homologous sequences. Nucleic Acids Res. 2011:39(4):e23. s41559-020-01327-6.
https://doi.org/10.1093/nar/gkq1212. McCormack JE, Hird SM, Zellmer AJ, Carstens BC, Brumfield RT.
Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment Applications of next-generation sequencing to phylogeography
tool for genome assemblies. Bioinformatics 2013:29(8): and phylogenetics. Mol Phylogenet Evol. 2013:66(2):526–538.
1072–1075. https://doi.org/10.1093/bioinformatics/btt086. https://doi.org/10.1016/j.ympev.2011.12.007.
Heath T, Hedtke SM, Hillis DM. 2008. Taxon sampling and the accuracy Merkel D. 2014. Docker: lightweight linux containers for consistent de
of phylogenetic analyses. J Syst Evol. 46:239–257. 10.3724/SP.J. velopment and deployment. Linux J. 239(2):2. 10.5555/2600239.
1002.2008.08016 2600241
12 Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023
Patchwork: Alignment-Based Retrieval and Concatenation GBE
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S,
Haeseler A, Lanfear R. IQ-TREE 2: new models and efficient methods Treangen TJ, Schatz MC, Delcher AL, Roberts M, et al. GAGE: a crit
for phylogenetic inference in the genomic era. Mol Biol Evol. ical evaluation of genome assemblies and assembly algorithms.
2020:37(5):1530–1534. https://doi.org/10.1093/molbev/msaa015. Genome Res. 2012:22(3):557–567. https://doi.org/10.1101/gr.
Nurk S, Bankevich A, Antipov D, Gurevich AA, Korobeynikov A, 131383.111.
Lapidus A, Prjibelski AD, Pyshkin A, Sirotkin A, Sirotkin Y, et al. Seppey M, Manni M, Zdobnov EM. BUSCO: assessing genome assem
Assembling single-cell genomes and mini-metagenomes from chi bly and annotation completeness. Methods Mol Biol. 2019:1962:
meric MDA products. J Comput Biol. 2013:20(10):714–737. 227–245. https://doi.org/10.1007/978-1-4939-9173-0_14.
Downloaded from https://academic.oup.com/gbe/article/15/12/evad227/7470721 by UFRJ - CFCH/IFCS user on 12 September 2024
https://doi.org/10.1089/cmb.2013.0084. Simmons SK, Lithwick-Yanai G, Adiconis X, Oberstrass F, Iremadze N,
Philippe H, de Vienne DM, Ranwez V, Roure B, Baurain D. 2017. Pitfalls Geiger-Schuller K, Thakore PI, Frangieh CJ, Barad O, Almogy G,
in supermatrix phylogenomics. Eur J Taxon. 283:1–25. 10.5852/ et al. Mostly natural sequencing-by-synthesis for scRNA-seq using
ejt.2017.283 Ultima sequencing. Nat Biotechnol. 2023:41(2):204–211. https://
Ranwez V, Douzery EJP, Cambon C, Chantret N, Delsuc F. MACSE v2: doi.org/10.1038/s41587-022-01452-6.
toolkit for the alignment of coding sequences accounting for fra Slater GSC, Birney E. Automated generation of heuristics for biological
meshifts and stop codons. Mol Biol Evol. 2018:35(10): sequence comparison. BMC Bioinformatics 2005:6(1):31. https://
2582–2584. https://doi.org/10.1093/molbev/msy159. doi.org/10.1186/1471-2105-6-31.
Raxworthy CJ, Smith BT. Mining museums for historical DNA: advances Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years. Nat
and challenges in museomics. Trends Ecol Evol. 2021:36(11): Rev Genet. 2019:20(11):631–656. https://doi.org/10.1038/
1049–1060. https://doi.org/10.1016/j.tree.2021.07.009. s41576-019-0150-2.
Rhie A, McCarthy SA, Fedrigo O, Damas J, Formenti G, Koren S, Steenwyk JL, Li Y, Zhou X, Shen XX, Rokas A. Incongruence in the phy
Uliano-Silva M, Chow W, Fungtammasan A, Kim J, et al. logenomics era. Nat Rev Genet. 2023:24(12):834–850. https://doi.
Towards complete and error-free genome assemblies of all verte org/10.1038/s41576-023-00620-x.
brate species. Nature 2021:592(7856):737–746. https://doi.org/ Waterhouse RM, Seppey M, Simão FA, Manni M, Ioannidis P,
10.1038/s41586-021-03451-0. Klioutchnikov G, Kriventseva EV, Zdobnov EM. BUSCO applica
Richter S, Schwarz F, Hering L, Böggemann M, Bleidorn C. The utility of tions from quality assessments to gene prediction and phyloge
genome skimming for phylogenomic analyses as demonstrated for nomics. Mol Biol Evol. 2018:35(3):543–548. https://doi.org/10.
glycerid relationships (Annelida, Glyceridae). Genome Biol Evol. 1093/molbev/msx319.
2015:7(12):3443–3462. https://doi.org/10.1093/gbe/evv224. Whelan S, Irisarri I, Burki F. PREQUAL: detecting non-homologous
Rogozin IB, Sverdlov AV, Babenko VN, Koonin EV. Analysis of evolution characters in sets of unaligned homologous sequences.
of exon-intron structure of eukaryotic genes. Brief Bioinformatics Bioinformatics 2018:34(22):3929–3930. https://doi.org/10.1093/
2005:6(2):118–134. https://doi.org/10.1093/bib/6.2.118. bioinformatics/bty448.
Sahbou A-E, Iraqi D, Mentag R, Khayi S. BuscoPhylo: a webserver for Zhang F, Ding Y, Zhu C-D, Zhou X, Orr MC, Scheu S, Luan Y-X.
Busco-based phylogenomic analysis for non-specialists. Sci Rep. Phylogenomics from low-coverage whole-genome sequencing.
2022:12(1):17352. https://doi.org/10.1038/s41598-022-22461-0. Methods Ecol Evol. 2019:10(4):507–517. https://doi.org/10.
Salzberg SL. Next-generation genome annotation: we still struggle to 1111/2041-210X.13145.
get it right. Genome Biol. 2019:20(1):92. https://doi.org/10.
1186/s13059-019-1715-2. Associate editor: Dennis Lavrov
Genome Biol. Evol. 15(12) https://doi.org/10.1093/gbe/evad227 Advance Access publication 12 December 2023 13