Sqmtools: Automated Processing and Visual Analysis of 'Omics Data With R and Anvi'O
Sqmtools: Automated Processing and Visual Analysis of 'Omics Data With R and Anvi'O
* Correspondence: fpusan@gmail.
com Abstract
†
Fernando Puente-Sánchez and
Natalia García-García contributed Background: The dramatic decrease in sequencing costs over the last decade has
equally to this work. boosted the adoption of high-throughput sequencing applications as a standard
Systems Biology Department, tool for the analysis of environmental microbial communities. Nowadays even small
Centro Nacional de Biotecnología
(CNB-CSIC), C/ Darwin n° 3, Campus research groups can easily obtain raw sequencing data. After that, however, non-
de Cantoblanco, 28049 Madrid, specialists are faced with the double challenge of choosing among an ever-
Spain increasing array of analysis methodologies, and navigating the vast amounts of
results returned by these approaches.
Results: Here we present a workflow that relies on the SqueezeMeta software for
the automated processing of raw reads into annotated contigs and reconstructed
genomes (bins). A set of custom scripts seamlessly integrates the output into the
anvi’o analysis platform, allowing filtering and visual exploration of the results.
Furthermore, we provide a software package with utility functions to expose the
SqueezeMeta results to the R analysis environment.
Conclusions: Altogether, our workflow allows non-expert users to go from raw
sequencing reads to custom plots with only a few powerful, flexible and well-
documented commands.
Keywords: Metagenomics, Metatranscriptomics, Microbial ecology, Automatic,
Pipeline, Visualization
Background
The advent of high-throughput sequencing technologies in 2008 made it possible to
directly sequence the different microbial genomes present in a given sample (metage-
nomics) as well as measuring the expression profiles of those genomes (metatranscrip-
tomics) in a culture-independent manner. This supposed a revolution in the field of
microbial ecology, as it allowed to profile the taxonomic composition and functional
potential of microbial communities [1] and eventually to recover full genomes from
environmental bacteria that in some cases could not be studied by other means [2].
Current sequencing costs (as of the end of 2019) are around 0.01$ per Megabase of
DNA [3], which have led to the popularization of metagenomics as a powerful and af-
fordable tool to study microbial communities. In parallel, access to the required high-
performance computing infrastructures is becoming more common, either via
© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to
the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The
images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise
in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright
holder. To view a copy of this licence, visit [Link] The Creative Commons Public Domain
Dedication waiver ([Link] applies to the data made available in this article, unless
otherwise stated in a credit line to the data.
Puente-Sánchez et al. BMC Bioinformatics (2020) 21:358 Page 2 of 11
package for differential abundance analysis [8], the vegan package for multivariate ana-
lysis [9], or the caret package for machine learning [10].
Results
To illustrate the utility and features of SQMtools, we used SqueezeMeta to analyse a
set of 16 human gut metagenomes from two different populations with different life-
styles: urban-living Italian adult and hunter-gatherers from the Hadza people (Supple-
mentary Table S1 [11];). We were interested in identifying taxonomic and functional
differences between the microbiomes of both populations. To that end, we loaded the
data into R using the SQMtools package, and analysed it using the workflow illustrated
in Fig. 1 and detailed in the Supplementary Material (which includes all the code neces-
sary to generate the figures in this article, and other examples showing additional fea-
tures). We also show how to expose the data to other R packages such as vegan in
order to perform more advanced analyses. Finally, we use anvi’o to visually explore sub-
sets of the metagenomes, as well as the bins obtained by the SqueezeMeta pipeline.
Briefly the SQMtools contains the following types of functions:
Fig. 1 Main workflow of the SQMtools pipeline. Raw reads are processed automatically by SqueezeMeta,
which integrates assembly, ORF prediction, annotation and binning. The results can then be easily filtered
and explored either with R (by creating a SQM object) or anvi’o (by creating an anvi’o database) in order to
produce custom plots with the information of interest. Arrows correspond to function calls (e.g. going from
raw reads to a krona chart would involve three function calls). The workflow is exemplified in detail in the
Supplementary Information
Puente-Sánchez et al. BMC Bioinformatics (2020) 21:358 Page 4 of 11
Subset functions: generate a new SQM object containing a subset of functions, taxa,
or bins of the parent SQM object.
Combine functions: generate a new SQM object aggregating the data from two or
more SQM objects.
Plot functions: make different R plots from the data (taxonomy, functions …)
contained in a SQM object.
Export functions: create files (krona charts, KEGG pathway maps, tables …) from
the data contained in a SQM object.
Fig. 2 General overview of the example dataset. a Genus-level taxonomic composition of the Hadza (blue)
and Italian (orange) samples, as depicted by the SQMtools package. b, c Non-metric multidimensional
scaling (NMDS) showing the clustering of Hadza vs Italian samples based on their (b) genus-level taxonomy
and (c) KEGG orthology functional profiling. Differences between clusters were significant (Permutational
Multivariate Analysis of Variance p < 0.005)
Puente-Sánchez et al. BMC Bioinformatics (2020) 21:358 Page 5 of 11
Fig. 3 Detailed overview of the biosynthesis of aromatic amino acids. a KEGG pathway map for
Phenylalanine, Tyrosine and Tryptophan biosynthesis. Reactions are coloured based on the log2 fold-
change of their average copy numbers in Italian (orange) vs Hadza (blue) samples. Asterisks indicated
reactions performed by KEGG orthologs that were significantly enriched in Hadza (blue) or Italian (orange)
samples (DESeq2 p adj. < 0.05). Black asterisks indicate reactions in which a different KEGG ortholog was
enriched in Hadza and Italian samples (note that the same reaction can be performed by more than one
KEGG ortholog). Asterisks within parentheses indicate reactions in which some, but not all KEGG orthologs
were significantly enriched in a group. b anvi’o plot showing the taxonomy and distribution across samples
of the contigs containing genes related to the biosynthesis of aromatic amino acids. c Heatmap depicting
the TPM across samples of selected KEGG orthologs related to the biosynthesis of aromatic amino acids.
Asterisks indicate significant enrichment in Hadza or Italian samples as described above
Fig. 4 Taxonomic distribution of selected functions. Barplots showing the weight (in percentage of total
reads) and taxonomic distribution of six broad functional categories across the different samples
Discussion
There are several software packages for the analysis and visualization of metagenomic
data, with MEGAN [12] or STAMP [13] being among the most popular. Our workflow
differentiates from them in that 1) it covers all the steps of the analysis, from raw reads
to custom figures and statistics and 2) results can be effortlessly loaded into the R ana-
lysis environment, where they can be explored with a set of convenience functions in-
cluded in the SQMtools package, but also directly analysed with other tools from the R
ecosystem.
SQMtools provides different ways to look at the abundance of orfs, contigs, bins, taxa
and functions in the metagenome (Table 1). In addition to read and base counts, which
have a straightforward interpretation, we also provide TPM and copy numbers in order
to take into account both sequencing depth and feature length. The TPM (transcripts
per million) metric was introduced by Wagner et al. [14] as an improved way to ac-
count for gene length and sequencing depth in transcriptomic experiments: we find it
equally useful in metagenomics. The TPM of a feature (be it a transcript, a gene or a
functional category) is the number of times that we would find that feature when
randomly sampling 1 million features, given the abundances of the different fea-
tures in our sample. As an alternative to TPM, we also provide copy numbers, cal-
culated as the ratio between the coverage of function of interest and the coverage
of the RecA/RadA recombinase universal single-copy gene. By default, the subset-
Tax and subsetBins functions will rescale TPM and copy numbers so that the
resulting values relate to the taxa present in the subset, rather than to the whole
metagenome (e.g. copy number of a function per genome of the selected taxa). De-
tails on the interpretation of those values in the different subsets are shown in
Table 1.
Puente-Sánchez et al. BMC Bioinformatics (2020) 21:358 Page 7 of 11
Table 1 Abundance metrics included in SQMtools and their interpretation in the full metagenome
and in taxonomic and functional subsets
Functions / Functions Taxonomy
Taxonomy
Scaling method Abund TPM Copy number Percent
Calculation Reads per 106 × reads per Coverage of 100 × reads per taxon
function or kilobase of each function / / Total reads
taxon function / total Coverage of RecA
reads per kilobase
of all functions
Full data Interpretation Reads from Genes from each Average copies of Percentage of each
in the full each function function per each function per taxonomic group in
data or taxon in million genes in genome in the the whole
the whole the whole whole metagenome
metagenome metagenome metagenome
subsetTax/ Rescaled in Not Yes, by taxonomy Yes, by taxonomy No
subsetBin the new applicable
subset?
Interpretation Reads from Genes from each Average copies of Percentage of each of
in the subset each function function per the function per the selected
or taxon in miIlion genes in genome from the taxonomic groups in
the selected the selected selected taxon or the whole
taxon or bin taxon or bin bin metagenome
subsetFun Rescaled in Not No No No
the new applicable
subset?
Interpretation Reads from Genes from each Average copies per Percentage of each
in the subset each function function in the genome of the taxonomic group in
or taxon in subset per million functions in the the whole
the subset genes in the subset per genome metagenome,
metagenome in the counting only the
metagenome reads assigned to the
functions in the subset
Note that TPMs are based on proportions calculated over an arbitrary count total (in
this case, the number of reads sequenced per sample) and as such are compositional
(see [15] for a detailed discussion on the problem and its implications). For
visualization/exploration purposes, we often find it useful to compare TPMs from dif-
ferent samples (e.g. in Fig. 3c) but in general statistical tests should be performed using
raw reads and composition-aware methods [15]. Copy numbers, on the other hand, are
ratios and should be less affected by compositionality issues [16]. We nonetheless still
recommend caution when performing statistical tests. The debate over the best way to
analyse microbiome data is still ongoing (see e.g. two alternative avenues in [17, 18])
and older methods such as DESeq2 might still outperform composition-aware methods
for some applications (see Methods). In SQMtools, we provide common normalizations
that can facilitate data exploration as well as the raw abundance data required to per-
form more advanced analyses, but we do not promote any particular statistical method
as we expect the field to continue evolving.
Conclusions
While many tools for metagenomics are available [19], to our knowledge we are the
first to provide a robust, comprehensive and well-curated software suite covering all
the steps of the analysis. This ensures full compatibility between the different included
tools, and facilitates installation (a single instruction if using the conda package man-
ager). Integration with R and anvi’o is effortless, relieving the users from the burden of
Puente-Sánchez et al. BMC Bioinformatics (2020) 21:358 Page 8 of 11
parsing the complex output files that are common to metagenomic studies. Overall,
our workflow allows non-expert users to go from raw sequencing reads to custom plots
with only a few powerful, flexible and well-documented commands, while also facilitat-
ing the incorporation of more advanced statistical methods to their analyses.
Methods
Software implementation
The SqueezeMeta to anvi’o interface is implemented in python3, and consists of two
scripts. Firstly, the [Link] script will parse a whole SqueezeMeta project (an-
notated orfs, contigs and bins) into anvi’o [6], generating databases that can be directly
used with the anvi-interactive or anvi-refine tools from the anvi’o suite. Secondly, the
[Link] integrates a search engine for prefiltering the data before launching
the anvi’o interactive session. The SQM to R interface is implemented in the SQMtools
R package, which contains several utility functions. The main components of the pack-
age are described in the next four sections. All analyses were performed using Squeeze-
Meta v1.1.1 and SQMtools v0.4.5.
1. Obtain the RPK (reads per kilobase) of each feature, by dividing the number of
mapped reads to that feature in that sample by the feature length in kilobases.
2. Calculate the TPM of each feature by dividing its RPK by the sum of RPKs of all
the features in that sample, and multiplying by a million.
For the sake of being consistent with previous works, we maintain the nomenclature
“TPM”, even when use it to measure the abundances of features other than transcripts.
For functional categories, we provide the following abundance metrics: mapped read
counts, mapped base counts, TPM and copy number. The TPM of a given functional
category is calculated as described above. The reads mapping to genes from that cat-
egory are aggregated and divided by the average length of the genes from that category
in the assembly. Copy number is calculated by dividing the aggregated coverage of the
Puente-Sánchez et al. BMC Bioinformatics (2020) 21:358 Page 9 of 11
genes from that functional category by the coverage of COG0468 to the RecA/RadA re-
combinase, which is a universal single copy gene. An alternative way of calculating copy
numbers using the median coverage of 15 different Universal Single Copy Genes is also
available and described in the USiCGs section of the SQMtools manual.
Figure generation
Heatmaps and barplots are generated using ggplot2. We also use KronaTools [23] for
generating Krona charts, and pathview [24] for generating annotated KEGG pathway
maps. Anvi’o plots were generated with anvi’o version 6.1 [6].
Computational resources
All tests were ran in a server with a 24-core Intel Xeon E5–2620 v2 CPU at 2.10GHz /
core and 256 Gb of RAM, using 12 processors. Under these conditions, SqueezeMeta
(including assembly, annotation and binning of the 16 samples) ran in 113 h (c.a. 4 days
and a half). Loading results into SQMtools took only a few minutes. Loading results
into anvi’o took an additional 130 h (c.a. 5 days and a half).
Alternatively, we were also capable of running SqueezeMeta on the 16 samples in a
laptop with an 8-core Intel Core i7 8750H CPU at 2.20 GHz / core, 16 Gb of RAM,
and an extra 16 Gb of swap memory in a SSD disk. The process took 80 h. Results were
loaded in SQMtools as described above, but we were unable to load them into anvi’o as
the system ran out of memory, making the anvi-profile script to eventually stall.
Supplementary information
Supplementary information accompanies this paper at [Link]
Puente-Sánchez et al. BMC Bioinformatics (2020) 21:358 Page 10 of 11
Additional file 1.
Abbreviations
MAG: Metagenome-Assembled Genome; KEGG: Kyoto Encyclopedia of Genes and Genomes; RPK: Reads Per Kilobase;
TPM: Transcripts Per Million; COG: Clusters of Orthologous Groups os proteins; PFAM: Protein FAMilies database;
Gb: Gigabytes; GHz: Gigahertzs
Acknowledgements
The authors thank Giuseppe D’Auria (FISABIO) for the initial contribution of the SQMtools – KronaTools interface. We
acknowledge support of the publication fee by the CSIC Open Access Publication Support Initiative through its Unit of
Information Resources for Research (URICI).
Authors’ contributions
FP-S and JT devised the package. FP-S designed and wrote the core SQMtools functions, FP-S and NG-G wrote the
code involved in figure generation. All authors contributed to the SQM to anvi’o interface. FP-S and NG-G wrote the
manuscript. The authors discussed the manuscript and approved the final version.
Funding
Computational resources were provided by the Spanish Ministry of Economy of Competitiveness grants CTM2016–
80095-C2–1-R and PID2019-110011RB-C31. Study design, data analysis and data interpretation were supported by
grants IJC2018–035180-I (awarded to F.P-S) and SEV-2013-0347-17-2 (Severo Ochoa program, awarded to N.G-G) from
the Spanish Ministry of Science and Innovation. Publication fees were covered by grant IJC2018-035180-I and the CSIC
Open Access Publication Support Initiative through its Unit of Information Resources for Research (URICI).
Competing interests
The authors declare that they have no competing interests.
References
1. Eisen J. Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes.
PLoS Biol. 2007;5:e82.
2. Pedrós-Alió C, Acinas SG, Logares R, Massana R. Marine microbial diversity as seen by high throughput sequencing.
Hoboken: Wiley; 2018. p. 47–98.
3. Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.
[Link]/sequencingcostsdata. Accessed 10/07/2020.
4. Attwood TK, Blackford S, Brazas MD, Davies A, Schneider MV. A global perspective on evolving bioinformatics and data
science training needs. Brief Bioinform. 2017;20:398–404.
5. Tamames J, Puente-Sanchez F. SqueezeMeta, a highly portable, fully automatic metagenomic analysis pipeline. Front
Microbiol. 2019;9:3349.
6. Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, et al. Anvi’o: an advanced analysis and visualization
platform for ‘omics data. PeerJ. 2015;3:e1319.
7. R Core Team. R: a language and environment for statistical computing; 2013.
8. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.
Genome Biol. 2014;15:550.
9. Oksanen J, Blanchet FG, Friendly M, Kindt R, Legendre P, McGlinn D. Vegan: community ecology package, R package
version 2.5–6; 2007.
10. Kuhn M. Building predictive models in R using the caret package. J Stat Softw. 2008;28:1–26.
11. Rampelli S, Schnorr SL, Consolandi C, Turroni S, Severgnini M, Peano C, et al. Metagenome sequencing of the Hadza
hunter-gatherer gut microbiota. Curr Biol. 2015;25:1682–93.
12. Huson DH, Beier S, Flade I, Górska A, El-Hadidi M, Mitra S, et al. MEGAN community edition-interactive exploration and
analysis of large-scale microbiome sequencing data. PLoS Comput Biol. 2016;12:e1004957.
13. Parks DH, Tyson GW, Hugenholtz P, Beiko RG. STAMP: statistical analysis of taxonomic and functional profiles.
Bioinformatics. 2014;30:3123–4.
14. Wagner GP, Kin K, Lynch VJ. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent
among samples. Theor Biosci. 2012;131:281–5.
15. Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ. Microbiome datasets are compositional: and this is not
optional. Front Microbiol. 2017;8:2224.
Puente-Sánchez et al. BMC Bioinformatics (2020) 21:358 Page 11 of 11
16. Morton JT, Marotz C, Washburne A, Silverman J, Zaramela LS, Edlund A, et al. Establishing microbial composition
measurement standards with reference frames. Nat Commun. 2019;10:1–11.
17. Quinn TP, Erb I, Gloor G, Notredame C, Richardson MF, Crowley TM. A field guide for the compositional analysis of any-
omics data. GigaScience. 2019;8:giz107.
18. Cruz GNF, Christoff AP, De Oliveira LFV. Equivolumetric protocol generates library sizes proportional to total microbial
load in next-generation sequencingBioRxiv; 2020.
19. Breitwieser FP, Lu J, Salzberg SL. A review of methods and databases for metagenomic classification and assembly. Brief
Bioinform. 2019;20:1125–36.
20. Kanehisa M. The KEGG database. Silico Simul Biol Processes. 2002;247:91–103.
21. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, et al. The COG database: an updated version
includes eukaryotes. BMC Bioinformatics. 2003;4:41.
22. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, et al. The Pfam protein families database in 2019. Nucleic
Acids Res. 2019;47:D427–32.
23. Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a web browser. BMC Bioinformatics.
2011;12:385.
24. Luo W, Brouwer C. Pathview: an R/bioconductor package for pathway-based data integration and visualization.
Bioinformatics. 2013;29(14):1830–1.
25. Quinn TP, Crowley TM, Richardson MF. Benchmarking differential expression analysis tools for RNA-Seq: normalization-
based vs log-ratio transformation-based methods. BMC Bioinformatics. 2018;19:274.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.