Annotate Microbial Reference Data
Annotate Microbial Reference Data
Sample to Insight
Tutorial
We also include optional examples showing how to use these databases in downstream analyses
and, in an optional advanced section, we cover creating a Gene Database for virulence resistance
analysis.
Please refer to the CLC Microbial Genomics Module manual for detailed descriptions of
the tools mentioned in this tutorial.
General tips
• Tools can be launched from the Workbench Toolbox, as described in this tutorial, or
alternatively, click on the Launch button ( ) in the toolbar and use the Quick Launch tool
to find and launch tools.
• Within wizard windows you can use the Reset button to change settings to their default
values.
• You can access the in-built manual by clicking on Help buttons or by selecting the going to
the "Help" menu and choosing "Plugin Help" | "CLC Microbial Genomics Module Help".
Prerequisites For this tutorial, you must be working with CLC Genomics Workbench 21.0 or
higher and have the CLC Microbial Genomics Module installed.
Please refer to the CLC Microbial Genomics Module manual for information about mod-
ule installation and licensing.
Tutorial
3. Create a new folder for the tutorial data, for example named "Annotate sequence list
tutorial".
You should now see the following elements in the tutorial folder:
• Resistance genes, containing the NCBI subset of resistance genes from QMI-AR Nucleotide
Database.
• 16S amplicons, containing the SILVA 16s RNA amplicon database downsampled to contain
50% of the original sequences.
5. Import the example paired-end reads by going to: File | Import ( ) | Illumina ( )
You should now see a data element called Simulated_wastewater_reads (paired) in the Naviga-
tion Area.
4
Tutorial
• Other column names can be recognized and checked for consistency by the software, either
by using the "Named columns" option or renaming the columns within the tool.
This is covered further in this tutorial, and full details can be found in the manual.
When creating custom databases, there are additional requirements for particular database
types. These are described in the examples in this tutorial.
2. Select "Microbial genomes" from the tutorial folder and then click on Next.
3. Check the "Download Taxonomy" option and uncheck other options as shown on figure 1.
Click on Next.
Color names and coloring In the "Preview and mappings" area, the "Named columns" option
is enabled, so columns with headings the software recognizes are checked and the status of the
column contents is indicated using colors. Here, we see:
• The "Name", "Accession", "Linear", "Assembly ID", "FTP Path" and "TaxID" columns are
shaded green. This indicates these column names are known to the software and contain
information consistent with that expected for this column type.
5
Tutorial
• The "Size" and "Start of Sequence" columns are shaded red. This indicates the column
names are known to the software, but that there is a problem with the contents, and that
these values will not be imported. Size and sequence are not expected as annotations to
be applied this way. They are characteristics determined by the software for each sequence
directly and can therefore not be updated.
• The "Source" column is white. This means the column heading has no special meaning in
the software, and the values will be imported as standard annotations.
Please see the manual for full details about the coloring of columns.
5. Click on Next, keep the "Create report" checked, and choose to save the output to a new
subfolder, for example named "Annotated Microbial Reference DB".
6
Tutorial
Depending on your hardware and internet connection, the tool may take several minutes to run.
8. Open the output sequence list from the "Annotated Microbial Reference DB" folder.
9. Switch to the Table view by clicking on ( ) in the bottom left corner, as seen in figure 3
to see a table of the annotations present on each sequence.
Figure 3: Click on the Table view icon, highlighted by a red box here, to see a table of the
annotations on each sequence
10. Inspect the taxonomy column. The taxonomy matching the TaxID for each sequence was
downloaded from the NCBI and then added as a Taxonomy annotation to the sequence.
You now have an annotated sequence list which can be used as a microbial reference database.
In the following optional section, we will try using it to analyze the simulated wastewater reads.
Optional: Using the annotated sequence list as taxonomic profiling database for
taxonomic profiling
You can run taxonomic profiling on the simulated wastewater reads by using the the sequence
list you just annotated to create a taxonomic profiling index. To do so, follow the steps below:
2. Select the "Microbial genomes (Metadata Annotated)" from the "Annotated Microbial
Reference DB" as input.
7
Tutorial
3. Choose to Save the index in the "Annotated Microbial Reference DB" folder and click Finish.
The tool will take several minutes to run. When it is done, you now have an index for
taxonomic profiling.
4. Next, we will use this index to analyse the taxonomies of the simulated wastewater sample.
From the Toolbox, choose:
Metagenomics ( ) | Taxonomic analysis ( ) | Taxonomic Profiling ( )
6. Select the index created in the previous step by clicking on ( ). Leave the other settings on
default (figure 4). Click on Next and save the output to a new subfolder, for example named
"Taxonomic profile". The tool will now run and may take several minutes to complete.
7. Inspect the taxonomic profile ( ) in the output folder. In the Stacked visualisation,
aggregate and color features by Species ( ) (figure 5). You will see there are 8 different
species represented.
For more information on taxonomic profiling, we recommend you complete the Taxonomic
Profiling of Whole Shotgun Metagenomic Data tutorial which can be found here: https://
resources.qiagenbioinformatics.com/tutorials/Taxonomic_Profiling.pdf.
8
Tutorial
2. Select "Resistance genes" from the tutorial folder location and then click on Next.
5. In the import area click Browse and select the "Resistance_genes_annotations.xlsx" table,
as shown in figure 6.
6. In Preview and mappings area, inspect the coloring of the table. The headings are checked
by the software and colored accordingly. For descriptions of the color coding, see Color
names and coloring
7. After confirming that the preview looks as expected with a Name and Phenotype field click
on Next.
9
Tutorial
8. Keep the "Create report" checked, and choose to save the output to a new subfolder, for
example titled "Annotated resistance genes".
Optional: Using the annotated sequence list as a gene database for finding resistance
You can find resistance in the simulated wastewater reads using the annotated sequence list
you just created. First, the metagenome reads must be assembled. To do so, follow the steps
10
Tutorial
below:
3. Set execution mode to Longer contigs and leave the other settings on default as seen in
(figure 7). Click on Next
4. Save the assembled metagenomes in a new subfolder, for example named "Assembled
metagenome".
The tool will run and output a contig list. We will use the annotated sequence list we created as
the nucleotide resistance database to search for resistance genes in the metagenome assembly.
7. Select the "Resistance genes" from the "Annotated resistance genes" folder as seen in
(figure 8). Leave the other settings on default. Click on Next
The tool outputs a resistance table. Open and inspect the table. You will observe that a number
of resistance genes were found.
11
Tutorial
Figure 8: Select the annotated resistance genes to search for resistance genes
For more information on the tools for detecting antibiotic resistance, we recommend you complete
the Antibiotic Resistance Analysis tutorial which can be found here: https://resources.
qiagenbioinformatics.com/tutorials/Antimicrobial_Resistance.pdf.
12
Tutorial
2. Select "Protein sequences" from the tutorial folder location and then click on Next.
5. In the import area click Browse and select the "Protein_sequences_annotations.xlsx" table,
as shown in figure 9.
6. In Preview and mappings area, inspect the coloring of the table. The headings are checked
by the software and colored accordingly. For descriptions of the color coding, see Color
names and coloring. In order for GO-terms to be recognized the input file must contain a
column named "GO-terms".
Tutorial
8. Keep the "Create report" checked, and choose to save the output to a new subfolder, for
example titled "Annotated Protein DB".
11. Open the output sequence list from the "Annotated Proteins" folder.
12. Switch to the Table view by clicking on ( ) in the bottom left corner.
13. Inspect the GO-terms column. The sequences have been annotated with GO-terms. The
GO-terms annotation has special meaning which can be seen by clicking on a row in the
"GO-terms" column. This will take you to the GO description of this gene.
Optional: Using the annotated protein sequence list to build a functional profile
We will use the metagenome assembly of the wastewater sample we built previously with the
annotated protein sequence list to build a GO functional profile.
In order to do so, the assembly must first be annotated with cds regions containing GO
annotations. We will use the Annotate with DIAMOND tool for this.
2. As input select the "Simulated_wastewater_reads (paired) contig list" and click on Next.
3. Select Protein Sequence List as the reference sequence then click ( ) to locate the
"Protein sequence" protein sequence list from the "Annotated Protein DB" folder. Leave
the other options as default. The wizard parameters should appear as on figure 10. Click
on Next.
4. Leave the next two wizard steps on default by clicking Next twice.
The tool will run and output a contig list with "(DIAMOND annotations)".
Open the output list and switch to Annotation Table view by clicking on ( ) to see a number of
cds annotations. We will use these annotations to build a functional profile.
The first step in building a functional profile is mapping the reads to the annotated contigs.
Tutorial
7. As input select the raw "Simulated_wastewater_reads (paired)" reads and click on Next.
We now have a read mapping and are ready to build the GO functional profile. If you have do not
already have a GO database downloaded, you should do so now using
Databases ( ) | Functional Analysis ( ) | Download GO Database ( )
This database is not limited to this tutorial so save it in your general database location.
12. As input select the read mapping created in the previous step and click on Next.
15. Choose to save the output in a new location for example named "Wastewater functional
profile".
15
Tutorial
Figure 11: Use the annotated contig list and downloaded GO database to build the functional profile
Inspect the output profile to see that a number of different GO terms are represented.
For more information on functional analysis including how to compare different samples,
we recommend you complete the Whole Metagenome Functional Analysis tutorial which
can be found here: https://resources.qiagenbioinformatics.com/tutorials/
Microbial_Analysis_Functional.pdf.
2. Select "16S amplicons" from the tutorial folder location and then click on Next.
3. Check "Set clustering similarity fraction annotation", then set the "Clustering similarity
fraction" to 0.97. This can also be set when running OTU clustering in case it was not set
when creating the database. Click on Next.
5. In the import area click Browse and select the "16S_amplicons_annotations.xlsx" table,
as shown in figure 12.
16
Tutorial
8. Open the "16S amplicons" sequence list and inspect the sequence names.
We can see that the sequence names match what is in the Sequence Name column in
the annotation file. We can therefore safely use this column to match sequences with
metadata.
10. Open Create Annotated Sequence List and repeat the above steps until you arrive at the
Select input files and map columns to attribute step.
11. Click on the second row containing "Sequence Name" in the preview (figure 13) and rename
this to "Name". Now, the table contains a "Name" and "Taxonomy" column and you can
click on Next.
12. Keep the "Create report" checked, and choose to save the output to a new subfolder, for
example titled "Annotated 16S DB".
17
Tutorial
15. Open the output sequence list from the "Annotated 16S DB" folder.
16. Switch to the Table view by clicking on ( ) in the bottom left corner to see a table of
annotations present on each sequence.
17. Inspect the taxonomy column. The taxonomies were automatically detected as being QIIME
formatted and converted to 7-step taxonomy.
This conversion allows the taxonomies to be used as database input for both OTU clustering
and to create taxonomic profiling indexes. Taxonomies can be specified in QIIME format
(starting with "k__ " and comma or semi-colon separated) as seen here or as a semi-colon
separated strings.
Optional: Using the annotated sequence list as reference database for OTU clustering
If you wish to try using the created database for OTU clustering, we recommend using the data
from the OTU clustering step by step tutorial which can be found here: https://resources.
qiagenbioinformatics.com/tutorials/OTU_Clustering_Steps.pdf. Simply replace
the 16S_97_otus_GG database with the one you just created.
18
Tutorial
2. Select "Virulence genes" from the tutorial folder location and then click on Next.
6. In Preview and mappings area, inspect the coloring of the table. The headings are checked
by the software and colored accordingly. For descriptions of the color coding, see Color
names and coloring. Here, all fields appear green and we will therefore click Next
7. Keep the "Create report" checked, and choose to save the output to a new subfolder, for
example titled "Annotated virulence genes".
19
Tutorial
10. Open the output sequence list from the "Annotated virulence genes" folder.
11. Switch to the Table view by clicking on ( ) in the bottom left corner.
12. Inspect the Virulence factor and Gene ID columns. These field have special meaning.
Clicking on a row in the " Virulence factor" or " Gene ID" columns will take you to a
description of this virulence gene.
Optional: Using the annotated sequence list as a virulence database for finding
virulence
We will use the Microbial genome database we imported and annotated previously and add
virulence annotations.
2. As input select the "Microbial genomes" from the "Annotated Microbial Reference DB"
folder and click on Next.
3. Select the "Virulence genes" nucleotide sequence list from the "Annotated virulence genes"
folder as the DB by clicking ( ). Leave the other options as default. The wizard parameters
should appear as on figure 15. Click on Next
4. In the last step, save the output table in the "Annotated Microbial Reference DB" folder.
The tools runs and may take several minutes to complete. Open and inspect the Find resistance
table. In the contigs column, we can see that three of the references were found to have virulence
genes. None of these were detected in taxonomic profiling and it is therefore unlikely that the
sample contains any particularly virulent strain.
20
Tutorial