This is a walkthrough demonstrating metagenomic workflows on the Terra cloud platform on Illumina data from arboviral surveillance sampling. We will also run on the previously used example data from the other two exercises.
The data set consists of two (downsampled) samples from an arboviral surveillance project from the University of Florida. Data was generated by Illumina sequencing with metagenomic protocols. In this Terra exercise, we additionally add some samples from the Ebola and Lassa exercises used previously.
The exercise will utilize the Broad Viral Genomics group’s viral metagenomic classification pipeline (Dockstore link, GitHub source, Manual) and practice execution on the Terra platform, but this pipeline is known to work on other cloud platforms as well as on the command line using miniWDL.
Briefly, the classify_single
pipeline consists of the following steps:
Note: this pipeline does not yet incorporate bracken for more accurate abundance estimation, but the principles for interpretation of results are the same (aside from any quantitative inferences).
For simplicity, we have already loaded in the read data and the reference databases. If you are starting from scratch on a new data set, what we did to populate these workspaces was:
fastq_to_ubam
, classify_single
.metagenomics
with four rows representing a 2x2 combination of samples and kraken2 databases:entity:metagenomics_id | sample | fastq1 | k2_db |
---|---|---|---|
LongBoat-PlusPF | LongBoat | gs://fc-9c0b667b-b35d-4554-937e-ea40fcd95c0d/input_data/metagenomics/Longboat-250k.fastq.gz | gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst |
Palmetto-PlusPF | Palmetto | gs://fc-9c0b667b-b35d-4554-937e-ea40fcd95c0d/input_data/metagenomics/Palmetto-250k.fastq.gz | gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst |
LongBoat-MiniK2 | LongBoat | gs://fc-9c0b667b-b35d-4554-937e-ea40fcd95c0d/input_data/metagenomics/Longboat-250k.fastq.gz | gs://pathogen-public-dbs/jhu/minikraken2_v2_8GB_201904_retar.tar.zst |
Palmetto-MiniK2 | Palmetto | gs://fc-9c0b667b-b35d-4554-937e-ea40fcd95c0d/input_data/metagenomics/Palmetto-250k.fastq.gz | gs://pathogen-public-dbs/jhu/minikraken2_v2_8GB_201904_retar.tar.zst |
fastq_to_ubam
workflow on all rows of the metagenomics
table, with platform_name
= “ILLUMINA”,
library_name
= “l1”, sample_name
= this.sample
, and fastq_1
= this.fastq1
.workspace.kraken2_db_pluspf
= gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst
workspace.kraken2_db_broad_custom
= gs://pathogen-public-dbs/v1/kraken2-broad-20200505.tar.zst
workspace.kraken2_db_minik2
= gs://pathogen-public-dbs/jhu/minikraken2_v2_8GB_201904_retar.tar.zst
workspace.krona_taxonomy_tab
= gs://pathogen-public-dbs/v1/krona.taxonomy-20221213.tab.zst
workspace.ncbi_taxdump
= gs://pathogen-public-dbs/v1/taxdump-20221213.tar.gz
workspace.spikein_db
= gs://pathogen-public-dbs/v0/ERCC_96_nopolyA.fasta
workspace.trim_clip_db
= gs://pathogen-public-dbs/v0/contaminants.clip_db.fasta
metagenomics
table including a combination of four EBOV and four LASV samples (from previous exercises) and two kraken2 databases (PlusPF and a Broad custom):entity:metagenomics_id | sample | k2_db | unmapped_bam |
---|---|---|---|
G5723.1-PlusPF | G5723.1 | gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst | (copied from ebov table) |
G5731.1-PlusPF | G5731.1 | gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst | (copied from ebov table) |
G5732.1-PlusPF | G5732.1 | gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst | (copied from ebov table) |
G5735.2-PlusPF | G5735.2 | gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst | (copied from ebov table) |
LASV_NGA_2016_0409-Broad | LASV_NGA_2016_0409 | gs://pathogen-public-dbs/v1/kraken2-broad-20200505.tar.zst | (copied from de_novo_assembly table) |
LASV_NGA_2016_0409-PlusPF | LASV_NGA_2016_0409 | gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst | (copied from de_novo_assembly table) |
LASV_NGA_2016_0668-Broad | LASV_NGA_2016_0668 | gs://pathogen-public-dbs/v1/kraken2-broad-20200505.tar.zst | (copied from de_novo_assembly table) |
LASV_NGA_2016_0668-PlusPF | LASV_NGA_2016_0668 | gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst | (copied from de_novo_assembly table) |
LASV_NGA_2016_0811-Broad | LASV_NGA_2016_0811 | gs://pathogen-public-dbs/v1/kraken2-broad-20200505.tar.zst | (copied from de_novo_assembly table) |
LASV_NGA_2016_0811-PlusPF | LASV_NGA_2016_0811 | gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst | (copied from de_novo_assembly table) |
LASV_NGA_2016_1423-Broad | LASV_NGA_2016_1423 | gs://pathogen-public-dbs/v1/kraken2-broad-20200505.tar.zst | (copied from de_novo_assembly table) |
LASV_NGA_2016_1423-PlusPF | LASV_NGA_2016_1423 | gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst | (copied from de_novo_assembly table) |
The above steps do not take very long (a few minutes here and there) but were not worth spending the time on in this workshop. But these steps are generalizable to any scenario or organism where you want to run your own fastq reads against our kraken2-based classification pipeline.
Off-the-shelf kraken2 databases are distributed by John’s Hopkins University (the authors of Kraken) at this site and redistributed on public Google Cloud buckets by the Broad.
We will use the same workspace that you have already cloned from the previous Ebola read alignment exercise.
Click on the Workflows tab on the top. This should lead to a list of analysis workflows that have already been preloaded into your
workspace. One of them is classify_single
. Click on classify_single
.
This will lead to a workflow launch page where you will need to set the parameters and inputs before launching your analyses. Make sure to set the following:
classify_single
“Version:” should be already set to master
, but make sure it is set as such.metagenomics
.metagenomics
table so that we launch 16 jobs at the same time, one for each sample x database combination in the table. Hit the OK button on the lower right of the pop up box. This should return you to the workflow setup page which should now say that it will run on “16 selected metagenomicss” (yes, it’s just adding an “s” to the table name).classify_single.kraken2_db_tgz
(required) should be set to this.k2_db
classify_single.krona_taxonomy_db_kraken2_tgz
(required) should be set to workspace.krona_taxonomy_tab
classify_single.ncbi_taxdump_tgz
(required) should be set to workspace.ncbi_taxdump
classify_single.reads_bams
(required) should be set to this.unmapped_bam
classify_single.spikein_db
(required) should be set to workspace.spikein_db
classify_single.trim_clip_db
(required) should be set to workspace.trim_clip_db
The resulting workflow launch page should look like this when you are ready:
Click “RUN ANALYSIS” (which should be dark blue if you’ve filled in all inputs properly). This will take you to a job submission status page for your newly launched workflow, showing four rows in the bottom table corresponding to the four jobs that have been launched.
You will receive an email when each of your submissions complete (along with information about whether they succeeded or failed). Additionally, you can also click on the JOB HISTORY tab at the top of your workspace to check on the status of your analyses in progress. When classify_single
is complete, you can move on to evaluating the results. The job submission page for your submission under the Job History tab should look something like this when the submissions are complete:
Depending on some predictable and some unpredictable factors, the classify_single
jobs should complete within about 30 minutes for the minikraken kraken2 databases and about 60 minutes for the PlusPF databases, but may take longer. No user intervention is required while you await results, and no connectivity or power is required at the client side during this time. The runtime should be somewhat independent of whether you launched jobs on 1 or 1,000 samples. Some intermediate outputs are viewable before the full analysis completes, but it’s often easier to wait for final results to be loaded into the table.
About 1 day after job completion, total cloud compute costs are displayed on this page (in USD). Prior to that, the run costs are listed as “N/A”. Kraken2 analysis costs and runtime tend to scale more with the size of the kraken2 database than the size of the input sequencing data (unless it is an extremely large volume of sequencing data).
You can examine the outputs and results of each step of each job via the Job History page, however, for large submissions, it is easier to view the saved top level outputs in the data table–in this case, the metagenomics
table. The metagenomics
table now has a number of additional output columns, including krona plots for viewing metagenomics results, text summary files, subsetted/filtered read sets, fastqc plots on subsetted reads, and viral de novo contigs.
There are two key outputs for each sample that you might inspect for each sample in the metagenomics
table: there is a text-based summary file produced by kraken, provided in the kraken2_summary_report
column, and there is the interactive krona plot rendering, provided in kraken2_krona_plot
.
You can click on the live links for any file element in the Terra data table and download them, or preview them in your browser. As an example, click on the LongBoat.kraken2.krona.html
live link in the kraken2_krona_plot
column of the LongBoat-PlusPF
row of the metagenomics
data table. This will open a “File Details” box where you can access this file three different ways:
gsutil cp
command to your terminal and it will download the file that way.Repeat the above steps for all four results to open in separate tabs.
Below are what the outputs should look like for your run.
Krona plots of the arboviral data, two different JHU-distributed databases (MiniK2 and PlusPF):
The dominant viral hit is Piry virus in both samples, but if you expand the Viruses wedge you will see a small number of Dengue virus type 4 reads in both. The Broad-database results (LongBoat, Palmetto), do not differ significantly from the PlusPF results and were not included in your dataset exercise.
Both the minikraken2 and PlusPF databases perform similarly for viral reads–the primary difference is the detection of eukaryotic reads by the PlusPF database (minikraken2 does not contain any eukaryotic sequences). Plasmodium relictum is a malaria-like pathogen that infect avian hosts.
Krona plots from the previously used Ebola example data against the JHU PlusPF database:
Notice the similar trends between the number of Zaire ebolavirus reads classified by kraken2 and the number of reads_aligned
in the ebov
table from the assemble_refbased
workflow in the Ebola exercise. There will be differences due to sensitivity of methodology (kraken2 vs minimap), as well as read deduplication (assemble_refbased counts are deduplicated, metagenomic hits are not) and other filters (assemble_refbased filters to properly paired aligned reads), but the general trend and orders of magnitude are similar.
sample | kraken2 EBOV hits | minimap2 reads aligned |
---|---|---|
G5723.1 | 14043 | 8240 |
G5731.1 | 200354 | 90008 |
G5732.1 | 5857 | 1770 |
G5735.2 | 420 | 434 |
Krona plots from the previously used Lassa example data against PlusPF and a custom Broad kraken2 database:
Here are the number of Lassa virus reads detected by each kraken2 database:
sample | PlusPF (arenaviridae) | Broad custom (Lassa mammarenavirus) |
---|---|---|
LASV_NGA_2016_0409 | 3021 | 9022 |
LASV_NGA_2016_0668 | 400 | 12204 |
LASV_NGA_2016_0811 | 91 | 1332 |
LASV_NGA_2016_1423 | 4310 | 689845 |
As you can see, there are significant differences in detection sensitivity between PlusPF and the custom Broad kraken2 databases for Lassa virus, as well as specificity (the Broad database classifies LASV to the species level, the PlusPF generally does not). This is not surprising, as Lassa virus is about 70% conserved at the nucleotide level across the species–with an average of 1 SNP every 3bp, no k-mer-based method will match these unless a close enough genome is represented in the database. Default JHU kraken(2) databases include only one representative genome per viral species (the RefSeq genome), so they tend to perform quite poorly at Lassa virus detection.
By contrast, Ebola virus is much more genetically conserved, and RefSeq-only kraken databases perform quite well at detection. Dengue virus, while diverse, has one RefSeq genome for each of the four types, meaning that RefSeq captures DENV diversity quite well.
To address the sensitivity issues with RefSeq, the Broad Viral Genomics group occasionally builds custom databases that are intended to encapsulate PlusPF with additional viral genomes for particularly diverse taxa. The most recent build of this database was in 2020 (gs://pathogen-public-dbs/v1/kraken2-broad-20200505.tar.zst
).
Other example krona plots from outside data sets:
Among other outputs, kraken2 will classify every read in your input data–that is, it will assign an NCBI taxid to every read you gave it. Our pipeline additionally uses this output to create two subsetted read sets of your input data:
Vertebrata
or lower. This includes all non-vertebrate reads plus unclassified reads and those that classified at a higher taxonomic rank than the subphylum level. Reads are contained in the cleaned_reads_unaligned_bam
output, with the numeric count provided in read_counts_depleted
and the FastQC plot provided in cleaned_fastqc
.Vertebrata
, other sequences
(synthetic constructs), or Bacteria
and then PCR deduplicated via an alignment-free approach. Reads are contained in the deduplicated_reads_unaligned
output, with the numeric count provided in read_counts_dedup
.The dehosted reads are typically the data set you will submit to SRA/ENA for public data release and these can be used for downstream analyses whether alignment-based or assembly-based (in the latter case, you can skip any dehosting steps to save time and cost, since it has already been done here).
A FastQC plot is generated on this subset as well, because we have found over the years that working with metagenomic reads from direct patient specimens sometimes exhibits different RNA degradation/quality between host nucleic acids and RNA viral nucleic acids within the same sample–BioAnalyzer/TapeStation outputs and FastQC plots on the total raw data may obscure this difference if it exists. Our team does not always proactively look at these plots unless problems are encountered, but they are easy to generate and helpful if needed.
The acellular reads are often the “target” data of interest (unless you are interested in the bacteria) and are often a very small fraction of the input data. These subsetted BAM files are often small enough to download and work with in local analysis or visual tools, and would be the appropriate input SPAdes and other focused analysis steps.
The classify_single
workflow will additionally take the subset of reads classified as acellular (all viral and unclassified reads, typically representing a small minority of the input data) and perform de novo assembly via SPAdes. The resulting contigs are provided in the contigs_fasta
output column of the metagenomics
table. You can download these fasta files from the table view–they should not be particularly large files (in this workshop’s data set, all of these files are less than 0.5MB).
For highly diverse viral taxa, k-mer based read classification will have sensitivity limitations, especially when utilizing RefSeq-only databases and/or if the full diversity of the species is not captured well in INSDC at all (see LASV results above). Custom kraken database builds can help, but not for more novel or uncharacterized viral taxa, which would require a different detection approach.
Utilizing de novo contigs instead of raw reads for detection provides more statistical power per sequence for distant matches via BLAST or BLAST-like approaches. This workshop does not go into detail on how to employ these approaches, however the most common and simple approach that researchers will take as a next step for investigation is to run the contigs through NCBI BLAST against nt
(blastn) or nr
(blastx).
See also: