viral-workshops

Viral metagenomics

This is a walkthrough demonstrating metagenomic workflows on the Terra cloud platform on Illumina data from arboviral surveillance sampling. We will also run on the previously used example data from the other two exercises.

Viral metagenomics

Description of data set

The data set consists of two (downsampled) samples from an arboviral surveillance project from the University of Florida. Data was generated by Illumina sequencing with metagenomic protocols. In this Terra exercise, we additionally add some samples from the Ebola and Lassa exercises used previously.

Description of metagenomics pipeline

The exercise will utilize the Broad Viral Genomics group’s viral metagenomic classification pipeline (Dockstore link, GitHub source, Manual) and practice execution on the Terra platform, but this pipeline is known to work on other cloud platforms as well as on the command line using miniWDL.

Briefly, the classify_single pipeline consists of the following steps:

Read-level k-mer based taxonomic classification via kraken2, with krona outputs for visualization
Creation of filtered subsets of unaligned reads based on classification: dehosted (all non-vertebrate and unknown), viral (all acellular and unknown), with fastqc plots for each subset
Alignment-free duplicate removal, adapter trimming, and de novo assembly (via SPAdes) of acellular reads
Alignment based counting of reads matching ERCC spike-ins

Note: this pipeline does not yet incorporate bracken for more accurate abundance estimation, but the principles for interpretation of results are the same (aside from any quantitative inferences).

Terra workspace setup

For simplicity, we have already loaded in the read data and the reference databases. If you are starting from scratch on a new data set, what we did to populate these workspaces was:

Imported the following workflows from the Broad Institute Viral Genomics Dockstore collection: fastq_to_ubam, classify_single.
Copied two fastq files from the workshop data folder to the Terra workspace bucket.
Created a Terra table called metagenomics with four rows representing a 2x2 combination of samples and kraken2 databases:

entity:metagenomics_id	sample	fastq1	k2_db
LongBoat-PlusPF	LongBoat	gs://fc-9c0b667b-b35d-4554-937e-ea40fcd95c0d/input_data/metagenomics/Longboat-250k.fastq.gz	gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst
Palmetto-PlusPF	Palmetto	gs://fc-9c0b667b-b35d-4554-937e-ea40fcd95c0d/input_data/metagenomics/Palmetto-250k.fastq.gz	gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst
LongBoat-MiniK2	LongBoat	gs://fc-9c0b667b-b35d-4554-937e-ea40fcd95c0d/input_data/metagenomics/Longboat-250k.fastq.gz	gs://pathogen-public-dbs/jhu/minikraken2_v2_8GB_201904_retar.tar.zst
Palmetto-MiniK2	Palmetto	gs://fc-9c0b667b-b35d-4554-937e-ea40fcd95c0d/input_data/metagenomics/Palmetto-250k.fastq.gz	gs://pathogen-public-dbs/jhu/minikraken2_v2_8GB_201904_retar.tar.zst

Ran the fastq_to_ubam workflow on all rows of the metagenomics table, with platform_name = “ILLUMINA”, library_name = “l1”, sample_name = this.sample, and fastq_1 = this.fastq1.
Added the following rows to the Workspace Data table:
- workspace.kraken2_db_pluspf = gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst
- workspace.kraken2_db_broad_custom = gs://pathogen-public-dbs/v1/kraken2-broad-20200505.tar.zst
- workspace.kraken2_db_minik2 = gs://pathogen-public-dbs/jhu/minikraken2_v2_8GB_201904_retar.tar.zst
- workspace.krona_taxonomy_tab = gs://pathogen-public-dbs/v1/krona.taxonomy-20221213.tab.zst
- workspace.ncbi_taxdump = gs://pathogen-public-dbs/v1/taxdump-20221213.tar.gz
- workspace.spikein_db = gs://pathogen-public-dbs/v0/ERCC_96_nopolyA.fasta
- workspace.trim_clip_db = gs://pathogen-public-dbs/v0/contaminants.clip_db.fasta
Added 12 more rows to the metagenomics table including a combination of four EBOV and four LASV samples (from previous exercises) and two kraken2 databases (PlusPF and a Broad custom):

entity:metagenomics_id	sample	k2_db	unmapped_bam
G5723.1-PlusPF	G5723.1	gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst	(copied from `ebov` table)
G5731.1-PlusPF	G5731.1	gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst	(copied from `ebov` table)
G5732.1-PlusPF	G5732.1	gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst	(copied from `ebov` table)
G5735.2-PlusPF	G5735.2	gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst	(copied from `ebov` table)
LASV_NGA_2016_0409-Broad	LASV_NGA_2016_0409	gs://pathogen-public-dbs/v1/kraken2-broad-20200505.tar.zst	(copied from `de_novo_assembly` table)
LASV_NGA_2016_0409-PlusPF	LASV_NGA_2016_0409	gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst	(copied from `de_novo_assembly` table)
LASV_NGA_2016_0668-Broad	LASV_NGA_2016_0668	gs://pathogen-public-dbs/v1/kraken2-broad-20200505.tar.zst	(copied from `de_novo_assembly` table)
LASV_NGA_2016_0668-PlusPF	LASV_NGA_2016_0668	gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst	(copied from `de_novo_assembly` table)
LASV_NGA_2016_0811-Broad	LASV_NGA_2016_0811	gs://pathogen-public-dbs/v1/kraken2-broad-20200505.tar.zst	(copied from `de_novo_assembly` table)
LASV_NGA_2016_0811-PlusPF	LASV_NGA_2016_0811	gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst	(copied from `de_novo_assembly` table)
LASV_NGA_2016_1423-Broad	LASV_NGA_2016_1423	gs://pathogen-public-dbs/v1/kraken2-broad-20200505.tar.zst	(copied from `de_novo_assembly` table)
LASV_NGA_2016_1423-PlusPF	LASV_NGA_2016_1423	gs://pathogen-public-dbs/jhu/k2_pluspf_20221209.tar.zst	(copied from `de_novo_assembly` table)

The above steps do not take very long (a few minutes here and there) but were not worth spending the time on in this workshop. But these steps are generalizable to any scenario or organism where you want to run your own fastq reads against our kraken2-based classification pipeline.

Off-the-shelf kraken2 databases are distributed by John’s Hopkins University (the authors of Kraken) at this site and redistributed on public Google Cloud buckets by the Broad.

Analysis walkthrough

Clone the workspace

We will use the same workspace that you have already cloned from the previous Ebola read alignment exercise.

Run classify_single with two different kraken2 databases

Click on the Workflows tab on the top. This should lead to a list of analysis workflows that have already been preloaded into your workspace. One of them is classify_single. Click on classify_single.

This will lead to a workflow launch page where you will need to set the parameters and inputs before launching your analyses. Make sure to set the following:

The classify_single “Version:” should be already set to master, but make sure it is set as such.
“Run workflow(s) with inputs defined by data table” should be selected (not “file paths”).
“Step 1 – Select root entity type:” should be set to metagenomics.
“Step 2 – SELECT DATA” – click on this button and a data selector box will pop up. Check box all rows of the metagenomics table so that we launch 16 jobs at the same time, one for each sample x database combination in the table. Hit the OK button on the lower right of the pop up box. This should return you to the workflow setup page which should now say that it will run on “16 selected metagenomicss” (yes, it’s just adding an “s” to the table name).
In the inputs table, we will need to populate the following required inputs:
- classify_single.kraken2_db_tgz (required) should be set to this.k2_db
- classify_single.krona_taxonomy_db_kraken2_tgz (required) should be set to workspace.krona_taxonomy_tab
- classify_single.ncbi_taxdump_tgz (required) should be set to workspace.ncbi_taxdump
- classify_single.reads_bams (required) should be set to this.unmapped_bam
- classify_single.spikein_db (required) should be set to workspace.spikein_db
- classify_single.trim_clip_db (required) should be set to workspace.trim_clip_db
Click the SAVE button after you’ve set all the inputs.

The resulting workflow launch page should look like this when you are ready:

Click “RUN ANALYSIS” (which should be dark blue if you’ve filled in all inputs properly). This will take you to a job submission status page for your newly launched workflow, showing four rows in the bottom table corresponding to the four jobs that have been launched.

Wait for job completion

You will receive an email when each of your submissions complete (along with information about whether they succeeded or failed). Additionally, you can also click on the JOB HISTORY tab at the top of your workspace to check on the status of your analyses in progress. When classify_single is complete, you can move on to evaluating the results. The job submission page for your submission under the Job History tab should look something like this when the submissions are complete:

Depending on some predictable and some unpredictable factors, the classify_single jobs should complete within about 30 minutes for the minikraken kraken2 databases and about 60 minutes for the PlusPF databases, but may take longer. No user intervention is required while you await results, and no connectivity or power is required at the client side during this time. The runtime should be somewhat independent of whether you launched jobs on 1 or 1,000 samples. Some intermediate outputs are viewable before the full analysis completes, but it’s often easier to wait for final results to be loaded into the table.

About 1 day after job completion, total cloud compute costs are displayed on this page (in USD). Prior to that, the run costs are listed as “N/A”. Kraken2 analysis costs and runtime tend to scale more with the size of the kraken2 database than the size of the input sequencing data (unless it is an extremely large volume of sequencing data).

Evaluating results

You can examine the outputs and results of each step of each job via the Job History page, however, for large submissions, it is easier to view the saved top level outputs in the data table–in this case, the metagenomics table. The metagenomics table now has a number of additional output columns, including krona plots for viewing metagenomics results, text summary files, subsetted/filtered read sets, fastqc plots on subsetted reads, and viral de novo contigs.

Kraken and Krona outputs

There are two key outputs for each sample that you might inspect for each sample in the metagenomics table: there is a text-based summary file produced by kraken, provided in the kraken2_summary_report column, and there is the interactive krona plot rendering, provided in kraken2_krona_plot.

You can click on the live links for any file element in the Terra data table and download them, or preview them in your browser. As an example, click on the LongBoat.kraken2.krona.html live link in the kraken2_krona_plot column of the LongBoat-PlusPF row of the metagenomics data table. This will open a “File Details” box where you can access this file three different ways:

Click the blue DOWNLOAD button to download it to your computer. You can then open the downloaded file in a browser to view.
If you have the gcloud API installed in a command line environment, copy and paste the gsutil cp command to your terminal and it will download the file that way.
Click on “View this file in the Google Cloud Storage Browser”.
This opens not the file, but its parent directory in a web browser for Google Cloud buckets. This directory will contain many other files, but look for the link to the file you were looking for originally (LongBoat.kraken2.krona.html), and click that.
This leads to an “Object details” page with information and several links for this particular file. Click the “Authenticated URL” link. This will open the krona plot in your web browser.

Repeat the above steps for all four results to open in separate tabs.

Below are what the outputs should look like for your run.

Arboviral results

Krona plots of the arboviral data, two different JHU-distributed databases (MiniK2 and PlusPF):

LongBoat: MiniK2, PlusPF
Palmetto: MiniK2, PlusPF

The dominant viral hit is Piry virus in both samples, but if you expand the Viruses wedge you will see a small number of Dengue virus type 4 reads in both. The Broad-database results (LongBoat, Palmetto), do not differ significantly from the PlusPF results and were not included in your dataset exercise.

Both the minikraken2 and PlusPF databases perform similarly for viral reads–the primary difference is the detection of eukaryotic reads by the PlusPF database (minikraken2 does not contain any eukaryotic sequences). Plasmodium relictum is a malaria-like pathogen that infect avian hosts.

Ebola results

Krona plots from the previously used Ebola example data against the JHU PlusPF database:

Notice the similar trends between the number of Zaire ebolavirus reads classified by kraken2 and the number of reads_aligned in the ebov table from the assemble_refbased workflow in the Ebola exercise. There will be differences due to sensitivity of methodology (kraken2 vs minimap), as well as read deduplication (assemble_refbased counts are deduplicated, metagenomic hits are not) and other filters (assemble_refbased filters to properly paired aligned reads), but the general trend and orders of magnitude are similar.

sample	kraken2 EBOV hits	minimap2 reads aligned
G5723.1	14043	8240
G5731.1	200354	90008
G5732.1	5857	1770
G5735.2	420	434

Lassa results

Krona plots from the previously used Lassa example data against PlusPF and a custom Broad kraken2 database:

LASV_NGA_2016_0409: PlusPF, Broad
LASV_NGA_2016_0668: PlusPF, Broad
LASV_NGA_2016_0811: PlusPF, Broad
LASV_NGA_2016_1423: PlusPF, Broad

Here are the number of Lassa virus reads detected by each kraken2 database:

sample	PlusPF (arenaviridae)	Broad custom (Lassa mammarenavirus)
LASV_NGA_2016_0409	3021	9022
LASV_NGA_2016_0668	400	12204
LASV_NGA_2016_0811	91	1332
LASV_NGA_2016_1423	4310	689845

As you can see, there are significant differences in detection sensitivity between PlusPF and the custom Broad kraken2 databases for Lassa virus, as well as specificity (the Broad database classifies LASV to the species level, the PlusPF generally does not). This is not surprising, as Lassa virus is about 70% conserved at the nucleotide level across the species–with an average of 1 SNP every 3bp, no k-mer-based method will match these unless a close enough genome is represented in the database. Default JHU kraken(2) databases include only one representative genome per viral species (the RefSeq genome), so they tend to perform quite poorly at Lassa virus detection.

By contrast, Ebola virus is much more genetically conserved, and RefSeq-only kraken databases perform quite well at detection. Dengue virus, while diverse, has one RefSeq genome for each of the four types, meaning that RefSeq captures DENV diversity quite well.

To address the sensitivity issues with RefSeq, the Broad Viral Genomics group occasionally builds custom databases that are intended to encapsulate PlusPF with additional viral genomes for particularly diverse taxa. The most recent build of this database was in 2020 (gs://pathogen-public-dbs/v1/kraken2-broad-20200505.tar.zst).

Other examples

Other example krona plots from outside data sets:

Nigerian unknown fatal fever, 2015 (DNAsed, non-DNAsed)
Water/non-template-controls from Broad Institute, Universite Cheikh Anta Diop – NTCs should be run in every metagenomic sequencing batch and closely examined
Non-metagenomic examples of krona as a generalized heirarchical composition visualization tool: nutritional composition, MacOS user directory

Subsetted read sets

Among other outputs, kraken2 will classify every read in your input data–that is, it will assign an NCBI taxid to every read you gave it. Our pipeline additionally uses this output to create two subsetted read sets of your input data:

Dehosted reads: all reads from input that did not classify as Vertebrata or lower. This includes all non-vertebrate reads plus unclassified reads and those that classified at a higher taxonomic rank than the subphylum level. Reads are contained in the cleaned_reads_unaligned_bam output, with the numeric count provided in read_counts_depleted and the FastQC plot provided in cleaned_fastqc.
Deduplicated acellular reads: all reads from input that did not classify as Vertebrata, other sequences (synthetic constructs), or Bacteria and then PCR deduplicated via an alignment-free approach. Reads are contained in the deduplicated_reads_unaligned output, with the numeric count provided in read_counts_dedup.

The dehosted reads are typically the data set you will submit to SRA/ENA for public data release and these can be used for downstream analyses whether alignment-based or assembly-based (in the latter case, you can skip any dehosting steps to save time and cost, since it has already been done here).

A FastQC plot is generated on this subset as well, because we have found over the years that working with metagenomic reads from direct patient specimens sometimes exhibits different RNA degradation/quality between host nucleic acids and RNA viral nucleic acids within the same sample–BioAnalyzer/TapeStation outputs and FastQC plots on the total raw data may obscure this difference if it exists. Our team does not always proactively look at these plots unless problems are encountered, but they are easy to generate and helpful if needed.

The acellular reads are often the “target” data of interest (unless you are interested in the bacteria) and are often a very small fraction of the input data. These subsetted BAM files are often small enough to download and work with in local analysis or visual tools, and would be the appropriate input SPAdes and other focused analysis steps.

Viral contigs

The classify_single workflow will additionally take the subset of reads classified as acellular (all viral and unclassified reads, typically representing a small minority of the input data) and perform de novo assembly via SPAdes. The resulting contigs are provided in the contigs_fasta output column of the metagenomics table. You can download these fasta files from the table view–they should not be particularly large files (in this workshop’s data set, all of these files are less than 0.5MB).

For highly diverse viral taxa, k-mer based read classification will have sensitivity limitations, especially when utilizing RefSeq-only databases and/or if the full diversity of the species is not captured well in INSDC at all (see LASV results above). Custom kraken database builds can help, but not for more novel or uncharacterized viral taxa, which would require a different detection approach.

Utilizing de novo contigs instead of raw reads for detection provides more statistical power per sequence for distant matches via BLAST or BLAST-like approaches. This workshop does not go into detail on how to employ these approaches, however the most common and simple approach that researchers will take as a next step for investigation is to run the contigs through NCBI BLAST against nt (blastn) or nr (blastx).