viral-workshops

Viral read alignment, variant calling, and consensus calling

This is a walkthrough demonstrating alignment, variant calling, and consensus calling workflows on the Terra cloud platform on Ebola virus Illumina data.

Contents

  1. Viral read alignment, variant calling, and consensus calling
    1. Description of data set
    2. Description of workflows
    3. Terra workspace setup
    4. Analysis walkthrough
      1. Clone the workspace
      2. Run align_and_plot
      3. Run assemble_refbased
      4. Wait for job completion
    5. Evaluating results
      1. Metrics
      2. Coverage plot outputs
    6. Other related resources

Description of data set

The data comes from febrile Ebola patients in Sierra Leone. The data and findings are described in Park, et al, 2015 and deposited in NCBI SRA and Genbank under PRJNA257197.

This exercise focuses on data from four samples (SRR1972917, SRR1972918, SRR1972919, SRR1972920) of a wide range of sequencing quality (SRR1972920 is relatively poor coverage with much of the genome uncovered, SRR1972918 is very deeply covered). Each is 101bp paired end Illumina data sequenced from libraries generated on patient blood samples via metagenomic RNA-seq laboratory approaches. All patients were sampled in 2014 from Sierra Leone, and all genomes belong to the SL3 clade of the Makona variant of Zaire ebolavirus. This exercise will utilize the de facto standard Makona C15 reference genome (KJ660346.2, March 2014, Guinea) for alignments and variant calling.

Description of workflows

The exercise will utilize workflows from the Broad Viral Genomics group. This will include the read alignment pipeline (Dockstore link, GitHub source, Manual) and the reference based consensus calling pipeline (Dockstore link, GitHub source, Manual) and practice execution on the Terra platform, but these pipelines are known to work on other cloud platforms as well as on the command line using miniWDL.

The align_and_plot workflow simply aligns reads to a reference (using minimap2, bwa, or novoalign), creates coverage plots, and calculates various metrics of interest (number of aligned reads, etc).

The assemble_refbased workflow performs the same alignment to a reference, optionally trims the alignments for primers if provided a bed file (for sequencing protocols involving PCR amplification followed by tagmentation), produces plots and metrics, and calls a consensus assembly and intrahost variants.

Terra workspace setup

For simplicity, we have already loaded in the read data and the reference genomes. If you are starting from scratch on a new organism, what we did to populate these workspaces was:

  1. Imported the following workflows from the Broad Institute Viral Genomics Dockstore collection: fetch_sra_to_bam, fetch_annotations, align_and_plot, assemble_refbased.
  2. Created a Terra table called ebov with the four SRA accessions:
entity:ebov_id sra_accession biosample_accession
G5723.1 SRR1972917 SAMN03254208
G5731.1 SRR1972918 SAMN03254209
G5732.1 SRR1972919 SAMN03254210
G5735.2 SRR1972920 SAMN03254213
  1. Ran the fetch_sra_to_bam workflow on all rows of the ebov table to download reads from all four SRA accessions, populating more columns of the table with raw reads and basic run/sample metadata.
  2. Ran the fetch_annotations workflow (on file paths, not data tables) to download the reference genome (KJ660346.2) and manually added a pointer to the output fasta file in the Workspace Data table as workspace.ref_genome_ebov.

The above steps do not take very long (a few minutes here and there) but were not worth spending the time on in this workshop. But these steps are generalizable to any scenario or organism where you want to align SRA reads against a Genbank reference genome.

Analysis walkthrough

Clone the workspace

A workspace for these exercises has been created in advance, and contains the required input data organized into distinct tables. The tables can be explored from the “Data” tab.

Before beginning the exercise, the pre-made workspace will be copied to a “clone” that will be yours to use for these exercises. Using a cloned workspace will ensure that the compute jobs and their outputs you see are yours alone.

image

image

Run align_and_plot

Click on the Workflows tab on the top. This should lead to a list of analysis workflows that have already been preloaded into your workspace. One of them is align_and_plot. Click on align_and_plot.

This will lead to a workflow launch page where you will need to set the parameters and inputs before launching your analyses. Make sure to set the following:

The resulting workflow launch page should look like this when you are ready:

image

Click “RUN ANALYSIS” (which should be dark blue if you’ve filled in all inputs properly). This will take you to a job submission status page for your newly launched workflow, showing four rows in the bottom table corresponding to the four jobs that have been launched.

Run assemble_refbased

Click on the Workflows tab on the top. This should lead to a list of analysis workflows that have already been preloaded into your workspace. One of them is assemble_refbased. Click on assemble_refbased.

This will lead to a workflow launch page where you will need to set the parameters and inputs before launching your analyses. Make sure to set the following:

The resulting workflow launch page should look like this when you are ready (optional input not shown here, as it is a page or two down below):

image

Click “RUN ANALYSIS” (which should be dark blue if you’ve filled in all inputs properly). This will take you to a job submission status page for your newly launched workflow, showing four rows in the bottom table corresponding to the four jobs that have been launched.

If you are running this workflow on PCR amplicon sequencing libraries, you would also need to supply the following inputs (these do not apply to this workshop’s data set, which is metagenomic):

Wait for job completion

You will receive an email when each of your submissions complete (along with information about whether they succeeded or failed). Additionally, you can also click on the JOB HISTORY tab at the top of your workspace to check on the status of your analyses in progress. When both assemble_refbased and align_and_plot are complete, you can move on to evaluating the results. The Job History tab should look like this when the submissions are complete:

image

Depending on some predictable and some unpredictable factors, the align_and_plot jobs should complete within 15 minutes and the assemble_refbased within 30, but they may take much longer (2-3x longer). No user intervention is required while you await results, and no connectivity or power is required at the client side during this time. The runtime should be somewhat independent of whether you launched jobs on 1 or 1,000 samples.

Evaluating results

You can examine the outputs and results of each step of each job via the Job History page, however, for large submissions, it is easier to view the saved top level outputs in the data table–in this case, the ebov table. The ebov table now has a number of additional output columns, including aligned BAM files, coverage plots, consensus genomes (FASTA), variants (VCF), and various numeric counts and metrics.

Metrics

align_and_plot produces a few key outputs of interest:

Typically the aligned read counts and coverage plots are looked at first.

assemble_refbased produces a few key outputs of interest:

Typically, the assembly_fasta is used for most downstream analyses (phylogenetic, typing, species-specific characterization) as well as data release and sharing. The other outputs listed above are used to evaluate sample data quality and filter which assemblies are included in downstream analysis.

Coverage plot outputs

The numeric counts and metrics give a holistic summary of the sequencing performance of any sample, but inspecting coverage plots can reveal any biases or issues at certain parts of the target genome and how much of that genome was recovered. Coverage plots are generated by assemble_refbased in two output files: a visual plot (PDF format) is provided in the align_to_ref_merged_coverage_plot output column. A two-column tab-delimited table indicating read depth coverage at each genomic position is provided in the align_to_ref_merged_coverage_tsv output column – this table is used to generate the PDF plot, and can be used to regenerate such plots using your preferred plotting software. The plots provided by this pipeline are after alignment, primer trimming (optional/if applicable), and PCR deduplication (unless disabled), and should exactly match the alignments in the aligned BAM file align_to_ref_merged_aligned_trimmed_only_bam.

You can click on the live links for any file element in the Terra data table and download them, or preview them in your browser. As an example, click on the SRR1972917.coverage_plot.pdf live link in the coverage_plot column of the G5723.1 row of the ebov data table. This will open a “File Details” box where you can access this file three different ways:

  1. Click the blue DOWNLOAD button to download it to your computer. You can then open the downloaded file in a browser to view.
  2. If you have the gcloud API CLI installed in a command-line environment, and copy and paste the gcloud storage cp command to your terminal, the file will be downloaded that way.
  3. If you click the “View this file in the Google Cloud Storage Browser”, an external page will be opened in the web browser showing the file where it resites in its Google Storage bucket. This directory will contain many other files, but look for the link to the file you were looking for originally (SRR1972917.coverage_plot.pdf), and click it. An “Object details” page will open with information and several links for this particular file. Clicking the “Authenticated URL” link will open the krona plot in your web browser.

Repeat the above steps for all four results to open in separate tabs.

Below are what the outputs should look like for our four EBOV genomes. The x-axis coordinates match the full reference genome provided to assemble_refbased. The y-axis is autoscaled and differs for each plot (by orders of magnitude in this case). You can observe that SRR1972918 has extremely high coverage, SRR1972920 has extremely poor coverage, and all samples struggled with some difficult sequence content around reference position 2400.

image image image image

See also: