viral-workshops

Sequencing run demultiplexing

This is a walkthrough demonstrating demultiplexing of data from a viral sequencing run of pooled sequencing libraries.

Contents

  1. Sequencing run demultiplexing
    1. Inspect and upload the sample sheet for the run to be demultiplexed
      1. Sample sheet columns required for demultiplexing
      2. Additional sample sheet columns helpful for data submission
        1. Sample sheet columns for commonly used library-preparation protocols
      3. Additional sample sheet columns helpful for internal QC checks
    2. Import the demux_deplete workflow
    3. Configure and execute the demux_deplete workflow
    4. Inspect the output of demultiplexing
      1. Demultiplexing metrics
      2. MultiQC reports
      3. Spike-in read counts for evaluating cross-talk or contamination between samples
      4. Data files

Inspect and upload the sample sheet for the run to be demultiplexed

View the sample sheet file provided, called SampleSheet_reprep_06_ss2.tsv1.

Sample sheet columns required for demultiplexing

For the demux_deplete workflow, a sample sheet is required to supply the sample-specific index information required for assigning reads to output files for individual samples in pool of sequencing libraries. The sample sheet is formatted as a text file with tab-separated values (tsv).

The sample sheet must have the following required columns:

sample barcode_1 barcode_2 library_id_per_sample
       

If a sequencing run includes the same pool of libraries across multiple flowcell lanes, demux_deplete will output a single file for a given sample name, that contains multiple read groups (sam/bam/cram @RG lines), one for each lane. Each read group is identified by a unique combination of information of the flowcell ID, the library ID, and the lane number (stored as the LB value in a sam or bam-formatted file).

Additional sample sheet columns helpful for data submission

A number of additional columns can be included with metadata helpful for downstream processes and the creation of the files necessary for submission of data to public databases (i.e. NCBI BioSample and SRA):

library_strategy library_source library_selection
     

The values entered for the library_strategy, library_source and library_selection fields must conform to a controlled vocabulary of terms specified by NCBI; see the “Library and Platform Terms” tab of the SRA metadata submission template (MS Excel *.xlsx file) for a current list of valid values.

Sample sheet columns for commonly used library-preparation protocols

For metagenomic sequencing, these three values should be:

library_strategy library_source library_selection
RNA-Seq VIRAL RNA cDNA

For amplicon sequencing, these values should be:

library_strategy library_source library_selection
AMPLICON VIRAL RNA PCR

Additional sample sheet columns helpful for internal QC checks

amplicon_set control spike_in viral_ct batch_lib
         

A template sheet is available here

In the Data tab, click Files on the left-hand pane. If a folder called samplesheets/ does not exist, click New folder and create a folder called samplesheets2. Upload the sample sheet TSV provided, flowcell_data.tsv. Once uploaded, right click on the uploaded file in Terra, and click Copy Link Address to copy the full path to the clipboard.

Click the flowcell table in the left-hand pane. Find the column called samplesheets, hover over the cell, and click the pencil icon to edit the samplesheet value(s) for the row present. In each of the four list entries, paste and replace the placeholder values with the full path copied in the previous step.

Import the demux_deplete workflow

The workflow used here for demultiplexing pooled sequence libraries, demux\_deplete is listed on Dockstore, a registry of published bioinformatics workflows.

Navigate to https://dockstore.org, and either search for the workflow by name or navigate to it by clicking Organizations, then Broad Institute of MIT and Harvard, and finally Viral Genomics. In the list of workflows shown, scroll, locate, and click on the workflow named broadinstitute/viral-pipelines/demux_deplete.

On the page for demux_deplete, buttons are present on the right side of the page below Launch with to import the workflow for execution on one of several bioinformatics platforms.

Click the Terra button. A page from Terra will be displayed to import the workflow. In the drop-down menu, select the destination workspace, and click Import. This will add demux_deplete to the group of pipelines listed under the Workflows tab of the workspace. When first imported, Terra will immediately direct to the configuration settings for the workflow.

Configure and execute the demux_deplete workflow

Access the demux_deplete workflow by clicking on the Workflows tab, and then the demux_deplete workflow.

Leave the Version drop-down menu to its default value, and click the radio button, Run workflow(s) with inputs defined by data table.

In the drop-down menu to the right of Select root entity type:, select flowcell, and click the Select Data button. Select the only row currently present in the flowcell table. Click OK.

Note that in the demux_deplete workflow, the samplesheets input field accepts a list of files. The list of sample sheet files provided should include one sample sheet per flowcell lane. If the lanes contain differing pools of sequencing libraries, the sample sheet files should be listed in the same order as the lanes. If the lanes contain the same pools, a single sample sheet should be listed multiple times, once per lane to demultiplex.

Demultiplexing jobs are executed in parallel for all lanes of a flowcell for which sample sheets are present, and the resulting sequence reads are merged on a per-library bases.

Configure the following workflow inputs to use data from the rows selected from the flowcell table:

Configure the following workflow inputs to reference the databases referenced in the workflow data table; these will be used to remove human reads from the data and to count the number of reads present that align to the spike-in sequences listed in the workspace.spikein_db file.

Set the following parameter so rows will be created for the demultiplexed data in the library and sample tables:

Click the Save button above the workflow input text boxes.

If input data have been selected and all of the required workflow inputs are specified, the Run button should be blue. Click the button to begin demultiplexing and depletion. A modal dialog will appear providing an opportunity to enter a comment about the compute job. Enter a comment if desired, and click Launch.

WAIT FOR DEMULTIPLEXING TO COMPLETE

Inspect the output of demultiplexing

If the input to demultiplexing is drawn from a row or rows in the flowcell table, outputs from demultiplexing will be added to the source row(s). This includes numeric metrics, file paths to files containing per-sample sequencing reads, and various metadata.

Demultiplexing metrics

The file listed in the demux_metrics column for a demultiplexed flowcell contains metrics from picard’s IlluminaBasecallsToSam, including the number of reads per sample name. The metrics file also lists the sequencing indices associated with each sample in the sample sheet. A zero or near-zero read count for a given sample may indicate that the indices for the sample were incorrect in the sample sheet, and may need to be corrected prior to demultiplexing again. The field demux_outlierBarcodes lists a file with abundant indices which were not included in the sample sheet; it can be helpful to check this file for potential sample sheet corrections.

MultiQC reports

The multiqc_report_raw and multiqc_report_cleaned columns list combined reports containing quality metrics from FastQC (and potentially other tools), for raw reads and human-depleted reads, respectively. These show base quality scores by position in the reads, quality as a function of flowcell location, and other metrics of read quality.

Spike-in read counts for evaluating cross-talk or contamination between samples

The spikein_counts column lists a file with a table listing counts of reads in each sample mapping to the known sequences of ERCC and SDSI synthetic controls. These controls are typically added (“spiked-in”) each sample in a pool early in the library preparation process, with a distinct spike-in for each sample. In the ideal case, the spikein_counts report should list a moderate read account for only one spike-in for each sample. Should a sample have reads mapping to multiple synthetic controls, that could be an indication of cross-talk or contamination between samples, or “index hopping”.

Data files

The (unmapped) sequence reads from demux_deplete used for subsequent analysis are contained in per-sample *.bam files:

  1. A software-focused text editor is recommended for editing sample sheets, such as Visual Studio Code or Sublime Text. The bio-utils (VSCode) or “ACTG” (Sublime Text) add-ons may be helpful for viewing and manipulating index sequences. The rainbow_csv package (Sublime Text) enhances display of TSV files. 

  2. The samplesheets/ folder is used here to ease organization. The sample sheet files can be stored elsewhere as long as their full file paths are listed correctly in the table row(s) used as input for demultiplexing.