Skip to main content

Single-sample

Introduction

Extending SV detection to small datasets

The Single Sample pipeline is designed to facilitate running the methods developed for the cohort-mode GATK-SV pipeline on small data sets or in clinical contexts where batching large numbers of samples is not an option. To do so, it uses precomputed data, SV calls, and model parameters computed by the cohort pipeline on a reference panel composed of similar samples. The pipeline integrates this precomputed information with signals extracted from the input CRAM file to produce a call set similar in quality to results computed by the cohort pipeline in a computationally tractable and reproducible manner.

Terra workspace

Users should clone the Terra single-sample workspace which is configured with a demo sample to run on. Refer to the following sections for instructions on how to run the pipeline on your data using this workspace.

Data

Case sample

The default workspace includes a NA12878 input WGS CRAM file configured in the workspace samples data table. This file is part of the high coverage (30X) WGS data for the 1000 Genomes Project samples generated by the New York Genome Center and hosted in AnVIL.

Reference panel

The reference panel configured in the workspace consists of data and calls computed from 156 publicly available samples chosen from the NYGC/AnVIL 1000 Genomes high coverage data linked above.

Inputs to the pipeline for the reference panel include:

  • A precomputed SV callset VCF, and joint-called depth-based CNV call files
  • Raw calls for the reference panel samples from Manta and WHAM
  • Trained models for calling copy number variation in GATK gCNV case mode
  • Parameters learned by the cohort mode pipeline in training machine learning models on the reference panel samples.

These resources are primarily configured in the "Workspace Data" for this workspace. However, several of the resources need to be passed to the workflow as large lists of files or strings. Due to Terra limitations on uploading data containing lists to the workspace data table, these resources are specified directly in the workflow configuration.

Reference resources

The pipeline uses a number of resource and data files computed for the hg38 reference:

  • Reference sequences and indices
  • Genome annotation tracks such as segmental duplication and RepeatMasker tracks
  • Data used for annotation of called variants, including GenCode gene annotations and gnomAD site allele frequencies.

Single sample workflow

What does it do?

The workflow gatk-sv-single-sample calls structural variations on a single input CRAM by running the GATK-SV Single Sample Pipeline end-to-end.

What does it require as input?

The workflow accepts a single CRAM or BAM file as input, configured in the following parameters:

Input TypeInput NameDescription
Stringsample_idCase sample identifier
Filebam_or_cram_filePath to the GCS location of the input CRAM or BAM file
StringbatchArbitrary name to be assigned to the run

Additional workspace-level inputs

  • Reference resources for hg38
  • Input data for the reference panel
  • The set of docker images used in the pipeline.

Please contact GATK-SV developers if you are interested in customizing these inputs beyond their defaults.

What does it return as output?

Output TypeOutput NameDescription
Filefinal_vcfSV VCF output for the pipeline. Includes all sites genotyped as variant in the case sample and genotypes for the reference panel. Sites are annotated with overlap of functional genome elements and allele frequencies of matching variants in gnomAD
Filefinal_vcf_idxIndex file for final_vcf
Filefinal_bedFinal output in BED format. Filter status, list of variant samples, and all VCF INFO fields are reported as additional columns.
Filemetrics_fileMetrics computed from the input data and intermediate and final VCFs. Includes metrics on the SV evidence, and on the number of variants called, broken down by type and size range.
Fileqc_fileQuality-control check file. This extracts several key metrics from the metrics_file and compares them to pre-specified threshold values. If any QC checks evaluate to FAIL, further diagnostics may be required.
Fileploidy_matrixMatrix of contig ploidy estimates computed by GATK gCNV.
Fileploidy_plotsPlots of contig ploidy generated from ploidy_matrix
Filenon_genotyped_unique_depth_callsThis VCF file contains any depth based calls made in the case sample that did not pass genotyping checks and do not match a depth-based call from the reference panel. If very high sensitivity is desired, examine this file for additional large CNV calls.
Filenon_genotyped_unique_depth_calls_idxIndex file for non_genotyped_unique_depth_calls
Filepre_cleanup_vcfVCF output in a representation used internally in the pipeline. This file is less compliant with the VCF spec and is intended for debugging purposes.
Filepre_cleanup_vcf_idxIndex file for pre_cleanup_vcf

Example time and cost run on sample data

Sample NameSample SizeTimeCost $
NA1287818.17 GiB23hrs~$7.34

To use this workflow on your own data

If you would like to run this workflow on your own samples (which must be medium-to-high coverage WGS data):

  • Clone the workspace into a Terra project you have access to. Select us-central1 for the region. If you must use a different region, you will need to copy all GATK-SV docker images to the other region before running the pipeline. See the docker images section for details.
  • In the cloned workspace, upload rows to the Sample and (optionally) the Participant Data Table that describe your samples. Ensure that the rows you add to the Sample table contain the columns sample_id and bam_or_cram_file are populated appropriately.
  • There is no need to modify values in the workspace data or method configuration. If you are interested in modifying the reference genome resources or reference panel, please contact the GATK team for support as listed below.
  • Launch the workflow from the "Workflows" tab, selecting your samples as the inputs.

Quality control

Please check the qc_file output to screen for data quality issues. This file provides a table of quality metrics and suggested acceptable ranges.

warning

If a metric exceeds the recommended range, all variants will be automatically flagged with a non-passing FILTER status in the output VCF.