Input data
GATK-SV requires the following input data:
- Sequencing alignments in BAM or CRAM format that are:
- Short-read, paired-end Illumina (e.g. Novaseq)
- Deep whole-genome coverage (~30x); RNA-seq and targeted (exome) libraries are not supported
- Indexed (have a companion
.bai
or.crai
file) - Aligned to hg38 with either GATK Best Practices and bwa-mem, or Illumina DRAGEN v3.4.12 or v3.7.8
- (Joint calling mode only) Family structure definitions file in PED format. This file is required even if your dataset does not contain related individuals.
Note that the supported alignment pipeline versions have been extensively tested for robustness and accuracy. While other versions of DRAGEN may work as well, they have not been validated with GATK-SV. We do not recommend mixing aligners within call sets.
Sample Exclusion
We recommend filtering out samples with a high percentage of improperly paired or chimeric reads as technical outliers prior to running GatherSampleEvidence. Samples with high rates of anomalous reads may indicate issues with library preparation, degradation, or contamination and can lead to poor variant set quality. Samples failing these criteria often require longer run times and higher compute costs.
Sample IDs
GATK-SV imposes certain restrictions on sample names (IDs) in order to avoid certain parsing errors (e.g. with the
use of the grep
command). While future releases will obviate some of these restrictions, users must modify
their sample IDs according to the following requirements.
Sample IDs must:
- Be unique within the cohort
- Contain only alphanumeric characters and underscores (no dashes, whitespace, or special characters)
Sample IDs should not:
- Contain only numeric characters, e.g.
10004928
- Be a substring of another sample ID in the same cohort
- Contain any of the following substrings:
chr
,name
,DEL
,DUP
,CPX
,CHROM
The same requirements apply to family IDs in the PED file, as well as batch IDs and the cohort ID provided as workflow inputs.
Users should set sample IDs in GatherSampleEvidence with the sample_id
input, which needs not match
the sample name defined in the BAM/CRAM header. GetSampleID.wdl
can be used to fetch BAM sample IDs and also generates a set
of alternate IDs that are considered safe for this pipeline. Alternatively,
this script
transforms a list of sample IDs to fit these requirements.
Sample IDs can be replaced again in GatherBatchEvidence. To do so, set the parameter
rename_samples = True
and provide updated sample IDs via the samples
parameter.
Note that following inputs will need to be updated with the transformed sample IDs:
- Sample ID list for GatherSampleEvidence or GatherBatchEvidence
- PED file
PED file format
The PED file format is described here. Note that GATK-SV imposes additional requirements:
- The file must be tab-delimited.
- The sex column must only contain 0, 1, or 2: 1=Male, 2=Female, 0=Other/Unknown. Sex chromosome aneuploidies (detected in EvidenceQC) should be entered as sex = 0.
- All family, individual, and parental IDs must conform to the sample ID requirements.
- Missing parental IDs should be entered as 0.
- Header lines are allowed if they begin with a # character.
- To validate the PED file, you may use
src/sv-pipeline/scripts/validate_ped.py -p pedigree.ped -s samples.list
.