Input data

GATK-SV requires the following input data:

Sequencing alignments in BAM or CRAM format that are:
- Short-read, paired-end Illumina (e.g. Novaseq)
- Deep whole-genome coverage (~30x); RNA-seq and targeted (exome) libraries are not supported
- Indexed (have a companion .bai or .crai file)
- Aligned to hg38 with either GATK Best Practices and bwa-mem, or Illumina DRAGEN v3.4.12 or v3.7.8
(Joint calling mode only) Family structure definitions file in PED format. This file is required even if your dataset does not contain related individuals.

Note that the supported alignment pipeline versions have been extensively tested for robustness and accuracy. While other versions of DRAGEN may work as well, they have not been validated with GATK-SV. We do not recommend mixing aligners within call sets.

Sample Exclusion

We recommend filtering out samples with a high percentage of improperly paired or chimeric reads as technical outliers prior to running GatherSampleEvidence. Samples with high rates of anomalous reads may indicate issues with library preparation, degradation, or contamination and can lead to poor variant set quality. Samples failing these criteria often require longer run times and higher compute costs.

Sample IDs

GATK-SV imposes certain restrictions on sample names (IDs) in order to avoid certain parsing errors (e.g. with the use of the grep command). While future releases will obviate some of these restrictions, users must modify their sample IDs according to the following requirements.

Sample IDs must:

Be unique within the cohort
Contain only alphanumeric characters and underscores (no dashes, whitespace, or special characters)

Sample IDs should not:

Contain only numeric characters, e.g. 10004928
Be a substring of another sample ID in the same cohort
Contain any of the following substrings: chr, name, DEL, DUP, CPX, CHROM

The same requirements apply to family IDs in the PED file, as well as batch IDs and the cohort ID provided as workflow inputs. This script can be used to transform a list of sample IDs to meet safe ID requirements.

Users should assign sample IDs in GatherSampleEvidence with the sample_id input, which needs not match the sample name defined in the BAM/CRAM header. Alternatively, sample IDs can be replaced again in GatherBatchEvidence by setting the parameter rename_samples = True and providing updated sample IDs via the samples parameter.

Note that following inputs will need to be updated with the transformed sample IDs:

Sample ID list for GatherSampleEvidence or GatherBatchEvidence
PED file

PED file format

The PED file format is described here. Note that GATK-SV imposes additional requirements:

The file must be tab-delimited.
The sex column must only contain 0, 1, or 2: 1=Male, 2=Female, 0=Other/Unknown. Sex chromosome aneuploidies (detected in EvidenceQC) should be entered as sex = 0.
All family, individual, and parental IDs must conform to the sample ID requirements.
Missing parental IDs should be entered as 0.
Header lines are allowed if they begin with a # character.
To validate the PED file, you may use src/sv-pipeline/scripts/validate_ped.py -p pedigree.ped -s samples.list.

Sample Exclusion​

Sample IDs​

Sample IDs must:​

Sample IDs should not:​

PED file format​

Sample Exclusion

Sample IDs

Sample IDs must:

Sample IDs should not:

PED file format