Resource files

This page contains descriptions of common resource files used in the pipeline. All required files are publicly available in Google Cloud Storage Buckets. URIs for these files are available in the hg38 resources json.

info

Some resources contain sensitive data and are stored in a secure bucket for development purposes. These files are not required to run GATK-SV.

Reference resources

allosome_file

Reference fasta index file containing only allosomal contigs.

autosome_file

Reference fasta index file containing only autosomal contigs.

bin_exclude

Block-compressed bed file of intervals to exclude in the call set.

cnmops_exclude_list

Plain text bed file of N-masked regions in the reference.

contig_ploidy_priors

Plain text TSV file of prior probabilities contig ploidies used by GATK-gCNV.

cytobands

Block-compressed bed file of cytoband intervals.

sd_locs_vcf

Plain text VCF of SNP sites at which to collect site depth (SD) evidence.

depth_exclude_list

Block-compressed bed file of intervals over which to exclude overlapping depth-only calls.

empty_file

Empty file; used to satisfy some workflow code paths.

exclude_intervals_for_gcnv_filter_intervals

Plain text bed file of intervals to exclude from GATK-gCNV.

external_af_ref_bed

Block-compressed bed file of SV sites for external allele frequency annotation.

genome_file

Tab-delimited table of primary contigs with contig name in the first column and length in the second column.

manta_region_bed

Block-compressed bed file of intervals to call with Manta.

mei_bed

Block-compressed bed file of mobile element insertions in the reference genome.

melt_std_vcf_header

Text file containing the VCF header for raw MELT calls.

noncoding_bed

Plain text bed file of non-coding elements in the reference genome.

par_bed

Plain text bed file of pseudoautosomal regions.

pesr_exclude_list

Block-compressed bed file of intervals for filtering calls. Variants generated with non-CNV tools (Dragen, Manta, MELT, Scramble, Wham) that have either end in any of these intervals are hard-filtered.

preprocessed_intervals

Intervals for read count collection and CNV calling.

primary_contigs_fai

Reference fasta index file only containing the primary contigs, i.e. chr1, ..., chr22, chrX, and chrY.

primary_contigs_list

Text file of primary contig names.

contigs_header

Plain text VCF header section of primary contig sequences.

protein_coding_gtf

Protein-coding sequence definitions for functional annotation in General Transfer Format. This GTF was created by subsetting the GENCODE GRCh38 basic gene annotation GTF with the script scripts/inputs/preprocess_gtf.py. Transcripts annotated as either Ensembl canonical or MANE Select Plus Clinical, and as either protein-coding or from nonsense-mediated decay, were retained. The GENCODE version is included in the filename.

reference_dict

Reference FASTA dictionary file (*.dict). See this article for more information.

reference_fasta

Reference FASTA file (*.fasta). See this article for more information.

reference_index

Reference FASTA index file (*.fasta.fai). See this article for more information.

rmsk

Block-compressed bed file of RepeatMasker intervals.

segdups

Block-compressed bed file of segmental duplication intervals.

seed_cutoffs

TSV of cutoff priors for genotyping.

single_sample_qc_definitions

TSV of recommended ranges for single-sample QC metrics.

wgd_scoring_mask

Plain text bed file of whole-genome dosage (WGD) score intervals over which to assess coverage bias.

wham_include_list_bed_file

Plain text bed file of intervals to call with Wham.

sl_cutoff_table

Cutoffs used in the genotype filtering model trained using data generated by All of Us.

aou_recalibrate_gq_model_file

Genotype filtering model trained using data generated by All of Us.

hgdp_recalibrate_gq_model_file

Genotype filtering model trained using data generated from the Human Genome Diversity Project.

recalibrate_gq_genome_tracks

List of block-compressed bed files, each containing intervals from a separate genome track.

Benchmarking datasets

ccdg_abel_site_level_benchmarking_dataset

Benchmarking variant set from Abel et al. 2020.

gnomad_v2_collins_sample_level_benchmarking_dataset

Benchmarking genotypes from gnomAD-SV-v2, see Collins et al. 2020. Not public data.

gnomad_v2_collins_site_level_benchmarking_dataset

Benchmarking variant set from gnomAD-SV-v2, see Collins et al. 2020.

hgsv_byrska_bishop_sample_level_benchmarking_dataset

Benchmarking genotypes from Byrska-Bishop et al. 2022.

hgsv_byrska_bishop_sample_renaming_tsv

Sample renaming manifest for the Byrska-Bishop benchmarking genotypes.

hgsv_byrska_bishop_site_level_benchmarking_dataset

Benchmarking variant set from Byrska-Bishop et al. 2022.

hgsv_ebert_sample_level_benchmarking_dataset

Benchmarking genotypes from Ebert et al. 2021.

ssc_belyeu_sample_level_benchmarking_dataset

Benchmarking genotypes from the Simons Simplex Collection, derived from Belyeu et al. 2021. Not public data.

ssc_belyeu_site_level_benchmarking_dataset

Benchmarking variant set from the Simons Simplex Collection, derived from Belyeu et al. 2021. Not public data.

ssc_sanders_sample_level_benchmarking_dataset

Benchmarking genotypes from the Simons Simplex Collection, derived from Sanders et al. 2015. Not public data.

thousand_genomes_site_level_benchmarking_dataset

Benchmarking variant set from the 1000 Genomes Project Phase 3 SV call set, see Sudmant et al. 2015.

asc_site_level_benchmarking_dataset

Benchmarking variant set from the Autism Spectrum Consortium. Not public data.

hgsv_site_level_benchmarking_dataset

Benchmarking variant set from Werling et al. 2018. Not public data.

collins_2017_sample_level_benchmarking_dataset

Benchmarking genotypes from Collins et al. 2017. Not public data.

sanders_2015_sample_level_benchmarking_dataset

Benchmarking genotypes from Sanders et al. 2015. Not public data.

werling_2018_sample_level_benchmarking_dataset

Benchmarking genotypes from Werling et al. 2018. Not public data.

Reference resources​

allosome_file​

autosome_file​

bin_exclude​

cnmops_exclude_list​

contig_ploidy_priors​

cytobands​

sd_locs_vcf​

depth_exclude_list​

empty_file​

exclude_intervals_for_gcnv_filter_intervals​

external_af_ref_bed​

genome_file​

manta_region_bed​

mei_bed​

melt_std_vcf_header​

noncoding_bed​

par_bed​

pesr_exclude_list​

preprocessed_intervals​

primary_contigs_fai​

primary_contigs_list​

contigs_header​

protein_coding_gtf​

reference_dict​

reference_fasta​

reference_index​

rmsk​

segdups​

seed_cutoffs​

single_sample_qc_definitions​

wgd_scoring_mask​

wham_include_list_bed_file​

sl_cutoff_table​

aou_recalibrate_gq_model_file​

hgdp_recalibrate_gq_model_file​

recalibrate_gq_genome_tracks​

Benchmarking datasets​

ccdg_abel_site_level_benchmarking_dataset​

gnomad_v2_collins_sample_level_benchmarking_dataset​

gnomad_v2_collins_site_level_benchmarking_dataset​

hgsv_byrska_bishop_sample_level_benchmarking_dataset​

hgsv_byrska_bishop_sample_renaming_tsv​

hgsv_byrska_bishop_site_level_benchmarking_dataset​

hgsv_ebert_sample_level_benchmarking_dataset​

ssc_belyeu_sample_level_benchmarking_dataset​

ssc_belyeu_site_level_benchmarking_dataset​

ssc_sanders_sample_level_benchmarking_dataset​

thousand_genomes_site_level_benchmarking_dataset​

asc_site_level_benchmarking_dataset​

hgsv_site_level_benchmarking_dataset​

collins_2017_sample_level_benchmarking_dataset​

sanders_2015_sample_level_benchmarking_dataset​

werling_2018_sample_level_benchmarking_dataset​