Resource files
This page contains descriptions of common resource files used in the pipeline. All required files are publicly available in Google Cloud Storage Buckets. URIs for these files are available in the hg38 resources json.
Some resources contain sensitive data and are stored in a secure bucket for development purposes. These files are not required to run GATK-SV.
Reference resources
allosome_file
Reference fasta index file containing only allosomal contigs.
autosome_file
Reference fasta index file containing only autosomal contigs.
bin_exclude
Block-compressed bed file of intervals to exclude in the call set.
cnmops_exclude_list
Plain text bed file of N-masked regions in the reference.
contig_ploidy_priors
Plain text TSV file of prior probabilities contig ploidies used by GATK-gCNV.
cytobands
Block-compressed bed file of cytoband intervals.
sd_locs_vcf
Plain text VCF of SNP sites at which to collect site depth (SD) evidence.
depth_exclude_list
Block-compressed bed file of intervals over which to exclude overlapping depth-only calls.
empty_file
Empty file; used to satisfy some workflow code paths.
exclude_intervals_for_gcnv_filter_intervals
Plain text bed file of intervals to exclude from GATK-gCNV.
external_af_ref_bed
Block-compressed bed file of SV sites for external allele frequency annotation.
genome_file
Tab-delimited table of primary contigs with contig name in the first column and length in the second column.
manta_region_bed
Block-compressed bed file of intervals to call with Manta.
mei_bed
Block-compressed bed file of mobile element insertions in the reference genome.
melt_std_vcf_header
Text file containing the VCF header for raw MELT calls.
noncoding_bed
Plain text bed file of non-coding elements in the reference genome.
par_bed
Plain text bed file of pseudoautosomal regions.
pesr_exclude_list
Block-compressed bed file of intervals for filtering calls. Variants generated with non-CNV tools (Manta, MELT, Scramble, Wham) that have either end in any of these intervals are hard-filtered.
preprocessed_intervals
Intervals for read count collection and CNV calling.
primary_contigs_fai
Reference fasta index file only containing the primary contigs, i.e. chr1
, ...,
chr22
, chrX
, and chrY
.
primary_contigs_list
Text file of primary contig names.
contigs_header
Plain text VCF header section of primary contig sequences.
protein_coding_gtf
Protein coding sequence definitions for functional annotation in General Transfer Format.
reference_dict
Reference FASTA dictionary file (*.dict
). See this article for more information.
reference_fasta
Reference FASTA file (*.fasta
). See this article for more information.
reference_index
Reference FASTA index file (*.fasta.fai
). See this article for more information.
rmsk
Block-compressed bed file of RepeatMasker intervals.
segdups
Block-compressed bed file of segmental duplication intervals.
seed_cutoffs
TSV of cutoff priors for genotyping.
single_sample_qc_definitions
TSV of recommended ranges for single-sample QC metrics.
wgd_scoring_mask
Plain text bed file of whole-genome dosage (WGD) score intervals over which to assess coverage bias.
wham_include_list_bed_file
Plain text bed file of intervals to call with Wham.
aou_recalibrate_gq_model_file
Genotype filtering model trained using data generated by All of Us.
hgdp_recalibrate_gq_model_file
Genotype filtering model trained using data generated from the Human Genome Diversity Project.
recalibrate_gq_genome_tracks
List of block-compressed bed files, each containing intervals from a separate genome track.
Benchmarking datasets
ccdg_abel_site_level_benchmarking_dataset
Benchmarking variant set from Abel et al. 2020.
gnomad_v2_collins_sample_level_benchmarking_dataset
Benchmarking genotypes from gnomAD-SV-v2, see Collins et al. 2020. Not public data.
gnomad_v2_collins_site_level_benchmarking_dataset
Benchmarking variant set from gnomAD-SV-v2, see Collins et al. 2020.
hgsv_byrska_bishop_sample_level_benchmarking_dataset
Benchmarking genotypes from Byrska-Bishop et al. 2022.
hgsv_byrska_bishop_sample_renaming_tsv
Sample renaming manifest for the Byrska-Bishop benchmarking genotypes.
hgsv_byrska_bishop_site_level_benchmarking_dataset
Benchmarking variant set from Byrska-Bishop et al. 2022.
hgsv_ebert_sample_level_benchmarking_dataset
Benchmarking genotypes from Ebert et al. 2021.
ssc_belyeu_sample_level_benchmarking_dataset
Benchmarking genotypes from the Simons Simplex Collection, derived from Belyeu et al. 2021. Not public data.
ssc_belyeu_site_level_benchmarking_dataset
Benchmarking variant set from the Simons Simplex Collection, derived from Belyeu et al. 2021. Not public data.
ssc_sanders_sample_level_benchmarking_dataset
Benchmarking genotypes from the Simons Simplex Collection, derived from Sanders et al. 2015. Not public data.
thousand_genomes_site_level_benchmarking_dataset
Benchmarking variant set from the 1000 Genomes Project Phase 3 SV call set, see Sudmant et al. 2015.
asc_site_level_benchmarking_dataset
Benchmarking variant set from the Autism Spectrum Consortium. Not public data.
hgsv_site_level_benchmarking_dataset
Benchmarking variant set from Werling et al. 2018. Not public data.
collins_2017_sample_level_benchmarking_dataset
Benchmarking genotypes from Collins et al. 2017. Not public data.
sanders_2015_sample_level_benchmarking_dataset
Benchmarking genotypes from Sanders et al. 2015. Not public data.
werling_2018_sample_level_benchmarking_dataset
Benchmarking genotypes from Werling et al. 2018. Not public data.