Skip to main content

AnnotateVcf

Adds annotations, such as the inferred function and allele frequencies of variants, to a VCF.

Annotations methods include:

  • Functional annotation - The GATK tool SVAnnotate is used to annotate SVs with inferred functional consequence on protein-coding regions, regulatory regions such as UTR and promoters, and other non-coding elements.
  • Allele Frequency (AF) annotation - annotate SVs with their allele frequencies across all samples, and samples of specific sex, as well as specific subpopulations.
  • Allele Frequency annotation with external callset - annotate SVs with the allele frequencies of their overlapping SVs in another callset, eg. the gnomAD-SV reference callset.

The following diagram illustrates the recommended invocation order:

Inputs

vcf

Any SV VCF. Running on the genotype filtered VCF is recommended.

prefix

Prefix for the output VCF, such as the cohort name. May be alphanumeric with underscores.

protein_coding_gtf

Coding transcript definitions, see here.

Optional noncoding_bed

Non-coding reference intervals, see here.

Optional promoter_window

Promoter window size. See here.

Optional max_breakend_as_cnv_length

Max size for treating BND records as CNVs. See here.

Optional svannotate_additional_args

Additional arguments for GATK-SVAnnotate.

Optional sample_pop_assignments

Two-column file with sample ID & population assignment. "." for population will ignore the sample. If provided, annotates population-specific allele frequencies.

Optional sample_keep_list

If provided, subset samples to this list in the output VCF.

Optional ped_file

Family structures and sex assignments determined in EvidenceQC. See PED file format. If provided, sex-specific allele frequencies will be annotated.

Optional par_bed

Pseudo-autosomal region (PAR) bed file. If provided, variants overlapping PARs will be annotated with the PAR field.

sv_per_shard

Shard size for parallel processing. Decreasing this may help if the workflow is running too slowly.

Optional external_af_ref_bed

Reference SV set (see here). If provided, annotates variants with allele frequencies from the reference population.

Optional external_af_ref_prefix

External AF annotation prefix. Required if providing external_af_ref_bed.

Optional external_af_population

Population names in the external SV reference set, e.g. "ALL", "AFR", "AMR", "EAS", "EUR". Required if providing external_af_ref_bed and must match the populations in the bed file.

Optional use_hail

Default: false. Use Hail for VCF concatenation. This should only be used for projects with over 50k samples. If enabled, the gcs_project must also be provided. Does not work on Terra.

Optional gcs_project

Google Cloud project ID. Required only if enabling use_hail.

Outputs

annotated_vcf

Output VCF.