Skip to main content

CleanVcf

WDL source code

Performs various VCF clean-up steps including:

  • Adjusting genotypes on allosomal contigs
  • Collapsing overlapping CNVs into multi-allelic CNVs
  • Revising genotypes in overlapping CNVs
  • Removing redundant CNVs
  • Stitching large CNVs
  • VCF formatting clean-up

The following diagram illustrates the recommended invocation order:

Inputs

cohort_name

Cohort name. The guidelines outlined in the sample ID requirements section apply here.

complex_genotype_vcfs

Array of contig-sharded VCFs containing genotyped complex variants, generated in GenotypeComplexVariants.

complex_resolve_bothside_pass_list

Array of variant lists with bothside SR support for all batches, generated in ResolveComplexVariants.

complex_resolve_background_fail_list

Array of variant lists with low SR signal-to-noise ratio for all batches, generated in ResolveComplexVariants.

ped_file

Family structures and sex assignments determined in EvidenceQC. See PED file format.

max_shards_per_chrom_step1, min_records_per_shard_step1, samples_per_step2_shard, max_samples_per_shard_step3, clean_vcf1b_records_per_shard, clean_vcf5_records_per_shard

These parameters control parallelism in scattered tasks. Please examine the WDL source code to see how each is used.

Optional outlier_samples_list

Text file of samples IDs to exclude when identifying multi-allelic CNVs. Most users do not need this feature unless excessive multi-allelic CNVs driven by low-quality samples are observed.

Optional use_hail

Default: false. Use Hail for VCF concatenation. This should only be used for projects with over 50k samples. If enabled, the gcs_project must also be provided. Does not work on Terra.

Optional gcs_project

Google Cloud project ID. Required only if enabling use_hail.

Outputs

cleaned_vcf

Genome-wide VCF of output.