CleanVcf
Performs various VCF clean-up steps including:
- Adjusting genotypes on allosomal contigs
- Collapsing overlapping CNVs into multi-allelic CNVs
- Revising genotypes in overlapping CNVs
- Removing redundant CNVs
- Stitching large CNVs
- VCF formatting clean-up
The following diagram illustrates the recommended invocation order:
Inputs
cohort_name
Cohort name. The guidelines outlined in the sample ID requirements section apply here.
complex_genotype_vcfs
Array of contig-sharded VCFs containing genotyped complex variants, generated in GenotypeComplexVariants.
complex_resolve_bothside_pass_list
Array of variant lists with bothside SR support for all batches, generated in ResolveComplexVariants.
complex_resolve_background_fail_list
Array of variant lists with low SR signal-to-noise ratio for all batches, generated in ResolveComplexVariants.
ped_file
Family structures and sex assignments determined in EvidenceQC. See PED file format.
max_shards_per_chrom_step1
, min_records_per_shard_step1
, samples_per_step2_shard
, max_samples_per_shard_step3
, clean_vcf1b_records_per_shard
, clean_vcf5_records_per_shard
These parameters control parallelism in scattered tasks. Please examine the WDL source code to see how each is used.
Optional outlier_samples_list
Text file of samples IDs to exclude when identifying multi-allelic CNVs. Most users do not need this feature unless excessive multi-allelic CNVs driven by low-quality samples are observed.
Optional use_hail
Default: false
. Use Hail for VCF concatenation. This should only be used for projects with over 50k samples. If enabled, the
gcs_project must also be provided. Does not work on Terra.
Optional gcs_project
Google Cloud project ID. Required only if enabling use_hail.
Outputs
cleaned_vcf
Genome-wide VCF of output.