Skip to main content

ClusterBatch

WDL source code

Clusters SV calls across a batch. For each caller, redundant variants are merged across samples into representative variant records based on interval overlap criteria. Some variants will be hard-filtered if they overlap with predefined intervals known to pose challenges to SV and CNV callers (e.g. centromeres). GATK-SVCluster is the primary tool used in for variant clustering.

The following diagram illustrates the recommended invocation order:

note

GenerateBatchMetrics is the primary downstream module in batch processing. JoinRawCalls is required for genotype filtering but does not need to be run until later in the pipeline.

Inputs

batch

An identifier for the batch. Should match the name used in GatherBatchEvidence.

*_vcf_tar

Standardized VCF tarballs from GatherBatchEvidence

del_bed, dup_bed

Merged CNV call files (.bed.gz) from GatherBatchEvidence

ped_file

Family structures and sex assignments determined in EvidenceQC. See PED file format.

Optional N_IQR_cutoff_plotting

If provided, plot SV counts per sample. This number is used as the cutoff of interquartile range multiples for flagging outlier samples. Example value: 4.

Optional stripy_vcfs

Single-sample STRipy VCFs to merge for the batch. In the Terra joint-calling workspace, these are produced by the optional standalone StripyWorkflow sample workflow and passed from the stripy_vcf sample attribute.

Outputs

clustered_*_vcf

Clustered variants for each caller (depth corresponds to depth-based CNV callers cnMOPS and GATK-gCNV) in VCF format.

Optional clustered_sv_counts, clustered_sv_count_plots, clustered_outlier_samples_preview, clustered_outlier_samples_with_reason, clustered_num_outlier_samples

SV count QC tables and plots. Enable by providing N_IQR_cutoff_plotting

Optional merged_stripy_vcf, merged_stripy_vcf_index

Batch-level merged STRipy VCF and index, present when stripy_vcfs is provided.