ClusterBatch
Clusters SV calls across a batch. For each caller, redundant variants are merged across samples into representative variant records based on interval overlap criteria. Some variants will be hard-filtered if they overlap with predefined intervals known to pose challenges to SV and CNV callers (e.g. centromeres). GATK-SVCluster is the primary tool used in for variant clustering.
The following diagram illustrates the recommended invocation order:
GenerateBatchMetrics is the primary downstream module in batch processing. JoinRawCalls is required for genotype filtering but does not need to be run until later in the pipeline.
Inputs
batch
An identifier for the batch. Should match the name used in GatherBatchEvidence.
*_vcf_tar
Standardized VCF tarballs from GatherBatchEvidence
del_bed, dup_bed
Merged CNV call files (.bed.gz) from GatherBatchEvidence
ped_file
Family structures and sex assignments determined in EvidenceQC. See PED file format.
Optional N_IQR_cutoff_plotting
If provided, plot SV counts per sample. This number is used as the cutoff of interquartile range multiples for flagging outlier samples. Example value: 4.
Optional stripy_vcfs
Single-sample STRipy VCFs to merge for the batch. In the Terra joint-calling workspace, these are produced by the optional
standalone StripyWorkflow sample workflow and passed from the stripy_vcf sample attribute.
Outputs
clustered_*_vcf
Clustered variants for each caller (depth corresponds to depth-based CNV callers cnMOPS and GATK-gCNV) in VCF format.
Optional clustered_sv_counts, clustered_sv_count_plots, clustered_outlier_samples_preview, clustered_outlier_samples_with_reason, clustered_num_outlier_samples
SV count QC tables and plots. Enable by providing N_IQR_cutoff_plotting
Optional merged_stripy_vcf, merged_stripy_vcf_index
Batch-level merged STRipy VCF and index, present when stripy_vcfs is provided.