ClusterBatch
Clusters SV calls across a batch. For each caller, redundant variants are merged across samples into representative variant records based on interval overlap criteria. Some variants will be hard-filtered if they overlap with predefined intervals known to pose challenges to SV and CNV callers (e.g. centromeres). GATK-SVCluster is the primary tool used in for variant clustering.
The following diagram illustrates the recommended invocation order:
GenerateBatchMetrics is the primary downstream module in batch processing. JoinRawCalls is required for genotype filtering but does not need to be run until later in the pipeline.
Inputs
batch
An identifier for the batch. Should match the name used in GatherBatchEvidence.
*_vcf_tar
Standardized VCF tarballs from GatherBatchEvidence
del_bed
, dup_bed
Merged CNV call files (.bed.gz
) from GatherBatchEvidence
ped_file
Family structures and sex assignments determined in EvidenceQC. See PED file format.
Optional N_IQR_cutoff_plotting
If provided, plot SV counts per sample. This number is used as the cutoff of interquartile range multiples for flagging outlier samples. Example value: 4.
Outputs
clustered_*_vcf
Clustered variants for each caller (depth
corresponds to depth-based CNV callers cnMOPS
and GATK-gCNV
) in VCF format.
Optional clustered_sv_counts
, clustered_sv_count_plots
, clustered_outlier_samples_preview
, clustered_outlier_samples_with_reason
, clustered_num_outlier_samples
SV count QC tables and plots. Enable by providing N_IQR_cutoff_plotting