EvidenceQC
Runs ploidy estimation, dosage scoring, and optionally VCF QC. The results from this module can be used for QC and batching.
We also recommend using sex assignments generated from the ploidy estimates and incorporating them into the PED file, with sex = 0 for sex aneuploidies.
The following diagram illustrates the recommended invocation order:
Preliminary Sample QC
The purpose of sample filtering at this stage after EvidenceQC is to prevent very poor quality samples from interfering with the results for the rest of the callset. In general, samples that are borderline are okay to leave in, but you should choose filtering thresholds to suit the needs of your cohort and study. There will be future opportunities (as part of FilterBatch) for filtering before the joint genotyping stage if necessary. Here are a few of the basic QC checks that we recommend:
-
Chromosome X and Y ploidy plots: check that sex assignments match your expectations. If there are discrepancies, check for sample swaps and update your PED file before proceeding.
-
Whole-genome dosage score (WGD): examine distribution and check that it is centered around 0 (the distribution of WGD for PCR- samples is expected to be slightly lower than 0, and the distribution of WGD for PCR+ samples is expected to be slightly greater than 0. Refer to the gnomAD-SV paper for more information on WGD score). Optionally filter outliers.
-
Low outliers for each SV caller: these are samples with much lower than typical numbers of SV calls per contig for each caller. An empty low outlier file means there were no outliers below the median and no filtering is necessary. Check that no samples had zero calls.
-
High outliers for each SV caller: optionally filter outliers; samples with many more SV calls than average may be poor quality.
-
Remove samples with autosomal aneuploidies based on the per-batch binned coverage plots of each chromosome.
In the joint calling mode Terra workspace, we provide a Jupyter notebook SampleQC.ipynb
for sample QC and filtering.
Batching
For larger cohorts, samples should be split up into batches of about 100-500 samples with similar characteristics. We recommend batching based on overall coverage and dosage score (WGD), which is generated in EvidenceQC. You may also wish to batch samples based on other characteristics that could impact SV calling, such as mean insert size or PCR status. An example batching process is outlined below:
- Divide the cohort by chromosome X ploidy (less than 2, greater than or equal to 2) based on copy ratio estimates from EvidenceQC. In this way, males and females will be batched separately before being merged back together for batches with equal sex balance
- Partition the samples by median coverage from EvidenceQC, grouping samples with similar median coverage together
- Partition the samples further by dosage score (WGD) from EvidenceQC, grouping samples with similar WGD score together
- Optionally, partition the samples further by mean insert size if available, grouping samples with similar mean insert size together
- Merge corresponding male and female partitions together to generate roughly equally sized batches of 100-500 samples with roughly equal sex balance
In the joint calling mode Terra workspace, we provide a Jupyter notebook Batching.ipynb
for batch creation.
Inputs
All array inputs of sample data must match in order. For example, the order of the samples
array should match that
of the counts
array.
batch
A name for the batch of samples being run. Can be alphanumeric with underscores.
samples
Sample IDs. Must match those used in GatherSampleEvidence.
counts
Binned read counts (.counts.tsv.gz
) from GatherSampleEvidence
*_vcfs
Raw SV call VCFs (.vcf.gz
) from GatherSampleEvidence. May be omitted in case a caller was not run.
Optional run_vcf_qc
Default: false
. Run raw call VCF QC analysis.
Optional run_ploidy
Default: true
. Run ploidy estimation.
Optional melt_insert_size
Mean insert size for each sample. Produces QC tables and plots if available.
Outputs
WGD_*
Per-sample whole-genome dosage scores with plots
bincov_median
Median coverage per sample
bincov_matrix
Binned read depth matrix for the submitted batch
ploidy_*
Ploidy estimates, sex assignments, with plots
Optional *_qc_low
, *_qc_high
Outlier samples detected by call counts.
Optional qc_table
QC summary table. Enable with run_ploidy.