ScoreGenotypes

WDL source code

Calibrate quality scores to provide as input to a machine learning model based on xgboost, which is then used to filter genotypes. The output VCF contains the following updated fields:

SL : Scaled logit scores
GQ : Updated genotype quality rescaled using SL
OGQ : Original GQ score before recalibration

The following diagram illustrates the recommended invocation order:

Model features

The model uses the following features:

Genotype properties:
- Non-reference and no-call allele counts
- Genotype quality (GQ)
- Supporting evidence types (EV) and respective genotype qualities (PE_GQ, SR_GQ, RD_GQ)
- Raw call concordance (CONC_ST)
Variant properties:
- Variant type (SVTYPE) and size (SVLEN)
- Calling algorithms (ALGORITHMS)
- Supporting evidence types (EVIDENCE)
- Two-sided SR support flag (BOTHSIDES_SUPPORT)
- Evidence overdispersion flag (PESR_GT_OVERDISPERSION)
- SR noise flag (HIGH_SR_BACKGROUND)
- Raw call concordance (STATUS, NON_REF_GENOTYPE_CONCORDANCE, VAR_PPV, VAR_SENSITIVITY, TRUTH_AF)
Reference context with respect to UCSC Genome Browser tracks:
- RepeatMasker
- Segmental duplications
- Simple repeats
- K-mer mappability (umap_s100 and umap_s24)

Model availability

For ease of use, we provide a model pre-trained on high-quality data with truth data derived from long-read calls:

gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/gatk-sv-recalibrator.aou_phase_1.v1.model

See the SV "Genotype Filter" section on page 34 of the All of Us Genomic Quality Report C2022Q4R9 CDR v7 for further details on model optimization. The generation and release of this model was made possible by the All of Us program (see here).

SL scores

All valid genotypes are annotated with a scaled logit (SL) score, which is rescaled to non-negative adjusted GQ values on [1, 99]. Note that the rescaled GQ values should not be interpreted as probabilities. Original genotype qualities are retained in the OGQ field.

A more positive SL score indicates higher probability that the given genotype is not homozygous for the reference allele. Genotypes are therefore filtered using SL thresholds that depend on SV type and size. This workflow also generates QC plots using the MainVcfQc workflow to review call set quality (see below for recommended practices).

Modes

The combination of this and the following FilterGenotypes workflow can be run in one of two modes:

(Recommended) The user does not provide truth labels to be optimized SL cutoffs based on, and instead explicitly provides a set of SL cutoffs through the sl_cutoff_table parameter in FilterGenotypes, e.g.
```
sv_type	min_size	max_size	sl_cutoff
DEL	-1	500	93
DEL	500	10000	150
DUP	-1	500	-51
DUP	500	10000	-4
INS	-1	-1	-13
INV	-1	-1	-19
```
wherein each row corresponds to a sl_cutoff between the size range of min_size (inclusive) and max_size (exclusive), and a -1 corresponds to a non-existent size limit in either direction.

Genotypes with SL scores less than the cutoffs are set to no-call (./.). The above values were taken directly from Appendix N of the All of Us Genomic Quality Report C2022Q4R9 CDR v7 . Users should adjust the thresholds depending on data quality and desired accuracy. Please see the arguments in this script for all available options.
(Advanced) The user provides truth labels for a subset of non-reference calls through the truth_json parameter in ScoreGenotypes, e.g.
```
 {
   "sample_1": 
   {
     "good_variant_ids": ["variant_1", "variant_3"], 
     "bad_variant_ids": ["variant_5", "variant_10"]
   },
   "sample_2": 
   {
     "good_variant_ids": ["variant_2", "variant_13"], 
     "bad_variant_ids": ["variant_8", "variant_11"]
   }
 }
```
wherein "good_variant_ids" and "bad_variant_ids" are lists of variant IDs corresponding to non-reference (i.e. het or hom-var) sample genotypes that are true positives and false positives, respectively.

This approach allows SL cutoffs to automatically be optimized, and are hence eventually passed in through the optimized_sl_cutoff_table parameter in FilterGenotypes. These cutoffs are optimized by maximizing the F-score with "beta" parameter fmax_beta, which modulates the weight given to precision over recall (lower values give higher precision).

Inputs

`vcf`

Input VCF generated from SVConcordance.

Optional `output_prefix`

Default: use input VCF filename. Prefix for the output VCF, such as the cohort name. May be alphanumeric with underscores.

`gq_recalibrator_model_file`

GQ-Recalibrator model. A public model is listed as aou_recalibrate_gq_model_file here.

`recalibrate_gq_args`

Arguments to pass to the GQ recalibration tool. Users should leave this with the default configuration in Terra.

`genome_tracks`

Genome tracks for sequence context annotation. Users should leave this with the default configuration in Terra.

Optional `fmax_beta`

Default: 0.4. If providing a truth set, defines the beta parameter for F-score optimization.

Optional `truth_json`

Truth labels for input variants. If provided, the workflow will attempt to optimize filtering cutoffs automatically using the F-score.

Optional `optimize_vcf_records_per_shard`

Default: 50000. Shard size for scattered cutoff optimization tasks. Decrease this if those steps are running slowly.

Outputs

`unfiltered_recalibrated_vcf`

VCF that contains SL genotype scores that have not yet gone through filtering.

Optional `vcf_optimization_table`

Table of cutoff optimization metrics. Only generated if truth_json is provided.

Optional `sl_cutoff_table`

Table that contains optimized SL cutoffs. Only generated if truth_json is provided.

Optional `sl_cutoff_qc_tarball`

Cutoff optimization and QC plots. Only generated if truth_json is provided.

Model features​

Model availability​

SL scores​

Modes​

Inputs​

vcf​

Optional output_prefix​

gq_recalibrator_model_file​

recalibrate_gq_args​

genome_tracks​

Optional fmax_beta​

Optional truth_json​

Optional optimize_vcf_records_per_shard​

Outputs​

unfiltered_recalibrated_vcf​

Optional vcf_optimization_table​

Optional sl_cutoff_table​

Optional sl_cutoff_qc_tarball​