FilterGenotypes
Performs genotype quality recalibration using a machine learning model based on xgboost and filters genotypes. The output VCF contains the following updated fields:
SL
: Scaled logit scores (see here)GQ
: Updated genotype quality rescaled usingSL
OGQ
: OriginalGQ
score before recalibrationHIGH_NCR
: Filter status assigned to variants exceeding a threshold proportion of no-call genotypes. This will also be applied to variants with genotypes that have already been filtered in the input VCF.
The following diagram illustrates the recommended invocation order:
Model features
The model uses the following features:
- Genotype properties:
- Non-reference and no-call allele counts
- Genotype quality (
GQ
) - Supporting evidence types (
EV
) and respective genotype qualities (PE_GQ
,SR_GQ
,RD_GQ
) - Raw call concordance (
CONC_ST
)
- Variant properties:
- Variant type (
SVTYPE
) and size (SVLEN
) - Calling algorithms (
ALGORITHMS
) - Supporting evidence types (
EVIDENCE
) - Two-sided SR support flag (
BOTHSIDES_SUPPORT
) - Evidence overdispersion flag (
PESR_GT_OVERDISPERSION
) - SR noise flag (
HIGH_SR_BACKGROUND
) - Raw call concordance (
STATUS
,NON_REF_GENOTYPE_CONCORDANCE
,VAR_PPV
,VAR_SENSITIVITY
,TRUTH_AF
)
- Variant type (
- Reference context with respect to UCSC Genome Browser tracks:
- RepeatMasker
- Segmental duplications
- Simple repeats
- K-mer mappability (umap_s100 and umap_s24)
Model availability
For ease of use, we provide a model pre-trained on high-quality data with truth data derived from long-read calls:
gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/gatk-sv-recalibrator.aou_phase_1.v1.model
See the SV "Genotype Filter" section on page 34 of the All of Us Genomic Quality Report C2022Q4R9 CDR v7 for further details on model training. The generation and release of this model was made possible by the All of Us program (see here).
SL scores
All valid genotypes are annotated with a "scaled logit" (SL
) score, which is rescaled to non-negative adjusted GQ
values on [1, 99]. Note that the rescaled GQ
values should not be interpreted as probabilities. Original genotype qualities are retained in the OGQ
field.
A more positive SL
score indicates higher probability that the given genotype is not homozygous for the reference allele. Genotypes are therefore filtered using SL
thresholds that depend on SV type and size. This workflow also generates QC plots using the MainVcfQc workflow to review call set quality (see below for recommended practices).
Modes
This workflow can be run in one of two modes:
-
(Recommended) The user explicitly provides a set of
SL
cutoffs through thesl_filter_args
parameter, e.g."--small-del-threshold 93 --medium-del-threshold 150 --small-dup-threshold -51 --medium-dup-threshold -4 --ins-threshold -13 --inv-threshold -19"
Genotypes with
SL
scores less than the cutoffs are set to no-call (./.
). The above values were taken directly from Appendix N of the All of Us Genomic Quality Report C2022Q4R9 CDR v7 . Users should adjust the thresholds depending on data quality and desired accuracy. Please see the arguments in this script for all available options. -
(Advanced) The user provides truth labels for a subset of non-reference calls, and
SL
cutoffs are automatically optimized. These truth labels should be provided as a json file in the following format:{
"sample_1":
{
"good_variant_ids": ["variant_1", "variant_3"],
"bad_variant_ids": ["variant_5", "variant_10"]
},
"sample_2":
{
"good_variant_ids": ["variant_2", "variant_13"],
"bad_variant_ids": ["variant_8", "variant_11"]
}
}where "good_variant_ids" and "bad_variant_ids" are lists of variant IDs corresponding to non-reference (i.e. het or hom-var) sample genotypes that are true positives and false positives, respectively.
SL
cutoffs are optimized by maximizing the F-score with "beta" parameterfmax_beta
, which modulates the weight given to precision over recall (lower values give higher precision).
In both modes, the workflow additionally filters variants based on the "no-call rate", the proportion of genotypes that were filtered in a given variant. Variants exceeding the no_call_rate_cutoff
are assigned a HIGH_NCR
filter status.
QC recommendations
We strongly recommend performing call set QC after this module. By default, QC plotting is enabled with the run_qc argument. Users should carefully inspect the main plots from the main_vcf_qc_tarball. Please see the MainVcfQc module documentation for more information on interpreting these plots and recommended QC criteria.
Inputs
Optional vcf
Input VCF generated from SVConcordance.
Optional output_prefix
Default: use input VCF filename. Prefix for the output VCF, such as the cohort name. May be alphanumeric with underscores.
ploidy_table
Table of sample ploidies generated in JoinRawCalls.
gq_recalibrator_model_file
GQ-Recalibrator model. A public model is listed as aou_recalibrate_gq_model_file
here.
recalibrate_gq_args
Arguments to pass to the GQ
recalibration tool. Users should leave this with the default configuration in Terra.
genome_tracks
Genome tracks for sequence context annotation. Users should leave this with the default configuration in Terra.
Optional no_call_rate_cutoff
Default: 0.05
. Threshold fraction of samples that must have no-call genotypes in order to filter a variant. Set to 1 to disable.
Optional fmax_beta
Default: 0.4
. If providing a truth set, defines the beta parameter for F-score optimization.
Optional truth_json
Truth labels for input variants. If provided, the workflow will attempt to optimize filtering cutoffs automatically using the F-score. If provided, sl_filter_args is ignored.
Optional sl_filter_args
Arguments for the SL filtering script.
This should be used to set SL
cutoffs for filtering (refer to description above). Overridden by truth_json.
Optional run_qc
Default: true
. Enable running MainVcfQc automatically. By default, filtered variants will be excluded from
the plots.
Optional optimize_vcf_records_per_shard
Default: 50000
. Shard size for scattered cutoff optimization tasks. Decrease this if those steps are running slowly.
Optional filter_vcf_records_per_shard
Default: 20000
. Shard size for scattered GQ
recalibration tasks. Decrease this if those steps are running slowly.
Outputs
filtered_vcf
Filtered VCF.
Optional main_vcf_qc_tarball
QC plots generated with MainVcfQc. Only generated if using run_qc.
Optional vcf_optimization_table
Table of cutoff optimization metrics. Only generated if truth_json is provided.
Optional sl_cutoff_qc_tarball
Cutoff optimization and QC plots. Only generated if truth_json is provided.
unfiltered_recalibrated_vcf
Supplemental output of the VCF after assigning SL
genotype scores but before applying filtering.