gnomad_qc.v4.annotations.generate_variant_qc_annotations
Script to generate annotations for variant QC on gnomAD v4.
usage: gnomad_qc.v4.annotations.generate_variant_qc_annotations.py
[-h] [--slack-channel SLACK_CHANNEL] [--overwrite] [--test-dataset]
[--test-n-partitions [TEST_N_PARTITIONS]] [--compute-info]
[--compute-info-split-n-alleles COMPUTE_INFO_SPLIT_N_ALLELES]
[--compute-info-over-split-n-alleles] [--combine-compute-info]
[--compute-info-n-partitions COMPUTE_INFO_N_PARTITIONS] [--retain-cdfs]
[--cdf-k CDF_K] [--split-info] [--export-info-vcf]
[--export-split-info-vcf] [--run-vep] [--validate-vep]
[--vep-version VEP_VERSION] [--generate-trio-stats]
[--releasable-trios-only] [--generate-sibling-stats]
[--create-variant-qc-annotation-ht] [--impute-features]
[--n-partitions N_PARTITIONS] [--export-true-positive-vcfs]
[--transmitted-singletons] [--sibling-singletons]
Named Arguments
- --slack-channel
Slack channel to post results and notifications to.
- --overwrite
Overwrite data
Default: False
- --test-dataset
Use the test dataset as input.
Default: False
- --test-n-partitions
Use only 2 partitions of the VDS as input for testing purposes.
- --split-info
Split info HT.
Default: False
- --export-info-vcf
Export info as VCF.
Default: False
- --export-split-info-vcf
Export split info as VCF.
Default: False
- --run-vep
Generates vep annotations.
Default: False
- --validate-vep
Validate that variants in protein-coding genes are correctly annotated by VEP.
Default: False
- --vep-version
Version of VEPed context Table to use in vep_or_lookup_vep.
Default: “105”
- --generate-sibling-stats
Calculate sibling variant sharing stats.
Default: False
Compute info HT.
Arguments relevant to computing the info HT..
- --compute-info
Compute info HT.
Default: False
- --compute-info-split-n-alleles
Number of alleles at a site to filter results on. If –compute-info-over-split-n-alleles is used, the results are filtered to sites with greater than or equal to the value supplied, otherwise the results are filtered to sites with less than the value supplied. By default no sites are filtered.
- --compute-info-over-split-n-alleles
Whether to filter to sites greater than or equal to the value supplied to–compute-info-split-n-alleles. By default, sites are filtered to sites less than that value, or None if –compute-info-split-n-alleles is not supplied.
Default: False
- --combine-compute-info
Whether to combine the output from running –compute-info –compute-info-split-n-alleles with and without the –compute-info-over-split-n-alleles flag.
Default: False
- --compute-info-n-partitions
Number of desired partitions for the info HT.
Default: 5000
- --retain-cdfs
If True, retains the cumulative distribution functions (CDFs) for all info annotations that are computed as a median aggregation. Keeping the CDFs is useful for annotations that require calculating the median acrosscombined datasets at a later stage. Default is False.
Default: False
- --cdf-k
Parameter controlling the accuracy vs. memory usage tradeoff when retaining CDFs. A higher value of cdf_k results in a more accurate CDF approximation but increases memory usage and computation time. Default is 200.
Default: 200
Arguments used to generate trio stats.
- --generate-trio-stats
Calculates trio stats
Default: False
- --releasable-trios-only
Only include releasable trios. This option is only valid when –generate-trio-stats is true.
Default: False
Variant QC annotation HT parameters
- --create-variant-qc-annotation-ht
Creates an annotated HT with features for variant QC.
Default: False
- --impute-features
If set, imputation is performed for variant QC features.
Default: False
- --n-partitions
Desired number of partitions for variant QC annotation HT .
Default: 5000
Export true positive VCFs
Arguments used to define true positive variant set.
- --export-true-positive-vcfs
Exports true positive variants (–transmitted-singletons and/or –sibling-singletons) to VCF files.
Default: False
- --transmitted-singletons
Include transmitted singletons in the exports of true positive variants to VCF files.
Default: False
- --sibling-singletons
Include sibling singletons in the exports of true positive variants to VCF files.
Default: False
Module Functions
Script to generate annotations for variant QC on gnomAD v4.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.INFO_METHODS = ['AS', 'quasi', 'set_long_AS_missing']
List of info methods computed for variant QC.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.INFO_FEATURES = ['AS_MQRankSum', 'AS_pab_max', 'AS_MQ', 'AS_QD', 'AS_ReadPosRankSum', 'AS_SOR', 'AS_FS']
List of features info to be used for variant QC.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.NON_INFO_FEATURES = ['variant_type', 'allele_type', 'n_alt_alleles', 'was_mixed', 'has_star']
List of features to be used for variant QC that are not in the info field.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.TRUTH_DATA = ['hapmap', 'omni', 'mills', 'kgp_phase1_hc']
List of truth datasets to be used for variant QC.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.extract_as_pls(lpl_expr, allele_idx)[source]
Extract PLs for a specific allele from an LPL array expression.
PL/LPL represents the normalized Phred-scaled likelihoods of the possible genotypes from all considered alleles (or local alleles).
If three alleles are considered, LPL genotype indexes are: [0/0, 0/1, 1/1, 0/2, 1/2, 2/2].
- If we want to extract the PLs for each alternate allele, we need to extract:
allele 1: [0/0, 0/1, 1/1]
allele 2: [0/0, 0/2, 2/2]
- Example:
LPL: [138, 98, 154, 26, 0, 14]
Extract allele 1 PLs: [0/0, 0/1, 1/1] -> [138, 98, 154]
Extract allele 2 PLs: [0/0, 0/2, 2/2] -> [138, 26, 14]
- Parameters:
lpl_expr (
ArrayExpression
) – LPL ArrayExpression.allele_idx (
Int32Expression
) – The index of the alternate allele to extract PLs for.
- Return type:
- Returns:
ArrayExpression of PLs for the specified allele.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.recompute_as_qualapprox_from_lpl(mt)[source]
Recompute AS_QUALapprox from LPL.
QUALapprox is the (Phred-scaled) probability that all reads at the site are hom-ref, so QUALapprox is PL[0]. To get the QUALapprox for just one allele, pull out the PLs for just that allele, then normalize by subtracting the smallest element from all the entries (so the best genotype is 0) and then use the normalized PL[0] value for that allele’s QUALapprox.
Note
The first element of AS_QUALapprox is always None.
If the allele is a star allele, we set QUALapprox for that allele to 0.
If GQ == 0 and PL[0] for the allele == 1, we set QUALapprox for the allele to 0.
- Example:
- Starting Values:
alleles: [‘G’, ‘*’, ‘A’, ‘C’, ‘GCTT’, ‘GT’, ‘T’]
LGT: 1/2
LA: [0, 1, 6]
LPL: [138, 98, 154, 26, 0, 14]
QUALapprox: 138
- Use extract_as_pls to get PLs for each allele:
allele 1: [138, 98, 154]
allele 2: [138, 26, 14]
- Normalize PLs by subtracting the smallest element from all the PLs:
allele 1: [138-98, 98-98, 154-98] -> [40, 0, 56]
allele 2: [138-14, 26-14, 14-14] -> [124, 12, 0]
Use the first element of the allele specific PLs to generate AS_QUALapprox: [None, 40, 124]
Set QUALapprox to 0 for the star allele: [None, 0, 124]
- Parameters:
mt (
MatrixTable
) – Input MatrixTable.- Return type:
- Returns:
AS_QUALapprox ArrayExpression recomputed from LPL.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.correct_as_annotations(mt, set_to_missing=False)[source]
Correct allele specific annotations that are longer than the length of LA.
For some entries in the MatrixTable, the following annotations are longer than LA, when they should be the same length as LA:
AS_SB_TABLE
AS_RAW_MQ
AS_RAW_ReadPosRankSum
AS_RAW_MQRankSum
This function corrects these annotations by either dropping the alternate allele with the index corresponding to the min value of AS_RAW_MQ, or setting them to missing if set_to_missing is True.
- Parameters:
mt (
MatrixTable
) – Input MatrixTable.set_to_missing (
bool
) – Whether to set the annotations to missing instead of correcting them.
- Return type:
- Returns:
StructExpression with corrected allele specific annotations.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.run_compute_info(mt, max_n_alleles=None, min_n_alleles=None, retain_cdfs=False, cdf_k=200)[source]
Run compute info on a MatrixTable.
..note:
Adds a fix for AS_QUALapprox by recomputing from LPL because some were found to have different lengths than LA.
- Creates a Table with three different methods of computing info annotations:
quasi_info: Compute info annotations using the quasi-allele specific method defined in default_compute_info.
AS_info: Compute info annotations using aggregation of the allele specific annotations in ‘gvcf_info’ after recomputing AS_QUALapprox from LPL, and fixing the length of AS_SB_TABLE, AS_RAW_MQ, AS_RAW_ReadPosRankSum and AS_RAW_MQRankSum.
set_long_AS_missing_info: Compute info annotations using aggregation of the allele specific annotations in ‘gvcf_info’ after setting AS_SB_TABLE, AS_RAW_MQ, AS_RAW_ReadPosRankSum and AS_RAW_MQRankSum to missing if they have the incorrect length.
- Parameters:
mt (
MatrixTable
) – Input MatrixTable.max_n_alleles (
Optional
[int
]) – Maximum number of alleles for the site to be included in computations.min_n_alleles (
Optional
[int
]) – Minimum number of alleles for the site to be included in computations.retain_cdfs (
bool
) – If True, retains the cumulative distribution functions (CDFs) for all info annotations that are computed as a median aggregation. Keeping the CDFs is useful for annotations that require calculating the median across combined datasets at a later stage. Default is False.cdf_k (
int
) – Parameter controlling the accuracy vs. memory usage tradeoff when retaining CDFs. A higher value of cdf_k results in a more accurate CDF approximation but increases memory usage and computation time. Default is 200.
- Return type:
- Returns:
Table with info annotations.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.get_reformatted_info_fields(ht, info_method=None)[source]
Reformat ht info annotations to contain all expected info fields.
- Parameters:
ht (
Table
) – Input Table.info_method (
Optional
[str
]) – Shorthand name of method used to compute the info annotation that should be reformatted. Choices are ‘AS’, ‘quasi’, or ‘set_long_AS_missing’. If None, all info annotations will be reformatted.
- Return type:
- Returns:
Table with reformatted info annotations.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.get_info_ht_for_vcf_export(ht, info_method)[source]
Get info HT for VCF export.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.split_info(info_ht)[source]
Generate an info Table with split multi-allelic sites from the multi-allelic info Table.
Note
gnomad_methods’ annotate_allele_info splits multi-allelic sites before the info annotation is split to ensure that all sites in the returned Table are annotated with allele info.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.run_generate_trio_stats(vds, fam_ped, fam_ht, releasable_only=False)[source]
Generate trio transmission stats from a VariantDataset and pedigree info.
- Parameters:
vds (
VariantDataset
) – VariantDataset to generate trio stats from.fam_ped (
Pedigree
) – Pedigree containing trio info.fam_ht (
Table
) – Table containing trio info.releasable_only (
bool
) – Whether to only include releasable trios. Releasable trios are those where all three samples (proband, maternal, and paternal) are marked as ‘releasable’.
- Return type:
- Returns:
Table containing trio stats.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.run_generate_sib_stats(mt, rel_ht)[source]
Generate stats for the number of alternate alleles in common between sibling pairs.
- Parameters:
mt (
MatrixTable
) – MatrixTable to generate sibling stats from.rel_ht (
Table
) – Table containing relatedness info for pairs in mt.
- Return type:
- Returns:
Table containing sibling stats.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.create_variant_qc_annotation_ht(info_ht, trio_stats_ht, sib_stats_ht, impute_features=True, n_partitions=5000)[source]
Create a Table with all necessary annotations for variant QC.
Annotations that are included:
- Features for RF:
variant_type
allele_type
n_alt_alleles
has_star
AS_QD
AS_pab_max
AS_MQRankSum
AS_SOR
AS_ReadPosRankSum
- Training sites (bool):
transmitted_singleton
sibling_singleton
fail_hard_filters - (ht.QD < 2) | (ht.FS > 60) | (ht.MQ < 30)
- Parameters:
info_ht (
Table
) – Info Table with split multi-allelics.trio_stats_ht (
Table
) – Table with trio statistics.sib_stats_ht (
Table
) – Table with sibling statistics.impute_features (
bool
) – Whether to impute features using feature medians (this is done by variant type).n_partitions (
int
) – Number of partitions to use for final annotated Table.
- Return type:
- Returns:
Hail Table with all annotations needed for variant QC.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.get_tp_ht_for_vcf_export(ht, transmitted_singletons=False, sibling_singletons=False)[source]
Get Tables with raw and adj true positive variants to export as a VCF for use in VQSR.
- Parameters:
ht (
Table
) – Input Table with transmitted singleton and sibling singleton information.transmitted_singletons (
bool
) – Whether to include transmitted singletons in the true positive variants.sibling_singletons (
bool
) – Whether to include sibling singletons in the true positive variants.
- Return type:
Dict
[str
,Table
]- Returns:
Dictionary of ‘raw’ and ‘adj’ true positive variant sites Tables.
- gnomad_qc.v4.annotations.generate_variant_qc_annotations.get_variant_qc_annotation_resources(test, overwrite, over_n_alleles=None, combine_compute_info=False, true_positive_type=None, releasable_trios_only=False)[source]
Get PipelineResourceCollection for all resources needed in the variant QC annotation pipeline.
- Parameters:
test (
bool
) – Whether to gather all resources for testing.overwrite (
bool
) – Whether to overwrite resources if they exist.over_n_alleles (
Optional
[bool
]) – Whether to use a temporary info TableResource for results. When True, use temporary info TableResource for only sites that have more than the passed arg –compute-info-split-n-alleles alleles. When False, use temporary info TableResource for only sites with fewer alleles. When None, the finalize info ht is used instead of a temporary location. Default is None.combine_compute_info (
bool
) – Whether the input for –compute-info should be the two temporary files (with and without the –compute-info-over-split-n-alleles flag) produced by running –compute-info with –compute-info-split-n-alleles.true_positive_type (
Optional
[str
]) – Type of true positive variants to use for true positive VCF path resource. Default is None.releasable_trios_only (
bool
) – Whether to only include releasable trios in the trio stats.
- Return type:
PipelineResourceCollection
- Returns:
PipelineResourceCollection containing resources for all steps of the variant QC annotation pipeline.