gnomad_qc.v4.annotations.generate_variant_qc_annotations

Script to generate annotations for variant QC on gnomAD v4.

usage: gnomad_qc.v4.annotations.generate_variant_qc_annotations.py
       [-h] [--slack-channel SLACK_CHANNEL] [--overwrite] [--test-dataset]
       [--test-n-partitions [TEST_N_PARTITIONS]] [--compute-info]
       [--compute-info-split-n-alleles COMPUTE_INFO_SPLIT_N_ALLELES]
       [--compute-info-over-split-n-alleles] [--combine-compute-info]
       [--compute-info-n-partitions COMPUTE_INFO_N_PARTITIONS] [--retain-cdfs]
       [--cdf-k CDF_K] [--split-info] [--export-info-vcf]
       [--export-split-info-vcf] [--run-vep] [--validate-vep]
       [--vep-version VEP_VERSION] [--generate-trio-stats]
       [--releasable-trios-only] [--generate-sibling-stats]
       [--create-variant-qc-annotation-ht] [--impute-features]
       [--n-partitions N_PARTITIONS] [--export-true-positive-vcfs]
       [--transmitted-singletons] [--sibling-singletons]

Named Arguments

--slack-channel

Slack channel to post results and notifications to.

--overwrite

Overwrite data

Default: False

--test-dataset

Use the test dataset as input.

Default: False

--test-n-partitions

Use only 2 partitions of the VDS as input for testing purposes.

--split-info

Split info HT.

Default: False

--export-info-vcf

Export info as VCF.

Default: False

--export-split-info-vcf

Export split info as VCF.

Default: False

--run-vep

Generates vep annotations.

Default: False

--validate-vep

Validate that variants in protein-coding genes are correctly annotated by VEP.

Default: False

--vep-version

Version of VEPed context Table to use in vep_or_lookup_vep.

Default: “105”

--generate-sibling-stats

Calculate sibling variant sharing stats.

Default: False

Compute info HT.

Arguments relevant to computing the info HT..

--compute-info

Compute info HT.

Default: False

--compute-info-split-n-alleles

Number of alleles at a site to filter results on. If –compute-info-over-split-n-alleles is used, the results are filtered to sites with greater than or equal to the value supplied, otherwise the results are filtered to sites with less than the value supplied. By default no sites are filtered.

--compute-info-over-split-n-alleles

Whether to filter to sites greater than or equal to the value supplied to–compute-info-split-n-alleles. By default, sites are filtered to sites less than that value, or None if –compute-info-split-n-alleles is not supplied.

Default: False

--combine-compute-info

Whether to combine the output from running –compute-info –compute-info-split-n-alleles with and without the –compute-info-over-split-n-alleles flag.

Default: False

--compute-info-n-partitions

Number of desired partitions for the info HT.

Default: 5000

--retain-cdfs

If True, retains the cumulative distribution functions (CDFs) for all info annotations that are computed as a median aggregation. Keeping the CDFs is useful for annotations that require calculating the median acrosscombined datasets at a later stage. Default is False.

Default: False

--cdf-k

Parameter controlling the accuracy vs. memory usage tradeoff when retaining CDFs. A higher value of cdf_k results in a more accurate CDF approximation but increases memory usage and computation time. Default is 200.

Default: 200

Arguments used to generate trio stats.

--generate-trio-stats

Calculates trio stats

Default: False

--releasable-trios-only

Only include releasable trios. This option is only valid when –generate-trio-stats is true.

Default: False

Variant QC annotation HT parameters

--create-variant-qc-annotation-ht

Creates an annotated HT with features for variant QC.

Default: False

--impute-features

If set, imputation is performed for variant QC features.

Default: False

--n-partitions

Desired number of partitions for variant QC annotation HT .

Default: 5000

Export true positive VCFs

Arguments used to define true positive variant set.

--export-true-positive-vcfs

Exports true positive variants (–transmitted-singletons and/or –sibling-singletons) to VCF files.

Default: False

--transmitted-singletons

Include transmitted singletons in the exports of true positive variants to VCF files.

Default: False

--sibling-singletons

Include sibling singletons in the exports of true positive variants to VCF files.

Default: False

Module Functions

gnomad_qc.v4.annotations.generate_variant_qc_annotations.INFO_METHODS

List of info methods computed for variant QC.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.INFO_FEATURES

List of features info to be used for variant QC.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.NON_INFO_FEATURES

List of features to be used for variant QC that are not in the info field.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.TRUTH_DATA

List of truth datasets to be used for variant QC.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.extract_as_pls(...)

Extract PLs for a specific allele from an LPL array expression.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.recompute_as_qualapprox_from_lpl(mt)

Recompute AS_QUALapprox from LPL.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.correct_as_annotations(mt)

Correct allele specific annotations that are longer than the length of LA.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.run_compute_info(mt)

Run compute info on a MatrixTable.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.get_reformatted_info_fields(ht)

Reformat ht info annotations to contain all expected info fields.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.get_info_ht_for_vcf_export(ht, ...)

Get info HT for VCF export.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.split_info(info_ht)

Generate an info Table with split multi-allelic sites from the multi-allelic info Table.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.run_generate_trio_stats(...)

Generate trio transmission stats from a VariantDataset and pedigree info.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.run_generate_sib_stats(mt, ...)

Generate stats for the number of alternate alleles in common between sibling pairs.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.create_variant_qc_annotation_ht(...)

Create a Table with all necessary annotations for variant QC.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.get_tp_ht_for_vcf_export(ht)

Get Tables with raw and adj true positive variants to export as a VCF for use in VQSR.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.get_variant_qc_annotation_resources(...)

Get PipelineResourceCollection for all resources needed in the variant QC annotation pipeline.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.main(args)

Generate all variant annotations needed for variant QC.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.get_script_argument_parser()

Get script argument parser.

Script to generate annotations for variant QC on gnomAD v4.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.INFO_METHODS = ['AS', 'quasi', 'set_long_AS_missing']

List of info methods computed for variant QC.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.INFO_FEATURES = ['AS_MQRankSum', 'AS_pab_max', 'AS_MQ', 'AS_QD', 'AS_ReadPosRankSum', 'AS_SOR', 'AS_FS']

List of features info to be used for variant QC.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.NON_INFO_FEATURES = ['variant_type', 'allele_type', 'n_alt_alleles', 'was_mixed', 'has_star']

List of features to be used for variant QC that are not in the info field.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.TRUTH_DATA = ['hapmap', 'omni', 'mills', 'kgp_phase1_hc']

List of truth datasets to be used for variant QC.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.extract_as_pls(lpl_expr, allele_idx)[source]

Extract PLs for a specific allele from an LPL array expression.

PL/LPL represents the normalized Phred-scaled likelihoods of the possible genotypes from all considered alleles (or local alleles).

If three alleles are considered, LPL genotype indexes are: [0/0, 0/1, 1/1, 0/2, 1/2, 2/2].

If we want to extract the PLs for each alternate allele, we need to extract:
  • allele 1: [0/0, 0/1, 1/1]

  • allele 2: [0/0, 0/2, 2/2]

Example:
  • LPL: [138, 98, 154, 26, 0, 14]

  • Extract allele 1 PLs: [0/0, 0/1, 1/1] -> [138, 98, 154]

  • Extract allele 2 PLs: [0/0, 0/2, 2/2] -> [138, 26, 14]

Parameters:
Return type:

ArrayExpression

Returns:

ArrayExpression of PLs for the specified allele.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.recompute_as_qualapprox_from_lpl(mt)[source]

Recompute AS_QUALapprox from LPL.

QUALapprox is the (Phred-scaled) probability that all reads at the site are hom-ref, so QUALapprox is PL[0]. To get the QUALapprox for just one allele, pull out the PLs for just that allele, then normalize by subtracting the smallest element from all the entries (so the best genotype is 0) and then use the normalized PL[0] value for that allele’s QUALapprox.

Note

  • The first element of AS_QUALapprox is always None.

  • If the allele is a star allele, we set QUALapprox for that allele to 0.

  • If GQ == 0 and PL[0] for the allele == 1, we set QUALapprox for the allele to 0.

Example:
Starting Values:
  • alleles: [‘G’, ‘*’, ‘A’, ‘C’, ‘GCTT’, ‘GT’, ‘T’]

  • LGT: 1/2

  • LA: [0, 1, 6]

  • LPL: [138, 98, 154, 26, 0, 14]

  • QUALapprox: 138

Use extract_as_pls to get PLs for each allele:
  • allele 1: [138, 98, 154]

  • allele 2: [138, 26, 14]

Normalize PLs by subtracting the smallest element from all the PLs:
  • allele 1: [138-98, 98-98, 154-98] -> [40, 0, 56]

  • allele 2: [138-14, 26-14, 14-14] -> [124, 12, 0]

Use the first element of the allele specific PLs to generate AS_QUALapprox: [None, 40, 124]

Set QUALapprox to 0 for the star allele: [None, 0, 124]

Parameters:

mt (MatrixTable) – Input MatrixTable.

Return type:

ArrayExpression

Returns:

AS_QUALapprox ArrayExpression recomputed from LPL.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.correct_as_annotations(mt, set_to_missing=False)[source]

Correct allele specific annotations that are longer than the length of LA.

For some entries in the MatrixTable, the following annotations are longer than LA, when they should be the same length as LA:

  • AS_SB_TABLE

  • AS_RAW_MQ

  • AS_RAW_ReadPosRankSum

  • AS_RAW_MQRankSum

This function corrects these annotations by either dropping the alternate allele with the index corresponding to the min value of AS_RAW_MQ, or setting them to missing if set_to_missing is True.

Parameters:
  • mt (MatrixTable) – Input MatrixTable.

  • set_to_missing (bool) – Whether to set the annotations to missing instead of correcting them.

Return type:

StructExpression

Returns:

StructExpression with corrected allele specific annotations.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.run_compute_info(mt, max_n_alleles=None, min_n_alleles=None, retain_cdfs=False, cdf_k=200)[source]

Run compute info on a MatrixTable.

..note:

Adds a fix for AS_QUALapprox by recomputing from LPL because some were found to
have different lengths than LA.
Creates a Table with three different methods of computing info annotations:
  • quasi_info: Compute info annotations using the quasi-allele specific method defined in default_compute_info.

  • AS_info: Compute info annotations using aggregation of the allele specific annotations in ‘gvcf_info’ after recomputing AS_QUALapprox from LPL, and fixing the length of AS_SB_TABLE, AS_RAW_MQ, AS_RAW_ReadPosRankSum and AS_RAW_MQRankSum.

  • set_long_AS_missing_info: Compute info annotations using aggregation of the allele specific annotations in ‘gvcf_info’ after setting AS_SB_TABLE, AS_RAW_MQ, AS_RAW_ReadPosRankSum and AS_RAW_MQRankSum to missing if they have the incorrect length.

Parameters:
  • mt (MatrixTable) – Input MatrixTable.

  • max_n_alleles (Optional[int]) – Maximum number of alleles for the site to be included in computations.

  • min_n_alleles (Optional[int]) – Minimum number of alleles for the site to be included in computations.

  • retain_cdfs (bool) – If True, retains the cumulative distribution functions (CDFs) for all info annotations that are computed as a median aggregation. Keeping the CDFs is useful for annotations that require calculating the median across combined datasets at a later stage. Default is False.

  • cdf_k (int) – Parameter controlling the accuracy vs. memory usage tradeoff when retaining CDFs. A higher value of cdf_k results in a more accurate CDF approximation but increases memory usage and computation time. Default is 200.

Return type:

Table

Returns:

Table with info annotations.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.get_reformatted_info_fields(ht, info_method=None)[source]

Reformat ht info annotations to contain all expected info fields.

Parameters:
  • ht (Table) – Input Table.

  • info_method (Optional[str]) – Shorthand name of method used to compute the info annotation that should be reformatted. Choices are ‘AS’, ‘quasi’, or ‘set_long_AS_missing’. If None, all info annotations will be reformatted.

Return type:

Table

Returns:

Table with reformatted info annotations.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.get_info_ht_for_vcf_export(ht, info_method)[source]

Get info HT for VCF export.

Parameters:
  • ht (Table) – Input info HT.

  • info_method (str) – Info method to use. One of ‘AS’, ‘quasi’, or ‘set_long_AS_missing’.

Return type:

Table

Returns:

Info HT for VCF export.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.split_info(info_ht)[source]

Generate an info Table with split multi-allelic sites from the multi-allelic info Table.

Note

gnomad_methods’ annotate_allele_info splits multi-allelic sites before the info annotation is split to ensure that all sites in the returned Table are annotated with allele info.

Parameters:

info_ht (Table) – Info Table with unsplit multi-allelics.

Return type:

Table

Returns:

Info Table with split multi-allelics.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.run_generate_trio_stats(vds, fam_ped, fam_ht, releasable_only=False)[source]

Generate trio transmission stats from a VariantDataset and pedigree info.

Parameters:
  • vds (VariantDataset) – VariantDataset to generate trio stats from.

  • fam_ped (Pedigree) – Pedigree containing trio info.

  • fam_ht (Table) – Table containing trio info.

  • releasable_only (bool) – Whether to only include releasable trios. Releasable trios are those where all three samples (proband, maternal, and paternal) are marked as ‘releasable’.

Return type:

Table

Returns:

Table containing trio stats.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.run_generate_sib_stats(mt, rel_ht)[source]

Generate stats for the number of alternate alleles in common between sibling pairs.

Parameters:
  • mt (MatrixTable) – MatrixTable to generate sibling stats from.

  • rel_ht (Table) – Table containing relatedness info for pairs in mt.

Return type:

Table

Returns:

Table containing sibling stats.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.create_variant_qc_annotation_ht(info_ht, trio_stats_ht, sib_stats_ht, impute_features=True, n_partitions=5000)[source]

Create a Table with all necessary annotations for variant QC.

Annotations that are included:

Features for RF:
  • variant_type

  • allele_type

  • n_alt_alleles

  • has_star

  • AS_QD

  • AS_pab_max

  • AS_MQRankSum

  • AS_SOR

  • AS_ReadPosRankSum

Training sites (bool):
  • transmitted_singleton

  • sibling_singleton

  • fail_hard_filters - (ht.QD < 2) | (ht.FS > 60) | (ht.MQ < 30)

Parameters:
  • info_ht (Table) – Info Table with split multi-allelics.

  • trio_stats_ht (Table) – Table with trio statistics.

  • sib_stats_ht (Table) – Table with sibling statistics.

  • impute_features (bool) – Whether to impute features using feature medians (this is done by variant type).

  • n_partitions (int) – Number of partitions to use for final annotated Table.

Return type:

Table

Returns:

Hail Table with all annotations needed for variant QC.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.get_tp_ht_for_vcf_export(ht, transmitted_singletons=False, sibling_singletons=False)[source]

Get Tables with raw and adj true positive variants to export as a VCF for use in VQSR.

Parameters:
  • ht (Table) – Input Table with transmitted singleton and sibling singleton information.

  • transmitted_singletons (bool) – Whether to include transmitted singletons in the true positive variants.

  • sibling_singletons (bool) – Whether to include sibling singletons in the true positive variants.

Return type:

Dict[str, Table]

Returns:

Dictionary of ‘raw’ and ‘adj’ true positive variant sites Tables.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.get_variant_qc_annotation_resources(test, overwrite, over_n_alleles=None, combine_compute_info=False, true_positive_type=None, releasable_trios_only=False)[source]

Get PipelineResourceCollection for all resources needed in the variant QC annotation pipeline.

Parameters:
  • test (bool) – Whether to gather all resources for testing.

  • overwrite (bool) – Whether to overwrite resources if they exist.

  • over_n_alleles (Optional[bool]) – Whether to use a temporary info TableResource for results. When True, use temporary info TableResource for only sites that have more than the passed arg –compute-info-split-n-alleles alleles. When False, use temporary info TableResource for only sites with fewer alleles. When None, the finalize info ht is used instead of a temporary location. Default is None.

  • combine_compute_info (bool) – Whether the input for –compute-info should be the two temporary files (with and without the –compute-info-over-split-n-alleles flag) produced by running –compute-info with –compute-info-split-n-alleles.

  • true_positive_type (Optional[str]) – Type of true positive variants to use for true positive VCF path resource. Default is None.

  • releasable_trios_only (bool) – Whether to only include releasable trios in the trio stats.

Return type:

PipelineResourceCollection

Returns:

PipelineResourceCollection containing resources for all steps of the variant QC annotation pipeline.

gnomad_qc.v4.annotations.generate_variant_qc_annotations.main(args)[source]

Generate all variant annotations needed for variant QC.