gnomad.utils.vep

gnomad.utils.vep.CURRENT_VEP_VERSION

Versions of VEP used in gnomAD data, the latest version is 105.

gnomad.utils.vep.CSQ_CODING

Constant containing all coding consequences.

gnomad.utils.vep.CSQ_SPLICE

Constant containing all splice consequences.

gnomad.utils.vep.POSSIBLE_REFS

Constant containing supported references

gnomad.utils.vep.VEP_CONFIG_PATH

Constant that contains the local path to the VEP config file

gnomad.utils.vep.VEP_CSQ_FIELDS

Constant that defines the order of VEP annotations used in VCF export, currently stored in a dictionary with the VEP version as the key.

gnomad.utils.vep.VEP_CSQ_HEADER

Constant that contains description for VEP used in VCF export.

gnomad.utils.vep.LOFTEE_LABELS

Constant that contains annotations added by LOFTEE.

gnomad.utils.vep.LOF_CSQ_SET

Set containing loss-of-function consequence strings.

gnomad.utils.vep.get_vep_help([vep_config_path])

Return the output of vep --help which includes the VEP version.

gnomad.utils.vep.get_vep_context([ref])

Get VEP context resource for the genome build ref.

gnomad.utils.vep.vep_or_lookup_vep(ht[, ...])

VEP a table, or lookup variants in a reference database.

gnomad.utils.vep.get_most_severe_consequence_expr(...)

Get the most severe consequence from a collection of consequences.

gnomad.utils.vep.add_most_severe_consequence_to_consequence(tc)

Add a most_severe_consequence field to a transcript consequence or array of transcript consequences.

gnomad.utils.vep.process_consequences(t[, ...])

Add most_severe_consequence into [vep_root].transcript_consequences, and worst_csq_by_gene, any_lof into [vep_root].

gnomad.utils.vep.filter_vep_to_canonical_transcripts(mt)

Filter VEP transcript consequences to those in the canonical transcript.

gnomad.utils.vep.filter_vep_to_mane_select_transcripts(mt)

Filter VEP transcript consequences to those in the MANE Select transcript.

gnomad.utils.vep.filter_vep_to_synonymous_variants(mt)

Filter VEP transcript consequences to those with a most severe consequence of 'synonymous_variant'.

gnomad.utils.vep.filter_vep_to_gene_list(t, ...)

Filter VEP transcript consequences to those in a set of genes.

gnomad.utils.vep.vep_struct_to_csq(vep_expr)

Given a VEP Struct, returns and array of VEP VCF CSQ strings (one per consequence in the struct).

gnomad.utils.vep.get_most_severe_consequence_for_summary(ht)

Prepare a hail Table for summary statistics generation.

gnomad.utils.vep.filter_vep_transcript_csqs(t)

Filter VEP transcript consequences based on specified criteria, and optionally filter to variants where transcript consequences is not empty after filtering.

gnomad.utils.vep.filter_vep_transcript_csqs_expr(...)

Filter VEP transcript consequences based on specified criteria, and optionally filter to variants where transcript consequences is not empty after filtering.

gnomad.utils.vep.add_most_severe_csq_to_tc_within_vep_root(t)

Add most_severe_consequence annotation to 'transcript_consequences' within the vep root annotation.

gnomad.utils.vep.explode_by_vep_annotation(t)

Explode the specified VEP annotation on the input Table/MatrixTable.

gnomad.utils.vep.CURRENT_VEP_VERSION = '105'

Versions of VEP used in gnomAD data, the latest version is 105.

gnomad.utils.vep.CSQ_CODING = ['transcript_ablation', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained', 'frameshift_variant', 'stop_lost', 'start_lost', 'initiator_codon_variant', 'transcript_amplification', 'inframe_insertion', 'inframe_deletion', 'missense_variant', 'protein_altering_variant', 'splice_donor_5th_base_variant', 'splice_region_variant', 'splice_donor_region_variant', 'splice_polypyrimidine_tract_variant', 'incomplete_terminal_codon_variant', 'start_retained_variant', 'stop_retained_variant', 'synonymous_variant', 'coding_sequence_variant', 'coding_transcript_variant']

Constant containing all coding consequences.

gnomad.utils.vep.CSQ_SPLICE = ['splice_acceptor_variant', 'splice_donor_variant', 'splice_region_variant']

Constant containing all splice consequences.

gnomad.utils.vep.POSSIBLE_REFS = ('GRCh37', 'GRCh38')

Constant containing supported references

gnomad.utils.vep.VEP_CONFIG_PATH = 'file:///vep_data/vep-gcloud.json'

Constant that contains the local path to the VEP config file

gnomad.utils.vep.VEP_CSQ_FIELDS = {'101': 'Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|ALLELE_NUM|DISTANCE|STRAND|VARIANT_CLASS|MINIMISED|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info', '105': 'Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|MANE_SELECT|MANE_PLUS_CLINICAL|TSL|APPRIS|CCDS|ENSP|UNIPROT_ISOFORM|SOURCE|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS|LoF|LoF_filter|LoF_flags|LoF_info'}

Constant that defines the order of VEP annotations used in VCF export, currently stored in a dictionary with the VEP version as the key.

gnomad.utils.vep.VEP_CSQ_HEADER = 'Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|MANE_SELECT|MANE_PLUS_CLINICAL|TSL|APPRIS|CCDS|ENSP|UNIPROT_ISOFORM|SOURCE|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS|LoF|LoF_filter|LoF_flags|LoF_info'

Constant that contains description for VEP used in VCF export.

gnomad.utils.vep.LOFTEE_LABELS = ['HC', 'LC', 'OS']

Constant that contains annotations added by LOFTEE.

gnomad.utils.vep.LOF_CSQ_SET = {'frameshift_variant', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained'}

Set containing loss-of-function consequence strings.

gnomad.utils.vep.get_vep_help(vep_config_path=None)[source]

Return the output of vep –help which includes the VEP version.

Warning

If no vep_config_path is supplied, this function will only work for Dataproc clusters created with hailctl dataproc start –vep. It assumes that the command is /path/to/vep.

Parameters:

vep_config_path (Optional[str]) – Optional path to use as the VEP config file. If None, VEP_CONFIG_URI environment variable is used

Returns:

VEP help string

gnomad.utils.vep.get_vep_context(ref=None)[source]

Get VEP context resource for the genome build ref.

Parameters:

ref (Optional[str]) – Genome build. If None, hl.default_reference is used

Return type:

VersionedTableResource

Returns:

VEPed context resource

gnomad.utils.vep.vep_or_lookup_vep(ht, reference_vep_ht=None, reference=None, vep_config_path=None, vep_version=None)[source]

VEP a table, or lookup variants in a reference database.

Warning

If reference_vep_ht is supplied, no check is performed to confirm reference_vep_ht was generated with the same version of VEP / VEP configuration as the VEP referenced in vep_config_path.

Parameters:
  • ht – Input Table

  • reference_vep_ht – A reference database with VEP annotations (must be in top-level vep)

  • reference – If reference_vep_ht is not specified, find a suitable one in reference (if None, grabs from hl.default_reference)

  • vep_config_path – vep_config to pass to hl.vep (if None, a suitable one for reference is chosen)

  • vep_version – Version of VEPed context Table to use (if None, the default vep_context resource will be used)

Returns:

VEPed Table

gnomad.utils.vep.get_most_severe_consequence_expr(csq_expr, csq_order=None)[source]

Get the most severe consequence from a collection of consequences.

This is for a given transcript, as there are often multiple annotations for a single transcript: e.g. splice_region_variant&intron_variant -> splice_region_variant

Parameters:
  • csq_expr (ArrayExpression) – ArrayExpression of consequences.

  • csq_order (Optional[List[str]]) – Optional list indicating the order of VEP consequences, sorted from high to low impact. Default is None, which uses the value of the CSQ_ORDER global.

Return type:

StringExpression

Returns:

Most severe consequence in csq_expr.

gnomad.utils.vep.add_most_severe_consequence_to_consequence(tc, csq_order=None, most_severe_csq_field='most_severe_consequence')[source]

Add a most_severe_consequence field to a transcript consequence or array of transcript consequences.

For a single transcript consequence, tc should be a StructExpression with a consequence_terms field, e.g. Struct(consequence_terms=[‘missense_variant’]). For an array of transcript consequences, tc should be an ArrayExpression of StructExpressions with a consequence_terms field.

Parameters:
  • tc (Union[StructExpression, ArrayExpression]) – Transcript consequence or array of transcript consequences to annotate.

  • csq_order (Optional[List[str]]) – Optional list indicating the order of VEP consequences, sorted from high to low impact. Default is None, which uses the value of the CSQ_ORDER global.

  • most_severe_csq_field (str) – Field name to use for most severe consequence. Default is ‘most_severe_consequence’.

Return type:

Union[StructExpression, ArrayExpression]

Returns:

Transcript consequence or array of transcript consequences annotated with the most severe consequence.

gnomad.utils.vep.process_consequences(t, vep_root='vep', penalize_flags=True, csq_order=None, has_polyphen=True)[source]

Add most_severe_consequence into [vep_root].transcript_consequences, and worst_csq_by_gene, any_lof into [vep_root].

most_severe_consequence is the worst consequence for a transcript.

Each transcript consequence is annotated with a csq_score which is a combination of the index of the consequence’s most_severe_consequence in csq_order and an extra deduction for loss-of-function consequences, and polyphen predictions if has_polyphen is True. Lower scores translate to higher severity.

The score adjustment is as follows:
  • lof == ‘HC’ & NO lof_flags (-1000 if penalize_flags, -500 if not)

  • lof == ‘HC’ & lof_flags (-500)

  • lof == ‘OS’ (-20)

  • lof == ‘LC’ (-10)

  • everything else (0)

Note

From gnomAD v4.0 on, the PolyPhen annotation was removed from the VEP Struct in the release HTs. When using this function with gnomAD v4.0 or later, set has_polyphen to False.

Parameters:
  • t (Union[MatrixTable, Table]) – Input Table or MatrixTable.

  • vep_root (str) – Root for VEP annotation (probably “vep”).

  • penalize_flags (bool) – Whether to penalize LOFTEE flagged variants, or treat them as equal to HC.

  • csq_order (Optional[List[str]]) – Optional list indicating the order of VEP consequences, sorted from high to low impact. Default is None, which uses the value of the CSQ_ORDER global.

  • has_polyphen (bool) – Whether the input VEP Struct has a PolyPhen annotation which will be used to modify the consequence score. Default is True.

Return type:

Union[MatrixTable, Table]

Returns:

MT with better formatted consequences.

gnomad.utils.vep.filter_vep_to_canonical_transcripts(mt, vep_root='vep', filter_empty_csq=False)[source]

Filter VEP transcript consequences to those in the canonical transcript.

Parameters:
  • mt (Union[MatrixTable, Table]) – Input Table or MatrixTable.

  • vep_root (str) – Name used for VEP annotation. Default is ‘vep’.

  • filter_empty_csq (bool) – Whether to filter out rows where ‘transcript_consequences’ is empty. Default is False.

Return type:

Union[MatrixTable, Table]

Returns:

Table or MatrixTable with VEP transcript consequences filtered.

gnomad.utils.vep.filter_vep_to_mane_select_transcripts(mt, vep_root='vep', filter_empty_csq=False)[source]

Filter VEP transcript consequences to those in the MANE Select transcript.

Parameters:
  • mt (Union[MatrixTable, Table]) – Input Table or MatrixTable.

  • vep_root (str) – Name used for VEP annotation. Default is ‘vep’.

  • filter_empty_csq (bool) – Whether to filter out rows where ‘transcript_consequences’ is empty. Default is False.

Return type:

Union[MatrixTable, Table]

Returns:

Table or MatrixTable with VEP transcript consequences filtered.

gnomad.utils.vep.filter_vep_to_synonymous_variants(mt, vep_root='vep', filter_empty_csq=False)[source]

Filter VEP transcript consequences to those with a most severe consequence of ‘synonymous_variant’.

Parameters:
  • mt (Union[MatrixTable, Table]) – Input Table or MatrixTable.

  • vep_root (str) – Name used for VEP annotation. Default is ‘vep’.

  • filter_empty_csq (bool) – Whether to filter out rows where ‘transcript_consequences’ is empty. Default is False.

Return type:

Union[MatrixTable, Table]

Returns:

Table or MatrixTable with VEP transcript consequences filtered.

gnomad.utils.vep.filter_vep_to_gene_list(t, genes, match_by_gene_symbol=False, vep_root='vep', filter_empty_csq=False)[source]

Filter VEP transcript consequences to those in a set of genes.

Note

Filtering to a list of genes by their ‘gene_id’ or ‘gene_symbol’ will filter to all variants that are annotated to the gene, including [‘upstream_gene_variant’, ‘downstream_gene_variant’], which will not be the same as if you filter to a gene interval. If you only want variants inside certain gene boundaries and a faster filter, you can first filter t to an interval list and then apply this filter.

Parameters:
  • t (Union[MatrixTable, Table]) – Input Table or MatrixTable.

  • genes (List[str]) – Genes of interest to filter VEP transcript consequences to.

  • match_by_gene_symbol (bool) – Whether to match values in genes to VEP transcript consequences by ‘gene_symbol’ instead of ‘gene_id’. Default is False.

  • vep_root (str) – Name used for VEP annotation. Default is ‘vep’.

  • filter_empty_csq (bool) – Whether to filter out rows where ‘transcript_consequences’ is empty. Default is False.

Returns:

Table or MatrixTable with VEP transcript consequences filtered.

gnomad.utils.vep.vep_struct_to_csq(vep_expr, csq_fields='Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|MANE_SELECT|MANE_PLUS_CLINICAL|TSL|APPRIS|CCDS|ENSP|UNIPROT_ISOFORM|SOURCE|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS|LoF|LoF_filter|LoF_flags|LoF_info', has_polyphen_sift=True)[source]

Given a VEP Struct, returns and array of VEP VCF CSQ strings (one per consequence in the struct).

The fields and their order will correspond to those passed in csq_fields, which corresponds to the VCF header that is required to interpret the VCF CSQ INFO field.

Note that the order is flexible and that all fields that are in the default value are supported. These fields will be formatted in the same way that their VEP CSQ counterparts are.

While other fields can be added if their name are the same as those in the struct. Their value will be the result of calling hl.str(), so it may differ from their usual VEP CSQ representation.

Parameters:
  • vep_expr (StructExpression) – The input VEP Struct

  • csq_fields (str) – The | delimited list of fields to include in the CSQ (in that order), default is the CSQ fields of the CURRENT_VEP_VERSION.

  • has_polyphen_sift (bool) – Whether the input VEP Struct has PolyPhen and SIFT annotations. Default is True.

Return type:

ArrayExpression

Returns:

The corresponding CSQ strings

gnomad.utils.vep.get_most_severe_consequence_for_summary(ht, csq_order=['transcript_ablation', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained', 'frameshift_variant', 'stop_lost', 'start_lost', 'initiator_codon_variant', 'transcript_amplification', 'inframe_insertion', 'inframe_deletion', 'missense_variant', 'protein_altering_variant', 'splice_donor_5th_base_variant', 'splice_region_variant', 'splice_donor_region_variant', 'splice_polypyrimidine_tract_variant', 'incomplete_terminal_codon_variant', 'start_retained_variant', 'stop_retained_variant', 'synonymous_variant', 'coding_sequence_variant', 'coding_transcript_variant', 'mature_miRNA_variant', '5_prime_UTR_variant', '3_prime_UTR_variant', 'non_coding_transcript_exon_variant', 'non_coding_exon_variant', 'intron_variant', 'NMD_transcript_variant', 'non_coding_transcript_variant', 'nc_transcript_variant', 'upstream_gene_variant', 'downstream_gene_variant', 'TFBS_ablation', 'TFBS_amplification', 'TF_binding_site_variant', 'regulatory_region_ablation', 'regulatory_region_amplification', 'feature_elongation', 'regulatory_region_variant', 'feature_truncation', 'intergenic_variant', 'sequence_variant'], loftee_labels=['HC', 'LC', 'OS'])[source]

Prepare a hail Table for summary statistics generation.

Adds the following annotations:
  • most_severe_csq: Most severe consequence for variant

  • protein_coding: Whether the variant is present on a protein-coding transcript

  • lof: Whether the variant is a loss-of-function variant

  • no_lof_flags: Whether the variant has any LOFTEE flags (True if no flags)

Assumes input Table is annotated with VEP and that VEP annotations have been filtered to canonical transcripts.

Parameters:
  • ht (Table) – Input Table.

  • csq_order (List[str]) – Order of VEP consequences, sorted from high to low impact. Default is CSQ_ORDER.

  • loftee_labels (List[str]) – Annotations added by LOFTEE. Default is LOFTEE_LABELS.

Return type:

Table

Returns:

Table annotated with VEP summary annotations.

gnomad.utils.vep.filter_vep_transcript_csqs(t, vep_root='vep', synonymous=True, canonical=True, ensembl_only=True, filter_empty_csq=True, **kwargs)[source]

Filter VEP transcript consequences based on specified criteria, and optionally filter to variants where transcript consequences is not empty after filtering.

If filter_empty_csq parameter is set to True, the Table/MatrixTable is filtered to variants where ‘transcript_consequences’ within the VEP annotation is not empty after the specified filtering criteria is applied.

Note

By default, the Table/MatrixTable is filtered to variants where ‘transcript_consequences’ within the VEP annotation is not empty after filtering to Ensembl canonical transcripts with a most severe consequence of ‘synonymous_variant’.

Parameters:
  • t (Union[Table, MatrixTable]) – Input Table or MatrixTable.

  • vep_root (str) – Root for VEP annotation. Default is ‘vep’.

  • synonymous (bool) – Whether to filter to variants where the most severe consequence is ‘synonymous_variant’. Default is True.

  • canonical (bool) – Whether to filter to only canonical transcripts. Default is True.

  • ensembl_only (bool) – Whether to filter to only Ensembl transcripts. This option is useful for deduplicating transcripts that are the same between RefSeq and Emsembl. Default is True.

  • filter_empty_csq (bool) – Whether to filter out rows where ‘transcript_consequences’ is empty, after filtering ‘transcript_consequences’ to the specified criteria. Default is True.

  • kwargs – Filtering criteria to apply to the VEP transcript consequences using filter_vep_transcript_csqs_expr. See that function for more details.

Return type:

Union[Table, MatrixTable]

Returns:

Table or MatrixTable with VEP transcript consequences filtered.

gnomad.utils.vep.filter_vep_transcript_csqs_expr(csq_expr, synonymous=False, canonical=False, mane_select=False, ensembl_only=False, protein_coding=False, loftee_labels=None, no_lof_flags=False, csqs=None, keep_csqs=True, genes=None, keep_genes=True, match_by_gene_symbol=False, additional_filtering_criteria=None)[source]

Filter VEP transcript consequences based on specified criteria, and optionally filter to variants where transcript consequences is not empty after filtering.

Note

If csqs is not None or synonymous is True, and ‘most_severe_consequence’ is not already annotated on the csq_expr elements, the most severe consequence will be added to the csq_expr for filtering.

Parameters:
  • csq_expr (Union[StructExpression, ArrayExpression]) – VEP transcript consequences StructExpression or ArrayExpression.

  • synonymous (bool) – Whether to filter to variants where the most severe consequence is ‘synonymous_variant’. Default is False.

  • canonical (bool) – Whether to filter to only canonical transcripts. Default is False.

  • mane_select (bool) – Whether to filter to only MANE Select transcripts. Default is False.

  • ensembl_only (bool) – Whether to filter to only Ensembl transcripts. This option is useful for deduplicating transcripts that are the same between RefSeq and Emsembl. Default is False.

  • protein_coding (bool) – Whether to filter to only protein-coding transcripts. Default is False.

  • loftee_labels (Optional[List[str]]) – List of LOFTEE labels to filter to. Default is None, which filters to all LOFTEE labels.

  • no_lof_flags (bool) – Whether to filter to consequences with no LOFTEE flags. Default is False.

  • csqs (Optional[List[str]]) – Optional list of consequence terms to filter to. Transcript consequences are filtered to those where ‘most_severe_consequence’ is in the list of consequence terms csqs. Default is None.

  • keep_csqs (bool) – Whether to keep transcript consequences that are in csqs. If set to False, transcript consequences that are in csqs will be removed. Default is True.

  • genes (Optional[List[str]]) – Optional list of genes to filter VEP transcript consequences to. Default is None.

  • keep_genes (bool) – Whether to keep transcript consequences that are in genes. If set to False, transcript consequences that are in genes will be removed. Default is True.

  • match_by_gene_symbol (bool) – Whether to match values in genes to VEP transcript consequences by ‘gene_symbol’ instead of ‘gene_id’. Default is False.

  • additional_filtering_criteria (Optional[List[Union[BooleanExpression, Callable]]]) – Optional list of additional filtering criteria to apply to the VEP transcript consequences.

Return type:

Union[BooleanExpression, ArrayExpression]

Returns:

BooleanExpression indicating whether the consequence should be filtered or an ArrayExpression of the filtered VEP transcript consequences.

gnomad.utils.vep.add_most_severe_csq_to_tc_within_vep_root(t, vep_root='vep')[source]

Add most_severe_consequence annotation to ‘transcript_consequences’ within the vep root annotation.

Parameters:
  • t (Union[Table, MatrixTable]) – Input Table or MatrixTable.

  • vep_root (str) – Root for vep annotation (probably vep).

Return type:

Union[Table, MatrixTable]

Returns:

Table or MatrixTable with most_severe_consequence annotation added.

gnomad.utils.vep.explode_by_vep_annotation(t, vep_annotation='transcript_consequences', vep_root='vep')[source]

Explode the specified VEP annotation on the input Table/MatrixTable.

Parameters:
  • t (Union[Table, MatrixTable]) – Input Table or MatrixTable.

  • vep_annotation (str) – Name of annotation in vep_root to explode.

  • vep_root (str) – Name used for root VEP annotation. Default is ‘vep’.

Return type:

Union[Table, MatrixTable]

Returns:

Table or MatrixTable with exploded VEP annotation.