gnomad.utils.vep

gnomad.utils.vep.POSSIBLE_REFS

Constant containing supported references

gnomad.utils.vep.VEP_CONFIG_PATH

Constant that contains the local path to the VEP config file

gnomad.utils.vep.VEP_CSQ_FIELDS

Constant that defines the order of VEP annotations used in VCF export.

gnomad.utils.vep.VEP_CSQ_HEADER

Constant that contains description for VEP used in VCF export.

gnomad.utils.vep.LOFTEE_LABELS

Constant that contains annotations added by LOFTEE.

gnomad.utils.vep.LOF_CSQ_SET

Set containing loss-of-function consequence strings.

gnomad.utils.vep.get_vep_help([vep_config_path])

Return the output of vep –help which includes the VEP version.

gnomad.utils.vep.get_vep_context([ref])

Get VEP context resource for the genome build ref.

gnomad.utils.vep.vep_or_lookup_vep(ht[, …])

VEP a table, or lookup variants in a reference database.

gnomad.utils.vep.add_most_severe_consequence_to_consequence(tc)

Add most_severe_consequence annotation to transcript consequences.

gnomad.utils.vep.process_consequences(mt[, …])

Add most_severe_consequence into [vep_root].transcript_consequences, and worst_csq_by_gene, any_lof into [vep_root].

gnomad.utils.vep.filter_vep_to_canonical_transcripts(mt)

Filter VEP transcript consequences to those in the canonical transcript.

gnomad.utils.vep.filter_vep_to_synonymous_variants(mt)

Filter VEP transcript consequences to those with a most severe consequence of synonymous_variant.

gnomad.utils.vep.vep_struct_to_csq(vep_expr)

Given a VEP Struct, returns and array of VEP VCF CSQ strings (one per consequence in the struct).

gnomad.utils.vep.get_most_severe_consequence_for_summary(ht)

Prepare a hail Table for summary statistics generation.

gnomad.utils.vep.POSSIBLE_REFS = ('GRCh37', 'GRCh38')

Constant containing supported references

gnomad.utils.vep.VEP_CONFIG_PATH = 'file:///vep_data/vep-gcloud.json'

Constant that contains the local path to the VEP config file

gnomad.utils.vep.VEP_CSQ_FIELDS = 'Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|ALLELE_NUM|DISTANCE|STRAND|VARIANT_CLASS|MINIMISED|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info'

Constant that defines the order of VEP annotations used in VCF export.

gnomad.utils.vep.VEP_CSQ_HEADER = 'Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|ALLELE_NUM|DISTANCE|STRAND|VARIANT_CLASS|MINIMISED|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info'

Constant that contains description for VEP used in VCF export.

gnomad.utils.vep.LOFTEE_LABELS = ['HC', 'LC', 'OS']

Constant that contains annotations added by LOFTEE.

gnomad.utils.vep.LOF_CSQ_SET = {'frameshift_variant', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained'}

Set containing loss-of-function consequence strings.

gnomad.utils.vep.get_vep_help(vep_config_path=None)[source]

Return the output of vep –help which includes the VEP version.

Warning

If no vep_config_path is supplied, this function will only work for Dataproc clusters created with hailctl dataproc start –vep. It assumes that the command is /path/to/vep.

Parameters

vep_config_path (Optional[str]) – Optional path to use as the VEP config file. If None, VEP_CONFIG_URI environment variable is used

Returns

VEP help string

gnomad.utils.vep.get_vep_context(ref=None)[source]

Get VEP context resource for the genome build ref.

Parameters

ref (Optional[str]) – Genome build. If None, hl.default_reference is used

Return type

VersionedTableResource

Returns

VEPed context resource

gnomad.utils.vep.vep_or_lookup_vep(ht, reference_vep_ht=None, reference=None, vep_config_path=None, vep_version=None)[source]

VEP a table, or lookup variants in a reference database.

Warning

If reference_vep_ht is supplied, no check is performed to confirm reference_vep_ht was generated with the same version of VEP / VEP configuration as the VEP referenced in vep_config_path.

Parameters
  • ht – Input Table

  • reference_vep_ht – A reference database with VEP annotations (must be in top-level vep)

  • reference – If reference_vep_ht is not specified, find a suitable one in reference (if None, grabs from hl.default_reference)

  • vep_config_path – vep_config to pass to hl.vep (if None, a suitable one for reference is chosen)

  • vep_version – Version of VEPed context Table to use (if None, the default vep_context resource will be used)

Returns

VEPed Table

gnomad.utils.vep.add_most_severe_consequence_to_consequence(tc)[source]

Add most_severe_consequence annotation to transcript consequences.

This is for a given transcript, as there are often multiple annotations for a single transcript: e.g. splice_region_variant&intron_variant -> splice_region_variant

Parameters

tc (StructExpression) –

Return type

StructExpression

gnomad.utils.vep.process_consequences(mt, vep_root='vep', penalize_flags=True)[source]

Add most_severe_consequence into [vep_root].transcript_consequences, and worst_csq_by_gene, any_lof into [vep_root].

most_severe_consequence is the worst consequence for a transcript.

Parameters
  • mt (Union[MatrixTable, Table]) – Input MT

  • vep_root (str) – Root for vep annotation (probably vep)

  • penalize_flags (bool) – Whether to penalize LOFTEE flagged variants, or treat them as equal to HC

Return type

Union[MatrixTable, Table]

Returns

MT with better formatted consequences

gnomad.utils.vep.filter_vep_to_canonical_transcripts(mt, vep_root='vep')[source]

Filter VEP transcript consequences to those in the canonical transcript.

Parameters
Return type

Union[MatrixTable, Table]

gnomad.utils.vep.filter_vep_to_synonymous_variants(mt, vep_root='vep')[source]

Filter VEP transcript consequences to those with a most severe consequence of synonymous_variant.

Parameters
Return type

Union[MatrixTable, Table]

gnomad.utils.vep.vep_struct_to_csq(vep_expr, csq_fields='Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|ALLELE_NUM|DISTANCE|STRAND|VARIANT_CLASS|MINIMISED|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info')[source]

Given a VEP Struct, returns and array of VEP VCF CSQ strings (one per consequence in the struct).

The fields and their order will correspond to those passed in csq_fields, which corresponds to the VCF header that is required to interpret the VCF CSQ INFO field.

Note that the order is flexible and that all fields that are in the default value are supported. These fields will be formatted in the same way that their VEP CSQ counterparts are.

While other fields can be added if their name are the same as those in the struct. Their value will be the result of calling hl.str(), so it may differ from their usual VEP CSQ representation.

Parameters
  • vep_expr (StructExpression) – The input VEP Struct

  • csq_fields (str) – The | delimited list of fields to include in the CSQ (in that order)

Return type

ArrayExpression

Returns

The corresponding CSQ strings

gnomad.utils.vep.get_most_severe_consequence_for_summary(ht, csq_order=['transcript_ablation', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained', 'frameshift_variant', 'stop_lost', 'start_lost', 'initiator_codon_variant', 'transcript_amplification', 'inframe_insertion', 'inframe_deletion', 'missense_variant', 'protein_altering_variant', 'splice_region_variant', 'incomplete_terminal_codon_variant', 'start_retained_variant', 'stop_retained_variant', 'synonymous_variant', 'coding_sequence_variant', 'mature_miRNA_variant', '5_prime_UTR_variant', '3_prime_UTR_variant', 'non_coding_transcript_exon_variant', 'non_coding_exon_variant', 'intron_variant', 'NMD_transcript_variant', 'non_coding_transcript_variant', 'nc_transcript_variant', 'upstream_gene_variant', 'downstream_gene_variant', 'TFBS_ablation', 'TFBS_amplification', 'TF_binding_site_variant', 'regulatory_region_ablation', 'regulatory_region_amplification', 'feature_elongation', 'regulatory_region_variant', 'feature_truncation', 'intergenic_variant'], loftee_labels=['HC', 'LC', 'OS'])[source]

Prepare a hail Table for summary statistics generation.

Adds the following annotations:
  • most_severe_csq: Most severe consequence for variant

  • protein_coding: Whether the variant is present on a protein-coding transcript

  • lof: Whether the variant is a loss-of-function variant

  • no_lof_flags: Whether the variant has any LOFTEE flags (True if no flags)

Assumes input Table is annotated with VEP and that VEP annotations have been filtered to canonical transcripts.

Parameters
  • ht (Table) – Input Table.

  • csq_order (List[str]) – Order of VEP consequences, sorted from high to low impact. Default is CSQ_ORDER.

  • loftee_labels (List[str]) – Annotations added by LOFTEE. Default is LOFTEE_LABELS.

Return type

Table

Returns

Table annotated with VEP summary annotations.