gnomad.utils.transcript_annotation

Utils module containing generic functions that are useful for adding transcript expression-aware annotations.

gnomad.utils.transcript_annotation.summarize_transcript_expression(mt)

Summarize a transcript expression MatrixTable by transcript, gene, and tissue.

gnomad.utils.transcript_annotation.get_expression_proportion(ht)

Calculate the proportion of expression of transcript to gene per tissue.

gnomad.utils.transcript_annotation.filter_expression_ht_by_tissues(ht)

Filter a Table with a row annotation for each tissue to only include specified tissues.

gnomad.utils.transcript_annotation.tissue_expression_ht_to_array(ht)

Convert a Table with a row annotation for each tissue to a Table with tissues as an array.

gnomad.utils.transcript_annotation.tx_filter_variants_by_csqs(ht)

Prepare a Table of variants with VEP transcript consequences for annotation.

gnomad.utils.transcript_annotation.tx_annotate_variants(ht, ...)

Annotate variants with transcript-based expression values or expression proportion from GTEx.

gnomad.utils.transcript_annotation.tx_aggregate_variants(ht)

Aggregate transcript-based expression values or expression proportion from GTEx.

gnomad.utils.transcript_annotation.perform_tx_annotation_pipeline(ht, ...)

One-stop usage of tx_filter_variants_by_csqs, tx_annotate_variants and tx_aggregate_variants.

Utils module containing generic functions that are useful for adding transcript expression-aware annotations.

gnomad.utils.transcript_annotation.summarize_transcript_expression(mt, transcript_expression_expr='transcript_tpm', tissue_expr='tissue', summary_agg_func=None)[source]

Summarize a transcript expression MatrixTable by transcript, gene, and tissue.

The summary_agg_func argument allows the user to specify a Hail aggregation function to use to summarize the expression by tissue. By default, the median is used.

The returned Table has a row annotation for each tissue containing a struct with the summarized tissue expression value (‘transcript_expression’) and the proportion of expression of transcript to gene per tissue (‘expression_proportion’).

Returned Table Schema example:

Row fields:
    'transcript_id': str
    'gene_id': str
    'tissue_1': struct {
      transcript_expression: float64,
      expression_proportion: float64
    }
    'tissue_2': struct {
      transcript_expression: float64,
      expression_proportion: float64
    }

Key: ['transcript_id', 'gene_id']
Parameters:
  • mt (MatrixTable) – MatrixTable of transcript (rows) expression quantifications (entry) by sample (columns).

  • transcript_expression_expr (Union[NumericExpression, str]) – Entry expression indicating transcript expression quantification. Default is ‘transcript_tpm’.

  • tissue_expr (Union[StringExpression, str]) – Column expression indicating tissue type. Default is ‘tissue’.

  • summary_agg_func (Optional[Callable]) – Optional aggregation function to use to summarize the transcript expression quantification by tissue. Example: hl.agg.mean. Default is None, which will use a median aggregation.

Return type:

Table

Returns:

A Table of summarized transcript expression by tissue.

gnomad.utils.transcript_annotation.get_expression_proportion(ht)[source]

Calculate the proportion of expression of transcript to gene per tissue.

Parameters:

ht (Table) – Table of summarized transcript expression by tissue.

Return type:

StructExpression

Returns:

StructExpression containing the proportion of expression of transcript to gene per tissue.

gnomad.utils.transcript_annotation.filter_expression_ht_by_tissues(ht, tissues_to_keep=None, tissues_to_filter=None)[source]

Filter a Table with a row annotation for each tissue to only include specified tissues.

Parameters:
  • ht (Table) – Table with a row annotation for each tissue.

  • tissues_to_keep (Optional[List[str]]) – Optional list of tissues to keep in the Table. Default is all non-key rows in the Table.

  • tissues_to_filter (Optional[List[str]]) – Optional list of tissues to exclude from the Table.

Return type:

Table

Returns:

Table with only specified tissues.

gnomad.utils.transcript_annotation.tissue_expression_ht_to_array(ht, tissues_to_keep=None, tissues_to_filter=None, annotations_to_extract=('transcript_expression', 'expression_proportion'))[source]

Convert a Table with a row annotation for each tissue to a Table with tissues as an array.

The output is a Table with one of the two formats:
  • An annotation of ‘tissue_expression’ containing an array of structs by tissue, where each element of the array is the Table’s row value for a given tissue.

    Example:

    tissue_expression': array<struct {
        transcript_expression: float64,
        expression_proportion: float64
    }>
    
  • One array annotation for each field defined in the ‘annotations_to_extract’ argument, where each array is an array of the given field values by tissue.

    Example:

    'transcript_expression': array<float64>
    'expression_proportion': array<float64>
    

The order of tissues in the array is indicated by the “tissues” global annotation.

Parameters:
  • ht (Table) – Table with a row annotation for each tissue.

  • tissues_to_keep (Optional[List[str]]) – Optional list of tissues to keep in the tissue expression array. Default is all non-key rows in the Table.

  • tissues_to_filter (Optional[List[str]]) – Optional list of tissues to exclude from the tissue expression array.

  • annotations_to_extract (Union[Tuple[str], List[str], None]) – Optional list of tissue struct fields to extract into top level array annotations. If None, the returned Table will contain a single top level annotation ‘tissue_expression’ that contains an array of structs by tissue. Default is (‘transcript_expression’, ‘expression_proportion’).

Return type:

Table

Returns:

Table with requested tissue struct annotations pulled into arrays of tissue values and a ‘tissues’ global annotation indicating the order of tissues in the arrays.

gnomad.utils.transcript_annotation.tx_filter_variants_by_csqs(ht, filter_to_cds=True, gencode_ht=None, filter_to_genes=None, match_by_gene_symbol=False, filter_to_csqs=None, ignore_splicing=True, filter_to_protein_coding=True, vep_root='vep')[source]

Prepare a Table of variants with VEP transcript consequences for annotation.

Note

When filter_to_cds is set to True, the returned Table will be further filtered by defined ‘amino_acids’ annotation, which is to filter out certain consequences, such as ‘stop_retained_variant’, that are kept by all CDS intervals but don’t belong to CDS of the transcript they fall on.

Parameters:
  • ht (Table) – Table of variants with ‘vep’ annotations.

  • gencode_ht (Optional[Table]) – Optional Gencode resource Table containing CDS interval information. This is only used when filter_to_cds is set to True. Default is None, which will use the default version of the Gencode Table resource for the reference build of the input Table ht.

  • filter_to_cds (bool) – Whether to filter to CDS regions. Default is True. And it will be further filtered by defined ‘amino_acids’ annotation.

  • filter_to_genes (Optional[List[str]]) – Optional list of genes to filter to. Default is None.

  • match_by_gene_symbol (bool) – Whether to match by gene symbol instead of gene ID. Default is False.

  • filter_to_csqs (Optional[List[str]]) – Optional list of consequences to filter to. Default is None.

  • ignore_splicing (bool) – If True, ignore splice consequences. Default is True.

  • filter_to_protein_coding (bool) – Whether to filter to protein coding transcripts. Default is True.

  • vep_root (str) – Name used for root VEP annotation. Default is ‘vep’.

Return type:

Table

Returns:

Table of variants with preprocessed/filtered transcript consequences prepared for annotation.

gnomad.utils.transcript_annotation.tx_annotate_variants(ht, tx_ht, tissues_to_filter=None, vep_root='vep', vep_annotation='transcript_consequences')[source]

Annotate variants with transcript-based expression values or expression proportion from GTEx.

Parameters:
  • ht (Table) – Table of variants to annotate, it should contain the nested fields: {vep_root}.{vep_annotation}.

  • tx_ht (Table) – Table of transcript expression information.

  • tissues_to_filter (Optional[List[str]]) – Optional list of tissues to exclude from the output. Default is None.

  • vep_root (str) – Name used for root VEP annotation. Default is ‘vep’.

  • vep_annotation (str) – Name of annotation under vep_root, one of the processed consequences: [“transcript_consequences”, “worst_csq_by_gene”, “worst_csq_for_variant”, “worst_csq_by_gene_canonical”, “worst_csq_for_variant_canonical”]. For example, if you want to annotate each variant with the worst consequence in each gene it falls on and the transcript expression, you would use “worst_csq_by_gene”. Default is “transcript_consequences”.

Return type:

Table

Returns:

Input Table with transcript expression information annotated.

gnomad.utils.transcript_annotation.tx_aggregate_variants(ht, additional_group_by=('alleles', 'gene_symbol', 'most_severe_consequence', 'lof', 'lof_flags'))[source]

Aggregate transcript-based expression values or expression proportion from GTEx.

Parameters:
  • ht (Table) – Table of variants annotated with transcript expression information.

  • additional_group_by (Union[Tuple[str], List[str], None]) – Optional list of additional fields to group by before sum aggregation. If None, the returned Table will be grouped by only “locus” and “gene_id” before the sum aggregation.

Return type:

Table

Returns:

Table of variants with transcript expression information aggregated.

gnomad.utils.transcript_annotation.perform_tx_annotation_pipeline(ht, tx_ht, tissues_to_filter=None, vep_root='vep', vep_annotation='transcript_consequences', filter_to_csqs=['transcript_ablation', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained', 'frameshift_variant', 'stop_lost', 'start_lost', 'initiator_codon_variant', 'transcript_amplification', 'inframe_insertion', 'inframe_deletion', 'missense_variant', 'protein_altering_variant', 'splice_region_variant', 'incomplete_terminal_codon_variant', 'start_retained_variant', 'stop_retained_variant', 'synonymous_variant', 'coding_sequence_variant'], additional_group_by=('alleles', 'gene_symbol', 'most_severe_consequence', 'lof', 'lof_flags'), **kwargs)[source]

One-stop usage of tx_filter_variants_by_csqs, tx_annotate_variants and tx_aggregate_variants.

Parameters:
  • ht (Table) – Table of variants to annotate, it should contain the nested fields: {vep_root}.{vep_annotation}.

  • tx_ht (Table) – Table of transcript expression information.

  • tissues_to_filter (Optional[List[str]]) – Optional list of tissues to exclude from the output.

  • vep_root (str) – Name used for root VEP annotation. Default is ‘vep’.

  • vep_annotation (str) – Name of annotation under vep_root. Default is ‘transcript_consequences’.

  • filter_to_csqs (Optional[List[str]]) – Optional list of consequences to filter to. Default is None.

  • additional_group_by (Union[Tuple[str], List[str], None]) – Optional list of additional fields to group by before sum aggregation. If None, the returned Table will be grouped by only “locus” and “gene_id” before the sum aggregation.

Return type:

Table

Returns:

Table of variants with transcript expression information aggregated.