gnomad.utils.transcript_annotation
Utils module containing generic functions that are useful for adding transcript expression-aware annotations.
|
Summarize a transcript expression MatrixTable by transcript, gene, and tissue. |
|
Calculate the proportion of expression of transcript to gene per tissue. |
|
Filter a Table with a row annotation for each tissue to only include specified tissues. |
|
Convert a Table with a row annotation for each tissue to a Table with tissues as an array. |
|
Prepare a Table of variants with VEP transcript consequences for annotation. |
|
Annotate variants with transcript-based expression values or expression proportion from GTEx. |
|
Aggregate transcript-based expression values or expression proportion from GTEx. |
|
One-stop usage of tx_filter_variants_by_csqs, tx_annotate_variants and tx_aggregate_variants. |
Utils module containing generic functions that are useful for adding transcript expression-aware annotations.
- gnomad.utils.transcript_annotation.summarize_transcript_expression(mt, transcript_expression_expr='transcript_tpm', tissue_expr='tissue', summary_agg_func=None)[source]
Summarize a transcript expression MatrixTable by transcript, gene, and tissue.
The summary_agg_func argument allows the user to specify a Hail aggregation function to use to summarize the expression by tissue. By default, the median is used.
The returned Table has a row annotation for each tissue containing a struct with the summarized tissue expression value (‘transcript_expression’) and the proportion of expression of transcript to gene per tissue (‘expression_proportion’).
Returned Table Schema example:
Row fields: 'transcript_id': str 'gene_id': str 'tissue_1': struct { transcript_expression: float64, expression_proportion: float64 } 'tissue_2': struct { transcript_expression: float64, expression_proportion: float64 } Key: ['transcript_id', 'gene_id']
- Parameters:
mt (
MatrixTable
) – MatrixTable of transcript (rows) expression quantifications (entry) by sample (columns).transcript_expression_expr (
Union
[NumericExpression
,str
]) – Entry expression indicating transcript expression quantification. Default is ‘transcript_tpm’.tissue_expr (
Union
[StringExpression
,str
]) – Column expression indicating tissue type. Default is ‘tissue’.summary_agg_func (
Optional
[Callable
]) – Optional aggregation function to use to summarize the transcript expression quantification by tissue. Example: hl.agg.mean. Default is None, which will use a median aggregation.
- Return type:
- Returns:
A Table of summarized transcript expression by tissue.
- gnomad.utils.transcript_annotation.get_expression_proportion(ht)[source]
Calculate the proportion of expression of transcript to gene per tissue.
- Parameters:
ht (
Table
) – Table of summarized transcript expression by tissue.- Return type:
- Returns:
StructExpression containing the proportion of expression of transcript to gene per tissue.
- gnomad.utils.transcript_annotation.filter_expression_ht_by_tissues(ht, tissues_to_keep=None, tissues_to_filter=None)[source]
Filter a Table with a row annotation for each tissue to only include specified tissues.
- Parameters:
ht (
Table
) – Table with a row annotation for each tissue.tissues_to_keep (
Optional
[List
[str
]]) – Optional list of tissues to keep in the Table. Default is all non-key rows in the Table.tissues_to_filter (
Optional
[List
[str
]]) – Optional list of tissues to exclude from the Table.
- Return type:
- Returns:
Table with only specified tissues.
- gnomad.utils.transcript_annotation.tissue_expression_ht_to_array(ht, tissues_to_keep=None, tissues_to_filter=None, annotations_to_extract=('transcript_expression', 'expression_proportion'))[source]
Convert a Table with a row annotation for each tissue to a Table with tissues as an array.
- The output is a Table with one of the two formats:
An annotation of ‘tissue_expression’ containing an array of structs by tissue, where each element of the array is the Table’s row value for a given tissue.
Example:
tissue_expression': array<struct { transcript_expression: float64, expression_proportion: float64 }>
One array annotation for each field defined in the ‘annotations_to_extract’ argument, where each array is an array of the given field values by tissue.
Example:
'transcript_expression': array<float64> 'expression_proportion': array<float64>
The order of tissues in the array is indicated by the “tissues” global annotation.
- Parameters:
ht (
Table
) – Table with a row annotation for each tissue.tissues_to_keep (
Optional
[List
[str
]]) – Optional list of tissues to keep in the tissue expression array. Default is all non-key rows in the Table.tissues_to_filter (
Optional
[List
[str
]]) – Optional list of tissues to exclude from the tissue expression array.annotations_to_extract (
Union
[Tuple
[str
],List
[str
],None
]) – Optional list of tissue struct fields to extract into top level array annotations. If None, the returned Table will contain a single top level annotation ‘tissue_expression’ that contains an array of structs by tissue. Default is (‘transcript_expression’, ‘expression_proportion’).
- Return type:
- Returns:
Table with requested tissue struct annotations pulled into arrays of tissue values and a ‘tissues’ global annotation indicating the order of tissues in the arrays.
- gnomad.utils.transcript_annotation.tx_filter_variants_by_csqs(ht, filter_to_cds=True, gencode_ht=None, filter_to_genes=None, match_by_gene_symbol=False, filter_to_csqs=None, ignore_splicing=True, filter_to_protein_coding=True, vep_root='vep')[source]
Prepare a Table of variants with VEP transcript consequences for annotation.
Note
When filter_to_cds is set to True, the returned Table will be further filtered by defined ‘amino_acids’ annotation, which is to filter out certain consequences, such as ‘stop_retained_variant’, that are kept by all CDS intervals but don’t belong to CDS of the transcript they fall on.
- Parameters:
ht (
Table
) – Table of variants with ‘vep’ annotations.gencode_ht (
Optional
[Table
]) – Optional Gencode resource Table containing CDS interval information. This is only used when filter_to_cds is set to True. Default is None, which will use the default version of the Gencode Table resource for the reference build of the input Table ht.filter_to_cds (
bool
) – Whether to filter to CDS regions. Default is True. And it will be further filtered by defined ‘amino_acids’ annotation.filter_to_genes (
Optional
[List
[str
]]) – Optional list of genes to filter to. Default is None.match_by_gene_symbol (
bool
) – Whether to match by gene symbol instead of gene ID. Default is False.filter_to_csqs (
Optional
[List
[str
]]) – Optional list of consequences to filter to. Default is None.ignore_splicing (
bool
) – If True, ignore splice consequences. Default is True.filter_to_protein_coding (
bool
) – Whether to filter to protein coding transcripts. Default is True.vep_root (
str
) – Name used for root VEP annotation. Default is ‘vep’.
- Return type:
- Returns:
Table of variants with preprocessed/filtered transcript consequences prepared for annotation.
- gnomad.utils.transcript_annotation.tx_annotate_variants(ht, tx_ht, tissues_to_filter=None, vep_root='vep', vep_annotation='transcript_consequences')[source]
Annotate variants with transcript-based expression values or expression proportion from GTEx.
- Parameters:
ht (
Table
) – Table of variants to annotate, it should contain the nested fields: {vep_root}.{vep_annotation}.tx_ht (
Table
) – Table of transcript expression information.tissues_to_filter (
Optional
[List
[str
]]) – Optional list of tissues to exclude from the output. Default is None.vep_root (
str
) – Name used for root VEP annotation. Default is ‘vep’.vep_annotation (
str
) – Name of annotation under vep_root, one of the processed consequences: [“transcript_consequences”, “worst_csq_by_gene”, “worst_csq_for_variant”, “worst_csq_by_gene_canonical”, “worst_csq_for_variant_canonical”]. For example, if you want to annotate each variant with the worst consequence in each gene it falls on and the transcript expression, you would use “worst_csq_by_gene”. Default is “transcript_consequences”.
- Return type:
- Returns:
Input Table with transcript expression information annotated.
- gnomad.utils.transcript_annotation.tx_aggregate_variants(ht, additional_group_by=('alleles', 'gene_symbol', 'most_severe_consequence', 'lof', 'lof_flags'))[source]
Aggregate transcript-based expression values or expression proportion from GTEx.
- Parameters:
ht (
Table
) – Table of variants annotated with transcript expression information.additional_group_by (
Union
[Tuple
[str
],List
[str
],None
]) – Optional list of additional fields to group by before sum aggregation. If None, the returned Table will be grouped by only “locus” and “gene_id” before the sum aggregation.
- Return type:
- Returns:
Table of variants with transcript expression information aggregated.
- gnomad.utils.transcript_annotation.perform_tx_annotation_pipeline(ht, tx_ht, tissues_to_filter=None, vep_root='vep', vep_annotation='transcript_consequences', filter_to_csqs=['transcript_ablation', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained', 'frameshift_variant', 'stop_lost', 'start_lost', 'initiator_codon_variant', 'transcript_amplification', 'inframe_insertion', 'inframe_deletion', 'missense_variant', 'protein_altering_variant', 'splice_region_variant', 'incomplete_terminal_codon_variant', 'start_retained_variant', 'stop_retained_variant', 'synonymous_variant', 'coding_sequence_variant'], additional_group_by=('alleles', 'gene_symbol', 'most_severe_consequence', 'lof', 'lof_flags'), **kwargs)[source]
One-stop usage of tx_filter_variants_by_csqs, tx_annotate_variants and tx_aggregate_variants.
- Parameters:
ht (
Table
) – Table of variants to annotate, it should contain the nested fields: {vep_root}.{vep_annotation}.tx_ht (
Table
) – Table of transcript expression information.tissues_to_filter (
Optional
[List
[str
]]) – Optional list of tissues to exclude from the output.vep_root (
str
) – Name used for root VEP annotation. Default is ‘vep’.vep_annotation (
str
) – Name of annotation under vep_root. Default is ‘transcript_consequences’.filter_to_csqs (
Optional
[List
[str
]]) – Optional list of consequences to filter to. Default is None.additional_group_by (
Union
[Tuple
[str
],List
[str
],None
]) – Optional list of additional fields to group by before sum aggregation. If None, the returned Table will be grouped by only “locus” and “gene_id” before the sum aggregation.
- Return type:
- Returns:
Table of variants with transcript expression information aggregated.