gnomad.utils.transcript_annotation
Utils module containing generic functions that are useful for adding transcript expression-aware annotations.
List of reproductive tissues in GTEx. |
|
List of cell lines in GTEx. |
|
|
Get list of tissues to exclude from pext analyses and mean pext across tissues. |
|
Summarize a transcript expression MatrixTable by transcript, gene, and tissue. |
|
Calculate the proportion of expression of transcript to gene per tissue. |
|
Filter a Table with a row annotation for each tissue to only include specified tissues. |
|
Convert a Table with a row annotation for each tissue to a Table with tissues as an array. |
|
Prepare a Table of variants with VEP transcript consequences for annotation. |
|
Annotate variants with transcript-based expression values or expression proportion from GTEx. |
|
Aggregate transcript-based expression values or expression proportion from GTEx. |
|
One-stop usage of tx_filter_variants_by_csqs, tx_annotate_variants and tx_aggregate_variants. |
|
Clean and formats a tissue name for browser compatibility. |
|
Create transcript annotation by region for loading into the gnomAD browser. |
Utils module containing generic functions that are useful for adding transcript expression-aware annotations.
- gnomad.utils.transcript_annotation.REPRODUCTIVE_TISSUES = ['Cervix_Ectocervix', 'Cervix_Endocervix', 'FallopianTube', 'Ovary', 'Prostate', 'Testis', 'Uterus', 'Vagina']
List of reproductive tissues in GTEx.
- gnomad.utils.transcript_annotation.CELL_LINES = ['Cells_EBV_transformedlymphocytes', 'Cells_Transformedfibroblasts', 'Cells_Culturedfibroblasts']
List of cell lines in GTEx.
- gnomad.utils.transcript_annotation.get_tissues_to_exclude(t, reproductive=True, cell_lines=True, min_samples=100)[source]
Get list of tissues to exclude from pext analyses and mean pext across tissues.
Note
The default value of 100 for the min_samples parameter is used because for small sample sizes, the variation in the number of expressed genes is higher. For sample sizes over 100, the number of expressed genes reaches saturation, leading to more stable estimates. See the following study for more details: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-08125-9.
- Parameters:
t (
Union
[Table
,MatrixTable
]) – Table/MatrixTable with ‘tissue’ annotation for each sample. If a MatrixTable is input, samples are expected as columns.reproductive (
bool
) – Whether to exclude reproductive tissues. Default is True.cell_lines (
bool
) – Whether to exclude cell lines. Default is True.min_samples (
Optional
[int
]) – Optional minimum number of samples required for a tissue to be included. If None, tissues will not be excluded based on sample size. Default is False.
- Return type:
List
[str
]- Returns:
List of tissues to exclude from pext analyses and mean pext across tissues.
- gnomad.utils.transcript_annotation.summarize_transcript_expression(mt, transcript_expression_expr='transcript_tpm', tissue_expr='tissue', summary_agg_func=None)[source]
Summarize a transcript expression MatrixTable by transcript, gene, and tissue.
The summary_agg_func argument allows the user to specify a Hail aggregation function to use to summarize the expression by tissue. By default, the median is used.
The returned Table has a row annotation for each tissue containing a struct with the summarized tissue expression value (‘transcript_expression’) and the proportion of expression of transcript to gene per tissue (‘expression_proportion’).
Returned Table Schema example:
Row fields: 'transcript_id': str 'gene_id': str 'tissue_1': struct { transcript_expression: float64, expression_proportion: float64 } 'tissue_2': struct { transcript_expression: float64, expression_proportion: float64 } Key: ['transcript_id', 'gene_id']
- Parameters:
mt (
MatrixTable
) – MatrixTable of transcript (rows) expression quantifications (entry) by sample (columns).transcript_expression_expr (
Union
[NumericExpression
,str
]) – Entry expression indicating transcript expression quantification. Default is ‘transcript_tpm’.tissue_expr (
Union
[StringExpression
,str
]) – Column expression indicating tissue type. Default is ‘tissue’.summary_agg_func (
Optional
[Callable
]) – Optional aggregation function to use to summarize the transcript expression quantification by tissue. Example: hl.agg.mean. Default is None, which will use a median aggregation.
- Return type:
- Returns:
A Table of summarized transcript expression by tissue.
- gnomad.utils.transcript_annotation.get_expression_proportion(ht)[source]
Calculate the proportion of expression of transcript to gene per tissue.
- Parameters:
ht (
Table
) – Table of summarized transcript expression by tissue.- Return type:
- Returns:
StructExpression containing the proportion of expression of transcript to gene per tissue.
- gnomad.utils.transcript_annotation.filter_expression_ht_by_tissues(ht, tissues_to_keep=None, tissues_to_exclude=None)[source]
Filter a Table with a row annotation for each tissue to only include specified tissues.
- Parameters:
ht (
Table
) – Table with a row annotation for each tissue.tissues_to_keep (
Optional
[List
[str
]]) – Optional list of tissues to keep in the Table. Default is all non-key rows in the Table.tissues_to_exclude (
Optional
[List
[str
]]) – Optional list of tissues to exclude from the Table.
- Return type:
- Returns:
Table with only specified tissues.
- gnomad.utils.transcript_annotation.tissue_expression_ht_to_array(ht, tissues_to_keep=None, tissues_to_exclude=None, annotations_to_extract=('transcript_expression', 'expression_proportion'))[source]
Convert a Table with a row annotation for each tissue to a Table with tissues as an array.
- The output is a Table with one of the two formats:
An annotation of ‘tissue_expression’ containing an array of structs by tissue, where each element of the array is the Table’s row value for a given tissue.
Example:
tissue_expression': array<struct { transcript_expression: float64, expression_proportion: float64 }>
One array annotation for each field defined in the ‘annotations_to_extract’ argument, where each array is an array of the given field values by tissue.
Example:
'transcript_expression': array<float64> 'expression_proportion': array<float64>
The order of tissues in the array is indicated by the “tissues” global annotation.
- Parameters:
ht (
Table
) – Table with a row annotation for each tissue.tissues_to_keep (
Optional
[List
[str
]]) – Optional list of tissues to keep in the tissue expression array. Default is all non-key rows in the Table.tissues_to_exclude (
Optional
[List
[str
]]) – Optional list of tissues to exclude from the tissue expression array.annotations_to_extract (
Union
[Tuple
[str
],List
[str
],None
]) – Optional list of tissue struct fields to extract into top level array annotations. If None, the returned Table will contain a single top level annotation ‘tissue_expression’ that contains an array of structs by tissue. Default is (‘transcript_expression’, ‘expression_proportion’).
- Return type:
- Returns:
Table with requested tissue struct annotations pulled into arrays of tissue values and a ‘tissues’ global annotation indicating the order of tissues in the arrays.
- gnomad.utils.transcript_annotation.tx_filter_variants_by_csqs(ht, filter_to_cds=True, gencode_ht=None, filter_to_genes=None, match_by_gene_symbol=False, filter_to_csqs=None, ignore_splicing=True, filter_to_protein_coding=True, vep_root='vep', include_polyphen_prioritization=False)[source]
Prepare a Table of variants with VEP transcript consequences for annotation.
Note
When filter_to_cds is set to True, the returned Table will be further filtered by defined ‘amino_acids’ annotation, which is to filter out certain consequences, such as ‘stop_retained_variant’, that are kept by all CDS intervals but don’t belong to CDS of the transcript they fall on.
- Parameters:
ht (
Table
) – Table of variants with ‘vep’ annotations.gencode_ht (
Optional
[Table
]) – Optional Gencode resource Table containing CDS interval information. This is only used when filter_to_cds is set to True. Default is None, which will use the default version of the Gencode Table resource for the reference build of the input Table ht.filter_to_cds (
bool
) – Whether to filter to CDS regions. Default is True. And it will be further filtered by defined ‘amino_acids’ annotation.filter_to_genes (
Optional
[List
[str
]]) – Optional list of genes to filter to. Default is None.match_by_gene_symbol (
bool
) – Whether to match by gene symbol instead of gene ID. Default is False.filter_to_csqs (
Optional
[List
[str
]]) – Optional list of consequences to filter to. Default is None.ignore_splicing (
bool
) – If True, ignore splice consequences. Default is True.filter_to_protein_coding (
bool
) – Whether to filter to protein coding transcripts. Default is True.vep_root (
str
) – Name used for root VEP annotation. Default is ‘vep’.include_polyphen_prioritization (
bool
) – Whether to include PolyPhen prioritization when processing VEP consequences. Default is False.
- Return type:
- Returns:
Table of variants with preprocessed/filtered transcript consequences prepared for annotation.
- gnomad.utils.transcript_annotation.tx_annotate_variants(ht, tx_ht, tissues_to_exclude=None, tissues_to_exclude_from_mean=None, vep_root='vep', vep_annotation='transcript_consequences')[source]
Annotate variants with transcript-based expression values or expression proportion from GTEx.
- Parameters:
ht (
Table
) – Table of variants to annotate, it should contain the nested fields: {vep_root}.{vep_annotation}.tx_ht (
Table
) – Table of transcript expression information.tissues_to_exclude (
Optional
[List
[str
]]) – Optional list of tissues to exclude from the output. Default is None.tissues_to_exclude_from_mean (
Optional
[List
[str
]]) – Optional list of tissues to exclude when calculating the mean expression proportion across all tissues. Default is None.vep_root (
str
) – Name used for root VEP annotation. Default is ‘vep’.vep_annotation (
str
) – Name of annotation under vep_root, one of the processed consequences: [“transcript_consequences”, “worst_csq_by_gene”, “worst_csq_for_variant”, “worst_csq_by_gene_canonical”, “worst_csq_for_variant_canonical”]. For example, if you want to annotate each variant with the worst consequence in each gene it falls on and the transcript expression, you would use “worst_csq_by_gene”. Default is “transcript_consequences”.
- Return type:
- Returns:
Input Table with transcript expression information annotated.
- gnomad.utils.transcript_annotation.tx_aggregate_variants(ht, additional_group_by=('alleles', 'gene_symbol', 'most_severe_consequence', 'lof', 'lof_flags'))[source]
Aggregate transcript-based expression values or expression proportion from GTEx.
- Parameters:
ht (
Table
) – Table of variants annotated with transcript expression information.additional_group_by (
Union
[Tuple
[str
],List
[str
],None
]) – Optional list of additional fields to group by before sum aggregation. If None, the returned Table will be grouped by only “locus” and “gene_id” before the sum aggregation.
- Return type:
- Returns:
Table of variants with transcript expression information aggregated.
- gnomad.utils.transcript_annotation.perform_tx_annotation_pipeline(ht, tx_ht, tissues_to_exclude=None, tissues_to_exclude_from_mean=None, vep_root='vep', vep_annotation='transcript_consequences', filter_to_csqs=['transcript_ablation', 'splice_acceptor_variant', 'splice_donor_variant', 'stop_gained', 'frameshift_variant', 'stop_lost', 'start_lost', 'initiator_codon_variant', 'transcript_amplification', 'inframe_insertion', 'inframe_deletion', 'missense_variant', 'protein_altering_variant', 'splice_donor_5th_base_variant', 'splice_region_variant', 'splice_donor_region_variant', 'splice_polypyrimidine_tract_variant', 'incomplete_terminal_codon_variant', 'start_retained_variant', 'stop_retained_variant', 'synonymous_variant', 'coding_sequence_variant', 'coding_transcript_variant'], additional_group_by=('alleles', 'gene_symbol', 'most_severe_consequence', 'lof', 'lof_flags'), **kwargs)[source]
One-stop usage of tx_filter_variants_by_csqs, tx_annotate_variants and tx_aggregate_variants.
Note
The default additional_group_by is used to create the gnomAD annotation-level pext release, and only additional_group_by=[“gene_symbol”] is used to create the gnomAD base-level pext release.
- Parameters:
ht (
Table
) – Table of variants to annotate, it should contain the nested fields: {vep_root}.{vep_annotation}.tx_ht (
Table
) – Table of transcript expression information.tissues_to_exclude (
Optional
[List
[str
]]) – Optional list of tissues to exclude from the output. Default is None.tissues_to_exclude_from_mean (
Optional
[List
[str
]]) – Optional list of tissues to exclude when calculating the mean expression proportion across all tissues. Default is None.vep_root (
str
) – Name used for root VEP annotation. Default is ‘vep’.vep_annotation (
str
) – Name of annotation under vep_root. Default is ‘transcript_consequences’.filter_to_csqs (
Optional
[List
[str
]]) – Optional list of consequences to filter to. Default is None.additional_group_by (
Union
[Tuple
[str
],List
[str
],None
]) – Optional list of additional fields to group by before sum aggregation. If None, the returned Table will be grouped by only “locus” and “gene_id” before the sum aggregation.
- Return type:
- Returns:
Table of variants with transcript expression information aggregated.
- gnomad.utils.transcript_annotation.clean_tissue_name_for_browser(tissue_name)[source]
Clean and formats a tissue name for browser compatibility.
This function converts uppercase letters to lowercase and adds underscores between words where necessary. Additionally, it replaces certain combined words with their corresponding formatted versions.
- Parameters:
tissue_name (
str
) – Tissue name to clean and format.- Return type:
str
- Returns:
Cleaned and formatted tissue name.
- gnomad.utils.transcript_annotation.create_tx_annotation_by_region(ht)[source]
Create transcript annotation by region for loading into the gnomAD browser.
This function processes a Hail Table to create transcript annotations by region. It calculates the mean expression proportion, handles missing values, and organizes the data by genomic regions. Regions are split based on changes in the following fields: ‘gene_id’, ‘exp_prop_mean’, and ‘tissues’.
locus
gene_id
exp_prop_mean
tissue1
tissue2
chr1:1
gene1
0.5
0.2
0.3
chr1:2
gene1
0.5
0.2
0.3
chr1:3
gene1
0.6
0.3
0.4
chr1:4
gene2
0.7
0.5
0.6
chr1:5
gene2
0.7
0.5
0.6
chr1:6
gene2
0.8
0.6
0.7
gene_id
regions
gene1
[{‘chrom’: ‘chr1’, ‘start’: 1, ‘stop’: 2, ‘mean’: 0.5, ‘tissues’: {‘tissue1’: 0.2, ‘tissue2’: 0.3}}, {‘chrom’: ‘chr1’, ‘start’: 3, ‘stop’: 3, ‘mean’: 0.6, ‘tissues’: {‘tissue1’: 0.3, ‘tissue2’: 0.4}}]
gene2
[{‘chrom’: ‘chr1’, ‘start’: 4, ‘stop’: 5, ‘mean’: 0.7, ‘tissues’: {‘tissue1’: 0.5, ‘tissue2’: 0.6}}, {‘chrom’: ‘chr1’, ‘start’: 6, ‘stop’: 6, ‘mean’: 0.8, ‘tissues’: {‘tissue1’: 0.6, ‘tissue2’: 0.7}}