gnomad_qc.v4.annotations.insilico_predictors

Script to generate Hail Tables with in silico predictors.

usage: gnomad_qc.v4.annotations.insilico_predictors.py [-h]
                                                       [--slack-channel SLACK_CHANNEL]
                                                       [--overwrite] [--cadd]
                                                       [--spliceai]
                                                       [--pangolin] [--revel]
                                                       [--phylop]
                                                       [--revel-unmatched-transcripts]

Named Arguments

--slack-channel

Slack channel to post results and notifications to.

--overwrite

Overwrite data

Default: False

--cadd

Create CADD HT

Default: False

--spliceai

Create SpliceAI HT

Default: False

--pangolin

Create Pangolin HT

Default: False

--revel

Create REVEL HT.

Default: False

--phylop

Create PhyloP HT.

Default: False

--revel-unmatched-transcripts

Get alternative REVEL score for variants in MANE transcripts in v4.1 release.

Default: False

Module Functions

`gnomad_qc.v4.annotations.insilico_predictors.get_sift_polyphen_from_vep`(ht)	Get the max SIFT and PolyPhen scores from VEP 105 annotations.
`gnomad_qc.v4.annotations.insilico_predictors.create_cadd_grch38_ht`()	Create a Hail Table with CADD scores for GRCh38.
`gnomad_qc.v4.annotations.insilico_predictors.create_spliceai_grch38_ht`()	Create a Hail Table with SpliceAI scores for GRCh38.
`gnomad_qc.v4.annotations.insilico_predictors.create_pangolin_grch38_ht`()	Create a Hail Table with Pangolin score for splicing for GRCh38.
`gnomad_qc.v4.annotations.insilico_predictors.create_revel_grch38_ht`()	Create a Hail Table with REVEL scores for GRCh38.
`gnomad_qc.v4.annotations.insilico_predictors.create_phylop_grch38_ht`()	Convert PhyloP scores to Hail Table.
`gnomad_qc.v4.annotations.insilico_predictors.get_revel_for_unmatched_transcripts`()	Create Tables with alternative REVEL scores for variants in v4.1 release.
`gnomad_qc.v4.annotations.insilico_predictors.main`(args)	Generate Hail Tables with in silico predictors.
`gnomad_qc.v4.annotations.insilico_predictors.get_script_argument_parser`()	Get script argument parser.

Script to generate Hail Tables with in silico predictors.

gnomad_qc.v4.annotations.insilico_predictors.get_sift_polyphen_from_vep(ht)[source]

Get the max SIFT and PolyPhen scores from VEP 105 annotations.

This retrieves the max of SIFT and PolyPhen scores for a variant’s MANE Select transcript or, if MANE Select does not exist, canonical transcript.

Parameters:: ht (Table) – VEP 105 annotated Hail Table.
Return type:: Table
Returns:: Table annotated with max SIFT and PolyPhen scores.

gnomad_qc.v4.annotations.insilico_predictors.create_cadd_grch38_ht()[source]

Create a Hail Table with CADD scores for GRCh38.

The combined CADD scores in the returned table are from the following sources:

all SNVs: cadd.v1.6.whole_genome_SNVs.tsv.bgz (81G) downloaded from https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/whole_genome_SNVs.tsv.gz. It contains 8,812,917,339 SNVs.
gnomad 3.0 indels: cadd.v1.6.gnomad.genomes.v3.0.indel.tsv.bgz (1.1G) downloaded from https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh38/gnomad.genomes.r3.0.indel.tsv.gz. It contains 100,546,109 indels from gnomaD v3.0.
gnomad 3.1 indels: cadd.v1.6.gnomad.genomes.v3.1.indels.new.ht was run on gnomAD v3.1 with CADD v1.6 in 2020. It contains 166,122,720 new indels from gnomAD v3.1 compared to v3.0.
gnomad 3.1 complex indels: cadd.v1.6.gnomad.genomes.v3.1.indels.complex.ht was run on gnomAD v3.1 with CADD v1.6 in 2020. It contains 2,307 complex variants that do not fit Hail’s criteria for an indel and thus exist in a separate table than the gnomad 3.1 indels.
gnomAD v4 exomes indels: cadd.v1.6.gnomad.exomes.v4.0.indels.new.tsv.bgz (368M) was run on gnomAD v4 with CADD v1.6 in 2023. It contains 32,561, 253 indels that are new in gnomAD v4.
gnomAD v4 genomes indels: cadd.v1.6.gnomad.genomes.v4.0.indels.new.tsv.bgz (13M) was run on gnomAD v4 with CADD v1.6 in 2023. It contains 904,906 indels that are new in gnomAD v4 genomes because of the addition of HGDP/TGP samples.

Note

~1,9M indels were duplicated in gnomAD v3.0 and v4.0 or in gnomAD v3.1 and v4.0. However, CADD only generates a score per loci. We keep only the latest prediction, v4.0, for these loci. The output generated a CADD HT with 9,110,177,520 rows.

Return type:: Table
Returns:: Hail Table with CADD scores for GRCh38.

gnomad_qc.v4.annotations.insilico_predictors.create_spliceai_grch38_ht()[source]

Create a Hail Table with SpliceAI scores for GRCh38.

SpliceAI scores are from the following resources:

Precomputed SNVs: spliceai_scores.masked.snv.hg38.vcf.bgz, downloaded from https://basespace.illumina.com/s/5u6ThOblecrh
Precomputed indels: spliceai_scores.masked.indel.hg38.vcf.bgz, downloaded from https://basespace.illumina.com/s/5u6ThOblecrh
gnomAD v3 indels: gnomad_v3_indel.spliceai_masked.vcf.bgz, computed on v3.1 indels by Illumina in 2020.
gnomAD v4 indels: gnomad_v4_new_indels.spliceai_masked.vcf.bgz, computed on v4 indels that are new compared to v3 indels by Illumina in February 2023.
gnomAD v3 and v4 unscored indels: spliceai_scores.masked.gnomad_v3_v4_unscored_indels.hg38.vcf.bgz, another set of indels were not scored in v3 or v4 but computed by Illumina in September 2023.

Return type:: Table
Returns:: Hail Table with SpliceAI scores for GRCh38.

gnomad_qc.v4.annotations.insilico_predictors.create_pangolin_grch38_ht()[source]

Create a Hail Table with Pangolin score for splicing for GRCh38.

Note

The score was based on the splicing prediction tool Pangolin: Zeng, T., Li, Y.I. Predicting RNA splicing from DNA sequence using Pangolin. Genome Biol 23, 103 (2022). https://doi.org/10.1186/s13059-022-02664-4

There’s no precomputed score for all possible variants, the scores were generated for gnomAD v4 genomes (=v3 genomes) and v4 exomes variants in gene body only with code from developers at Invitae: https://github.com/invitae/pangolin. All v4 genomes variants (except ~20M bug-affected and ~3M new variants from HGDP/TGP samples, noted below) were run on Pangolin v1.3.12, the others were run on Pangolin v1.4.4.

Return type:: Table
Returns:: Hail Table with Pangolin score for splicing for GRCh38.

gnomad_qc.v4.annotations.insilico_predictors.create_revel_grch38_ht()[source]

Create a Hail Table with REVEL scores for GRCh38.

Note

Starting with gnomAD v4, we use REVEL scores for only MANE Select and canonical transcripts. Even when a variant falls on multiple MANE/canonical transcripts of different genes, the scores are equal.

REVEL scores were downloaded from: https://rothsj06.dmz.hpc.mssm.edu/revel-v1.3_all_chromosomes.zip size ~648M, ~82,100,677 variants

REVEL’s Ensembl ID is not from Ensembl 105, so we filter to transcripts that are in Ensembl 105. The Ensembl 105 ID file was downloaded from Ensembl 105 archive. It contains the following columns:

Transcript stable ID

Ensembl Canonical

MANE Select

This deprecates the has_duplicate field present in gnomAD v3.1.

Return type:: Table
Returns:: Hail Table with REVEL scores for GRCh38.

gnomad_qc.v4.annotations.insilico_predictors.create_phylop_grch38_ht()[source]

Convert PhyloP scores to Hail Table.

BigWig format of Phylop was download from here: https://cgl.gi.ucsc.edu/data/cactus/241-mammalian-2020v2-hub/Homo_sapiens/241-mammalian-2020v2.bigWig and converted it to bedGraph format with bigWigToBedGraph from the kent packages of UCSC (https://hgdownload.cse.ucsc.edu/admin/exe/) with the following command: ./bigWigToBedGraph ~/Downloads/241-mammalian-2020v2.bigWig ~/Downloads/241-mammalian-2020v2.bedGraph The bedGraph file is bigzipped before importing to Hail.

Note

Different to other in silico predictors, the Phylop HT is keyed by locus only. Since the PhyloP scores have one value per position, we exploded the interval to store the HT by locus. In result, we have Phylop scores for 2,852,623,265 locus from 2,648,607,958 intervals.

Return type:: Table
Returns:: Hail Table with Phylop Scores for GRCh38

gnomad_qc.v4.annotations.insilico_predictors.get_revel_for_unmatched_transcripts()[source]

Create Tables with alternative REVEL scores for variants in v4.1 release.

..note: :rtype: None

REVEL was computed using transcripts from Ensembl v64. In the gnomAD v4.0 and v4.1 release Tables, transcript information from Ensembl v105 and variant information (locus and alleles combination) were used to ascertain variant REVEL scores for MANE select or canonical transcripts only. This means that variants within 2,414 MANE select transcripts in gnomAD v4.0 and v4.1 are missing REVEL scores because they are not present in Ensembl v64.

To address this, we annotated the variants within the 2,414 genes with the maximum REVEL score found at the specific locus and allele, rather than the score for the MANE Select transcript.

The exomes TSV adds REVEL scores to 1,936,321 out of 2,284,296 (87.77%) missense variants within the 2,414 genes. The genomes TSV adds REVEL scores to 528,204 out of 620,799 ( 85.08%) missense variants within the 2,414 genes.

gnomad_qc.v4.annotations.insilico_predictors.main(args)[source]: Generate Hail Tables with in silico predictors.