gnomad.variant_qc.pipeline
Annotate each row of ht with a bin based on binning the score annotation into n_bins equally-sized bins. |
|
Make dict of aggregations for min/max of score, number of ClinVar variants, number of truth variants, and family statistics. |
|
Run generate_trio_stats_expr with variant QC pipeline defaults to get trio stats stratified by raw and adj. |
|
Generate a hail table with counts of variants shared by pairs of siblings in relatedness_ht. |
|
Perform random forest (RF) training using a Table annotated with features and training data. |
- gnomad.variant_qc.pipeline.create_binned_ht(ht, n_bins=100, singleton=True, biallelic=True, adj=True, add_substrat=None)[source]
Annotate each row of ht with a bin based on binning the score annotation into n_bins equally-sized bins.
This is meant as a default wrapper for compute_ranked_bin.
Note
- The following fields should be present:
score
ac - expected that this is the adj filtered allele count
ac_raw - expected that this is the raw allele count before adj filtering
- Computes bin numbers stratified by SNV / Indels and with the following optional sub bins
singletons
biallelics
biallelic singletons
adj
adj biallelics
adj singletons
adj biallelic singletons
- Parameters:
ht (
Table
) – Input tablen_bins (
int
) – Number of bins to bin intosingleton (
bool
) – Should bins be stratified by singletonsbiallelic (
bool
) – Should bins be stratified by bi-alleleic variantsadj (
bool
) – Should bins be stratified by adj filteringadd_substrat (
Optional
[Dict
[str
,BooleanExpression
]]) – Any additional stratifications for adding bins
- Return type:
- Returns:
table with bin number for each variant
- gnomad.variant_qc.pipeline.score_bin_agg(ht, fam_stats_ht)[source]
Make dict of aggregations for min/max of score, number of ClinVar variants, number of truth variants, and family statistics.
Note
This function uses ht._parent to get the origin Table from the GroupedTable for the aggregation
This can easily be combined with the GroupedTable returned by compute_grouped_binned_ht, For example:
binned_ht = create_binned_ht(...) grouped_binned_ht = compute_grouped_binned_ht(binned_ht) agg_ht = grouped_binned_ht.aggregate(score_bin_agg(**grouped_binned_ht, ...))
Note
The following annotations should be present:
- In ht:
score
singleton
positive_train_site
negative_train_site
ac_raw - expected that this is the raw allele count before adj filtering
ac - expected that this is the allele count after adj filtering
ac_qc_samples_unrelated_raw - allele count before adj filtering for unrelated samples passing sample QC
info - struct that includes QD, FS, and MQ in order to add an annotation for fail_hard_filters
- In truth_ht:
omni
mills
hapmap
kgp_phase1_hc
- In fam_stats_ht:
n_de_novos_adj
n_de_novos_raw
n_transmitted_raw
n_untransmitted_raw
- Automatic aggregations that will be done are:
min_score - minimun of score annotation per group
max_score - maiximum of score annotation per group
n - count of variants per group
n_ins - count of insertion per group
n_ins - count of insertion per group
n_del - count of deletions per group
n_ti - count of transitions per group
n_tv - count of trnasversions per group
n_1bp_indel - count of one base pair indels per group
n_mod3bp_indel - count of indels with a length divisible by three per group
n_singleton - count of singletons per group
fail_hard_filters - count of variants per group with QD < 2 | FS > 60 | MQ < 30
n_vqsr_pos_train - count of variants that were a VQSR positive train site per group
n_vqsr_neg_train - count of variants that were a VQSR negative train site per group
n_clinvar - count of clinvar variants
n_de_novos_singleton_adj - count of singleton de novo variants after adj filtration
n_de_novo_singleton - count of raw unfiltered singleton de novo variants
n_de_novos_adj - count of adj filtered de novo variants
n_de_novos - count of raw unfiltered de novo variants
n_trans_singletons - count of transmitted singletons
n_untrans_singletons - count of untransmitted singletons
n_omni - count of omni truth variants
n_mills - count of mills truth variants
n_hapmap - count of hapmap truth variants
n_kgp_phase1_hc - count of 1000 genomes phase 1 high confidence truth variants
- Parameters:
ht (
GroupedTable
) – Table that aggregation will be performed onfam_stats_ht (
Table
) – Path to family statistics HT
- Return type:
Dict
[str
,Aggregation
]- Returns:
a dictionary containing aggregations to perform on ht
- gnomad.variant_qc.pipeline.generate_trio_stats(mt, autosomes_only=True, bi_allelic_only=True)[source]
Run generate_trio_stats_expr with variant QC pipeline defaults to get trio stats stratified by raw and adj.
Note
Expects that mt is it a trio matrix table that was annotated with adj and if dealing with a sparse MT hl.experimental.densify must be run first.
By default this pipeline function will filter mt to only autosomes and bi-allelic sites.
- Parameters:
mt (
MatrixTable
) – A Trio Matrix Table returned from hl.trio_matrix. Must be denseautosomes_only (
bool
) – If set, only autosomal intervals are used.bi_allelic_only (
bool
) – If set, only bi-allelic sites are used for the computation
- Return type:
- Returns:
Table with trio stats
- gnomad.variant_qc.pipeline.generate_sib_stats(mt, relatedness_ht, i_col='i', j_col='j', relationship_col='relationship', autosomes_only=True, bi_allelic_only=True)[source]
Generate a hail table with counts of variants shared by pairs of siblings in relatedness_ht.
This is meant as a default wrapper for generate_sib_stats_expr.
This function takes a hail Table with a row for each pair of individuals i,j in the data that are related (it’s OK to have unrelated samples too).
The relationship_col should be a column specifying the relationship between each two samples as defined by the constants in gnomad.utils.relatedness. This relationship_col will be used to filter to only pairs of samples that are annotated as SIBLINGS.
Note
By default this pipeline function will filter mt to only autosomes and bi-allelic sites.
- Parameters:
mt (
MatrixTable
) – Input Matrix tablerelatedness_ht (
Table
) – Input relationship tablei_col (
str
) – Column containing the 1st sample of the pair in the relationship tablej_col (
str
) – Column containing the 2nd sample of the pair in the relationship tablerelationship_col (
str
) – Column containing the relationship for the sample pair as defined in this module constants.autosomes_only (
bool
) – If set, only autosomal intervals are used.bi_allelic_only (
bool
) – If set, only bi-allelic sites are used for the computation
- Return type:
- Returns:
A Table with the sibling shared variant counts
- gnomad.variant_qc.pipeline.train_rf_model(ht, rf_features, tp_expr, fp_expr, fp_to_tp=1.0, num_trees=500, max_depth=5, test_expr=False)[source]
Perform random forest (RF) training using a Table annotated with features and training data.
Note
- This function uses train_rf and extends it by:
Adding an option to apply the resulting model to test variants which are withheld from training.
Uses a false positive (FP) to true positive (TP) ratio to determine what variants to use for RF training.
- The returned Table includes the following annotations:
rf_train: indicates if the variant was used for training of the RF model.
rf_label: indicates if the variant is a TP or FP.
rf_test: indicates if the variant was used in testing of the RF model.
features: global annotation of the features used for the RF model.
features_importance: global annotation of the importance of each feature in the model.
test_results: results from testing the model on variants defined by test_expr.
- Parameters:
ht (
Table
) – Table annotated with features for the RF model and the positive and negative training data.rf_features (
List
[str
]) – List of column names to use as features in the RF training.tp_expr (
BooleanExpression
) – TP training expression.fp_expr (
BooleanExpression
) – FP training expression.fp_to_tp (
float
) – Ratio of FPs to TPs for creating the RF model. If set to 0, all training examples are used.num_trees (
int
) – Number of trees in the RF model.max_depth (
int
) – Maxmimum tree depth in the RF model.test_expr (
BooleanExpression
) – An expression specifying variants to hold out for testing and use for evaluation only.
- Return type:
Tuple
[Table
,PipelineModel
]- Returns:
Table with TP and FP training sets used in the RF training and the resulting RF model.