gnomad.variant_qc.pipeline

`gnomad.variant_qc.pipeline.create_binned_ht`(ht)	Annotate each row of ht with a bin based on binning the score annotation into n_bins equally-sized bins.
`gnomad.variant_qc.pipeline.score_bin_agg`(ht, ...)	Make dict of aggregations for min/max of score, number of ClinVar variants, number of truth variants, and family statistics.
`gnomad.variant_qc.pipeline.generate_trio_stats`(mt)	Run generate_trio_stats_expr with variant QC pipeline defaults to get trio stats stratified by raw and adj.
`gnomad.variant_qc.pipeline.generate_sib_stats`(mt, ...)	Generate a hail table with counts of variants shared by pairs of siblings in relatedness_ht.
`gnomad.variant_qc.pipeline.train_rf_model`(ht, ...)	Perform random forest (RF) training using a Table annotated with features and training data.

gnomad.variant_qc.pipeline.create_binned_ht(ht, n_bins=100, singleton=True, biallelic=True, adj=True, add_substrat=None)[source]

Annotate each row of ht with a bin based on binning the score annotation into n_bins equally-sized bins.

This is meant as a default wrapper for compute_ranked_bin.

Note

The following fields should be present:

score
ac - expected that this is the adj filtered allele count
ac_raw - expected that this is the raw allele count before adj filtering

Computes bin numbers stratified by SNV / Indels and with the following optional sub bins

singletons
biallelics
biallelic singletons
adj
adj biallelics
adj singletons
adj biallelic singletons

Parameters:

ht (Table) – Input table
n_bins (int) – Number of bins to bin into
singleton (bool) – Should bins be stratified by singletons
biallelic (bool) – Should bins be stratified by bi-alleleic variants
adj (bool) – Should bins be stratified by adj filtering
add_substrat (Optional[Dict[str, BooleanExpression]]) – Any additional stratifications for adding bins

Return type:

Table

Returns:

table with bin number for each variant

gnomad.variant_qc.pipeline.score_bin_agg(ht, fam_stats_ht)[source]

Make dict of aggregations for min/max of score, number of ClinVar variants, number of truth variants, and family statistics.

Note

This function uses ht._parent to get the origin Table from the GroupedTable for the aggregation

This can easily be combined with the GroupedTable returned by compute_grouped_binned_ht, For example:

binned_ht = create_binned_ht(...)
grouped_binned_ht = compute_grouped_binned_ht(binned_ht)
agg_ht = grouped_binned_ht.aggregate(score_bin_agg(**grouped_binned_ht, ...))

Note

The following annotations should be present:

In ht:

score
singleton
positive_train_site
negative_train_site
ac_raw - expected that this is the raw allele count before adj filtering
ac - expected that this is the allele count after adj filtering
ac_qc_samples_unrelated_raw - allele count before adj filtering for unrelated samples passing sample QC
info - struct that includes QD, FS, and MQ in order to add an annotation for fail_hard_filters

In truth_ht:

omni
mills
hapmap
kgp_phase1_hc

In fam_stats_ht:

n_de_novos_adj
n_de_novos_raw
n_transmitted_raw
n_untransmitted_raw

Automatic aggregations that will be done are:

min_score - minimun of score annotation per group
max_score - maiximum of score annotation per group
n - count of variants per group
n_ins - count of insertion per group
n_ins - count of insertion per group
n_del - count of deletions per group
n_ti - count of transitions per group
n_tv - count of trnasversions per group
n_1bp_indel - count of one base pair indels per group
n_mod3bp_indel - count of indels with a length divisible by three per group
n_singleton - count of singletons per group
fail_hard_filters - count of variants per group with QD < 2 | FS > 60 | MQ < 30
n_vqsr_pos_train - count of variants that were a VQSR positive train site per group
n_vqsr_neg_train - count of variants that were a VQSR negative train site per group
n_clinvar - count of clinvar variants
n_de_novos_singleton_adj - count of singleton de novo variants after adj filtration
n_de_novo_singleton - count of raw unfiltered singleton de novo variants
n_de_novos_adj - count of adj filtered de novo variants
n_de_novos - count of raw unfiltered de novo variants
n_trans_singletons - count of transmitted singletons
n_untrans_singletons - count of untransmitted singletons
n_omni - count of omni truth variants
n_mills - count of mills truth variants
n_hapmap - count of hapmap truth variants
n_kgp_phase1_hc - count of 1000 genomes phase 1 high confidence truth variants

Parameters:

ht (GroupedTable) – Table that aggregation will be performed on
fam_stats_ht (Table) – Path to family statistics HT

Return type:

Dict[str, Aggregation]

Returns:

a dictionary containing aggregations to perform on ht

gnomad.variant_qc.pipeline.generate_trio_stats(mt, autosomes_only=True, bi_allelic_only=True)[source]

Run generate_trio_stats_expr with variant QC pipeline defaults to get trio stats stratified by raw and adj.

Note

Expects that mt is it a trio matrix table that was annotated with adj and if dealing with a sparse MT hl.experimental.densify must be run first.

By default this pipeline function will filter mt to only autosomes and bi-allelic sites.

Parameters:

mt (MatrixTable) – A Trio Matrix Table returned from hl.trio_matrix. Must be dense
autosomes_only (bool) – If set, only autosomal intervals are used.
bi_allelic_only (bool) – If set, only bi-allelic sites are used for the computation

Return type:

Table

Returns:

Table with trio stats

gnomad.variant_qc.pipeline.generate_sib_stats(mt, relatedness_ht, i_col='i', j_col='j', relationship_col='relationship', autosomes_only=True, bi_allelic_only=True)[source]

Generate a hail table with counts of variants shared by pairs of siblings in relatedness_ht.

This is meant as a default wrapper for generate_sib_stats_expr.

This function takes a hail Table with a row for each pair of individuals i,j in the data that are related (it’s OK to have unrelated samples too).

The relationship_col should be a column specifying the relationship between each two samples as defined by the constants in gnomad.utils.relatedness. This relationship_col will be used to filter to only pairs of samples that are annotated as SIBLINGS.

Note

By default this pipeline function will filter mt to only autosomes and bi-allelic sites.

Parameters:

mt (MatrixTable) – Input Matrix table
relatedness_ht (Table) – Input relationship table
i_col (str) – Column containing the 1st sample of the pair in the relationship table
j_col (str) – Column containing the 2nd sample of the pair in the relationship table
relationship_col (str) – Column containing the relationship for the sample pair as defined in this module constants.
autosomes_only (bool) – If set, only autosomal intervals are used.
bi_allelic_only (bool) – If set, only bi-allelic sites are used for the computation

Return type:

Table

Returns:

A Table with the sibling shared variant counts

gnomad.variant_qc.pipeline.train_rf_model(ht, rf_features, tp_expr, fp_expr, fp_to_tp=1.0, num_trees=500, max_depth=5, test_expr=False)[source]

Perform random forest (RF) training using a Table annotated with features and training data.

Note

This function uses train_rf and extends it by:

Adding an option to apply the resulting model to test variants which are withheld from training.
Uses a false positive (FP) to true positive (TP) ratio to determine what variants to use for RF training.

The returned Table includes the following annotations:

rf_train: indicates if the variant was used for training of the RF model.
rf_label: indicates if the variant is a TP or FP.
rf_test: indicates if the variant was used in testing of the RF model.
features: global annotation of the features used for the RF model.
features_importance: global annotation of the importance of each feature in the model.
test_results: results from testing the model on variants defined by test_expr.

Parameters:

ht (Table) – Table annotated with features for the RF model and the positive and negative training data.
rf_features (List[str]) – List of column names to use as features in the RF training.
tp_expr (BooleanExpression) – TP training expression.
fp_expr (BooleanExpression) – FP training expression.
fp_to_tp (float) – Ratio of FPs to TPs for creating the RF model. If set to 0, all training examples are used.
num_trees (int) – Number of trees in the RF model.
max_depth (int) – Maxmimum tree depth in the RF model.
test_expr (BooleanExpression) – An expression specifying variants to hold out for testing and use for evaluation only.

Return type:

Tuple[Table, PipelineModel]

Returns:

Table with TP and FP training sets used in the RF training and the resulting RF model.