gnomad.variant_qc.pipeline

gnomad.variant_qc.pipeline.create_binned_ht(ht)

Annotate each row of ht with a bin based on binning the score annotation into n_bins equally-sized bins.

gnomad.variant_qc.pipeline.score_bin_agg(ht, ...)

Make dict of aggregations for min/max of score, number of ClinVar variants, number of truth variants, and family statistics.

gnomad.variant_qc.pipeline.generate_trio_stats(mt)

Run generate_trio_stats_expr with variant QC pipeline defaults to get trio stats stratified by raw and adj.

gnomad.variant_qc.pipeline.generate_sib_stats(mt, ...)

Generate a hail table with counts of variants shared by pairs of siblings in relatedness_ht.

gnomad.variant_qc.pipeline.train_rf_model(ht, ...)

Perform random forest (RF) training using a Table annotated with features and training data.

gnomad.variant_qc.pipeline.create_binned_ht(ht, n_bins=100, singleton=True, biallelic=True, adj=True, add_substrat=None)[source]

Annotate each row of ht with a bin based on binning the score annotation into n_bins equally-sized bins.

This is meant as a default wrapper for compute_ranked_bin.

Note

The following fields should be present:
  • score

  • ac - expected that this is the adj filtered allele count

  • ac_raw - expected that this is the raw allele count before adj filtering

Computes bin numbers stratified by SNV / Indels and with the following optional sub bins
  • singletons

  • biallelics

  • biallelic singletons

  • adj

  • adj biallelics

  • adj singletons

  • adj biallelic singletons

Parameters:
  • ht (Table) – Input table

  • n_bins (int) – Number of bins to bin into

  • singleton (bool) – Should bins be stratified by singletons

  • biallelic (bool) – Should bins be stratified by bi-alleleic variants

  • adj (bool) – Should bins be stratified by adj filtering

  • add_substrat (Optional[Dict[str, BooleanExpression]]) – Any additional stratifications for adding bins

Return type:

Table

Returns:

table with bin number for each variant

gnomad.variant_qc.pipeline.score_bin_agg(ht, fam_stats_ht)[source]

Make dict of aggregations for min/max of score, number of ClinVar variants, number of truth variants, and family statistics.

Note

This function uses ht._parent to get the origin Table from the GroupedTable for the aggregation

This can easily be combined with the GroupedTable returned by compute_grouped_binned_ht, For example:

binned_ht = create_binned_ht(...)
grouped_binned_ht = compute_grouped_binned_ht(binned_ht)
agg_ht = grouped_binned_ht.aggregate(score_bin_agg(**grouped_binned_ht, ...))

Note

The following annotations should be present:

In ht:
  • score

  • singleton

  • positive_train_site

  • negative_train_site

  • ac_raw - expected that this is the raw allele count before adj filtering

  • ac - expected that this is the allele count after adj filtering

  • ac_qc_samples_unrelated_raw - allele count before adj filtering for unrelated samples passing sample QC

  • info - struct that includes QD, FS, and MQ in order to add an annotation for fail_hard_filters

In truth_ht:
  • omni

  • mills

  • hapmap

  • kgp_phase1_hc

In fam_stats_ht:
  • n_de_novos_adj

  • n_de_novos_raw

  • n_transmitted_raw

  • n_untransmitted_raw

Automatic aggregations that will be done are:
  • min_score - minimun of score annotation per group

  • max_score - maiximum of score annotation per group

  • n - count of variants per group

  • n_ins - count of insertion per group

  • n_ins - count of insertion per group

  • n_del - count of deletions per group

  • n_ti - count of transitions per group

  • n_tv - count of trnasversions per group

  • n_1bp_indel - count of one base pair indels per group

  • n_mod3bp_indel - count of indels with a length divisible by three per group

  • n_singleton - count of singletons per group

  • fail_hard_filters - count of variants per group with QD < 2 | FS > 60 | MQ < 30

  • n_vqsr_pos_train - count of variants that were a VQSR positive train site per group

  • n_vqsr_neg_train - count of variants that were a VQSR negative train site per group

  • n_clinvar - count of clinvar variants

  • n_de_novos_singleton_adj - count of singleton de novo variants after adj filtration

  • n_de_novo_singleton - count of raw unfiltered singleton de novo variants

  • n_de_novos_adj - count of adj filtered de novo variants

  • n_de_novos - count of raw unfiltered de novo variants

  • n_trans_singletons - count of transmitted singletons

  • n_untrans_singletons - count of untransmitted singletons

  • n_omni - count of omni truth variants

  • n_mills - count of mills truth variants

  • n_hapmap - count of hapmap truth variants

  • n_kgp_phase1_hc - count of 1000 genomes phase 1 high confidence truth variants

Parameters:
  • ht (GroupedTable) – Table that aggregation will be performed on

  • fam_stats_ht (Table) – Path to family statistics HT

Return type:

Dict[str, Aggregation]

Returns:

a dictionary containing aggregations to perform on ht

gnomad.variant_qc.pipeline.generate_trio_stats(mt, autosomes_only=True, bi_allelic_only=True)[source]

Run generate_trio_stats_expr with variant QC pipeline defaults to get trio stats stratified by raw and adj.

Note

Expects that mt is it a trio matrix table that was annotated with adj and if dealing with a sparse MT hl.experimental.densify must be run first.

By default this pipeline function will filter mt to only autosomes and bi-allelic sites.

Parameters:
  • mt (MatrixTable) – A Trio Matrix Table returned from hl.trio_matrix. Must be dense

  • autosomes_only (bool) – If set, only autosomal intervals are used.

  • bi_allelic_only (bool) – If set, only bi-allelic sites are used for the computation

Return type:

Table

Returns:

Table with trio stats

gnomad.variant_qc.pipeline.generate_sib_stats(mt, relatedness_ht, i_col='i', j_col='j', relationship_col='relationship', autosomes_only=True, bi_allelic_only=True)[source]

Generate a hail table with counts of variants shared by pairs of siblings in relatedness_ht.

This is meant as a default wrapper for generate_sib_stats_expr.

This function takes a hail Table with a row for each pair of individuals i,j in the data that are related (it’s OK to have unrelated samples too).

The relationship_col should be a column specifying the relationship between each two samples as defined by the constants in gnomad.utils.relatedness. This relationship_col will be used to filter to only pairs of samples that are annotated as SIBLINGS.

Note

By default this pipeline function will filter mt to only autosomes and bi-allelic sites.

Parameters:
  • mt (MatrixTable) – Input Matrix table

  • relatedness_ht (Table) – Input relationship table

  • i_col (str) – Column containing the 1st sample of the pair in the relationship table

  • j_col (str) – Column containing the 2nd sample of the pair in the relationship table

  • relationship_col (str) – Column containing the relationship for the sample pair as defined in this module constants.

  • autosomes_only (bool) – If set, only autosomal intervals are used.

  • bi_allelic_only (bool) – If set, only bi-allelic sites are used for the computation

Return type:

Table

Returns:

A Table with the sibling shared variant counts

gnomad.variant_qc.pipeline.train_rf_model(ht, rf_features, tp_expr, fp_expr, fp_to_tp=1.0, num_trees=500, max_depth=5, test_expr=False)[source]

Perform random forest (RF) training using a Table annotated with features and training data.

Note

This function uses train_rf and extends it by:
  • Adding an option to apply the resulting model to test variants which are withheld from training.

  • Uses a false positive (FP) to true positive (TP) ratio to determine what variants to use for RF training.

The returned Table includes the following annotations:
  • rf_train: indicates if the variant was used for training of the RF model.

  • rf_label: indicates if the variant is a TP or FP.

  • rf_test: indicates if the variant was used in testing of the RF model.

  • features: global annotation of the features used for the RF model.

  • features_importance: global annotation of the importance of each feature in the model.

  • test_results: results from testing the model on variants defined by test_expr.

Parameters:
  • ht (Table) – Table annotated with features for the RF model and the positive and negative training data.

  • rf_features (List[str]) – List of column names to use as features in the RF training.

  • tp_expr (BooleanExpression) – TP training expression.

  • fp_expr (BooleanExpression) – FP training expression.

  • fp_to_tp (float) – Ratio of FPs to TPs for creating the RF model. If set to 0, all training examples are used.

  • num_trees (int) – Number of trees in the RF model.

  • max_depth (int) – Maxmimum tree depth in the RF model.

  • test_expr (BooleanExpression) – An expression specifying variants to hold out for testing and use for evaluation only.

Return type:

Tuple[Table, PipelineModel]

Returns:

Table with TP and FP training sets used in the RF training and the resulting RF model.