gnomad.variant_qc.pipeline
| Annotate each row of ht with a bin based on binning the score annotation into n_bins equally-sized bins. | |
| Make dict of aggregations for min/max of score, number of ClinVar variants, number of truth variants, and family statistics. | |
| Run generate_trio_stats_expr with variant QC pipeline defaults to get trio stats stratified by raw and adj. | |
| Generate a hail table with counts of variants shared by pairs of siblings in relatedness_ht. | |
| Perform random forest (RF) training using a Table annotated with features and training data. | 
- gnomad.variant_qc.pipeline.create_binned_ht(ht, n_bins=100, singleton=True, biallelic=True, adj=True, add_substrat=None)[source]
- Annotate each row of ht with a bin based on binning the score annotation into n_bins equally-sized bins. - This is meant as a default wrapper for compute_ranked_bin. - Note - The following fields should be present:
- score 
- ac - expected that this is the adj filtered allele count 
- ac_raw - expected that this is the raw allele count before adj filtering 
 
 - Computes bin numbers stratified by SNV / Indels and with the following optional sub bins
- singletons 
- biallelics 
- biallelic singletons 
- adj 
- adj biallelics 
- adj singletons 
- adj biallelic singletons 
 
 - Parameters:
- ht ( - Table) – Input table
- n_bins ( - int) – Number of bins to bin into
- singleton ( - bool) – Should bins be stratified by singletons
- biallelic ( - bool) – Should bins be stratified by bi-alleleic variants
- adj ( - bool) – Should bins be stratified by adj filtering
- add_substrat ( - Optional[- Dict[- str,- BooleanExpression]]) – Any additional stratifications for adding bins
 
- Return type:
- Returns:
- table with bin number for each variant 
 
- gnomad.variant_qc.pipeline.score_bin_agg(ht, fam_stats_ht)[source]
- Make dict of aggregations for min/max of score, number of ClinVar variants, number of truth variants, and family statistics. - Note - This function uses ht._parent to get the origin Table from the GroupedTable for the aggregation - This can easily be combined with the GroupedTable returned by compute_grouped_binned_ht, For example: - binned_ht = create_binned_ht(...) grouped_binned_ht = compute_grouped_binned_ht(binned_ht) agg_ht = grouped_binned_ht.aggregate(score_bin_agg(**grouped_binned_ht, ...)) - Note - The following annotations should be present: - In ht:
- score 
- singleton 
- positive_train_site 
- negative_train_site 
- ac_raw - expected that this is the raw allele count before adj filtering 
- ac - expected that this is the allele count after adj filtering 
- ac_qc_samples_unrelated_raw - allele count before adj filtering for unrelated samples passing sample QC 
- info - struct that includes QD, FS, and MQ in order to add an annotation for fail_hard_filters 
 
- In truth_ht:
- omni 
- mills 
- hapmap 
- kgp_phase1_hc 
 
- In fam_stats_ht:
- n_de_novos_adj 
- n_de_novos_raw 
- n_transmitted_raw 
- n_untransmitted_raw 
 
 - Automatic aggregations that will be done are:
- min_score - minimun of score annotation per group 
- max_score - maiximum of score annotation per group 
- n - count of variants per group 
- n_ins - count of insertion per group 
- n_ins - count of insertion per group 
- n_del - count of deletions per group 
- n_ti - count of transitions per group 
- n_tv - count of trnasversions per group 
- n_1bp_indel - count of one base pair indels per group 
- n_mod3bp_indel - count of indels with a length divisible by three per group 
- n_singleton - count of singletons per group 
- fail_hard_filters - count of variants per group with QD < 2 | FS > 60 | MQ < 30 
- n_vqsr_pos_train - count of variants that were a VQSR positive train site per group 
- n_vqsr_neg_train - count of variants that were a VQSR negative train site per group 
- n_clinvar - count of clinvar variants 
- n_de_novos_singleton_adj - count of singleton de novo variants after adj filtration 
- n_de_novo_singleton - count of raw unfiltered singleton de novo variants 
- n_de_novos_adj - count of adj filtered de novo variants 
- n_de_novos - count of raw unfiltered de novo variants 
- n_trans_singletons - count of transmitted singletons 
- n_untrans_singletons - count of untransmitted singletons 
- n_omni - count of omni truth variants 
- n_mills - count of mills truth variants 
- n_hapmap - count of hapmap truth variants 
- n_kgp_phase1_hc - count of 1000 genomes phase 1 high confidence truth variants 
 
 - Parameters:
- ht ( - GroupedTable) – Table that aggregation will be performed on
- fam_stats_ht ( - Table) – Path to family statistics HT
 
- Return type:
- Dict[- str,- Aggregation]
- Returns:
- a dictionary containing aggregations to perform on ht 
 
- gnomad.variant_qc.pipeline.generate_trio_stats(mt, autosomes_only=True, bi_allelic_only=True)[source]
- Run generate_trio_stats_expr with variant QC pipeline defaults to get trio stats stratified by raw and adj. - Note - Expects that mt is it a trio matrix table that was annotated with adj and if dealing with a sparse MT hl.experimental.densify must be run first. - By default this pipeline function will filter mt to only autosomes and bi-allelic sites. - Parameters:
- mt ( - MatrixTable) – A Trio Matrix Table returned from hl.trio_matrix. Must be dense
- autosomes_only ( - bool) – If set, only autosomal intervals are used.
- bi_allelic_only ( - bool) – If set, only bi-allelic sites are used for the computation
 
- Return type:
- Returns:
- Table with trio stats 
 
- gnomad.variant_qc.pipeline.generate_sib_stats(mt, relatedness_ht, i_col='i', j_col='j', relationship_col='relationship', autosomes_only=True, bi_allelic_only=True)[source]
- Generate a hail table with counts of variants shared by pairs of siblings in relatedness_ht. - This is meant as a default wrapper for generate_sib_stats_expr. - This function takes a hail Table with a row for each pair of individuals i,j in the data that are related (it’s OK to have unrelated samples too). - The relationship_col should be a column specifying the relationship between each two samples as defined by the constants in gnomad.utils.relatedness. This relationship_col will be used to filter to only pairs of samples that are annotated as SIBLINGS. - Note - By default this pipeline function will filter mt to only autosomes and bi-allelic sites. - Parameters:
- mt ( - MatrixTable) – Input Matrix table
- relatedness_ht ( - Table) – Input relationship table
- i_col ( - str) – Column containing the 1st sample of the pair in the relationship table
- j_col ( - str) – Column containing the 2nd sample of the pair in the relationship table
- relationship_col ( - str) – Column containing the relationship for the sample pair as defined in this module constants.
- autosomes_only ( - bool) – If set, only autosomal intervals are used.
- bi_allelic_only ( - bool) – If set, only bi-allelic sites are used for the computation
 
- Return type:
- Returns:
- A Table with the sibling shared variant counts 
 
- gnomad.variant_qc.pipeline.train_rf_model(ht, rf_features, tp_expr, fp_expr, fp_to_tp=1.0, num_trees=500, max_depth=5, test_expr=False)[source]
- Perform random forest (RF) training using a Table annotated with features and training data. - Note - This function uses train_rf and extends it by:
- Adding an option to apply the resulting model to test variants which are withheld from training. 
- Uses a false positive (FP) to true positive (TP) ratio to determine what variants to use for RF training. 
 
 - The returned Table includes the following annotations:
- rf_train: indicates if the variant was used for training of the RF model. 
- rf_label: indicates if the variant is a TP or FP. 
- rf_test: indicates if the variant was used in testing of the RF model. 
- features: global annotation of the features used for the RF model. 
- features_importance: global annotation of the importance of each feature in the model. 
- test_results: results from testing the model on variants defined by test_expr. 
 
 - Parameters:
- ht ( - Table) – Table annotated with features for the RF model and the positive and negative training data.
- rf_features ( - List[- str]) – List of column names to use as features in the RF training.
- tp_expr ( - BooleanExpression) – TP training expression.
- fp_expr ( - BooleanExpression) – FP training expression.
- fp_to_tp ( - float) – Ratio of FPs to TPs for creating the RF model. If set to 0, all training examples are used.
- num_trees ( - int) – Number of trees in the RF model.
- max_depth ( - int) – Maxmimum tree depth in the RF model.
- test_expr ( - BooleanExpression) – An expression specifying variants to hold out for testing and use for evaluation only.
 
- Return type:
- Tuple[- Table,- PipelineModel]
- Returns:
- Table with TP and FP training sets used in the RF training and the resulting RF model.