gnomad.variant_qc.evaluation
Return a table with a bin for each row based on the ranking of score_expr. |
|
|
Group a Table that has been annotated with bins (compute_ranked_bin or create_binned_ht). |
|
Determine the concordance (TP, FP, FN) between a truth sample within the callset and the samples truth data grouped by bins computed using compute_ranked_bin. |
|
Compute a table comparing a truth sample in callset vs the truth. |
Add rank based on the score_expr. |
- gnomad.variant_qc.evaluation.compute_ranked_bin(ht, score_expr, bin_expr={'bin': True}, compute_snv_indel_separately=True, n_bins=100, desc=True)[source]
Return a table with a bin for each row based on the ranking of score_expr.
The bin is computed by dividing the score_expr into n_bins bins containing approximately equal numbers of elements. This is done by ranking the rows by score_expr (and a random number in cases where multiple variants have the same score) and then assigning the variant to a bin based on its ranking.
If compute_snv_indel_separately is True all items in bin_expr will be stratified by snv / indels for the ranking and bin calculation. Because SNV and indel rows are mutually exclusive, they are re-combined into a single annotation. For example if we have the following four variants and scores and n_bins of 2:
Variant
Type
Score
bin - compute_snv_indel_separately:
False
True
Var1
SNV
0.1
1
1
Var2
SNV
0.2
1
2
Var3
Indel
0.3
2
1
Var4
Indel
0.4
2
2
Note
The bin_expr defines which data the bin(s) should be computed on. E.g., to get biallelic specific binning and singleton specific binning, the following could be used:
bin_expr={ 'biallelic_bin': ~ht.was_split, 'singleton_bin': ht.singleton }
- Parameters:
ht (
Table
) – Input Tablescore_expr (
NumericExpression
) – Expression containing the scorebin_expr (
Dict
[str
,BooleanExpression
]) – Specific row grouping(s) to perform ranking and binning on (see note)compute_snv_indel_separately (
bool
) – Should all bin_expr items be stratified by SNVs / indelsn_bins (
int
) – Number of bins to bin the data intodesc (
bool
) – Whether to bin the score in descending order
- Return type:
- Returns:
Table with the requested bin annotations
- gnomad.variant_qc.evaluation.compute_grouped_binned_ht(bin_ht, checkpoint_path=None)[source]
Group a Table that has been annotated with bins (compute_ranked_bin or create_binned_ht).
The table will be grouped by bin_id (bin, biallelic, etc.), contig, snv, bi_allelic and singleton.
Note
If performing an aggregation following this grouping (such as score_bin_agg) then the aggregation function will need to use ht._parent to get the origin Table from the GroupedTable for the aggregation
- Parameters:
bin_ht (
Table
) – Input Table with a bin_id annotationcheckpoint_path (
Optional
[str
]) – If provided an intermediate checkpoint table is created with all required annotations before shuffling.
- Return type:
- Returns:
Table grouped by bins(s)
- gnomad.variant_qc.evaluation.compute_binned_truth_sample_concordance(ht, binned_score_ht, n_bins=100, add_bins={})[source]
Determine the concordance (TP, FP, FN) between a truth sample within the callset and the samples truth data grouped by bins computed using compute_ranked_bin.
Note
- The input ‘ht` should contain three row fields:
score: value to use for binning
GT: a CallExpression containing the genotype of the evaluation data for the sample
truth_GT: a CallExpression containing the genotype of the truth sample
- The input binned_score_ht should contain:
score: value used to bin the full callset
bin: the full callset bin
‘add_bins` can be used to add additional global and truth sample binning to the final binned truth sample concordance HT. The keys in add_bins must be present in binned_score_ht and the values in add_bins should be expressions on ht that define a subset of variants to bin in the truth sample. An example is if we want to look at the global and truth sample binning on only bi-allelic variants. add_bins could be set to {‘biallelic_bin’: ht.biallelic}.
The table is grouped by global/truth sample bin and variant type and contains TP, FP and FN.
- Parameters:
ht (
Table
) – Input HTbinned_score_ht (
Table
) – Table with the bin annotation for each variantn_bins (
int
) – Number of bins to bin the data intoadd_bins (
Dict
[str
,BooleanExpression
]) – Dictionary of additional global bin columns (key) and the expr to use for binning the truth sample (value)
- Return type:
- Returns:
Binned truth sample concordance HT
- gnomad.variant_qc.evaluation.create_truth_sample_ht(mt, truth_mt, high_confidence_intervals_ht)[source]
Compute a table comparing a truth sample in callset vs the truth.
- Parameters:
mt (
MatrixTable
) – MT of truth sample from callset to be compared to truthtruth_mt (
MatrixTable
) – MT of truth samplehigh_confidence_intervals_ht (
Table
) – High confidence interval HT
- Return type:
- Returns:
Table containing both the callset truth sample and the truth data
- gnomad.variant_qc.evaluation.add_rank(ht, score_expr, subrank_expr=None)[source]
Add rank based on the score_expr. Rank is added for snvs and indels separately.
If one or more subrank_expr are provided, then subrank is added based on all sites for which the boolean expression is true.
In addition, variant counts (snv, indel separately) is added as a global (rank_variant_counts).
- Parameters:
ht (
Table
) – input Hail Table containing variants (with QC annotations) to be rankedscore_expr (
NumericExpression
) – the Table annotation by which ranking should be scoredsubrank_expr (
Optional
[Dict
[str
,BooleanExpression
]]) – Any subranking to be added in the form name_of_subrank: subrank_filtering_expr
- Return type:
- Returns:
Table with rankings added