gnomad.variant_qc.evaluation

`gnomad.variant_qc.evaluation.compute_ranked_bin`(ht, ...)	Return a table with a bin for each row based on the ranking of score_expr.
`gnomad.variant_qc.evaluation.compute_grouped_binned_ht`(bin_ht)	Group a Table that has been annotated with bins (compute_ranked_bin or create_binned_ht).
`gnomad.variant_qc.evaluation.compute_binned_truth_sample_concordance`(ht, ...)	Determine the concordance (TP, FP, FN) between a truth sample within the callset and the samples truth data grouped by bins computed using compute_ranked_bin.
`gnomad.variant_qc.evaluation.create_truth_sample_ht`(mt, ...)	Compute a table comparing a truth sample in callset vs the truth.
`gnomad.variant_qc.evaluation.add_rank`(ht, ...)	Add rank based on the score_expr.

gnomad.variant_qc.evaluation.compute_ranked_bin(ht, score_expr, bin_expr={'bin': True}, compute_snv_indel_separately=True, n_bins=100, desc=True)[source]

Return a table with a bin for each row based on the ranking of score_expr.

The bin is computed by dividing the score_expr into n_bins bins containing approximately equal numbers of elements. This is done by ranking the rows by score_expr (and a random number in cases where multiple variants have the same score) and then assigning the variant to a bin based on its ranking.

If compute_snv_indel_separately is True all items in bin_expr will be stratified by snv / indels for the ranking and bin calculation. Because SNV and indel rows are mutually exclusive, they are re-combined into a single annotation. For example if we have the following four variants and scores and n_bins of 2:

Variant	Type	Score	bin - compute_snv_indel_separately:
			False	True
Var1	SNV	0.1	1	1
Var2	SNV	0.2	1	2
Var3	Indel	0.3	2	1
Var4	Indel	0.4	2	2

Note

The bin_expr defines which data the bin(s) should be computed on. E.g., to get biallelic specific binning and singleton specific binning, the following could be used:

bin_expr={
    'biallelic_bin': ~ht.was_split,
    'singleton_bin': ht.singleton
}

Parameters:

ht (Table) – Input Table
score_expr (NumericExpression) – Expression containing the score
bin_expr (Dict[str, BooleanExpression]) – Specific row grouping(s) to perform ranking and binning on (see note)
compute_snv_indel_separately (bool) – Should all bin_expr items be stratified by SNVs / indels
n_bins (int) – Number of bins to bin the data into
desc (bool) – Whether to bin the score in descending order

Return type:

Table

Returns:

Table with the requested bin annotations

gnomad.variant_qc.evaluation.compute_grouped_binned_ht(bin_ht, checkpoint_path=None)[source]

Group a Table that has been annotated with bins (compute_ranked_bin or create_binned_ht).

The table will be grouped by bin_id (bin, biallelic, etc.), contig, snv, bi_allelic and singleton.

Note

If performing an aggregation following this grouping (such as score_bin_agg) then the aggregation function will need to use ht._parent to get the origin Table from the GroupedTable for the aggregation

Parameters:

bin_ht (Table) – Input Table with a bin_id annotation
checkpoint_path (Optional[str]) – If provided an intermediate checkpoint table is created with all required annotations before shuffling.

Return type:

GroupedTable

Returns:

Table grouped by bins(s)

gnomad.variant_qc.evaluation.compute_binned_truth_sample_concordance(ht, binned_score_ht, n_bins=100, add_bins={})[source]

Determine the concordance (TP, FP, FN) between a truth sample within the callset and the samples truth data grouped by bins computed using compute_ranked_bin.

Note

The input ‘ht` should contain three row fields:

score: value to use for binning
GT: a CallExpression containing the genotype of the evaluation data for the sample
truth_GT: a CallExpression containing the genotype of the truth sample

The input binned_score_ht should contain:

score: value used to bin the full callset
bin: the full callset bin

‘add_bins` can be used to add additional global and truth sample binning to the final binned truth sample concordance HT. The keys in add_bins must be present in binned_score_ht and the values in add_bins should be expressions on ht that define a subset of variants to bin in the truth sample. An example is if we want to look at the global and truth sample binning on only bi-allelic variants. add_bins could be set to {‘biallelic_bin’: ht.biallelic}.

The table is grouped by global/truth sample bin and variant type and contains TP, FP and FN.

Parameters:

ht (Table) – Input HT
binned_score_ht (Table) – Table with the bin annotation for each variant
n_bins (int) – Number of bins to bin the data into
add_bins (Dict[str, BooleanExpression]) – Dictionary of additional global bin columns (key) and the expr to use for binning the truth sample (value)

Return type:

Table

Returns:

Binned truth sample concordance HT

gnomad.variant_qc.evaluation.create_truth_sample_ht(mt, truth_mt, high_confidence_intervals_ht)[source]

Compute a table comparing a truth sample in callset vs the truth.

Parameters:

mt (MatrixTable) – MT of truth sample from callset to be compared to truth
truth_mt (MatrixTable) – MT of truth sample
high_confidence_intervals_ht (Table) – High confidence interval HT

Return type:

Table

Returns:

Table containing both the callset truth sample and the truth data

gnomad.variant_qc.evaluation.add_rank(ht, score_expr, subrank_expr=None)[source]

Add rank based on the score_expr. Rank is added for snvs and indels separately.

If one or more subrank_expr are provided, then subrank is added based on all sites for which the boolean expression is true.

In addition, variant counts (snv, indel separately) is added as a global (rank_variant_counts).

Parameters:

ht (Table) – input Hail Table containing variants (with QC annotations) to be ranked
score_expr (NumericExpression) – the Table annotation by which ranking should be scored
subrank_expr (Optional[Dict[str, BooleanExpression]]) – Any subranking to be added in the form name_of_subrank: subrank_filtering_expr

Return type:

Table

Returns:

Table with rankings added