gnomad.assessment.validity_checks
|
Check generic logical condition cond_expr involving annotations in a Hail Table when n_fail is absent and print the results to stdout. |
|
Make Hail expressions to measure % variants filtered under varying conditions of interest. |
|
Compute the sum of call stats annotations for a specified group of annotations, compare to the annotated version, and display the result in stdout. |
Check if the row counts in two Tables are the same. |
|
|
Summarize variants filtered under various conditions in input MatrixTable or Table. |
|
Loop through all conditional checks for a given hail Table. |
|
Perform validity checks on frequency data in input Table. |
Compute the sum of annotations for a specified group of annotations, and compare to the annotated version. |
|
Get summary of variants in a MatrixTable or Table. |
|
|
Perform validity checks on raw and adj data in input Table/MatrixTable. |
|
Perform validity checks for annotations on the sex chromosomes. |
|
Check amount of missingness in all row annotations. |
Check that all VCF fields and descriptions are present in input Table and VCF header dictionary. |
|
|
Check that the lengths of row annotations match the lengths of associated global annotations. |
Pretty print global annotations. |
|
Perform a battery of validity checks on a specified group of subsets in a MatrixTable containing variant annotations. |
|
|
Calculate the count of VEP annotated variants in vep_ht per interval defined by interval_ht. |
- gnomad.assessment.validity_checks.generic_field_check(ht, check_description, display_fields, cond_expr=None, verbose=False, show_percent_sites=False, n_fail=None, ht_count=None)[source]
Check generic logical condition cond_expr involving annotations in a Hail Table when n_fail is absent and print the results to stdout.
Displays the number of rows (and percent of rows, if show_percent_sites is True) in the Table that fail, either previously computed as n_fail or that match the cond_expr, and fail to be the desired condition (check_description). If the number of rows that match the cond_expr or n_fail is 0, then the Table passes that check; otherwise, it fails.
Note
cond_expr and check_description are opposites and should never be the same. E.g., If cond_expr filters for instances where the raw AC is less than adj AC, then it is checking sites that fail to be the desired condition (check_description) of having a raw AC greater than or equal to the adj AC.
- Parameters:
ht (
Table
) – Table containing annotations to be checked.check_description (
str
) – String describing the condition being checked; is displayed in stdout summary message.display_fields (
StructExpression
) – StructExpression containing annotations to be displayed in case of failure (for troubleshooting purposes); these fields are also displayed if verbose is True.cond_expr (
BooleanExpression
) – Optional logical expression referring to annotations in ht to be checked.verbose (
bool
) – If True, show top values of annotations being checked, including checks that pass; if False, show only top values of annotations that fail checks.show_percent_sites (
bool
) – Show percentage of sites that fail checks. Default is False.n_fail (
Optional
[int
]) – Optional number of sites that fail the conditional checks (previously computed). If not supplied, cond_expr is used to filter the Table and obtain the count of sites that fail the checks.ht_count (
Optional
[int
]) – Optional number of sites within hail Table (previously computed). If not supplied, a count of sites in the Table is performed.
- Return type:
None
- Returns:
None
- gnomad.assessment.validity_checks.make_filters_expr_dict(ht, extra_filter_checks=None, variant_filter_field='RF')[source]
Make Hail expressions to measure % variants filtered under varying conditions of interest.
- Checks for:
Total number of variants
- Fraction of variants removed due to:
Any filter
Inbreeding coefficient filter in combination with any other filter
AC0 filter in combination with any other filter
variant_filter_field filtering in combination with any other filter
Only inbreeding coefficient filter
Only AC0 filter
Only filtering defined by variant_filter_field
- Parameters:
ht (
Table
) – Table containing ‘filter’ annotation to be examined.extra_filter_checks (
Optional
[Dict
[str
,Expression
]]) – Optional dictionary containing filter condition name (key) extra filter expressions (value) to be examined.variant_filter_field (
str
) – String of variant filtration used in the filters annotation on ht (e.g. RF, VQSR, AS_VQSR). Default is “RF”.
- Return type:
Dict
[str
,Expression
]- Returns:
Dictionary containing Hail aggregation expressions to examine filter flags.
- gnomad.assessment.validity_checks.make_group_sum_expr_dict(t, subset, label_groups, sort_order=['subset', 'downsampling', 'popmax', 'grpmax', 'pop', 'gen_anc', 'subpop', 'sex', 'group'], delimiter='-', metric_first_field=True, metrics=['AC', 'AN', 'nhomalt'])[source]
Compute the sum of call stats annotations for a specified group of annotations, compare to the annotated version, and display the result in stdout.
For example, if subset1 consists of pop1, pop2, and pop3, check that t.info.AC-subset1 == sum(t.info.AC-subset1-pop1, t.info.AC-subset1-pop2, t.info.AC-subset1-pop3).
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input MatrixTable or Table containing call stats annotations to be summed.subset (
str
) – String indicating sample subset.label_groups (
Dict
[str
,List
[str
]]) – Dictionary containing an entry for each label group, where key is the name of the grouping, e.g. “sex” or “pop”, and value is a list of all possible values for that grouping (e.g. [“XY”, “XX”] or [“afr”, “nfe”, “amr”]).sort_order (
List
[str
]) – List containing order to sort label group combinations. Default is SORT_ORDER.delimiter (
str
) – String to use as delimiter when making group label combinations. Default is “-“.metric_first_field (
bool
) – If True, metric precedes subset in the Table’s fields, e.g. AC-hgdp. If False, subset precedes metric, hgdp-AC. Default is True.metrics (
List
[str
]) – List of metrics to sum and compare to annotationed versions. Default is [“AC”, “AN”, “nhomalt”].
- Return type:
Dict
[str
,Dict
[str
,Union
[Int64Expression
,StructExpression
]]]- Returns:
Dictionary of sample sum field check expressions and display fields.
- gnomad.assessment.validity_checks.compare_row_counts(ht1, ht2)[source]
Check if the row counts in two Tables are the same.
- gnomad.assessment.validity_checks.summarize_variant_filters(t, variant_filter_field='RF', problematic_regions=['lcr', 'segdup', 'nonpar'], single_filter_count=False, site_gt_check_expr=None, extra_filter_checks=None, n_rows=50, n_cols=140)[source]
Summarize variants filtered under various conditions in input MatrixTable or Table.
- Summarize counts for:
Total number of variants
- Fraction of variants removed due to:
Any filter
Inbreeding coefficient filter in combination with any other filter
AC0 filter in combination with any other filter
variant_filter_field filtering in combination with any other filter in combination with any other filter
Only inbreeding coefficient filter
Only AC0 filter
Only variant_filter_field filtering
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input MatrixTable or Table to be checked.variant_filter_field (
str
) – String of variant filtration used in the filters annotation on ht (e.g. RF, VQSR, AS_VQSR). Default is “RF”.problematic_regions (
List
[str
]) – List of regions considered problematic to run filter check in. Default is [“lcr”, “segdup”, “nonpar”].single_filter_count (
bool
) – If True, explode the Table’s filter column and give a supplement total count of each filter. Default is False.site_gt_check_expr (
Dict
[str
,BooleanExpression
]) – Optional dictionary of strings and boolean expressions typically used to log how many monoallelic or 100% heterozygous sites are in the Table.extra_filter_checks (
Optional
[Dict
[str
,Expression
]]) – Optional dictionary containing filter condition name (key) and extra filter expressions (value) to be examined.n_rows (
int
) – Number of rows to display only when showing percentages of filtered variants grouped by multiple conditions. Default is 50.n_cols (
int
) – Number of columns to display only when showing percentages of filtered variants grouped by multiple conditions. Default is 140.
- Return type:
None
- Returns:
None
- gnomad.assessment.validity_checks.generic_field_check_loop(ht, field_check_expr, verbose, show_percent_sites=False, ht_count=None)[source]
Loop through all conditional checks for a given hail Table.
This loop allows aggregation across the hail Table once, as opposed to aggregating during every conditional check.
- Parameters:
ht (
Table
) – Table containing annotations to be checked.field_check_expr (
Dict
[str
,Dict
[str
,Any
]]) – Dictionary whose keys are conditions being checked and values are the expressions for filtering to condition.verbose (
bool
) – If True, show top values of annotations being checked, including checks that pass; if False, show only top values of annotations that fail checks.show_percent_sites (
bool
) – Show percentage of sites that fail checks. Default is False.ht_count (
int
) – Previously computed sum of sites within hail Table. Default is None.
- Return type:
None
- Returns:
None
- gnomad.assessment.validity_checks.compare_subset_freqs(t, subsets, verbose, show_percent_sites=True, delimiter='-', metric_first_field=True, metrics=['AC', 'AN', 'nhomalt'])[source]
Perform validity checks on frequency data in input Table.
- Check:
- Number of sites where callset frequency is equal to a subset frequency (raw and adj)
eg. t.info.AC-adj != t.info.AC-subset1-adj
Total number of sites where the raw allele count annotation is defined
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input MatrixTable or Table.subsets (
List
[str
]) – List of sample subsets.verbose (
bool
) – If True, show top values of annotations being checked, including checks that pass; if False, show only top values of annotations that fail checks.show_percent_sites (
bool
) – If True, show the percentage and count of overall sites that fail; if False, only show the number of sites that fail.delimiter (
str
) – String to use as delimiter when making group label combinations. Default is “-“.metric_first_field (
bool
) – If True, metric precedes subset, e.g. AC-non_v2-. If False, subset precedes metric, non_v2-AC-XY. Default is True.metrics (
List
[str
]) – List of metrics to compare between subset and entire callset. Default is [“AC”, “AN”, “nhomalt”].
- Return type:
None
- Returns:
None
- gnomad.assessment.validity_checks.sum_group_callstats(t, sexes=['XX', 'XY'], subsets=[''], pops=['afr', 'amr', 'asj', 'eas', 'fin', 'mid', 'nfe', 'remaining', 'sas'], groups=['adj'], additional_subsets_and_pops=None, verbose=False, sort_order=['subset', 'downsampling', 'popmax', 'grpmax', 'pop', 'gen_anc', 'subpop', 'sex', 'group'], delimiter='-', metric_first_field=True, metrics=['AC', 'AN', 'nhomalt'])[source]
Compute the sum of annotations for a specified group of annotations, and compare to the annotated version.
Displays results from checking the sum of the specified annotations in stdout. Also checks that annotations for all expected sample populations are present.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input Table.sexes (
List
[str
]) – List of sexes in table.subsets (
List
[str
]) – List of sample subsets that contain pops passed in pops parameter. An empty string, e.g. “”, should be passed to test entire callset. Default is [“”].pops (
List
[str
]) – List of pops contained within the subsets. Default is POPS[CURRENT_MAJOR_RELEASE][“exomes”].groups (
List
[str
]) – List of callstat groups, e.g. “adj” and “raw” contained within the callset. gnomAD does not store the raw callstats for the pop or sex groupings of any subset. Default is [“adj”]sample_sum_sets_and_pops – Dict with subset (keys) and list of the subset’s specific populations (values). Default is None.
verbose (
bool
) – If True, show top values of annotations being checked, including checks that pass; if False, show only top values of annotations that fail checks. Default is False.sort_order (
List
[str
]) – List containing order to sort label group combinations. Default is SORT_ORDER.delimiter (
str
) – String to use as delimiter when making group label combinations. Default is “-“.metric_first_field (
bool
) – If True, metric precedes label group, e.g. AC-afr-male. If False, label group precedes metric, afr-male-AC. Default is True.metrics (
List
[str
]) – List of metrics to sum and compare to annotationed versions. Default is [“AC”, “AN”, “nhomalt”].additional_subsets_and_pops (
Dict
[str
,List
[str
]]) –
- Return type:
None
- Returns:
None
- gnomad.assessment.validity_checks.summarize_variants(t, expected_contigs=None)[source]
Get summary of variants in a MatrixTable or Table.
Print the number of variants to stdout and check that each chromosome has variant calls. If requested, check that all expected contigs are found in the variant summary and that no unexpected contigs are found.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input MatrixTable or Table to be checked.expected_contigs (
List
[str
]) – List of contigs expected to be found in the input.
- Return type:
- Returns:
Struct of variant summary
- gnomad.assessment.validity_checks.check_raw_and_adj_callstats(t, subsets, verbose, delimiter='-', metric_first_field=True)[source]
Perform validity checks on raw and adj data in input Table/MatrixTable.
- Check that:
Raw AC and AF are not 0
AC and AF are not negative
Raw values for AC, AN, nhomalt in each sample subset are greater than or equal to their corresponding adj values
Raw and adj call stat annotations must be in an info struct annotation on the Table/MatrixTable, e.g. t.info.AC-raw.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input MatrixTable or Table to check.subsets (
List
[str
]) – List of sample subsets.verbose (
bool
) – If True, show top values of annotations being checked, including checks that pass; if False, show only top values of annotations that fail checks.delimiter (
str
) – String to use as delimiter when making group label combinations. Default is “-“.metric_first_field (
bool
) – If True, metric precedes label group, e.g. AC-afr-male. If False, label group precedes metric, afr-male-AC. Default is True.
- Return type:
None
- Returns:
None
- gnomad.assessment.validity_checks.check_sex_chr_metrics(t, info_metrics, contigs, verbose, delimiter='-')[source]
Perform validity checks for annotations on the sex chromosomes.
- Check:
That metrics for chrY variants in XX samples are NA and not 0
That nhomalt counts are equal to XX nhomalt counts for all non-PAR chrX variants
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input MatrixTable or Table.info_metrics (
List
[str
]) – List of metrics in info struct of input Table.contigs (
List
[str
]) – List of contigs present in input Table.verbose (
bool
) – If True, show top values of annotations being checked, including checks that pass; if False, show only top values of annotations that fail checks.delimiter (
str
) – String to use as the delimiter in XX metrics. Default is “-“.
- Return type:
None
- Returns:
None
- gnomad.assessment.validity_checks.compute_missingness(t, info_metrics, non_info_metrics, n_sites, missingness_threshold)[source]
Check amount of missingness in all row annotations.
Print metric to sdout if the percentage of metric annotations missingness exceeds the missingness_threshold.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input MatrixTable or Table.info_metrics (
List
[str
]) – List of metrics in info struct of input Table.non_info_metrics (
List
[str
]) – List of row annotations minus info struct from input Table.n_sites (
int
) – Number of sites in input Table.missingness_threshold (
float
) – Upper cutoff for allowed amount of missingness.
- Return type:
None
- Returns:
None
- gnomad.assessment.validity_checks.vcf_field_check(t, header_dict, row_annotations=None, entry_annotations=None, hists=['gq_hist_alt', 'gq_hist_all', 'dp_hist_alt', 'dp_hist_all', 'ab_hist_alt'])[source]
Check that all VCF fields and descriptions are present in input Table and VCF header dictionary.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input MatrixTable or Table to be exported to VCF.header_dict (
Dict
[str
,Dict
[str
,Dict
[str
,str
]]]) – VCF header dictionary.row_annotations (
List
[str
]) – List of row annotations in MatrixTable or Table.entry_annotations (
List
[str
]) – List of entry annotations to use if running this check on a MatrixTable.hists (
List
[str
]) – List of variant histogram annotations. Default is HISTS.
- Return type:
bool
- Returns:
Boolean with whether all expected fields and descriptions are present.
- gnomad.assessment.validity_checks.check_global_and_row_annot_lengths(t, row_to_globals_check, check_all_rows=False)[source]
Check that the lengths of row annotations match the lengths of associated global annotations.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input MatrixTable or Table.row_to_globals_check (
Dict
[str
,List
[str
]]) – Dictionary with row annotation (key) and list of associated global annotations (value) to compare.check_all_rows (
bool
) – If True, check all rows in t; if False, check only the first row. Default is False.
- Return type:
None
- Returns:
None
- gnomad.assessment.validity_checks.pprint_global_anns(t)[source]
Pretty print global annotations.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input MatrixTable or Table.- Return type:
None
- gnomad.assessment.validity_checks.validate_release_t(t, subsets=[''], pops=['afr', 'amr', 'asj', 'eas', 'fin', 'mid', 'nfe', 'remaining', 'sas'], missingness_threshold=0.5, site_gt_check_expr=None, verbose=False, show_percent_sites=True, delimiter='-', metric_first_field=True, sum_metrics=['AC', 'AN', 'nhomalt'], sexes=['XX', 'XY'], groups=['adj'], sample_sum_sets_and_pops=None, sort_order=['subset', 'downsampling', 'popmax', 'grpmax', 'pop', 'gen_anc', 'subpop', 'sex', 'group'], variant_filter_field='RF', problematic_regions=['lcr', 'segdup', 'nonpar'], single_filter_count=False, summarize_variants_check=True, filters_check=True, raw_adj_check=True, subset_freq_check=True, samples_sum_check=True, sex_chr_check=True, missingness_check=True, pprint_globals=False, row_to_globals_check=None, check_all_rows_in_row_to_global_check=False)[source]
Perform a battery of validity checks on a specified group of subsets in a MatrixTable containing variant annotations.
Includes: - Summaries of % filter status for different partitions of variants - Histogram outlier bin checks - Checks on AC, AN, and AF annotations - Checks that subgroup annotation values add up to the supergroup annotation values - Checks on sex-chromosome annotations; and summaries of % missingness in variant annotations
All annotations must be within an info struct, e.g. t.info.AC-raw.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input MatrixTable or Table containing variant annotations to check.subsets (
List
[str
]) – List of subsets to be checked.pops (
List
[str
]) – List of pops within main callset. Default is POPS[CURRENT_MAJOR_RELEASE][“exomes”].missingness_threshold (
float
) – Upper cutoff for allowed amount of missingness. Default is 0.5.site_gt_check_expr (
Dict
[str
,BooleanExpression
]) – Optional boolean expression or dictionary of strings and boolean expressions typically used to log how many monoallelic or 100% heterozygous sites are in the Table.verbose (
bool
) – If True, display top values of relevant annotations being checked, regardless of whether check conditions are violated; if False, display only top values of relevant annotations if check conditions are violated.show_percent_sites (
bool
) – Show percentage of sites that fail checks. Default is False.delimiter (
str
) – String to use as delimiter when making group label combinations. Default is “-“.metric_first_field (
bool
) – If True, metric precedes label group, e.g. AC-afr-male. If False, label group precedes metric, afr-male-AC. Default is True.sum_metrics (
List
[str
]) – List of metrics to sum and compare to annotationed versions and between subsets and entire callset. Default is [“AC”, “AN”, “nhomalt”].sexes (
List
[str
]) – List of sexes in table. Default is SEXES.groups (
List
[str
]) – List of callstat groups, e.g. “adj” and “raw” contained within the callset. gnomAD does not store the raw callstats for the pop or sex groupings of any subset. Default is [“adj”]sample_sum_sets_and_pops (
Dict
[str
,List
[str
]]) – Dict with subset (keys) and populations within subset (values) for sample sum check.sort_order (
List
[str
]) – List containing order to sort label group combinations. Default is SORT_ORDER.variant_filter_field (
str
) – String of variant filtration used in the filters annotation on ht (e.g. RF, VQSR, AS_VQSR). Default is “RF”.problematic_regions (
List
[str
]) – List of regions considered problematic to run filter check in. Default is [“lcr”, “segdup”, “nonpar”].single_filter_count (
bool
) – If True, explode the Table’s filter column and give a supplement total count of each filter. Default is False.summarize_variants_check (
bool
) – When true, runs the summarize_variants method. Default is True.filters_check (
bool
) – When True, runs the summarize_variant_filters method. Default is True.raw_adj_check (
bool
) – When True, runs the check_raw_and_adj_callstats method. Default is True.subset_freq_check (
bool
) – When True, runs the compare_subset_freqs method. Default is True.samples_sum_check (
bool
) – When True, runs the sum_group_callstats method. Default is True.sex_chr_check (
bool
) – When True, runs the check_sex_chr_metricss method. Default is True.missingness_check (
bool
) – When True, runs the compute_missingness method. Default is True.pprint_globals (
bool
) – When True, Pretty Print the globals of the input Table. Default is True.row_to_globals_check (
Optional
[Dict
[str
,List
[str
]]]) – Optional dictionary of globals (keys) and rows (values) to be checked. When passed, function checks that the lengths of the global and row annotations are equal.check_all_rows_in_row_to_global_check (
bool
) – If True, check all rows in t in row_to_globals_check; if False, check only the first row. Default is False.
- Return type:
None
- Returns:
None (stdout display of results from the battery of validity checks).
- gnomad.assessment.validity_checks.count_vep_annotated_variants_per_interval(vep_ht, interval_ht)[source]
Calculate the count of VEP annotated variants in vep_ht per interval defined by interval_ht.
Note
vep_ht must contain the ‘vep.transcript_consequences’ array field, which contains a ‘biotype’ field to determine whether a variant is in a “protein-coding” gene.
interval_ht should be indexed by ‘locus’ and contain a ‘gene_stable_ID’ field. For example, an interval Table containing the intervals of protein-coding genes of a specific Ensembl release.
- The returned Table will have the following fields added:
n_total_variants: The number of total variants in the interval.
n_pcg_variants: The number of variants in the interval that are annotated as “protein-coding”.