gnomad_qc.federated.federated_validity_checks

Script to perform validity checks on input federated data or final release files.

Module Functions

gnomad_qc.federated.federated_validity_checks.get_table_kind(...)

Determine whether a markdown table corresponds to "global" or "row" fields by scanning upward from the table header line.

gnomad_qc.federated.federated_validity_checks.hail_type_from_string(...)

Convert a type string from the markdown to a Hail type.

gnomad_qc.federated.federated_validity_checks.is_concrete_type(htype)

Determine whether a Hail type represents a "concrete" field that should be added to field_types, as opposed to an empty container (such as an empty array or struct).

gnomad_qc.federated.federated_validity_checks.parse_field_necessity_from_md(md_text)

Create dictionary of field necessity from parsing markdown text.

gnomad_qc.federated.federated_validity_checks.log_field_validation_results(...)

Log the results of field existence and type validation.

gnomad_qc.federated.federated_validity_checks.validate_config(...)

Validate JSON config inputs.

gnomad_qc.federated.federated_validity_checks.validate_config_fields_in_ht(ht, ...)

Check that necessary fields defined in the JSON config are present in the Hail Table.

gnomad_qc.federated.federated_validity_checks.validate_required_fields(ht, ...)

Validate that the table contains the required global and row fields and that their values are of the expected types.

gnomad_qc.federated.federated_validity_checks.check_fields_not_in_requirements(ht, ...)

Warn about fields in HT missing from requirements.

gnomad_qc.federated.federated_validity_checks.filter_to_test_partitions(ht)

Filter the Table to a specified number of partitions on autosomes and sex chromosomes for testing purposes.

gnomad_qc.federated.federated_validity_checks.check_missingness(ht)

Check for and report the fraction of missing data in row annotations.

gnomad_qc.federated.federated_validity_checks.run_row_to_globals_length_check(ht, ...)

Build the row_to_globals_check mapping from config and run check_global_and_row_annot_lengths.

gnomad_qc.federated.federated_validity_checks.add_info_annotations(ht, ...)

Add select annotations to info if present in the Table.

gnomad_qc.federated.federated_validity_checks.validate_federated_data(ht, ...)

Perform validity checks on federated data.

gnomad_qc.federated.federated_validity_checks.create_logtest_ht([...])

Create a test Hail Table with nested struct annotations to test log output.

gnomad_qc.federated.federated_validity_checks.load_gnomad_data(...)

Load gnomAD data based on specified input file and parameters.

gnomad_qc.federated.federated_validity_checks.main(args)

Perform validity checks for federated data.

Script to perform validity checks on input federated data or final release files.

gnomad_qc.federated.federated_validity_checks.get_table_kind(lines, header_index)[source]

Determine whether a markdown table corresponds to “global” or “row” fields by scanning upward from the table header line.

Parameters:
  • lines – The full list of lines from the markdown document.

  • header_index – The index of the table header line (the line with column names).

Return type:

str

Returns:

String ‘global’ if the nearest preceding section marker indicates global fields, ‘row’ if it indicates row fields, or ‘None’ if neither is found.

gnomad_qc.federated.federated_validity_checks.hail_type_from_string(type_str)[source]

Convert a type string from the markdown to a Hail type.

This function expects flattened fields (no nested dicts or nested structs). Complex nested types may not be fully supported.

Parameters:

type_str (str) – Type string from markdown text, such as int32.

Return type:

Any

Returns:

Hail type represented by the type string.

gnomad_qc.federated.federated_validity_checks.is_concrete_type(htype)[source]

Determine whether a Hail type represents a “concrete” field that should be added to field_types, as opposed to an empty container (such as an empty array or struct).

Empty structs and arrays are not concrete types. Arrays are interpreted recursively.

Parameters:

htype – Hail type to check (ex: hl.tint32, hl.tarray(hl.tfloat64), hl.tstruct()).

Return type:

bool

Returns:

Bool of whether or not the hail type is considered “concrete”.

gnomad_qc.federated.federated_validity_checks.parse_field_necessity_from_md(md_text)[source]

Create dictionary of field necessity from parsing markdown text.

Parameters:

md_text (str) – Markdown text to parse.

Return type:

Tuple[Dict[str, str], Dict[str, Dict[str, Any]]]

Returns:

Dictionary of field names and their necessity, and dictionary split into ‘global_field_types’ and ‘row_field_types’ keys, containing field names and their types.

gnomad_qc.federated.federated_validity_checks.log_field_validation_results(field_issues, fields_validated, type_issues, types_validated)[source]

Log the results of field existence and type validation.

Parameters:
  • field_issues (Dict[str, Dict[str, List[str]]]) – Nested dictionary mapping necessity (“required”, “optional”) and annotation_kind (“row”, “global”) to list of missing field names.

  • fields_validated (Dict[str, Dict[str, List[str]]]) – Nested dictionary mapping necessity (“required”, “optional”) and annotation_kind (“row”, “global”) to list of fields successfully found.

  • type_issues (List[str]) – List of strings describing fields with incorrect or mismatched types.

  • types_validated (List[str]) – List of strings describing successful type validations.

Return type:

None

Returns:

None

gnomad_qc.federated.federated_validity_checks.validate_config(config, schema)[source]

Validate JSON config inputs.

Parameters:
  • config (Dict[str, Any]) – JSON configuration for parameter inputs.

  • schema (Dict[str, Any]) – JSON schema to use for validation.

Return type:

None

Returns:

None.

gnomad_qc.federated.federated_validity_checks.validate_config_fields_in_ht(ht, config)[source]

Check that necessary fields defined in the JSON config are present in the Hail Table.

Parameters:
  • ht (Table) – Hail Table.

  • config (Dict[str, Any]) – JSON configuration for parameter inputs.

Return type:

None

Returns:

None.

gnomad_qc.federated.federated_validity_checks.validate_required_fields(ht, field_types, field_necessities, validate_all_fields=False)[source]

Validate that the table contains the required global and row fields and that their values are of the expected types.

Note

Required fields can be nested (e.g., ‘info.QD’ indicates that the ‘QD’ field is nested within the ‘info’ struct).

Parameters:
  • ht (Table) – Table to validate.

  • field_types (Dict[str, Dict[str, Any]]) – Nested dictionary of both global and row fields and their expected types. There are two keys: “global_field_types” and “row_field_types”, respectively containing the global and row fields as keys and their expected types as values.

  • field_necessities (Dict[str, str]) – Flat dictionary with annotation fields as keys and values field necessity(“required” or “optional”) as values.

  • validate_all_fields (bool) – Whether to validate all fields or only the required/optional ones.

Return type:

Tuple[Dict[str, Any], Dict[str, Any]]

Returns:

Tuple of fields checked and whether or not they passed validation checks.

gnomad_qc.federated.federated_validity_checks.check_fields_not_in_requirements(ht, field_types)[source]

Warn about fields in HT missing from requirements.

Parameters:
  • ht (Table) – Hail Table.

  • field_types (Dict[str, Dict[str, Any]]) – Nested dictionary of both global and row fields and their expected types. There should be two keys: “global_field_types” and “row_field_types”.

Return type:

None

Returns:

None.

gnomad_qc.federated.federated_validity_checks.filter_to_test_partitions(ht, test_n_partitions=2)[source]

Filter the Table to a specified number of partitions on autosomes and sex chromosomes for testing purposes.

Parameters:
  • ht (Table) – Input Table.

  • test_n_partitions (int) – Number of partitions to filter to. Default is 2.

Return type:

Table

Returns:

Filtered Table with only the specified number of partitions.

gnomad_qc.federated.federated_validity_checks.check_missingness(ht, missingness_threshold=0.5, structs_to_not_traverse=('vep',))[source]

Check for and report the fraction of missing data in row annotations.

For struct annotations, missingness is checked recursively unless the annotation name is included in structs_to_not_traverse, in which case only top-level missingness of the struct itself is checked.

Parameters:
  • ht (Table) – Input Table.

  • missingness_threshold (float) – Upper cutoff for allowed amount of missingness. Default is 0.50.

  • structs_to_not_traverse (Optional[Tuple[str]]) – Optional tuple of top-level struct row annotations that should be treated as a single field rather than recursively traversed. Default is (“vep”,).

Return type:

None

Returns:

None

gnomad_qc.federated.federated_validity_checks.run_row_to_globals_length_check(ht, config, check_all_rows=True)[source]

Build the row_to_globals_check mapping from config and run check_global_and_row_annot_lengths.

Parameters:
  • ht (Table) – Hail table to check.

  • config (Dict[str, Any]) – Configuration dictionary containing freq_fields and optional faf_fields.

  • check_all_rows (bool) – Whether to check all rows. If False, only checks first rows. Default is True.

Return type:

None

Returns:

None

gnomad_qc.federated.federated_validity_checks.add_info_annotations(ht, region_flag_fields, allele_type_fields)[source]

Add select annotations to info if present in the Table.

Parameters:
  • ht (Table) – Table to annotate.

  • region_flag_fields (List[str]) – List of region flag fields to check for and add to info if present in the Table.

  • allele_type_fields (List[str]) – List of allele type fields to check for and add to info if present in the Table.

Return type:

Table

Returns:

Annotated Table with new info field.

gnomad_qc.federated.federated_validity_checks.validate_federated_data(ht, freq_meta_expr, missingness_threshold=0.5, struct_annotations_to_skip_missingness=None, freq_annotations_to_sum=['AC', 'AN', 'homozygote_count'], sort_order=['subset', 'downsampling', 'gen_anc', 'sex', 'group'], nhomalt_metric='nhomalt', verbose=False, subsets=None, variant_filter_field='AS_VQSR', problematic_regions=['lcr', 'non_par', 'segdup'], site_gt_check_expr=None)[source]

Perform validity checks on federated data.

Parameters:
  • ht (Table) – Input Table.

  • freq_meta_expr (ArrayExpression) – Metadata expression that contains the values of the elements in meta_indexed_expr. The most often used expression is freq_meta to index into a ‘freq’ array (example: ht.freq_meta).

  • missingness_threshold (float) – Upper cutoff for allowed amount of missingness. Default is 0.50.

  • struct_annotations_to_skip_missingness (Optional[List[str]]) – Optional list of top-level struct row annotations that should be treated as a single field rather than recursively traversed when checking missingness. Default is None.

  • freq_annotations_to_sum (List[str]) – List of annotation fields within meta_expr to sum. Default is [‘AC’, ‘AN’, ‘homozygote_count’].

  • sort_order (List[str]) – Order in which groupings are unfurled into flattened annotations. Default is [“subset”, “downsampling”, gen_anc”, “sex”, “group”].

  • nhomalt_metric (str) – Name of metric denoting homozygous alternate count. Default is “nhomalt”.

  • verbose (bool) – If True, show top values of annotations being checked, including checks that pass; if False, show only top values of annotations that fail checks. Default is False.

  • subsets (List[str]) – List of sample subsets.

  • variant_filter_field (str) – String of variant filtration used in the filters annotation on ht (e.g. RF, VQSR, AS_VQSR). Default is “AS_VQSR”.

  • problematic_regions (List[str]) – List of regions considered problematic to run filter check in. Default is [“lcr”, “non_par”, “segdup”].

  • site_gt_check_expr (Dict[str, BooleanExpression]) – Optional dictionary of strings and boolean expressions typically used to log how many monoallelic or 100% heterozygous sites are in the Table.

Return type:

None

Returns:

None

gnomad_qc.federated.federated_validity_checks.create_logtest_ht(exclude_xnonpar_y=False)[source]

Create a test Hail Table with nested struct annotations to test log output.

Parameters:

exclude_xnonpar_y (bool) – If True, exclude chrX non-pseudoautosomal region and chrY variants when making test data. Default is False.

Return type:

Table

Returns:

Table to use for testing log output.

gnomad_qc.federated.federated_validity_checks.load_gnomad_data(gnomad_input_file, version, data_type='genomes', test=False, sample_set=None, public_release=None, environment=None)[source]

Load gnomAD data based on specified input file and parameters.

Parameters:
  • gnomad_input_file (str) – Name of resource to load, either “freq” or “release_sites”.

  • version (str) – Version to load. For example “4.0”, “4.1”, “5.0”. Default is “5.0”.

  • data_type (str) – Type of gnomAD data to load, either “exomes” or “genomes”.

  • test (bool) – If True, load test version of the data. Default is False.

  • sample_set (Optional[str]) – Sample set of annotation resource. One of “aou”, “gnomad”, or “merged”. If None, uses the default defined by the underlying resource function. Default is None.

  • public_release (Optional[bool]) – Whether or not to use the public version of the release. If None, uses the default defined by the underlying resource function.Default is None.

  • environment (Optional[str]) – Environment to use. Must be one of “rwb”, “batch”, or “dataproc”. If None, uses the default defined by the underlying resource function. Default is None.

Return type:

Table

Returns:

Hail Table of the specified gnomAD data.

gnomad_qc.federated.federated_validity_checks.main(args)[source]

Perform validity checks for federated data.