gnomad_qc.v5.data_ingestion.federated_validity_checks

Script to perform validity checks on input federated data or final release files.

Module Functions

gnomad_qc.v5.data_ingestion.federated_validity_checks.get_table_kind(...)

Determine whether a markdown table corresponds to "global" or "row" fields by scanning upward from the table header line.

gnomad_qc.v5.data_ingestion.federated_validity_checks.hail_type_from_string(...)

Convert a type string from the markdown to a Hail type.

gnomad_qc.v5.data_ingestion.federated_validity_checks.is_concrete_type(htype)

Determine whether a Hail type represents a "concrete" field that should be added to field_types, as opposed to an empty container (such as an empty array or struct).

gnomad_qc.v5.data_ingestion.federated_validity_checks.parse_field_necessity_from_md(md_text)

Create dictionary of field necessity from parsing markdown text.

gnomad_qc.v5.data_ingestion.federated_validity_checks.log_field_validation_results(...)

Log the results of field existence and type validation.

gnomad_qc.v5.data_ingestion.federated_validity_checks.validate_config(...)

Validate JSON config inputs.

gnomad_qc.v5.data_ingestion.federated_validity_checks.validate_config_fields_in_ht(ht, ...)

Check that necessary fields defined in the JSON config are present in the Hail Table.

gnomad_qc.v5.data_ingestion.federated_validity_checks.validate_required_fields(ht, ...)

Validate that the table contains the required global and row fields and that their values are of the expected types.

gnomad_qc.v5.data_ingestion.federated_validity_checks.check_missingness(ht)

Check for and report the fraction of missing data in the Table.

gnomad_qc.v5.data_ingestion.federated_validity_checks.validate_federated_data(ht, ...)

Perform validity checks on federated data.

gnomad_qc.v5.data_ingestion.federated_validity_checks.create_logtest_ht([...])

Create a test Hail Table with nested struct annotations to test log output.

gnomad_qc.v5.data_ingestion.federated_validity_checks.main(args)

Perform validity checks for federated data.

Script to perform validity checks on input federated data or final release files.

gnomad_qc.v5.data_ingestion.federated_validity_checks.get_table_kind(lines, header_index)[source]

Determine whether a markdown table corresponds to “global” or “row” fields by scanning upward from the table header line.

Parameters:
  • lines – The full list of lines from the markdown document.

  • header_index – The index of the table header line (the line with column names).

Return type:

str

Returns:

String ‘global’ if the nearest preceding section marker indicates global fields, ‘row’ if it indicates row fields, or ‘None’ if neither is found.

gnomad_qc.v5.data_ingestion.federated_validity_checks.hail_type_from_string(type_str)[source]

Convert a type string from the markdown to a Hail type.

This function expects flattened fields (no nested dicts or nested structs). Complex nested types may not be fully supported.

Parameters:

type_str (str) – Type string from markdown text, such as int32.

Return type:

Any

Returns:

Hail type represented by the type string.

gnomad_qc.v5.data_ingestion.federated_validity_checks.is_concrete_type(htype)[source]

Determine whether a Hail type represents a “concrete” field that should be added to field_types, as opposed to an empty container (such as an empty array or struct).

Empty structs and arrays are not concrete types. Arrays are interpreted recursively.

Parameters:

htype – Hail type to check (ex: hl.tint32, hl.tarray(hl.tfloat64), hl.tstruct()).

Return type:

bool

Returns:

Bool of whether or not the hail type is considered “concrete”.

gnomad_qc.v5.data_ingestion.federated_validity_checks.parse_field_necessity_from_md(md_text)[source]

Create dictionary of field necessity from parsing markdown text.

Parameters:

md_text (str) – Markdown text to parse.

Return type:

Tuple[Dict[str, str], Dict[str, Dict[str, Any]]]

Returns:

Dictionary of field names and their necessity, and dictionary split into ‘global_field_types’ and ‘row_field_types’ keys, containing field names and their types.

gnomad_qc.v5.data_ingestion.federated_validity_checks.log_field_validation_results(field_issues, fields_validated, type_issues, types_validated)[source]

Log the results of field existence and type validation.

Parameters:
  • field_issues (Dict[str, Dict[str, List[str]]]) – Nested dictionary mapping necessity (“required”, “optional”) and annotation_kind (“row”, “global”) to list of missing field names.

  • fields_validated (Dict[str, Dict[str, List[str]]]) – Nested dictionary mapping necessity (“required”, “optional”) and annotation_kind (“row”, “global”) to list of fields successfully found.

  • type_issues (List[str]) – List of strings describing fields with incorrect or mismatched types.

  • types_validated (List[str]) – List of strings describing successful type validations.

Return type:

None

Returns:

None

gnomad_qc.v5.data_ingestion.federated_validity_checks.validate_config(config, schema)[source]

Validate JSON config inputs.

Parameters:
  • config (Dict[str, Any]) – JSON configuration for parameter inputs.

  • schema (Dict[str, Any]) – JSON schema to use for validation.

Return type:

None

Returns:

None.

gnomad_qc.v5.data_ingestion.federated_validity_checks.validate_config_fields_in_ht(ht, config)[source]

Check that necessary fields defined in the JSON config are present in the Hail Table.

Parameters:
  • ht (Table) – Hail Table.

  • config (Dict[str, Any]) – JSON configuration for parameter inputs.

Return type:

None

Returns:

None.

gnomad_qc.v5.data_ingestion.federated_validity_checks.validate_required_fields(ht, field_types, field_necessities, validate_all_fields=False)[source]

Validate that the table contains the required global and row fields and that their values are of the expected types.

Note

Required fields can be nested (e.g., ‘info.QD’ indicates that the ‘QD’ field is nested within the ‘info’ struct).

Parameters:
  • ht (Table) – Table to validate.

  • field_types (Dict[str, Dict[str, Any]]) – Nested dictionary of both global and row fields and their expected types. There are two keys: “global_field_types” and “row_field_types”, respectively containing the global and row fields as keys and their expected types as values.

  • field_necessities (Dict[str, str]) – Flat dictionary with annotation fields as keys and values field necessity(“required” or “optional”) as values.

  • validate_all_fields (bool) – Whether to validate all fields or only the required/optional ones.

Return type:

Tuple[Dict[str, Any], Dict[str, Any]]

Returns:

Tuple of fields checked and whether or not they passed validation checks.

gnomad_qc.v5.data_ingestion.federated_validity_checks.check_missingness(ht, missingness_threshold=0.5, struct_annotations=['grpmax', 'fafmax', 'histograms'])[source]

Check for and report the fraction of missing data in the Table.

Parameters:
  • ht (Table) – Input Table.

  • missingness_threshold (float) – Upper cutoff for allowed amount of missingness. Default is 0.50.

  • struct_annotations (List[str]) – List of struct annotations to check for missingness. Default is [‘grpmax’, ‘fafmax’, ‘histograms’].

Return type:

None

Returns:

None

gnomad_qc.v5.data_ingestion.federated_validity_checks.validate_federated_data(ht, freq_meta_expr, missingness_threshold=0.5, struct_annotations_for_missingness=['grpmax', 'fafmax', 'histograms'], freq_annotations_to_sum=['AC', 'AN', 'homozygote_count'], sort_order=['subset', 'downsampling', 'gen_anc', 'sex', 'group'], nhomalt_metric='nhomalt', verbose=False, subsets=None, variant_filter_field='AS_VQSR', problematic_regions=['lcr', 'non_par', 'segdup'], site_gt_check_expr=None)[source]

Perform validity checks on federated data.

Parameters:
  • ht (Table) – Input Table.

  • freq_meta_expr (ArrayExpression) – Metadata expression that contains the values of the elements in meta_indexed_expr. The most often used expression is freq_meta to index into a ‘freq’ array (example: ht.freq_meta).

  • freq_annotations_to_sum (List[str]) – List of annotation fields within meta_expr to sum. Default is [‘AC’, ‘AN’, ‘homozygote_count’].

  • sort_order (List[str]) – Order in which groupings are unfurled into flattened annotations. Default is [“subset”, “downsampling”, gen_anc”, “sex”, “group”].

  • nhomalt_metric (str) – Name of metric denoting homozygous alternate count. Default is “nhomalt”.

  • verbose (bool) – If True, show top values of annotations being checked, including checks that pass; if False, show only top values of annotations that fail checks. Default is False.

  • subsets (List[str]) – List of sample subsets.

  • variant_filter_field (str) – String of variant filtration used in the filters annotation on ht (e.g. RF, VQSR, AS_VQSR). Default is “AS_VQSR”.

  • problematic_regions (List[str]) – List of regions considered problematic to run filter check in. Default is [“lcr”, “non_par”, “segdup”].

  • site_gt_check_expr (Dict[str, BooleanExpression]) – Optional dictionary of strings and boolean expressions typically used to log how many monoallelic or 100% heterozygous sites are in the Table.

  • missingness_threshold (float) –

  • struct_annotations_for_missingness (List[str]) –

Return type:

None

Returns:

None

gnomad_qc.v5.data_ingestion.federated_validity_checks.create_logtest_ht(exclude_xnonpar_y=False)[source]

Create a test Hail Table with nested struct annotations to test log output.

Parameters:

exclude_xnonpar_y (bool) – If True, exclude chrX non-pseudoautosomal region and chrY variants when making test data. Default is False.

Return type:

Table

Returns:

Table to use for testing log output.

gnomad_qc.v5.data_ingestion.federated_validity_checks.main(args)[source]

Perform validity checks for federated data.