gnomad_qc.v4.annotations.recover_and_complete_vep115

Complete VEP annotation for gnomAD context HT.

Background:

This script was created to recover from a VEP 115 run that failed after processing ~99% (37797/38029 partitions) of the gnomAD context HT (all possible SNVs). The job ran for an extended period before failing at task 32305 due to a VEP JSON parsing error. The error was caused by variant chr18:16770181 A>C, where VEP annotated the context field with ‘-nan’, resulting in:

com.fasterxml.jackson.core.JsonParseException: Unexpected character (‘n’ (code 110)) in numeric value: expected digit (0-9) to follow minus sign

After filtering that variant, another variant in the same region also failed with the same error. To ensure successful completion, the entire chr18 centromere region (chr18:15460900-20861207) is now excluded from VEP processing.

Rather than rerun VEP on the entire context HT, this script:

  • Reconstructs the partially written HT by updating metadata files.

  • Identifies which variants still need VEP annotation.

  • Filters out the chr18 centromere region to prevent crashes.

  • Runs VEP only on the remaining unannotated variants.

  • Combines all results into a complete VEP-annotated context HT.

Note: Variants in the chr18 centromere will have missing VEP annotations and should be investigated separately.

Pipeline Steps:

Step 1: Copy partial HT to temp location. Step 2: Extract partition metadata from index files and vep_context HT. Step 3: Reconstruct partial HT by updating metadata files. Step 4: Filter context HT to variants missing VEP (excluding chr18 centromere). Step 5: Run VEP on remaining variants (excludes chr18 centromere). Step 6: Run VEP on chr18 centromere variants with modified config. Step 7: Combine all VEP results and add metadata to final HT.

Module Functions

gnomad_qc.v4.annotations.recover_and_complete_vep115.copy_partial_ht(...)

Copy entire partial HT directory to output location.

gnomad_qc.v4.annotations.recover_and_complete_vep115.get_context_ht_partition_counts()

Get all partition counts from vep_context HT.

gnomad_qc.v4.annotations.recover_and_complete_vep115.get_context_ht_bounds()

Get range bounds from vep_context HT.

gnomad_qc.v4.annotations.recover_and_complete_vep115.load_partition_info(...)

Load partition counts, range bounds, and partition file names from a JSON file.

gnomad_qc.v4.annotations.recover_and_complete_vep115.extract_partition_metadata_and_save(ht_path)

Extract partition metadata from index files and vep_context HT, then save for later reuse.

gnomad_qc.v4.annotations.recover_and_complete_vep115.reconstruct_partial_ht(...)

Reconstruct the partially written HT by updating metadata from schema reference.

gnomad_qc.v4.annotations.recover_and_complete_vep115.load_context_ht([...])

Load the gnomAD context HT.

gnomad_qc.v4.annotations.recover_and_complete_vep115.load_partial_vep_ht(...)

Load the partial VEP HT.

gnomad_qc.v4.annotations.recover_and_complete_vep115.prepare_context_ht(ht)

Prepare context HT by dropping existing VEP annotations.

gnomad_qc.v4.annotations.recover_and_complete_vep115.get_variants_that_need_vep(...)

Get variants that need VEP.

gnomad_qc.v4.annotations.recover_and_complete_vep115.filter_problematic_variants(ht)

Filter out variants in regions that cause VEP to fail.

gnomad_qc.v4.annotations.recover_and_complete_vep115.filter_to_centromere_variants(ht)

Filter to ONLY variants in the chr18 centromere region.

gnomad_qc.v4.annotations.recover_and_complete_vep115.run_vep_on_remaining(ht)

Run VEP on variants that need it.

gnomad_qc.v4.annotations.recover_and_complete_vep115.run_vep_on_centromere(ht)

Run VEP on chr18 centromere variants.

gnomad_qc.v4.annotations.recover_and_complete_vep115.add_vep_metadata(ht, ...)

Add VEP metadata (version, help, config) to global annotations.

gnomad_qc.v4.annotations.recover_and_complete_vep115.combine_vep_results(...)

Combine VEP results from partial HT and newly VEPed variants.

gnomad_qc.v4.annotations.recover_and_complete_vep115.main(args)

Run the complete VEP 115 recovery and completion annotation pipeline.

Complete VEP annotation for gnomAD context HT.

Background:

This script was created to recover from a VEP 115 run that failed after processing ~99% (37797/38029 partitions) of the gnomAD context HT (all possible SNVs). The job ran for an extended period before failing at task 32305 due to a VEP JSON parsing error. The error was caused by variant chr18:16770181 A>C, where VEP annotated the context field with ‘-nan’, resulting in:

com.fasterxml.jackson.core.JsonParseException: Unexpected character (‘n’ (code 110)) in numeric value: expected digit (0-9) to follow minus sign

After filtering that variant, another variant in the same region also failed with the same error. To ensure successful completion, the entire chr18 centromere region (chr18:15460900-20861207) is now excluded from VEP processing.

Rather than rerun VEP on the entire context HT, this script:

  • Reconstructs the partially written HT by updating metadata files.

  • Identifies which variants still need VEP annotation.

  • Filters out the chr18 centromere region to prevent crashes.

  • Runs VEP only on the remaining unannotated variants.

  • Combines all results into a complete VEP-annotated context HT.

Note: Variants in the chr18 centromere will have missing VEP annotations and should be investigated separately.

Pipeline Steps:

Step 1: Copy partial HT to temp location. Step 2: Extract partition metadata from index files and vep_context HT. Step 3: Reconstruct partial HT by updating metadata files. Step 4: Filter context HT to variants missing VEP (excluding chr18 centromere). Step 5: Run VEP on remaining variants (excludes chr18 centromere). Step 6: Run VEP on chr18 centromere variants with modified config. Step 7: Combine all VEP results and add metadata to final HT.

gnomad_qc.v4.annotations.recover_and_complete_vep115.copy_partial_ht(partial_path, output_path)[source]

Copy entire partial HT directory to output location.

Parameters:
  • partial_path (str) – Source path of partial HT.

  • output_path (str) – Destination path for copied HT.

Return type:

None

Returns:

None

gnomad_qc.v4.annotations.recover_and_complete_vep115.get_context_ht_partition_counts()[source]

Get all partition counts from vep_context HT.

Return type:

list[int]

Returns:

List of partition counts for all partitions in vep_context HT.

gnomad_qc.v4.annotations.recover_and_complete_vep115.get_context_ht_bounds()[source]

Get range bounds from vep_context HT.

Return type:

list[dict]

Returns:

List of bounds where bounds[i] = bound for partition i.

gnomad_qc.v4.annotations.recover_and_complete_vep115.load_partition_info(output_path)[source]

Load partition counts, range bounds, and partition file names from a JSON file.

Parameters:

output_path (str) – Path to load the info file from.

Return type:

tuple[list[int], list[dict], list[str]]

Returns:

Tuple of (counts list, bounds list, partition file names list).

gnomad_qc.v4.annotations.recover_and_complete_vep115.extract_partition_metadata_and_save(ht_path)[source]

Extract partition metadata from index files and vep_context HT, then save for later reuse.

This function:

  • Reads partition counts from index metadata files (validated against vep_context).

  • Gets partition bounds from vep_context HT.

  • Derives partition file names from index directory names.

  • Saves all metadata to partition_info.json.

Parameters:

ht_path (str) – Path to HT.

Return type:

tuple[list[int], list[dict], list[str]]

Returns:

Tuple of (counts list, bounds list, partition file names list).

gnomad_qc.v4.annotations.recover_and_complete_vep115.reconstruct_partial_ht(schema_ref_path, output_path, partition_counts, partition_bounds, part_file_names)[source]

Reconstruct the partially written HT by updating metadata from schema reference.

Assumes the partial HT has already been copied to output_path and partition info has been read.

Parameters:
  • schema_ref_path (str) – Path to schema reference HT.

  • output_path (str) – Path to copied partial HT (will update metadata in place).

  • partition_counts (list[int]) – List of per-partition row counts.

  • partition_bounds (list[dict]) – List of per-partition range bounds.

  • part_file_names (list[str]) – List of partition file names.

Return type:

Table

Returns:

The reconstructed Hail Table.

gnomad_qc.v4.annotations.recover_and_complete_vep115.load_context_ht(version='101')[source]

Load the gnomAD context HT.

Parameters:

version (str) – Version of the context HT to load.

Return type:

Table

Returns:

Context Hail Table.

gnomad_qc.v4.annotations.recover_and_complete_vep115.load_partial_vep_ht(partial_vep_ht_path)[source]

Load the partial VEP HT.

Parameters:

partial_vep_ht_path (str) – Path to partial VEP HT.

Return type:

Table

Returns:

Partial VEP Hail Table.

gnomad_qc.v4.annotations.recover_and_complete_vep115.prepare_context_ht(ht)[source]

Prepare context HT by dropping existing VEP annotations.

Parameters:

ht (Table) – Context Hail Table.

Return type:

Table

Returns:

Prepared context Hail Table.

gnomad_qc.v4.annotations.recover_and_complete_vep115.get_variants_that_need_vep(context_ht, partial_vep_ht)[source]

Get variants that need VEP.

Parameters:
  • context_ht (Table) – Prepared context Hail Table.

  • partial_vep_ht (Table) – Partial VEP Hail Table with VEP annotations on the context HT key.

Return type:

Table

Returns:

Hail Table of variants needing VEP.

gnomad_qc.v4.annotations.recover_and_complete_vep115.filter_problematic_variants(ht)[source]

Filter out variants in regions that cause VEP to fail.

Specifically filters the chr18 centromere (chr18:15460900-20861207) which contains variants that cause VEP to return ‘-nan’ in the context field, leading to JSON parsing errors. This region will be investigated separately.

Parameters:

ht (Table) – Hail Table to filter.

Return type:

Table

Returns:

Filtered Hail Table excluding chr18 centromere.

gnomad_qc.v4.annotations.recover_and_complete_vep115.filter_to_centromere_variants(ht)[source]

Filter to ONLY variants in the chr18 centromere region.

This function is used to isolate centromere variants for VEP processing with a modified configuration that excludes the context plugin (which causes ‘-nan’ errors).

Parameters:

ht (Table) – Hail Table to filter.

Return type:

Table

Returns:

Filtered Hail Table containing only chr18 centromere variants.

gnomad_qc.v4.annotations.recover_and_complete_vep115.run_vep_on_remaining(ht, vep_config_path='file:///vep_data/vep-gcloud.json')[source]

Run VEP on variants that need it.

Parameters:
  • ht (Table) – Hail Table of variants needing VEP.

  • vep_config_path (str) – Path to VEP config file.

Return type:

Table

Returns:

Hail Table with VEP annotations (or original if revep_count is 0).

gnomad_qc.v4.annotations.recover_and_complete_vep115.run_vep_on_centromere(ht, vep_config_path='file:///vep_data/vep-gcloud.json')[source]

Run VEP on chr18 centromere variants.

WARNING: This step requires using a VEP init script that does NOT include the ‘context’ plugin in the VEP command, as the context plugin causes ‘-nan’ errors for variants in centromeric regions.

Parameters:
  • ht (Table) – Hail Table of centromere variants needing VEP.

  • vep_config_path (str) – Path to VEP config file.

Return type:

Table

Returns:

Hail Table with VEP annotations.

gnomad_qc.v4.annotations.recover_and_complete_vep115.add_vep_metadata(ht, vep_config_path)[source]

Add VEP metadata (version, help, config) to global annotations.

Parameters:
  • ht (Table) – Hail Table with VEP annotations.

  • vep_config_path (str) – Path to VEP config file.

Return type:

Table

Returns:

Hail Table with metadata annotations.

gnomad_qc.v4.annotations.recover_and_complete_vep115.combine_vep_results(context_ht, partial_vep_ht, revep_ht, centromere_revep_ht, vep_config_path='file:///vep_data/vep-gcloud.json')[source]

Combine VEP results from partial HT and newly VEPed variants.

Annotates context HT with VEP from multiple sources in priority order:

  1. Partial VEP HT (from original run).

  2. ReVEP HT (newly VEPed non-centromere variants).

  3. Centromere ReVEP HT (VEPed centromere variants).

Variants not covered by any source will have missing VEP annotations.

Parameters:
  • context_ht (Table) – Context Hail Table.

  • partial_vep_ht (Table) – Partial VEP Hail Table with VEP annotations on the context HT key.

  • revep_ht (Table) – Hail Table with newly VEPed variants (excluding centromere).

  • centromere_revep_ht (Table) – Hail Table with VEPed centromere variants.

  • vep_config_path (str) – Path to VEP config file.

Return type:

Table

Returns:

Final combined Hail Table with VEP metadata.

gnomad_qc.v4.annotations.recover_and_complete_vep115.main(args)[source]

Run the complete VEP 115 recovery and completion annotation pipeline.

Return type:

None