gnomad_qc.v4.annotations.recover_and_complete_vep115
Complete VEP annotation for gnomAD context HT.
Background:
This script was created to recover from a VEP 115 run that failed after processing ~99% (37797/38029 partitions) of the gnomAD context HT (all possible SNVs). The job ran for an extended period before failing at task 32305 due to a VEP JSON parsing error. The error was caused by variant chr18:16770181 A>C, where VEP annotated the context field with ‘-nan’, resulting in:
com.fasterxml.jackson.core.JsonParseException: Unexpected character (‘n’ (code 110)) in numeric value: expected digit (0-9) to follow minus sign
After filtering that variant, another variant in the same region also failed with the same error. To ensure successful completion, the entire chr18 centromere region (chr18:15460900-20861207) is now excluded from VEP processing.
Rather than rerun VEP on the entire context HT, this script:
Reconstructs the partially written HT by updating metadata files.
Identifies which variants still need VEP annotation.
Filters out the chr18 centromere region to prevent crashes.
Runs VEP only on the remaining unannotated variants.
Combines all results into a complete VEP-annotated context HT.
Note: Variants in the chr18 centromere will have missing VEP annotations and should be investigated separately.
Pipeline Steps:
Step 1: Copy partial HT to temp location. Step 2: Extract partition metadata from index files and vep_context HT. Step 3: Reconstruct partial HT by updating metadata files. Step 4: Filter context HT to variants missing VEP (excluding chr18 centromere). Step 5: Run VEP on remaining variants (excludes chr18 centromere). Step 6: Run VEP on chr18 centromere variants with modified config. Step 7: Combine all VEP results and add metadata to final HT.
Module Functions
Complete VEP annotation for gnomAD context HT.
Background:
This script was created to recover from a VEP 115 run that failed after processing ~99% (37797/38029 partitions) of the gnomAD context HT (all possible SNVs). The job ran for an extended period before failing at task 32305 due to a VEP JSON parsing error. The error was caused by variant chr18:16770181 A>C, where VEP annotated the context field with ‘-nan’, resulting in:
com.fasterxml.jackson.core.JsonParseException: Unexpected character (‘n’ (code 110)) in numeric value: expected digit (0-9) to follow minus sign
After filtering that variant, another variant in the same region also failed with the same error. To ensure successful completion, the entire chr18 centromere region (chr18:15460900-20861207) is now excluded from VEP processing.
Rather than rerun VEP on the entire context HT, this script:
Reconstructs the partially written HT by updating metadata files.
Identifies which variants still need VEP annotation.
Filters out the chr18 centromere region to prevent crashes.
Runs VEP only on the remaining unannotated variants.
Combines all results into a complete VEP-annotated context HT.
Note: Variants in the chr18 centromere will have missing VEP annotations and should be investigated separately.
Pipeline Steps:
Step 1: Copy partial HT to temp location. Step 2: Extract partition metadata from index files and vep_context HT. Step 3: Reconstruct partial HT by updating metadata files. Step 4: Filter context HT to variants missing VEP (excluding chr18 centromere). Step 5: Run VEP on remaining variants (excludes chr18 centromere). Step 6: Run VEP on chr18 centromere variants with modified config. Step 7: Combine all VEP results and add metadata to final HT.
- gnomad_qc.v4.annotations.recover_and_complete_vep115.copy_partial_ht(partial_path, output_path)[source]
Copy entire partial HT directory to output location.
- Parameters:
partial_path (
str) – Source path of partial HT.output_path (
str) – Destination path for copied HT.
- Return type:
None- Returns:
None
- gnomad_qc.v4.annotations.recover_and_complete_vep115.get_context_ht_partition_counts()[source]
Get all partition counts from vep_context HT.
- Return type:
list[int]- Returns:
List of partition counts for all partitions in vep_context HT.
- gnomad_qc.v4.annotations.recover_and_complete_vep115.get_context_ht_bounds()[source]
Get range bounds from vep_context HT.
- Return type:
list[dict]- Returns:
List of bounds where bounds[i] = bound for partition i.
- gnomad_qc.v4.annotations.recover_and_complete_vep115.load_partition_info(output_path)[source]
Load partition counts, range bounds, and partition file names from a JSON file.
- Parameters:
output_path (
str) – Path to load the info file from.- Return type:
tuple[list[int],list[dict],list[str]]- Returns:
Tuple of (counts list, bounds list, partition file names list).
- gnomad_qc.v4.annotations.recover_and_complete_vep115.extract_partition_metadata_and_save(ht_path)[source]
Extract partition metadata from index files and vep_context HT, then save for later reuse.
This function:
Reads partition counts from index metadata files (validated against vep_context).
Gets partition bounds from vep_context HT.
Derives partition file names from index directory names.
Saves all metadata to partition_info.json.
- Parameters:
ht_path (
str) – Path to HT.- Return type:
tuple[list[int],list[dict],list[str]]- Returns:
Tuple of (counts list, bounds list, partition file names list).
- gnomad_qc.v4.annotations.recover_and_complete_vep115.reconstruct_partial_ht(schema_ref_path, output_path, partition_counts, partition_bounds, part_file_names)[source]
Reconstruct the partially written HT by updating metadata from schema reference.
Assumes the partial HT has already been copied to output_path and partition info has been read.
- Parameters:
schema_ref_path (
str) – Path to schema reference HT.output_path (
str) – Path to copied partial HT (will update metadata in place).partition_counts (
list[int]) – List of per-partition row counts.partition_bounds (
list[dict]) – List of per-partition range bounds.part_file_names (
list[str]) – List of partition file names.
- Return type:
- Returns:
The reconstructed Hail Table.
- gnomad_qc.v4.annotations.recover_and_complete_vep115.load_context_ht(version='101')[source]
Load the gnomAD context HT.
- Parameters:
version (
str) – Version of the context HT to load.- Return type:
- Returns:
Context Hail Table.
- gnomad_qc.v4.annotations.recover_and_complete_vep115.load_partial_vep_ht(partial_vep_ht_path)[source]
Load the partial VEP HT.
- Parameters:
partial_vep_ht_path (
str) – Path to partial VEP HT.- Return type:
- Returns:
Partial VEP Hail Table.
- gnomad_qc.v4.annotations.recover_and_complete_vep115.prepare_context_ht(ht)[source]
Prepare context HT by dropping existing VEP annotations.
- gnomad_qc.v4.annotations.recover_and_complete_vep115.get_variants_that_need_vep(context_ht, partial_vep_ht)[source]
Get variants that need VEP.
- gnomad_qc.v4.annotations.recover_and_complete_vep115.filter_problematic_variants(ht)[source]
Filter out variants in regions that cause VEP to fail.
Specifically filters the chr18 centromere (chr18:15460900-20861207) which contains variants that cause VEP to return ‘-nan’ in the context field, leading to JSON parsing errors. This region will be investigated separately.
- gnomad_qc.v4.annotations.recover_and_complete_vep115.filter_to_centromere_variants(ht)[source]
Filter to ONLY variants in the chr18 centromere region.
This function is used to isolate centromere variants for VEP processing with a modified configuration that excludes the context plugin (which causes ‘-nan’ errors).
- gnomad_qc.v4.annotations.recover_and_complete_vep115.run_vep_on_remaining(ht, vep_config_path='file:///vep_data/vep-gcloud.json')[source]
Run VEP on variants that need it.
- gnomad_qc.v4.annotations.recover_and_complete_vep115.run_vep_on_centromere(ht, vep_config_path='file:///vep_data/vep-gcloud.json')[source]
Run VEP on chr18 centromere variants.
WARNING: This step requires using a VEP init script that does NOT include the ‘context’ plugin in the VEP command, as the context plugin causes ‘-nan’ errors for variants in centromeric regions.
- gnomad_qc.v4.annotations.recover_and_complete_vep115.add_vep_metadata(ht, vep_config_path)[source]
Add VEP metadata (version, help, config) to global annotations.
- gnomad_qc.v4.annotations.recover_and_complete_vep115.combine_vep_results(context_ht, partial_vep_ht, revep_ht, centromere_revep_ht, vep_config_path='file:///vep_data/vep-gcloud.json')[source]
Combine VEP results from partial HT and newly VEPed variants.
Annotates context HT with VEP from multiple sources in priority order:
Partial VEP HT (from original run).
ReVEP HT (newly VEPed non-centromere variants).
Centromere ReVEP HT (VEPed centromere variants).
Variants not covered by any source will have missing VEP annotations.
- Parameters:
context_ht (
Table) – Context Hail Table.partial_vep_ht (
Table) – Partial VEP Hail Table with VEP annotations on the context HT key.revep_ht (
Table) – Hail Table with newly VEPed variants (excluding centromere).centromere_revep_ht (
Table) – Hail Table with VEPed centromere variants.vep_config_path (
str) – Path to VEP config file.
- Return type:
- Returns:
Final combined Hail Table with VEP metadata.