gnomad_qc.v5.annotations.generate_frequency
Script to generate frequency data for gnomAD v5.
This script calculates variant frequencies and histograms for: 1. gnomAD dataset - updating v4 frequencies by subtracting consent withdrawal samples 2. AoU dataset - using either pre-computed allele numbers or a densify approach
Processing Workflow:
gnomAD (–process-gnomad): 1. Load v4 frequency table (contains frequencies and age histograms) 2. Prepare consent withdrawal VDS (split multiallelics, annotate metadata) 3. Calculate frequencies and age histograms for consent samples 4. Subtract from v4 frequencies to get updated gnomAD v5 frequencies
AoU (–process-aou): 1. Load AoU VDS with metadata 2. Prepare VDS (annotate group membership, adjust for ploidy, split multi-allelics) 3. Calculate frequencies using either: All sites ANs (efficient, requires pre-computed AN values) or Densify approach (standard, more resource intensive) 4. Generate age histograms during frequency calculation
Usage Examples:
# Process AoU dataset using all-sites ANs. python generate_frequency.py –process-aou –use-all-sites-ans –environment rwb
# Process AoU on batch/QoB with custom resources. python generate_frequency.py –process-aou –environment batch –app-name “aou_freq” –driver-cores 8 –worker-memory highmem
# Process gnomAD consent withdrawals python generate_frequency.py –process-gnomad –environment dataproc
# Run gnomAD in test mode python generate_frequency.py –process-gnomad –test –test-partitions 2
Generate frequency data for gnomAD v5.
usage: gnomad_qc.v5.annotations.generate_frequency.py [-h] [--overwrite]
[--test]
[--test-partitions TEST_PARTITIONS]
[--process-gnomad]
[--process-aou]
[--use-all-sites-ans]
[--environment {rwb,batch,dataproc}]
[--tmp-dir-days TMP_DIR_DAYS]
[--gcp-billing-project GCP_BILLING_PROJECT]
[--app-name APP_NAME]
[--driver-cores DRIVER_CORES]
[--driver-memory DRIVER_MEMORY]
[--worker-cores WORKER_CORES]
[--worker-memory WORKER_MEMORY]
Named Arguments
- --overwrite
Overwrite existing hail Tables.
Default: False
testing options
- --test
Filter to the first N partitions of full VDS for testing (N controlled by –test-partitions).
Default: False
- --test-partitions
Number of partitions to use in test mode. Default is 2.
Default: 2
processing steps
- --process-gnomad
Process gnomAD dataset for frequency calculations.
Default: False
- --process-aou
Process All of Us dataset for frequency calculations.
Default: False
- --use-all-sites-ans
Use all sites ANs in frequency calculations to avoid a densify.
Default: False
environment configuration
- --environment
Possible choices: rwb, batch, dataproc
Environment to run in.
Default: “rwb”
- --tmp-dir-days
Number of days for temp directory retention. Default is 4.
Default: 4
- --gcp-billing-project
Google Cloud billing project for reading requester pays buckets.
Default: “broad-mpg-gnomad”
batch configuration
Optional parameters for batch/QoB backend (only used when –environment=batch).
- --app-name
Job name for batch/QoB backend.
- --driver-cores
Number of cores for driver node.
- --driver-memory
Memory type for driver node (e.g., ‘highmem’).
- --worker-cores
Number of cores for worker nodes.
- --worker-memory
Memory type for worker nodes (e.g., ‘highmem’).
Module Functions
|
Annotate allele balance quality metrics histograms and age histograms onto MatrixTable. |
|
Process All of Us dataset for frequency calculations and age histograms. |
|
Process gnomAD dataset to update v4 frequency HT by removing consent withdrawal samples. |
Generate v5 frequency data. |
|
|
Get script argument parser. |
Script to generate frequency data for gnomAD v5.
This script calculates variant frequencies and histograms for: 1. gnomAD dataset - updating v4 frequencies by subtracting consent withdrawal samples 2. AoU dataset - using either pre-computed allele numbers or a densify approach
Processing Workflow:
gnomAD (–process-gnomad): 1. Load v4 frequency table (contains frequencies and age histograms) 2. Prepare consent withdrawal VDS (split multiallelics, annotate metadata) 3. Calculate frequencies and age histograms for consent samples 4. Subtract from v4 frequencies to get updated gnomAD v5 frequencies
AoU (–process-aou): 1. Load AoU VDS with metadata 2. Prepare VDS (annotate group membership, adjust for ploidy, split multi-allelics) 3. Calculate frequencies using either: All sites ANs (efficient, requires pre-computed AN values) or Densify approach (standard, more resource intensive) 4. Generate age histograms during frequency calculation
Usage Examples:
# Process AoU dataset using all-sites ANs. python generate_frequency.py –process-aou –use-all-sites-ans –environment rwb
# Process AoU on batch/QoB with custom resources. python generate_frequency.py –process-aou –environment batch –app-name “aou_freq” –driver-cores 8 –worker-memory highmem
# Process gnomAD consent withdrawals python generate_frequency.py –process-gnomad –environment dataproc
# Run gnomAD in test mode python generate_frequency.py –process-gnomad –test –test-partitions 2
- gnomad_qc.v5.annotations.generate_frequency.mt_hist_fields(mt)[source]
Annotate allele balance quality metrics histograms and age histograms onto MatrixTable.
- Parameters:
mt (
MatrixTable) – Input MatrixTable.- Return type:
- Returns:
Struct with allele balance, quality metrics histograms, and age histograms.
- gnomad_qc.v5.annotations.generate_frequency.process_aou_dataset(test=False, use_all_sites_ans=False)[source]
Process All of Us dataset for frequency calculations and age histograms.
This function efficiently processes the AoU VDS by: 1. Computing complete frequency struct (uses imported AN from AoU all site ANs if requested) 2. Generating age histograms within the frequency calculation
- Parameters:
test (
bool) – Whether to run in test mode.use_all_sites_ans (
bool) – Whether to use all sites ANs for frequency calculations.
- Return type:
- Returns:
Table with freq and age_hists annotations for AoU dataset.
- gnomad_qc.v5.annotations.generate_frequency.process_gnomad_dataset(test=False, test_partitions=2)[source]
Process gnomAD dataset to update v4 frequency HT by removing consent withdrawal samples.
This function performs frequency adjustment by: 1. Loading v4 frequency HT (contains both frequencies and age histograms) 2. Loading consent withdrawal VDS 3. Filtering to sites present in BOTH consent VDS AND v4 frequency table 4. Calculating frequencies and age histograms for consent withdrawal samples 5. Subtracting both frequencies and age histograms from v4 frequency HT 6. Only overwriting fields that were actually updated in the final output
- Parameters:
test (
bool) – Whether to run in test mode. If True, filters full v4 vds to first N partitions (N controlled by test_partitions).test_partitions (
int) – Number of partitions to use in test mode. Default is 2.
- Return type:
- Returns:
Updated frequency HT with updated frequencies and age histograms for gnomAD dataset.