gnomad_qc.v4.annotations.vrs_annotation_batch
This is a batch script which adds VRS IDs to a Hail Table by creating sharded VCFs, running a vrs-annotation script on each shard, and merging the results into the original Hail Table.
- Advise to run from a backend (eg. hailctl batch submit) to avoid losing progress
in case of disconnection.
Example command to use:
hailctl batch submit --image-name us-central1-docker.pkg.dev/broad-mpg-gnomad/images/vrs084 gnomad_qc/gnomad_qc/v4/annotations/vrs_annotation_batch.py -- --billing-project gnomad-annot --working-bucket gnomad-tmp-4day --image us-central1-docker.pkg.dev/broad-mpg-gnomad/images/vrs084 --header-path gs://gnomad/v4.0/annotations/exomes/vrs-header-fix.txt --run-vrs --annotate-original --overwrite --backend-mode batch --data-type exomes --test
usage: gnomad_qc.v4.annotations.vrs_annotation_batch.py [-h]
[--billing-project BILLING_PROJECT]
[--image IMAGE]
[--working-bucket WORKING_BUCKET]
[--partitions-for-vcf-export PARTITIONS_FOR_VCF_EXPORT]
(--data-type {exomes,genomes} | --input-path INPUT_PATH)
[--output-vrs-path OUTPUT_VRS_PATH]
[--output-vrs-anno-ori-path OUTPUT_VRS_ANNO_ORI_PATH]
[--test]
[--header-path HEADER_PATH]
[--downsample DOWNSAMPLE]
[--disk-size DISK_SIZE]
[--memory MEMORY]
[--seqrepo-mount SEQREPO_MOUNT]
[--overwrite]
[--run-vrs]
[--annotate-original]
[--hail-rand-seed HAIL_RAND_SEED]
[--backend-mode {spark,batch}]
[--tmp-dir-hail TMP_DIR_HAIL]
Named Arguments
- --billing-project
Project to bill.
- --image
Image in a GCP Artifact Registry repository.
Default: “us-central1-docker.pkg.dev/broad-mpg-gnomad/images/vrs084”
- --working-bucket
Name of GCP Bucket to output intermediate files (sharded VCFs and checkpointed HTs) to.
Default: “gnomad-tmp-4day”
- --partitions-for-vcf-export
Number of partitions to use when exporting the Table to a sharded VCF (each partition is exported to a separate VCF). This value determines the breakdown of jobs that will be ran (the VRS annotation script is ran in parallel on the VCFs).
- --data-type
Possible choices: exomes, genomes
Data type to annotate (exomes or genomes).
- --input-path
Full path of Hail Table to annotate.
- --output-vrs-path
Full path of Hail Table to write VRS annotations to.
- --output-vrs-anno-ori-path
Full path of Hail Table to write VRS annotations with the original annotations added back.
- --test
Fiter to only 200 partitions for testing purposes.
Default: False
- --header-path
Full path of txt file containing lines to append to VCF headers for fields that maybe be missing when exporting the Table to VCF.
- --downsample
Proportion to which to downsample the original Hail Table input.
Default: 1.0
- --disk-size
Amount of disk (GB) to allocate to each Hail Batch job.
- --memory
Amount of memory (GB) to allocate to each Hail Batch job.
- --seqrepo-mount
Bucket to mount and read from using Hail Batch’s CloudFuse for access to seqrepo: PLEASE note this DOES have performance implications.
- --overwrite
Boolean to pass to ht.write(overwrite=_____) determining whether or not to overwrite existing output for the final Table and checkpointed files.
Default: False
- --run-vrs
Pass argument to run VRS annotation on dataset of choice. Specifying ‘–run-vrs’ also requires setting ‘backend-mode’ to ‘batch’, which is the default.
Default: False
- --annotate-original
Pass argument to add VRS annotations back to original dataset.
Default: False
- --hail-rand-seed
Random seed for hail. Default is 5.
Default: 5
- --backend-mode
Possible choices: spark, batch
Mode in which to run Hail - either ‘spark’ or ‘batch’ (for QoB)
Default: “batch”
- --tmp-dir-hail
Directory for temporary files to set when initializing Hail.
Module Functions
|
Initialize a hail batch job with some default parameters. |
|
Create job and initialize glcoud authentication and gsutil commands. |
Generate VRS annotations for a Hail Table of variants. |
|
|
Get script argument parser. |
This is a batch script which adds VRS IDs to a Hail Table by creating sharded VCFs, running a vrs-annotation script on each shard, and merging the results into the original Hail Table.
- Advise to run from a backend (eg. hailctl batch submit) to avoid losing progress
in case of disconnection.
Example command to use:
hailctl batch submit --image-name us-central1-docker.pkg.dev/broad-mpg-gnomad/images/vrs084 gnomad_qc/gnomad_qc/v4/annotations/vrs_annotation_batch.py -- --billing-project gnomad-annot --working-bucket gnomad-tmp-4day --image us-central1-docker.pkg.dev/broad-mpg-gnomad/images/vrs084 --header-path gs://gnomad/v4.0/annotations/exomes/vrs-header-fix.txt --run-vrs --annotate-original --overwrite --backend-mode batch --data-type exomes --test
- gnomad_qc.v4.annotations.vrs_annotation_batch.init_job(batch, name=None, image=None, cpu=None, memory=None, disk_size=None)[source]
Initialize a hail batch job with some default parameters.
- Parameters:
batch – Batch object
name (
str
) – job label which will show up in the Batch web UIimage (
str
) – docker image name (eg. “weisburd/image-name@sha256:aa19845da5”)cpu (
float
) – number of CPUs (between 0.25 to 16)memory (
float
) – amount of RAM in Gb (eg. 3.75)disk_size (
float
) – amount of disk in Gb (eg. 50)
- Returns:
new job object
- gnomad_qc.v4.annotations.vrs_annotation_batch.init_job_with_gcloud(batch, name=None, image=None, cpu=None, memory=None, disk_size=None, mount=None)[source]
Create job and initialize glcoud authentication and gsutil commands.
Wraps Ben Weisburd’s init_job (https://github.com/broadinstitute/tgg_methods/blob/master/tgg/batch/batch_utils.py#L160) with additional gcloud steps.
- Parameters:
batch – Batch object.
name (
str
) – Job label which will show up in the Batch web UI.image (
str
) – Docker image name (eg. “us-central1-docker.pkg.dev/broad-mpg-gnomad/ga4gh-vrs/marten_0615_vrs0_8_4”).cpu (
float
) – Number of CPUs (between 0.25 to 16).memory (
float
) – Amount of RAM in Gb (eg. 3.75).disk_size (
float
) – Amount of disk in Gb (eg. 50).mount (
str
) – Name of GCP Bucket to mount using cloudfuse.
- Returns:
New job object.