gnomad_qc.v4.annotations.vrs_annotation_batch

This is a batch script which adds VRS IDs to a Hail Table by creating sharded VCFs, running a vrs-annotation script on each shard, and merging the results into the original Hail Table.

Advise to run from a backend (eg. hailctl batch submit) to avoid losing progress: in case of disconnection.

Example command to use:

hailctl batch submit             --image-name us-central1-docker.pkg.dev/broad-mpg-gnomad/images/vrs084             gnomad_qc/gnomad_qc/v4/annotations/vrs_annotation_batch.py             --             --billing-project gnomad-annot             --working-bucket gnomad-tmp-4day             --image us-central1-docker.pkg.dev/broad-mpg-gnomad/images/vrs084             --header-path gs://gnomad/v4.0/annotations/exomes/vrs-header-fix.txt             --run-vrs             --annotate-original             --overwrite             --backend-mode batch             --data-type exomes             --test

usage: gnomad_qc.v4.annotations.vrs_annotation_batch.py [-h]
                                                        [--billing-project BILLING_PROJECT]
                                                        [--image IMAGE]
                                                        [--working-bucket WORKING_BUCKET]
                                                        [--partitions-for-vcf-export PARTITIONS_FOR_VCF_EXPORT]
                                                        (--data-type {exomes,genomes} | --input-path INPUT_PATH)
                                                        [--output-vrs-path OUTPUT_VRS_PATH]
                                                        [--output-vrs-anno-ori-path OUTPUT_VRS_ANNO_ORI_PATH]
                                                        [--test]
                                                        [--header-path HEADER_PATH]
                                                        [--downsample DOWNSAMPLE]
                                                        [--disk-size DISK_SIZE]
                                                        [--memory MEMORY]
                                                        [--seqrepo-mount SEQREPO_MOUNT]
                                                        [--overwrite]
                                                        [--run-vrs]
                                                        [--annotate-original]
                                                        [--hail-rand-seed HAIL_RAND_SEED]
                                                        [--backend-mode {spark,batch}]
                                                        [--tmp-dir-hail TMP_DIR_HAIL]

Named Arguments

--billing-project

Project to bill.

--image

Image in a GCP Artifact Registry repository.

Default: “us-central1-docker.pkg.dev/broad-mpg-gnomad/images/vrs084”

--working-bucket

Name of GCP Bucket to output intermediate files (sharded VCFs and checkpointed HTs) to.

Default: “gnomad-tmp-4day”

--partitions-for-vcf-export

Number of partitions to use when exporting the Table to a sharded VCF (each partition is exported to a separate VCF). This value determines the breakdown of jobs that will be ran (the VRS annotation script is ran in parallel on the VCFs).

--data-type

Possible choices: exomes, genomes

Data type to annotate (exomes or genomes).

--input-path

Full path of Hail Table to annotate.

--output-vrs-path

Full path of Hail Table to write VRS annotations to.

--output-vrs-anno-ori-path

Full path of Hail Table to write VRS annotations with the original annotations added back.

--test

Fiter to only 200 partitions for testing purposes.

Default: False

--header-path

Full path of txt file containing lines to append to VCF headers for fields that maybe be missing when exporting the Table to VCF.

--downsample

Proportion to which to downsample the original Hail Table input.

Default: 1.0

--disk-size

Amount of disk (GB) to allocate to each Hail Batch job.

--memory

Amount of memory (GB) to allocate to each Hail Batch job.

--seqrepo-mount

Bucket to mount and read from using Hail Batch’s CloudFuse for access to seqrepo: PLEASE note this DOES have performance implications.

--overwrite

Boolean to pass to ht.write(overwrite=_____) determining whether or not to overwrite existing output for the final Table and checkpointed files.

Default: False

--run-vrs

Pass argument to run VRS annotation on dataset of choice. Specifying ‘–run-vrs’ also requires setting ‘backend-mode’ to ‘batch’, which is the default.

Default: False

--annotate-original

Pass argument to add VRS annotations back to original dataset.

Default: False

--hail-rand-seed

Random seed for hail. Default is 5.

Default: 5

--backend-mode

Possible choices: spark, batch

Mode in which to run Hail - either ‘spark’ or ‘batch’ (for QoB)

Default: “batch”

--tmp-dir-hail

Directory for temporary files to set when initializing Hail.

Module Functions

`gnomad_qc.v4.annotations.vrs_annotation_batch.init_job`(batch)	Initialize a hail batch job with some default parameters.
`gnomad_qc.v4.annotations.vrs_annotation_batch.init_job_with_gcloud`(batch)	Create job and initialize glcoud authentication and gsutil commands.
`gnomad_qc.v4.annotations.vrs_annotation_batch.main`(args)	Generate VRS annotations for a Hail Table of variants.
`gnomad_qc.v4.annotations.vrs_annotation_batch.get_script_argument_parser`()	Get script argument parser.

This is a batch script which adds VRS IDs to a Hail Table by creating sharded VCFs, running a vrs-annotation script on each shard, and merging the results into the original Hail Table.

Advise to run from a backend (eg. hailctl batch submit) to avoid losing progress: in case of disconnection.

Example command to use:

hailctl batch submit             --image-name us-central1-docker.pkg.dev/broad-mpg-gnomad/images/vrs084             gnomad_qc/gnomad_qc/v4/annotations/vrs_annotation_batch.py             --             --billing-project gnomad-annot             --working-bucket gnomad-tmp-4day             --image us-central1-docker.pkg.dev/broad-mpg-gnomad/images/vrs084             --header-path gs://gnomad/v4.0/annotations/exomes/vrs-header-fix.txt             --run-vrs             --annotate-original             --overwrite             --backend-mode batch             --data-type exomes             --test

gnomad_qc.v4.annotations.vrs_annotation_batch.init_job(batch, name=None, image=None, cpu=None, memory=None, disk_size=None)[source]

Initialize a hail batch job with some default parameters.

Parameters:

batch – Batch object
name (str) – job label which will show up in the Batch web UI
image (str) – docker image name (eg. “weisburd/image-name@sha256:aa19845da5”)
cpu (float) – number of CPUs (between 0.25 to 16)
memory (float) – amount of RAM in Gb (eg. 3.75)
disk_size (float) – amount of disk in Gb (eg. 50)

Returns:

new job object

gnomad_qc.v4.annotations.vrs_annotation_batch.init_job_with_gcloud(batch, name=None, image=None, cpu=None, memory=None, disk_size=None, mount=None)[source]

Create job and initialize glcoud authentication and gsutil commands.

Wraps Ben Weisburd’s init_job (https://github.com/broadinstitute/tgg_methods/blob/master/tgg/batch/batch_utils.py#L160) with additional gcloud steps.

Parameters:

batch – Batch object.
name (str) – Job label which will show up in the Batch web UI.
image (str) – Docker image name (eg. “us-central1-docker.pkg.dev/broad-mpg-gnomad/ga4gh-vrs/marten_0615_vrs0_8_4”).
cpu (float) – Number of CPUs (between 0.25 to 16).
memory (float) – Amount of RAM in Gb (eg. 3.75).
disk_size (float) – Amount of disk in Gb (eg. 50).
mount (str) – Name of GCP Bucket to mount using cloudfuse.

Returns:

New job object.

gnomad_qc.v4.annotations.vrs_annotation_batch.main(args)[source]: Generate VRS annotations for a Hail Table of variants.