gnomad_qc.v5.annotations.generate_variant_qc_annotations

Script to generate annotations for variant QC on gnomAD v5.

usage: gnomad_qc.v5.annotations.generate_variant_qc_annotations.py
       [-h] [--environment {rwb,batch}] [--app-name APP_NAME]
       [--driver-cores DRIVER_CORES] [--driver-memory DRIVER_MEMORY]
       [--worker-cores WORKER_CORES] [--worker-memory WORKER_MEMORY]
       [--overwrite] [--test] [--test-n-partitions [TEST_N_PARTITIONS]]
       [--generate-trio-stats] [--generate-sibling-stats] [--create-info-ht]
       [--lowqual-indel-phred-het-prior LOWQUAL_INDEL_PHRED_HET_PRIOR]
       [--export-info-vcf] [--create-variant-qc-annotation-ht]
       [--impute-features] [--n-partitions N_PARTITIONS]
       [--export-true-positive-vcfs] [--transmitted-singletons]
       [--sibling-singletons]

Named Arguments

--environment

Possible choices: rwb, batch

Environment where script will run.

Default: “rwb”

--app-name

Job name for batch/QoB backend.

--driver-cores

Number of cores. Applies to Batch environment only. Hail default is 1 if unspecified.

--driver-memory

Memory for driver node. Applies to Batch environment only. Hail default is ‘standard’ if unspecified.

--worker-cores

Number of cores. Applies to Batch environment only. Hail default is 1 if unspecified.

--worker-memory

Memory for worker nodes. Applies to Batch environment only. Hail default is ‘standard’ if unspecified.

--overwrite

Overwrite output files.

Default: False

--test

Write to test path.

Default: False

--test-n-partitions

Use only n partitions of the VDS as input for testing purposes (default: 2).

--generate-trio-stats

Calculates trio stats.

Default: False

--generate-sibling-stats

Calculates sibling stats.

Default: False

--create-info-ht

Create the info ht containing annotations needed for variant QC.

Default: False

--lowqual-indel-phred-het-prior

Phred-scaled prior for a het genotype at a site with a low quality indel. Default is 40. We use 1/10k bases (phred=40) to be more consistent with the filtering used by Broad’s Data Sciences Platform for VQSR.

Default: 40

--export-info-vcf

Export info ht as VCF.

Default: False

Variant QC annotation HT parameters.

--create-variant-qc-annotation-ht

Creates an annotated HT with features for variant QC.

Default: False

--impute-features

If set, imputation is performed for variant QC features.

Default: False

--n-partitions

Desired number of partitions for variant QC annotation HT.

Default: 5000

Export true positive VCFs

Arguments used to define true positive variant set.

--export-true-positive-vcfs

Exports true positive variants (–transmitted-singletons and/or –sibling-singletons) to VCF files.

Default: False

--transmitted-singletons

Include transmitted singletons in the exports of true positive variants to VCF files.

Default: False

--sibling-singletons

Include sibling singletons in the exports of true positive variants to VCF files.

Default: False

Module Functions

gnomad_qc.v5.annotations.generate_variant_qc_annotations.generate_ac_info_ht(vds)

Compute AC and AC_raw annotations for each allele count filter group.

gnomad_qc.v5.annotations.generate_variant_qc_annotations.create_info_ht(...)

Import a VCF of AoU annotated sites, reformat annotations, and add AS_lowqual.

gnomad_qc.v5.annotations.generate_variant_qc_annotations.run_generate_trio_stats(mt, ...)

Generate trio transmission stats from a VariantDataset and pedigree info.

gnomad_qc.v5.annotations.generate_variant_qc_annotations.run_generate_sib_stats(mt, ...)

Generate sibling stats from a VariantDataset and relatedness info.

gnomad_qc.v5.annotations.generate_variant_qc_annotations.create_variant_qc_annotation_ht(...)

Create a Table with all necessary annotations for variant QC.

gnomad_qc.v5.annotations.generate_variant_qc_annotations.get_tp_ht_for_vcf_export(ht)

Get Tables with raw and adj true positive variants to export as a VCF for use in VQSR.

gnomad_qc.v5.annotations.generate_variant_qc_annotations.main(args)

Generate all variant annotations needed for variant QC.

gnomad_qc.v5.annotations.generate_variant_qc_annotations.get_script_argument_parser()

Get script argument parser.

Script to generate annotations for variant QC on gnomAD v5.

gnomad_qc.v5.annotations.generate_variant_qc_annotations.generate_ac_info_ht(vds)[source]

Compute AC and AC_raw annotations for each allele count filter group.

Function also adds AS_pab_max and allele_info annotations.

Parameters:

vds (VariantDataset) – VariantDataset to use for computing AC and AC_raw annotations.

Return type:

Table

Returns:

Table with AC and AC_raw annotations split by high quality, release, and unrelated.

gnomad_qc.v5.annotations.generate_variant_qc_annotations.create_info_ht(vcf_path, header_path, lowqual_indel_phred_het_prior=40, vds=None, test=False)[source]

Import a VCF of AoU annotated sites, reformat annotations, and add AS_lowqual.

Parameters:
  • vcf_path (str) – Path to the annotated sites-only VCF.

  • header_path (str) – Path to the header file for the VCF.

  • lowqual_indel_phred_het_prior (int) – Phred-scaled prior for a het genotype at a site with a low quality indel. Default is 40. We use 1/10k bases (phred=40) to be more consistent with the filtering used by Broad’s Data Sciences Platform for VQSR.

  • vds (VariantDataset) – VariantDataset to use for computing AC and AC_raw annotations.

  • test (bool) – Whether to write run a test using just the first two partitions of the loaded VCF.

Return type:

Table

Returns:

Hail Table with reformatted annotations.

gnomad_qc.v5.annotations.generate_variant_qc_annotations.run_generate_trio_stats(mt, fam_ped)[source]

Generate trio transmission stats from a VariantDataset and pedigree info.

Parameters:
  • mt (MatrixTable) – Dense trio MatrixTable.

  • fam_ped (Pedigree) – Pedigree containing trio info.

Return type:

Table

Returns:

Table containing trio stats.

gnomad_qc.v5.annotations.generate_variant_qc_annotations.run_generate_sib_stats(mt, relatedness_ht)[source]

Generate sibling stats from a VariantDataset and relatedness info.

Parameters:
  • mt (MatrixTable) – Input MatrixTable.

  • relatedness_ht (Table) – Table containing relatedness info.

Return type:

Table

Returns:

Table containing sibling stats.

gnomad_qc.v5.annotations.generate_variant_qc_annotations.create_variant_qc_annotation_ht(info_ht, trio_stats_ht, sib_stats_ht, impute_features=True, n_partitions=5000)[source]

Create a Table with all necessary annotations for variant QC.

Annotations that are included:

Features for RF:
  • variant_type

  • allele_type

  • n_alt_alleles

  • has_star

  • AS_QD

  • AS_pab_max

  • AS_MQRankSum

  • AS_SOR

  • AS_ReadPosRankSum

Training sites (bool):
  • transmitted_singleton

  • sibling_singleton

  • fail_hard_filters - (ht.AS_QD < 0.5) | (ht.AS_FS > 60) | (ht.AS_MQ < 30)

Parameters:
  • info_ht (Table) – Info Table with split multi-allelics.

  • trio_stats_ht (Table) – Table with trio statistics.

  • sib_stats_ht (Table) – Table with sibling statistics.

  • impute_features (bool) – Whether to impute features using feature medians (this is done by variant type).

  • n_partitions (int) – Number of partitions to use for final annotated Table.

Return type:

Table

Returns:

Hail Table with all annotations needed for variant QC.

gnomad_qc.v5.annotations.generate_variant_qc_annotations.get_tp_ht_for_vcf_export(ht, transmitted_singletons=False, sibling_singletons=False)[source]

Get Tables with raw and adj true positive variants to export as a VCF for use in VQSR.

Parameters:
  • ht (Table) – Input Table with transmitted singleton and sibling singleton information.

  • transmitted_singletons (bool) – Whether to include transmitted singletons in the true positive variants.

  • sibling_singletons (bool) – Whether to include sibling singletons in the true positive variants.

Return type:

Dict[str, Table]

Returns:

Dictionary of ‘raw’ and ‘adj’ true positive variant sites Tables.

gnomad_qc.v5.annotations.generate_variant_qc_annotations.main(args)[source]

Generate all variant annotations needed for variant QC.