gnomad_qc.v4.create_release.create_release_sites_ht

Script to create release sites HT for v4.0 exomes and genomes.

usage: gnomad_qc.v4.create_release.create_release_sites_ht.py
       [-h] [--new-partition-percent NEW_PARTITION_PERCENT] [--overwrite]
       [-v VERSION] [-t] [-d {exomes,genomes}]
       [-j TABLES_FOR_JOIN [TABLES_FOR_JOIN ...]]
       [-b {dbsnp,filters,freq,info,region_flags,in_silico,vep}]
       [--release-exists] [--slack-channel SLACK_CHANNEL]
       [--n-partitions N_PARTITIONS]

Named Arguments

--new-partition-percent

Percent of start dataset partitions to use for release HT. Default is 1.1 (110%)

Default: 1.1

--overwrite

Overwrite data

Default: False

-v, --version

The version of gnomAD.

Default: “4.1”

-t, --test

Runs a test on PCSK9 region, chr1:55039447-55064852

Default: False

-d, --data-type

Possible choices: exomes, genomes

Data type to create release HT for.

Default: “exomes”

-j, --tables-for-join

Tables to join for release

Default: [‘dbsnp’, ‘filters’, ‘freq’, ‘info’, ‘region_flags’, ‘in_silico’, ‘vep’]

-b, --base-table

Possible choices: dbsnp, filters, freq, info, region_flags, in_silico, vep

Base table for interval partition calculation.

Default: “freq”

--release-exists

Whether the release HT already exists.

Default: False

--slack-channel

Slack channel to post results and notifications to.

--n-partitions

Number of partitions to naive coalesce the release Table to.

Default: 10000

Module Functions

gnomad_qc.v4.create_release.create_release_sites_ht.get_config(...)

Get configuration dictionary for specified data type.

gnomad_qc.v4.create_release.create_release_sites_ht.drop_v3_subsets(freq_ht)

Drop the frequencies of all v3 subsets except 'hgdp' and 'tgp' from freq_ht.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_joint_faf_select(ht, **_)

Drop faf95 from 'grpmax'.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_freq_select(ht, ...)

Drop faf95 from both 'gnomad' and 'non_ukb' in 'grpmax' and rename gen_anc_faf_max to fafmax.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_in_silico_select(ht, **_)

Get in silico predictors from VEP for release.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_region_flags_select(ht, ...)

Select region flags for release.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_filters_select(ht, **_)

Select gnomAD filter HT fields for release dataset.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_filters_select_globals(ht)

Select filter HT globals for release dataset.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_info_select(ht, ...)

Select fields for info Hail Table annotation in release.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_vep_select(ht, **_)

Select fields for VEP hail Table annotation in release.

gnomad_qc.v4.create_release.create_release_sites_ht.get_select_global_fields(ht, ...)

Generate a dictionary of globals to select by checking the config of all tables joined.

gnomad_qc.v4.create_release.create_release_sites_ht.get_select_fields(...)

Generate a select dict from traversing the base_ht and extracting annotations.

gnomad_qc.v4.create_release.create_release_sites_ht.get_final_ht_fields(ht)

Get the final fields for the release HT.

gnomad_qc.v4.create_release.create_release_sites_ht.get_ht(...)

Return the appropriate Hail table with selects applied.

gnomad_qc.v4.create_release.create_release_sites_ht.join_hts(...)

Outer join a list of Hail Tables.

gnomad_qc.v4.create_release.create_release_sites_ht.main(args)

Create release ht.

gnomad_qc.v4.create_release.create_release_sites_ht.get_script_argument_parser()

Get script argument parser.

Script to create release sites HT for v4.0 exomes and genomes.

gnomad_qc.v4.create_release.create_release_sites_ht.get_config(data_type, release_exists=False)[source]

Get configuration dictionary for specified data type.

Format:

'<Name of dataset>': {
    'ht': '<Optional Hail Table for direct annotation extraction. This is not used for the join.>',
    'path': 'gs://path/to/hail_table.ht',
    'select': '<Optional list of fields to select or dict of new field name to location of old field in the dataset.>',
    'field_name': '<Optional name of root annotation in combined dataset, defaults to name of dataset.>',
    'custom_select': '<Optional function name of custom select function that is needed for more advanced logic>',
    'select_globals': '<Optional list of globals to select or dict of new global field name to old global field name. If not specified, all globals are selected.>'
},

Warning

The ‘in_silico’ key’s ‘ht’ logic is handled separately because it is a list of HTs. In this list, the phyloP HT is keyed by locus only and thus the ‘ht’ code below sets the join key to 1, which will grab the first key of ht.key.dtype.values() e.g. ‘locus’, when an HT’s keys are not {‘locus’, ‘alleles’}. All future in_silico predictors should have the keys confirmed to be ‘locus’ with or without ‘alleles’ before using this logic.

Parameters:
  • data_type (str) – Dataset’s data type: ‘exomes’ or ‘genomes’.

  • release_exists (bool) – Whether the release HT already exists.

Return type:

Dict[str, Dict[str, Expression]]

Returns:

Dict of dataset’s configs.

gnomad_qc.v4.create_release.create_release_sites_ht.drop_v3_subsets(freq_ht)[source]

Drop the frequencies of all v3 subsets except ‘hgdp’ and ‘tgp’ from freq_ht.

Parameters:

freq_ht (Table) – v4.0 genomes freq Table.

Return type:

Table

Returns:

v4.0 genomes freq Table with some v3 subsets dropped.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_joint_faf_select(ht, **_)[source]

Drop faf95 from ‘grpmax’.

This annotation will be combined with the others from joint_faf’s select in the config. See note in custom_freq_select explaining why this field is removed.

Parameters:

ht (Table) – Joint FAF Hail Table.

Return type:

Dict[str, Expression]

Returns:

Select expression dict.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_freq_select(ht, data_type)[source]

Drop faf95 from both ‘gnomad’ and ‘non_ukb’ in ‘grpmax’ and rename gen_anc_faf_max to fafmax.

These annotations will be combined with the others from freq’s select in the config.

Note

  • The faf95 field in the grpmax struct is the FAF of the genetic ancestry group with the largest AF (grpmax AF).

  • The FAF fields within the gen_anc_faf_max struct contains the FAFs from the genetic ancestry group(s) with the largest FAFs

  • These values aren’t necessarily the same; the group with the highest AF for a variant isn’t necessarily the group with the highest FAF for a variant

  • The filtering allele frequencies that are used by the community are the values within the gen_anc_faf_max struct, NOT grpmax FAF, which is why we are dropping grpmax.faf95 and renaming gen_anc_faf_max

Parameters:
  • ht (Table) – Freq Hail Table

  • data_type (str) – Dataset’s data type: ‘exomes’ or ‘genomes’.

Return type:

Dict[str, Expression]

Returns:

Select expression dict.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_in_silico_select(ht, **_)[source]

Get in silico predictors from VEP for release.

This function currently selects only SIFT and Polyphen from VEP.

Parameters:

ht (Table) – VEP Hail Table.

Return type:

Dict[str, Expression]

Returns:

Select expression dict.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_region_flags_select(ht, data_type)[source]

Select region flags for release.

Parameters:
  • ht (Table) – Hail Table.

  • data_type (str) – Dataset’s data type: ‘exomes’ or ‘genomes’.

Return type:

Dict[str, Expression]

Returns:

Select expression dict.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_filters_select(ht, **_)[source]

Select gnomAD filter HT fields for release dataset.

Extract “results” field and rename based on filtering method.

Parameters:

ht (Table) – Filters Hail Table.

Return type:

Dict[str, Expression]

Returns:

Select expression dict.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_filters_select_globals(ht)[source]

Select filter HT globals for release dataset.

Parameters:

ht (Table) – Filters Hail Table.

Return type:

Dict[str, Expression]

Returns:

Select expression dict.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_info_select(ht, data_type)[source]

Select fields for info Hail Table annotation in release.

The info field requires fields from the freq HT and the filters HT so those are pulled in here along with all info HT fields. It also adds the allele_info struct to release HT.

Parameters:
  • ht (Table) – Info Hail Table.

  • data_type (str) – Dataset’s data type: ‘exomes’ or ‘genomes’.

Return type:

Dict[str, Expression]

Returns:

Select expression dict.

gnomad_qc.v4.create_release.create_release_sites_ht.custom_vep_select(ht, **_)[source]

Select fields for VEP hail Table annotation in release.

Parameters:

ht (Table) – VEP Hail table

Return type:

Dict[str, Expression]

Returns:

Select expression dict.

gnomad_qc.v4.create_release.create_release_sites_ht.get_select_global_fields(ht, data_type, tables_for_join=['dbsnp', 'filters', 'freq', 'info', 'region_flags', 'in_silico', 'vep'])[source]

Generate a dictionary of globals to select by checking the config of all tables joined.

Note

This function will place the globals within the select_globals value above any globals returned from custom_select_globals. If ordering is important, use only custom_select_globals.

Parameters:
  • ht (Table) – Final joined HT with globals.

  • data_type (str) – Dataset’s data type: ‘exomes’ or ‘genomes’.

  • tables_for_join (List[str]) – List of tables to join into final release HT.

Return type:

Dict[str, Expression]

Returns:

select mapping from global annotation name to ht annotation.

gnomad_qc.v4.create_release.create_release_sites_ht.get_select_fields(selects, base_ht)[source]

Generate a select dict from traversing the base_ht and extracting annotations.

Parameters:
  • selects (Union[List, Dict]) – Mapping or list of selections.

  • base_ht (Table) – Base Hail Table to traverse.

Return type:

Dict[str, Expression]

Returns:

select Mapping from annotation name to base_ht annotation.

gnomad_qc.v4.create_release.create_release_sites_ht.get_final_ht_fields(ht)[source]

Get the final fields for the release HT.

Create a dictionary of lists of fields that are in the FINALIZED_SCHEMA and are present in the HT. If a field is not present in the HT, log a warning.

Parameters:

ht (Table) – Hail Table.

Return type:

Dict[str, List[str]]

Returns:

Dict of final fields for the release HT.

gnomad_qc.v4.create_release.create_release_sites_ht.get_ht(dataset, _intervals, data_type, test, release_exists)[source]

Return the appropriate Hail table with selects applied.

Parameters:
  • dataset (str) – Hail Table to join.

  • _intervals (MultipleTypeChecker) – Intervals for reading in hail Table. Used to optimize join.

  • data_type (str) – Dataset’s data type: ‘exomes’ or ‘genomes’.

  • test (bool) – Whether call is for a test run.

  • release_exists (bool) – Whether the release HT already exists.

Return type:

Table

Returns:

Hail Table with fields to select.

gnomad_qc.v4.create_release.create_release_sites_ht.join_hts(base_table, tables, new_partition_percent, test, data_type, release_exists, version)[source]

Outer join a list of Hail Tables.

Parameters:
  • base_table (Table) – Dataset to use for interval partitioning.

  • tables (List[str]) – List of tables to join.

  • new_partition_percent (float) – Percent of base_table partitions used for final release Hail Table.

  • test (bool) – Whether this is for a test run.

  • data_type (str) – Dataset’s data type: ‘exomes’ or ‘genomes’.

  • release_exists (bool) – Whether the release HT already exists.

  • version (str) – Release version.

Return type:

Table

Returns:

Hail Table with datasets joined.

gnomad_qc.v4.create_release.create_release_sites_ht.main(args)[source]

Create release ht.