gnomad_qc.v5.resources.basics

Script containing generic resources.

Module Functions

gnomad_qc.v5.resources.basics.qc_temp_prefix([...])

Return path to temporary QC bucket.

gnomad_qc.v5.resources.basics.get_checkpoint_path(name)

Create a checkpoint path for Table or MatrixTable.

gnomad_qc.v5.resources.basics.get_logging_path(name)

Create a path for Hail log files.

gnomad_qc.v5.resources.basics.get_aou_vds([...])

Load the AOU VDS.

gnomad_qc.v5.resources.basics.get_aou_failing_genomic_metrics_samples()

Import AoU genomic metrics and filter to samples that fail specific quality criteria, including low coverage and ambiguous sex ploidy.

gnomad_qc.v5.resources.basics.get_samples_to_exclude([...])

Get set of AoU sample IDs to exclude.

gnomad_qc.v5.resources.basics.add_project_prefix_to_sample_collisions(t, ...)

Add project prefix to sample IDs that exist in multiple projects.

Script containing generic resources.

gnomad_qc.v5.resources.basics.qc_temp_prefix(version='5.0', environment='dataproc')[source]

Return path to temporary QC bucket.

Parameters:
  • version (str) – Version of annotation path to return.

  • environment (str) – Compute environment, either ‘dataproc’ or ‘rwb’. Defaults to ‘dataproc’.

Return type:

str

Returns:

Path to bucket with temporary QC data.

gnomad_qc.v5.resources.basics.get_checkpoint_path(name, version='5.0', mt=False, environment='dataproc')[source]

Create a checkpoint path for Table or MatrixTable.

Parameters:
  • name (str) – Name of intermediate Table/MatrixTable.

  • version (str) – Version of annotation path to return.

  • mt (bool) – Whether path is for a MatrixTable, default is False.

  • environment (str) – Compute environment, either ‘dataproc’ or ‘rwb’. Defaults to ‘dataproc’.

  • name

  • mt

Return type:

str

Returns:

Output checkpoint path.

gnomad_qc.v5.resources.basics.get_logging_path(name, version='5.0', environment='dataproc')[source]

Create a path for Hail log files.

Parameters:
  • name (str) – Name of log file.

  • version (str) – Version of annotation path to return.

  • environment (str) – Compute environment, either ‘dataproc’ or ‘rwb’. Defaults to ‘dataproc’.

Return type:

str

Returns:

Output log path.

gnomad_qc.v5.resources.basics.get_aou_vds(split=False, remove_hard_filtered_samples=True, filter_samples=None, test=False, filter_partitions=None, chrom=None, autosomes_only=False, sex_chr_only=False, filter_variant_ht=None, filter_intervals=None, split_reference_blocks=True, remove_dead_alleles=True, entries_to_keep=None, checkpoint_variant_data=False, naive_coalesce_partitions=None)[source]

Load the AOU VDS.

Parameters:
  • split (bool) – Whether to split multi-allelic variants in the VDS. Note: this will perform a split on the VDS rather than grab an already split VDS. Default is False.

  • remove_hard_filtered_samples (bool) – Whether to remove samples that failed hard filters (only relevant after hard filtering is complete). Default is True.

  • filter_samples (Union[List[str], Table, None]) – Optional samples to filter the VDS to. Can be a list of sample IDs or a Table with sample IDs.

  • test (bool) – Whether to load the test VDS instead of the full VDS. The test VDS includes 10 samples selected from the full dataset for testing purposes. Default is False.

  • filter_partitions (Optional[List[int]]) – Optional argument to filter the VDS to a list of specific partitions.

  • chrom (Union[str, List[str], Set[str], None]) – Optional argument to filter the VDS to a specific chromosome(s).

  • autosomes_only (bool) – Whether to include only autosomes. Default is False.

  • sex_chr_only (bool) – Whether to include only sex chromosomes. Default is False.

  • filter_variant_ht (Optional[Table]) – Optional argument to filter the VDS to a specific set of variants. Only supported when splitting the VDS.

  • filter_intervals (Optional[List[Union[str, tinterval]]]) – Optional argument to filter the VDS to specific intervals.

  • split_reference_blocks (bool) – Whether to split the reference data at the edges of the intervals defined by filter_intervals. Default is True.

  • remove_dead_alleles (bool) – Whether to remove dead alleles when removing samples. Default is True.

  • entries_to_keep (Optional[List[str]]) – Optional list of entries to keep in the variant data. If splitting the VDS, use the global entries (e.g. ‘GT’) instead of the local entries (e.g. ‘LGT’) to keep.

  • checkpoint_variant_data (bool) – Whether to checkpoint the variant data MT after splitting and filtering. Default is False.

  • naive_coalesce_partitions (Optional[int]) – Optional number of partitions to coalesce the VDS to. Default is None.

Return type:

VariantDataset

Returns:

AoU v8 VDS.

gnomad_qc.v5.resources.basics.aou_acaf_mt = MatrixTableResource(path=gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/acaf_threshold/splitMT/hail.mt)

AoU v8 ACAF (Allele Count/Allele Frequency threshold) MatrixTable.

MatrixTable contains only variants with AF > 1% or AC > 100 in any genetic ancestry group.

See https://support.researchallofus.org/hc/en-us/articles/29475228181908-How-the-All-of-Us-Genomic-data-are-organized#01JJK0HH53FX9XQRDQ5HQFZW9B and https://support.researchallofus.org/hc/en-us/articles/14929793660948-Smaller-Callsets-for-Analyzing-Short-Read-WGS-SNP-Indel-Data-with-Hail-MT-VCF-and-PLINK for more information.

gnomad_qc.v5.resources.basics.aou_exome_mt = MatrixTableResource(path=gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/exome/splitMT/hail.mt)

AoU v8 Exome MatrixTable.

MatrixTable contains only variants in exons (with 15 bp padding on either side) as defined by GENCODE v42 basic.

See same links as above (in acaf_mt) for more information.

gnomad_qc.v5.resources.basics.get_aou_failing_genomic_metrics_samples()[source]

Import AoU genomic metrics and filter to samples that fail specific quality criteria, including low coverage and ambiguous sex ploidy.

Note

Samples with low mean coverage (<30x), genome coverage (<90% at 20x), All of Us Hereditary Disease Risk gene (AoUHDR) coverage (<95% at 20x), or aligned_q30_bases (<8e10) were expected to be excluded from the AoU callset. However, some such samples are still present. AoU is preparing to publish a known issue in their quality report related to this. This note will be updated with a link once the issue is published.

In addition, we exclude samples with ambiguous sex ploidy (i.e., not “XX” or “XY”) from the callset.

Return type:

SetExpression

Returns:

SetExpression of samples failing coverage filters or with non-XX-XY sex ploidies.

gnomad_qc.v5.resources.basics.get_samples_to_exclude(filter_samples=None, overwrite=False)[source]

Get set of AoU sample IDs to exclude.

Note

If filter_samples is a Hail Table, it must contain a field named ‘s’ with sample IDs.

Parameters:
  • filter_samples (Union[List[str], Table, None]) – Optional additional samples to remove. Can be a list of sample IDs or a Table with sample IDs.

  • overwrite (bool) – Whether to overwrite the existing samples_to_exclude resource. Default is False.

Return type:

SetExpression

Returns:

SetExpression containing IDs of samples to exclude from v5 analysis.

gnomad_qc.v5.resources.basics.add_project_prefix_to_sample_collisions(t, sample_collisions, project=None, sample_id_field='s')[source]

Add project prefix to sample IDs that exist in multiple projects.

Parameters:
  • t (Union[Table, MatrixTable]) – Table/MatrixTable to add project prefix to sample IDs.

  • sample_collisions (Table) – Table of sample IDs that exist in multiple projects.

  • project (Optional[str]) – Optional project name to prepend to sample collisions. If not set, will use ‘ht.project’ annotation. Default is None.

  • sample_id_field (str) – Field name for sample IDs in the table.

Return type:

Table

Returns:

Table with project prefix added to sample IDs.