gnomad_qc.v5.resources.basics
Script containing generic resources.
Module Functions
Return path to temporary QC bucket. |
|
Create a checkpoint path for Table or MatrixTable. |
|
Create a path for Hail log files. |
|
Load the AOU VDS. |
|
|
Import AoU genomic metrics and filter to samples that fail specific quality criteria, including low coverage and ambiguous sex ploidy. |
Get set of AoU sample IDs to exclude. |
|
|
Add project prefix to sample IDs that exist in multiple projects. |
Script containing generic resources.
- gnomad_qc.v5.resources.basics.qc_temp_prefix(version='5.0', environment='dataproc')[source]
Return path to temporary QC bucket.
- Parameters:
version (
str
) – Version of annotation path to return.environment (
str
) – Compute environment, either ‘dataproc’ or ‘rwb’. Defaults to ‘dataproc’.
- Return type:
str
- Returns:
Path to bucket with temporary QC data.
- gnomad_qc.v5.resources.basics.get_checkpoint_path(name, version='5.0', mt=False, environment='dataproc')[source]
Create a checkpoint path for Table or MatrixTable.
- Parameters:
name (
str
) – Name of intermediate Table/MatrixTable.version (
str
) – Version of annotation path to return.mt (
bool
) – Whether path is for a MatrixTable, default is False.environment (
str
) – Compute environment, either ‘dataproc’ or ‘rwb’. Defaults to ‘dataproc’.name –
mt –
- Return type:
str
- Returns:
Output checkpoint path.
- gnomad_qc.v5.resources.basics.get_logging_path(name, version='5.0', environment='dataproc')[source]
Create a path for Hail log files.
- Parameters:
name (
str
) – Name of log file.version (
str
) – Version of annotation path to return.environment (
str
) – Compute environment, either ‘dataproc’ or ‘rwb’. Defaults to ‘dataproc’.
- Return type:
str
- Returns:
Output log path.
- gnomad_qc.v5.resources.basics.get_aou_vds(split=False, remove_hard_filtered_samples=True, filter_samples=None, test=False, filter_partitions=None, chrom=None, autosomes_only=False, sex_chr_only=False, filter_variant_ht=None, filter_intervals=None, split_reference_blocks=True, remove_dead_alleles=True, entries_to_keep=None, checkpoint_variant_data=False, naive_coalesce_partitions=None)[source]
Load the AOU VDS.
- Parameters:
split (
bool
) – Whether to split multi-allelic variants in the VDS. Note: this will perform a split on the VDS rather than grab an already split VDS. Default is False.remove_hard_filtered_samples (
bool
) – Whether to remove samples that failed hard filters (only relevant after hard filtering is complete). Default is True.filter_samples (
Union
[List
[str
],Table
,None
]) – Optional samples to filter the VDS to. Can be a list of sample IDs or a Table with sample IDs.test (
bool
) – Whether to load the test VDS instead of the full VDS. The test VDS includes 10 samples selected from the full dataset for testing purposes. Default is False.filter_partitions (
Optional
[List
[int
]]) – Optional argument to filter the VDS to a list of specific partitions.chrom (
Union
[str
,List
[str
],Set
[str
],None
]) – Optional argument to filter the VDS to a specific chromosome(s).autosomes_only (
bool
) – Whether to include only autosomes. Default is False.sex_chr_only (
bool
) – Whether to include only sex chromosomes. Default is False.filter_variant_ht (
Optional
[Table
]) – Optional argument to filter the VDS to a specific set of variants. Only supported when splitting the VDS.filter_intervals (
Optional
[List
[Union
[str
,tinterval
]]]) – Optional argument to filter the VDS to specific intervals.split_reference_blocks (
bool
) – Whether to split the reference data at the edges of the intervals defined by filter_intervals. Default is True.remove_dead_alleles (
bool
) – Whether to remove dead alleles when removing samples. Default is True.entries_to_keep (
Optional
[List
[str
]]) – Optional list of entries to keep in the variant data. If splitting the VDS, use the global entries (e.g. ‘GT’) instead of the local entries (e.g. ‘LGT’) to keep.checkpoint_variant_data (
bool
) – Whether to checkpoint the variant data MT after splitting and filtering. Default is False.naive_coalesce_partitions (
Optional
[int
]) – Optional number of partitions to coalesce the VDS to. Default is None.
- Return type:
- Returns:
AoU v8 VDS.
- gnomad_qc.v5.resources.basics.aou_acaf_mt = MatrixTableResource(path=gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/acaf_threshold/splitMT/hail.mt)
AoU v8 ACAF (Allele Count/Allele Frequency threshold) MatrixTable.
MatrixTable contains only variants with AF > 1% or AC > 100 in any genetic ancestry group.
See https://support.researchallofus.org/hc/en-us/articles/29475228181908-How-the-All-of-Us-Genomic-data-are-organized#01JJK0HH53FX9XQRDQ5HQFZW9B and https://support.researchallofus.org/hc/en-us/articles/14929793660948-Smaller-Callsets-for-Analyzing-Short-Read-WGS-SNP-Indel-Data-with-Hail-MT-VCF-and-PLINK for more information.
- gnomad_qc.v5.resources.basics.aou_exome_mt = MatrixTableResource(path=gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/exome/splitMT/hail.mt)
AoU v8 Exome MatrixTable.
MatrixTable contains only variants in exons (with 15 bp padding on either side) as defined by GENCODE v42 basic.
See same links as above (in acaf_mt) for more information.
- gnomad_qc.v5.resources.basics.get_aou_failing_genomic_metrics_samples()[source]
Import AoU genomic metrics and filter to samples that fail specific quality criteria, including low coverage and ambiguous sex ploidy.
Note
Samples with low mean coverage (<30x), genome coverage (<90% at 20x), All of Us Hereditary Disease Risk gene (AoUHDR) coverage (<95% at 20x), or aligned_q30_bases (<8e10) were expected to be excluded from the AoU callset. However, some such samples are still present. AoU is preparing to publish a known issue in their quality report related to this. This note will be updated with a link once the issue is published.
In addition, we exclude samples with ambiguous sex ploidy (i.e., not “XX” or “XY”) from the callset.
- Return type:
- Returns:
SetExpression of samples failing coverage filters or with non-XX-XY sex ploidies.
- gnomad_qc.v5.resources.basics.get_samples_to_exclude(filter_samples=None, overwrite=False)[source]
Get set of AoU sample IDs to exclude.
Note
If filter_samples is a Hail Table, it must contain a field named ‘s’ with sample IDs.
- Parameters:
filter_samples (
Union
[List
[str
],Table
,None
]) – Optional additional samples to remove. Can be a list of sample IDs or a Table with sample IDs.overwrite (
bool
) – Whether to overwrite the existing samples_to_exclude resource. Default is False.
- Return type:
- Returns:
SetExpression containing IDs of samples to exclude from v5 analysis.
- gnomad_qc.v5.resources.basics.add_project_prefix_to_sample_collisions(t, sample_collisions, project=None, sample_id_field='s')[source]
Add project prefix to sample IDs that exist in multiple projects.
- Parameters:
t (
Union
[Table
,MatrixTable
]) – Table/MatrixTable to add project prefix to sample IDs.sample_collisions (
Table
) – Table of sample IDs that exist in multiple projects.project (
Optional
[str
]) – Optional project name to prepend to sample collisions. If not set, will use ‘ht.project’ annotation. Default is None.sample_id_field (
str
) – Field name for sample IDs in the table.
- Return type:
- Returns:
Table with project prefix added to sample IDs.