gnomad_qc.v5.resources.basics
Script containing generic resources.
Module Functions
Return path to temporary QC bucket. |
|
Create a checkpoint path for Table or MatrixTable. |
|
Create a path for Hail log files. |
|
Load the AOU VDS. |
|
|
Get gnomAD v5 genomes VariantDataset with desired filtering and metadata annotations. |
|
Import AoU genomic metrics and filter to samples that fail specific quality criteria, including low coverage and ambiguous sex ploidy. |
Get set of AoU sample IDs to exclude. |
|
|
Add project prefix to sample IDs that exist in multiple projects. |
Script containing generic resources.
- gnomad_qc.v5.resources.basics.qc_temp_prefix(version='5.0', environment='dataproc')[source]
Return path to temporary QC bucket.
- Parameters:
version (
str
) – Version of annotation path to return.environment (
str
) – Compute environment, either ‘dataproc’ or ‘rwb’. Defaults to ‘dataproc’.
- Return type:
str
- Returns:
Path to bucket with temporary QC data.
- gnomad_qc.v5.resources.basics.get_checkpoint_path(name, version='5.0', mt=False, environment='dataproc')[source]
Create a checkpoint path for Table or MatrixTable.
- Parameters:
name (
str
) – Name of intermediate Table/MatrixTable.version (
str
) – Version of annotation path to return.mt (
bool
) – Whether path is for a MatrixTable, default is False.environment (
str
) – Compute environment, either ‘dataproc’ or ‘rwb’. Defaults to ‘dataproc’.name –
mt –
- Return type:
str
- Returns:
Output checkpoint path.
- gnomad_qc.v5.resources.basics.get_logging_path(name, version='5.0', environment='dataproc')[source]
Create a path for Hail log files.
- Parameters:
name (
str
) – Name of log file.version (
str
) – Version of annotation path to return.environment (
str
) – Compute environment, either ‘dataproc’ or ‘rwb’. Defaults to ‘dataproc’.
- Return type:
str
- Returns:
Output log path.
- gnomad_qc.v5.resources.basics.get_aou_vds(split=False, remove_hard_filtered_samples=True, filter_samples=None, test=False, filter_partitions=None, chrom=None, autosomes_only=False, sex_chr_only=False, filter_variant_ht=None, filter_intervals=None, split_reference_blocks=True, remove_dead_alleles=True, entries_to_keep=None, checkpoint_variant_data=False, naive_coalesce_partitions=None)[source]
Load the AOU VDS.
- Parameters:
split (
bool
) – Whether to split multi-allelic variants in the VDS. Note: this will perform a split on the VDS rather than grab an already split VDS. Default is False.remove_hard_filtered_samples (
bool
) – Whether to remove samples that failed hard filters (only relevant after hard filtering is complete). Default is True.filter_samples (
Union
[List
[str
],Table
,None
]) – Optional samples to filter the VDS to. Can be a list of sample IDs or a Table with sample IDs.test (
bool
) – Whether to load the test VDS instead of the full VDS. The test VDS includes 10 samples selected from the full dataset for testing purposes. Default is False.filter_partitions (
Optional
[List
[int
]]) – Optional argument to filter the VDS to a list of specific partitions.chrom (
Union
[str
,List
[str
],Set
[str
],None
]) – Optional argument to filter the VDS to a specific chromosome(s).autosomes_only (
bool
) – Whether to include only autosomes. Default is False.sex_chr_only (
bool
) – Whether to include only sex chromosomes. Default is False.filter_variant_ht (
Optional
[Table
]) – Optional argument to filter the VDS to a specific set of variants. Only supported when splitting the VDS.filter_intervals (
Optional
[List
[Union
[str
,tinterval
]]]) – Optional argument to filter the VDS to specific intervals.split_reference_blocks (
bool
) – Whether to split the reference data at the edges of the intervals defined by filter_intervals. Default is True.remove_dead_alleles (
bool
) – Whether to remove dead alleles when removing samples. Default is True.entries_to_keep (
Optional
[List
[str
]]) – Optional list of entries to keep in the variant data. If splitting the VDS, use the global entries (e.g. ‘GT’) instead of the local entries (e.g. ‘LGT’) to keep.checkpoint_variant_data (
bool
) – Whether to checkpoint the variant data MT after splitting and filtering. Default is False.naive_coalesce_partitions (
Optional
[int
]) – Optional number of partitions to coalesce the VDS to. Default is None.
- Return type:
- Returns:
AoU v8 VDS.
- gnomad_qc.v5.resources.basics.get_gnomad_v5_genomes_vds(split=False, remove_hard_filtered_samples=True, release_only=False, annotate_meta=False, test=False, filter_partitions=None, chrom=None, autosomes_only=False, sex_chr_only=False, filter_variant_ht=None, filter_intervals=None, split_reference_blocks=True, entries_to_keep=None, annotate_het_non_ref=False, naive_coalesce_partitions=None, filter_samples_ht=None)[source]
Get gnomAD v5 genomes VariantDataset with desired filtering and metadata annotations.
- Parameters:
split (
bool
) – Perform split on VDS - Note: this will perform a split on the VDS rather than grab an already split VDS.remove_hard_filtered_samples (
bool
) – Whether to remove samples that failed hard filters (only relevant after sample QC).release_only (
bool
) – Whether to filter the VDS to only samples available for v5 release (distinct from v4 release due to samples to drop for consent reasons). Requires that v5 sample metadata has been computed.annotate_meta (
bool
) – Whether to add v4 genomes metadata to VDS variant_data in ‘meta’ column.test (
bool
) – Whether to use the test VDS instead of the full v4 genomes VDS.filter_partitions (
Optional
[List
[int
]]) – Optional argument to filter the VDS to specific partitions in the provided list.chrom (
Union
[str
,List
[str
],Set
[str
],None
]) – Optional argument to filter the VDS to a specific chromosome(s).autosomes_only (
bool
) – Whether to filter the VDS to autosomes only. Default is False.sex_chr_only (
bool
) – Whether to filter the VDS to sex chromosomes only. Default is False.filter_variant_ht (
Optional
[Table
]) – Optional argument to filter the VDS to a specific set of variants. Only supported when splitting the VDS.filter_intervals (
Optional
[List
[Union
[str
,tinterval
]]]) – Optional argument to filter the VDS to specific intervals.split_reference_blocks (
bool
) – Whether to split the reference data at the edges of the intervals defined by filter_intervals. Default is True.entries_to_keep (
Optional
[List
[str
]]) – Optional argument to keep only specific entries in the returned VDS. If splitting the VDS, use the global entries (e.g. ‘GT’) instead of the local entries (e.g. ‘LGT’) to keep.annotate_het_non_ref (
bool
) – Whether to annotate non reference heterozygotes (as ‘_het_non_ref’) to the variant data. Default is False.naive_coalesce_partitions (
Optional
[int
]) – Optional argument to coalesce the VDS to a specific number of partitions using naive coalesce.filter_samples_ht (
Optional
[Table
]) – Optional Table of samples to filter the VDS to.
- Return type:
- Returns:
gnomAD v4 genomes VariantDataset with chosen annotations and filters.
- gnomad_qc.v5.resources.basics.aou_acaf_mt = MatrixTableResource(path=gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/acaf_threshold/splitMT/hail.mt)
AoU v8 ACAF (Allele Count/Allele Frequency threshold) MatrixTable.
MatrixTable contains only variants with AF > 1% or AC > 100 in any genetic ancestry group.
See https://support.researchallofus.org/hc/en-us/articles/29475228181908-How-the-All-of-Us-Genomic-data-are-organized#01JJK0HH53FX9XQRDQ5HQFZW9B and https://support.researchallofus.org/hc/en-us/articles/14929793660948-Smaller-Callsets-for-Analyzing-Short-Read-WGS-SNP-Indel-Data-with-Hail-MT-VCF-and-PLINK for more information.
- gnomad_qc.v5.resources.basics.aou_exome_mt = MatrixTableResource(path=gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/exome/splitMT/hail.mt)
AoU v8 Exome MatrixTable.
MatrixTable contains only variants in exons (with 15 bp padding on either side) as defined by GENCODE v42 basic.
See same links as above (in acaf_mt) for more information.
- gnomad_qc.v5.resources.basics.get_aou_failing_genomic_metrics_samples()[source]
Import AoU genomic metrics and filter to samples that fail specific quality criteria, including low coverage and ambiguous sex ploidy.
Note
Samples with low mean coverage (<30x), genome coverage (<90% at 20x), All of Us Hereditary Disease Risk gene (AoUHDR) coverage (<95% at 20x), or aligned_q30_bases (<8e10) were expected to be excluded from the AoU callset. However, some such samples are still present. AoU is preparing to publish a known issue in their quality report related to this. This note will be updated with a link once the issue is published.
In addition, we exclude samples with ambiguous sex ploidy (i.e., not “XX” or “XY”) from the callset.
- Return type:
Set
[str
]- Returns:
Set of sample IDs failing coverage filters or with non-XX-XY sex ploidies.
- gnomad_qc.v5.resources.basics.get_samples_to_exclude(filter_samples=None, overwrite=False)[source]
Get set of AoU sample IDs to exclude.
Note
If filter_samples is a Hail Table, it must contain a field named ‘s’ with sample IDs.
- Parameters:
filter_samples (
Union
[List
[str
],Table
,None
]) – Optional additional samples to remove. Can be a list of sample IDs or a Table with sample IDs.overwrite (
bool
) – Whether to overwrite the existing samples_to_exclude resource. Default is False.
- Return type:
- Returns:
SetExpression containing IDs of samples to exclude from v5 analysis.
- gnomad_qc.v5.resources.basics.add_project_prefix_to_sample_collisions(t, sample_collisions, project=None, sample_id_field='s')[source]
Add project prefix to sample IDs that exist in multiple projects.
- Parameters:
t (
Union
[Table
,MatrixTable
]) – Table/MatrixTable to add project prefix to sample IDs.sample_collisions (
Table
) – Table of sample IDs that exist in multiple projects.project (
Optional
[str
]) – Optional project name to prepend to sample collisions. If not set, will use ‘ht.project’ annotation. Default is None.sample_id_field (
str
) – Field name for sample IDs in the table.
- Return type:
- Returns:
Table with project prefix added to sample IDs.