gnomad.sample_qc.pipeline
Annotate rows with sites_callrate, site_inbreeding_coeff and af, then apply thresholds. |
|
|
Create a QC-ready MT. |
Create a Table with X_karyotype, Y_karyotype, and sex_karyotype. |
|
Impute sample sex based on X-chromosome heterozygosity and sex chromosome ploidy. |
- gnomad.sample_qc.pipeline.filter_rows_for_qc(mt, min_af=0.001, min_callrate=0.99, min_inbreeding_coeff_threshold=-0.8, min_hardy_weinberg_threshold=1e-08, apply_hard_filters=True, bi_allelic_only=True, snv_only=True)[source]
Annotate rows with sites_callrate, site_inbreeding_coeff and af, then apply thresholds.
AF and callrate thresholds are taken from gnomAD QC; inbreeding coeff, MQ, FS and QD filters are taken from GATK best practices.
Note
This function expect the typical
infoannotation of type struct with fieldsMQ,FSandQDif applying hard filters.- Parameters:
mt (
MatrixTable) – Input MTmin_af (
Optional[float]) – Minimum site AF to keep. Not applied if set toNone.min_callrate (
Optional[float]) – Minimum site call rate to keep. Not applied if set toNone.min_inbreeding_coeff_threshold (
Optional[float]) – Minimum site inbreeding coefficient to keep. Not applied if set toNone.min_hardy_weinberg_threshold (
Optional[float]) – Minimum site HW test p-value to keep. Not applied if set toNone.apply_hard_filters (
bool) – Whether to apply standard GAKT default site hard filters: QD >= 2, FS <= 60 and MQ >= 30.bi_allelic_only (
bool) – Whether to only keep bi-allelic sites or include multi-allelic sites too.snv_only (
bool) – Whether to only keep SNVs or include other variant types.
- Return type:
- Returns:
annotated and filtered table
- gnomad.sample_qc.pipeline.get_qc_mt(mt, bi_allelic_only=True, snv_only=True, adj_only=True, min_af=0.001, min_callrate=0.99, min_inbreeding_coeff_threshold=-0.8, min_hardy_weinberg_threshold=1e-08, apply_hard_filters=True, ld_r2=0.1, filter_lcr=True, filter_decoy=True, filter_segdup=True, filter_exome_low_coverage_regions=False, high_conf_regions=None, checkpoint_path=None, n_partitions=None, block_size=None)[source]
Create a QC-ready MT.
- Has options to filter to the following:
Variants outside known problematic regions
Bi-allelic sites only
SNVs only
Variants passing hard thresholds
Variants passing the set call rate and MAF thresholds
Genotypes passing on gnomAD ADJ criteria (GQ>=20, DP>=10, AB>0.2 for hets)
In addition, the MT will be LD-pruned if ld_r2 is set.
- Parameters:
mt (
MatrixTable) – Input MT.bi_allelic_only (
bool) – Whether to only keep bi-allelic sites or include multi-allelic sites too.snv_only (
bool) – Whether to only keep SNVs or include other variant types.adj_only (
bool) – If set, only ADJ genotypes are kept. This filter is applied before the call rate and AF calculation.min_af (
Optional[float]) – Minimum allele frequency to keep. Not applied if set toNone.min_callrate (
Optional[float]) – Minimum call rate to keep. Not applied if set toNone.min_inbreeding_coeff_threshold (
Optional[float]) – Minimum site inbreeding coefficient to keep. Not applied if set toNone.min_hardy_weinberg_threshold (
Optional[float]) – Minimum site HW test p-value to keep. Not applied if set toNone.apply_hard_filters (
bool) – Whether to apply standard GAKT default site hard filters: QD >= 2, FS <= 60 and MQ >= 30.ld_r2 (
Optional[float]) – Minimum r2 to keep when LD-pruning (set to None for no LD pruning).filter_lcr (
bool) – Filter LCR regions.filter_decoy (
bool) – Filter decoy regions.filter_segdup (
bool) – Filter segmental duplication regions.filter_exome_low_coverage_regions (
bool) – If set, only high coverage exome regions (computed from gnomAD are kept).high_conf_regions (
Optional[List[str]]) – If given, the data will be filtered to only include variants in those regions.checkpoint_path (
Optional[str]) – If given, the QC MT will be checkpointed to the specified path before running LD pruning. If not specified, persist will be used instead.n_partitions (
Optional[int]) – If given, the QC MT will be repartitioned to the specified number of partitions before running LD pruning. checkpoint_path must also be specified as the MT will first be written to the checkpoint_path before being reread with the new number of partitions.block_size (
Optional[int]) – If given, set the block size to this value when LD pruning.
- Return type:
- Returns:
Filtered MT.
- gnomad.sample_qc.pipeline.infer_sex_karyotype(ploidy_ht, f_stat_cutoff=0.5, use_gaussian_mixture_model=False, normal_ploidy_cutoff=5, aneuploidy_cutoff=6, chr_x_frac_hom_alt_expr=None, normal_chr_x_hom_alt_cutoff=5)[source]
Create a Table with X_karyotype, Y_karyotype, and sex_karyotype.
This function uses get_ploidy_cutoffs to determine X and Y ploidy cutoffs and then get_sex_expr to get karyotype annotations from those cutoffs.
By default f_stat_cutoff will be used to roughly split samples into ‘XX’ and ‘XY’ for use in get_ploidy_cutoffs. If use_gaussian_mixture_model is True a gaussian mixture model will be used to split samples into ‘XX’ and ‘XY’ instead of f-stat.
- Parameters:
ploidy_ht (
Table) – Input Table with chromosome X and chromosome Y ploidy values and optionally f-stat.f_stat_cutoff (
float) – f-stat to roughly divide ‘XX’ from ‘XY’ samples. Assumes XX samples are below cutoff and XY are above cutoff. Default is 0.5.use_gaussian_mixture_model (
bool) – Use gaussian mixture model to split samples into ‘XX’ and ‘XY’ instead of f-stat.normal_ploidy_cutoff (
int) – Number of standard deviations to use when determining sex chromosome ploidy cutoffs for XX, XY karyotypes.aneuploidy_cutoff (
int) – Number of standard deviations to use when determining sex chromosome ploidy cutoffs for aneuploidies.chr_x_frac_hom_alt_expr (
Optional[NumericExpression]) – Fraction of homozygous alternate genotypes (hom-alt/(hom-alt + het)) on chromosome X.normal_chr_x_hom_alt_cutoff (
int) – Number of standard deviations to use when determining cutoffs for the fraction of homozygous alternate genotypes (hom-alt/(hom-alt + het)) on chromosome X for for XX and XY karyotypes. Only used if chr_x_frac_hom_alt_expr is supplied.
- Return type:
- Returns:
Table of samples imputed sex karyotype.
- gnomad.sample_qc.pipeline.annotate_sex(mtds, is_sparse=True, excluded_intervals=None, included_intervals=None, normalization_contig='chr20', sites_ht=None, aaf_expr=None, gt_expr='GT', f_stat_cutoff=0.5, aaf_threshold=0.001, variants_only_x_ploidy=False, variants_only_y_ploidy=False, variants_filter_lcr=True, variants_filter_segdup=True, variants_filter_decoy=False, variants_snv_only=False, coverage_mt=None, compute_x_frac_variants_hom_alt=False, compute_fstat=True, infer_karyotype=True, use_gaussian_mixture_model=False)[source]
Impute sample sex based on X-chromosome heterozygosity and sex chromosome ploidy.
- Return Table with the following fields:
s (str): Sample
normalization_contig`_mean_dp (float32): Sample’s mean coverage over the specified `normalization_contig.
chrX_mean_dp (float32): Sample’s mean coverage over chromosome X.
chrY_mean_dp (float32): Sample’s mean coverage over chromosome Y.
chrX_ploidy (float32): Sample’s imputed ploidy over chromosome X.
chrY_ploidy (float32): Sample’s imputed ploidy over chromosome Y.
- If compute_fstat:
f_stat (float64): Sample f-stat. Calculated using hl.impute_sex.
n_called (int64): Number of variants with a genotype call. Calculated using hl.impute_sex.
expected_homs (float64): Expected number of homozygotes. Calculated using hl.impute_sex.
observed_homs (int64): Observed number of homozygotes. Calculated using hl.impute_sex.
- If infer_karyotype:
X_karyotype (str): Sample’s chromosome X karyotype.
Y_karyotype (str): Sample’s chromosome Y karyotype.
sex_karyotype (str): Sample’s sex karyotype.
Note
In order to infer sex karyotype (infer_karyotype`=True), one of `compute_fstat or use_gaussian_mixture_model must be set to True.
- Parameters:
mtds (
Union[MatrixTable,VariantDataset]) – Input MatrixTable or VariantDataset.is_sparse (
bool) – Whether input MatrixTable is in sparse data format. Default is True.excluded_intervals (
Optional[Table]) – Optional table of intervals to exclude from the computation. This option is currently not implemented for imputing sex chromosome ploidy on a VDS.included_intervals (
Optional[Table]) – Optional table of intervals to use in the computation. REQUIRED for exomes.normalization_contig (
str) – Which chromosome to use to normalize sex chromosome coverage. Used in determining sex chromosome ploidies. Default is “chr20”.sites_ht (
Optional[Table]) – Optional Table of sites and alternate allele frequencies for filtering the input MatrixTable prior to imputing sex.aaf_expr (
Optional[str]) – Optional. Name of field in input MatrixTable with alternate allele frequency.gt_expr (
str) – Name of entry field storing the genotype. Default is ‘GT’.f_stat_cutoff (
float) – f-stat to roughly divide ‘XX’ from ‘XY’ samples. Assumes XX samples are below cutoff and XY samples are above cutoff. Default is 0.5.aaf_threshold (
float) – Minimum alternate allele frequency to be used in f-stat calculations. Default is 0.001.variants_only_x_ploidy (
bool) – Whether to use depth of only variant data for the x ploidy estimation.variants_only_y_ploidy (
bool) – Whether to use depth of only variant data for the y ploidy estimation.variants_filter_lcr (
bool) – Whether to filter out variants in LCR regions for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X. Default is True.variants_filter_segdup (
bool) – Whether to filter out variants in segdup regions for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X. Default is True.variants_filter_decoy (
bool) – Whether to filter out variants in decoy regions for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X. Default is False. Note: this option doesn’t exist for GRCh38.variants_snv_only (
bool) – Whether to filter to only single nucleotide variants for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X. Default is False.coverage_mt (
Optional[MatrixTable]) – Optional precomputed coverage MatrixTable to use in reference based VDS ploidy estimation.compute_x_frac_variants_hom_alt (
bool) – Whether to return an annotation for the fraction of homozygous alternate variants on chromosome X. Default is False.compute_fstat (
bool) – Whether to compute f-stat. Default is True.infer_karyotype (
bool) – Whether to infer sex karyotypes. Default is True.use_gaussian_mixture_model (
bool) – Whether to use gaussian mixture model to split samples into ‘XX’ and ‘XY’ instead of f-stat. Default is False.
- Return type:
- Returns:
Table of samples and their imputed sex karyotypes.