gnomad.sample_qc.pipeline

gnomad.sample_qc.pipeline.filter_rows_for_qc(mt)

Annotate rows with sites_callrate, site_inbreeding_coeff and af, then apply thresholds.

gnomad.sample_qc.pipeline.get_qc_mt(mt[, ...])

Create a QC-ready MT.

gnomad.sample_qc.pipeline.infer_sex_karyotype(...)

Create a Table with X_karyotype, Y_karyotype, and sex_karyotype.

gnomad.sample_qc.pipeline.annotate_sex(mtds)

Impute sample sex based on X-chromosome heterozygosity and sex chromosome ploidy.

gnomad.sample_qc.pipeline.filter_rows_for_qc(mt, min_af=0.001, min_callrate=0.99, min_inbreeding_coeff_threshold=-0.8, min_hardy_weinberg_threshold=1e-08, apply_hard_filters=True, bi_allelic_only=True, snv_only=True)[source]

Annotate rows with sites_callrate, site_inbreeding_coeff and af, then apply thresholds.

AF and callrate thresholds are taken from gnomAD QC; inbreeding coeff, MQ, FS and QD filters are taken from GATK best practices.

Note

This function expect the typical info annotation of type struct with fields MQ, FS and QD if applying hard filters.

Parameters:
  • mt (MatrixTable) – Input MT

  • min_af (Optional[float]) – Minimum site AF to keep. Not applied if set to None.

  • min_callrate (Optional[float]) – Minimum site call rate to keep. Not applied if set to None.

  • min_inbreeding_coeff_threshold (Optional[float]) – Minimum site inbreeding coefficient to keep. Not applied if set to None.

  • min_hardy_weinberg_threshold (Optional[float]) – Minimum site HW test p-value to keep. Not applied if set to None.

  • apply_hard_filters (bool) – Whether to apply standard GAKT default site hard filters: QD >= 2, FS <= 60 and MQ >= 30.

  • bi_allelic_only (bool) – Whether to only keep bi-allelic sites or include multi-allelic sites too.

  • snv_only (bool) – Whether to only keep SNVs or include other variant types.

Return type:

MatrixTable

Returns:

annotated and filtered table

gnomad.sample_qc.pipeline.get_qc_mt(mt, bi_allelic_only=True, snv_only=True, adj_only=True, min_af=0.001, min_callrate=0.99, min_inbreeding_coeff_threshold=-0.8, min_hardy_weinberg_threshold=1e-08, apply_hard_filters=True, ld_r2=0.1, filter_lcr=True, filter_decoy=True, filter_segdup=True, filter_exome_low_coverage_regions=False, high_conf_regions=None, checkpoint_path=None, n_partitions=None, block_size=None)[source]

Create a QC-ready MT.

Has options to filter to the following:
  • Variants outside known problematic regions

  • Bi-allelic sites only

  • SNVs only

  • Variants passing hard thresholds

  • Variants passing the set call rate and MAF thresholds

  • Genotypes passing on gnomAD ADJ criteria (GQ>=20, DP>=10, AB>0.2 for hets)

In addition, the MT will be LD-pruned if ld_r2 is set.

Parameters:
  • mt (MatrixTable) – Input MT.

  • bi_allelic_only (bool) – Whether to only keep bi-allelic sites or include multi-allelic sites too.

  • snv_only (bool) – Whether to only keep SNVs or include other variant types.

  • adj_only (bool) – If set, only ADJ genotypes are kept. This filter is applied before the call rate and AF calculation.

  • min_af (Optional[float]) – Minimum allele frequency to keep. Not applied if set to None.

  • min_callrate (Optional[float]) – Minimum call rate to keep. Not applied if set to None.

  • min_inbreeding_coeff_threshold (Optional[float]) – Minimum site inbreeding coefficient to keep. Not applied if set to None.

  • min_hardy_weinberg_threshold (Optional[float]) – Minimum site HW test p-value to keep. Not applied if set to None.

  • apply_hard_filters (bool) – Whether to apply standard GAKT default site hard filters: QD >= 2, FS <= 60 and MQ >= 30.

  • ld_r2 (Optional[float]) – Minimum r2 to keep when LD-pruning (set to None for no LD pruning).

  • filter_lcr (bool) – Filter LCR regions.

  • filter_decoy (bool) – Filter decoy regions.

  • filter_segdup (bool) – Filter segmental duplication regions.

  • filter_exome_low_coverage_regions (bool) – If set, only high coverage exome regions (computed from gnomAD are kept).

  • high_conf_regions (Optional[List[str]]) – If given, the data will be filtered to only include variants in those regions.

  • checkpoint_path (Optional[str]) – If given, the QC MT will be checkpointed to the specified path before running LD pruning. If not specified, persist will be used instead.

  • n_partitions (Optional[int]) – If given, the QC MT will be repartitioned to the specified number of partitions before running LD pruning. checkpoint_path must also be specified as the MT will first be written to the checkpoint_path before being reread with the new number of partitions.

  • block_size (Optional[int]) – If given, set the block size to this value when LD pruning.

Return type:

MatrixTable

Returns:

Filtered MT.

gnomad.sample_qc.pipeline.infer_sex_karyotype(ploidy_ht, f_stat_cutoff=0.5, use_gaussian_mixture_model=False, normal_ploidy_cutoff=5, aneuploidy_cutoff=6, chr_x_frac_hom_alt_expr=None, normal_chr_x_hom_alt_cutoff=5)[source]

Create a Table with X_karyotype, Y_karyotype, and sex_karyotype.

This function uses get_ploidy_cutoffs to determine X and Y ploidy cutoffs and then get_sex_expr to get karyotype annotations from those cutoffs.

By default f_stat_cutoff will be used to roughly split samples into ‘XX’ and ‘XY’ for use in get_ploidy_cutoffs. If use_gaussian_mixture_model is True a gaussian mixture model will be used to split samples into ‘XX’ and ‘XY’ instead of f-stat.

Parameters:
  • ploidy_ht (Table) – Input Table with chromosome X and chromosome Y ploidy values and optionally f-stat.

  • f_stat_cutoff (float) – f-stat to roughly divide ‘XX’ from ‘XY’ samples. Assumes XX samples are below cutoff and XY are above cutoff. Default is 0.5.

  • use_gaussian_mixture_model (bool) – Use gaussian mixture model to split samples into ‘XX’ and ‘XY’ instead of f-stat.

  • normal_ploidy_cutoff (int) – Number of standard deviations to use when determining sex chromosome ploidy cutoffs for XX, XY karyotypes.

  • aneuploidy_cutoff (int) – Number of standard deviations to use when determining sex chromosome ploidy cutoffs for aneuploidies.

  • chr_x_frac_hom_alt_expr (Optional[NumericExpression]) – Fraction of homozygous alternate genotypes (hom-alt/(hom-alt + het)) on chromosome X.

  • normal_chr_x_hom_alt_cutoff (int) – Number of standard deviations to use when determining cutoffs for the fraction of homozygous alternate genotypes (hom-alt/(hom-alt + het)) on chromosome X for for XX and XY karyotypes. Only used if chr_x_frac_hom_alt_expr is supplied.

Return type:

Table

Returns:

Table of samples imputed sex karyotype.

gnomad.sample_qc.pipeline.annotate_sex(mtds, is_sparse=True, excluded_intervals=None, included_intervals=None, normalization_contig='chr20', sites_ht=None, aaf_expr=None, gt_expr='GT', f_stat_cutoff=0.5, aaf_threshold=0.001, variants_only_x_ploidy=False, variants_only_y_ploidy=False, variants_filter_lcr=True, variants_filter_segdup=True, variants_filter_decoy=False, variants_snv_only=False, coverage_mt=None, compute_x_frac_variants_hom_alt=False, compute_fstat=True, infer_karyotype=True, use_gaussian_mixture_model=False)[source]

Impute sample sex based on X-chromosome heterozygosity and sex chromosome ploidy.

Return Table with the following fields:
  • s (str): Sample

  • normalization_contig`_mean_dp (float32): Sample’s mean coverage over the specified `normalization_contig.

  • chrX_mean_dp (float32): Sample’s mean coverage over chromosome X.

  • chrY_mean_dp (float32): Sample’s mean coverage over chromosome Y.

  • chrX_ploidy (float32): Sample’s imputed ploidy over chromosome X.

  • chrY_ploidy (float32): Sample’s imputed ploidy over chromosome Y.

If compute_fstat:
  • f_stat (float64): Sample f-stat. Calculated using hl.impute_sex.

  • n_called (int64): Number of variants with a genotype call. Calculated using hl.impute_sex.

  • expected_homs (float64): Expected number of homozygotes. Calculated using hl.impute_sex.

  • observed_homs (int64): Observed number of homozygotes. Calculated using hl.impute_sex.

If infer_karyotype:
  • X_karyotype (str): Sample’s chromosome X karyotype.

  • Y_karyotype (str): Sample’s chromosome Y karyotype.

  • sex_karyotype (str): Sample’s sex karyotype.

Note

In order to infer sex karyotype (infer_karyotype`=True), one of `compute_fstat or use_gaussian_mixture_model must be set to True.

Parameters:
  • mtds (Union[MatrixTable, VariantDataset]) – Input MatrixTable or VariantDataset.

  • is_sparse (bool) – Whether input MatrixTable is in sparse data format. Default is True.

  • excluded_intervals (Optional[Table]) – Optional table of intervals to exclude from the computation. This option is currently not implemented for imputing sex chromosome ploidy on a VDS.

  • included_intervals (Optional[Table]) – Optional table of intervals to use in the computation. REQUIRED for exomes.

  • normalization_contig (str) – Which chromosome to use to normalize sex chromosome coverage. Used in determining sex chromosome ploidies. Default is “chr20”.

  • sites_ht (Optional[Table]) – Optional Table of sites and alternate allele frequencies for filtering the input MatrixTable prior to imputing sex.

  • aaf_expr (Optional[str]) – Optional. Name of field in input MatrixTable with alternate allele frequency.

  • gt_expr (str) – Name of entry field storing the genotype. Default is ‘GT’.

  • f_stat_cutoff (float) – f-stat to roughly divide ‘XX’ from ‘XY’ samples. Assumes XX samples are below cutoff and XY samples are above cutoff. Default is 0.5.

  • aaf_threshold (float) – Minimum alternate allele frequency to be used in f-stat calculations. Default is 0.001.

  • variants_only_x_ploidy (bool) – Whether to use depth of only variant data for the x ploidy estimation.

  • variants_only_y_ploidy (bool) – Whether to use depth of only variant data for the y ploidy estimation.

  • variants_filter_lcr (bool) – Whether to filter out variants in LCR regions for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X. Default is True.

  • variants_filter_segdup (bool) – Whether to filter out variants in segdup regions for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X. Default is True.

  • variants_filter_decoy (bool) – Whether to filter out variants in decoy regions for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X. Default is False. Note: this option doesn’t exist for GRCh38.

  • variants_snv_only (bool) – Whether to filter to only single nucleotide variants for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X. Default is False.

  • coverage_mt (Optional[MatrixTable]) – Optional precomputed coverage MatrixTable to use in reference based VDS ploidy estimation.

  • compute_x_frac_variants_hom_alt (bool) – Whether to return an annotation for the fraction of homozygous alternate variants on chromosome X. Default is False.

  • compute_fstat (bool) – Whether to compute f-stat. Default is True.

  • infer_karyotype (bool) – Whether to infer sex karyotypes. Default is True.

  • use_gaussian_mixture_model (bool) – Whether to use gaussian mixture model to split samples into ‘XX’ and ‘XY’ instead of f-stat. Default is False.

Return type:

Table

Returns:

Table of samples and their imputed sex karyotypes.