gnomad.sample_qc.pipeline
| Annotate rows with sites_callrate, site_inbreeding_coeff and af, then apply thresholds. | |
| 
 | Create a QC-ready MT. | 
| Create a Table with X_karyotype, Y_karyotype, and sex_karyotype. | |
| Impute sample sex based on X-chromosome heterozygosity and sex chromosome ploidy. | 
- gnomad.sample_qc.pipeline.filter_rows_for_qc(mt, min_af=0.001, min_callrate=0.99, min_inbreeding_coeff_threshold=-0.8, min_hardy_weinberg_threshold=1e-08, apply_hard_filters=True, bi_allelic_only=True, snv_only=True)[source]
- Annotate rows with sites_callrate, site_inbreeding_coeff and af, then apply thresholds. - AF and callrate thresholds are taken from gnomAD QC; inbreeding coeff, MQ, FS and QD filters are taken from GATK best practices. - Note - This function expect the typical - infoannotation of type struct with fields- MQ,- FSand- QDif applying hard filters.- Parameters:
- mt ( - MatrixTable) – Input MT
- min_af ( - Optional[- float]) – Minimum site AF to keep. Not applied if set to- None.
- min_callrate ( - Optional[- float]) – Minimum site call rate to keep. Not applied if set to- None.
- min_inbreeding_coeff_threshold ( - Optional[- float]) – Minimum site inbreeding coefficient to keep. Not applied if set to- None.
- min_hardy_weinberg_threshold ( - Optional[- float]) – Minimum site HW test p-value to keep. Not applied if set to- None.
- apply_hard_filters ( - bool) – Whether to apply standard GAKT default site hard filters: QD >= 2, FS <= 60 and MQ >= 30.
- bi_allelic_only ( - bool) – Whether to only keep bi-allelic sites or include multi-allelic sites too.
- snv_only ( - bool) – Whether to only keep SNVs or include other variant types.
 
- Return type:
- Returns:
- annotated and filtered table 
 
- gnomad.sample_qc.pipeline.get_qc_mt(mt, bi_allelic_only=True, snv_only=True, adj_only=True, min_af=0.001, min_callrate=0.99, min_inbreeding_coeff_threshold=-0.8, min_hardy_weinberg_threshold=1e-08, apply_hard_filters=True, ld_r2=0.1, filter_lcr=True, filter_decoy=True, filter_segdup=True, filter_exome_low_coverage_regions=False, high_conf_regions=None, checkpoint_path=None, n_partitions=None, block_size=None)[source]
- Create a QC-ready MT. - Has options to filter to the following:
- Variants outside known problematic regions 
- Bi-allelic sites only 
- SNVs only 
- Variants passing hard thresholds 
- Variants passing the set call rate and MAF thresholds 
- Genotypes passing on gnomAD ADJ criteria (GQ>=20, DP>=10, AB>0.2 for hets) 
 
 - In addition, the MT will be LD-pruned if ld_r2 is set. - Parameters:
- mt ( - MatrixTable) – Input MT.
- bi_allelic_only ( - bool) – Whether to only keep bi-allelic sites or include multi-allelic sites too.
- snv_only ( - bool) – Whether to only keep SNVs or include other variant types.
- adj_only ( - bool) – If set, only ADJ genotypes are kept. This filter is applied before the call rate and AF calculation.
- min_af ( - Optional[- float]) – Minimum allele frequency to keep. Not applied if set to- None.
- min_callrate ( - Optional[- float]) – Minimum call rate to keep. Not applied if set to- None.
- min_inbreeding_coeff_threshold ( - Optional[- float]) – Minimum site inbreeding coefficient to keep. Not applied if set to- None.
- min_hardy_weinberg_threshold ( - Optional[- float]) – Minimum site HW test p-value to keep. Not applied if set to- None.
- apply_hard_filters ( - bool) – Whether to apply standard GAKT default site hard filters: QD >= 2, FS <= 60 and MQ >= 30.
- ld_r2 ( - Optional[- float]) – Minimum r2 to keep when LD-pruning (set to None for no LD pruning).
- filter_lcr ( - bool) – Filter LCR regions.
- filter_decoy ( - bool) – Filter decoy regions.
- filter_segdup ( - bool) – Filter segmental duplication regions.
- filter_exome_low_coverage_regions ( - bool) – If set, only high coverage exome regions (computed from gnomAD are kept).
- high_conf_regions ( - Optional[- List[- str]]) – If given, the data will be filtered to only include variants in those regions.
- checkpoint_path ( - Optional[- str]) – If given, the QC MT will be checkpointed to the specified path before running LD pruning. If not specified, persist will be used instead.
- n_partitions ( - Optional[- int]) – If given, the QC MT will be repartitioned to the specified number of partitions before running LD pruning. checkpoint_path must also be specified as the MT will first be written to the checkpoint_path before being reread with the new number of partitions.
- block_size ( - Optional[- int]) – If given, set the block size to this value when LD pruning.
 
- Return type:
- Returns:
- Filtered MT. 
 
- gnomad.sample_qc.pipeline.infer_sex_karyotype(ploidy_ht, f_stat_cutoff=0.5, use_gaussian_mixture_model=False, normal_ploidy_cutoff=5, aneuploidy_cutoff=6, chr_x_frac_hom_alt_expr=None, normal_chr_x_hom_alt_cutoff=5)[source]
- Create a Table with X_karyotype, Y_karyotype, and sex_karyotype. - This function uses get_ploidy_cutoffs to determine X and Y ploidy cutoffs and then get_sex_expr to get karyotype annotations from those cutoffs. - By default f_stat_cutoff will be used to roughly split samples into ‘XX’ and ‘XY’ for use in get_ploidy_cutoffs. If use_gaussian_mixture_model is True a gaussian mixture model will be used to split samples into ‘XX’ and ‘XY’ instead of f-stat. - Parameters:
- ploidy_ht ( - Table) – Input Table with chromosome X and chromosome Y ploidy values and optionally f-stat.
- f_stat_cutoff ( - float) – f-stat to roughly divide ‘XX’ from ‘XY’ samples. Assumes XX samples are below cutoff and XY are above cutoff. Default is 0.5.
- use_gaussian_mixture_model ( - bool) – Use gaussian mixture model to split samples into ‘XX’ and ‘XY’ instead of f-stat.
- normal_ploidy_cutoff ( - int) – Number of standard deviations to use when determining sex chromosome ploidy cutoffs for XX, XY karyotypes.
- aneuploidy_cutoff ( - int) – Number of standard deviations to use when determining sex chromosome ploidy cutoffs for aneuploidies.
- chr_x_frac_hom_alt_expr ( - Optional[- NumericExpression]) – Fraction of homozygous alternate genotypes (hom-alt/(hom-alt + het)) on chromosome X.
- normal_chr_x_hom_alt_cutoff ( - int) – Number of standard deviations to use when determining cutoffs for the fraction of homozygous alternate genotypes (hom-alt/(hom-alt + het)) on chromosome X for for XX and XY karyotypes. Only used if chr_x_frac_hom_alt_expr is supplied.
 
- Return type:
- Returns:
- Table of samples imputed sex karyotype. 
 
- gnomad.sample_qc.pipeline.annotate_sex(mtds, is_sparse=True, excluded_intervals=None, included_intervals=None, normalization_contig='chr20', sites_ht=None, aaf_expr=None, gt_expr='GT', f_stat_cutoff=0.5, aaf_threshold=0.001, variants_only_x_ploidy=False, variants_only_y_ploidy=False, variants_filter_lcr=True, variants_filter_segdup=True, variants_filter_decoy=False, variants_snv_only=False, coverage_mt=None, compute_x_frac_variants_hom_alt=False, compute_fstat=True, infer_karyotype=True, use_gaussian_mixture_model=False)[source]
- Impute sample sex based on X-chromosome heterozygosity and sex chromosome ploidy. - Return Table with the following fields:
- s (str): Sample 
- normalization_contig`_mean_dp (float32): Sample’s mean coverage over the specified `normalization_contig. 
- chrX_mean_dp (float32): Sample’s mean coverage over chromosome X. 
- chrY_mean_dp (float32): Sample’s mean coverage over chromosome Y. 
- chrX_ploidy (float32): Sample’s imputed ploidy over chromosome X. 
- chrY_ploidy (float32): Sample’s imputed ploidy over chromosome Y. 
 - If compute_fstat:
- f_stat (float64): Sample f-stat. Calculated using hl.impute_sex. 
- n_called (int64): Number of variants with a genotype call. Calculated using hl.impute_sex. 
- expected_homs (float64): Expected number of homozygotes. Calculated using hl.impute_sex. 
- observed_homs (int64): Observed number of homozygotes. Calculated using hl.impute_sex. 
 
- If infer_karyotype:
- X_karyotype (str): Sample’s chromosome X karyotype. 
- Y_karyotype (str): Sample’s chromosome Y karyotype. 
- sex_karyotype (str): Sample’s sex karyotype. 
 
 
 - Note - In order to infer sex karyotype (infer_karyotype`=True), one of `compute_fstat or use_gaussian_mixture_model must be set to True. - Parameters:
- mtds ( - Union[- MatrixTable,- VariantDataset]) – Input MatrixTable or VariantDataset.
- is_sparse ( - bool) – Whether input MatrixTable is in sparse data format. Default is True.
- excluded_intervals ( - Optional[- Table]) – Optional table of intervals to exclude from the computation. This option is currently not implemented for imputing sex chromosome ploidy on a VDS.
- included_intervals ( - Optional[- Table]) – Optional table of intervals to use in the computation. REQUIRED for exomes.
- normalization_contig ( - str) – Which chromosome to use to normalize sex chromosome coverage. Used in determining sex chromosome ploidies. Default is “chr20”.
- sites_ht ( - Optional[- Table]) – Optional Table of sites and alternate allele frequencies for filtering the input MatrixTable prior to imputing sex.
- aaf_expr ( - Optional[- str]) – Optional. Name of field in input MatrixTable with alternate allele frequency.
- gt_expr ( - str) – Name of entry field storing the genotype. Default is ‘GT’.
- f_stat_cutoff ( - float) – f-stat to roughly divide ‘XX’ from ‘XY’ samples. Assumes XX samples are below cutoff and XY samples are above cutoff. Default is 0.5.
- aaf_threshold ( - float) – Minimum alternate allele frequency to be used in f-stat calculations. Default is 0.001.
- variants_only_x_ploidy ( - bool) – Whether to use depth of only variant data for the x ploidy estimation.
- variants_only_y_ploidy ( - bool) – Whether to use depth of only variant data for the y ploidy estimation.
- variants_filter_lcr ( - bool) – Whether to filter out variants in LCR regions for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X. Default is True.
- variants_filter_segdup ( - bool) – Whether to filter out variants in segdup regions for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X. Default is True.
- variants_filter_decoy ( - bool) – Whether to filter out variants in decoy regions for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X. Default is False. Note: this option doesn’t exist for GRCh38.
- variants_snv_only ( - bool) – Whether to filter to only single nucleotide variants for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X. Default is False.
- coverage_mt ( - Optional[- MatrixTable]) – Optional precomputed coverage MatrixTable to use in reference based VDS ploidy estimation.
- compute_x_frac_variants_hom_alt ( - bool) – Whether to return an annotation for the fraction of homozygous alternate variants on chromosome X. Default is False.
- compute_fstat ( - bool) – Whether to compute f-stat. Default is True.
- infer_karyotype ( - bool) – Whether to infer sex karyotypes. Default is True.
- use_gaussian_mixture_model ( - bool) – Whether to use gaussian mixture model to split samples into ‘XX’ and ‘XY’ instead of f-stat. Default is False.
 
- Return type:
- Returns:
- Table of samples and their imputed sex karyotypes.