gnomad_qc.v4.sample_qc.sex_inference

Script to impute chromosomal sex karyotype annotation.

usage: gnomad_qc.v4.sample_qc.sex_inference.py [-h] [--overwrite] [--test]
                                               [--slack-channel SLACK_CHANNEL]
                                               [--determine-fstat-sites]
                                               [--min-callrate MIN_CALLRATE]
                                               [--approx-af-and-no-callrate]
                                               [--fstat-n-partitions FSTAT_N_PARTITIONS]
                                               [--sex-imputation-interval-qc]
                                               [--read-sex-imputation-coverage-mt-if-exists]
                                               [--normalization-contig NORMALIZATION_CONTIG]
                                               [--mean-dp-thresholds MEAN_DP_THRESHOLDS [MEAN_DP_THRESHOLDS ...]]
                                               [--interval-qc-n-partitions INTERVAL_QC_N_PARTITIONS]
                                               [--impute-sex-ploidy]
                                               [--f-stat-ukb-var]
                                               [--min-af MIN_AF]
                                               [--f-stat-cutoff F_STAT_CUTOFF]
                                               [--high-qual-intervals | --high-qual-per-platform | --high-qual-all-platforms]
                                               [--min-platform-size MIN_PLATFORM_SIZE]
                                               [--variant-depth-only-x-ploidy]
                                               [--variant-depth-only-y-ploidy]
                                               [--omit-variant-depth-ploidy-lcr-filter]
                                               [--omit-variant-depth-ploidy-segdup-filter]
                                               [--variant-depth-ploidy-snv-only]
                                               [--omit-compute-x-frac-variants-hom-alt]
                                               [--high-qual-by-mean-fraction-over-dp-0 | --high-qual-by-fraction-samples-over-cov]
                                               [--sex-mean-fraction-over-dp-0 SEX_MEAN_FRACTION_OVER_DP_0]
                                               [--norm-mean-fraction-over-dp-0 NORM_MEAN_FRACTION_OVER_DP_0]
                                               [--x-cov X_COV] [--y-cov Y_COV]
                                               [--norm-cov NORM_COV]
                                               [--fraction-samples-x FRACTION_SAMPLES_X]
                                               [--fraction-samples-y FRACTION_SAMPLES_Y]
                                               [--fraction-samples-norm FRACTION_SAMPLES_NORM]
                                               [--annotate-sex-karyotype]
                                               [--use-gmm-for-ploidy-cutoffs]
                                               [--apply-x-frac-hom-alt-cutoffs]
                                               [--per-platform]
                                               [--sex-karyotype-cutoffs SEX_KARYOTYPE_CUTOFFS]

Named Arguments

--overwrite

Overwrite output files.

Default: False

--test

Test the pipeline using the gnomAD v4 test dataset.

Default: False

--slack-channel

Slack channel to post results and notifications to.

Determine f-stat sites

Arguments used for determining sites to use for f-stat calculations.

--determine-fstat-sites

Create Table of common (> value specified by ‘–min-af’), bi-allelic SNPs on chromosome X for f-stat calculations. Additionally filter to high callrate (> value specified by ‘–min-callrate’) variants if ‘–approx-af-and-no-callrate’ is not used. NOTE: This requires a densify of chrX!

Default: False

--min-callrate

Minimum variant callrate.

Default: 0.99

--approx-af-and-no-callrate

Whether to approximate allele frequency with AC/(n_samples * 2) and use no callrate cutoff for determination of f-stat sites.

Default: False

--fstat-n-partitions

Number of desired partitions for the f-stat sites output Table.

Default: 1000

Sex chromosome interval QC

Arguments used for making an interval QC HT from the sex imputation interval coverage MT.

--sex-imputation-interval-qc

Create a Table of the fraction of samples per interval and per platform with mean DP over thresholds specified by ‘–mean-dp-thresholds’.

Default: False

--read-sex-imputation-coverage-mt-if-exists

Whether to use the sex imputation coverage MT if it already exists rather than remaking this intermediate temporary file.

Default: False

--normalization-contig

Which autosomal chromosome to use for normalizing the coverage of chromosomes X and Y.

Default: “chr20”

--mean-dp-thresholds

List of mean DP cutoffs to determining the fraction of samples with mean coverage >= the cutoff for each interval.

Default: [5, 10, 15, 20, 25]

--interval-qc-n-partitions

Number of desired partitions for the sex imputation interval QC output Table.

Default: 500

Impute sex ploidy

Arguments used for imputing sex chromosome ploidy.

--impute-sex-ploidy

Run sex chromosome ploidy imputation.

Default: False

--normalization-contig

Which autosomal chromosome to use for normalizing the coverage of chromosomes X and Y.

Default: “chr20”

--read-sex-imputation-coverage-mt-if-exists

Whether to use the sex imputation coverage MT if it already exists rather than remaking this intermediate temporary file.

Default: False

--f-stat-ukb-var

Whether to use UK Biobank high callrate (0.99) and common variants (UKB allele frequency > value specified by ‘–min-af’) for f-stat computation instead of the sites determined by ‘–determine-fstat-sites’.

Default: False

--min-af

Minimum variant allele frequency to retain variant.

Default: 0.001

--f-stat-cutoff

Cutoff for f-stat to roughly divide ‘XX’ from ‘XY’ samples. Assumes XX samples are below cutoff and XY are above cutoff.

Default: -1.0

--high-qual-intervals

Whether to filter to high quality intervals for the sex ploidy imputation. Can’t be used at the same time as ‘–high-qual-per-platform’ or ‘–high-qual-all-platforms’.

Default: False

--high-qual-per-platform

Whether to filter to per platform high quality intervals for the sex ploidy imputation. Can’t be used at the same time as ‘–high-qual-intervals’ or ‘–high-qual-all-platforms’.

Default: False

--high-qual-all-platforms

Whether to filter to high quality intervals for the sex ploidy imputation. Use only intervals that are considered high quality across all platforms. Can’t be used at the same time as ‘–high-qual-intervals’ or ‘–high-qual-per-platform’

Default: False

--min-platform-size

Required size of a platform to be considered in ‘–high-qual-all-platforms’. Only platforms that have # of samples > ‘min_platform_size’ are used to determine intervals that have a high quality across all platforms.

Default: 100

--variant-depth-only-x-ploidy

Whether to use depth of variant data for the x ploidy estimation instead of the default behavior that will use reference blocks.

Default: False

--variant-depth-only-y-ploidy

Whether to use depth of variant data for the y ploidy estimation instead of the default behavior that will use reference blocks.

Default: False

--omit-variant-depth-ploidy-lcr-filter

Whether to omit filtering out variants in LCR regions for the variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X.

Default: False

--omit-variant-depth-ploidy-segdup-filter

Whether to omit filtering out variants in segdup regions for the variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X.

Default: False

--variant-depth-ploidy-snv-only

Whether to filter to only single nucleotide variants for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X.

Default: False

--omit-compute-x-frac-variants-hom-alt

Whether to omit the computation of the fraction of homozygous alternate variants on chromosome X.

Default: False

--high-qual-by-mean-fraction-over-dp-0

Whether to use the mean fraction of bases over DP 0 to determine high quality intervals. Can’t be set at the same time as ‘–high-qual-by-fraction-samples-over-cov’.

Default: False

--high-qual-by-fraction-samples-over-cov

Whether to determine high quality intervals using the fraction of samples with a mean interval quality over a specified quality for chrX (–x-cov), chrY (–y-cov), and the normalization contig (–norm-cov). Can’t be set at the same time as ‘–high-qual-by-mean-fraction-over-dp-0’.

Default: False

--sex-mean-fraction-over-dp-0

Mean fraction of bases over DP 0 used to define high quality intervals on sex chromosomes.

Default: 0.4

--norm-mean-fraction-over-dp-0

Mean fraction of bases over DP 0 used to define high quality intervals on the normalization chromosome.

Default: 0.99

--x-cov

Mean coverage level used to define high quality intervals on chromosome X. Aggregate mean for this coverage level must be in the sex chromosome interval QC HT (must be value in ‘–mean-dp-thresholds’ list used to create the QC HT)!

Default: 10

--y-cov

Mean coverage level used to define high quality intervals on chromosome Y. Aggregate mean for this coverage level must be in the sex chromosome interval QC HT (must be value in ‘–mean-dp-thresholds’ list used to create the QC HT)!

Default: 5

--norm-cov

Mean coverage level used to define high quality intervals on the normalization autosome. Aggregate mean for this coverage level must be in the sex chromosome interval QC HT (must be value in ‘–mean-dp-thresholds’ list used to create the QC HT)!

Default: 20

--fraction-samples-x

Fraction samples at specified coverage ‘–x-cov’ to determine high quality intervals on chromosome X.

Default: 0.8

--fraction-samples-y

Fraction samples at specified coverage ‘–y-cov’ to determine high quality intervals on chromosome Y.

Default: 0.35

--fraction-samples-norm

Fraction samples at specified coverage ‘–norm-cov’ to determine high quality intervals on the normalization chromosome specified by ‘–normalization-contig’.

Default: 0.85

Annotate sex karyotype

Arguments used for annotating sex karyotype.

--annotate-sex-karyotype

Run sex karyotype inference.

Default: False

--use-gmm-for-ploidy-cutoffs

Whether to use Gaussian mixture model to roughly split samples into ‘XX’ and ‘XY’ instead of f-stat.

Default: False

--apply-x-frac-hom-alt-cutoffs

Whether to apply ‘XX’ and ‘XY’ cutoffs for the fraction of homozygous alternate genotypes on chromosome X and use them to infer sex karyotype.

Default: False

--per-platform

Whether to run the karyotype inference per platform.

Default: False

--sex-karyotype-cutoffs

Optional path to JSON file containing sex karyotype X and Y ploidy cutoffs to use for karyotype annotation instead of inferring cutoffs. If ‘–apply-x-frac-hom-alt-cutoffs’ is used, this file mustalso include cutoffs for the fraction of homozygous alternate genotypes on chromosome X.

Module Functions

gnomad_qc.v4.sample_qc.sex_inference.determine_fstat_sites(vds)

Write a Table with chromosome X SNPs that are bi-allelic, common, and high callrate by default.

gnomad_qc.v4.sample_qc.sex_inference.load_platform_ht([...])

Load platform assignment Table or test Table and return an error if requested Table does not exist.

gnomad_qc.v4.sample_qc.sex_inference.prepare_sex_imputation_coverage_mt([...])

Prepare the sex imputation coverage MT.

gnomad_qc.v4.sample_qc.sex_inference.compute_sex_ploidy(...)

Impute sex chromosome ploidy, and optionally chrX heterozygosity and fraction homozygous alternate variants on chrX.

gnomad_qc.v4.sample_qc.sex_inference.annotate_sex_karyotype_from_ploidy_cutoffs(...)

Determine sex karyotype annotation based on chromosome X and chromosome Y ploidy estimates and ploidy cutoffs.

gnomad_qc.v4.sample_qc.sex_inference.infer_sex_karyotype_from_ploidy(...)

Create a Table with X_karyotype, Y_karyotype, and sex_karyotype.

gnomad_qc.v4.sample_qc.sex_inference.reformat_ploidy_cutoffs_for_json(ht)

Format x_ploidy_cutoffs and y_ploidy_cutoffs global annotations for JSON export.

gnomad_qc.v4.sample_qc.sex_inference.main(args)

Impute chromosomal sex karyotype annotation.

gnomad_qc.v4.sample_qc.sex_inference.get_script_argument_parser()

Get script argument parser.

Script to impute chromosomal sex karyotype annotation.

gnomad_qc.v4.sample_qc.sex_inference.determine_fstat_sites(vds, approx_af_and_no_callrate=False, min_af=0.001, min_callrate=0.99)[source]

Write a Table with chromosome X SNPs that are bi-allelic, common, and high callrate by default.

This Table is designed to be used as a variant filter in sex imputation for f-stat computation.

Warning

By default approx_af_and_no_callrate is False and the final Table will be filtered to high callrate (> value specified by min_callrate) variants. This requires a densify of chrX!”

Note

If approx_af_and_no_callrate is True, allele frequency is approximated with AC/(n_samples * 2) and no callrate filter is used.

Parameters:
  • vds (VariantDataset) – Input VariantDataset.

  • approx_af_and_no_callrate (bool) – Whether to approximate allele frequency with AC/(n_samples * 2) and use no callrate cutoff to filter sites.

  • min_af (float) – Minimum alternate allele frequency cutoff used to filter sites.

  • min_callrate (float) – Minimum callrate cutoff used to filter sites.

Return type:

Table

Returns:

Table of chromosome X sites to be used for f-stat computation.

gnomad_qc.v4.sample_qc.sex_inference.load_platform_ht(test=False, calling_interval_name='intersection', calling_interval_padding=50)[source]

Load platform assignment Table or test Table and return an error if requested Table does not exist.

Note

If test is True and the test platform assignment Table does not exist, the function will load the final platform assignment Table instead if it already exists.

Parameters:
  • test (bool) – Whether a test platform assignment Table should be loaded.

  • calling_interval_name (str) – Name of calling intervals to use for interval coverage. One of: ‘ukb’, ‘broad’, or ‘intersection’. Only used if test is True.

  • calling_interval_padding (int) – Number of base pair padding to use on the calling intervals. One of 0 or 50 bp. Only used if test is True.

Return type:

Table

Returns:

Platform assignment Table.

gnomad_qc.v4.sample_qc.sex_inference.prepare_sex_imputation_coverage_mt(normalization_contig='chr20', test=False, read_if_exists=False)[source]

Prepare the sex imputation coverage MT.

Filter the full interval coverage MatrixTable to the specified normalization contig and hard filtered samples (before sex hard filter) and union it with the sex coverage MatrixTable after excluding intervals that overlap PAR regions.

Parameters:
  • normalization_contig (str) – Which autosomal chromosome to use for normalizing the coverage of chromosomes X and Y. Default is ‘chr20’.

  • test (bool) – Whether to use gnomAD v4 test dataset. Default is False.

  • read_if_exists (bool) – Whether to use the sex imputation coverage MT if it already exists rather than remaking this intermediate temporary file. Default is False.

Return type:

MatrixTable

Returns:

Interval coverage MatrixTable for sex imputation.

gnomad_qc.v4.sample_qc.sex_inference.compute_sex_ploidy(vds, coverage_mt, interval_qc_ht=None, high_qual_per_platform=False, platform_ht=None, normalization_contig='chr20', variant_depth_only_x_ploidy=False, variant_depth_only_y_ploidy=False, variant_depth_only_ploidy_filter_lcr=True, variant_depth_only_ploidy_filter_segdup=True, variant_depth_only_ploidy_snv_only=False, compute_x_frac_variants_hom_alt=True, freq_ht=None, min_af=0.001, f_stat_cutoff=-1.0)[source]

Impute sex chromosome ploidy, and optionally chrX heterozygosity and fraction homozygous alternate variants on chrX.

This function imputes sex chromosome ploidy from a VDS and a sex inference specific coverage MT (created by prepare_sex_imputation_coverage_mt).

With no additional parameters passed, chrX and chrY ploidy will be imputed using Hail’s hail.vds.impute_sex_chromosome_ploidy method which computes chromosome ploidy using reference block DP per calling interval (using intervals in coverage_mt). This method breaks up the reference blocks at the calling interval boundaries, maintaining all reference block END information for the mean DP per interval computation.

There is also the option to impute ploidy using mean variant depth within the specified calling intervals instead of using reference block depths. This can be defined differently for chrX and chrY using variant_depth_only_x_ploidy and variant_depth_only_y_ploidy.

If an interval_qc_ht Table is supplied, only high quality intervals will be used in sex chromosome ploidy imputation. High quality intervals are defined by ‘interval_qc_ht.pass_interval_qc’.

If high_qual_per_platform is True, interval_qc_ht and platform_ht must be supplied, and ‘interval_qc_ht.pass_interval_qc’ should be a Struct with one BooleanExpression per platform.

Parameters:
  • vds (VariantDataset) – Input VDS for use in sex inference.

  • coverage_mt (MatrixTable) – Input sex inference specific interval coverage MatrixTable.

  • interval_qc_ht (Optional[Table]) – Optional interval QC Table to use for filtering to high quality intervals before sex ploidy imputation.

  • high_qual_per_platform (bool) – Whether to filter to per platform high quality intervals for the sex ploidy imputation. Default is False.

  • platform_ht (Optional[Table]) – Input platform assignment Table. This is only needed if high_qual_per_platform is True.

  • normalization_contig (str) – Which autosomal chromosome to use for normalizing the coverage of chromosomes X and Y. Default is ‘chr20’.

  • variant_depth_only_x_ploidy (bool) – Whether to use depth of variant data within calling intervals instead of reference data for chrX ploidy estimation. Default will only use reference data.

  • variant_depth_only_y_ploidy (bool) – Whether to use depth of variant data within calling intervals instead of reference data for chrY ploidy estimation. Default will only use reference data.

  • variant_depth_only_ploidy_filter_lcr (bool) – Whether to filter out variants in LCR regions for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X. Default is True.

  • variant_depth_only_ploidy_filter_segdup (bool) – Whether to filter out variants in segdup regions for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X. Default is True.

  • variant_depth_only_ploidy_snv_only (bool) – Whether to filter to only single nucleotide variants for variants only ploidy estimation and fraction of homozygous alternate variants on chromosome X. Default is False.

  • compute_x_frac_variants_hom_alt – Whether to return an annotation for the fraction of homozygous alternate variants on chromosome X. Default is True.

  • freq_ht (Optional[Table]) – Optional Table to use for f-stat allele frequency cutoff. The input VDS is filtered to sites in this Table prior to running Hail’s impute_sex module, and alternate allele frequency is used from this Table with a min_af cutoff.

  • min_af (float) – Minimum alternate allele frequency to be used in f-stat calculations. Default is 0.001.

  • f_stat_cutoff (float) – f-stat to roughly divide ‘XX’ from ‘XY’ samples. Assumes XX samples are below cutoff and XY are above cutoff. Default is -1.0.

Return type:

Table

Returns:

Table with imputed ploidies.

gnomad_qc.v4.sample_qc.sex_inference.annotate_sex_karyotype_from_ploidy_cutoffs(ploidy_ht, sex_karyotype_ploidy_cutoffs, per_platform=False, apply_x_frac_hom_alt_cutoffs=False)[source]

Determine sex karyotype annotation based on chromosome X and chromosome Y ploidy estimates and ploidy cutoffs.

ploidy_ht must include the following annotations:

  • chrX_ploidy: chromosome X ploidy estimate

  • chrY_ploidy: chromosome X ploidy estimate

The expected format of sex_karyotype_ploidy_cutoffs is one of:

  • If per_platform is False:

{
    "x_ploidy_cutoffs": {
        "upper_cutoff_X": 2.6,
        "lower_cutoff_XX": 1.9,
        "upper_cutoff_XX": 6.8,
        "lower_cutoff_XXX": 6.6
    },
    "y_ploidy_cutoffs": {
        "lower_cutoff_Y": 0.2,
        "upper_cutoff_Y": 1.3,
        "lower_cutoff_YY": 1.4
    }
}
  • If per_platform is True:

{
    "x_ploidy_cutoffs": {
        "platform_1": {
            "upper_cutoff_X": 2.6,
            "lower_cutoff_XX": 1.9,
            "upper_cutoff_XX": 6.2,
            "lower_cutoff_XXX": 6.6
        },
        "platform_0": {
            "upper_cutoff_X": 1.6,
            "lower_cutoff_XX": 1.5,
            "upper_cutoff_XX": 3.3,
            "lower_cutoff_XXX": 3.5
        },
        ...
    },
    "y_ploidy_cutoffs": {
        "platform_1": {
            "lower_cutoff_Y": 0.2,
            "upper_cutoff_Y": 1.3,
            "lower_cutoff_YY": 1.4
        },
        "platform_0": {
            "lower_cutoff_Y": 0.1,
            "upper_cutoff_Y": 1.2,
            "lower_cutoff_YY": 1.1
        },
        ...
    }
}

Returns a Table with the following annotations:

  • X_karyotype: Sample assigned X karyotype.

  • Y_karyotype: Sample assigned Y karyotype.

  • sex_karyotype: Combination of X_karyotype and Y_karyotype.

Parameters:
  • ploidy_ht (Table) – Table with chromosome X and chromosome Y ploidies.

  • sex_karyotype_ploidy_cutoffs (Union[Dict[str, Dict[str, Dict[str, float]]], Dict[str, Dict[str, float]]]) – Dictionary of sex karyotype ploidy cutoffs.

  • per_platform (bool) – Whether the sex_karyotype_ploidy_cutoffs should be applied per platform.

  • apply_x_frac_hom_alt_cutoffs (bool) – Whether to apply cutoffs for the fraction homozygous alternate genotypes (hom-alt/(hom-alt + het)) on chromosome X.

Return type:

Table

Returns:

Sex karyotype Table.

gnomad_qc.v4.sample_qc.sex_inference.infer_sex_karyotype_from_ploidy(ploidy_ht, per_platform=False, f_stat_cutoff=-1.0, use_gmm_for_ploidy_cutoffs=False, apply_x_frac_hom_alt_cutoffs=False)[source]

Create a Table with X_karyotype, Y_karyotype, and sex_karyotype.

Parameters:
  • ploidy_ht (Table) – Table with chromosome X and chromosome Y ploidies, and f-stat if not use_gmm_for_ploidy_cutoffs.

  • per_platform (bool) – Whether the sex karyotype ploidy cutoff inference should be applied per platform.

  • f_stat_cutoff (float) – f-stat to roughly divide ‘XX’ from ‘XY’ samples. Assumes XX samples are below cutoff and XY are above cutoff.

  • use_gmm_for_ploidy_cutoffs (bool) – Use Gaussian mixture model to split samples into ‘XX’ and ‘XY’ instead of f-stat.

  • apply_x_frac_hom_alt_cutoffs (bool) – Whether to apply cutoffs for the fraction homozygous alternate genotypes (hom-alt/(hom-alt + het)) on chromosome X.

Return type:

Table

Returns:

Table of imputed sex karyotypes.

gnomad_qc.v4.sample_qc.sex_inference.reformat_ploidy_cutoffs_for_json(ht, per_platform=False, include_x_frac_hom_alt_cutoffs=True)[source]

Format x_ploidy_cutoffs and y_ploidy_cutoffs global annotations for JSON export.

Parameters:
  • ht (Table) – Table including globals for x_ploidy_cutoffs and y_ploidy_cutoffs.

  • per_platform (bool) – Whether the ploidy global cutoffs are per platform.

  • include_x_frac_hom_alt_cutoffs (bool) – Whether to include cutoffs for the fraction homozygous alternate genotypes (hom-alt/(hom-alt + het)) on chromosome X.

Return type:

dict

Returns:

Dictionary of X and Y ploidy cutoffs for JSON export.

gnomad_qc.v4.sample_qc.sex_inference.main(args)[source]

Impute chromosomal sex karyotype annotation.