gnomad_qc.v4.sample_qc.interval_qc
Script to define high quality intervals based on per interval aggregate statistics over samples.
Two methods are available for defining high quality intervals:
mean fraction of bases over DP 0 to determine high quality intervals.
fraction of samples with a mean interval coverage over a specified coverage that is different for autosomes and sex chromosomes.
Aggregate statistics over samples can also be stratified by platform to determine per-platform high quality intervals.
usage: gnomad_qc.v4.sample_qc.interval_qc.py [-h] [--overwrite] [--test]
[--slack-channel SLACK_CHANNEL]
[--sex-chr-interval-coverage]
[--calling-interval-name {ukb,broad,intersection}]
[--calling-interval-padding {0,50}]
[--generate-interval-qc-ht]
[--mean-dp-thresholds MEAN_DP_THRESHOLDS [MEAN_DP_THRESHOLDS ...]]
[--generate-interval-qc-pass-ht]
[--per-platform | --all-platforms]
[--min-platform-size MIN_PLATFORM_SIZE]
[--by-mean-fraction-over-dp-0 | --by-fraction-samples-over-cov]
[--mean-fraction-over-dp-0 MEAN_FRACTION_OVER_DP_0]
[--autosome-par-xx-cov AUTOSOME_PAR_XX_COV]
[--xy-nonpar-cov XY_NONPAR_COV]
[--fraction-samples FRACTION_SAMPLES]
Named Arguments
- --overwrite
Overwrite output files.
Default: False
- --test
Test using only 10 partitions on chr1 and chrX, and all of chrY.
Default: False
- --slack-channel
Slack channel to post results and notifications to.
Sex chromosome interval coverage
Arguments used for computing interval coverage on sex chromosomes.
- --sex-chr-interval-coverage
Create a MatrixTable of interval-by-sample coverage on sex chromosomes with intervals split at PAR regions.
Default: False
- --calling-interval-name
Possible choices: ukb, broad, intersection
Name of calling intervals to use for interval coverage. One of: ‘ukb’, ‘broad’, or ‘intersection’.
Default: “intersection”
- --calling-interval-padding
Possible choices: 0, 50
Number of base pair padding to use on the calling intervals. One of 0 or 50 bp.
Default: 50
Compute aggregate interval stats for interval QC
Arguments used for computing interval QC stats.
- --generate-interval-qc-ht
Compute aggregate interval stats for interval QC from coverage MatrixTables.
Default: False
- --mean-dp-thresholds
List of mean DP cutoffs to determine the fraction of samples with mean coverage >= the cutoff for each interval.
Default: [5, 10, 15, 20, 25]
Generate interval QC pass annotation
Arguments used for determining intervals that pass QC.
- --generate-interval-qc-pass-ht
Create Table that contains an ‘interval_qc_pass’ annotation indicating whether the interval passes high-quality criteria.
Default: False
- --per-platform
Whether to make the interval QC pass annotation a DictionaryExpression with interval QC pass per platform.
Default: False
- --all-platforms
Whether to consider an interval as passing QC only if it passes interval QC per platform across all platforms (with a sample size above ‘–min-platform-size’).
Default: False
- --min-platform-size
Required size of a platform to be considered in ‘–all-platforms’. Only platforms that have # of samples > ‘min_platform_size’ are used to determine intervals that are high quality across all platforms.
Default: 100
- --by-mean-fraction-over-dp-0
Whether to use the mean fraction of bases over DP 0 to determine high quality intervals. Can’t be set at the same time as ‘–by-prop-samples-over-cov’.
Default: False
- --by-fraction-samples-over-cov
Whether to determine high quality intervals using the fraction of samples (–fraction-samples) with a mean interval coverage over a specified coverage for intervals on the the autosomes/sex chromosome PAR/chrX in XX individuals (–autosome-par-xx-cov) and intervals on non-PAR chrX and non-PAR chrY in XY individuals (–xy-nonpar-cov). Can’t be set at the same time as ‘–by-mean-fraction-over-dp-0’
Default: False
- --mean-fraction-over-dp-0
Mean fraction of bases over DP 0 used to define high quality intervals.
Default: 0.99
- --autosome-par-xx-cov
Mean coverage level used to define high coverage intervals on the the autosomes, sex chromosome PAR, and chrX in XX individuals. This field must be in the interval coverage MatrixTables!
Default: 20
- --xy-nonpar-cov
Mean coverage level used to define high coverage intervals on non-PAR chrX and non-PAR chrY in XY individuals. This field must be in the interval coverage MatrixTables!
Default: 10
- --fraction-samples
Fraction of samples with mean coverage greater than ‘–autosome-par-xx-cov’/’–xy-nonpar-cov’ over the interval to determine high coverage intervals.
Default: 0.85
Module Functions
|
Create a MatrixTable of interval-by-sample coverage on sex chromosomes with intervals split at PAR regions. |
Filter mt to num_partitions partitions on chr1 and sex_mt to num_partitions partitions on chrX and all of chrY. |
|
|
Compute interval QC aggregate statistics per interval, per platform, and optionally split by sex karyotype. |
|
Create a dictionary specifying annotations and cutoffs to use for determining high quality intervals. |
|
Add interval_qc_pass annotation to indicate whether the interval is high quality. |
|
Annotate a Table/MatrixTable with 'pass_interval_qc' using get_interval_qc_pass. |
Define high quality intervals based on aggregate statistics over samples. |
|
|
Get script argument parser. |
Script to define high quality intervals based on per interval aggregate statistics over samples.
Two methods are available for defining high quality intervals:
mean fraction of bases over DP 0 to determine high quality intervals.
fraction of samples with a mean interval coverage over a specified coverage that is different for autosomes and sex chromosomes.
Aggregate statistics over samples can also be stratified by platform to determine per-platform high quality intervals.
- gnomad_qc.v4.sample_qc.interval_qc.generate_sex_chr_interval_coverage_mt(vds, calling_intervals_ht)[source]
Create a MatrixTable of interval-by-sample coverage on sex chromosomes with intervals split at PAR regions.
- Parameters:
vds (
VariantDataset
) – Input VariantDataset.calling_intervals_ht (
Table
) – Calling interval Table.
- Return type:
- Returns:
MatrixTable with interval coverage per sample on sex chromosomes.
- gnomad_qc.v4.sample_qc.interval_qc.filter_to_test(mt, sex_mt, num_partitions=10)[source]
Filter mt to num_partitions partitions on chr1 and sex_mt to num_partitions partitions on chrX and all of chrY.
Note
This returns the first num_partitions in mt, the first num_partitions in sex_mt, and all of chrY. It makes the assumption that the first num_partitions in mt are on chr1 and that the first num_partitions in sex_mt are on chrY. If num_partitions is too high this may not hold true.
- Parameters:
mt (
MatrixTable
) – Input MatrixTable to filter to specified number of partitions on chr1.sex_mt (
MatrixTable
) – Input MatrixTable to filter to specified number of partitions on chrX and all of chrY.num_partitions (
int
) – Number of partitions to grab from mt.
- Return type:
Tuple
[MatrixTable
,MatrixTable
]- Returns:
Input MatrixTables filtered to num_partitions on chr1, chrX, and all of chrY.
- gnomad_qc.v4.sample_qc.interval_qc.compute_interval_qc(mt, platform_ht, mean_dp_thresholds=[5, 10, 15, 20, 25], split_by_sex=False)[source]
Compute interval QC aggregate statistics per interval, per platform, and optionally split by sex karyotype.
The following annotations must be on mt (interval-by-sample MT output from hl.vds.interval_coverage):
interval - Genomic interval of interest.
mean_dp - Mean depth of bases across the interval.
fraction_over_dp_threshold - Fraction of interval (in bases) above each DP threshold. Second element must be dp >= 1 (dp > 0).
If split_by_sex:
sex_karyotype - StringExpression annotation with sex karyotype information including ‘XX’ and ‘XY’ values.
The platform_ht must have a ‘qc_platform’ annotation indicating the platform each sample was assigned.
Returns a Table with the following annotations:
interval_mean_dp - Mean DP of the interval across ‘all’ samples and optionally split by ‘XX’ and ‘XY’.
fraction_over_{dp}x - for all ‘dp’ in mean_dp_thresholds, which is the fraction of samples with mean DP over ‘dp’. Computed across ‘all’ samples and optionally split by ‘XX’ and ‘XY’.
mean_fraction_over_dp_0 - Mean of the fraction of the interval (in bases) that is dp > 0. Computed across ‘all’ samples and optionally split by ‘XX’ and ‘XY’.
platform_interval_mean_dp - Same as ‘interval_mean_dp’, but instead of containing a single value, ‘all’ (and ‘XX’ and ‘XY’ if split_by_sex is True) contains a dictionary of per platform values.
platform_fraction_over_{dp}x - Same as ‘fraction_over_{dp}x’, but instead of containing a single value, ‘all’ (and ‘XX’ and ‘XY’ if split_by_sex is True) contains a dictionary of per platform values.
platform_mean_fraction_over_dp_0 - Same as ‘mean_fraction_over_dp_0’, but instead of containing a single value, ‘all’ (and ‘XX’ and ‘XY’ if split_by_sex is True) contains a dictionary of per platform values.
- Parameters:
mt (
MatrixTable
) – Input interval coverage MatrixTable.platform_ht (
Table
) – Input platform assignment Table.mean_dp_thresholds (
List
[int
]) – List of mean DP thresholds to use for computing the fraction of samples with mean interval DP >= the threshold.split_by_sex (
bool
) – Whether the interval QC should be stratified by sex. If True, mt must be annotated with ‘sex_karyotype’.
- Return type:
- Returns:
Table with interval QC annotations.
- gnomad_qc.v4.sample_qc.interval_qc.get_high_qual_cutoff_dict(autosome_par_cutoff, x_nonpar_cutoff, y_nonpar_cutoff, autosome_par_qc_ann, x_nonpar_qc_ann, y_nonpar_qc_ann, split_by_sex=False)[source]
Create a dictionary specifying annotations and cutoffs to use for determining high quality intervals.
This Dictionary is meant to be used as input to get_interval_qc_pass.
If split_by_sex is True, the ‘x_non_par’ dictionary value will contain a cutoff for both ‘XX’ and ‘XY’, and ‘y_non_par’ will contain a cutoff for only ‘XY’.
The returned dictionary will be in this form if split_by_sex is False:
{ 'autosome_par': [(autosome_par_qc_ann, 'all', autosome_par_cutoff)], 'x_non_par': [(x_nonpar_qc_ann, 'all', x_nonpar_cutoff)], 'y_non_par': [(y_nonpar_qc_ann, 'all', y_nonpar_cutoff)] }
The returned dictionary will be in this form if split_by_sex is True:
{ 'autosome_par': [(autosome_par_qc_ann, 'all', autosome_par_cutoff)], 'x_non_par': [ (x_nonpar_qc_ann, 'XX', x_nonpar_cutoff), (y_nonpar_qc_ann, 'XY', y_nonpar_cutoff) ], 'y_non_par': [(y_nonpar_qc_ann, 'XY', y_nonpar_cutoff)] }
- Parameters:
autosome_par_cutoff (
float
) – Cutoff to define high coverage intervals for autosome and PAR intervals. Intervals with autosome_par_qc_ann > autosome_par_cutoff are considered high coverage.x_nonpar_cutoff (
float
) – Cutoff to define high coverage intervals for chromosome X non-PAR intervals (for XX individuals if split_by_sex is True). Intervals with x_nonpar_qc_ann > x_nonpar_cutoff are considered high coverage.y_nonpar_cutoff (
float
) – Cutoff to define high coverage intervals for chromosome Y non-PAR intervals (for XY individuals if split_by_sex is True). Intervals with y_nonpar_qc_ann > y_nonpar_cutoff are considered high coverage. Also used to define high coverage X non-PAR intervals if split_by_sex is True.autosome_par_qc_ann (
str
) – Annotation in an interval QC HT that will be used to filter high coverage intervals for autosomes and PAR regions.x_nonpar_qc_ann (
str
) – Annotation in an interval QC HT that will be used to filter high coverage intervals for chromosome X non-PAR regions.y_nonpar_qc_ann (
str
) – Annotation in an interval QC HT that will be used to filter high coverage intervals for chromosome Y non-PAR regions. Also used for chromosome X non-PAR regions if split_by_sex is True.split_by_sex (
bool
) – Whether to split ‘x_non_par’ and ‘y_non_par’ cutoffs based on sex karyotype. Default is False.
- Return type:
Dict
[str
,List
[Tuple
[str
,str
,float
]]]- Returns:
Dictionary of annotations and cutoffs to use to define high quality intervals.
- gnomad_qc.v4.sample_qc.interval_qc.get_interval_qc_pass(interval_qc_ht, high_qual_cutoffs, per_platform=False, all_platforms=False, min_platform_size=100)[source]
Add interval_qc_pass annotation to indicate whether the interval is high quality.
interval_qc_ht is the output of compute_interval_qc and contains annotations that can be used in the high_qual_cutoffs dictionary to indicate intervals that are considered high quality.
The high_qual_cutoffs dictionary can be created using get_high_qual_cutoff_dict. It specifies annotations and cutoffs to use for determining high quality intervals. Annotations in the high_qual_cutoffs dictionary must exist in the interval_qc_ht Table. The high_qual_cutoffs dictionary must have the following keys: ‘autosome_par’, ‘x_non_par’ and ‘y_non_par’. Each Key specifies a list of annotations and cutoffs to use for filtering.
Example of high_qual_cutoffs dictionary using annotations for the proportion of samples over a specified coverage:
{ 'autosome_par': [('fraction_over_20x', 'all', 0.85)], 'x_non_par': [ ('fraction_over_20x', 'XX', 0.85), ('fraction_over_10x', 'XY', 0.85) ], 'y_non_par': [('fraction_over_10x', 'XY', 0.85)] }
Example of high_qual_cutoffs dictionary using annotations for the proportion of samples over a specified coverage and specifying differences by sex karyotype:
{ 'autosome_par': [('fraction_over_20x', 'all', 0.85)], 'x_non_par': [('fraction_over_10x', 'all', 0.80)], 'y_non_par': [('fraction_over_5x', 'all', 0.35)] }
Only one of ‘per_platform’ and ‘all_platforms’ can be True, and if per_platform or all_platforms is True, a prefix of “platform_” is added before the annotation in the high_qual_cutoffs dictionary.
- Parameters:
interval_qc_ht (
Table
) – Input interval QC Table.high_qual_cutoffs (
Dict
[str
,List
[Tuple
[str
,str
,float
]]]) – Dictionary containing annotations and cutoffs to use for filtering to high coverage intervals.per_platform (
bool
) – Whether to make the interval QC pass annotation a DictionaryExpression with interval QC pass per platform.all_platforms (
bool
) – Whether to consider an interval as passing QC only if it passes interval QC per platform across all platforms (with a sample size above min_platform_size).min_platform_size (
int
) – Required size of a platform to be considered in all_platforms. Only platforms that have # of samples > ‘min_platform_size’ are used to determine intervals that have a high coverage across all platforms.
- Return type:
- Returns:
MatrixTable or Table with samples removed.
- gnomad_qc.v4.sample_qc.interval_qc.annotate_interval_qc_filter(t, **kwargs)[source]
Annotate a Table/MatrixTable with ‘pass_interval_qc’ using get_interval_qc_pass.
Passes the interval QC Table resource and kwargs to get_interval_qc_pass.
- Parameters:
t (
Union
[MatrixTable
,Table
]) – Input Table or MatrixTable.kwargs – Optional keyword arguments to pass to get_interval_qc_pass.
- Return type:
Union
[MatrixTable
,Table
]- Returns:
Input Table or MatrixTable annotated with ‘pass_interval_qc’.