gnomad_qc.v4.sample_qc.relatedness
Script to compute relatedness estimates among pairs of samples in the callset.
usage: gnomad_qc.v4.sample_qc.relatedness.py [-h] [-o] [--test]
[--min-emission-kinship MIN_EMISSION_KINSHIP]
[--relatedness-n-partitions RELATEDNESS_N_PARTITIONS]
[--prepare-cuking-inputs]
[--print-cuking-command]
[--cuking-split-factor CUKING_SPLIT_FACTOR]
[--create-cuking-relatedness-table]
[--run-ibd-on-cuking-pairs]
[--ibd-min-cuking-kin IBD_MIN_CUKING_KIN]
[--ibd-max-cuking-ibs0 IBD_MAX_CUKING_IBS0]
[--ibd-max-samples IBD_MAX_SAMPLES]
[--run-pc-relate-pca]
[--n-pca-pcs N_PCA_PCS]
[--create-pc-relate-relatedness-table]
[--n-pc-relate-pcs N_PC_RELATE_PCS]
[--min-individual-maf MIN_INDIVIDUAL_MAF]
[--block-size BLOCK_SIZE]
[--finalize-relatedness-ht]
[--finalize-relatedness-method {cuking,pc_relate}]
[--second-degree-min-kin SECOND_DEGREE_MIN_KIN]
[--parent-child-max-ibd0-or-ibs0-over-ibs2 PARENT_CHILD_MAX_IBD0_OR_IBS0_OVER_IBS2]
[--second-degree-sibling-lower-cutoff-slope SECOND_DEGREE_SIBLING_LOWER_CUTOFF_SLOPE]
[--second-degree-sibling-lower-cutoff-intercept SECOND_DEGREE_SIBLING_LOWER_CUTOFF_INTERCEPT]
[--second-degree-upper-sibling-lower-cutoff-slope SECOND_DEGREE_UPPER_SIBLING_LOWER_CUTOFF_SLOPE]
[--second-degree-upper-sibling-lower-cutoff-intercept SECOND_DEGREE_UPPER_SIBLING_LOWER_CUTOFF_INTERCEPT]
[--duplicate-twin-min-kin DUPLICATE_TWIN_MIN_KIN]
[--duplicate-twin-ibd1-min DUPLICATE_TWIN_IBD1_MIN]
[--duplicate-twin-ibd1-max DUPLICATE_TWIN_IBD1_MAX]
[--compute-related-samples-to-drop]
[--release]
[--slack-channel SLACK_CHANNEL]
Named Arguments
- -o, --overwrite
Overwrite output files.
Default: False
- --test
Use a test MatrixTableResource as input.
Default: False
- --slack-channel
Slack channel to post results and notifications to.
Common relatedness estimate arguments
Arguments relevant to both cuKING and PC-relate relatedness estimates.
- --min-emission-kinship
Minimum kinship threshold for emitting a pair of samples in the relatedness output.
Default: 0.05
- --relatedness-n-partitions
Number of desired partitions for the relatedness Table.
Default: 100
cuKING specific relatedness arguments
Arguments specific to computing relatedness estimates using cuKING.
- --prepare-cuking-inputs
Converts the dense QC MatrixTable to a Parquet format suitable for cuKING.
Default: False
- --create-cuking-relatedness-table
Convert the cuKING outputs to a standard Hail Table.
Default: False
Finalize relatedness specific arguments
Arguments specific to creating the final relatedness Table including adding a ‘relationship’ annotation for each pair. Note: The defaults provided for the slope and intercept cutoffs were determined from visualization of the cuking kinship distribution and the IBS0/IBS2 vs. kinship plot.
- --finalize-relatedness-ht
Whether to finalize the relatedness HT.
Default: False
- --finalize-relatedness-method
Possible choices: cuking, pc_relate
Which relatedness method to use for finalized relatedness Table. Options are ‘cuking’ and ‘pc_relate’. Default is ‘cuking’.
Default: “cuking”
- --second-degree-min-kin
Minimum kinship threshold for filtering a pair of samples with a second degree relationship when filtering related individuals. Default is 0.08838835. Bycroft et al. (2018) calculates a theoretical kinship of 0.08838835 for a second degree relationship cutoff. This cutoff shouldbe determined by evaluation of the kinship distribution.
Default: 0.08838835
- --parent-child-max-ibd0-or-ibs0-over-ibs2
Maximum value of IBD0 (if ‘–finalize-relatedness-method’ is ‘pc_relate’) or IBS0/IBS2 (if ‘–finalize-relatedness-method’ is ‘cuking’) for a parent-child pair.
Default: 5.2e-05
- --second-degree-sibling-lower-cutoff-slope
Slope of the line to use as a lower cutoff for second degree relatives and siblings from parent-child pairs.
Default: -0.0019
- --second-degree-sibling-lower-cutoff-intercept
Intercept of the line to use as a lower cutoff for second degree relatives and siblings from parent-child pairs.
Default: 0.00058
- --second-degree-upper-sibling-lower-cutoff-slope
Slope of the line to use as an upper cutoff for second degree relatives and a lower cutoff for siblings.
Default: -0.01
- --second-degree-upper-sibling-lower-cutoff-intercept
Intercept of the line to use as an upper cutoff for second degree relatives and a lower cutoff for siblings.
Default: 0.0022
- --duplicate-twin-min-kin
Minimum kinship for duplicate or twin pairs.
Default: 0.42
- --duplicate-twin-ibd1-min
Minimum IBD1 cutoff for duplicate or twin pairs. Only used when ‘–finalize-relatedness-method’ is ‘pc_relate’. Note: the min is used because pc_relate can output large negative values in some corner cases.
Default: -0.15
- --duplicate-twin-ibd1-max
Maximum IBD1 cutoff for duplicate or twin pairs. Only used when ‘–finalize-relatedness-method’ is ‘pc_relate’.
Default: 0.1
Compute related samples to drop
Arguments used to determine related samples that should be dropped from the ancestry PCA or release.
- --compute-related-samples-to-drop
Determine the minimal set of related samples to prune for ancestry PCA or release if ‘–release’ is used.
Default: False
- --release
Whether to determine related samples to drop for the release based on outlier filtering of sample QC metrics.
Default: False
Module Functions
|
Print the command to submit a Cloud Batch job for running cuKING. |
|
Run Hail's identity by descent on a subset of related pairs identified by cuKING. |
|
Create the finalized relatedness Table including adding a 'relationship' annotation for each pair. |
Add a rank to each sample for use when breaking maximal independent set ties. |
|
|
Determine the minimal set of related samples to prune for ancestry PCA or release. |
|
Get PipelineResourceCollection for all resources needed in the relatedness pipeline. |
Compute relatedness estimates among pairs of samples in the callset. |
|
|
Get script argument parser. |
Script to compute relatedness estimates among pairs of samples in the callset.
- gnomad_qc.v4.sample_qc.relatedness.print_cuking_command(cuking_input_path, cuking_output_path, min_emission_kinship=0.5, cuking_split_factor=4)[source]
Print the command to submit a Cloud Batch job for running cuKING.
- Parameters:
cuking_input_path (
str
) – Path to the cuKING input Parquet files.cuking_output_path (
str
) – Path to the cuKING output Parquet files.min_emission_kinship (
float
) – Minimum kinship threshold for emitting a pair of samples in the relatedness output.cuking_split_factor (
int
) – Split factor to use for splitting the full relatedness matrix table into equally sized submatrices that are computed independently to parallelize the relatedness computation and decrease the memory requirements. For example, to halve memory requirements, the full matrix can be split into equally sized submatrices (i.e. a ‘split factor’ of 4). Only the ‘upper triangular’ submatrices need to be evaluated due to the symmetry of the relatedness matrix, leading to 10 shards. Default is 4.
- Return type:
None
- Returns:
- gnomad_qc.v4.sample_qc.relatedness.compute_ibd_on_cuking_pair_subset(mt, relatedness_ht, ibd_min_cuking_kin=0.16, ibd_max_cuking_ibs0=50, ibd_max_samples=10000)[source]
Run Hail’s identity by descent on a subset of related pairs identified by cuKING.
The pairs that will get an identity by descent annotation are those where either ibd_min_cuking_kin or ibd_max_cuking_ibs0 are met.
- Parameters:
mt (
Table
) – QC MatrixTable.relatedness_ht (
Table
) – cuKING relatedness Table.ibd_min_cuking_kin (
float
) – Minimum cuKING kinship for pair to be included in IBD estimates. Default is 0.16.ibd_max_cuking_ibs0 (
int
) – Maximum cuKING IBS0 for pair to be included in IBD estimates. Default is 50. This default was determined from looking at the cuKING Kinship vs. IBS0 plot for gnomAD v3 + gnomAD v4.ibd_max_samples (
int
) – Maximum number of samples to include in each IBD run.
- Return type:
- Returns:
Table containing identity by descent metrics on related sample pairs.
- gnomad_qc.v4.sample_qc.relatedness.finalize_relatedness_ht(ht, meta_ht, relatedness_method, relatedness_args)[source]
Create the finalized relatedness Table including adding a ‘relationship’ annotation for each pair.
- The relatedness_args dictionary should have the following keys:
‘second_degree_min_kin’: Minimum kinship threshold for filtering a pair of samples with a second degree relationship when filtering related individuals. Default is 0.08838835. Bycroft et al. (2018) calculates a theoretical kinship of 0.08838835 for a second degree relationship cutoff. This cutoff should be determined by evaluation of the kinship distribution.
‘parent_child_max_ibd0_or_ibs0_over_ibs2’: Maximum value of IBD0 (if relatedness_method is ‘pc_relate’) or IBS0/IBS2 (if relatedness_method is ‘cuking’) for a parent-child pair.
‘second_degree_sibling_lower_cutoff_slope’: Slope of the line to use as a lower cutoff for second degree relatives and siblings from parent-child pairs.
‘second_degree_sibling_lower_cutoff_intercept’: Intercept of the line to use as a lower cutoff for second degree relatives and siblings from parent-child pairs.
‘second_degree_upper_sibling_lower_cutoff_slope’: Slope of the line to use as an upper cutoff for second degree relatives and a lower cutoff for siblings.
‘second_degree_upper_sibling_lower_cutoff_intercept’: Intercept of the line to use as an upper cutoff for second degree relatives and a lower cutoff for siblings.
‘duplicate_twin_min_kin’: Minimum kinship for duplicate or twin pairs.
‘duplicate_twin_ibd1_min’: Minimum IBD1 cutoff for duplicate or twin pairs. Only used when relatedness_method is ‘pc_relate’. Note: the min is used because pc_relate can output large negative values in some corner cases.
‘duplicate_twin_ibd1_max’: Maximum IBD1 cutoff for duplicate or twin pairs. Only used when relatedness_method is ‘pc_relate’.
- The following annotations are added to ht:
‘relationship’: Relationship annotation for the pair. Returned by get_slope_int_relationship_expr.
‘gnomad_v3_duplicate’: Whether the sample is a duplicate of a sample found in the gnomAD v3.1 genomes.
‘gnomad_v3_release_duplicate’: Whether the sample is a duplicate of a sample found in the gnomAD v3.1 release genomes.
- Parameters:
ht (
Table
) – Input relatedness Table.meta_ht (
Table
) – Input metadata Table. Used to add v3 overlap annotations.relatedness_method (
str
) – Which relatedness method to use for finalized relatedness Table. Options are ‘cuking’ and ‘pc_relate’. Default is ‘cuking’.relatedness_args (
Dict
[str
,float
]) – Dictionary of arguments to be passed to get_slope_int_relationship_expr.
- Return type:
- Returns:
Finalized relatedness Table
- gnomad_qc.v4.sample_qc.relatedness.compute_rank_ht(ht, filter_ht=None)[source]
Add a rank to each sample for use when breaking maximal independent set ties.
Favor v3 release samples, then v4 samples over v3 non-release samples, then higher chr20 mean DP.
If filter_ht is provided, rank based on filtering ‘outlier_filtered’ annotation first.
- gnomad_qc.v4.sample_qc.relatedness.run_compute_related_samples_to_drop(ht, meta_ht, release=False, filter_ht=None)[source]
Determine the minimal set of related samples to prune for ancestry PCA or release.
Runs compute_related_samples_to_drop in gnomad_methods after computing the sample rankings using compute_rank_ht.
When release is True, filter_ht is used to rank samples based on filtering ‘outlier_filtered’ annotation, favoring those that are not filtered. The ‘filter_ht’ ‘outlier_filtered’ samples will also be included in the returned samples_to_drop_ht Table.
- Parameters:
ht (
Table
) – Input relatedness Table.meta_ht (
Table
) – Metadata Table with v3_meta.v3_release, releasable, chr20_mean_dp annotations to be used in ranking and filtering of related individuals.release (
bool
) – Whether to determine related samples to drop for the release based on outlier filtering of sample QC metrics. filter_ht must be supplied.filter_ht (
Optional
[Table
]) – Optional Table with outlier filtering of sample QC metrics to use if release is True.
- Return type:
- Returns:
Table with sample rank and a Table with related samples to drop.
- gnomad_qc.v4.sample_qc.relatedness.get_relatedness_resources(test, release, relatedness_method, overwrite)[source]
Get PipelineResourceCollection for all resources needed in the relatedness pipeline.
- Parameters:
test (
bool
) – Whether to gather all resources for the test dataset.release (
bool
) – Whether to get release resource for ‘compute-related-samples-to-drop’ step of the relatedness pipeline.relatedness_method (
str
) – The relatedness method to use for ‘finalize-relatedness-ht’ resources.overwrite (
bool
) – Whether to overwrite resources if they exist.
- Return type:
PipelineResourceCollection
- Returns:
PipelineResourceCollection containing resources for all steps of the relatedness inference pipeline.