gnomad.sample_qc.relatedness
String representation for a pair of unrelated individuals in this module. |
|
String representation for a pair of 2nd degree relatives in this module. |
|
String representation for a parent-child pair in this module. |
|
String representation for a sibling pair in this module. |
|
String representation for a pair of samples who are identical (either MZ twins of duplicate) in this module. |
|
String representation for a pair of samples whose relationship is ambiguous. |
|
Extract the list of duplicate samples using a Table ouput from pc_relate. |
|
Create a HT with duplicated samples sets. |
|
|
Explode the result of get_duplicated_samples_ht, so that each line contains a single sample. |
Return an expression indicating the relationship between a pair of samples given their kin coefficient and IBDO, IBD1, IBD2 values. |
|
|
Return an expression indicating the relationship between a pair of samples given slope and intercept cutoffs. |
Generate a pedigree containing trios inferred from the relationship_ht. |
|
Generate a pedigree made of trios created by sampling 3 random samples in the sample list. |
|
|
Compute a Table with the list of samples to drop (and their global rank) to get the maximal independent set of unrelated samples. |
Filter a Table, MatrixTable or VariantDataset to a set of trios in fam_ht. |
|
|
Generate a row-wise expression containing trio transmission stats. |
|
Generate a row-wise expression containing the number of alternate alleles in common between sibling pairs. |
- gnomad.sample_qc.relatedness.UNRELATED = 'unrelated'
String representation for a pair of unrelated individuals in this module. Typically >2nd degree relatives, but the threshold is user-dependant.
- gnomad.sample_qc.relatedness.SECOND_DEGREE_RELATIVES = 'second degree relatives'
String representation for a pair of 2nd degree relatives in this module.
- gnomad.sample_qc.relatedness.PARENT_CHILD = 'parent-child'
String representation for a parent-child pair in this module.
- gnomad.sample_qc.relatedness.SIBLINGS = 'siblings'
String representation for a sibling pair in this module.
- gnomad.sample_qc.relatedness.DUPLICATE_OR_TWINS = 'duplicate/twins'
String representation for a pair of samples who are identical (either MZ twins of duplicate) in this module.
- gnomad.sample_qc.relatedness.AMBIGUOUS_RELATIONSHIP = 'ambiguous'
String representation for a pair of samples whose relationship is ambiguous. This is used in the case of a pair of samples which kinship/IBD values do not correspond to any biological relationship between two individuals.
- gnomad.sample_qc.relatedness.get_duplicated_samples(relationship_ht, i_col='i', j_col='j', rel_col='relationship')[source]
Extract the list of duplicate samples using a Table ouput from pc_relate.
- Parameters:
relationship_ht (
Table
) – Table with relationships between pairs of samplesi_col (
str
) – Column containing the 1st samplej_col (
str
) – Column containing the 2nd samplerel_col (
str
) – Column containing the sample pair relationship annotated with get_relationship_expr
- Return type:
List
[Set
[str
]]- Returns:
List of sets of samples that are duplicates
- gnomad.sample_qc.relatedness.get_duplicated_samples_ht(duplicated_samples, samples_rankings_ht, rank_ann='rank')[source]
Create a HT with duplicated samples sets.
Each row is indexed by the sample that is kept and also contains the set of duplicate samples that should be filtered.
samples_rankings_ht is a HT containing a global rank for each of the samples (smaller is better).
- Parameters:
duplicated_samples (
List
[Set
[str
]]) – List of sets of duplicated samplessamples_rankings_ht (
Table
) – HT with global rank for each samplerank_ann (
str
) – Annotation in samples_ranking_ht containing each sample global rank (smaller is better).
- Returns:
HT with duplicate sample sets, including which to keep/filter
- gnomad.sample_qc.relatedness.explode_duplicate_samples_ht(dups_ht)[source]
Explode the result of get_duplicated_samples_ht, so that each line contains a single sample.
An additional annotation is added: dup_filtered indicating which of the duplicated samples was kept. Requires a field filtered which type should be the same as the input duplicated samples Table key.
- gnomad.sample_qc.relatedness.get_relationship_expr(kin_expr, ibd0_expr, ibd1_expr, ibd2_expr, first_degree_kin_thresholds=(0.19, 0.4), second_degree_min_kin=0.1, ibd0_0_max=0.025, ibd0_25_thresholds=(0.1, 0.425), ibd1_0_thresholds=(-0.15, 0.1), ibd1_50_thresholds=(0.275, 0.75), ibd1_100_min=0.75, ibd2_0_max=0.125, ibd2_25_thresholds=(0.1, 0.5), ibd2_100_thresholds=(0.75, 1.25))[source]
Return an expression indicating the relationship between a pair of samples given their kin coefficient and IBDO, IBD1, IBD2 values.
The kinship coefficient values in the defaults are in line with those output from hail.methods.pc_relate <https://hail.is/docs/0.2/methods/genetics.html?highlight=pc_relate#hail.methods.pc_relate>.
- Parameters:
kin_expr (
NumericExpression
) – Kin coefficient expressionibd0_expr (
NumericExpression
) – IBDO expressionibd1_expr (
NumericExpression
) – IBD1 expressionibd2_expr (
NumericExpression
) – IDB2 expressionfirst_degree_kin_thresholds (
Tuple
[float
,float
]) – (min, max) kinship threshold for 1st degree relativessecond_degree_min_kin (
float
) – min kinship threshold for 2nd degree relativesibd0_0_max (
float
) – max IBD0 threshold for 0 IBD0 sharingibd0_25_thresholds (
Tuple
[float
,float
]) – (min, max) thresholds for 0.25 IBD0 sharingibd1_0_thresholds (
Tuple
[float
,float
]) – (min, max) thresholds for 0 IBD1 sharing. Note that the min is there because pc_relate can output large negative values in some corner cases.ibd1_50_thresholds (
Tuple
[float
,float
]) – (min, max) thresholds for 0.5 IBD1 sharingibd1_100_min (
float
) – min IBD1 threshold for 1.0 IBD1 sharingibd2_0_max (
float
) – max IBD2 threshold for 0 IBD2 sharingibd2_25_thresholds (
Tuple
[float
,float
]) – (min, max) thresholds for 0.25 IBD2 sharingibd2_100_thresholds (
Tuple
[float
,float
]) – (min, max) thresholds for 1.00 IBD2 sharing. Note that the min is there because pc_relate can output much larger IBD2 values in some corner cases.
- Return type:
- Returns:
The relationship annotation using the constants defined in this module.
- gnomad.sample_qc.relatedness.get_slope_int_relationship_expr(kin_expr, y_expr, parent_child_max_y, second_degree_sibling_lower_cutoff_slope, second_degree_sibling_lower_cutoff_intercept, second_degree_upper_sibling_lower_cutoff_slope, second_degree_upper_sibling_lower_cutoff_intercept, duplicate_twin_min_kin=0.42, second_degree_min_kin=0.1, duplicate_twin_ibd1_min=-0.15, duplicate_twin_ibd1_max=0.1, ibd1_expr=None)[source]
Return an expression indicating the relationship between a pair of samples given slope and intercept cutoffs.
The kinship coefficient (kin_expr) and an additional metric (y_expr) are used to define the relationship between a pair of samples. For this function the slope and intercepts should refer to cutoff lines where the x-axis, or independent variable is the kinship coefficient and the y-axis, or dependent variable, is the metric defined by y_expr. Typically, the y-axis metric IBS0, IBS0/IBS2, or IBD0.
Note
No defaults are provided for the slope and intercept cutoffs because they are highly dependent on the dataset and the metric used in y_expr.
- The relationship expression is determined as follows:
If kin_expr < second_degree_min_kin -> UNRELATED
- If kin_expr > duplicate_twin_min_kin:
- If y_expr < parent_child_max_y:
- If ibd1_expr is defined:
If duplicate_twin_ibd1_min <= ibd1_expr <= ` duplicate_twin_ibd1_max` -> DUPLICATE_OR_TWINS
Else -> AMBIGUOUS_RELATIONSHIP
Else -> DUPLICATE_OR_TWINS
If y_expr < parent_child_max_y -> PARENT_CHILD
- If pair is over second_degree_sibling_lower_cutoff line:
If pair is over second_degree_upper_sibling_lower_cutoff line -> SIBLINGS
Else -> SECOND_DEGREE_RELATIVES
If none of the above conditions are met -> AMBIGUOUS_RELATIONSHIP
- Parameters:
kin_expr (
NumericExpression
) – Kin coefficient expression. Used as the x-axis, or independent variable, for the slope and intercept cutoffs.y_expr (
NumericExpression
) – Expression for the metric to use as the y-axis, or dependent variable, for the slope and intercept cutoffs. This is typically an expression for IBS0, IBS0/IBS2, or IBD0.parent_child_max_y (
float
) – Maximum value of the metric defined by y_expr for a parent-child pair.second_degree_sibling_lower_cutoff_slope (
float
) – Slope of the line to use as a lower cutoff for second degree relatives and siblings from parent-child pairs.second_degree_sibling_lower_cutoff_intercept (
float
) – Intercept of the line to use as a lower cutoff for second degree relatives and siblings from parent-child pairs.second_degree_upper_sibling_lower_cutoff_slope (
float
) – Slope of the line to use as an upper cutoff for second degree relatives and a lower cutoff for siblings.second_degree_upper_sibling_lower_cutoff_intercept (
float
) – Intercept of the line to use as an upper cutoff for second degree relatives and a lower cutoff for siblings.duplicate_twin_min_kin (
float
) – Minimum kinship for duplicate or twin pairs. Default is 0.42.second_degree_min_kin (
float
) – Minimum kinship threshold for 2nd degree relatives. Default is 0.08838835. Bycroft et al. (2018) calculates a theoretical kinship of 0.08838835 for a second degree relationship cutoff, but this cutoff should be determined by evaluation of the kinship distribution.ibd1_expr (
Optional
[NumericExpression
]) – Optional IBD1 expression. If this expression is provided, duplicate_twin_ibd1_min and duplicate_twin_ibd1_max will be used as an additional cutoff for duplicate or twin pairs.duplicate_twin_ibd1_min (
float
) – Minimum IBD1 cutoff for duplicate or twin pairs. Note: the min is because pc_relate can output large negative values in some corner cases.duplicate_twin_ibd1_max (
float
) – Maximum IBD1 cutoff for duplicate or twin pairs.
- Returns:
The relationship annotation using the constants defined in this module.
- gnomad.sample_qc.relatedness.infer_families(relationship_ht, sex, duplicate_samples_ht, i_col='i', j_col='j', relationship_col='relationship')[source]
Generate a pedigree containing trios inferred from the relationship_ht.
This function takes a hail Table with a row for each pair of related individuals i, j in the data (it’s OK to have unrelated samples too).
- The relationship_col should be a column specifying the relationship between each two samples as defined in this
module’s constants.
This function returns a pedigree containing trios inferred from the data. Family ID can be the same for multiple trios if one or more members of the trios are related (e.g. sibs, multi-generational family). Trios are ordered by family ID.
Note
This function only returns complete trios defined as: one child, one father and one mother (sex is required for both parents).
- Parameters:
relationship_ht (
Table
) – Input relationship tablesex (
Union
[Table
,Dict
[str
,bool
]]) – A Table or dict giving the sex for each sample (TRUE`=female, `FALSE`=male). If a Table is given, it should have a field `is_female.duplicated_samples – All duplicated samples TO REMOVE (If not provided, this function won’t work as it assumes that each child has exactly two parents)
i_col (
str
) – Column containing the 1st sample of the pair in the relationship tablej_col (
str
) – Column containing the 2nd sample of the pair in the relationship tablerelationship_col (
str
) – Column contatining the relationship for the sample pair as defined in this module constants.duplicate_samples_ht (
Table
) –
- Return type:
- Returns:
Pedigree of complete trios
- gnomad.sample_qc.relatedness.create_fake_pedigree(n, sample_list, exclude_real_probands=False, max_tries=10, real_pedigree=None, sample_list_stratification=None)[source]
Generate a pedigree made of trios created by sampling 3 random samples in the sample list.
If real_pedigree is given, then children in the resulting fake trios will not include any trio with proband - parents that are in the real ones.
Each sample can be used only once as a proband in the resulting trios.
Sex of probands in fake trios is random.
- Parameters:
n (
int
) – Number of fake trios desired in the pedigree.sample_list (
List
[str
]) – List of samples.exclude_real_probands (
bool
) – If set, then fake trios probands cannot be in the real trios probands.max_tries (
int
) – Maximum number of sampling to try before bailing out (preventing infinite loop if n is too large w.r.t. the number of samples).real_pedigree (
Optional
[Pedigree
]) – Optional pedigree to exclude children from.sample_list_stratification (
Optional
[Dict
[str
,str
]]) – Optional dictionary with samples as keys and a value that should be used to stratify samples in sample_list into groups that the trio should be picked from. This ensures that each fake trio will contain samples from only the same stratification. For example, if all samples within a fake trio should be chosen from the same platform, this can be a dictionary of sample: platform.
- Return type:
- Returns:
Fake pedigree.
- gnomad.sample_qc.relatedness.compute_related_samples_to_drop(relatedness_ht, rank_ht, kin_threshold, filtered_samples=None, min_related_hard_filter=None, keep_samples=None, keep_samples_when_related=False)[source]
Compute a Table with the list of samples to drop (and their global rank) to get the maximal independent set of unrelated samples.
Note
relatedness_ht should be keyed by exactly two fields of the same type, identifying the pair of samples for each row.
rank_ht should be keyed by a single key of the same type as a single sample identifier in relatedness_ht.
- Parameters:
relatedness_ht (
Table
) – relatedness HT, as produced by e.g. pc-relatekin_threshold (
float
) – Kinship threshold to consider two samples as relatedrank_ht (
Table
) – Table with a global rank for each sample (smaller is preferred)filtered_samples (
Optional
[SetExpression
]) – An optional set of samples to exclude (e.g. these samples were hard-filtered) These samples will then appear in the resulting samples to drop.min_related_hard_filter (
Optional
[int
]) – If provided, any sample that is related to more samples than this parameter will be filtered prior to computing the maximal independent set and appear in the results.keep_samples (
Optional
[SetExpression
]) – An optional set of samples that must be kept. An error is raised (when keep_samples_when_related is False) if any two samples in the list are among the related pairs.keep_samples_when_related (
bool
) – Don’t raise an error if keep_samples contains related samples, and keep related samples. Default is False.
- Return type:
- Returns:
A Table with the list of the samples to drop along with their rank.
- gnomad.sample_qc.relatedness.filter_to_trios(mtds, fam_ht)[source]
Filter a Table, MatrixTable or VariantDataset to a set of trios in fam_ht.
Note
Using filter_cols in MatrixTable will not affect the number of rows (variants), however, using filter_samples in VariantDataset will remove the variants that are not present in any of the trios.
- Parameters:
mtds (
Union
[Table
,MatrixTable
,VariantDataset
]) – A Variant Dataset or a Matrix Table or a Table to filter to only trios.fam_ht (
Table
) – A Table of trios to filter to, loaded using hl.import_fam.
- Return type:
Union
[Table
,MatrixTable
,VariantDataset
]- Returns:
A Table, MatrixTable or VariantDataset with only the trios in fam_ht.
- gnomad.sample_qc.relatedness.generate_trio_stats_expr(trio_mt, transmitted_strata={'raw': True}, de_novo_strata={'raw': True}, ac_strata={'raw': True}, proband_is_female_expr=None)[source]
Generate a row-wise expression containing trio transmission stats.
- The expression will generate the following counts:
Number of alleles in het parents transmitted to the proband
Number of alleles in het parents not transmitted to the proband
Number of de novo mutations
Parent allele count
Proband allele count
Transmission and de novo mutation metrics and allele counts can be stratified using additional filters. transmitted_strata, de_novo_strata, and ac_strata all expect a dictionary of filtering expressions keyed by their desired suffix to append for labeling. The default will perform counts using all genotypes and append ‘raw’ to the label.
Note
Expects that mt is dense if dealing with a sparse MT hl.experimental.densify must be run first.
- Parameters:
trio_mt (
MatrixTable
) – A trio standard trio MT (with the format as produced by hail.methods.trio_matrix)transmitted_strata (
Dict
[str
,BooleanExpression
]) – Strata for the transmission countsde_novo_strata (
Dict
[str
,BooleanExpression
]) – Strata for the de novo countsac_strata (
Dict
[str
,BooleanExpression
]) – Strata for the parent and child allele countsproband_is_female_expr (
Optional
[BooleanExpression
]) – An optional expression giving the sex the proband. If not given, DNMs are only computed for autosomes.
- Return type:
- Returns:
An expression with the counts
- gnomad.sample_qc.relatedness.generate_sib_stats_expr(mt, sib_ht, i_col='i', j_col='j', strata={'raw': True}, is_female=None)[source]
Generate a row-wise expression containing the number of alternate alleles in common between sibling pairs.
The sibling sharing counts can be stratified using additional filters using stata.
Note
This function expects that the mt has either been split or filtered to only bi-allelics If a sample has multiple sibling pairs, only one pair will be counted
- Parameters:
mt (
MatrixTable
) – Input matrix tablesib_ht (
Table
) – Table defining sibling pairs with one sample in a col (i_col) and the second in another col (j_col)i_col (
str
) – Column containing the 1st sample of the pair in the relationship tablej_col (
str
) – Column containing the 2nd sample of the pair in the relationship tablestrata (
Dict
[str
,BooleanExpression
]) – Dict with additional strata to use when computing shared sibling variant countsis_female (
Optional
[BooleanExpression
]) – An optional column in mt giving the sample sex. If not given, counts are only computed for autosomes.
- Return type:
- Returns:
A Table with the sibling shared variant counts