gnomad.sample_qc.relatedness

gnomad.sample_qc.relatedness.UNRELATED

String representation for a pair of unrelated individuals in this module.

gnomad.sample_qc.relatedness.SECOND_DEGREE_RELATIVES

String representation for a pair of 2nd degree relatives in this module.

gnomad.sample_qc.relatedness.PARENT_CHILD

String representation for a parent-child pair in this module.

gnomad.sample_qc.relatedness.SIBLINGS

String representation for a sibling pair in this module.

gnomad.sample_qc.relatedness.DUPLICATE_OR_TWINS

String representation for a pair of samples who are identical (either MZ twins of duplicate) in this module.

gnomad.sample_qc.relatedness.AMBIGUOUS_RELATIONSHIP

String representation for a pair of samples whose relationship is ambiguous.

gnomad.sample_qc.relatedness.get_duplicated_samples(...)

Extract the list of duplicate samples using a Table ouput from pc_relate.

gnomad.sample_qc.relatedness.get_duplicated_samples_ht(...)

Create a HT with duplicated samples sets.

gnomad.sample_qc.relatedness.explode_duplicate_samples_ht(dups_ht)

Explode the result of get_duplicated_samples_ht, so that each line contains a single sample.

gnomad.sample_qc.relatedness.get_relationship_expr(...)

Return an expression indicating the relationship between a pair of samples given their kin coefficient and IBDO, IBD1, IBD2 values.

gnomad.sample_qc.relatedness.get_slope_int_relationship_expr(...)

Return an expression indicating the relationship between a pair of samples given slope and intercept cutoffs.

gnomad.sample_qc.relatedness.infer_families(...)

Generate a pedigree containing trios inferred from the relationship_ht.

gnomad.sample_qc.relatedness.create_fake_pedigree(n, ...)

Generate a pedigree made of trios created by sampling 3 random samples in the sample list.

gnomad.sample_qc.relatedness.compute_related_samples_to_drop(...)

Compute a Table with the list of samples to drop (and their global rank) to get the maximal independent set of unrelated samples.

gnomad.sample_qc.relatedness.filter_to_trios(...)

Filter a Matrix Table or a Variant Dataset to a set of trios in fam_ht.

gnomad.sample_qc.relatedness.generate_trio_stats_expr(trio_mt)

Generate a row-wise expression containing trio transmission stats.

gnomad.sample_qc.relatedness.generate_sib_stats_expr(mt, ...)

Generate a row-wise expression containing the number of alternate alleles in common between sibling pairs.

gnomad.sample_qc.relatedness.UNRELATED = 'unrelated'

String representation for a pair of unrelated individuals in this module. Typically >2nd degree relatives, but the threshold is user-dependant.

gnomad.sample_qc.relatedness.SECOND_DEGREE_RELATIVES = 'second degree relatives'

String representation for a pair of 2nd degree relatives in this module.

gnomad.sample_qc.relatedness.PARENT_CHILD = 'parent-child'

String representation for a parent-child pair in this module.

gnomad.sample_qc.relatedness.SIBLINGS = 'siblings'

String representation for a sibling pair in this module.

gnomad.sample_qc.relatedness.DUPLICATE_OR_TWINS = 'duplicate/twins'

String representation for a pair of samples who are identical (either MZ twins of duplicate) in this module.

gnomad.sample_qc.relatedness.AMBIGUOUS_RELATIONSHIP = 'ambiguous'

String representation for a pair of samples whose relationship is ambiguous. This is used in the case of a pair of samples which kinship/IBD values do not correspond to any biological relationship between two individuals.

gnomad.sample_qc.relatedness.get_duplicated_samples(relationship_ht, i_col='i', j_col='j', rel_col='relationship')[source]

Extract the list of duplicate samples using a Table ouput from pc_relate.

Parameters:
  • relationship_ht (Table) – Table with relationships between pairs of samples

  • i_col (str) – Column containing the 1st sample

  • j_col (str) – Column containing the 2nd sample

  • rel_col (str) – Column containing the sample pair relationship annotated with get_relationship_expr

Return type:

List[Set[str]]

Returns:

List of sets of samples that are duplicates

gnomad.sample_qc.relatedness.get_duplicated_samples_ht(duplicated_samples, samples_rankings_ht, rank_ann='rank')[source]

Create a HT with duplicated samples sets.

Each row is indexed by the sample that is kept and also contains the set of duplicate samples that should be filtered.

samples_rankings_ht is a HT containing a global rank for each of the samples (smaller is better).

Parameters:
  • duplicated_samples (List[Set[str]]) – List of sets of duplicated samples

  • samples_rankings_ht (Table) – HT with global rank for each sample

  • rank_ann (str) – Annotation in samples_ranking_ht containing each sample global rank (smaller is better).

Returns:

HT with duplicate sample sets, including which to keep/filter

gnomad.sample_qc.relatedness.explode_duplicate_samples_ht(dups_ht)[source]

Explode the result of get_duplicated_samples_ht, so that each line contains a single sample.

An additional annotation is added: dup_filtered indicating which of the duplicated samples was kept. Requires a field filtered which type should be the same as the input duplicated samples Table key.

Parameters:

dups_ht (Table) – Input HT

Return type:

Table

Returns:

Flattened HT

gnomad.sample_qc.relatedness.get_relationship_expr(kin_expr, ibd0_expr, ibd1_expr, ibd2_expr, first_degree_kin_thresholds=(0.19, 0.4), second_degree_min_kin=0.1, ibd0_0_max=0.025, ibd0_25_thresholds=(0.1, 0.425), ibd1_0_thresholds=(-0.15, 0.1), ibd1_50_thresholds=(0.275, 0.75), ibd1_100_min=0.75, ibd2_0_max=0.125, ibd2_25_thresholds=(0.1, 0.5), ibd2_100_thresholds=(0.75, 1.25))[source]

Return an expression indicating the relationship between a pair of samples given their kin coefficient and IBDO, IBD1, IBD2 values.

The kinship coefficient values in the defaults are in line with those output from hail.methods.pc_relate <https://hail.is/docs/0.2/methods/genetics.html?highlight=pc_relate#hail.methods.pc_relate>.

Parameters:
  • kin_expr (NumericExpression) – Kin coefficient expression

  • ibd0_expr (NumericExpression) – IBDO expression

  • ibd1_expr (NumericExpression) – IBD1 expression

  • ibd2_expr (NumericExpression) – IDB2 expression

  • first_degree_kin_thresholds (Tuple[float, float]) – (min, max) kinship threshold for 1st degree relatives

  • second_degree_min_kin (float) – min kinship threshold for 2nd degree relatives

  • ibd0_0_max (float) – max IBD0 threshold for 0 IBD0 sharing

  • ibd0_25_thresholds (Tuple[float, float]) – (min, max) thresholds for 0.25 IBD0 sharing

  • ibd1_0_thresholds (Tuple[float, float]) – (min, max) thresholds for 0 IBD1 sharing. Note that the min is there because pc_relate can output large negative values in some corner cases.

  • ibd1_50_thresholds (Tuple[float, float]) – (min, max) thresholds for 0.5 IBD1 sharing

  • ibd1_100_min (float) – min IBD1 threshold for 1.0 IBD1 sharing

  • ibd2_0_max (float) – max IBD2 threshold for 0 IBD2 sharing

  • ibd2_25_thresholds (Tuple[float, float]) – (min, max) thresholds for 0.25 IBD2 sharing

  • ibd2_100_thresholds (Tuple[float, float]) – (min, max) thresholds for 1.00 IBD2 sharing. Note that the min is there because pc_relate can output much larger IBD2 values in some corner cases.

Return type:

StringExpression

Returns:

The relationship annotation using the constants defined in this module.

gnomad.sample_qc.relatedness.get_slope_int_relationship_expr(kin_expr, y_expr, parent_child_max_y, second_degree_sibling_lower_cutoff_slope, second_degree_sibling_lower_cutoff_intercept, second_degree_upper_sibling_lower_cutoff_slope, second_degree_upper_sibling_lower_cutoff_intercept, duplicate_twin_min_kin=0.42, second_degree_min_kin=0.1, duplicate_twin_ibd1_min=-0.15, duplicate_twin_ibd1_max=0.1, ibd1_expr=None)[source]

Return an expression indicating the relationship between a pair of samples given slope and intercept cutoffs.

The kinship coefficient (kin_expr) and an additional metric (y_expr) are used to define the relationship between a pair of samples. For this function the slope and intercepts should refer to cutoff lines where the x-axis, or independent variable is the kinship coefficient and the y-axis, or dependent variable, is the metric defined by y_expr. Typically, the y-axis metric IBS0, IBS0/IBS2, or IBD0.

Note

No defaults are provided for the slope and intercept cutoffs because they are highly dependent on the dataset and the metric used in y_expr.

The relationship expression is determined as follows:
  • If kin_expr < second_degree_min_kin -> UNRELATED

  • If kin_expr > duplicate_twin_min_kin:
    • If y_expr < parent_child_max_y:
      • If ibd1_expr is defined:
        • If duplicate_twin_ibd1_min <= ibd1_expr <= ` duplicate_twin_ibd1_max` -> DUPLICATE_OR_TWINS

        • Else -> AMBIGUOUS_RELATIONSHIP

      • Else -> DUPLICATE_OR_TWINS

  • If y_expr < parent_child_max_y -> PARENT_CHILD

  • If pair is over second_degree_sibling_lower_cutoff line:
    • If pair is over second_degree_upper_sibling_lower_cutoff line -> SIBLINGS

    • Else -> SECOND_DEGREE_RELATIVES

  • If none of the above conditions are met -> AMBIGUOUS_RELATIONSHIP

Parameters:
  • kin_expr (NumericExpression) – Kin coefficient expression. Used as the x-axis, or independent variable, for the slope and intercept cutoffs.

  • y_expr (NumericExpression) – Expression for the metric to use as the y-axis, or dependent variable, for the slope and intercept cutoffs. This is typically an expression for IBS0, IBS0/IBS2, or IBD0.

  • parent_child_max_y (float) – Maximum value of the metric defined by y_expr for a parent-child pair.

  • second_degree_sibling_lower_cutoff_slope (float) – Slope of the line to use as a lower cutoff for second degree relatives and siblings from parent-child pairs.

  • second_degree_sibling_lower_cutoff_intercept (float) – Intercept of the line to use as a lower cutoff for second degree relatives and siblings from parent-child pairs.

  • second_degree_upper_sibling_lower_cutoff_slope (float) – Slope of the line to use as an upper cutoff for second degree relatives and a lower cutoff for siblings.

  • second_degree_upper_sibling_lower_cutoff_intercept (float) – Intercept of the line to use as an upper cutoff for second degree relatives and a lower cutoff for siblings.

  • duplicate_twin_min_kin (float) – Minimum kinship for duplicate or twin pairs. Default is 0.42.

  • second_degree_min_kin (float) – Minimum kinship threshold for 2nd degree relatives. Default is 0.08838835. Bycroft et al. (2018) calculates a theoretical kinship of 0.08838835 for a second degree relationship cutoff, but this cutoff should be determined by evaluation of the kinship distribution.

  • ibd1_expr (Optional[NumericExpression]) – Optional IBD1 expression. If this expression is provided, duplicate_twin_ibd1_min and duplicate_twin_ibd1_max will be used as an additional cutoff for duplicate or twin pairs.

  • duplicate_twin_ibd1_min (float) – Minimum IBD1 cutoff for duplicate or twin pairs. Note: the min is because pc_relate can output large negative values in some corner cases.

  • duplicate_twin_ibd1_max (float) – Maximum IBD1 cutoff for duplicate or twin pairs.

Returns:

The relationship annotation using the constants defined in this module.

gnomad.sample_qc.relatedness.infer_families(relationship_ht, sex, duplicate_samples_ht, i_col='i', j_col='j', relationship_col='relationship')[source]

Generate a pedigree containing trios inferred from the relationship_ht.

This function takes a hail Table with a row for each pair of related individuals i, j in the data (it’s OK to have unrelated samples too).

The relationship_col should be a column specifying the relationship between each two samples as defined in this

module’s constants.

This function returns a pedigree containing trios inferred from the data. Family ID can be the same for multiple trios if one or more members of the trios are related (e.g. sibs, multi-generational family). Trios are ordered by family ID.

Note

This function only returns complete trios defined as: one child, one father and one mother (sex is required for both parents).

Parameters:
  • relationship_ht (Table) – Input relationship table

  • sex (Union[Table, Dict[str, bool]]) – A Table or dict giving the sex for each sample (TRUE`=female, `FALSE`=male). If a Table is given, it should have a field `is_female.

  • duplicated_samples – All duplicated samples TO REMOVE (If not provided, this function won’t work as it assumes that each child has exactly two parents)

  • i_col (str) – Column containing the 1st sample of the pair in the relationship table

  • j_col (str) – Column containing the 2nd sample of the pair in the relationship table

  • relationship_col (str) – Column contatining the relationship for the sample pair as defined in this module constants.

  • duplicate_samples_ht (Table) –

Return type:

Pedigree

Returns:

Pedigree of complete trios

gnomad.sample_qc.relatedness.create_fake_pedigree(n, sample_list, exclude_real_probands=False, max_tries=10, real_pedigree=None, sample_list_stratification=None)[source]

Generate a pedigree made of trios created by sampling 3 random samples in the sample list.

  • If real_pedigree is given, then children in the resulting fake trios will not include any trio with proband - parents that are in the real ones.

  • Each sample can be used only once as a proband in the resulting trios.

  • Sex of probands in fake trios is random.

Parameters:
  • n (int) – Number of fake trios desired in the pedigree.

  • sample_list (List[str]) – List of samples.

  • exclude_real_probands (bool) – If set, then fake trios probands cannot be in the real trios probands.

  • max_tries (int) – Maximum number of sampling to try before bailing out (preventing infinite loop if n is too large w.r.t. the number of samples).

  • real_pedigree (Optional[Pedigree]) – Optional pedigree to exclude children from.

  • sample_list_stratification (Optional[Dict[str, str]]) – Optional dictionary with samples as keys and a value that should be used to stratify samples in sample_list into groups that the trio should be picked from. This ensures that each fake trio will contain samples from only the same stratification. For example, if all samples within a fake trio should be chosen from the same platform, this can be a dictionary of sample: platform.

Return type:

Pedigree

Returns:

Fake pedigree.

Compute a Table with the list of samples to drop (and their global rank) to get the maximal independent set of unrelated samples.

Note

  • relatedness_ht should be keyed by exactly two fields of the same type, identifying the pair of samples for each row.

  • rank_ht should be keyed by a single key of the same type as a single sample identifier in relatedness_ht.

Parameters:
  • relatedness_ht (Table) – relatedness HT, as produced by e.g. pc-relate

  • kin_threshold (float) – Kinship threshold to consider two samples as related

  • rank_ht (Table) – Table with a global rank for each sample (smaller is preferred)

  • filtered_samples (Optional[SetExpression]) – An optional set of samples to exclude (e.g. these samples were hard-filtered) These samples will then appear in the resulting samples to drop.

  • min_related_hard_filter (Optional[int]) – If provided, any sample that is related to more samples than this parameter will be filtered prior to computing the maximal independent set and appear in the results.

  • keep_samples (Optional[SetExpression]) – An optional set of samples that must be kept. An error is raised (when keep_samples_when_related is False) if any two samples in the list are among the related pairs.

  • keep_samples_when_related (bool) – Don’t raise an error if keep_samples contains related samples, and keep related samples. Default is False.

Return type:

Table

Returns:

A Table with the list of the samples to drop along with their rank.

gnomad.sample_qc.relatedness.filter_to_trios(mtds, fam_ht)[source]

Filter a Matrix Table or a Variant Dataset to a set of trios in fam_ht.

Note

Using filter_cols in MatrixTable will not affect the number of rows ( variants), however, using filter_samples in VariantDataset will remove the variants that are not present in any of the trios.

Parameters:
  • mtds (Union[MatrixTable, VariantDataset]) – A Variant Dataset or a Matrix Table to filter to only trios

  • fam_ht (Table) – A Table of trios to filter to, loaded using hl.import_fam

Return type:

Union[MatrixTable, VariantDataset]

Returns:

A Matrix Table or a Variant Dataset with only the trios in fam_ht

gnomad.sample_qc.relatedness.generate_trio_stats_expr(trio_mt, transmitted_strata={'raw': True}, de_novo_strata={'raw': True}, ac_strata={'raw': True}, proband_is_female_expr=None)[source]

Generate a row-wise expression containing trio transmission stats.

The expression will generate the following counts:
  • Number of alleles in het parents transmitted to the proband

  • Number of alleles in het parents not transmitted to the proband

  • Number of de novo mutations

  • Parent allele count

  • Proband allele count

Transmission and de novo mutation metrics and allele counts can be stratified using additional filters. transmitted_strata, de_novo_strata, and ac_strata all expect a dictionary of filtering expressions keyed by their desired suffix to append for labeling. The default will perform counts using all genotypes and append ‘raw’ to the label.

Note

Expects that mt is dense if dealing with a sparse MT hl.experimental.densify must be run first.

Parameters:
  • trio_mt (MatrixTable) – A trio standard trio MT (with the format as produced by hail.methods.trio_matrix)

  • transmitted_strata (Dict[str, BooleanExpression]) – Strata for the transmission counts

  • de_novo_strata (Dict[str, BooleanExpression]) – Strata for the de novo counts

  • ac_strata (Dict[str, BooleanExpression]) – Strata for the parent and child allele counts

  • proband_is_female_expr (Optional[BooleanExpression]) – An optional expression giving the sex the proband. If not given, DNMs are only computed for autosomes.

Return type:

StructExpression

Returns:

An expression with the counts

gnomad.sample_qc.relatedness.generate_sib_stats_expr(mt, sib_ht, i_col='i', j_col='j', strata={'raw': True}, is_female=None)[source]

Generate a row-wise expression containing the number of alternate alleles in common between sibling pairs.

The sibling sharing counts can be stratified using additional filters using stata.

Note

This function expects that the mt has either been split or filtered to only bi-allelics If a sample has multiple sibling pairs, only one pair will be counted

Parameters:
  • mt (MatrixTable) – Input matrix table

  • sib_ht (Table) – Table defining sibling pairs with one sample in a col (i_col) and the second in another col (j_col)

  • i_col (str) – Column containing the 1st sample of the pair in the relationship table

  • j_col (str) – Column containing the 2nd sample of the pair in the relationship table

  • strata (Dict[str, BooleanExpression]) – Dict with additional strata to use when computing shared sibling variant counts

  • is_female (Optional[BooleanExpression]) – An optional column in mt giving the sample sex. If not given, counts are only computed for autosomes.

Return type:

StructExpression

Returns:

A Table with the sibling shared variant counts