gnomad.sample_qc.relatedness

`gnomad.sample_qc.relatedness.UNRELATED`	String representation for a pair of unrelated individuals in this module.
`gnomad.sample_qc.relatedness.SECOND_DEGREE_RELATIVES`	String representation for a pair of 2nd degree relatives in this module.
`gnomad.sample_qc.relatedness.PARENT_CHILD`	String representation for a parent-child pair in this module.
`gnomad.sample_qc.relatedness.SIBLINGS`	String representation for a sibling pair in this module.
`gnomad.sample_qc.relatedness.DUPLICATE_OR_TWINS`	String representation for a pair of samples who are identical (either MZ twins of duplicate) in this module.
`gnomad.sample_qc.relatedness.AMBIGUOUS_RELATIONSHIP`	String representation for a pair of samples whose relationship is ambiguous.
`gnomad.sample_qc.relatedness.get_duplicated_samples`(...)	Extract the list of duplicate samples using a Table ouput from pc_relate.
`gnomad.sample_qc.relatedness.get_duplicated_samples_ht`(...)	Create a HT with duplicated samples sets.
`gnomad.sample_qc.relatedness.explode_duplicate_samples_ht`(dups_ht)	Explode the result of get_duplicated_samples_ht, so that each line contains a single sample.
`gnomad.sample_qc.relatedness.get_relationship_expr`(...)	Return an expression indicating the relationship between a pair of samples given their kin coefficient and IBDO, IBD1, IBD2 values.
`gnomad.sample_qc.relatedness.get_slope_int_relationship_expr`(...)	Return an expression indicating the relationship between a pair of samples given slope and intercept cutoffs.
`gnomad.sample_qc.relatedness.infer_families`(...)	Generate a pedigree containing trios inferred from the relationship_ht.
`gnomad.sample_qc.relatedness.create_fake_pedigree`(n, ...)	Generate a pedigree made of trios created by sampling 3 random samples in the sample list.
`gnomad.sample_qc.relatedness.compute_related_samples_to_drop`(...)	Compute a Table with the list of samples to drop (and their global rank) to get the maximal independent set of unrelated samples.
`gnomad.sample_qc.relatedness.filter_to_trios`(...)	Filter a Table, MatrixTable or VariantDataset to a set of trios in fam_ht.
`gnomad.sample_qc.relatedness.generate_trio_stats_expr`(trio_mt)	Generate a row-wise expression containing trio transmission stats.
`gnomad.sample_qc.relatedness.generate_sib_stats_expr`(mt, ...)	Generate a row-wise expression containing the number of alternate alleles in common between sibling pairs.
`gnomad.sample_qc.relatedness.calculate_de_novo_post_prob`(...)	Calculate the posterior probability of a de novo mutation.
`gnomad.sample_qc.relatedness.default_get_de_novo_expr`(...)	Get the de novo status of a variant based on the proband and parent genotypes.

gnomad.sample_qc.relatedness.UNRELATED = 'unrelated': String representation for a pair of unrelated individuals in this module. Typically >2nd degree relatives, but the threshold is user-dependant.

gnomad.sample_qc.relatedness.SECOND_DEGREE_RELATIVES = 'second degree relatives': String representation for a pair of 2nd degree relatives in this module.

gnomad.sample_qc.relatedness.PARENT_CHILD = 'parent-child': String representation for a parent-child pair in this module.

gnomad.sample_qc.relatedness.SIBLINGS = 'siblings': String representation for a sibling pair in this module.

gnomad.sample_qc.relatedness.DUPLICATE_OR_TWINS = 'duplicate/twins': String representation for a pair of samples who are identical (either MZ twins of duplicate) in this module.

gnomad.sample_qc.relatedness.AMBIGUOUS_RELATIONSHIP = 'ambiguous': String representation for a pair of samples whose relationship is ambiguous. This is used in the case of a pair of samples which kinship/IBD values do not correspond to any biological relationship between two individuals.

gnomad.sample_qc.relatedness.get_duplicated_samples(relationship_ht, i_col='i', j_col='j', rel_col='relationship')[source]

Extract the list of duplicate samples using a Table ouput from pc_relate.

Parameters:

relationship_ht (Table) – Table with relationships between pairs of samples
i_col (str) – Column containing the 1st sample
j_col (str) – Column containing the 2nd sample
rel_col (str) – Column containing the sample pair relationship annotated with get_relationship_expr

Return type:

List[Set[str]]

Returns:

List of sets of samples that are duplicates

gnomad.sample_qc.relatedness.get_duplicated_samples_ht(duplicated_samples, samples_rankings_ht, rank_ann='rank')[source]

Create a HT with duplicated samples sets.

Each row is indexed by the sample that is kept and also contains the set of duplicate samples that should be filtered.

samples_rankings_ht is a HT containing a global rank for each of the samples (smaller is better).

Parameters:

duplicated_samples (List[Set[str]]) – List of sets of duplicated samples
samples_rankings_ht (Table) – HT with global rank for each sample
rank_ann (str) – Annotation in samples_ranking_ht containing each sample global rank (smaller is better).

Returns:

HT with duplicate sample sets, including which to keep/filter

gnomad.sample_qc.relatedness.explode_duplicate_samples_ht(dups_ht)[source]

Explode the result of get_duplicated_samples_ht, so that each line contains a single sample.

An additional annotation is added: dup_filtered indicating which of the duplicated samples was kept. Requires a field filtered which type should be the same as the input duplicated samples Table key.

Parameters:: dups_ht (Table) – Input HT
Return type:: Table
Returns:: Flattened HT

gnomad.sample_qc.relatedness.get_relationship_expr(kin_expr, ibd0_expr, ibd1_expr, ibd2_expr, first_degree_kin_thresholds=(0.19, 0.4), second_degree_min_kin=0.1, ibd0_0_max=0.025, ibd0_25_thresholds=(0.1, 0.425), ibd1_0_thresholds=(-0.15, 0.1), ibd1_50_thresholds=(0.275, 0.75), ibd1_100_min=0.75, ibd2_0_max=0.125, ibd2_25_thresholds=(0.1, 0.5), ibd2_100_thresholds=(0.75, 1.25))[source]

Return an expression indicating the relationship between a pair of samples given their kin coefficient and IBDO, IBD1, IBD2 values.

The kinship coefficient values in the defaults are in line with those output from hail.methods.pc_relate <https://hail.is/docs/0.2/methods/genetics.html?highlight=pc_relate#hail.methods.pc_relate>.

Parameters:

kin_expr (NumericExpression) – Kin coefficient expression
ibd0_expr (NumericExpression) – IBDO expression
ibd1_expr (NumericExpression) – IBD1 expression
ibd2_expr (NumericExpression) – IDB2 expression
first_degree_kin_thresholds (Tuple[float, float]) – (min, max) kinship threshold for 1st degree relatives
second_degree_min_kin (float) – min kinship threshold for 2nd degree relatives
ibd0_0_max (float) – max IBD0 threshold for 0 IBD0 sharing
ibd0_25_thresholds (Tuple[float, float]) – (min, max) thresholds for 0.25 IBD0 sharing
ibd1_0_thresholds (Tuple[float, float]) – (min, max) thresholds for 0 IBD1 sharing. Note that the min is there because pc_relate can output large negative values in some corner cases.
ibd1_50_thresholds (Tuple[float, float]) – (min, max) thresholds for 0.5 IBD1 sharing
ibd1_100_min (float) – min IBD1 threshold for 1.0 IBD1 sharing
ibd2_0_max (float) – max IBD2 threshold for 0 IBD2 sharing
ibd2_25_thresholds (Tuple[float, float]) – (min, max) thresholds for 0.25 IBD2 sharing
ibd2_100_thresholds (Tuple[float, float]) – (min, max) thresholds for 1.00 IBD2 sharing. Note that the min is there because pc_relate can output much larger IBD2 values in some corner cases.

Return type:

StringExpression

Returns:

The relationship annotation using the constants defined in this module.

gnomad.sample_qc.relatedness.get_slope_int_relationship_expr(kin_expr, y_expr, parent_child_max_y, second_degree_sibling_lower_cutoff_slope, second_degree_sibling_lower_cutoff_intercept, second_degree_upper_sibling_lower_cutoff_slope, second_degree_upper_sibling_lower_cutoff_intercept, duplicate_twin_min_kin=0.42, second_degree_min_kin=0.1, duplicate_twin_ibd1_min=-0.15, duplicate_twin_ibd1_max=0.1, ibd1_expr=None)[source]

Return an expression indicating the relationship between a pair of samples given slope and intercept cutoffs.

The kinship coefficient (kin_expr) and an additional metric (y_expr) are used to define the relationship between a pair of samples. For this function the slope and intercepts should refer to cutoff lines where the x-axis, or independent variable is the kinship coefficient and the y-axis, or dependent variable, is the metric defined by y_expr. Typically, the y-axis metric IBS0, IBS0/IBS2, or IBD0.

Note

No defaults are provided for the slope and intercept cutoffs because they are highly dependent on the dataset and the metric used in y_expr.

The relationship expression is determined as follows:

If kin_expr < second_degree_min_kin -> UNRELATED
If kin_expr > duplicate_twin_min_kin:
- If y_expr < parent_child_max_y:
  
  If ibd1_expr is defined:
  
  If duplicate_twin_ibd1_min <= ibd1_expr <= ` duplicate_twin_ibd1_max` -> DUPLICATE_OR_TWINS
  
  Else -> AMBIGUOUS_RELATIONSHIP
  
  Else -> DUPLICATE_OR_TWINS
If y_expr < parent_child_max_y -> PARENT_CHILD
If pair is over second_degree_sibling_lower_cutoff line:
- If pair is over second_degree_upper_sibling_lower_cutoff line -> SIBLINGS
- Else -> SECOND_DEGREE_RELATIVES
If none of the above conditions are met -> AMBIGUOUS_RELATIONSHIP

Parameters:

kin_expr (NumericExpression) – Kin coefficient expression. Used as the x-axis, or independent variable, for the slope and intercept cutoffs.
y_expr (NumericExpression) – Expression for the metric to use as the y-axis, or dependent variable, for the slope and intercept cutoffs. This is typically an expression for IBS0, IBS0/IBS2, or IBD0.
parent_child_max_y (float) – Maximum value of the metric defined by y_expr for a parent-child pair.
second_degree_sibling_lower_cutoff_slope (float) – Slope of the line to use as a lower cutoff for second degree relatives and siblings from parent-child pairs.
second_degree_sibling_lower_cutoff_intercept (float) – Intercept of the line to use as a lower cutoff for second degree relatives and siblings from parent-child pairs.
second_degree_upper_sibling_lower_cutoff_slope (float) – Slope of the line to use as an upper cutoff for second degree relatives and a lower cutoff for siblings.
second_degree_upper_sibling_lower_cutoff_intercept (float) – Intercept of the line to use as an upper cutoff for second degree relatives and a lower cutoff for siblings.
duplicate_twin_min_kin (float) – Minimum kinship for duplicate or twin pairs. Default is 0.42.
second_degree_min_kin (float) – Minimum kinship threshold for 2nd degree relatives. Default is 0.08838835. Bycroft et al. (2018) calculates a theoretical kinship of 0.08838835 for a second degree relationship cutoff, but this cutoff should be determined by evaluation of the kinship distribution.
ibd1_expr (Optional[NumericExpression]) – Optional IBD1 expression. If this expression is provided, duplicate_twin_ibd1_min and duplicate_twin_ibd1_max will be used as an additional cutoff for duplicate or twin pairs.
duplicate_twin_ibd1_min (float) – Minimum IBD1 cutoff for duplicate or twin pairs. Note: the min is because pc_relate can output large negative values in some corner cases.
duplicate_twin_ibd1_max (float) – Maximum IBD1 cutoff for duplicate or twin pairs.

Returns:

The relationship annotation using the constants defined in this module.

gnomad.sample_qc.relatedness.infer_families(relationship_ht, sex, duplicate_samples_ht, i_col='i', j_col='j', relationship_col='relationship')[source]

Generate a pedigree containing trios inferred from the relationship_ht.

This function takes a hail Table with a row for each pair of related individuals i, j in the data (it’s OK to have unrelated samples too).

The relationship_col should be a column specifying the relationship between each two samples as defined in this: module’s constants.

This function returns a pedigree containing trios inferred from the data. Family ID can be the same for multiple trios if one or more members of the trios are related (e.g. sibs, multi-generational family). Trios are ordered by family ID.

Note

This function only returns complete trios defined as: one child, one father, and one mother (sex is required for both parents).

Parameters:

relationship_ht (Table) – Input relationship table
sex (Union[Table, Dict[str, bool]]) – A Table or dict giving the sex for each sample (TRUE`=XX, `FALSE`=XY). If a Table is given, it should have a field `is_xx.
duplicated_samples – All duplicated samples TO REMOVE (If not provided, this function won’t work as it assumes that each child has exactly two parents)
i_col (str) – Column containing the 1st sample of the pair in the relationship table
j_col (str) – Column containing the 2nd sample of the pair in the relationship table
relationship_col (str) – Column contatining the relationship for the sample pair as defined in this module constants.
duplicate_samples_ht (Table) –

Return type:

Pedigree

Returns:

Pedigree of complete trios

gnomad.sample_qc.relatedness.create_fake_pedigree(n, sample_list, exclude_real_probands=False, max_tries=10, real_pedigree=None, sample_list_stratification=None)[source]

Generate a pedigree made of trios created by sampling 3 random samples in the sample list.

If real_pedigree is given, then children in the resulting fake trios will not include any trio with proband - parents that are in the real ones.
Each sample can be used only once as a proband in the resulting trios.
Sex of probands in fake trios is random.

Parameters:

n (int) – Number of fake trios desired in the pedigree.
sample_list (List[str]) – List of samples.
exclude_real_probands (bool) – If set, then fake trios probands cannot be in the real trios probands.
max_tries (int) – Maximum number of sampling to try before bailing out (preventing infinite loop if n is too large w.r.t. the number of samples).
real_pedigree (Optional[Pedigree]) – Optional pedigree to exclude children from.
sample_list_stratification (Optional[Dict[str, str]]) – Optional dictionary with samples as keys and a value that should be used to stratify samples in sample_list into groups that the trio should be picked from. This ensures that each fake trio will contain samples from only the same stratification. For example, if all samples within a fake trio should be chosen from the same platform, this can be a dictionary of sample: platform.

Return type:

Pedigree

Returns:

Fake pedigree.

gnomad.sample_qc.relatedness.compute_related_samples_to_drop(relatedness_ht, rank_ht, kin_threshold, filtered_samples=None, min_related_hard_filter=None, keep_samples=None, keep_samples_when_related=False)[source]

Compute a Table with the list of samples to drop (and their global rank) to get the maximal independent set of unrelated samples.

Note

relatedness_ht should be keyed by exactly two fields of the same type, identifying the pair of samples for each row.
rank_ht should be keyed by a single key of the same type as a single sample identifier in relatedness_ht.

Parameters:

relatedness_ht (Table) – relatedness HT, as produced by e.g. pc-relate
kin_threshold (float) – Kinship threshold to consider two samples as related
rank_ht (Table) – Table with a global rank for each sample (smaller is preferred)
filtered_samples (Optional[SetExpression]) – An optional set of samples to exclude (e.g. these samples were hard-filtered) These samples will then appear in the resulting samples to drop.
min_related_hard_filter (Optional[int]) – If provided, any sample that is related to more samples than this parameter will be filtered prior to computing the maximal independent set and appear in the results.
keep_samples (Optional[SetExpression]) – An optional set of samples that must be kept. An error is raised (when keep_samples_when_related is False) if any two samples in the list are among the related pairs.
keep_samples_when_related (bool) – Don’t raise an error if keep_samples contains related samples, and keep related samples. Default is False.

Return type:

Table

Returns:

A Table with the list of the samples to drop along with their rank.

gnomad.sample_qc.relatedness.filter_to_trios(mtds, fam_ht)[source]

Filter a Table, MatrixTable or VariantDataset to a set of trios in fam_ht.

Note

Using filter_cols in MatrixTable will not affect the number of rows (variants), however, using filter_samples in VariantDataset will remove the variants that are not present in any of the trios.

Parameters:

mtds (Union[Table, MatrixTable, VariantDataset]) – A Variant Dataset or a Matrix Table or a Table to filter to only trios.
fam_ht (Table) – A Table of trios to filter to, loaded using hl.import_fam.

Return type:

Union[Table, MatrixTable, VariantDataset]

Returns:

A Table, MatrixTable or VariantDataset with only the trios in fam_ht.

gnomad.sample_qc.relatedness.generate_trio_stats_expr(trio_mt, transmitted_strata={'raw': True}, de_novo_strata={'raw': True}, ac_strata={'raw': True}, proband_is_female_expr=None)[source]

Generate a row-wise expression containing trio transmission stats.

The expression will generate the following counts:

Number of alleles in het parents transmitted to the proband
Number of alleles in het parents not transmitted to the proband
Number of de novo mutations
Parent allele count
Proband allele count

Transmission and de novo mutation metrics and allele counts can be stratified using additional filters. transmitted_strata, de_novo_strata, and ac_strata all expect a dictionary of filtering expressions keyed by their desired suffix to append for labeling. The default will perform counts using all genotypes and append ‘raw’ to the label.

Note

Expects that mt is dense if dealing with a sparse MT hl.experimental.densify must be run first.

Parameters:

trio_mt (MatrixTable) – A trio standard trio MT (with the format as produced by hail.methods.trio_matrix)
transmitted_strata (Dict[str, BooleanExpression]) – Strata for the transmission counts
de_novo_strata (Dict[str, BooleanExpression]) – Strata for the de novo counts
ac_strata (Dict[str, BooleanExpression]) – Strata for the parent and child allele counts
proband_is_xx_expr – An optional expression giving the karyotype of the proband (XX=True, XY=False). If not given, DNMs are only computed for autosomes.
proband_is_female_expr (Optional[BooleanExpression]) –

Return type:

StructExpression

Returns:

An expression with the counts

gnomad.sample_qc.relatedness.generate_sib_stats_expr(mt, sib_ht, i_col='i', j_col='j', strata={'raw': True}, is_female=None)[source]

Generate a row-wise expression containing the number of alternate alleles in common between sibling pairs.

The sibling sharing counts can be stratified using additional filters using stata.

Note

This function expects that the mt has either been split or filtered to only bi-allelics If a sample has multiple sibling pairs, only one pair will be counted

Parameters:

mt (MatrixTable) – Input matrix table
sib_ht (Table) – Table defining sibling pairs with one sample in a col (i_col) and the second in another col (j_col)
i_col (str) – Column containing the 1st sample of the pair in the relationship table
j_col (str) – Column containing the 2nd sample of the pair in the relationship table
strata (Dict[str, BooleanExpression]) – Dict with additional strata to use when computing shared sibling variant counts
is_female (Optional[BooleanExpression]) – An optional column in mt giving the sample sex. If not given, counts are only computed for autosomes.

Return type:

StructExpression

Returns:

A Table with the sibling shared variant counts

gnomad.sample_qc.relatedness.calculate_de_novo_post_prob(proband_pl_expr, father_pl_expr, mother_pl_expr, diploid_expr, hemi_x_expr, hemi_y_expr, freq_prior_expr, min_pop_prior=3.3333333333333333e-06, de_novo_prior=3.3333333333333334e-08)[source]

Calculate the posterior probability of a de novo mutation.

This function computes the posterior probability of a de novo mutation (P_dn) based on the genotype likelihoods of the proband and parents, along with the population frequency prior for the variant. The method is adapted from Kaitlin Samocha’s de novo caller and Hail’s de_novo function. However, neither approach explicitly documented how to compute de novo probabilities for hemizygous genotypes in XY individuals. To address this, we provide the full set of equations in this docstring.

The posterior probability of an event being truly de novo vs. the probability it was a missed heterozygote call in one of the two parents is:

\[P_{dn} = \frac{P(DN \mid \text{data})}{P(DN \mid \text{data}) + P(\text{missed het in parent(s)} \mid \text{data})}\]

The terms are defined as follows:

\(P(DN \mid \text{data})\) is the probability that the variant is de novo, given the observed genotype data.
\(P(\text{missed het in parent(s)} \mid \text{data})\) is the probability that the heterozygous variant was missed in at least one parent.

Applying Bayesian Theorem to the numerator and denominator yields:

\[P_{dn} = \frac{P(\text{data} \mid DN) \cdot P(DN)}{P(\text{data} \mid DN) \cdot P(DN) + P(\text{data} \mid \text{missed het in parent(s)}) \cdot P(\text{missed het in parent(s)})}\]

where:

\(P(\text{data} \mid DN)\): Probability of observing the data under the assumption of a de novo mutation.
- Autosomes and PAR regions:
  
  \[P(\text{data} \mid DN) = P(\text{hom_ref in father}) \cdot P(\text{hom_ref in mother}) \cdot P(\text{het in proband})\]
Probability of a observing a de novo mutation given the data specifically for hemizygous calls in XY individuals

Note that hemizygous calls in XY individuals will be reported as homozygous alternate without any sex ploidy adjustments, which is why the formulas below use P(hom_alt in proband)
- X non-PAR regions (XY only):
  
  \[P(\text{data} \mid DN) = P(\text{hom_ref in mother}) \cdot P(\text{hom_alt in proband})\]
- Y non-PAR regions (XY only):
  
  \[P(\text{data} \mid DN) = P(\text{hom_ref in father}) \cdot P(\text{hom_alt in proband})\]
\(P(DN)\): The prior probability of a de novo mutation from literature, defined as:

\[P(DN) = \frac{1}{3 \times 10^7}\]
\(P(\text{data} \mid \text{missed het in parent(s)})\): Probability of observing the data under the assumption of a missed het in a parent.
- Autosomes and PAR regions:
  
  \[P(\text{data} \mid \text{missed het in parents}) = ( P(\text{het in father}) \cdot P(\text{hom_ref in mother}) + P(\text{hom_ref in father}) \cdot P(\text{het in mother})) \cdot P(\text{het in proband})\]
- X non-PAR regions (XY only):
  
  \[P(\text{data} \mid \text{missed het in mother}) = (P(\text{het in mother}) + P(\text{hom_alt in mother})) \cdot P(\text{hom_alt in proband})\]
- Y non-PAR regions (XY only):
  
  \[P(\text{data} \mid \text{missed het in father}) = (P(\text{het in father}) + P(\text{hom_alt in father})) \cdot P(\text{hom_alt in proband})\]
\(P(\text{missed het in parent(s)}\): Prior that at least one parent is heterozygous. Depends on alternate allele frequency:

\[P(\text{het in one parent}) = 1 - (1 - \text{freq_prior})^4\]

where \(\text{freq_prior}\) is the population frequency prior for the variant.

Parameters:

proband_pl_expr (ArrayExpression) – Phred-scaled genotype likelihoods for the proband.
father_pl_expr (ArrayExpression) – Phred-scaled genotype likelihoods for the father.
mother_pl_expr (ArrayExpression) – Phred-scaled genotype likelihoods for the mother.
diploid_expr (BooleanExpression) – Boolean expression indicating a diploid genotype.
hemi_x_expr (BooleanExpression) – Boolean expression indicating a hemizygous genotype on the X chromosome.
hemi_y_expr (BooleanExpression) – Boolean expression indicating a hemizygous genotype on the Y chromosome.
freq_prior_expr (Float64Expression) – Population frequency prior for the variant.
min_pop_prior (Optional[float]) – Minimum population frequency prior (default: \(\text{100/3e7}\)).
de_novo_prior (Optional[float]) – Prior probability of a de novo mutation (default: \(\text{1/3e7}\)).

Return type:

Float64Expression

Returns:

Posterior probability of a de novo mutation (P_dn).

gnomad.sample_qc.relatedness.default_get_de_novo_expr(locus_expr, alleles_expr, proband_expr, father_expr, mother_expr, is_xx_expr, freq_prior_expr, min_pop_prior=3.3333333333333333e-06, de_novo_prior=3.3333333333333334e-08, min_dp_ratio=0.1, min_gq=20, min_proband_ab=0.2, max_parent_ab=0.05, min_de_novo_p=0.05, high_conf_dp_ratio=0.2, dp_threshold_snp=10, high_med_conf_ab=0.3, low_conf_ab=0.2, high_conf_p=0.99, med_conf_p=0.5)[source]

Get the de novo status of a variant based on the proband and parent genotypes.

Confidence thresholds (from Kaitlin Samocha’s de novo caller):

Category	P(de novo)	AB	AD	DP	DR	GQ
FAIL	< 0.05	AB(parents) > 0.05 OR AB(proband) < 0.2	0		< 0.1	< 20
HIGH (Indel)	> 0.99	> 0.3			> 0.2
HIGH (SNV) 1	> 0.99	> 0.3			> 0.2
HIGH (SNV) 2	> 0.5	> 0.3		> 10
MEDIUM	> 0.5	> 0.3
LOW	>= 0.05	>= 0.2

AB: Proband AB. FAIL criteria also includes threshold for parent(s).
AD: Sum of parent(s) AD.
DP: Proband DP.
DR: Defined as DP(proband) / DP(parent(s)).
GQ: Proband GQ.

Note

The “LOW” confidence category differs slightly from the criteria in the original code (P(de novo) > 0.05 and AB(proband > 0.2), as it is designed to fill the gap for variants that do not meet the FAIL criteria but would otherwise remain unclassified.

The de novo confidence is calculated as a simplified version of the one previously described in Kaitlin Samocha’s [de novo caller](https://github.com/ksamocha/de_novo_scripts) and Hail’s [de_novo](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.de_novo) method. This simplified version is the same as Hail’s methods when using the ignore_in_sample_allele_frequency parameter. The main difference is that this mode should be used when families larger than a single trio are in the dataset, in which an allele might be de novo in a parent and transmitted to a child in the dataset. This mode will not consider the allele count (AC) in the dataset, and will only consider the Phred-scaled likelihoods (PL) of the child and parents, allele balance (AB) of the child and parents, the genotype quality (GQ) of the child, the depth (DP) of the child and parents, and the population frequency prior.

Warning

This method assumes that the PL and AD fields are present in the genotype fields of the child and parents. If they are missing, this method will not work. Many of our larger datasets have the PL and AD fields intentionally removed to save storage space. If this is the reason that the PL and AD fields are missing, the only way to use this method is to set them to their approximate values:

PL=hl.or_else(PL, [0, GQ, 2 * GQ])
AD=hl.or_else(AD, [DP, 0])

Parameters:

locus_expr (LocusExpression) – Variant locus.
alleles_expr (ArrayExpression) – Variant alleles. Function assumes all variants are biallelic, meaning that multiallelic variants in the input dataset should be split prior to running this function.
proband_expr (StructExpression) – Proband genotype info; required fields: GT, DP, GQ, AD, PL.
father_expr (StructExpression) – Father genotype info; required fields: GT, DP, GQ, AD, PL.
mother_expr (StructExpression) – Mother genotype info; required fields: GT, DP, GQ, AD, PL.
is_xx_expr (BooleanExpression) – Whether the proband has XX sex karyotype.
freq_prior_expr (Float64Expression) – Population frequency prior for the variant.
min_pop_prior (float) – Minimum population frequency prior. Default is 100 / 3e7.
de_novo_prior (float) – Prior probability of a de novo mutation. Default is 1 / 3e7.
min_dp_ratio (float) – Minimum depth ratio for proband to parents. Default is 0.1.
min_gq (int) – Minimum genotype quality for the proband. Default is 20.
min_proband_ab (float) – Minimum allele balance for the proband. Default is 0.2.
max_parent_ab (float) – Maximum allele balance for parents. Default is 0.05.
min_de_novo_p (float) – Minimum probability for variant to be called de novo. Default is 0.05.
high_conf_dp_ratio (float) – DP ratio threshold of proband DP to combined DP in parents for high confidence. Default is 0.2.
dp_threshold_snp (int) – Minimum depth for high-confidence SNPs. Default is 10.
high_med_conf_ab (float) – AB threshold for high/medium confidence. Default is 0.3.
low_conf_ab (float) – AB threshold for low confidence. Default is 0.2.
high_conf_p (float) – P(de novo) threshold for high confidence. Default is 0.99.
med_conf_p (float) – P(de novo) threshold for medium confidence. Default is 0.5.

Return type:

StructExpression

Returns:

StructExpression with variant de novo status and confidence of de novo call.