gnomad_qc.v4.sample_qc.identify_trios

Script to identify trios from relatedness data and filter based on Mendel errors and de novos.

usage: gnomad_qc.v4.sample_qc.identify_trios.py [-h] [--test] [--overwrite]
                                                [--slack-channel SLACK_CHANNEL]
                                                [--identify-duplicates]
                                                [--infer-families]
                                                [--create-fake-pedigree]
                                                [--fake-fam-prop FAKE_FAM_PROP]
                                                [--run-mendel-errors]
                                                [--finalize-ped]
                                                [--max-mendel-z [MAX_MENDEL_Z]]
                                                [--max-de-novo-z [MAX_DE_NOVO_Z]]
                                                [--stratify-ukb]
                                                [--max-mendel MAX_MENDEL]
                                                [--ukb-max-mendel UKB_MAX_MENDEL]
                                                [--max-de-novo MAX_DE_NOVO]
                                                [--ukb-max-de-novo UKB_MAX_DE_NOVO]
                                                [--seed SEED]

Named Arguments

--test

Runs mendel errors on only five partitions of the MT.

Default: False

--overwrite

Overwrite all data from this subset (default: False).

Default: False

--slack-channel

Slack channel to post results and notifications to.

Duplicate identification

--identify-duplicates

Create a table with duplicate samples indicating which one is the best to use based on the ranking of all samples after sample QC metric outlier filtering (ranking used to determine related samples to drop for the release).

Default: False

Pedigree inference

--infer-families

Infer families and trios using the relatedness Table and duplicate Table.

Default: False

Fake Pedigree creation

--create-fake-pedigree

Create a fake Pedigree from unrelated samples in the data for comparison to the inferred Pedigree.

Default: False

--fake-fam-prop

Number of fake trios to generate as a proportion of the total number of trios found in the data. Default is 0.1.

Default: 0.1

Mendel error calculation

--run-mendel-errors

Calculate mendel errors for the inferred and fake Pedigrees on chr20.

Default: False

Pedigree filtering for final Pedigree generation

--finalize-ped

Create final families/trios ped files by excluding trios where the number of Mendel errors or de novos are outliers. Outliers can be defined as trios that have Mendel errors or de novos higher than those specified in –max-mendel and –max-de-novo respectively. They can also be defined as trios that have Mendel errors or de novos higher than –max-mendel-z or –max-de-novo-z standard deviations above the mean across inferred trios.

Default: False

--max-mendel-z

Max number of standard deviations above the mean Mendel errors across inferred trios to keep a trio. If flag is set, default is 3.

--max-de-novo-z

Max number of standard deviations above the mean de novos across inferred trios to keep a trio. If flag is set, default is 3.

--stratify-ukb

Stratify Mendel errors and de novo standard deviations cutoffs to UKB and non-UKB samples.

Default: False

--max-mendel

Maximum number of raw Mendel errors for real trios. If specified and –ukb-max-mendel is not, –max-mendel will be used for all samples. If both –max-mendel and –ukb-max-mendel are specified, –max-mendel will be used for non-UKB samples and –ukb-max-mendel will be used for UKB samples.

--ukb-max-mendel

Maximum number of raw Mendel errors for real trios in UKB samples. If specified, –max-mendel must also be specified for non-UKB samples. If not specified, but –max-mendel is, –max-mendel will be used for all samples.

--max-de-novo

Maximum number of raw de novo mutations for real trios. If specified and –ukb-max-de-novo is not, –max-de-novo will be used for all samples. If both –max-de-novo and –ukb-max-de-novo are specified, –max-de-novo will be used for non-UKB samples and –ukb-max-de-novo will be used for UKB samples.

--ukb-max-de-novo

Maximum number of raw de novo mutations for real trios in UKB samples. If specified, –max-de-novo must also be specified for non-UKB samples. If not specified, but –max-de-novo is, –max-de-novo will be used for all samples.

--seed

Random seed for choosing one random trio per family to keep after filtering.

Default: 24

Module Functions

gnomad_qc.v4.sample_qc.identify_trios.families_to_trios(ped)

Convert a Pedigree with families to a Pedigree with only one random trio per family.

gnomad_qc.v4.sample_qc.identify_trios.filter_relatedness_ht(ht, ...)

Filter relatedness Table to only include pairs of samples that are both exomes and not QC-filtered.

gnomad_qc.v4.sample_qc.identify_trios.platform_table_to_dict(...)

Convert a Table with platform assignments to a sample:platform dictionary.

gnomad_qc.v4.sample_qc.identify_trios.run_create_fake_pedigree(...)

Generate a fake Pedigree with fake_fam_prop defining the proportion of the number of trios in ped to use.

gnomad_qc.v4.sample_qc.identify_trios.run_mendel_errors(...)

Run Hail's mendel_errors on chr20 of the VDS subset to samples in ped and fake_ped.

gnomad_qc.v4.sample_qc.identify_trios.filter_ped_to_same_platform(...)

Filter a Pedigree to only trios where all samples are from the same platform.

gnomad_qc.v4.sample_qc.identify_trios.filter_ped(...)

Filter a Pedigree based on Mendel errors and de novo metrics.

gnomad_qc.v4.sample_qc.identify_trios.get_trio_resources(...)

Get PipelineResourceCollection for all resources needed in the trio identification pipeline.

gnomad_qc.v4.sample_qc.identify_trios.main(args)

Identify trios and filter based on Mendel errors and de novos.

gnomad_qc.v4.sample_qc.identify_trios.get_script_argument_parser()

Get script argument parser.

Script to identify trios from relatedness data and filter based on Mendel errors and de novos.

gnomad_qc.v4.sample_qc.identify_trios.families_to_trios(ped, seed=24)[source]

Convert a Pedigree with families to a Pedigree with only one random trio per family.

Parameters:
  • ped (Pedigree) – Pedigree with families.

  • seed (int) – Random seed for choosing trio to keep from each family. Default is 24.

Return type:

Pedigree

Returns:

Pedigree with only one trio per family.

gnomad_qc.v4.sample_qc.identify_trios.filter_relatedness_ht(ht, filter_ht)[source]

Filter relatedness Table to only include pairs of samples that are both exomes and not QC-filtered.

Parameters:
  • ht (Table) – Relatedness Table.

  • filter_ht (Table) – Outlier filtering Table.

Return type:

Table

Returns:

Filtered relatedness Table.

gnomad_qc.v4.sample_qc.identify_trios.platform_table_to_dict(platform_ht)[source]

Convert a Table with platform assignments to a sample:platform dictionary.

Note

This function assumes that the Table has a qc_platform field.

Parameters:

platform_ht (Table) – Table with platform assignments.

Return type:

Dict[str, str]

Returns:

Sample:platform dictionary.

gnomad_qc.v4.sample_qc.identify_trios.run_create_fake_pedigree(ped, filter_ht, platform_ht=None, fake_fam_prop=0.1)[source]

Generate a fake Pedigree with fake_fam_prop defining the proportion of the number of trios in ped to use.

Parameters:
  • ped (Pedigree) – Pedigree to use for generating fake Pedigree.

  • filter_ht (Table) – Outlier filtering Table.

  • platform_ht (Optional[Table]) – Optional table with platform assignments. Default is None.

  • fake_fam_prop (float) – Proportion of trios in ped to use for generating fake Pedigree. Default is 0.1.

Return type:

Pedigree

Returns:

Fake Pedigree.

gnomad_qc.v4.sample_qc.identify_trios.run_mendel_errors(ped, fake_ped, interval_qc_pass_ht=None, test=False)[source]

Run Hail’s mendel_errors on chr20 of the VDS subset to samples in ped and fake_ped.

Parameters:
  • ped (Pedigree) – Inferred Pedigree.

  • fake_ped (Pedigree) – Fake Pedigree.

  • interval_qc_pass_ht (Optional[Table]) – Optional interval QC pass Table that contains an ‘interval_qc_pass’ annotation indicating whether the interval passes high-quality criteria. This annotation is used to filter the MatrixTable before running mendel_errors. Default is None.

  • test (bool) – Whether to run on five partitions of the VDS for testing. Default is False.

Return type:

Table

Returns:

Table with Mendel errors on chr20.

gnomad_qc.v4.sample_qc.identify_trios.filter_ped_to_same_platform(ped, platform_ht)[source]

Filter a Pedigree to only trios where all samples are from the same platform.

Parameters:
  • ped (Pedigree) – Pedigree to filter.

  • platform_ht (Table) – Table with platform assignments.

Return type:

Pedigree

Returns:

Filtered Pedigree.

gnomad_qc.v4.sample_qc.identify_trios.filter_ped(ped, mendel_ht, max_mendel_z=3, max_de_novo_z=3, stratify_ukb=False, max_mendel_n=None, ukb_max_mendel_n=None, max_de_novo_n=None, ukb_max_de_novo_n=None)[source]

Filter a Pedigree based on Mendel errors and de novo metrics.

Parameters:
  • ped (Pedigree) – Pedigree to filter.

  • mendel_ht (Table) – Table with Mendel errors.

  • max_mendel_z (Optional[int]) – Optional maximum z-score for Mendel error metrics. Default is 3.

  • max_de_novo_z (Optional[int]) – Optional maximum z-score for de novo metrics. Default is 3.

  • stratify_ukb (bool) – Whether to stratify z-score cutoffs by trio UK Biobank status. Default is False.

  • max_mendel_n (Optional[int]) – Optional maximum Mendel error count. If specified and ukb_max_mendel_n is not, max_mendel_n will be used for all samples. If both max_mendel_n and ukb_max_mendel_n are specified, max_mendel_n` will be used for non-UKB samples and ukb_max_mendel_n will be used for UKB samples. Default is None.

  • ukb_max_mendel_n (Optional[int]) – Optional maximum Mendel error count for trios in UK Biobank. If specified, max_mendel_n must also be specified for non-UKB samples. If not specified, but max_mendel_n is, max_mendel_n will be used for all samples. Default is None.

  • max_de_novo_n (Optional[int]) – Optional maximum de novo count. If specified and ukb_max_de_novo_n is not, max_de_novo_n will be used for all samples. If both max_de_novo_n and ukb_max_de_novo_n are specified, max_de_novo_n` will be used for non-UKB samples and ukb_max_de_novo_n will be used for UKB samples. Default is None.

  • ukb_max_de_novo_n (Optional[int]) – Optional maximum de novo count for trios in UK Biobank. If specified, max_de_novo_n must also be specified for non-UKB samples. If not specified, but max_de_novo_n is, max_de_novo_n will be used for all samples. Default is None.

Return type:

Tuple[Pedigree, Dict[str, Dict[str, int]]]

Returns:

Tuple of filtered Pedigree and dictionary of filtering parameters.

gnomad_qc.v4.sample_qc.identify_trios.get_trio_resources(overwrite, test)[source]

Get PipelineResourceCollection for all resources needed in the trio identification pipeline.

Parameters:
  • overwrite (bool) – Whether to overwrite existing resources.

  • test (bool) – Whether to use test resources.

Return type:

PipelineResourceCollection

Returns:

PipelineResourceCollection containing resources for all steps of the trio identification pipeline.

gnomad_qc.v4.sample_qc.identify_trios.main(args)[source]

Identify trios and filter based on Mendel errors and de novos.