gnomad_qc.v4.sample_qc.identify_trios
Script to identify trios from relatedness data and filter based on Mendel errors and de novos.
usage: gnomad_qc.v4.sample_qc.identify_trios.py [-h] [--test] [--overwrite]
[--slack-channel SLACK_CHANNEL]
[--identify-duplicates]
[--infer-families]
[--create-fake-pedigree]
[--fake-fam-prop FAKE_FAM_PROP]
[--run-mendel-errors]
[--finalize-ped]
[--max-mendel-z [MAX_MENDEL_Z]]
[--max-de-novo-z [MAX_DE_NOVO_Z]]
[--stratify-ukb]
[--max-mendel MAX_MENDEL]
[--ukb-max-mendel UKB_MAX_MENDEL]
[--max-de-novo MAX_DE_NOVO]
[--ukb-max-de-novo UKB_MAX_DE_NOVO]
[--seed SEED]
Named Arguments
- --test
Runs mendel errors on only five partitions of the MT.
Default: False
- --overwrite
Overwrite all data from this subset (default: False).
Default: False
- --slack-channel
Slack channel to post results and notifications to.
Duplicate identification
- --identify-duplicates
Create a table with duplicate samples indicating which one is the best to use based on the ranking of all samples after sample QC metric outlier filtering (ranking used to determine related samples to drop for the release).
Default: False
Pedigree inference
- --infer-families
Infer families and trios using the relatedness Table and duplicate Table.
Default: False
Fake Pedigree creation
- --create-fake-pedigree
Create a fake Pedigree from unrelated samples in the data for comparison to the inferred Pedigree.
Default: False
- --fake-fam-prop
Number of fake trios to generate as a proportion of the total number of trios found in the data. Default is 0.1.
Default: 0.1
Mendel error calculation
- --run-mendel-errors
Calculate mendel errors for the inferred and fake Pedigrees on chr20.
Default: False
Pedigree filtering for final Pedigree generation
- --finalize-ped
Create final families/trios ped files by excluding trios where the number of Mendel errors or de novos are outliers. Outliers can be defined as trios that have Mendel errors or de novos higher than those specified in –max-mendel and –max-de-novo respectively. They can also be defined as trios that have Mendel errors or de novos higher than –max-mendel-z or –max-de-novo-z standard deviations above the mean across inferred trios.
Default: False
- --max-mendel-z
Max number of standard deviations above the mean Mendel errors across inferred trios to keep a trio. If flag is set, default is 3.
- --max-de-novo-z
Max number of standard deviations above the mean de novos across inferred trios to keep a trio. If flag is set, default is 3.
- --stratify-ukb
Stratify Mendel errors and de novo standard deviations cutoffs to UKB and non-UKB samples.
Default: False
- --max-mendel
Maximum number of raw Mendel errors for real trios. If specified and –ukb-max-mendel is not, –max-mendel will be used for all samples. If both –max-mendel and –ukb-max-mendel are specified, –max-mendel will be used for non-UKB samples and –ukb-max-mendel will be used for UKB samples.
- --ukb-max-mendel
Maximum number of raw Mendel errors for real trios in UKB samples. If specified, –max-mendel must also be specified for non-UKB samples. If not specified, but –max-mendel is, –max-mendel will be used for all samples.
- --max-de-novo
Maximum number of raw de novo mutations for real trios. If specified and –ukb-max-de-novo is not, –max-de-novo will be used for all samples. If both –max-de-novo and –ukb-max-de-novo are specified, –max-de-novo will be used for non-UKB samples and –ukb-max-de-novo will be used for UKB samples.
- --ukb-max-de-novo
Maximum number of raw de novo mutations for real trios in UKB samples. If specified, –max-de-novo must also be specified for non-UKB samples. If not specified, but –max-de-novo is, –max-de-novo will be used for all samples.
- --seed
Random seed for choosing one random trio per family to keep after filtering.
Default: 24
Module Functions
|
Convert a Pedigree with families to a Pedigree with only one random trio per family. |
|
Filter relatedness Table to only include pairs of samples that are both exomes and not QC-filtered. |
|
Convert a Table with platform assignments to a sample:platform dictionary. |
|
Generate a fake Pedigree with fake_fam_prop defining the proportion of the number of trios in ped to use. |
|
Run Hail's mendel_errors on chr20 of the VDS subset to samples in ped and fake_ped. |
|
Filter a Pedigree to only trios where all samples are from the same platform. |
Filter a Pedigree based on Mendel errors and de novo metrics. |
|
|
Get PipelineResourceCollection for all resources needed in the trio identification pipeline. |
Identify trios and filter based on Mendel errors and de novos. |
|
|
Get script argument parser. |
Script to identify trios from relatedness data and filter based on Mendel errors and de novos.
- gnomad_qc.v4.sample_qc.identify_trios.families_to_trios(ped, seed=24)[source]
Convert a Pedigree with families to a Pedigree with only one random trio per family.
Filter relatedness Table to only include pairs of samples that are both exomes and not QC-filtered.
- gnomad_qc.v4.sample_qc.identify_trios.platform_table_to_dict(platform_ht)[source]
Convert a Table with platform assignments to a sample:platform dictionary.
Note
This function assumes that the Table has a qc_platform field.
- Parameters:
platform_ht (
Table
) – Table with platform assignments.- Return type:
Dict
[str
,str
]- Returns:
Sample:platform dictionary.
- gnomad_qc.v4.sample_qc.identify_trios.run_create_fake_pedigree(ped, filter_ht, platform_ht=None, fake_fam_prop=0.1)[source]
Generate a fake Pedigree with fake_fam_prop defining the proportion of the number of trios in ped to use.
- Parameters:
ped (
Pedigree
) – Pedigree to use for generating fake Pedigree.filter_ht (
Table
) – Outlier filtering Table.platform_ht (
Optional
[Table
]) – Optional table with platform assignments. Default is None.fake_fam_prop (
float
) – Proportion of trios in ped to use for generating fake Pedigree. Default is 0.1.
- Return type:
- Returns:
Fake Pedigree.
- gnomad_qc.v4.sample_qc.identify_trios.run_mendel_errors(ped, fake_ped, interval_qc_pass_ht=None, test=False)[source]
Run Hail’s mendel_errors on chr20 of the VDS subset to samples in ped and fake_ped.
- Parameters:
ped (
Pedigree
) – Inferred Pedigree.fake_ped (
Pedigree
) – Fake Pedigree.interval_qc_pass_ht (
Optional
[Table
]) – Optional interval QC pass Table that contains an ‘interval_qc_pass’ annotation indicating whether the interval passes high-quality criteria. This annotation is used to filter the MatrixTable before running mendel_errors. Default is None.test (
bool
) – Whether to run on five partitions of the VDS for testing. Default is False.
- Return type:
- Returns:
Table with Mendel errors on chr20.
- gnomad_qc.v4.sample_qc.identify_trios.filter_ped_to_same_platform(ped, platform_ht)[source]
Filter a Pedigree to only trios where all samples are from the same platform.
- gnomad_qc.v4.sample_qc.identify_trios.filter_ped(ped, mendel_ht, max_mendel_z=3, max_de_novo_z=3, stratify_ukb=False, max_mendel_n=None, ukb_max_mendel_n=None, max_de_novo_n=None, ukb_max_de_novo_n=None)[source]
Filter a Pedigree based on Mendel errors and de novo metrics.
- Parameters:
ped (
Pedigree
) – Pedigree to filter.mendel_ht (
Table
) – Table with Mendel errors.max_mendel_z (
Optional
[int
]) – Optional maximum z-score for Mendel error metrics. Default is 3.max_de_novo_z (
Optional
[int
]) – Optional maximum z-score for de novo metrics. Default is 3.stratify_ukb (
bool
) – Whether to stratify z-score cutoffs by trio UK Biobank status. Default is False.max_mendel_n (
Optional
[int
]) – Optional maximum Mendel error count. If specified and ukb_max_mendel_n is not, max_mendel_n will be used for all samples. If both max_mendel_n and ukb_max_mendel_n are specified, max_mendel_n` will be used for non-UKB samples and ukb_max_mendel_n will be used for UKB samples. Default is None.ukb_max_mendel_n (
Optional
[int
]) – Optional maximum Mendel error count for trios in UK Biobank. If specified, max_mendel_n must also be specified for non-UKB samples. If not specified, but max_mendel_n is, max_mendel_n will be used for all samples. Default is None.max_de_novo_n (
Optional
[int
]) – Optional maximum de novo count. If specified and ukb_max_de_novo_n is not, max_de_novo_n will be used for all samples. If both max_de_novo_n and ukb_max_de_novo_n are specified, max_de_novo_n` will be used for non-UKB samples and ukb_max_de_novo_n will be used for UKB samples. Default is None.ukb_max_de_novo_n (
Optional
[int
]) – Optional maximum de novo count for trios in UK Biobank. If specified, max_de_novo_n must also be specified for non-UKB samples. If not specified, but max_de_novo_n is, max_de_novo_n will be used for all samples. Default is None.
- Return type:
Tuple
[Pedigree
,Dict
[str
,Dict
[str
,int
]]]- Returns:
Tuple of filtered Pedigree and dictionary of filtering parameters.
- gnomad_qc.v4.sample_qc.identify_trios.get_trio_resources(overwrite, test)[source]
Get PipelineResourceCollection for all resources needed in the trio identification pipeline.
- Parameters:
overwrite (
bool
) – Whether to overwrite existing resources.test (
bool
) – Whether to use test resources.
- Return type:
PipelineResourceCollection
- Returns:
PipelineResourceCollection containing resources for all steps of the trio identification pipeline.