gnomad_qc.v4.variant_qc.random_forest
Script for running random forest model on gnomAD v4 variant QC data.
usage: gnomad_qc.v4.variant_qc.random_forest.py [-h]
[--slack_channel SLACK_CHANNEL]
[--overwrite] [--test]
[--model-id MODEL_ID]
[--list-rf-runs] [--train-rf]
[--apply-rf]
[--compute-info-method {AS,quasi,set_long_AS_missing}]
[--features FEATURES [FEATURES ...]]
[--fp-to-tp FP_TO_TP]
[--test-intervals TEST_INTERVALS [TEST_INTERVALS ...]]
[--num-trees NUM_TREES]
[--max-depth MAX_DEPTH]
[--adj]
[--transmitted-singletons]
[--sibling-singletons]
[--filter-centromere-telomere]
[--interval-qc-filter]
Named Arguments
- --slack_channel
Slack channel to post results and notifications to.
- --overwrite
Overwrite all data from this subset.
Default: False
- --test
If the dataset should be filtered to chr22 for testing (also filtered to evaluation interval specified by –test-intervals).
Default: False
- --model-id
Model ID. Created by –train-rf and only needed for –apply-rf without running –train-rf.
Actions
- --list-rf-runs
Lists all previous RF runs, along with their model ID, parameters and testing results.
Default: False
- --train-rf
Trains RF model.
Default: False
- --apply-rf
Applies RF model to the data.
Default: False
Random Forest parameters
- --compute-info-method
Possible choices: AS, quasi, set_long_AS_missing
Method of computing the INFO score to use for the variant QC features. Default is ‘AS’.
Default: “AS”
- --features
Features to use in the random forests model.
Default: [‘AS_MQRankSum’, ‘AS_pab_max’, ‘AS_QD’, ‘AS_ReadPosRankSum’, ‘AS_SOR’, ‘allele_type’, ‘has_star’, ‘n_alt_alleles’, ‘variant_type’]
- --fp-to-tp
Ratio of FPs to TPs for training the RF model. If 0, all training examples are used. Default is 1.0.
Default: 1.0
- --test-intervals
The specified interval(s) will be held out for testing and evaluation only. Default is “chr20”.
Default: “chr20”
- --num-trees
Number of trees in the RF model. Default is 500.
Default: 500
- --max-depth
Maxmimum tree depth in the RF model. Default is 5.
Default: 5
Training data parameters
- --adj
Use adj genotypes for transmitted/sibling singletons.
Default: False
- --transmitted-singletons
Include transmitted singletons in training.
Default: False
- --sibling-singletons
Include sibling singletons in training.
Default: False
- --filter-centromere-telomere
Train RF without centromeres and telomeres.
Default: False
- --interval-qc-filter
Whether interval QC should be applied for RF training.
Default: False
Module Functions
Train random forest model using train_rf_model. |
|
|
Add RF model run to RF run list. |
|
Get PipelineResourceCollection for all resources needed in the variant QC pipeline. |
Run random forest variant QC pipeline. |
|
|
Get script argument parser. |
Script for running random forest model on gnomAD v4 variant QC data.
- gnomad_qc.v4.variant_qc.random_forest.train_rf(ht, test=False, features=None, fp_to_tp=1.0, num_trees=500, max_depth=5, transmitted_singletons=False, sibling_singletons=False, adj=False, filter_centromere_telomere=False, test_intervals='chr20', interval_qc_pass_ht=None)[source]
Train random forest model using train_rf_model.
- Parameters:
ht (
Table
) – Table containing annotations needed for RF training.test (
bool
) – Whether to filter the input Table to chr20 and test_intervals for test purposes. Default is False.features (
List
[str
]) – List of features to use in the random forests model. When no features list is provided, the global FEATURES is used.fp_to_tp (
float
) – Ratio of FPs to TPs for creating the RF model. If set to 0, all training examples are used. Default is 1.0.num_trees (
int
) – Number of trees in the RF model. Default is 500.max_depth (
int
) – Maximum tree depth in the RF model. Default is 5.transmitted_singletons (
bool
) – Whether to use transmitted singletons for training. Default is False.sibling_singletons (
bool
) – Whether to use sibling singletons for training. Default is False.adj (
bool
) – Whether to use adj genotypes for transmitted/sibling singletons instead of raw. Default is False and raw is used.filter_centromere_telomere (
bool
) – Filter centromeres and telomeres before training. Default is False.test_intervals (
Union
[str
,List
[str
]]) – Specified interval(s) will be held out for testing and evaluation only. Default is “chr20”.interval_qc_pass_ht (
Optional
[Table
]) – Optional interval QC pass Table that contains an ‘interval_qc_pass’ annotation indicating whether the interval passes high-quality criteria. This annotation is used to filter the Table before running training the RF model. Default is None.
- Return type:
Tuple
[Table
,Any
]- Returns:
Input ht annotated with training information and the RF model.
- gnomad_qc.v4.variant_qc.random_forest.add_model_to_run_list(ht, model_id, rf_runs, rf_run_path)[source]
Add RF model run to RF run list.
- Parameters:
ht (
Table
) – Table containing RF model run information as globals.model_id (
str
) – ID of RF model run.rf_runs (
Dict
[str
,Any
]) – Dictionary containing current RF run information.rf_run_path (
str
) – Path to RF run list.
- Return type:
None
- Returns:
None
- gnomad_qc.v4.variant_qc.random_forest.get_variant_qc_resources(test, overwrite, model_id=None)[source]
Get PipelineResourceCollection for all resources needed in the variant QC pipeline.
- Parameters:
test (
bool
) – Whether to gather all resources for testing.overwrite (
bool
) – Whether to overwrite resources if they exist.model_id (
str
) – Model ID to use for RF model. If not provided, a new model ID will be generated.
- Return type:
PipelineResourceCollection
- Returns:
PipelineResourceCollection containing resources for all steps of the variant QC pipeline.