gnomad_qc.v4.variant_qc.random_forest

Script for running random forest model on gnomAD v4 variant QC data.

usage: gnomad_qc.v4.variant_qc.random_forest.py [-h]
                                                [--slack_channel SLACK_CHANNEL]
                                                [--overwrite] [--test]
                                                [--model-id MODEL_ID]
                                                [--list-rf-runs] [--train-rf]
                                                [--apply-rf]
                                                [--compute-info-method {AS,quasi,set_long_AS_missing}]
                                                [--features FEATURES [FEATURES ...]]
                                                [--fp-to-tp FP_TO_TP]
                                                [--test-intervals TEST_INTERVALS [TEST_INTERVALS ...]]
                                                [--num-trees NUM_TREES]
                                                [--max-depth MAX_DEPTH]
                                                [--adj]
                                                [--transmitted-singletons]
                                                [--sibling-singletons]
                                                [--filter-centromere-telomere]
                                                [--interval-qc-filter]

Named Arguments

--slack_channel

Slack channel to post results and notifications to.

--overwrite

Overwrite all data from this subset.

Default: False

--test

If the dataset should be filtered to chr22 for testing (also filtered to evaluation interval specified by –test-intervals).

Default: False

--model-id

Model ID. Created by –train-rf and only needed for –apply-rf without running –train-rf.

Actions

--list-rf-runs

Lists all previous RF runs, along with their model ID, parameters and testing results.

Default: False

--train-rf

Trains RF model.

Default: False

--apply-rf

Applies RF model to the data.

Default: False

Random Forest parameters

--compute-info-method

Possible choices: AS, quasi, set_long_AS_missing

Method of computing the INFO score to use for the variant QC features. Default is ‘AS’.

Default: “AS”

--features

Features to use in the random forests model.

Default: [‘AS_MQRankSum’, ‘AS_pab_max’, ‘AS_QD’, ‘AS_ReadPosRankSum’, ‘AS_SOR’, ‘allele_type’, ‘has_star’, ‘n_alt_alleles’, ‘variant_type’]

--fp-to-tp

Ratio of FPs to TPs for training the RF model. If 0, all training examples are used. Default is 1.0.

Default: 1.0

--test-intervals

The specified interval(s) will be held out for testing and evaluation only. Default is “chr20”.

Default: “chr20”

--num-trees

Number of trees in the RF model. Default is 500.

Default: 500

--max-depth

Maxmimum tree depth in the RF model. Default is 5.

Default: 5

Training data parameters

--adj

Use adj genotypes for transmitted/sibling singletons.

Default: False

--transmitted-singletons

Include transmitted singletons in training.

Default: False

--sibling-singletons

Include sibling singletons in training.

Default: False

--filter-centromere-telomere

Train RF without centromeres and telomeres.

Default: False

--interval-qc-filter

Whether interval QC should be applied for RF training.

Default: False

Module Functions

gnomad_qc.v4.variant_qc.random_forest.train_rf(ht)

Train random forest model using train_rf_model.

gnomad_qc.v4.variant_qc.random_forest.add_model_to_run_list(ht, ...)

Add RF model run to RF run list.

gnomad_qc.v4.variant_qc.random_forest.get_variant_qc_resources(...)

Get PipelineResourceCollection for all resources needed in the variant QC pipeline.

gnomad_qc.v4.variant_qc.random_forest.main(args)

Run random forest variant QC pipeline.

gnomad_qc.v4.variant_qc.random_forest.get_script_argument_parser()

Get script argument parser.

Script for running random forest model on gnomAD v4 variant QC data.

gnomad_qc.v4.variant_qc.random_forest.train_rf(ht, test=False, features=None, fp_to_tp=1.0, num_trees=500, max_depth=5, transmitted_singletons=False, sibling_singletons=False, adj=False, filter_centromere_telomere=False, test_intervals='chr20', interval_qc_pass_ht=None)[source]

Train random forest model using train_rf_model.

Parameters:
  • ht (Table) – Table containing annotations needed for RF training.

  • test (bool) – Whether to filter the input Table to chr20 and test_intervals for test purposes. Default is False.

  • features (List[str]) – List of features to use in the random forests model. When no features list is provided, the global FEATURES is used.

  • fp_to_tp (float) – Ratio of FPs to TPs for creating the RF model. If set to 0, all training examples are used. Default is 1.0.

  • num_trees (int) – Number of trees in the RF model. Default is 500.

  • max_depth (int) – Maximum tree depth in the RF model. Default is 5.

  • transmitted_singletons (bool) – Whether to use transmitted singletons for training. Default is False.

  • sibling_singletons (bool) – Whether to use sibling singletons for training. Default is False.

  • adj (bool) – Whether to use adj genotypes for transmitted/sibling singletons instead of raw. Default is False and raw is used.

  • filter_centromere_telomere (bool) – Filter centromeres and telomeres before training. Default is False.

  • test_intervals (Union[str, List[str]]) – Specified interval(s) will be held out for testing and evaluation only. Default is “chr20”.

  • interval_qc_pass_ht (Optional[Table]) – Optional interval QC pass Table that contains an ‘interval_qc_pass’ annotation indicating whether the interval passes high-quality criteria. This annotation is used to filter the Table before running training the RF model. Default is None.

Return type:

Tuple[Table, Any]

Returns:

Input ht annotated with training information and the RF model.

gnomad_qc.v4.variant_qc.random_forest.add_model_to_run_list(ht, model_id, rf_runs, rf_run_path)[source]

Add RF model run to RF run list.

Parameters:
  • ht (Table) – Table containing RF model run information as globals.

  • model_id (str) – ID of RF model run.

  • rf_runs (Dict[str, Any]) – Dictionary containing current RF run information.

  • rf_run_path (str) – Path to RF run list.

Return type:

None

Returns:

None

gnomad_qc.v4.variant_qc.random_forest.get_variant_qc_resources(test, overwrite, model_id=None)[source]

Get PipelineResourceCollection for all resources needed in the variant QC pipeline.

Parameters:
  • test (bool) – Whether to gather all resources for testing.

  • overwrite (bool) – Whether to overwrite resources if they exist.

  • model_id (str) – Model ID to use for RF model. If not provided, a new model ID will be generated.

Return type:

PipelineResourceCollection

Returns:

PipelineResourceCollection containing resources for all steps of the variant QC pipeline.

gnomad_qc.v4.variant_qc.random_forest.main(args)[source]

Run random forest variant QC pipeline.