gnomad.variant_qc.random_forest

gnomad.variant_qc.random_forest.run_rf_test(mt)

Run a dummy test RF on a given MT.

gnomad.variant_qc.random_forest.check_ht_fields_for_spark(ht, ...)

Check specified fields of a hail table for Spark DataFrame conversion (type and name).

gnomad.variant_qc.random_forest.get_columns_quantiles(ht, ...)

Compute approximate quantiles of specified numeric fields from non-missing values.

gnomad.variant_qc.random_forest.median_impute_features(ht)

Numerical features in the Table are median-imputed by Hail's approx_median.

gnomad.variant_qc.random_forest.ht_to_rf_df(ht, ...)

Create a Spark dataframe ready for RF from a HT.

gnomad.variant_qc.random_forest.get_features_importance(...)

Extract the features importance from a Pipeline model containing a RandomForestClassifier stage.

gnomad.variant_qc.random_forest.get_labels(...)

Return the labels from the StringIndexer stage at index 0 from an RF pipeline model.

gnomad.variant_qc.random_forest.test_model(ht, ...)

A wrapper to test a model on a set of examples with known labels.

gnomad.variant_qc.random_forest.apply_rf_model(ht, ...)

Apply a Random Forest (RF) pipeline model to a Table and annotate the RF probabilities and predictions.

gnomad.variant_qc.random_forest.save_model(...)

Save a Random Forest pipeline model.

gnomad.variant_qc.random_forest.load_model(...)

Load a Random Forest pipeline model.

gnomad.variant_qc.random_forest.train_rf(ht, ...)

Train a Random Forest (RF) pipeline model.

gnomad.variant_qc.random_forest.get_rf_runs(...)

Load RF run data from JSON file.

gnomad.variant_qc.random_forest.get_run_data(...)

Create a Dict containing information about the RF input arguments and feature importance.

gnomad.variant_qc.random_forest.pretty_print_runs(runs)

Print the information for the RF runs loaded from the json file storing the RF run hashes -> info.

gnomad.variant_qc.random_forest.run_rf_test(mt, output='/tmp')[source]

Run a dummy test RF on a given MT.

  1. Creates row annotations and labels to run model on

  2. Trains a RF pipeline model (including median imputation of missing values in created annotations)

  3. Saves the RF pipeline model

  4. Applies the model to the MT and prints features importance

Parameters:
  • mt (MatrixTable) – Input MT

  • output (str) – Output files prefix to save the RF model

Return type:

Tuple[PipelineModel, Table]

Returns:

RF model and MatrixTable after applying RF model

gnomad.variant_qc.random_forest.check_ht_fields_for_spark(ht, fields)[source]

Check specified fields of a hail table for Spark DataFrame conversion (type and name).

Parameters:
  • ht (Table) – input Table

  • fields (List[str]) – Fields to test

Return type:

None

Returns:

None

gnomad.variant_qc.random_forest.get_columns_quantiles(ht, fields, quantiles, relative_error=0.001)[source]

Compute approximate quantiles of specified numeric fields from non-missing values. Non-numeric fields are ignored.

This function returns a Dict of column name -> list of quantiles in the same order specified. If a column only has NAs, None is returned.

Parameters:
  • ht (Table) – input HT

  • fields (List[str]) – list of features to impute. If none given, all numerical features with missing data are imputed

  • quantiles (List[float]) – list of quantiles to return (e.g. [0.5] would return the median)

  • relative_error (int) – The relative error on the quantile approximation

Return type:

Dict[str, List[float]]

Returns:

Dict of column -> quantiles

gnomad.variant_qc.random_forest.median_impute_features(ht, strata=None)[source]

Numerical features in the Table are median-imputed by Hail’s approx_median.

If a strata dict is given, imputation is done based on the median of of each stratification.

The annotations that are added to the Table are
  • feature_imputed - A row annotation indicating if each numerical feature was imputed or not.

  • features_median - A global annotation containing the median of the numerical features. If strata is given, this struct will also be broken down by the given strata.

  • variants_by_strata - An additional global annotation with the variant counts by strata that will only be added if imputing by a given strata.

Parameters:
  • ht (Table) – Table containing all samples and features for median imputation.

  • strata (Optional[Dict[str, Expression]]) – Whether to impute features median by specific strata (default False).

Return type:

Table

Returns:

Feature Table imputed using approximate median values.

gnomad.variant_qc.random_forest.ht_to_rf_df(ht, features, label=None, index=None)[source]

Create a Spark dataframe ready for RF from a HT.

Rows with any missing features are dropped. Missing labels are replaced with ‘NA’

Note

Only basic types are supported!

Parameters:
  • ht (Table) – Input HT

  • features (List[str]) – Features that will be used for RF

  • label (str) – Optional label column that will be predicted by RF

  • index (str) – Optional index column to keep (E.g. for joining results back at later stage)

Return type:

DataFrame

Returns:

Spark Dataframe

gnomad.variant_qc.random_forest.get_features_importance(rf_pipeline, rf_index=-2, assembler_index=-3)[source]

Extract the features importance from a Pipeline model containing a RandomForestClassifier stage.

Parameters:
  • rf_pipeline (PipelineModel) – Input pipeline

  • rf_index (int) – index of the RandomForestClassifier stage

  • assembler_index (int) – index of the VectorAssembler stage

Return type:

Dict[str, float]

Returns:

feature importance for each feature in the RF model

gnomad.variant_qc.random_forest.get_labels(rf_pipeline)[source]

Return the labels from the StringIndexer stage at index 0 from an RF pipeline model.

Parameters:

rf_pipeline (PipelineModel) – Input pipeline

Return type:

List[str]

Returns:

labels

gnomad.variant_qc.random_forest.test_model(ht, rf_model, features, label, prediction_col_name='rf_prediction')[source]

A wrapper to test a model on a set of examples with known labels.

  1. Runs the model on the data

  2. Prints confusion matrix and accuracy

  3. Returns confusion matrix as a list of struct

Parameters:
  • ht (Table) – Input table

  • rf_model (PipelineModel) – RF Model

  • features (List[str]) – Columns containing features that were used in the model

  • label (str) – Column containing label to be predicted

  • prediction_col_name (str) – Where to store the prediction

Return type:

List[tstruct]

Returns:

A list containing structs with {label, prediction, n}

gnomad.variant_qc.random_forest.apply_rf_model(ht, rf_model, features, label=None, probability_col_name='rf_probability', prediction_col_name='rf_prediction')[source]

Apply a Random Forest (RF) pipeline model to a Table and annotate the RF probabilities and predictions.

Parameters:
  • ht (Table) – Input HT

  • rf_model (PipelineModel) – Random Forest pipeline model

  • features (List[str]) – List of feature columns in the pipeline. !Should match the model list of features!

  • label (str) – Optional column containing labels. !Should match the model labels!

  • probability_col_name (str) – Name of the column that will store the RF probabilities

  • prediction_col_name (str) – Name of the column that will store the RF predictions

Return type:

Table

Returns:

Table with RF columns

gnomad.variant_qc.random_forest.save_model(rf_pipeline, out_path, overwrite=False)[source]

Save a Random Forest pipeline model.

Parameters:
  • rf_pipeline (PipelineModel) – Pipeline to save

  • out_path (str) – Output path

  • overwrite (bool) – If set, will overwrite existing file(s) at output location

Return type:

None

Returns:

Nothing

gnomad.variant_qc.random_forest.load_model(input_path)[source]

Load a Random Forest pipeline model.

Parameters:

input_path (str) – Location of model to load

Return type:

PipelineModel

Returns:

Random Forest pipeline model

gnomad.variant_qc.random_forest.train_rf(ht, features, label, num_trees=500, max_depth=5)[source]

Train a Random Forest (RF) pipeline model.

Parameters:
  • ht (Table) – Input HT

  • features (List[str]) – List of columns to be used as features

  • label (str) – Column containing the label to predict

  • num_trees (int) – Number of trees to use

  • max_depth (int) – Maximum tree depth

Return type:

PipelineModel

Returns:

Random Forest pipeline model

gnomad.variant_qc.random_forest.get_rf_runs(rf_json_fp)[source]

Load RF run data from JSON file.

Parameters:

rf_json_fp (str) – File path to rf json file.

Return type:

Dict

Returns:

Dictionary containing the content of the JSON file, or an empty dictionary if the file wasn’t found.

gnomad.variant_qc.random_forest.get_run_data(input_args, test_intervals, features_importance, test_results)[source]

Create a Dict containing information about the RF input arguments and feature importance.

Parameters:
  • input_args (Dict[str, bool]) – Dictionary of model input arguments

  • test_intervals (List[str]) – Intervals withheld from training to be used in testing

  • features_importance (Dict[str, float]) – Feature importance returned by the RF

  • test_results (List[tstruct]) – Accuracy results from applying RF model to the test intervals

  • input_args

  • test_intervals

  • features_importance

  • test_results

Return type:

Dict

Returns:

Dict of RF information

gnomad.variant_qc.random_forest.pretty_print_runs(runs, label_col='rf_label', prediction_col_name='rf_prediction')[source]

Print the information for the RF runs loaded from the json file storing the RF run hashes -> info.

Parameters:
  • runs (Dict) – Dictionary containing JSON input loaded from RF run file

  • label_col (str) – Name of the RF label column

  • prediction_col_name (str) – Name of the RF prediction column

Return type:

None

Returns:

Nothing – only prints information