gnomad.variant_qc.random_forest

`gnomad.variant_qc.random_forest.run_rf_test`(mt)	Run a dummy test RF on a given MT.
`gnomad.variant_qc.random_forest.check_ht_fields_for_spark`(ht, ...)	Check specified fields of a hail table for Spark DataFrame conversion (type and name).
`gnomad.variant_qc.random_forest.get_columns_quantiles`(ht, ...)	Compute approximate quantiles of specified numeric fields from non-missing values.
`gnomad.variant_qc.random_forest.median_impute_features`(ht)	Numerical features in the Table are median-imputed by Hail's approx_median.
`gnomad.variant_qc.random_forest.ht_to_rf_df`(ht, ...)	Create a Spark dataframe ready for RF from a HT.
`gnomad.variant_qc.random_forest.get_features_importance`(...)	Extract the features importance from a Pipeline model containing a RandomForestClassifier stage.
`gnomad.variant_qc.random_forest.get_labels`(...)	Return the labels from the StringIndexer stage at index 0 from an RF pipeline model.
`gnomad.variant_qc.random_forest.test_model`(ht, ...)	A wrapper to test a model on a set of examples with known labels.
`gnomad.variant_qc.random_forest.apply_rf_model`(ht, ...)	Apply a Random Forest (RF) pipeline model to a Table and annotate the RF probabilities and predictions.
`gnomad.variant_qc.random_forest.save_model`(...)	Save a Random Forest pipeline model.
`gnomad.variant_qc.random_forest.load_model`(...)	Load a Random Forest pipeline model.
`gnomad.variant_qc.random_forest.train_rf`(ht, ...)	Train a Random Forest (RF) pipeline model.
`gnomad.variant_qc.random_forest.get_rf_runs`(...)	Load RF run data from JSON file.
`gnomad.variant_qc.random_forest.get_run_data`(...)	Create a Dict containing information about the RF input arguments and feature importance.
`gnomad.variant_qc.random_forest.pretty_print_runs`(runs)	Print the information for the RF runs loaded from the json file storing the RF run hashes -> info.

gnomad.variant_qc.random_forest.run_rf_test(mt, output='/tmp')[source]

Run a dummy test RF on a given MT.

Creates row annotations and labels to run model on
Trains a RF pipeline model (including median imputation of missing values in created annotations)
Saves the RF pipeline model
Applies the model to the MT and prints features importance

Parameters:

mt (MatrixTable) – Input MT
output (str) – Output files prefix to save the RF model

Return type:

Tuple[PipelineModel, Table]

Returns:

RF model and MatrixTable after applying RF model

gnomad.variant_qc.random_forest.check_ht_fields_for_spark(ht, fields)[source]

Check specified fields of a hail table for Spark DataFrame conversion (type and name).

Parameters:

ht (Table) – input Table
fields (List[str]) – Fields to test

Return type:

None

Returns:

None

gnomad.variant_qc.random_forest.get_columns_quantiles(ht, fields, quantiles, relative_error=0.001)[source]

Compute approximate quantiles of specified numeric fields from non-missing values. Non-numeric fields are ignored.

This function returns a Dict of column name -> list of quantiles in the same order specified. If a column only has NAs, None is returned.

Parameters:

ht (Table) – input HT
fields (List[str]) – list of features to impute. If none given, all numerical features with missing data are imputed
quantiles (List[float]) – list of quantiles to return (e.g. [0.5] would return the median)
relative_error (int) – The relative error on the quantile approximation

Return type:

Dict[str, List[float]]

Returns:

Dict of column -> quantiles

gnomad.variant_qc.random_forest.median_impute_features(ht, strata=None)[source]

Numerical features in the Table are median-imputed by Hail’s approx_median.

If a strata dict is given, imputation is done based on the median of of each stratification.

The annotations that are added to the Table are

feature_imputed - A row annotation indicating if each numerical feature was imputed or not.
features_median - A global annotation containing the median of the numerical features. If strata is given, this struct will also be broken down by the given strata.
variants_by_strata - An additional global annotation with the variant counts by strata that will only be added if imputing by a given strata.

Parameters:

ht (Table) – Table containing all samples and features for median imputation.
strata (Optional[Dict[str, Expression]]) – Whether to impute features median by specific strata (default False).

Return type:

Table

Returns:

Feature Table imputed using approximate median values.

gnomad.variant_qc.random_forest.ht_to_rf_df(ht, features, label=None, index=None)[source]

Create a Spark dataframe ready for RF from a HT.

Rows with any missing features are dropped. Missing labels are replaced with ‘NA’

Note

Only basic types are supported!

Parameters:

ht (Table) – Input HT
features (List[str]) – Features that will be used for RF
label (str) – Optional label column that will be predicted by RF
index (str) – Optional index column to keep (E.g. for joining results back at later stage)

Return type:

DataFrame

Returns:

Spark Dataframe

gnomad.variant_qc.random_forest.get_features_importance(rf_pipeline, rf_index=-2, assembler_index=-3)[source]

Extract the features importance from a Pipeline model containing a RandomForestClassifier stage.

Parameters:

rf_pipeline (PipelineModel) – Input pipeline
rf_index (int) – index of the RandomForestClassifier stage
assembler_index (int) – index of the VectorAssembler stage

Return type:

Dict[str, float]

Returns:

feature importance for each feature in the RF model

gnomad.variant_qc.random_forest.get_labels(rf_pipeline)[source]

Return the labels from the StringIndexer stage at index 0 from an RF pipeline model.

Parameters:: rf_pipeline (PipelineModel) – Input pipeline
Return type:: List[str]
Returns:: labels

gnomad.variant_qc.random_forest.test_model(ht, rf_model, features, label, prediction_col_name='rf_prediction')[source]

A wrapper to test a model on a set of examples with known labels.

Runs the model on the data
Prints confusion matrix and accuracy
Returns confusion matrix as a list of struct

Parameters:

ht (Table) – Input table
rf_model (PipelineModel) – RF Model
features (List[str]) – Columns containing features that were used in the model
label (str) – Column containing label to be predicted
prediction_col_name (str) – Where to store the prediction

Return type:

List[tstruct]

Returns:

A list containing structs with {label, prediction, n}

gnomad.variant_qc.random_forest.apply_rf_model(ht, rf_model, features, label=None, probability_col_name='rf_probability', prediction_col_name='rf_prediction')[source]

Apply a Random Forest (RF) pipeline model to a Table and annotate the RF probabilities and predictions.

Parameters:

ht (Table) – Input HT
rf_model (PipelineModel) – Random Forest pipeline model
features (List[str]) – List of feature columns in the pipeline. !Should match the model list of features!
label (str) – Optional column containing labels. !Should match the model labels!
probability_col_name (str) – Name of the column that will store the RF probabilities
prediction_col_name (str) – Name of the column that will store the RF predictions

Return type:

Table

Returns:

Table with RF columns

gnomad.variant_qc.random_forest.save_model(rf_pipeline, out_path, overwrite=False)[source]

Save a Random Forest pipeline model.

Parameters:

rf_pipeline (PipelineModel) – Pipeline to save
out_path (str) – Output path
overwrite (bool) – If set, will overwrite existing file(s) at output location

Return type:

None

Returns:

Nothing

gnomad.variant_qc.random_forest.load_model(input_path)[source]

Load a Random Forest pipeline model.

Parameters:: input_path (str) – Location of model to load
Return type:: PipelineModel
Returns:: Random Forest pipeline model

gnomad.variant_qc.random_forest.train_rf(ht, features, label, num_trees=500, max_depth=5)[source]

Train a Random Forest (RF) pipeline model.

Parameters:

ht (Table) – Input HT
features (List[str]) – List of columns to be used as features
label (str) – Column containing the label to predict
num_trees (int) – Number of trees to use
max_depth (int) – Maximum tree depth

Return type:

PipelineModel

Returns:

Random Forest pipeline model

gnomad.variant_qc.random_forest.get_rf_runs(rf_json_fp)[source]

Load RF run data from JSON file.

Parameters:: rf_json_fp (str) – File path to rf json file.
Return type:: Dict
Returns:: Dictionary containing the content of the JSON file, or an empty dictionary if the file wasn’t found.

gnomad.variant_qc.random_forest.get_run_data(input_args, test_intervals, features_importance, test_results)[source]

Create a Dict containing information about the RF input arguments and feature importance.

Parameters:

input_args (Dict[str, bool]) – Dictionary of model input arguments
test_intervals (List[str]) – Intervals withheld from training to be used in testing
features_importance (Dict[str, float]) – Feature importance returned by the RF
test_results (List[tstruct]) – Accuracy results from applying RF model to the test intervals
input_args –
test_intervals –
features_importance –
test_results –

Return type:

Dict

Returns:

Dict of RF information

gnomad.variant_qc.random_forest.pretty_print_runs(runs, label_col='rf_label', prediction_col_name='rf_prediction')[source]

Print the information for the RF runs loaded from the json file storing the RF run hashes -> info.

Parameters:

runs (Dict) – Dictionary containing JSON input loaded from RF run file
label_col (str) – Name of the RF label column
prediction_col_name (str) – Name of the RF prediction column

Return type:

None

Returns:

Nothing – only prints information