gnomad.variant_qc.random_forest
Run a dummy test RF on a given MT. |
|
|
Check specified fields of a hail table for Spark DataFrame conversion (type and name). |
|
Compute approximate quantiles of specified numeric fields from non-missing values. |
Numerical features in the Table are median-imputed by Hail's approx_median. |
|
Create a Spark dataframe ready for RF from a HT. |
|
|
Extract the features importance from a Pipeline model containing a RandomForestClassifier stage. |
Return the labels from the StringIndexer stage at index 0 from an RF pipeline model. |
|
A wrapper to test a model on a set of examples with known labels. |
|
Apply a Random Forest (RF) pipeline model to a Table and annotate the RF probabilities and predictions. |
|
Save a Random Forest pipeline model. |
|
Load a Random Forest pipeline model. |
|
Train a Random Forest (RF) pipeline model. |
|
Load RF run data from JSON file. |
|
Create a Dict containing information about the RF input arguments and feature importance. |
|
Print the information for the RF runs loaded from the json file storing the RF run hashes -> info. |
- gnomad.variant_qc.random_forest.run_rf_test(mt, output='/tmp')[source]
Run a dummy test RF on a given MT.
Creates row annotations and labels to run model on
Trains a RF pipeline model (including median imputation of missing values in created annotations)
Saves the RF pipeline model
Applies the model to the MT and prints features importance
- Parameters:
mt (
MatrixTable
) – Input MToutput (
str
) – Output files prefix to save the RF model
- Return type:
Tuple
[PipelineModel
,Table
]- Returns:
RF model and MatrixTable after applying RF model
- gnomad.variant_qc.random_forest.check_ht_fields_for_spark(ht, fields)[source]
Check specified fields of a hail table for Spark DataFrame conversion (type and name).
- Parameters:
ht (
Table
) – input Tablefields (
List
[str
]) – Fields to test
- Return type:
None
- Returns:
None
- gnomad.variant_qc.random_forest.get_columns_quantiles(ht, fields, quantiles, relative_error=0.001)[source]
Compute approximate quantiles of specified numeric fields from non-missing values. Non-numeric fields are ignored.
This function returns a Dict of column name -> list of quantiles in the same order specified. If a column only has NAs, None is returned.
- Parameters:
ht (
Table
) – input HTfields (
List
[str
]) – list of features to impute. If none given, all numerical features with missing data are imputedquantiles (
List
[float
]) – list of quantiles to return (e.g. [0.5] would return the median)relative_error (
int
) – The relative error on the quantile approximation
- Return type:
Dict
[str
,List
[float
]]- Returns:
Dict of column -> quantiles
- gnomad.variant_qc.random_forest.median_impute_features(ht, strata=None)[source]
Numerical features in the Table are median-imputed by Hail’s approx_median.
If a strata dict is given, imputation is done based on the median of of each stratification.
- The annotations that are added to the Table are
feature_imputed - A row annotation indicating if each numerical feature was imputed or not.
features_median - A global annotation containing the median of the numerical features. If strata is given, this struct will also be broken down by the given strata.
variants_by_strata - An additional global annotation with the variant counts by strata that will only be added if imputing by a given strata.
- Parameters:
ht (
Table
) – Table containing all samples and features for median imputation.strata (
Optional
[Dict
[str
,Expression
]]) – Whether to impute features median by specific strata (default False).
- Return type:
- Returns:
Feature Table imputed using approximate median values.
- gnomad.variant_qc.random_forest.ht_to_rf_df(ht, features, label=None, index=None)[source]
Create a Spark dataframe ready for RF from a HT.
Rows with any missing features are dropped. Missing labels are replaced with ‘NA’
Note
Only basic types are supported!
- Parameters:
ht (
Table
) – Input HTfeatures (
List
[str
]) – Features that will be used for RFlabel (
str
) – Optional label column that will be predicted by RFindex (
str
) – Optional index column to keep (E.g. for joining results back at later stage)
- Return type:
DataFrame
- Returns:
Spark Dataframe
- gnomad.variant_qc.random_forest.get_features_importance(rf_pipeline, rf_index=-2, assembler_index=-3)[source]
Extract the features importance from a Pipeline model containing a RandomForestClassifier stage.
- Parameters:
rf_pipeline (
PipelineModel
) – Input pipelinerf_index (
int
) – index of the RandomForestClassifier stageassembler_index (
int
) – index of the VectorAssembler stage
- Return type:
Dict
[str
,float
]- Returns:
feature importance for each feature in the RF model
- gnomad.variant_qc.random_forest.get_labels(rf_pipeline)[source]
Return the labels from the StringIndexer stage at index 0 from an RF pipeline model.
- Parameters:
rf_pipeline (
PipelineModel
) – Input pipeline- Return type:
List
[str
]- Returns:
labels
- gnomad.variant_qc.random_forest.test_model(ht, rf_model, features, label, prediction_col_name='rf_prediction')[source]
A wrapper to test a model on a set of examples with known labels.
Runs the model on the data
Prints confusion matrix and accuracy
Returns confusion matrix as a list of struct
- Parameters:
ht (
Table
) – Input tablerf_model (
PipelineModel
) – RF Modelfeatures (
List
[str
]) – Columns containing features that were used in the modellabel (
str
) – Column containing label to be predictedprediction_col_name (
str
) – Where to store the prediction
- Return type:
List
[tstruct
]- Returns:
A list containing structs with {label, prediction, n}
- gnomad.variant_qc.random_forest.apply_rf_model(ht, rf_model, features, label=None, probability_col_name='rf_probability', prediction_col_name='rf_prediction')[source]
Apply a Random Forest (RF) pipeline model to a Table and annotate the RF probabilities and predictions.
- Parameters:
ht (
Table
) – Input HTrf_model (
PipelineModel
) – Random Forest pipeline modelfeatures (
List
[str
]) – List of feature columns in the pipeline. !Should match the model list of features!label (
str
) – Optional column containing labels. !Should match the model labels!probability_col_name (
str
) – Name of the column that will store the RF probabilitiesprediction_col_name (
str
) – Name of the column that will store the RF predictions
- Return type:
- Returns:
Table with RF columns
- gnomad.variant_qc.random_forest.save_model(rf_pipeline, out_path, overwrite=False)[source]
Save a Random Forest pipeline model.
- Parameters:
rf_pipeline (
PipelineModel
) – Pipeline to saveout_path (
str
) – Output pathoverwrite (
bool
) – If set, will overwrite existing file(s) at output location
- Return type:
None
- Returns:
Nothing
- gnomad.variant_qc.random_forest.load_model(input_path)[source]
Load a Random Forest pipeline model.
- Parameters:
input_path (
str
) – Location of model to load- Return type:
PipelineModel
- Returns:
Random Forest pipeline model
- gnomad.variant_qc.random_forest.train_rf(ht, features, label, num_trees=500, max_depth=5)[source]
Train a Random Forest (RF) pipeline model.
- Parameters:
ht (
Table
) – Input HTfeatures (
List
[str
]) – List of columns to be used as featureslabel (
str
) – Column containing the label to predictnum_trees (
int
) – Number of trees to usemax_depth (
int
) – Maximum tree depth
- Return type:
PipelineModel
- Returns:
Random Forest pipeline model
- gnomad.variant_qc.random_forest.get_rf_runs(rf_json_fp)[source]
Load RF run data from JSON file.
- Parameters:
rf_json_fp (
str
) – File path to rf json file.- Return type:
Dict
- Returns:
Dictionary containing the content of the JSON file, or an empty dictionary if the file wasn’t found.
- gnomad.variant_qc.random_forest.get_run_data(input_args, test_intervals, features_importance, test_results)[source]
Create a Dict containing information about the RF input arguments and feature importance.
- Parameters:
input_args (
Dict
[str
,bool
]) – Dictionary of model input argumentstest_intervals (
List
[str
]) – Intervals withheld from training to be used in testingfeatures_importance (
Dict
[str
,float
]) – Feature importance returned by the RFtest_results (
List
[tstruct
]) – Accuracy results from applying RF model to the test intervalsinput_args –
test_intervals –
features_importance –
test_results –
- Return type:
Dict
- Returns:
Dict of RF information
- gnomad.variant_qc.random_forest.pretty_print_runs(runs, label_col='rf_label', prediction_col_name='rf_prediction')[source]
Print the information for the RF runs loaded from the json file storing the RF run hashes -> info.
- Parameters:
runs (
Dict
) – Dictionary containing JSON input loaded from RF run filelabel_col (
str
) – Name of the RF label columnprediction_col_name (
str
) – Name of the RF prediction column
- Return type:
None
- Returns:
Nothing – only prints information