gnomad.variant_qc.random_forest
| Run a dummy test RF on a given MT. | |
| 
 | Check specified fields of a hail table for Spark DataFrame conversion (type and name). | 
| 
 | Compute approximate quantiles of specified numeric fields from non-missing values. | 
| Numerical features in the Table are median-imputed by Hail's approx_median. | |
| Create a Spark dataframe ready for RF from a HT. | |
| 
 | Extract the features importance from a Pipeline model containing a RandomForestClassifier stage. | 
| Return the labels from the StringIndexer stage at index 0 from an RF pipeline model. | |
| A wrapper to test a model on a set of examples with known labels. | |
| Apply a Random Forest (RF) pipeline model to a Table and annotate the RF probabilities and predictions. | |
| Save a Random Forest pipeline model. | |
| Load a Random Forest pipeline model. | |
| Train a Random Forest (RF) pipeline model. | |
| Load RF run data from JSON file. | |
| Create a Dict containing information about the RF input arguments and feature importance. | |
| Print the information for the RF runs loaded from the json file storing the RF run hashes -> info. | 
- gnomad.variant_qc.random_forest.run_rf_test(mt, output='/tmp')[source]
- Run a dummy test RF on a given MT. - Creates row annotations and labels to run model on 
- Trains a RF pipeline model (including median imputation of missing values in created annotations) 
- Saves the RF pipeline model 
- Applies the model to the MT and prints features importance 
 - Parameters:
- mt ( - MatrixTable) – Input MT
- output ( - str) – Output files prefix to save the RF model
 
- Return type:
- Tuple[- PipelineModel,- Table]
- Returns:
- RF model and MatrixTable after applying RF model 
 
- gnomad.variant_qc.random_forest.check_ht_fields_for_spark(ht, fields)[source]
- Check specified fields of a hail table for Spark DataFrame conversion (type and name). - Parameters:
- ht ( - Table) – input Table
- fields ( - List[- str]) – Fields to test
 
- Return type:
- None
- Returns:
- None 
 
- gnomad.variant_qc.random_forest.get_columns_quantiles(ht, fields, quantiles, relative_error=0.001)[source]
- Compute approximate quantiles of specified numeric fields from non-missing values. Non-numeric fields are ignored. - This function returns a Dict of column name -> list of quantiles in the same order specified. If a column only has NAs, None is returned. - Parameters:
- ht ( - Table) – input HT
- fields ( - List[- str]) – list of features to impute. If none given, all numerical features with missing data are imputed
- quantiles ( - List[- float]) – list of quantiles to return (e.g. [0.5] would return the median)
- relative_error ( - int) – The relative error on the quantile approximation
 
- Return type:
- Dict[- str,- List[- float]]
- Returns:
- Dict of column -> quantiles 
 
- gnomad.variant_qc.random_forest.median_impute_features(ht, strata=None)[source]
- Numerical features in the Table are median-imputed by Hail’s approx_median. - If a strata dict is given, imputation is done based on the median of of each stratification. - The annotations that are added to the Table are
- feature_imputed - A row annotation indicating if each numerical feature was imputed or not. 
- features_median - A global annotation containing the median of the numerical features. If strata is given, this struct will also be broken down by the given strata. 
- variants_by_strata - An additional global annotation with the variant counts by strata that will only be added if imputing by a given strata. 
 
 - Parameters:
- ht ( - Table) – Table containing all samples and features for median imputation.
- strata ( - Optional[- Dict[- str,- Expression]]) – Whether to impute features median by specific strata (default False).
 
- Return type:
- Returns:
- Feature Table imputed using approximate median values. 
 
- gnomad.variant_qc.random_forest.ht_to_rf_df(ht, features, label=None, index=None)[source]
- Create a Spark dataframe ready for RF from a HT. - Rows with any missing features are dropped. Missing labels are replaced with ‘NA’ - Note - Only basic types are supported! - Parameters:
- ht ( - Table) – Input HT
- features ( - List[- str]) – Features that will be used for RF
- label ( - str) – Optional label column that will be predicted by RF
- index ( - str) – Optional index column to keep (E.g. for joining results back at later stage)
 
- Return type:
- DataFrame
- Returns:
- Spark Dataframe 
 
- gnomad.variant_qc.random_forest.get_features_importance(rf_pipeline, rf_index=-2, assembler_index=-3)[source]
- Extract the features importance from a Pipeline model containing a RandomForestClassifier stage. - Parameters:
- rf_pipeline ( - PipelineModel) – Input pipeline
- rf_index ( - int) – index of the RandomForestClassifier stage
- assembler_index ( - int) – index of the VectorAssembler stage
 
- Return type:
- Dict[- str,- float]
- Returns:
- feature importance for each feature in the RF model 
 
- gnomad.variant_qc.random_forest.get_labels(rf_pipeline)[source]
- Return the labels from the StringIndexer stage at index 0 from an RF pipeline model. - Parameters:
- rf_pipeline ( - PipelineModel) – Input pipeline
- Return type:
- List[- str]
- Returns:
- labels 
 
- gnomad.variant_qc.random_forest.test_model(ht, rf_model, features, label, prediction_col_name='rf_prediction')[source]
- A wrapper to test a model on a set of examples with known labels. - Runs the model on the data 
- Prints confusion matrix and accuracy 
- Returns confusion matrix as a list of struct 
 - Parameters:
- ht ( - Table) – Input table
- rf_model ( - PipelineModel) – RF Model
- features ( - List[- str]) – Columns containing features that were used in the model
- label ( - str) – Column containing label to be predicted
- prediction_col_name ( - str) – Where to store the prediction
 
- Return type:
- List[- tstruct]
- Returns:
- A list containing structs with {label, prediction, n} 
 
- gnomad.variant_qc.random_forest.apply_rf_model(ht, rf_model, features, label=None, probability_col_name='rf_probability', prediction_col_name='rf_prediction')[source]
- Apply a Random Forest (RF) pipeline model to a Table and annotate the RF probabilities and predictions. - Parameters:
- ht ( - Table) – Input HT
- rf_model ( - PipelineModel) – Random Forest pipeline model
- features ( - List[- str]) – List of feature columns in the pipeline. !Should match the model list of features!
- label ( - str) – Optional column containing labels. !Should match the model labels!
- probability_col_name ( - str) – Name of the column that will store the RF probabilities
- prediction_col_name ( - str) – Name of the column that will store the RF predictions
 
- Return type:
- Returns:
- Table with RF columns 
 
- gnomad.variant_qc.random_forest.save_model(rf_pipeline, out_path, overwrite=False)[source]
- Save a Random Forest pipeline model. - Parameters:
- rf_pipeline ( - PipelineModel) – Pipeline to save
- out_path ( - str) – Output path
- overwrite ( - bool) – If set, will overwrite existing file(s) at output location
 
- Return type:
- None
- Returns:
- Nothing 
 
- gnomad.variant_qc.random_forest.load_model(input_path)[source]
- Load a Random Forest pipeline model. - Parameters:
- input_path ( - str) – Location of model to load
- Return type:
- PipelineModel
- Returns:
- Random Forest pipeline model 
 
- gnomad.variant_qc.random_forest.train_rf(ht, features, label, num_trees=500, max_depth=5)[source]
- Train a Random Forest (RF) pipeline model. - Parameters:
- ht ( - Table) – Input HT
- features ( - List[- str]) – List of columns to be used as features
- label ( - str) – Column containing the label to predict
- num_trees ( - int) – Number of trees to use
- max_depth ( - int) – Maximum tree depth
 
- Return type:
- PipelineModel
- Returns:
- Random Forest pipeline model 
 
- gnomad.variant_qc.random_forest.get_rf_runs(rf_json_fp)[source]
- Load RF run data from JSON file. - Parameters:
- rf_json_fp ( - str) – File path to rf json file.
- Return type:
- Dict
- Returns:
- Dictionary containing the content of the JSON file, or an empty dictionary if the file wasn’t found. 
 
- gnomad.variant_qc.random_forest.get_run_data(input_args, test_intervals, features_importance, test_results)[source]
- Create a Dict containing information about the RF input arguments and feature importance. - Parameters:
- input_args ( - Dict[- str,- bool]) – Dictionary of model input arguments
- test_intervals ( - List[- str]) – Intervals withheld from training to be used in testing
- features_importance ( - Dict[- str,- float]) – Feature importance returned by the RF
- test_results ( - List[- tstruct]) – Accuracy results from applying RF model to the test intervals
- input_args – 
- test_intervals – 
- features_importance – 
- test_results – 
 
- Return type:
- Dict
- Returns:
- Dict of RF information 
 
- gnomad.variant_qc.random_forest.pretty_print_runs(runs, label_col='rf_label', prediction_col_name='rf_prediction')[source]
- Print the information for the RF runs loaded from the json file storing the RF run hashes -> info. - Parameters:
- runs ( - Dict) – Dictionary containing JSON input loaded from RF run file
- label_col ( - str) – Name of the RF label column
- prediction_col_name ( - str) – Name of the RF prediction column
 
- Return type:
- None
- Returns:
- Nothing – only prints information