gnomad.sample_qc.ancestry
| 
 | Project samples in mt on pre-computed PCs. | 
| 
 | Apply an ONNX classification model fit to a pandas dataframe data_pd. | 
| 
 | Apply an sklearn classification model fit to a pandas dataframe data_pd. | 
| Convert a sklearn random forest model to ONNX. | |
| Use a random forest model to assign genetic ancestry labels based on the results of PCA. | |
| Run PCA excluding the given related or additional samples, and project those samples in the PC space to return scores for all samples. | 
- gnomad.sample_qc.ancestry.pc_project(mt, loadings_ht, loading_location='loadings', af_location='pca_af')[source]
- Project samples in mt on pre-computed PCs. - Parameters:
- mt ( - MatrixTable) – MT containing the samples to project
- loadings_ht ( - Table) – HT containing the PCA loadings and allele frequencies used for the PCA
- loading_location ( - str) – Location of expression for loadings in loadings_ht
- af_location ( - str) – Location of expression for allele frequency in loadings_ht
 
- Return type:
- Returns:
- Table with scores calculated from loadings in column scores 
 
- gnomad.sample_qc.ancestry.apply_onnx_classification_model(data_pd, fit)[source]
- Apply an ONNX classification model fit to a pandas dataframe data_pd. - Parameters:
- data_pd ( - DataFrame) – Pandas dataframe containing the data to be classified.
- fit ( - ModelProto) – ONNX model to be applied.
 
- Return type:
- Tuple[- ndarray,- DataFrame]
- Returns:
- Tuple of classification and probabilities. 
 
- gnomad.sample_qc.ancestry.apply_sklearn_classification_model(data_pd, fit)[source]
- Apply an sklearn classification model fit to a pandas dataframe data_pd. - Parameters:
- data_pd ( - DataFrame) – Pandas dataframe containing the data to be classified.
- fit ( - Any) – Sklearn model to be applied.
 
- Return type:
- Tuple[- ndarray,- DataFrame]
- Returns:
- Tuple of classification and probabilities. 
 
- gnomad.sample_qc.ancestry.convert_sklearn_rf_to_onnx(fit, target_opset=None)[source]
- Convert a sklearn random forest model to ONNX. - Parameters:
- fit ( - Any) – Sklearn random forest model to be converted.
- target_opset ( - Optional[- int]) – An optional target ONNX opset version to convert the model to.
 
- Return type:
- ModelProto
- Returns:
- ONNX model. 
 
- gnomad.sample_qc.ancestry.assign_genetic_ancestry_pcs(gen_anc_pca_scores, pc_cols, known_col='known_label', fit=None, seed=42, prop_train=0.8, n_estimators=100, min_prob=0.9, output_col='gen_anc', missing_label='oth', pc_expr='scores', convert_model_func=None, apply_model_func=<function apply_sklearn_classification_model>, n_partitions=None)[source]
- Use a random forest model to assign genetic ancestry labels based on the results of PCA. - Default values for model and assignment parameters are those used in gnomAD. - As input, this function can either take:
- A Hail Table (typically the output of hwe_normalized_pca). In this case,
- pc_cols should be one of::
- A list of integers where each element is one of the PCs to use. 
- A list of strings where each element is one of the PCs to use. 
- An ArrayExpression of Floats where each element is one of the PCs. to use 
 
 
- A Hail Table will be returned as output. 
 
 
- A Pandas DataFrame. In this case:
- Each PC should be in a separate column and pc_cols is the list of all the columns containing the PCs to use. 
- A pandas DataFrame is returned as output. 
 
 
 
 - Note - If you have a Pandas Dataframe and have all PCs as an array in a single column, the expand_pd_array_col`can be used to expand this column into multiple `PC columns. - Parameters:
- gen_anc_pca_scores ( - Union[- Table,- DataFrame]) – Input Hail Table or Pandas Dataframe.
- pc_cols ( - Union[- ArrayExpression,- List[- int],- List[- str]]) – List of which PCs to use/columns storing the PCs to use. Values provided should be 1-based and should be a list of integers when passing in a Hail Table (i.e. [1, 2, 4, 5]) or a list of strings when passing in a Pandas Dataframe (i.e. [“PC1”, “PC2”, “PC4”, “PC5”]). When passing a HT this can also be an ArrayExpression containing all the PCs to use.
- known_col ( - str) – Column storing the known genetic ancestry labels.
- fit ( - Any) – Fit from a previously trained random forest model (i.e., the output from a previous RandomForestClassifier() call).
- seed ( - int) – Random seed.
- prop_train ( - float) – Proportion of known data used for training.
- n_estimators ( - int) – Number of trees to use in the RF model.
- min_prob ( - float) – Minimum probability of belonging to a given genetic ancestry group for the genetic ancestry group to be set (otherwise set to None).
- output_col ( - str) – Output column storing the assigned genetic ancestry.
- missing_label ( - str) – Label for samples for which the assignment probability is smaller than min_prob.
- pc_expr ( - Union[- ArrayExpression,- str]) – Column storing the list of PCs. Only used if pc_cols is a List of integers. Default is scores.
- convert_model_func ( - Optional[- Callable[[- Any],- Any]]) – Optional function to convert the model to ONNX format. Default is no conversion.
- apply_model_func ( - Callable[[- DataFrame,- Any],- Any]) – Function to apply the model to the data. Default is apply_sklearn_classification_model, which will apply a sklearn classification model to the data. This default will work if no fit is set, or the supplied fit is a sklearn classification model.
- n_partitions ( - Optional[- int]) – Optional number of partitions to repartition the genetic ancestry group inference table to.
 
- Return type:
- Tuple[- Union[- Table,- DataFrame],- Any]
- Returns:
- Hail Table or Pandas Dataframe (depending on input) containing sample IDs and imputed genetic ancestry labels, trained random forest model. 
 
- Run PCA excluding the given related or additional samples, and project those samples in the PC space to return scores for all samples. - The related_samples_to_drop and additional_samples_to_drop Tables have to be keyed by the sample ID and all samples present in these tables will be excluded from the PCA. - The loadings Table returned also contains a pca_af annotation which is the allele frequency used for PCA. This is useful to project other samples in the PC space. - Parameters:
- qc_mt ( - MatrixTable) – Input QC MT
- related_samples_to_drop ( - Optional[- Table]) – Optional table of related samples to drop when generating the PCs, these samples will be projected in the PC space
- additional_samples_to_drop ( - Optional[- Table]) – Optional table of additional samples to drop when generating the PCs, these samples will be projected in the PC space
- n_pcs ( - int) – Number of PCs to compute
- autosomes_only ( - bool) – Whether to run the analysis on autosomes only
 
- Return type:
- Returns:
- eigenvalues, scores and loadings