gnomad.sample_qc.ancestry

gnomad.sample_qc.ancestry.pc_project(mt, ...)

Project samples in mt on pre-computed PCs.

gnomad.sample_qc.ancestry.apply_onnx_classification_model(...)

Apply an ONNX classification model fit to a pandas dataframe data_pd.

gnomad.sample_qc.ancestry.apply_sklearn_classification_model(...)

Apply an sklearn classification model fit to a pandas dataframe data_pd.

gnomad.sample_qc.ancestry.convert_sklearn_rf_to_onnx(fit)

Convert a sklearn random forest model to ONNX.

gnomad.sample_qc.ancestry.assign_population_pcs(...)

Use a random forest model to assign population labels based on the results of PCA.

gnomad.sample_qc.ancestry.run_pca_with_relateds(qc_mt)

Run PCA excluding the given related or additional samples, and project those samples in the PC space to return scores for all samples.

gnomad.sample_qc.ancestry.pc_project(mt, loadings_ht, loading_location='loadings', af_location='pca_af')[source]

Project samples in mt on pre-computed PCs.

Parameters:
  • mt (MatrixTable) – MT containing the samples to project

  • loadings_ht (Table) – HT containing the PCA loadings and allele frequencies used for the PCA

  • loading_location (str) – Location of expression for loadings in loadings_ht

  • af_location (str) – Location of expression for allele frequency in loadings_ht

Return type:

Table

Returns:

Table with scores calculated from loadings in column scores

gnomad.sample_qc.ancestry.apply_onnx_classification_model(data_pd, fit)[source]

Apply an ONNX classification model fit to a pandas dataframe data_pd.

Parameters:
  • data_pd (DataFrame) – Pandas dataframe containing the data to be classified.

  • fit (ModelProto) – ONNX model to be applied.

Return type:

Tuple[ndarray, DataFrame]

Returns:

Tuple of classification and probabilities.

gnomad.sample_qc.ancestry.apply_sklearn_classification_model(data_pd, fit)[source]

Apply an sklearn classification model fit to a pandas dataframe data_pd.

Parameters:
  • data_pd (DataFrame) – Pandas dataframe containing the data to be classified.

  • fit (Any) – Sklearn model to be applied.

Return type:

Tuple[ndarray, DataFrame]

Returns:

Tuple of classification and probabilities.

gnomad.sample_qc.ancestry.convert_sklearn_rf_to_onnx(fit, target_opset=None)[source]

Convert a sklearn random forest model to ONNX.

Parameters:
  • fit (Any) – Sklearn random forest model to be converted.

  • target_opset (Optional[int]) – An optional target ONNX opset version to convert the model to.

Return type:

ModelProto

Returns:

ONNX model.

gnomad.sample_qc.ancestry.assign_population_pcs(pop_pca_scores, pc_cols, known_col='known_pop', fit=None, seed=42, prop_train=0.8, n_estimators=100, min_prob=0.9, output_col='pop', missing_label='oth', pc_expr='scores', convert_model_func=None, apply_model_func=<function apply_sklearn_classification_model>)[source]

Use a random forest model to assign population labels based on the results of PCA.

Default values for model and assignment parameters are those used in gnomAD.

As input, this function can either take:
  • A Hail Table (typically the output of hwe_normalized_pca). In this case,
    • pc_cols should be one of::
      • A list of integers where each element is one of the PCs to use.

      • A list of strings where each element is one of the PCs to use.

      • An ArrayExpression of Floats where each element is one of the PCs. to use

    • A Hail Table will be returned as output.

  • A Pandas DataFrame. In this case:
    • Each PC should be in a separate column and pc_cols is the list of all the columns containing the PCs to use.

    • A pandas DataFrame is returned as output.

Note

If you have a Pandas Dataframe and have all PCs as an array in a single column, the expand_pd_array_col`can be used to expand this column into multiple `PC columns.

Parameters:
  • pop_pca_scores (Union[Table, DataFrame]) – Input Hail Table or Pandas Dataframe.

  • pc_cols (Union[ArrayExpression, List[int], List[str]]) – List of which PCs to use/columns storing the PCs to use. Values provided should be 1-based and should be a list of integers when passing in a Hail Table (i.e. [1, 2, 4, 5]) or a list of strings when passing in a Pandas Dataframe (i.e. [“PC1”, “PC2”, “PC4”, “PC5”]). When passing a HT this can also be an ArrayExpression containing all the PCs to use.

  • known_col (str) – Column storing the known population labels.

  • fit (Any) – Fit from a previously trained random forest model (i.e., the output from a previous RandomForestClassifier() call).

  • seed (int) – Random seed.

  • prop_train (float) – Proportion of known data used for training.

  • n_estimators (int) – Number of trees to use in the RF model.

  • min_prob (float) – Minimum probability of belonging to a given population for the population to be set (otherwise set to None).

  • output_col (str) – Output column storing the assigned population.

  • missing_label (str) – Label for samples for which the assignment probability is smaller than min_prob.

  • pc_expr (Union[ArrayExpression, str]) – Column storing the list of PCs. Only used if pc_cols is a List of integers. Default is scores.

  • convert_model_func (Optional[Callable[[Any], Any]]) – Optional function to convert the model to ONNX format. Default is no conversion.

  • apply_model_func (Callable[[DataFrame, Any], Any]) – Function to apply the model to the data. Default is apply_sklearn_classification_model, which will apply a sklearn classification model to the data. This default will work if no fit is set, or the supplied fit is a sklearn classification model.

Return type:

Tuple[Union[Table, DataFrame], Any]

Returns:

Hail Table or Pandas Dataframe (depending on input) containing sample IDs and imputed population labels, trained random forest model.

gnomad.sample_qc.ancestry.run_pca_with_relateds(qc_mt, related_samples_to_drop=None, additional_samples_to_drop=None, n_pcs=10, autosomes_only=True)[source]

Run PCA excluding the given related or additional samples, and project those samples in the PC space to return scores for all samples.

The related_samples_to_drop and additional_samples_to_drop Tables have to be keyed by the sample ID and all samples present in these tables will be excluded from the PCA.

The loadings Table returned also contains a pca_af annotation which is the allele frequency used for PCA. This is useful to project other samples in the PC space.

Parameters:
  • qc_mt (MatrixTable) – Input QC MT

  • related_samples_to_drop (Optional[Table]) – Optional table of related samples to drop when generating the PCs, these samples will be projected in the PC space

  • additional_samples_to_drop (Optional[Table]) – Optional table of additional samples to drop when generating the PCs, these samples will be projected in the PC space

  • n_pcs (int) – Number of PCs to compute

  • autosomes_only (bool) – Whether to run the analysis on autosomes only

Return type:

Tuple[List[float], Table, Table]

Returns:

eigenvalues, scores and loadings