BVAS Selector

class bvas.BVASSelector(Y, Gamma, mutations, S=5.0, tau=100.0, nu_eff_multiplier=1.0, genotype_matrix=None, variant_names=None)[source]

Bases: object

Main analysis class for Bayesian Viral Allele Selection (BVAS). Combines a Gaussian diffusion-based likelihood with Bayesian Variable Selection. Uses BVASSampler under the hood to do MCMC inference.

Usage:

selector = BVASSelector(Y, Gamma, mutations, S=10.0, tau=100.0)
selector.run(T=2000, T_burnin=1000)
print(selector.summary)
print(selector.growth_rates)
print(selector.stats)

Note that computations will be done using the device and dtype of the provided torch.Tensor`s, i.e. `Y and Gamma. If you would like computations to be done with a GPU make sure that these tensors are on the GPU. We recommend doing all computations in 64-bit precision, i.e. Y.dtype == Gamma.dtype == torch.float64.

GPU usage:

selector = BVASSelector(Y.cuda().double(), Gamma.cuda().double(), mutations, S=10.0, tau=100.0)
selector.run(T=2000, T_burnin=1000)

The inputs Y and Gamma are defined in terms of region-specific allele frequencies \({\mathbf x}_r(t)\) and region-specific effective population sizes \(\nu_r\) as follows.

\[ \begin{align}\begin{aligned}&{\mathbf y}_r(t) = {\mathbf x}_r(t + 1) - {\mathbf x}_r(t)\\&\bar{\mathbf{Y}}^\nu \equiv \sum_r \nu_r \sum_t {\mathbf y}_r(t)\\&\Lambda_{r,ab}(t) = x_{r,ab}(t) - x_{r,a}(t) x_{r,b}(t)\\&\bar{\mathbf{\Lambda}}^\nu \equiv \sum_r \nu_r \sum_t {\mathbf \Lambda}_r(t)\end{aligned}\end{align} \]

where \(x_{r,ab}(t)\) denote pairwise allele frequencies in region \(r\).

Parameters:
  • Y (torch.Tensor) – A vector of shape (A,) that encodes integrated alelle frequency increments for each allele and where A is the number of alleles.

  • Gamma (torch.Tensor) – A matrix of shape (A, A) that encodes information about second moments of allele frequencies.

  • mutations (list) – A list of strings of length A that encodes the names of the A alleles in Y.

  • S – Controls the expected number of alleles to include in the model a priori. Defaults to 5.0. To specify allele-level prior inclusion probabilities provide a A-dimensional torch.Tensor of the form (h_1, …, h_A). If a tuple of positive floats (alpha, beta) is provided, the a priori inclusion probability is a latent variable governed by the corresponding Beta prior so that the sparsity level is inferred from the data. Note that for a given choice of alpha and beta the expected number of alleles to include in the model a priori is given by \(\frac{\alpha}{\alpha + \beta} \times A\). We caution that this approach may be a poor choice for very noisy genomic surveillance data. Also note that the mean number of covariates in the posterior can vary significantly from prior expectations, since the posterior is in effect a compromise between the prior and the observed data.

  • tau (float) – Controls the precision of the coefficients in the prior. Defaults to 100.0.

  • nu_eff_multiplier (float) – Additional factor by which to multiply the effective population size, i.e. on top of whatever was done when computing Y and Gamma. Defaults to 1.0.

  • genotype_matrix (torch.Tensor) – A matrix of shape (num_variants, A) that encodes the genotype of various viral variants. If included the sampler will compute variant-level growth rates during inference for the varaints in genotype_matrix. Defaults to None. If not None, user must also provide variant_names.

  • variant_names (list) – A list of names of the variants specified by genotype_matrix. Must have the same length as the leading dimension of genotype_matrix. Defaults to None.

run(T, T_burnin, seed=None)[source]

Run MCMC inference for \(T + T_{\rm burn-in}\) iterations. The leading \(T_{\rm burn-in}\) iterations are discarded. After completion the results of the MCMC run can be accessed in the summary, growth_rates, and stats attributes.

The summary pandas.DataFrame contains six columns. The first column lists the Posterior Inclusion Probability (PIP) for each covariate. The second column lists the posterior mean of the coefficient that corresponds to each covariate. The third column lists the posterior standard deviation for each coefficient. The fourth and fifth columns are analogous to the second and third columns, respectively, with the difference that the fourth and fifth columns report conditional posterior statistics. For example, the fourth column reports the posterior mean of each coefficient conditioned on the corresponding covariate being included in the model. The sixth column is the PIP rank.

The growth_rates pandas.DataFrame reports estimated relative growth rates for each variant in genotype_matrix if the latter was provided. Note that growth rates are relative to the wild-type variant with all-zeros genotype.

Parameters:
  • T (int) – Positive integer that controls the number of MCMC samples that are generated (i.e. after burn-in/adaptation).

  • T_burnin (int) – Positive integer that controls the number of MCMC samples that are generated during burn-in/adaptation.

  • seed (int) – Random number seed for reproducibility. Defaults to None.