Waddington Optimal Transport (wot) uses time-course data to infer how the probability distribution of cells in gene-expression space evolves over time, by using the mathematical approach of optimal transport.

Installation


Dependencies

This packages depends on Python 3.

Several other python packages are required, but they can easily be installed through pip or conda

Install from pip

The recommended and easiest way to install wot is via pip

pip install --user wot

wot is then installed and ready to use.

Usage


wot consists of several tools. Each tool can be used with the syntax wot <tool>.

Help is available for each tool with wot <tool> -h. For instance:

wot optimal_transport -h

In the following sections, each command is described with an example and a table containing the core options. Required options are in bold font.


To help you get started with wot, we provide an archive containing all the simulated data that was used to compute all the pictures shown throughout this documentation. You can download it here and run the commands described thereafter in the same directory.

Transport maps

wot optimal_transport --matrix matrix.txt \
 --cell_days days.txt --out tmaps

This command will create a file tmaps_{A}_{B}.h5ad for each pair {A}, {B} of consecutive timepoints. These transport maps can then be converted to any format you find convenient with the convert_matrix tool.

Option Description
--matrix Normalized gene expression matrix. See formats
--cell_days Timestamps for each cell. See formats
--tranpose Swap the rows and column of the input matrix
--epsilon Regularization parameter that controls the entropy of the transport map
default : 0.05
--lambda1 Regularization parameter that controls the fidelity of the constraint on p
default : 1
--lambda2 Regularization parameter that controls the fidelity of the constraint on q
default : 50
--local_pca Number of dimensions to use when doing PCA
default : 30
--growth_iters Number of growth iterations for learning the growth rate.
default : 1
--cell_growth_rates File with "id" and "cell_growth_rate" headers corresponding to cell id and growth rate per day.
--gene_filter File with one gene id per line to use (e.g. variable genes)
Local PCA

The default transport cost uses Principal Component Analysis to reduce the dimension of the data before computing distances pairs of timepoints. By default, wot uses 30 dimensions.

Dimensionality reduction can be disabled with --local_pca 0

If you specify more dimensions for the PCA than your dataset has genes, wot will skip PCA and print a warning.

Trajectories

Ancestors and descendants in wot are computed through the trajectory tool.

You can select a cell set by specifying a cell set file. You can either manually edit this type of file, or generate it from a gene set file using the cells_by_gene_set tool. Please note that the order of the cell ids in the output trajectory file may differ from that of the expression matrix used to generate transport maps.

wot trajectory --tmap tmaps \
 --cell_set cell_sets.gmt --time 10 --out traj.txt
Option Description
--tmap Prefix of the transport maps configuration file
--cell_set Target cell set. See formats
--time The target timepoint to consider if some cell sets span across multiple timepoints
--out Output filename
default : wot_trajectory.txt. Please note that the order of the cell ids in the output trajectory file may differ from that of the expression matrix used to generate transport maps.

Ancestor census

The ancestor census command computes the amount of mass of an ancestor distribution falls into each cell set.

wot census --tmap tmaps  \
 --cell_set cell_sets.gmt \
 --out census --time 10

This command creates several census files named <prefix>_<cellset>.txt, for instance census_tip1.txt. See formats for more information.

Option Description
--tmap Prefix of the transport maps configuration file
--cell_set Target cell sets. See formats
--time The target timepoint to consider if some cell sets span across multiple timepoints
--out Output filenames prefix.
default : 'census'

The trajectory trends command computes the mean and variance of each gene in the specified matrix for the given trajectories. Please use the trajectory tool to compute trajectories.

wot trajectory_trends --trajectory tmaps \
 --matrix matrix.txt \
 --out trends 

This will create a file trends_<trajectory name>.txt with the mean expression profile among ancestors/descendants and trends_<trajectory name>.variance.txt with the variance for each feature at each timepoint

Option Description
--trajectory The trajectory produced by the trajectory tool
--matrix Normalized gene expression matrix. See formats
--out Output filenames prefix.
default : 'trends'

Local regulatory model via differential expression

The local enrichment command finds the genes that are differentially expressed between two sets of cells.

The input matrices must have timepoints on rows, and genes on columns. This is the format created by the trajectory trends command.

wot local_enrichment --score t_test \
 --matrix1 trends_set1.mean.txt --variance1 trends_set1.variance.txt \
 --matrix2 trends_set2.mean.txt --variance2 trends_set2.variance.txt

This will create files <timepoint>.rnk for each timepoint, containing each gene’s score.

Option Description
--matrix1 A matrix with timepoints on rows and features, such as genes or pathways on columns See formats
--variance1 A matrix with timepoints on rows and features on columns with the variance of each gene, associated with matrix1. See formats
--matrix2 A matrix with timepoints on rows and features, such as genes or pathways on columns See formats
--variance2 A matrix with timepoints on rows and features on columns with the variance of each gene, associated with matrix2. See formats
--score Method to compute differential expression score.
Available:
  • signal to noise (s2n)
  • mean difference (mean_difference)
  • t-test (t_test)
  • fold change (fold_change)
--comparisons The timepoints to compare. See detailled description below

This commands accepts two types of input configurations. The first one is as presented in the example with two matrices. It will compare the two matrices for all matching timepoints, and output a <timepoint>.rnk file for each of those.

Alternatively, you can specify a single matrix, and the default behavior will be to compare entries of the matrix for all consecutives timepoints. This would create files names <timepoint1>_<timepoint2>.rnk instead.

For more control over which comparisons are performed, you can specify a file with a tab-separated pair of timepoints on each line with the --comparisons parameter. The first of the two will refer to the timepoint of the selected entry in the first matrix, and the second to the timpepoints of the selected entry in the second matrix, which is identical to the first one if only one was specified. You will then get a file named <timepoint1>_<timepoint2>.rnk for each pair of timepoints in the comparisons file.

Validation

You can easily validate the transport maps that have been computed above.

Say, for instance, that you have data at time points 0, 1, and 2. You could compute the transport map from 0 to 2, and then interpolate with optimal transport to see what distribution is expected at time 1 by this model. It is then easy to compare with the actual distributions at those time points, and thus check that the transport maps properly predict the cell’s trajectory.

wot optimal_transport_validation --matrix matrix.txt \
 --cell_days days.txt --covariate covariate.txt

This would create a validation summary, as a text file, containing all the information needed to evaluate the accuracy of the predictions.

Option Description
--matrix Normalized gene expression matrix. See formats
--cell_days Timestamps for each cell. See formats
--covariate Covariate value for each cell. See formats
--out The filename for the validation summary
default : validation_summary.txt

The validation tool also accepts all options of the optimal_transport tool. Mandatory options are reproduced here for convenience.

Covariate

To measure the quality of interpolation, we compare the distance between batches of cells at the same time point.

The covariate values may be specified in a tab-separated text file. It must have exactly two headers : “id” and “covariate”. Each subsequent line must consist of a cell name, a tab, and a covariate value.

Null hypothesises

wot tests the results of the Optimal Transport predictions against three different null hypothesises.

  • First point : The state of the population at time t0 is a good approximation for its state between times t0 and t1. This is equivalent to saying cells stay where they were at time t0, and then instantly move to their next state exactly at time t1.
  • Last point : The state of the population at time t1 is a good approximation for its state between times t0 and t1. This is equivalent to saying cells are in a given state at time t0, and then instantly move to their next state, and stay there until time t1.
  • Randomized interpolation : Linear interpolation between randomly-picked pairs of points from t0 and t1 is a good approximation for the state of the population between t0 and t1.

Validation summary

The validation summary, generated as text file, is organized as follows :

Each line contains information about the relation between two cell sets :

  • interval_start and interval_end indicate which pair of timepoints is being considered.
  • pair0 and pair1 indicate which sets are being considered:
    • P is the real population at the intermediate timepoint
    • F is the estimated population in first point hypothesis
    • L is the estimated population in last point hypothesis
    • R is the estimated population in randomized interpolation hypothesis
    • I is the estimated population in optimal transport hypothesis
  • distance is the Wasserstein distance between the two sets considered

pair0 and pair1 will have a suffix of the form _cvX_cvY or _cvZ when covariates are used, to indicate which batch they were extracted from.

Embedding

One method of visualizing data in two or three dimensions is Force-directed Layout Embedding (FLE). We compute this nearest neighbor graph in diffusion component space. We recommend using forceatlas2-3d to run the force directed layout. You can generate a graph to use as input to force directed layout using the command neighborhood_graph

wot neighborhood_graph --matrix matrix.txt
Option Description
--matrix Normalized gene expression matrix. See formats
--neighbors Number of nearest neighbors to consider
default : 100
--neighbors_diff Number of nearest neighbors to use in diffusion component space
default : 20
--n_comps Number of diffusion components
default : 20
--gene_filter File with one gene id per line to use (e.g. variable genes)
--out Output filename
Output

This command will create two files containing the projection of the data in two dimensions: <prefix>.csv and <prefix>.h5ad

The h5ad is just a regular 2D matrix that can be converted to any other format with the convert_matrix tool. The csv is provided for convenience and can be given directly to the wot_server interactive tool.

Supported file formats


Gene expression matrices

The matrix file specifies the gene expression matrix to use.

The following formats are accepted by all tools: mtx, txt, h5ad, loom, and gct. Please note that wot expects cells on the rows and genes on the columns, except for the mtx format.

Text

The text format consists of tab or comma separated columns with genes on the columns and cells on the rows.

The first row, the header, must consist of an “id” field, and then the list of genes to be considered.

Each subsequent row will give the expression level of each gene for a given cell.

The first field must be a unique identifier for the cell, and then the tab or comma separated list of expression levels for each gene/feature.

Example:

idgene_1gene_2gene_3
cell_11.212.25.4
cell_22.34.15.0
MTX

The MTX format is a sparse matrix format with genes on the rows and cells on the columns as output by Cell Ranger. You should also have TSV files with genes and barcode sequences corresponding to row and column indices, respectively. These files must be located in the same folder as the MTX file with the same base file name. For example if the MTX file is my_data.mtx, you should also have a my_data.genes.txt file and a my_data.barcodes.txt file.

GCT

A GCT file is a tab-delimited text file. Please see description here

Loom

A HDF5 file for efficient storage and access of large datases. Please see description at http://loompy.org/


You can convert input matrices from one format to another :

wot convert_matrix --format txt *.loom

This command converts all loom files in the current directory to text.

Supported input formats are mtx, hdf5, h5, h5ad, loom, and gct.

Supported output formats are txt, loom, and gct. (default: loom)

Cell timestamps (cell day file)

The timestamp associated with each cell of the matrix file is specified in the days file. This file must be a tab-separated plain text file, with two header fields: “id” and “day”.

Example:

idday
cell_11
cell_22.5

OT Configuration file

There are several options to specify Optimal Transport parameters in wot.

The easiest is to just use constant parameters and specify them when computing transport maps with the --epsilon or --lambda1 options.

If you want more control over what parameters are used, you can use a detailed configuration file. There are two kinds of configuration files accepted by wot.

Configuration per timepoint

You can specify each parameter at each timepoint. When computing a transport map between two timepoints, the average of the two parameters for the considered timepoints will be taken into account.

For instance, if you have prior knowledge of the amount of entropy at each timepoint, you could specify a different value of epsilon for each timepoint, and those would be used to compute more accurate transport maps.

The configuration file is a tab-separated text file that starts with a header that must contain a column named t, for the timepoint, and then the name of any parameter you want to set. Any parameter not specified in this file can be specified as being constant as previously, with the command-line arguments --epsilon, --lambda1, --tolerance, etc. .

Example:

tepsilon
00.001
10.002
20.005
30.008
3.50.01
40.005
50.001

Configuration per pair of timepoints

If you want to be even more explicit about what parameters to use for each transport map computation, you can specify parameters for pairs of timepoints.

As previously, the configuration is specified in a tab-separated text file. Its header must have columns t0 and t1, for source and destination timepoints.

Bear in mind though, that any pair of timepoints not specified in this file will not be computable. That means you should at least put all pairs of consecutive timepoints if you want to be able to compute full trajectories.

Example:

t0t1lambda1
0150
1280
2430
4510

This can for instance be used if you want to skip a timepoint (note how timepoints 3 or 3.5 are not present here). If a timepoint is present in the dataset but not in this configuration file, it will be ignored.

You can use as many parameter columns as you want, even none. All parameters not specified here can be specified as being constant as previously, with the command-line arguments --epsilon, --lambda1, --tolerance, etc. .

Gene sets

Gene sets can be in gmx (Gene MatriX), gmt (Gene Matrix Transposed), or grp format.

The gmt format is convenient to store large databases of gene sets. However, for a handful of sets, the gmx format might offer better excel-editablity.

More information on the gene set formats can be found in the Broad Institute Software Documentation

GMT

The gmt format consists of one gene set per line. Each line is a tab-separated list composed as follows :

  • The gene set name (can contain spaces)
  • A commentary / description of the gene set (may be empty or contain spaces)
  • A tab-separated list of genes

Example:

Tip1The first tipgene_2gene_1
Tip2The second tipgene_3
Tip3The third tipgene_4gene_1
GMX

The gmx format is the transposed of the gmx format. Each column represents a gene set. It is also tab-separated.

Example:

Tip1Tip2Tip3
The first tipThe second tipThe third tip
gene_2gene_3gene_4
gene_1gene_1
GRP

The grp format contains a single gene set in a simple newline-delimited text format.

Example:

gene_1
gene_2
gene_3

Cell sets

Cell sets can be described using the same formats as gene sets. Simply list the ids of the cell in a set where you would have listed the name of the genes in a gene set.

Cell selecting tool

If you want to select a cell sets corresponding to a list of gene sets, you may use the cells_by_gene_set command-line tool provided byt wot.

wot cells_by_gene_set --matrix matrix.txt --gene_sets gene_sets.gmt \
 --out cell_sets.gmt --format gmt --quantile 0.99

You can select which proportion of the cells having each gene to select with the --quantile option. The default value is 0.99, which would select the top 1% of each gene. Choosing 0.5 for instance would select every cell that has all genes above the median in the population.

Census file

Census files are datasets files : tab-separated text files with a header. The header consists of an “id” field, and then the list of cell sets for the census.

Each subsequent row will give the proportion of ancestors that pertained in each of the mentionned cell sets.

The id is the time at which the ancestors lived.

Example:

idtip1tip2tip3
0.00.150.050.05
1.00.280.050.03
2.00.420.030.02
3.00.720.020.01
4.00.890.000.00
5.00.990.000.00

Covariate file

The batch associated with each cell of the matrix file is specified in the covariate file. This file must be a tab-separated plain text file, with two header fields: “id” and “covariate”.

Example:

idcovariate
cell_10
cell_21