File Formats

File formats

Gene expression matrices

The matrix file specifies the gene expression matrix to use.

The following formats are accepted by all tools: mtx, txt, h5ad, and loom Please note that wot expects cells on the rows and genes on the columns, except for the mtx format.

Text

The text format consists of tab or comma separated columns with genes on the columns and cells on the rows.

The first row, the header, must consist of an “id” field, and then the list of genes to be considered.

Each subsequent row will give the expression level of each gene for a given cell.

The first field must be a unique identifier for the cell, and then the tab or comma separated list of expression levels for each gene/feature.

Example:

id	gene_1	gene_2	gene_3
cell_1	1.2	12.2	5.4
cell_2	2.3	4.1	5.0

MTX

The MTX format is a sparse matrix format with genes on the rows and cells on the columns as output by Cell Ranger. You should also have TSV files with genes and barcode sequences corresponding to row and column indices, respectively. These files must be located in the same folder as the MTX file with the same base file name. For example if the MTX file is my_data.mtx, you should also have a my_data.genes.txt file and a my_data.barcodes.txt file.

H5AD

A HDF5 file that provides a scalable way of keeping track of data together with learned annotations.. Please see description at https://anndata.readthedocs.io

Loom

A HDF5 file for efficient storage and access of large datases. Please see description at http://loompy.org/

Cell Days

The timestamp associated with each cell of the matrix file is specified in the days file. This file must be a tab or comma separated plain text file, with two header fields: “id” and “day”.

Example:

id	day
cell_1	1
cell_2	2.5

Gene/Cell sets

Gene or cell sets can be in gmx (Gene MatriX), gmt (Gene Matrix Transposed), or grp format.

The gmt format is convenient to store large databases of sets. However, for a handful of sets, the gmx format might offer better excel-editablity.

More information on these formats can be found here

GMT

The gmt format consists of one set per line. Each line is a tab-separated list composed as follows :

The set name (can contain spaces)
A commentary / description of the set (may be empty or contain spaces)
A tab-separated list of set members

Example:

Set1	set 1 description	gene_2	gene_1
Set2	set 2 description	gene_3
Set3	set 3 description	gene_4	gene_1

GMX

The gmx format is the transposed of the gmx format. Each column represents a set. It is also tab-separated.

Example:

Set1	Set2	Set3
set 1 description	set 2 description	set 3 description
gene_2	gene_3	gene_4
gene_1		gene_1

GRP

The grp format contains a single set in a simple newline-delimited text format.

Example:

gene_1

gene_2

gene_3

Covariate file

The batch associated with each cell of the matrix file is specified in the covariate file. This file must be a tab or comma separated plain text file, with two header fields: “id” and “covariate”.

Example:

id	covariate
cell_1	0
cell_2	1

OT Configuration file

There are several options to specify Optimal Transport parameters in wot.

The easiest is to just use constant parameters and specify them when computing transport maps with the --epsilon or --lambda1 options.

If you want more control over what parameters are used, you can use a detailed configuration file. There are two kinds of configuration files accepted by wot.

Configuration per timepoint

You can specify each parameter at each timepoint. When computing a transport map between two timepoints, the average of the two parameters for the considered timepoints will be taken into account.

For instance, if you have prior knowledge of the amount of entropy at each timepoint, you could specify a different value of epsilon for each timepoint, and those would be used to compute more accurate transport maps.

The configuration file is a tab-separated text file that starts with a header that must contain a column named t, for the timepoint, and then the name of any parameter you want to set. Any parameter not specified in this file can be specified as being constant as previously, with the command-line arguments --epsilon, --lambda1, --tolerance, etc. .

Example:

t	epsilon
0	0.001
1	0.002
2	0.005
3	0.008
3.5	0.01
4	0.005
5	0.001

Configuration per pair of timepoints

If you want to be even more explicit about what parameters to use for each transport map computation, you can specify parameters for pairs of timepoints.

As previously, the configuration is specified in a tab-separated text file. Its header must have columns t0 and t1, for source and destination timepoints.

Bear in mind though, that any pair of timepoints not specified in this file will not be computable. That means you should at least put all pairs of consecutive timepoints if you want to be able to compute full trajectories.

Example:

t0	t1	lambda1
0	1	50
1	2	80
2	4	30
4	5	10

This can for instance be used if you want to skip a timepoint (note how timepoints 3 or 3.5 are not present here). If a timepoint is present in the dataset but not in this configuration file, it will be ignored.

You can use as many parameter columns as you want, even none. All parameters not specified here can be specified as being constant as previously, with the command-line arguments --epsilon, --lambda1, --tolerance, etc. .

Census file

Census files are datasets files : tab-separated text files with a header. The header consists of an “id” field, and then the list of cell sets for the census.

Each subsequent row will give the proportion of ancestors that pertained in each of the mentionned cell sets.

The id is the time at which the ancestors lived.

Example:

id	tip1	tip2	tip3
0.0	0.15	0.05	0.05
1.0	0.28	0.05	0.03
2.0	0.42	0.03	0.02
3.0	0.72	0.02	0.01
4.0	0.89	0.00	0.00
5.0	0.99	0.00	0.00