16 Functional Pseudotime Analysis

In this lab, we will analyze a single cell RNA-seq dataset that will teach us about several methods to infer the differentiation trajectory of a set of cells. These methods can order a set of individual cells along a path / trajectory / lineage, and assign a pseudotime value to each cell that represents where the cell is along that path. This can be a starting point for further analysis to determine gene expression programs driving interesting cell phenotypes. As you are running the code, think about how the algorithms work and what you like and do not like about the assumptions and utilities provided by the algorithm.

Note: you can increase the system memory available to Docker by going to Docker -> Preferences -> Advanced and shifting the Memory slider.

16.1 Load settings and packages

16.2 First look at the differentiation data from Deng et al.

We will use a nice SMART-Seq2 single cell RNA-seq data from Single-Cell RNA-Seq Reveals Dynamic, Random Monoallelic Gene Expression in Mammalian Cells. Here is one relevant detail from their paper: “To investigate allele-specific gene expression at single-cell resolution, we isolated 269 individual cells dissociated from in vivo F1 embryos (CAST/EiJ × C57BL/6J, hereafter abbreviated as CAST and C57, respectively) from oocyte to blastocyst stages of mouse preimplantation development (PD)”

Let us take a first look at the Deng data. One simple approach to ordering cells in pseudotime is to use PCA. By carrying out PCA and labeling the cells by the stage at which they were collected, we can see how well the principal components separate cells along a differentiation trajectory.

# Read in single cell data.
path.deng <- paste0(mydir, "deng-reads.rds")
deng_SCE <- readRDS(path.deng)

# What class is the deng_SCE object, and how is it organized?
class(deng_SCE)
## [1] "SingleCellExperiment"
## attr(,"package")
## [1] "SingleCellExperiment"
structure(deng_SCE)
## class: SingleCellExperiment 
## dim: 22431 268 
## metadata(0):
## assays(2): counts logcounts
## rownames(22431): Hvcn1 Gbp7 ... Sox5 Alg11
## rowData names(10): feature_symbol is_feature_control ...
##   total_counts log10_total_counts
## colnames(268): 16cell 16cell.1 ... zy.2 zy.3
## colData names(30): cell_type2 cell_type1 ... pct_counts_ERCC
##   is_cell_control
## reducedDimNames(0):
## spikeNames(1): ERCC
# How many mouse cells are at each stage?
table(deng_SCE$cell_type1)
## 
## 16cell  2cell  4cell  8cell  blast zygote 
##     50     22     14     37    133     12
table(deng_SCE$cell_type2)
## 
##     16cell      4cell      8cell early2cell earlyblast  late2cell 
##         50         14         37          8         43         10 
##  lateblast   mid2cell   midblast         zy 
##         30         12         60          4
# Re-order the levels of the factor storing the cell developmental stage.
deng_SCE$cell_type2 <- factor(deng_SCE$cell_type2,
                              levels = c("zy", "early2cell", "mid2cell", "late2cell", 
                                         "4cell", "8cell", "16cell", "earlyblast", "midblast",
                                         "lateblast"))

# Run PCA on Deng data. Use the runPCA function from the SingleCellExperiment package.
deng_SCE <- runPCA(deng_SCE, ncomponents = 50)

# Use the reducedDim function to access the PCA and store the results. 
pca <- reducedDim(deng_SCE, "PCA")

# Describe how the PCA is stored in a matrix. Why does it have this structure?
head(pca)
##                PC1       PC2       PC3        PC4        PC5        PC6
## 16cell   -4.616718 -13.67367 1.6860222 -0.3971110  0.4331834 -3.3351871
## 16cell.1 -5.528782 -11.18717 2.9383081  0.2381910  0.1644197 -0.8666549
## 16cell.2 -5.067554 -13.50145 1.6200888 -1.4489428 -0.3560664 -2.7057527
## 16cell.3 -5.604999 -12.42408 1.2092483  0.7560287 -0.9812613  0.1008600
## 16cell.4 -4.989056 -12.64989 0.9518061  2.0114694  2.1210373 -0.5469395
## 16cell.5 -4.605128 -13.35725 1.2909256 -0.6736846  2.1882002 -2.7699870
##                 PC7        PC8        PC9      PC10      PC11       PC12
## 16cell   -0.6236209  0.3621665 -1.2910579  2.386634 -1.824875  0.4400858
## 16cell.1 -0.4232062 -1.0859542 -2.0577693  2.566871 -2.663335 -3.2087906
## 16cell.2  1.7515857  0.1552950 -1.5951345  1.879552 -3.237261 -0.9710867
## 16cell.3  1.0103357 -0.4448119 -0.6081911 -1.059523 -4.416895 -0.7229430
## 16cell.4  0.0563789 -0.7070924 -1.2665595  2.733815 -1.090173 -0.7688942
## 16cell.5  0.9010465  1.0445247 -2.8906108  2.886413 -1.844421 -4.2520016
##                 PC13        PC14       PC15       PC16       PC17
## 16cell   -0.77758746  2.11643418 -0.8244988  0.4831872 -0.8213527
## 16cell.1  1.95567784 -1.12330816 -2.0169080  2.0721840  0.4924598
## 16cell.2  1.77982275  0.02196743 -2.0566764  2.1791387 -1.8205026
## 16cell.3  1.82741705 -2.96140322 -2.0898322 -1.8926069 -1.2088311
## 16cell.4  0.01732681 -0.77442110 -2.4652798  2.6683790 -1.0930128
## 16cell.5  1.26660076 -0.08437871 -1.9085246  0.8621796 -0.2821138
##                  PC18       PC19       PC20        PC21        PC22
## 16cell    2.912998472 -0.1114755  0.6633269 -0.88984647  2.73819461
## 16cell.1  1.555585976 -1.1341038 -2.4003738 -1.00116216  2.79983998
## 16cell.2  1.544638100  0.2775379 -1.2832673 -0.46297320  0.08161412
## 16cell.3 -1.164002801  0.8564781 -1.0032480 -0.02165817  2.81656780
## 16cell.4  0.492283751  0.1123406  0.6449514 -1.48016504  1.69637235
## 16cell.5 -0.004085098  0.9727746 -0.8545002 -0.82326712 -0.49085584
##                PC23        PC24       PC25       PC26       PC27
## 16cell   -1.2833723  0.26093070 -0.9025236 -0.9199517  0.7705592
## 16cell.1 -2.0626007 -1.49145200 -0.3073604 -1.6213086 -0.4668769
## 16cell.2 -0.6273122 -0.82463839 -0.3514344 -0.9143962  0.1285442
## 16cell.3  1.1540990  1.64935692 -2.8926707  0.5279823  1.0364766
## 16cell.4  1.3036491 -0.04931057 -1.8065901 -2.3286922  0.7928030
## 16cell.5 -2.4572424  0.28548367  0.9919387 -1.3399723 -0.4663864
##                PC28       PC29       PC30       PC31          PC32
## 16cell   -1.1942837  3.3616090 -0.8339894 -2.6264466  0.2088315099
## 16cell.1  0.9442624 -1.3557627  1.5239438 -0.3786813 -2.0367506324
## 16cell.2 -1.4033309  0.5669192 -0.5764371  0.2240873  0.2092731123
## 16cell.3  2.1822899  0.3327657 -3.9188841 -0.2036249 -1.1033810361
## 16cell.4  1.5970783 -1.1079176 -0.6441109 -2.8524393  0.0004546378
## 16cell.5  2.7023565  1.1374510 -2.8720566 -1.0064513  1.8063507928
##                PC33       PC34        PC35       PC36       PC37
## 16cell    1.1394308  0.3840000 -0.30028319  0.5470062 -2.2521121
## 16cell.1  0.4802093 -1.5461206  0.57785418  0.2136878 -3.1427911
## 16cell.2 -0.6020883  0.2587835 -1.39670584  1.0217782 -0.9796248
## 16cell.3 -2.0222558 -1.5459463  0.98029013  1.6982863 -2.3282394
## 16cell.4 -0.8323239  0.7580553 -0.08445444 -0.3581607  1.2995240
## 16cell.5 -0.2765822  1.4528263 -0.01488841 -2.0533203  0.9582204
##                 PC38       PC39       PC40       PC41        PC42
## 16cell    1.54130056 -0.5503021  1.8218871 -1.2491319  0.02774323
## 16cell.1 -0.06859658  1.0455128  3.3915829  0.8911611  1.76258371
## 16cell.2 -0.93454517 -1.4591025  0.1484064  0.7292182  0.89423648
## 16cell.3 -0.95838032  0.1684409  0.6182237  0.8339745 -0.21181011
## 16cell.4  1.57202989  0.8189163 -2.3331272 -0.3108283 -0.31856432
## 16cell.5  1.45162192  2.1979823  1.0272293 -3.8152357  0.07293698
##                PC43       PC44      PC45       PC46       PC47       PC48
## 16cell   -1.2648452  1.3478074 -1.489203  1.6186833 -0.3300131  0.8603107
## 16cell.1  1.2310412 -0.3182549 -1.857164 -2.9553566 -2.2298687  0.4699274
## 16cell.2  0.6198856  1.9024444  0.096746  1.0150184  1.5574517 -0.3872533
## 16cell.3 -1.1234170  1.0566629 -1.311480 -0.4328987  2.7491211 -0.6665591
## 16cell.4  0.5818659 -1.2131792  1.086794  0.3857532  2.3153699  1.2776942
## 16cell.5 -0.8961840 -0.1003433  1.667234  1.5554497 -0.9303923 -0.8338826
##                PC49        PC50
## 16cell   -1.1294170  0.54900311
## 16cell.1 -0.9080707 -1.77789141
## 16cell.2  1.4262577 -0.08018494
## 16cell.3 -0.4120873  2.00067268
## 16cell.4 -0.9602715 -0.02929732
## 16cell.5  1.2653639  1.63364381
dim(pca)
## [1] 268  50
# Add PCA data to the deng_SCE object.
deng_SCE$PC1 <- pca[, 1]
deng_SCE$PC2 <- pca[, 2]

# Plot PC biplot with cells colored by cell_type2. 
# colData(deng_SCE) accesses the cell metadata DataFrame object for deng_SCE.
# Look at Figure 1A of the paper as a comparison to your PC biplot.
ggplot(as.data.frame(colData(deng_SCE)), aes(x = PC1, y = PC2, color = cell_type2)) + geom_quasirandom(groupOnX = FALSE) +
    scale_color_tableau() + theme_classic() +
    xlab("PC1") + ylab("PC2") + ggtitle("PC biplot")

# PCA is a simple approach and can be good to compare to more complex algorithms 
# designed to capture differentiation processes. As a simple measure of pseudotime 
# we can use the coordinates of PC1.
# Plot PC1 vs cell_type2. 
deng_SCE$pseudotime_PC1 <- rank(deng_SCE$PC1)  # rank cells by their PC1 score
ggplot(as.data.frame(colData(deng_SCE)), aes(x = pseudotime_PC1, y = cell_type2, 
                                             colour = cell_type2)) +
    geom_quasirandom(groupOnX = FALSE) +
    scale_color_tableau() + theme_classic() +
    xlab("PC1") + ylab("Timepoint") +
    ggtitle("Cells ordered by first principal component")

ggsave(paste0(mydir, "/pseudotime_PC1.png"))
## Saving 7 x 5 in image
# Try separating the cell types using other PCs. How does the separation look?

16.3 Diffusion map pseudotime

Let us see how a more advance trajectory inference method, diffusion maps and diffusion pseudotime, performs at placing cells along the expected differentiation trajectory.

Diffusion maps were introduced by Ronald Coifman and Stephane Lafon, and the underlying idea is to assume that the data are samples from a diffusion process. The method infers the low-dimensional manifold by estimating the eigenvalues and eigenvectors for the diffusion operator related to the data.

Angerer et al have applied the diffusion maps concept to the analysis of single-cell RNA-seq data to create an R package called destiny.

We will use two forms of pseudotime: the first diffusion component and the diffusion pseudotime.

#  Prepare a counts matrix with labeled rows and columns. 
deng <- logcounts(deng_SCE)  # access log-transformed counts matrix
cellLabels <- deng_SCE$cell_type2
colnames(deng) <- cellLabels

# Make a diffusion map.
dm <- DiffusionMap(t(deng))

# Optional: Try different sigma values when making diffusion map.
# dm <- DiffusionMap(t(deng), sigma = "local")  # use local option to set sigma
# sigmas <- find_sigmas(t(deng), verbose = FALSE)  # find optimal sigma
# dm <- DiffusionMap(t(deng), sigma = optimal_sigma(sigmas))  

# Plot diffusion component 1 vs diffusion component 2 (DC1 vs DC2). 
tmp <- data.frame(DC1 = eigenvectors(dm)[, 1],
                  DC2 = eigenvectors(dm)[, 2],
                  Timepoint = deng_SCE$cell_type2)
ggplot(tmp, aes(x = DC1, y = DC2, colour = Timepoint)) +
    geom_point() + scale_color_tableau() + 
    xlab("Diffusion component 1") + 
    ylab("Diffusion component 2") +
    theme_classic()

# Try plotting higher diffusion components against one another.

# Next, let us use the first diffusion component (DC1) as a measure of pseudotime.
# How does the separation by cell stage look?
deng_SCE$pseudotime_diffusionmap <- rank(eigenvectors(dm)[,1])    # rank cells by their dpt
ggplot(as.data.frame(colData(deng_SCE)), 
       aes(x = pseudotime_diffusionmap, 
           y = cell_type2, colour = cell_type2)) +
    geom_quasirandom(groupOnX = FALSE) +
    scale_color_tableau() + theme_classic() +
    xlab("Diffusion component 1 (DC1)") + ylab("Timepoint") +
    ggtitle("Cells ordered by DC1")

ggsave(paste0(mydir, "/pseudotime_DC1.png"))
## Saving 7 x 5 in image
# Plot eigenvalues of diffusion distance matrix. How many diffusion components would you use?
# This is analagous to the PC elbow plot (scree plot) that we previously used to assess how 
# many PCs to use in downstream applications like clustering.
plot(eigenvalues(dm), ylim = 0:1, pch = 20, xlab = 'Diffusion component (DC)', ylab = 'Eigenvalue')

# What happens if you run the diffusion map on the PCs? Why would one do this?
rownames(pca) <- cellLabels
dm <- DiffusionMap(pca)

# Diffusion pseudotime calculation. 
# Set index or tip of pseudotime calculation to be a zygotic cell (cell 268). 
dpt <- DPT(dm, tips = 268)

# Plot DC1 vs DC2 and color the cells by their inferred diffusion pseudotime.
# We can accesss diffusion pseudotime via dpt$dpt.
df <- data.frame(DC1 = eigenvectors(dm)[, 1], DC2 = eigenvectors(dm)[, 2], 
                 dptval = dpt$dpt, cell_type2 = deng_SCE$cell_type2)
p1 <- ggplot(df) + geom_point(aes(x = DC1, y = DC2, color = dptval))
p2 <- ggplot(df) + geom_point(aes(x = DC1, y = DC2, color = cell_type2))
p <- plot_grid(p1, p2)
p

save_plot(paste0(mydir, "/dpt_celltype.png"), p, base_height = 5, base_aspect_ratio = 2)

# Plot diffusion pseudotime vs timepoint. 
# Which separates the data better, DC1 or diffusion pseudotime?
deng_SCE$pseudotime_dpt <- rank(dpt$dpt) 
ggplot(as.data.frame(colData(deng_SCE)), 
       aes(x = pseudotime_dpt, 
           y = cell_type2, colour = cell_type2)) +
    geom_quasirandom(groupOnX = FALSE) +
    scale_color_tableau() + theme_classic() +
    xlab("Diffusion map pseudotime (dpt)") +
    ylab("Timepoint") +
    ggtitle("Cells ordered by diffusion map pseudotime")

ggsave(paste0(mydir, "/pseudotime_dpt.png"))
## Saving 7 x 5 in image
# Save current progress.
# save(deng_SCE, file = Rda.destiny.path)
# To load the data, run the following command.
# load(Rda.destiny.path)

16.4 Slingshot map pseudotime

Let us see how another advance trajectory inference method, Slingshot, performs at placing cells along the expected differentiation trajectory.

library(slingshot)
library(Seurat)

# Read the Slingshot documentation (?slingshot) and then run Slingshot below. 
# Given your understanding of the algorithm and the documentation, what is one 
# major set of parameters we omitted here when running Slingshot?
sce <- slingshot(deng_SCE, reducedDim = 'PCA')  # no clusters

# Plot PC1 vs PC2 colored by Slingshot pseudotime.
colors <- rainbow(50, alpha = 1)
plot(reducedDims(sce)$PCA, col = colors[cut(sce$slingPseudotime_1,breaks=50)], pch=16, asp = 1)
lines(SlingshotDataSet(sce), lwd=2)

# Plot Slingshot pseudotime vs cell stage. 
ggplot(as.data.frame(colData(deng_SCE)), aes(x = sce$slingPseudotime_1, y = cell_type2, 
                              colour = cell_type2)) +
    geom_quasirandom(groupOnX = FALSE) +
    scale_color_tableau() + theme_classic() +
    xlab("Slingshot pseudotime") + ylab("Timepoint") +
    ggtitle("Cells ordered by Slingshot pseudotime")

# Cluster cells using the Seurat workflow below.
gcdata <- CreateSeuratObject(counts = counts(deng_SCE), min.cells = 0, min.genes = 0, project = "slingshot")
gcdata <- NormalizeData(object = gcdata, normalization.method = "LogNormalize", 
    scale.factor = 10000)
gcdata <- FindVariableGenes(object = gcdata, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.1, x.high.cutoff = 3, y.cutoff = 0.5)
gcdata <- ScaleData(object = gcdata, do.center = T, do.scale = F)
gcdata <- RunPCA(object = gcdata, pc.genes = gcdata@var.genes, do.print = TRUE, pcs.print = 1:5, 
    genes.print = 5)
gcdata <- FindClusters(object = gcdata, reduction.type = "pca", dims.use = 1:20, 
    resolution = 0.6, print.output = 0, save.SNN = TRUE)

# Add clustering information from Seurat to the deng_SCE object
# Then run Slingshot using these cluster assignments.
deng_SCE$slingPseudotime_1 <- NULL  # remove old slingshot pseudotime data
colData(deng_SCE)$Seurat_clusters <- as.character(gcdata@ident)  # go from factor to character
deng_SCE <- slingshot(deng_SCE, clusterLabels = 'Seurat_clusters', reducedDim = 'PCA')

# Plot PC1 vs PC2 colored by Slingshot pseudotime.
colors <- rainbow(50, alpha = 1)
plot(reducedDims(deng_SCE)$PCA, col = colors[cut(deng_SCE$slingPseudotime_1,breaks=50)], pch=16, asp = 1)
lines(SlingshotDataSet(deng_SCE), lwd=2)

# Plot Slingshot pseudotime vs cell stage. 
ggplot(as.data.frame(colData(deng_SCE)), aes(x = slingPseudotime_1, y = cell_type2, 
                              colour = cell_type2)) +
    geom_quasirandom(groupOnX = FALSE) +
    scale_color_tableau() + theme_classic() +
    xlab("Slingshot pseudotime") + ylab("Timepoint") +
    ggtitle("Cells ordered by Slingshot pseudotime")
ggsave(paste0(mydir, "/pseudotime_slingshot.png"))

# Save current progress.
# save(deng_SCE, file = Rda.slingshot.path)
# To load the data, run the following command.
# load(Rda.slingshot.path)

16.5 Find temporally expressed genes

In this final analysis code chunk, we will identify temporally expressed genes, ie those genes whose expression is changing in a continuous manner over pseudotime. To do this, we will fit a GAM with a LOESS term for pseudotime. Functions for fitting and working with generalized additive models, as described in “Generalized Additive Models” (Hastie and Tibshirani, 1990). Read more about GAMs

install.packages("gam")
library(gam)

# Only look at the 1,000 most variable genes when identifying temporally expressesd genes.
# Identify the variable genes by ranking all genes by their variance.
Y <- log2(counts(deng_SCE) + 1)
var1K <- names(sort(apply(Y, 1, var),decreasing = TRUE))[1:1000]
Y <- Y[var1K, ]  # only counts for variable genes

# Fit GAM for each gene using pseudotime as independent variable.
t <- deng_SCE$slingPseudotime_1
gam.pval <- apply(Y, 1, function(z){
  d <- data.frame(z=z, t=t)
  tmp <- gam(z ~ lo(t), data=d)
  p <- summary(tmp)[4][[1]][1,5]
  p
})

# Identify genes with the most significant time-dependent model fit.
topgenes <- names(sort(gam.pval, decreasing = FALSE))[1:100]  

# Prepare and plot a heatmap of the top genes that vary their expression over pseudotime.
require(clusterExperiment)
heatdata <- as.matrix(gcdata@data[rownames(gcdata@data) %in% topgenes, order(t, na.last = NA)])
heatclus <- gcdata@ident[order(t, na.last = NA)]
png(paste0(mydir, "heatmap_time_genes.png"), width=10, height=10, units = "in", res=200)
ce <- ClusterExperiment(heatdata, heatclus, transformation = log1p)
clusterExperiment::plotHeatmap(ce, clusterSamplesData = "orderSamplesValue", visualizeData = 'transformed', cexRow = 1.5, fontsize = 15)
dev.off()

16.6 Comparison of the different trajectory inference methods

How do the trajectories inferred by PCA, diffusion pseudotime, and slingshot pseudotime compare to one another?

install.packages("corrplot")
library(corrplot)

# Prepare data frame with different pseudotime measures.
df_pseudotime <- as.data.frame(colData(deng_SCE)[, c("pseudotime_PC1", "pseudotime_dpt", "slingPseudotime_1")])
colnames(df_pseudotime) <- c("PC1", "diffusion", "slingshot")

# Plot correlation between different pseudotime measures.
corrplot.mixed(cor(df_pseudotime, use = "na.or.complete"), 
               order = "hclust", tl.col = "black",
               main = "Correlation matrix for pseudotime results",
               mar = c(0, 0, 3.1, 0))

16.7 Plots of gene expression over time.

Visualize how some of the temporally expressed genes change in time.

plotExpression(deng_SCE, "Obox5", x = "PC1", 
               colour_by = "cell_type2", show_violin = FALSE,
               show_smooth = TRUE)

plotExpression(deng_SCE, "Obox5", x = "pseudotime_dpt", 
               colour_by = "cell_type2", show_violin = FALSE,
               show_smooth = TRUE)

plotExpression(deng_SCE, "Obox5", x = "slingPseudotime_1", 
               colour_by = "cell_type2", show_violin = FALSE,
               show_smooth = TRUE)

16.8 Acknowledgements

This document builds off chapter 8.4 from the Hemberg lab single cell RNA-seq course, from the Destiny vignette and from the Slingshot vignette.