18 CITE-seq and scATAC-seq

In this lab, we will look at how single cell RNA-seq and single cell protein expression measurement datasets can be jointly analyzed, as part of a CITE-Seq experiment. To learn more about how the antibody barcode matrix is computationally generated from the sequencing data, please visit CITE-seq-Count. To learn more about CITE-Seq and feature barcoding, please visit the CITE-seq site.

Note: you can increase the system memory available to Docker by going to Docker -> Preferences -> Advanced and shifting the Memory slider.

18.1 Load settings and packages

18.2 Load in the data

This vignette demonstrates new features that allow users to analyze and explore multi-modal data with Seurat. While this represents an initial release, we are excited to release significant new functionality for multi-modal datasets in the future.

Here, we analyze a dataset of 8,617 cord blood mononuclear cells (CBMCs), produced with CITE-seq, where we simultaneously measure the single cell transcriptomes alongside the expression of 11 surface proteins, whose levels are quantified with DNA-barcoded antibodies. First, we load in two count matrices : one for the RNA measurements, and one for the antibody-derived tags (ADT). You can download the ADT file here and the RNA file here

# Load in the RNA UMI matrix

# Note that this dataset also contains ~5% of mouse cells, which we can use
# as negative controls for the protein measurements. For this reason, the
# gene expression matrix has HUMAN_ or MOUSE_ appended to the beginning of
# each gene.
cbmc.rna <- read.csv(paste0(mydir, "GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz"), sep = ",", header = TRUE, row.names = 1)
## HUMAN_hsa-mir-7515                 0                0
## HUMAN_hsa-mir-8072                 0                0
## MOUSE_0610007N19Rik               14                2
## MOUSE_0610007P14Rik                2                9
# To make life a bit easier going forward, we're going to discard all but
# the top 100 most highly expressed mouse genes, and remove the 'HUMAN_'
# from the CITE-seq prefix
cbmc.rna.collapsed <- CollapseSpeciesExpressionMatrix(cbmc.rna)
rm(cbmc.rna)  # free up memory

# Load in the ADT UMI matrix
cbmc.adt <- read.csv(paste0(mydir, "GSE100866_CBMC_8K_13AB_10X-ADT_umi.csv.gz"), sep = ",", header = TRUE, row.names = 1)

# To avoid any confusion where genes and proteins might have the same name,
# we'll append 'CITE_' to each of the ADT rownames. This is not strictly
# necessary, but it helps for clarity
cbmc.citeseq <- cbmc.adt
rownames(cbmc.citeseq) <- paste0("CITE_", rownames(cbmc.adt))

# Lastly, we observed poor enrichments for CCR5, CCR7, and CD10 - and
# therefore remove them from the matrix.
cbmc.citeseq <- cbmc.citeseq[setdiff(rownames(cbmc.citeseq), c("CITE_CCR5", "CITE_CCR7", "CITE_CD10")), ]

# Look at structure of ADT matrix.
## CD3                  60               52               89
## CD4                  72               49              112
## CD8                  76               59               61
## CD45RA              575             3943              682
## CD56                 64               68               87
## CD16                161              107              117
## CD10                156               95              113
## CD11c                77               65               65
## CD14                206              129              169
## CD19                 70              665               79
# What fraction of cells in the ADT and RNA matrix overlap?
length(intersect(colnames(cbmc.rna.collapsed), colnames(cbmc.citeseq))) / length(union(colnames(cbmc.rna.collapsed), colnames(cbmc.citeseq)))
## [1] 1

18.3 Setup a Seurat object, and cluster cells based on RNA expression

The steps below represent a quick clustering of the PBMCs based on the scRNA-seq data. For more detail on individual steps or more advanced options, see our PBMC clustering guided tutorial here

cbmc <- CreateSeuratObject(counts = cbmc.rna.collapsed)

# This code sub-samples the data in order to speed up calculations and not use too much memory.
# cbmc <- SetAllIdent(cbmc, id = "orig.ident")
# cbmc <- SubsetData(cbmc, max.cells.per.ident = 2000, random.seed = 1)
# cbmc.citeseq <- cbmc.citeseq[, cbmc@cell.names]

# standard log-normalization
cbmc <- NormalizeData(cbmc)

# choose ~1k variable genes
cbmc <- FindVariableFeatures(cbmc, do.plot = FALSE, y.cutoff = 0.5)

# standard scaling 
cbmc <- ScaleData(cbmc, display.progress = FALSE)

# Run PCA, select 13 PCs for tSNE visualization and graph-based clustering
cbmc <- RunPCA(cbmc, pcs.print = 0)
## PCElbowPlot(cbmc)

# Cluster the cells using the first 13 principal components.
cbmc <- FindClusters(cbmc, dims.use = 1:13, print.output = FALSE)
cbmc <- RunTSNE(cbmc, dims.use = 1:13)

# Find the markers that define each cluster, and use these to annotate the
# clusters, we use max.cells.per.ident to speed up the process
cbmc.rna.markers <- FindAllMarkers(cbmc, max.cells.per.ident = 100, logfc.threshold = log(2), only.pos = TRUE, min.diff.pct = 0.3, do.print = F)

# Examine top marker genes and identify cell types.
cbmc.rna.markers %>% group_by(cluster) %>% top_n(5)
genes <- c("CD3D", "CD4", "CD8A", "CD14", "MS4A1", "KLRB1", "CD34")
FeaturePlot(cbmc, genes, cols.use = c("lightgrey", "blue"))

# Which cluster consists of mouse cells?
cbmc.rna.markers %>% filter(cluster == 3)

current.cluster.ids <- 0:15
# Note, for simplicity we are merging two CD14+ Mono clusters (that differ
# in the expression of HLA-DR genes), and two NK clusters (that differ in
# cell cycle stage)
new.cluster.ids <- c("CD4 T", "CD14+ Mono", "CD14+ Mono", "NK", "Mouse", "B", "CD8 T", "CD16+ Mono", "Unknown", "CD34+", "Mk", "Eryth", "DC", "Mouse", "pDC", "NK")
cbmc@ident <- plyr::mapvalues(x = cbmc@ident, from = current.cluster.ids, to = new.cluster.ids)

TSNEPlot(cbmc, do.label = TRUE, pt.size = 0.5)

# Save current progress.
# save(cbmc, cbmc.rna.markers, file = Rda.RNA.path)
# To load the data, run the following command.
# load(Rda.RNA.path)

18.4 Add the protein expression levels to the Seurat object

Seurat v2.1 allows you to store information from multiple assays in the same object, as long as the data is multi-modal (collected on the same set of cells). You can use the SetAssayData and GetAssayData accessor functions to add and fetch data from additional assays.

# We will define a CITE assay, and store raw data for it. Note that it's
# convenient, but not required, to use the same name as the rowname prefix
# we defined earlier.

# If you are interested in how these data are internally stored, you can
# check out the @assay slot, and the assay class, which is defined in
# multimodal.R Note that RNA data is still stored in its normal slots, but
# can also be accessed using GetAssayData and SetAssayData, using the 'RNA'
# assay

cbmc <- SetAssayData(cbmc, assay.type = "CITE", slot = "raw.data", new.data = cbmc.citeseq)
GetAssayData(cbmc, assay.type = "CITE", slot = "raw.data")[1:3,1:3]

# Now we can repeat the preprocessing (normalization and scaling) steps that
# we typically run with RNA, but modifying the 'assay.type' argument.  For
# CITE-seq data, we do not recommend typical LogNormalization. Instead, we
# use a centered log-ratio (CLR) normalization, computed independently for
# each gene.  This is a slightly improved procedure from the original
# publication, and we will release more advanced versions of CITE-seq
# normalizations soon.
cbmc <- NormalizeData(cbmc, assay.type = "CITE", normalization.method = "genesCLR")
GetAssayData(cbmc, assay.type = "CITE", slot = "data")[1:3,1:3]

cbmc <- ScaleData(cbmc, assay.type = "CITE", display.progress = FALSE)

18.5 Visualize protein levels on RNA clusters

You can use the names of any ADT markers, (i.e. “CITE_CD4”), in FetchData, FeaturePlot, RidgePlot, GenePlot, DoHeatmap, or any other visualization features

# In this plot, protein (ADT) levels are on top, and RNA levels are on bottom
FeaturePlot(cbmc, features.plot = c("CITE_CD3", "CITE_CD11c", "CITE_CD8", "CITE_CD16", 
    "CD3E", "ITGAX", "CD8A", "FCGR3A"), min.cutoff = "q05", max.cutoff = "q95", 
    nCol = 4, cols.use = c("lightgrey", "blue"), pt.size = 0.5)
# How do the gene and protein expression levels compare to one another?

# Compare gene and protein expression levels for the other 6 antibodies.
FeaturePlot(cbmc, features.plot = c("CITE_CD4", "CITE_CD45RA", "CITE_CD56", "CITE_CD14", "CITE_CD19", "CITE_CD34", "CD4", "PTPRC", "NCAM1", "CD14", "CD19", "CD34"), min.cutoff = "q05", max.cutoff = "q95", nCol = 6, cols.use = c("lightgrey", "blue"), pt.size = 0.5)

# Ridge plots are another useful visualization.
RidgePlot(cbmc, features.plot = c("CITE_CD3", "CITE_CD11c", "CITE_CD8", "CITE_CD16"), 
    nCol = 2)

par(mfrow = c(1, 2))
# Draw ADT scatter plots (like biaxial plots for FACS). Note that you can
# even 'gate' cells if desired by setting do.identify = TRUE or 
# interact with cells by setting do.hover = TRUE
GenePlot(cbmc, gene1 = "CITE_CD19", gene2 = "CITE_CD3", cex = 0.5)

# view relationship between protein and RNA
GenePlot(cbmc, gene1 = "CITE_CD3", gene2 = "CD3E", cex.use = 0.5)

# Let's plot CD4 vs CD8 levels in T cells
tcells <- SubsetData(cbmc, ident.use = c("CD4 T", "CD8 T"))

par(mfrow = c(1, 2))
GenePlot(tcells, gene1 = "CITE_CD4", gene2 = "CITE_CD8", cex = 0.5)

# Let's look at the raw (non-normalized) ADT counts. You can see the values
# are quite high, particularly in comparison to RNA values. This is due to
# the significantl higher protein copy number in cells, which significantly
# reduces 'drop-out' in ADT data
GenePlot(tcells, gene1 = "CITE_CD4", gene2 = "CITE_CD8", use.raw = TRUE, cex = 0.5)

# If you look a bit more closely, you'll see that our CD8 T cell cluster is
# enriched for CD8 T cells, but still contains many CD4+ CD8- T cells.  This
# is because Naive CD4 and CD8 T cells are quite similar transcriptomically,
# and the RNA dropout levels for CD4 and CD8 are quite high.  This
# demonstrates the challenge of defining subtle immune cell differences from
# scRNA-seq data alone.

# What fraction of T cells are double negative in gene expression? (CD4- and CD8-)
# You can use an interactive plot to gate on the cells (do.identify = T) or use 
# Boolean conditions on CD4 and CD8A expression to find double negative cells.
# cells <- GenePlot(tcells, gene1 = "CD4", gene2 = "CD8A", cex = 0.5, do.identify = T)
# length(cells) / length(tcells@cell.names)
length(which(tcells@data["CD4", ] == 0 & tcells@data["CD8A", ] == 0))

# What fraction of T cells are double negative in protein expression? (CD4- and CD8-)
# cells <- GenePlot(tcells, gene1 = "CITE_CD4", gene2 = "CITE_CD8", cex = 0.5, do.identify = T)
# length(cells) / length(tcells@cell.names)
length(which(tcells@assay$CITE@data["CITE_CD4", ] < 1 & 
               tcells@assay$CITE@data["CITE_CD8", ] < 1))

18.6 Identify differentially expressed proteins between clusters

mono.markers <- FindMarkers(cbmc, "CD14+ Mono", "CD16+ Mono", assay.type = "CITE", logfc.threshold = log(1.5))

# Plot the expression of the monocyte markers you identified.
FeaturePlot(cbmc, features.plot = c("CITE_CD14", "CITE_CD16"), min.cutoff = "q05", max.cutoff = "q95", nCol = 4, cols.use = c("lightgrey", "blue"), pt.size = 0.5)

# Note that we observe CD14 protein expression only on CD4+ T cells, as has
# been previously observed in the literature
tcell.markers <- FindMarkers(cbmc, ident.1 = "CD4 T", ident.2 = "CD8 T", assay.type = "CITE", logfc.threshold = log(1.5))

# Downsample the clusters to a maximum of 300 cells each (makes the heatmap
# easier to see for small clusters)
cbmc.small <- SubsetData(cbmc, max.cells.per.ident = 300)
# Find protein markers for all clusters, and draw a heatmap
adt.markers <- FindAllMarkers(cbmc.small, assay.type = "CITE", only.pos = TRUE, 
    print.bar = F)

DoHeatmap(cbmc.small, genes.use = unique(adt.markers$gene), assay.type = "CITE", 
    slim.col.label = TRUE, remove.key = TRUE, group.label.rot = TRUE)

# You can see that our unknown cells co-express both myeloid and lymphoid
# markers (true at the RNA level as well). They are likely cell clumps
# (multiplets) that should be discarded. We'll remove the mouse cells now as
# well
cbmc <- SubsetData(cbmc, ident.remove = c("Unknown", "Mouse"))

18.7 Cluster directly on protein levels

You can also run dimensional reduction and graph-based clustering directly on CITE-seq data

# We will store the results in a new object, cbmc_cite.
cbmc_cite <- RunPCA(cbmc, pc.genes = rownames(cbmc.citeseq), assay.type = "CITE", pcs.print = 0)
PCAPlot(cbmc_cite, pt.size = 0.5)

# Since we only have 10 markers, instead of doing PCA, we'll just use a
# standard euclidean distance matrix here.  Also, this provides a good
# opportunity to demonstrate how to do visualization and clustering using a
# custom distance matrix in Seurat.
adt.data <- GetAssayData(cbmc_cite, assay.type = "CITE", slot = "data")
adt.dist <- as.matrix(dist(t(adt.data)))
# Why do we not use PCA to do dimensionality reduction here?
# Is Euclidean distance a good distance metric in this case?

# Before we recluster the data on ADT levels, we'll stash the RNA cluster
# IDs for later
cbmc_cite <- StashIdent(cbmc_cite, "rnaClusterID")

# Now, we rerun tSNE using our distance matrix defined only on ADT (protein)
# levels.
cbmc_cite <- RunTSNE(cbmc_cite, distance.matrix = adt.dist)

# We can also rerun clustering using the same distance matrix. We'll start
# with a very coarse clustering (resolution=0.2)
cbmc_cite <- FindClusters(cbmc_cite, distance.matrix = adt.dist, print.output = FALSE, resolution = 0.2)

# We can compare the RNA and protein clustering, and use this to annotate
# the protein clustering (we could also of course use FindMarkers)
clustering.table <- table(cbmc_cite@ident, cbmc_cite@meta.data$rnaClusterID)

current.cluster.ids <- 0:10
# Note, for simplicity we are merging two CD14+ Mono clusters (that differ
# in the expression of HLA-DR genes), and two NK clusters (that differ in
# cell cycle stage)
new.cluster.ids <- c("CD4 T", "CD14+ Mono", "NK", "B", "CD8 T", "CD34+", "Unknown1", 
    "CD16+ Mono", "Unknown2", "pDC", "Unknown3")
cbmc_cite@ident <- plyr::mapvalues(x = cbmc_cite@ident, from = current.cluster.ids, 
    to = new.cluster.ids)

tsne_rnaClusters <- TSNEPlot(cbmc_cite, do.return = TRUE, group.by = "rnaClusterID", 
    pt.size = 0.5, do.label = T)
tsne_rnaClusters <- tsne_rnaClusters + ggtitle("Clustering based on scRNA-seq") + 
    theme(plot.title = element_text(hjust = 0.5))

tsne_adtClusters <- TSNEPlot(cbmc_cite, do.return = TRUE, pt.size = 0.5, do.label = T)
tsne_adtClusters <- tsne_adtClusters + ggtitle("Clustering based on ADT signal") + 
    theme(plot.title = element_text(hjust = 0.5))

# Note: for this comparison, both the RNA and protein clustering are
# visualized on a tSNE generated using the ADT distance matrix.
plot_grid(tsne_rnaClusters, tsne_adtClusters, ncol = 2)

# What differences if any do you see between the clustering based on scRNA-seq
# and the clustering based on ADT signal?
# How could we combine these datasets in a joint, integrative analysis?

# Save current progress.
# save(cbmc_cite, file = Rda.protein.path)
# To load the data, run the following command.
# load(Rda.protein.path)

The ADT-based clustering yields similar results, but with a few differences

  • Clustering is improved for CD4/CD8 T cell populations, based on the robust ADT data for CD4, CD8, CD14, and CD45RA
  • However, some clusters for which the ADT data does not contain good distinguishing protein markers (i.e. Mk/Ery/DC) lose separation
  • Notably, our unknown populations again correspond to rare populations that co-express markers for different lineages
  • Unknown 1 (Myeloid/CD4), (Myeloid/CD8), and 3 (B/CD4) likely correspond to additional doublets, whose signal was too subtle to cluster separately in scRNA-seq
  • You can verify this using FindMarkers at the RNA level, as well

18.8 Additional exploration: another example of multi-modal analysis

For another nice example of multi-modal analysis, please explore this single cell ATAC-Seq vignette

18.9 Acknowledgements

This document is largely a tutorial from Seurat website, with some small modifications. The official vignette is available at CITE-Seq Seurat.