18 CITE-seq and scATAC-seq
In this lab, we will look at how single cell RNA-seq and single cell protein expression measurement datasets can be jointly analyzed, as part of a CITE-Seq experiment. To learn more about how the antibody barcode matrix is computationally generated from the sequencing data, please visit CITE-seq-Count. To learn more about CITE-Seq and feature barcoding, please visit the CITE-seq site.
Note: you can increase the system memory available to Docker by going to Docker -> Preferences -> Advanced and shifting the Memory slider.
18.1 Load settings and packages
18.2 Load in the data
This vignette demonstrates new features that allow users to analyze and explore multi-modal data with Seurat. While this represents an initial release, we are excited to release significant new functionality for multi-modal datasets in the future.
Here, we analyze a dataset of 8,617 cord blood mononuclear cells (CBMCs), produced with CITE-seq, where we simultaneously measure the single cell transcriptomes alongside the expression of 11 surface proteins, whose levels are quantified with DNA-barcoded antibodies. First, we load in two count matrices : one for the RNA measurements, and one for the antibody-derived tags (ADT). You can download the ADT file here and the RNA file here
# Load in the RNA UMI matrix # Note that this dataset also contains ~5% of mouse cells, which we can use # as negative controls for the protein measurements. For this reason, the # gene expression matrix has HUMAN_ or MOUSE_ appended to the beginning of # each gene. cbmc.rna <- read.csv(paste0(mydir, "GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz"), sep = ",", header = TRUE, row.names = 1) cbmc.rna[20400:20403,1:2]
## CTGTTTACACCGCTAG CTCTACGGTGTGGCTC ## HUMAN_hsa-mir-7515 0 0 ## HUMAN_hsa-mir-8072 0 0 ## MOUSE_0610007N19Rik 14 2 ## MOUSE_0610007P14Rik 2 9
# To make life a bit easier going forward, we're going to discard all but # the top 100 most highly expressed mouse genes, and remove the 'HUMAN_' # from the CITE-seq prefix cbmc.rna.collapsed <- CollapseSpeciesExpressionMatrix(cbmc.rna) rm(cbmc.rna) # free up memory # Load in the ADT UMI matrix cbmc.adt <- read.csv(paste0(mydir, "GSE100866_CBMC_8K_13AB_10X-ADT_umi.csv.gz"), sep = ",", header = TRUE, row.names = 1) # To avoid any confusion where genes and proteins might have the same name, # we'll append 'CITE_' to each of the ADT rownames. This is not strictly # necessary, but it helps for clarity cbmc.citeseq <- cbmc.adt rownames(cbmc.citeseq) <- paste0("CITE_", rownames(cbmc.adt)) # Lastly, we observed poor enrichments for CCR5, CCR7, and CD10 - and # therefore remove them from the matrix. cbmc.citeseq <- cbmc.citeseq[setdiff(rownames(cbmc.citeseq), c("CITE_CCR5", "CITE_CCR7", "CITE_CD10")), ] # Look at structure of ADT matrix. cbmc.adt[1:10,1:3]
## CTGTTTACACCGCTAG CTCTACGGTGTGGCTC AGCAGCCAGGCTCATT ## CD3 60 52 89 ## CD4 72 49 112 ## CD8 76 59 61 ## CD45RA 575 3943 682 ## CD56 64 68 87 ## CD16 161 107 117 ## CD10 156 95 113 ## CD11c 77 65 65 ## CD14 206 129 169 ## CD19 70 665 79
# What fraction of cells in the ADT and RNA matrix overlap? length(intersect(colnames(cbmc.rna.collapsed), colnames(cbmc.citeseq))) / length(union(colnames(cbmc.rna.collapsed), colnames(cbmc.citeseq)))
##  1
18.3 Setup a Seurat object, and cluster cells based on RNA expression
The steps below represent a quick clustering of the PBMCs based on the scRNA-seq data. For more detail on individual steps or more advanced options, see our PBMC clustering guided tutorial here
cbmc <- CreateSeuratObject(counts = cbmc.rna.collapsed) # This code sub-samples the data in order to speed up calculations and not use too much memory. # cbmc <- SetAllIdent(cbmc, id = "orig.ident") # cbmc <- SubsetData(cbmc, max.cells.per.ident = 2000, random.seed = 1) # cbmc.citeseq <- cbmc.citeseq[, email@example.com] # standard log-normalization cbmc <- NormalizeData(cbmc) # choose ~1k variable genes cbmc <- FindVariableFeatures(cbmc, do.plot = FALSE, y.cutoff = 0.5) # standard scaling cbmc <- ScaleData(cbmc, display.progress = FALSE) # Run PCA, select 13 PCs for tSNE visualization and graph-based clustering cbmc <- RunPCA(cbmc, pcs.print = 0) ## PCElbowPlot(cbmc) # Cluster the cells using the first 13 principal components. cbmc <- FindClusters(cbmc, dims.use = 1:13, print.output = FALSE) cbmc <- RunTSNE(cbmc, dims.use = 1:13) # Find the markers that define each cluster, and use these to annotate the # clusters, we use max.cells.per.ident to speed up the process cbmc.rna.markers <- FindAllMarkers(cbmc, max.cells.per.ident = 100, logfc.threshold = log(2), only.pos = TRUE, min.diff.pct = 0.3, do.print = F) # Examine top marker genes and identify cell types. cbmc.rna.markers %>% group_by(cluster) %>% top_n(5) genes <- c("CD3D", "CD4", "CD8A", "CD14", "MS4A1", "KLRB1", "CD34") FeaturePlot(cbmc, genes, cols.use = c("lightgrey", "blue")) # Which cluster consists of mouse cells? cbmc.rna.markers %>% filter(cluster == 3) current.cluster.ids <- 0:15 # Note, for simplicity we are merging two CD14+ Mono clusters (that differ # in the expression of HLA-DR genes), and two NK clusters (that differ in # cell cycle stage) new.cluster.ids <- c("CD4 T", "CD14+ Mono", "CD14+ Mono", "NK", "Mouse", "B", "CD8 T", "CD16+ Mono", "Unknown", "CD34+", "Mk", "Eryth", "DC", "Mouse", "pDC", "NK") cbmc@ident <- plyr::mapvalues(x = cbmc@ident, from = current.cluster.ids, to = new.cluster.ids) TSNEPlot(cbmc, do.label = TRUE, pt.size = 0.5) # Save current progress. # save(cbmc, cbmc.rna.markers, file = Rda.RNA.path) # To load the data, run the following command. # load(Rda.RNA.path)
18.4 Add the protein expression levels to the Seurat object
Seurat v2.1 allows you to store information from multiple assays in the same object, as long as the data is multi-modal (collected on the same set of cells). You can use the SetAssayData and GetAssayData accessor functions to add and fetch data from additional assays.
# We will define a CITE assay, and store raw data for it. Note that it's # convenient, but not required, to use the same name as the rowname prefix # we defined earlier. # If you are interested in how these data are internally stored, you can # check out the @assay slot, and the assay class, which is defined in # multimodal.R Note that RNA data is still stored in its normal slots, but # can also be accessed using GetAssayData and SetAssayData, using the 'RNA' # assay cbmc <- SetAssayData(cbmc, assay.type = "CITE", slot = "raw.data", new.data = cbmc.citeseq) GetAssayData(cbmc, assay.type = "CITE", slot = "raw.data")[1:3,1:3] cbmc@assay$CITE@raw.data[1:3,1:3] # Now we can repeat the preprocessing (normalization and scaling) steps that # we typically run with RNA, but modifying the 'assay.type' argument. For # CITE-seq data, we do not recommend typical LogNormalization. Instead, we # use a centered log-ratio (CLR) normalization, computed independently for # each gene. This is a slightly improved procedure from the original # publication, and we will release more advanced versions of CITE-seq # normalizations soon. cbmc <- NormalizeData(cbmc, assay.type = "CITE", normalization.method = "genesCLR") GetAssayData(cbmc, assay.type = "CITE", slot = "data")[1:3,1:3] cbmc <- ScaleData(cbmc, assay.type = "CITE", display.progress = FALSE) cbmc@assay$CITE@scale.data[1:3,1:3]
18.5 Visualize protein levels on RNA clusters
You can use the names of any ADT markers, (i.e. “CITE_CD4”), in FetchData, FeaturePlot, RidgePlot, GenePlot, DoHeatmap, or any other visualization features
# In this plot, protein (ADT) levels are on top, and RNA levels are on bottom FeaturePlot(cbmc, features.plot = c("CITE_CD3", "CITE_CD11c", "CITE_CD8", "CITE_CD16", "CD3E", "ITGAX", "CD8A", "FCGR3A"), min.cutoff = "q05", max.cutoff = "q95", nCol = 4, cols.use = c("lightgrey", "blue"), pt.size = 0.5) # How do the gene and protein expression levels compare to one another? # Compare gene and protein expression levels for the other 6 antibodies. FeaturePlot(cbmc, features.plot = c("CITE_CD4", "CITE_CD45RA", "CITE_CD56", "CITE_CD14", "CITE_CD19", "CITE_CD34", "CD4", "PTPRC", "NCAM1", "CD14", "CD19", "CD34"), min.cutoff = "q05", max.cutoff = "q95", nCol = 6, cols.use = c("lightgrey", "blue"), pt.size = 0.5) # Ridge plots are another useful visualization. RidgePlot(cbmc, features.plot = c("CITE_CD3", "CITE_CD11c", "CITE_CD8", "CITE_CD16"), nCol = 2) par(mfrow = c(1, 2)) # Draw ADT scatter plots (like biaxial plots for FACS). Note that you can # even 'gate' cells if desired by setting do.identify = TRUE or # interact with cells by setting do.hover = TRUE GenePlot(cbmc, gene1 = "CITE_CD19", gene2 = "CITE_CD3", cex = 0.5) # view relationship between protein and RNA GenePlot(cbmc, gene1 = "CITE_CD3", gene2 = "CD3E", cex.use = 0.5) # Let's plot CD4 vs CD8 levels in T cells tcells <- SubsetData(cbmc, ident.use = c("CD4 T", "CD8 T")) par(mfrow = c(1, 2)) GenePlot(tcells, gene1 = "CITE_CD4", gene2 = "CITE_CD8", cex = 0.5) # Let's look at the raw (non-normalized) ADT counts. You can see the values # are quite high, particularly in comparison to RNA values. This is due to # the significantl higher protein copy number in cells, which significantly # reduces 'drop-out' in ADT data GenePlot(tcells, gene1 = "CITE_CD4", gene2 = "CITE_CD8", use.raw = TRUE, cex = 0.5) # If you look a bit more closely, you'll see that our CD8 T cell cluster is # enriched for CD8 T cells, but still contains many CD4+ CD8- T cells. This # is because Naive CD4 and CD8 T cells are quite similar transcriptomically, # and the RNA dropout levels for CD4 and CD8 are quite high. This # demonstrates the challenge of defining subtle immune cell differences from # scRNA-seq data alone. # What fraction of T cells are double negative in gene expression? (CD4- and CD8-) # You can use an interactive plot to gate on the cells (do.identify = T) or use # Boolean conditions on CD4 and CD8A expression to find double negative cells. # cells <- GenePlot(tcells, gene1 = "CD4", gene2 = "CD8A", cex = 0.5, do.identify = T) # length(cells) / length(firstname.lastname@example.org) length(which(tcells@data["CD4", ] == 0 & tcells@data["CD8A", ] == 0)) # What fraction of T cells are double negative in protein expression? (CD4- and CD8-) # cells <- GenePlot(tcells, gene1 = "CITE_CD4", gene2 = "CITE_CD8", cex = 0.5, do.identify = T) # length(cells) / length(email@example.com) length(which(tcells@assay$CITE@data["CITE_CD4", ] < 1 & tcells@assay$CITE@data["CITE_CD8", ] < 1))
18.6 Identify differentially expressed proteins between clusters
mono.markers <- FindMarkers(cbmc, "CD14+ Mono", "CD16+ Mono", assay.type = "CITE", logfc.threshold = log(1.5)) head(mono.markers) # Plot the expression of the monocyte markers you identified. FeaturePlot(cbmc, features.plot = c("CITE_CD14", "CITE_CD16"), min.cutoff = "q05", max.cutoff = "q95", nCol = 4, cols.use = c("lightgrey", "blue"), pt.size = 0.5) # Note that we observe CD14 protein expression only on CD4+ T cells, as has # been previously observed in the literature tcell.markers <- FindMarkers(cbmc, ident.1 = "CD4 T", ident.2 = "CD8 T", assay.type = "CITE", logfc.threshold = log(1.5)) head(tcell.markers) # Downsample the clusters to a maximum of 300 cells each (makes the heatmap # easier to see for small clusters) cbmc.small <- SubsetData(cbmc, max.cells.per.ident = 300) # Find protein markers for all clusters, and draw a heatmap adt.markers <- FindAllMarkers(cbmc.small, assay.type = "CITE", only.pos = TRUE, print.bar = F) DoHeatmap(cbmc.small, genes.use = unique(adt.markers$gene), assay.type = "CITE", slim.col.label = TRUE, remove.key = TRUE, group.label.rot = TRUE) # You can see that our unknown cells co-express both myeloid and lymphoid # markers (true at the RNA level as well). They are likely cell clumps # (multiplets) that should be discarded. We'll remove the mouse cells now as # well cbmc <- SubsetData(cbmc, ident.remove = c("Unknown", "Mouse"))
18.7 Cluster directly on protein levels
You can also run dimensional reduction and graph-based clustering directly on CITE-seq data
# We will store the results in a new object, cbmc_cite. cbmc_cite <- RunPCA(cbmc, pc.genes = rownames(cbmc.citeseq), assay.type = "CITE", pcs.print = 0) PCAPlot(cbmc_cite, pt.size = 0.5) # Since we only have 10 markers, instead of doing PCA, we'll just use a # standard euclidean distance matrix here. Also, this provides a good # opportunity to demonstrate how to do visualization and clustering using a # custom distance matrix in Seurat. adt.data <- GetAssayData(cbmc_cite, assay.type = "CITE", slot = "data") adt.dist <- as.matrix(dist(t(adt.data))) # Why do we not use PCA to do dimensionality reduction here? # Is Euclidean distance a good distance metric in this case? PCElbowPlot(cbmc_cite) # Before we recluster the data on ADT levels, we'll stash the RNA cluster # IDs for later cbmc_cite <- StashIdent(cbmc_cite, "rnaClusterID") # Now, we rerun tSNE using our distance matrix defined only on ADT (protein) # levels. cbmc_cite <- RunTSNE(cbmc_cite, distance.matrix = adt.dist) # We can also rerun clustering using the same distance matrix. We'll start # with a very coarse clustering (resolution=0.2) cbmc_cite <- FindClusters(cbmc_cite, distance.matrix = adt.dist, print.output = FALSE, resolution = 0.2) # We can compare the RNA and protein clustering, and use this to annotate # the protein clustering (we could also of course use FindMarkers) clustering.table <- table(cbmc_cite@ident, firstname.lastname@example.org$rnaClusterID) current.cluster.ids <- 0:10 # Note, for simplicity we are merging two CD14+ Mono clusters (that differ # in the expression of HLA-DR genes), and two NK clusters (that differ in # cell cycle stage) new.cluster.ids <- c("CD4 T", "CD14+ Mono", "NK", "B", "CD8 T", "CD34+", "Unknown1", "CD16+ Mono", "Unknown2", "pDC", "Unknown3") cbmc_cite@ident <- plyr::mapvalues(x = cbmc_cite@ident, from = current.cluster.ids, to = new.cluster.ids) tsne_rnaClusters <- TSNEPlot(cbmc_cite, do.return = TRUE, group.by = "rnaClusterID", pt.size = 0.5, do.label = T) tsne_rnaClusters <- tsne_rnaClusters + ggtitle("Clustering based on scRNA-seq") + theme(plot.title = element_text(hjust = 0.5)) tsne_adtClusters <- TSNEPlot(cbmc_cite, do.return = TRUE, pt.size = 0.5, do.label = T) tsne_adtClusters <- tsne_adtClusters + ggtitle("Clustering based on ADT signal") + theme(plot.title = element_text(hjust = 0.5)) # Note: for this comparison, both the RNA and protein clustering are # visualized on a tSNE generated using the ADT distance matrix. plot_grid(tsne_rnaClusters, tsne_adtClusters, ncol = 2) # What differences if any do you see between the clustering based on scRNA-seq # and the clustering based on ADT signal? # How could we combine these datasets in a joint, integrative analysis? # Save current progress. # save(cbmc_cite, file = Rda.protein.path) # To load the data, run the following command. # load(Rda.protein.path)
The ADT-based clustering yields similar results, but with a few differences
- Clustering is improved for CD4/CD8 T cell populations, based on the robust ADT data for CD4, CD8, CD14, and CD45RA
- However, some clusters for which the ADT data does not contain good distinguishing protein markers (i.e. Mk/Ery/DC) lose separation
- Notably, our unknown populations again correspond to rare populations that co-express markers for different lineages
- Unknown 1 (Myeloid/CD4), (Myeloid/CD8), and 3 (B/CD4) likely correspond to additional doublets, whose signal was too subtle to cluster separately in scRNA-seq
- You can verify this using FindMarkers at the RNA level, as well
18.8 Additional exploration: another example of multi-modal analysis
For another nice example of multi-modal analysis, please explore this single cell ATAC-Seq vignette
This document is largely a tutorial from Seurat website, with some small modifications. The official vignette is available at CITE-Seq Seurat.