# 9 Data Wrangling scRNAseq

## 9.1 Goal

• To give you experience with the analysis of single cell RNA sequencing (scRNA-seq) including performing quality control and identifying cell type subsets.
• To introduce you to scRNA-seq analysis using the Seurat package.

## 9.2 Introduction

Data produced in a single cell RNA-seq experiment has several interesting characteristics that make it distinct from data produced in a bulk population RNA-seq experiment. Two characteristics that are important to keep in mind when working with scRNA-Seq are drop-out (the excessive amount of zeros due to limiting mRNA) and the potential for quality control (QC) metrics to be confounded with biology. This combined with the ability to measure heterogeniety from cells in samples has shifted the field away from the typical analysis in population-based RNA-Seq. Here we demonstrate some approaches to quality control, followed by identifying and analyzing cell subsets.

For this tutorial, we will be analyzing the a dataset of Non-Small Cell Lung Cancer Cells (NSCLC) freely available from 10X Genomics (https://support.10xgenomics.com/single-cell-vdj/datasets/2.2.0/vdj_v1_hs_nsclc_5gex), using the Seurat R package (http://satijalab.org/seurat/), a popular and powerful set of tools to conduct scRNA-seq analysis in R. In this dataset, there are 7802 single cells that were sequenced on the Illumina NovaSeq 6000. Please note this tutorial borrows heavily from Seurat’s tutorials, so feel free to go through them in more detail.

When loading libraries, we are asking R to load code for us written by someone else. It is a convenient way to leverage and reproduce methodology developed by others.

library(Seurat)
library(dplyr)
library(Matrix)
library(gdata)

### 9.2.2 Read in NSCLC counts matrix.

The data for Non-Small Cell Lung Cancer Cells (NSCLC) is freely available from 10X Genomics (https://support.10xgenomics.com/single-cell-vdj/datasets/2.2.0/vdj_v1_hs_nsclc_5gex). We start by reading in the counts matrix generated by the Cell Ranger count program.

Task: Change the directory name to mydir/ where you saved your data

dirname <- "/mydir/Lab08-Data-Wrangling-scRNAseq/"
counts_matrix_filename = paste0(dirname,"/filtered_gene_bc_matrices/GRCh38/")
counts <- Read10X(data.dir = counts_matrix_filename)  # Seurat function to read in 10x count data

# To minimize memory use on the docker - choose only the first 1000 cells
counts <- counts[,1:1000]

### 9.2.3 Let’s examine the sparse counts matrix

counts[1:10, 1:3]

Here we see the upper left corner of the sparse matrix. The columns are indexed by 10x cell barcodes (each 16 nt long), and the rows are the gene names. We mentioned these matrices are sparse, here we see only zeroes (indicated by the “.” symbol); this is the most common value in these sparse matrices. Next, let us look at the dimensions of this matrix.

### 9.2.4 How big is the matrix?

dim(counts) # report number of genes (rows) and number of cells (columns)

Here we see the counts matrix has 33694 genes and 7802 cells.

### 9.2.5 How much memory does a sparse matrix take up relative to a dense matrix?

object.size(counts) # size in bytes
object.size(as.matrix(counts)) # size in bytes

We see here that the sparse matrix takes 225 Mb in memory while storing the matrix in a dense format (where all count values including zeros are stored) takes almost 10 times as much memory! This memory saving is very important, especially as data sets are now being created that are beyond a million cells. These matrices can become unmanageable without special computing resources.

In the sparse representation, we assume that the majority of count values in a matrix are zero. We only store the non-zero values. This is implemented in the Matrix package using a dgTMatrix object.

## 9.3 Filtering low-quality cells

You can learn a lot about your scRNA-seq data’s quality with simple plotting. Let’s do some plotting to look at the number of reads per cell, reads per genes, expressed genes per cell (often called complexity), and rarity of genes (cells expressing genes).

### 9.3.1 Look at the summary counts for genes and cells

counts_per_cell <- Matrix::colSums(counts)
counts_per_gene <- Matrix::rowSums(counts)
genes_per_cell <- Matrix::colSums(counts>0) # count gene only if it has non-zero reads mapped.

Task: In a similar way, can you calculate cells per genes? replace the ‘?’ in the command below

#### cells_per_gene <- Matrix::?(counts>?) # only count cells where the gene is expressed
cells_per_gene <- Matrix::rowSums(counts>0) # only count cells where the gene is expressed

colSums and rowSums are functions that work on each row or column in a matrix and return the column sums or row sums as a vector. If this is true counts_per_cell should have 1 entry per cell. Let’s make sure the length of the returned vector matches the matrix dimension for column. How would you do that? ( Hint:length() ).

Notes: 1. Matrix::colSums is a way to force functions from the Matrix library to be used. There are many libraries that implement colSums, we are forcing the one from the Matrix library to be used here to make sure it handles the dgTmatrix (sparse matrix) correctly. This is good practice.

hist(log10(counts_per_cell+1),main='counts per cell',col='wheat')
hist(log10(genes_per_cell+1), main='genes per cell', col='wheat')
plot(counts_per_cell, genes_per_cell, log='xy', col='wheat')
title('counts vs genes per cell')

Here we see examples of plotting a new plot, the histogram. R makes this really easy with the hist function. We are also transforming the values to log10 before plotting, this is done with the log10 method. When logging count data, the + 1 is used to avoid log10(0) which is not defined.

Can you a histogram of counts per gene in log10 scale?

hist(log10(counts_per_gene+1), main='counts per gene', col='wheat')
### hist(?(?+1), main='counts per gene', col='wheat')

### 9.3.2 Plot cells ranked by their number of detected genes.

Here we rank each cell by its library complexity, ie the number of genes detected per cell. This is a very useful plot as it shows the distribution of library complexity in the sequencing run. One can use this plot to investigate observations (potential cells) that are actually failed libraries (lower end outliers) or observations that are cell doublets (higher end outliers).

plot(sort(genes_per_cell), xlab='cell', log='y', main='genes per cell (ordered)')

## 9.4 Beginning with Seurat: http://satijalab.org/seurat/

### 9.4.1 Creating a seurat object

To analyze our single cell data we will use a seurat object. Can you create an Seurat object with the 10x data and save it in an object called ‘seurat’? hint: CreateSeuratObject(). Can you include only genes that are are expressed in 3 or more cells and cells with complexity of 350 genes or more? How many genes are you left with? How many cells?

### seurat<-CreateSeuratObject(raw.data = counts, ? = 3, ? = 350, project = "10X_NSCLC")
seurat<-CreateSeuratObject(raw.data = counts, min.cells = 3, min.genes = 350, project = "10X_NSCLC")

Almost all our analysis will be on the single object, of class Seurat. This object contains various “slots” (designated by seurat@slotname) that will store not only the raw count data, but also the results from various computations below. This has the advantage that we do not need to keep track of inidividual variables of interest - they can all be collapsed into a single object as long as these slots are pre-defined.

seurat@raw.data is a slot that stores the original gene count matrix. We can view the first 10 rows (genes) and the first 10 columns (cells).

seurat@raw.data[1:10,1:10]

## 9.5 Preprocessing step 1 : Filter out low-quality cells

The Seurat object initialization step above only considered cells that expressed at least 350 genes. Additionally, we would like to exclude cells that are damaged. A common metric to judge this (although by no means the only one) is the relative expression of mitochondrially derived genes. When the cells apoptose due to stress, their mitochondria becomes leaky and there is widespread RNA degradation. Thus a relative enrichment of mitochondrially derived genes can be a tell-tale sign of cell stress. Here, we compute the proportion of transcripts that are of mitochondrial origin for every cell (percent.mito), and visualize its distribution as a violin plot. We also use the GenePlot function to observe how percent.mito correlates with other metrics.

# The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat.  For non-UMI
# data, nUMI represents the sum of the non-normalized values within a cell We calculate the percentage of mitochondrial
# genes here and store it in percent.mito using AddMetaData.  We use object@raw.data since this represents
# non-transformed and non-log-normalized counts The % of UMI mapping to MT-genes is a common scRNA-seq QC metric.
mito.genes <- grep(pattern = "^MT-", x = rownames(x = seurat@data), value = TRUE)
percent.mito <- Matrix::colSums(seurat@raw.data[mito.genes, ])/Matrix::colSums(seurat@raw.data)

# AddMetaData adds columns to object@meta.data, and is a great place to stash QC stats.  This also allows us to plot the
# metadata values using the Seurat's VlnPlot().
head(seurat@meta.data)  # Before adding

Task: Can add the percentage if mitochondrial genes to the seurat object meta data? If you dont remember the name of the parameter you can type ?AddMetaData in the console.

#### seurat <- AddMetaData(object = seurat, ? = ?, col.name = "percent.mito")
VlnPlot(object = seurat, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3,  point.size.use = 0.1)

Here we calculated the percent mitochondrial reads and added it to the Seurat object in the slot named meta.data. This allowed us to plot using the violin plot function provided by Seurat.

A third metric we use is the number of house keeping genes expressed in a cell. These genes reflect commomn processes active in a cell and hence are a good global quality measure. They are also abundant and are usually steadliy expressed in cells, thus less sensitive to the high dropout.

# Load the the list of house keeping genes
hkgenes <- read.table(paste0(dirname,"/tirosh_house_keeping.txt"), skip = 2)
hkgenes <- as.vector(hkgenes\$V1)

hkgenes.found <- which(toupper(rownames(seurat@data)) %in% hkgenes)

Possible task: Feel like challenging yourself? write the code to do the following: 1. Sum the number of detected house keeping genes for each cell 2. Add this information as meta data to seurat 3. plot all metrics: “nGene”, “nUMI”, “percent.mito”,“n.exp.hkgenes” using VlnPlot 4. Scroll down to see if you got it!

If you feel like going through a more guided version, scroll down and follow the instructions.

Alternative task: Sum the number of detected house keeping genes for each cell, then add this to the meta data

#### n.expressed.hkgenes <- ?(seurat@data[hkgenes.found, ] > ?)
#### seurat <- AddMetaData(object = ?, ? = ?, col.name = "n.exp.hkgenes")
n.expressed.hkgenes <- Matrix::colSums(seurat@data[hkgenes.found, ] > 0)
VlnPlot(object = seurat, features.plot = c("nGene", "nUMI", "percent.mito","n.exp.hkgenes"), nCol = 4,  point.size.use = 0.1)

Is there a correlation between the measurements? For example, number of UMIs with number of genes? Possible task: Feel like challenging yourself? write the code to do the following: Can you plot the nGene vs nUMI (hint:GenePlot)? What is the correlation? Do you see a strange subpopulation? What do you think happened with these cells?

seurat

Scroll down to see the command

Alternative task: Can you plot the nGene vs nUMI (hint:GenePlot)? What is the correlation? Do you see a strange subpopulation? What do you think happened with these cells?

### GenePlot(object = seurat, gene1 = ?, gene2 = ?)
GenePlot(object = seurat, gene1 = "nUMI", gene2 = "nGene")

## 9.6 Examine contents of Seurat object

str(seurat)

These are the slots in the Seurat object. Some of the slots are automatically updated by Seurat as you move through analysis. Take a moment to look through the information, knowing the slots allow you to leverage work Seurat has already done for you.

VlnPlot(object = seurat, features.plot = c("nGene"), group.by = c('orig.ident'))

Here we plot the number of genes per cell by what Seurat calls orig.ident. Identity is a concept that is used in the Seurat object to refer to the cell identity. In this case, the cell identity is 10X_NSCLC, but after we cluster the cells, the cell identity will be whatever cluster the cell belongs to. We will see how identity updates as we go throught the analysis.

Next, let’s filter the cells based on the quality control metrics. Filter based on: 1. nGene 2. percent.mito 3. n.exp.hkgenes Task: Change the thresholds to what you think they should be according to the violin plots

VlnPlot(object = seurat, features.plot = c("nGene","percent.mito","n.exp.hkgenes"), nCol = 3,  point.size.use = 0.1)
#### seurat <- FilterCells(object = seurat, subset.names = c("nGene", "percent.mito","n.exp.hkgenes"), low.thresholds = c(350, -Inf,55), high.thresholds = c(5000, 0.1, Inf))

How many cells are you left with?

seurat

### 9.6.1 Preprocessing step 2 : Expression normalization

After removing unwanted genes cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. There have been many methods to normalize the data, but this is the simplest and the most intuitive. The division by total expression is done to change all expression counts to a relative measure, since experience has suggested that technical factors (e.g. capture rate, efficiency of reverse transcription) are largely responsible for the variation in the number of molecules per cell, although genuine biological factors (e.g. cell cycle stage, cell size) also play a smaller, but non-negligible role. The log-transformation is a commonly used transformation that has many desirable properties, such as variance stabilization (can you think of others?).

seurat <- NormalizeData(object = seurat, normalization.method = "LogNormalize", scale.factor = 1e4)

Well there you have it! A filtered and normalized gene-expression data set. A great accomplishment for your first dive into scRNA-Seq analysis. Well done!

## 9.7 Detection of variable genes across the single cells

Seurat calculates highly variable genes and focuses on these for downstream analysis. FindVariableGenes calculates the average expression and dispersion for each gene, places these genes into bins, and then calculates a z-score for dispersion within each bin. This helps control for the relationship between variability and average expression. This function is unchanged from (Macosko et al.), but new methods for variable gene expression identification are coming soon. We suggest that users set these parameters to mark visual outliers on the dispersion plot, but the exact parameter settings may vary based on the data type, heterogeneity in the sample, and normalization strategy. The parameters here identify ~3,000 variable genes, and represent typical parameter settings for UMI data that is normalized to a total of 1e4 molecules.

#seurat <- FindVariableGenes(object = seurat, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.5, num.bin=20)  # if this fails, experiment with the num.bin setting
seurat <- FindVariableGenes(object = seurat, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 1, y.cutoff = 0.5, num.bin=20)  # if this fails, experiment with the num.bin setting
length(seurat@var.genes)

We can see the Seurat object slots have updated for the FindVariableGenes section. Let’s use the slot to see how many variable genes we found.

str(seurat)
length(x = seurat@var.genes)

Task: how does changing the parameters for find variable genes function changes the number of the found genes? Play with the parameters - what makes the function find more variable genes? less?

seurat <- FindVariableGenes(object = seurat, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 1, y.cutoff = 0.5, num.bin=40, do.plot = FALSE)
length(seurat@var.genes)
seurat <- FindVariableGenes(object = seurat, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 1, y.cutoff = 0.5, num.bin=10, do.plot = FALSE)
length(seurat@var.genes)
seurat <- FindVariableGenes(object = seurat, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 1, y.cutoff = 0.5, num.bin=20, do.plot = FALSE)
length(seurat@var.genes)
seurat <- FindVariableGenes(object = seurat, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.2, x.high.cutoff = 1, y.cutoff = 0.5, num.bin=20, do.plot = FALSE)
length(seurat@var.genes)
seurat <- FindVariableGenes(object = seurat, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 0.1, y.cutoff = 0.5, num.bin=20, do.plot = FALSE)
length(seurat@var.genes)
seurat <- FindVariableGenes(object = seurat, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 1, y.cutoff = 2, num.bin=10, do.plot = FALSE)
length(seurat@var.genes)

## 9.8 Gene set expression across cells

Sometimes we want to ask what is the expression of a set of a genes across cells. This set of genes may make up a gene expression program we are interested in. Another benefit at looking at gene sets is it reduces the effects of drop outs.

Below, we look at genes involved in: T cells, the cell cycle and the stress signature upon cell dissociation. We calculate these genes average expression levels on the single cell level, while controlling for technical effects.

# Read in a list of cell cycle markers, from Tirosh et al, 2015.
# We can segregate this list into markers of G2/M phase and markers of S phase.
s.genes <- cc.genes[1:43]
g2m.genes <- cc.genes[44:97]

seurat <- CellCycleScoring(object = seurat, s.genes = s.genes, g2m.genes = g2m.genes, set.ident = T)

Task: Use markers for dissociation to calculate dissociation score

# Genes upregulated during dissociation of tissue into single cells.
genes.dissoc <- c("ATF3", "BTG2", "CEBPB", "CEBPD", "CXCL3", "CXCL2", "CXCL1", "DNAJA1", "DNAJB1", "DUSP1", "EGR1", "FOS", "FOSB", "HSP90AA1", "HSP90AB1", "HSPA1A", "HSPA1B", "HSPA1A", "HSPA1B", "HSPA8", "HSPB1", "HSPE1", "HSPH1", "ID3", "IER2", "JUN", "JUNB", "JUND", "MT1X", "NFKBIA", "NR4A1", "PPP1R15A", "SOCS3", "ZFP36")
#### seurat <- ?(?, genes.list = list(?), ctrl.size = 20, enrich.name = "genes_dissoc")
seurat <- AddModuleScore(seurat, genes.list = list(genes.dissoc), ctrl.size = 20, enrich.name = "genes_dissoc")

Task: Plot the correlation between number of genes and S score. How do we know the name of these scores in the seurat meta data?

### GenePlot(seurat, "S.Score", "nGene")
GenePlot(seurat, "S.Score", "nGene")

Bonus: Can you cluster the data based on the variable genes alone?

Congratulations! You can identify and visualize cell subsets and the marker genes that describe these cell subsets. This is a very powerful analysis pattern often seen in publications. Well done!