1 Introduction

1.1 COURSE OVERVIEW

In recent years single cell RNA-seq (scRNA-seq) has become widely used for transcriptome analysis in many areas of biology. In contrast to bulk RNA-seq, scRNA-seq provides quantitative measurements of the expression of every gene in a single cell. However, to analyze scRNA-seq data, novel methods are required and some of the underlying assumptions for the methods developed for bulk RNA-seq experiments are no longer valid. In this course we will cover all steps of the scRNA-seq processing, starting from the raw reads coming off the sequencer. The course includes common analysis strategies, using state-of-the-art methods and we also discuss the central biological questions that can be addressed using scRNA-seq.

1.2 TARGETED AUDIENCE & ASSUMED BACKGROUND

This course is aimed at researchers and technical workers who are or will be analyzing scRNA-seq data. The material is suitable both for experimentalists who want to learn more about data-analysis as well as computational biologists who want to learn about scRNASeq methods. Examples demonstrated in this course can be applied to any experimental protocol or biological system.

The requirements for this course are: 1. Working knowledge of unix (managing files, running programs) 2. Programming experience in R (writing a function, basic I/O operations, variable types, using packages). Bioconductor experience is a plus. 3. Familiarity with next-generation sequencing data and its analyses (using alignment and quantification tools for bulk sequencing data)

1.3 COURSE FORMAT

The course will be delivered over the course of five days. Each day will include a lecture and laboratory component. The lecture will introduce the topics of discussion and the laboratory sessions will be focused on practical hands-on analysis of scRNA-seq data. These sessions will involve a combination of both mirroring exercises with the instructor to demonstrate a skill as well as applying these skills on your own to complete individual exercises. After and during each exercise, interpretation of results will be discussed as a group. Computing will be done using a combination of tools installed on the attendees laptop computer and web resources accessed via web browser.

1.4 Getting Started

1.5 SESSION CONTENT

1.5.1 Monday – Classes from 08:00 to 16:00 (lunch break-1 hr, 40 min of total coffee breaks)

Shared Google doc - course notes, ideas/questions/challenges/interesting topics you would like to explore.

1.5.1.1 Lecture 1 – scRNA-Seq experimental design (Orr)

  • Overview of course
  • General introduction: cell atlas overviews
  • Comparison of Bulk and single cell RNA-Seq
  • Overview of available scRNA-seq technologies (10x) and experimental protocols
  • scRNA-Seq experimental design and analysis workflow?

1.5.1.2 Lab 1 – Understanding sequencing raw data, downloading Docker if not done already (Kirk)

1.5.1.3 Lab based around data wrangling from public data repositories: get data from 10x website, single cell portal, from GEO (fastqs, counts)

  • Shell and Unix commands to navigate directories, create folders, open files
  • Raw file formats

1.5.1.4 Lecture 2 - Intro to Data processing: from bcl file to bam file, Transcriptome quantification: from bam file to counts (Dana)

  • scRNA-Seq processing workflow starting with choice of sequencer (NextSeq, HiSeq, MiSeq) / barcode swapping and bcl files
  • Overview of Popular tools and algorithms
  • Common single-cell analyses and interpretation
  • Sequencing data: alignment and quality control
  • Looking at cool things in alignment like where reads are, mutations, splicing
  • Read & UMI counting (Kallisto alignment-free pseudocounts as well), how RSEM works (length dependence, sequencing depth, multimapping reads), CellRanger (dropest), bustools
  • 10x barcode structure and links to Perturb-seq
  • Gene length & coverage
  • Gene expression units (count data Smart-seq2 counts or 10x UMIs vs expression data)

1.5.1.5 Lab 2 – Processing raw scRNA-Seq data (Dana), Docker setup (Kirk)

  • Data outputs from different scRNAseq technologies (10x, Smart-seq2) - process both?
  • Demultiplexing sequencing data
  • Read Quality Control (CellRanger, dropEst, fastqc)
  • Run bowtie2 on 2 wells to demonstrate alignment
  • Read alignment and visualization (kallisto, RSEM, Igviewer)
  • Demultiplexing
  • FastQC
  • Align (STAR/TOPHAT/Kallisto)
  • IGViewer - what do we want here? I use it for mutation detections, copying sequences, searching for alternative splicing.

1.5.1.6 Flash talks (1.5 hr, break into 2 groups of 13) placed into a dropbox

1 slide advertising or summarizing the poster. So you can introduce yourselves and we can get to know each other. No questions, 2 minutes. Two sessions, 15 people each.

1.5.2 Tuesday – Classes from 08:00 to 16:00

1.5.2.1 Lab 3 - Introduction to R (Kirk)

  • Some R overview slides, https://r4ds.had.co.nz/
  • Installing packages
  • Data-types
  • Data manipulation, slicing
  • Strings manipulations
  • Introducing object oriented programming / S4 objects
  • Visualization tools
  • Bonus create FeaturePlot from Seurat in base ggplot
  • Bonus: run RSEM on Dana’s bam files if you are bored

1.5.2.2 Lecture 3 - Expression QC, normalisation and gene-level batch correction (Orr)

  • What CellRanger does for quality filtering
  • PBMC data
  • Normalisation methods https://www.nature.com/articles/nmeth.4292
  • Doublets, empty droplets, CellBender
  • Barcode swapping
  • Regression with technical covariates
  • What about imputation?

1.5.2.3 Lab 4 – Data wrangling for scRNAseq data (Dana)

  • Data structures and file formats for single-cell data
  • Quality control of cells and genes (doublets, ambient, empty drops)
  • Data exploration: violin plots…
  • Introducing Seurat object
  • Genes
  • House keeping genes
  • Mitochondrial genes
  • Filter
  • Normalize
  • Find variable genes
  • Scaling
  • Regression
  • Calculate a signature

1.5.2.4 Flash talks (1.5 hr, break into 2 groups of 13) placed into a dropbox

1 slide advertising or summarizing the poster. So you can introduce yourselves and we can get to know each other. No questions, 2 minutes. Two sessions, 15 people each.

1.5.3 Wednesday – Classes from 08:00 to 16:00

1.5.3.1 Lecture 4 (may start late Tuesday) - Identifying cell populations (Kirk)

  • Feature selection
  • Dimensionality reduction
  • Clustering and assigning identity (Louvain, NMF, topic models, variational autoencoder)
  • Differential expression tests

1.5.3.2 Lab 5 – Feature selection & Clustering analysis (Kirk)

  • Parameters and clustering
  • Comparison of feature selection methods

1.5.3.3 Lecture 5 - Batch effects correction (Orr)

  • Batch correction methods (regress out batch, scaling within batch, Seurat v3, MNN, Liger, Harmony, scvi, scgen)
  • Evaluation methods for batch correction (ARI, average silhouette width, kBET…)

1.5.3.4 Lab 6 - Correcting batch effects (Orr)

  • Comparison of batch correction methods, Seurat pancreas Use Seurat Wrappers?

1.5.4 Thursday – Classes from 08:00 to 16:00

  • Deciding on discussion topics for next day based on shared google doc.

1.5.4.1 Lecture 6 - Advanced topics (Kirk)

  • Pseudotime inference
  • Differential expression through pseudotime
  • Deep learning or spatial data depending on questionnaire (20ish min, autoencoder as nonlinear dimension reduction, scvi, what questions to ask to evaluate whether a more advanced model helps, how to decide it’s safe to use a method, tradeoffs between method complexity and interpretability)

1.5.4.2 Lab 7 - Functional and Pseudotime analysis (Orr)

1.5.4.3 Lecture 7 - Single-cell multi-omic technologies (Dana)

  • Introduction to other omic data types
  • Integrating scRNA-seq with other single-cell modalities (CITE, Perturb, ATAC, methylation…)

1.5.5 Friday – Classes from 08:00 to 16:00

Small group discussion on selected topics through hangouts . #### Lab 10 - small dataset for analysis and office hours focused on select topics (Dana)

For project on last day (plan for whole day), Dana will prepare datasets for 3 or more ish mut glioma tumors that they will download beforehand. The datasets may need to be subsampled to save time. Can do pseudotime, can run scvi, nmf. Groups of 3 students.