A whistlestop tour of the book contents

It can be hard to get a sense of whether a book like ours is right for you just based on the Amazon listing (which we don’t have direct control over as authors). To address that problem, I figured I’d write a short series of blog posts dedicated to providing insight into “What’s in the can?”, ie what our book is about, and what it’s like to work through, to supplement the information provided in the listing.

In this first installment, I adapted a Twitter thread I had posted a while back that people seemed to find helpful, consisting of one tweet per chapter, each providing a brief and informal description of the chapter’s topic and goals, with a sampling of illustrations from that chapter. The format and Twitter’s brutally low character limit forced me to be very concise, and really drill down to the essentials. The result emphasizes the progression from one topic to the next, and reflects our vision of the book as an educational journey rather than a collection of facts.

Here we go.

1. Introduction

Why you should care about the cloud, and how bioinformatics / life sciences research benefits from moving to a cloud-based ecosystem for data sharing and analysis. No, the cloud environment is not perfect; yes, it really is a game changer.

Advantages of the cloud-centric model

2. Genomics in a Nutshell

A primer for newcomers to the field of genomics, covering foundational terms and concepts such as genes, DNA and genomic variation, plus the technical basics of sequencing and handling genomic data.

Types of variants

—

Types of variants

3. Computing Technology Basics for Life Scientists

CPU, GPU, TPU, FPGA, OMG GTFO – no really, just some basic hardware terminology, plus an introduction to key concepts like parallelism, pipelining, containers and virtual machines in fairly plain language.

Computing infrastructure

4. First Steps in the Cloud

Finally we get to do some hands-on work (on Google Cloud). Set up an account, get free credits, practice managing data in storage buckets and interacting with a Docker container, get a nice custom VM set up to do some genomics.

Create a VM

—

Working in a VM

5. First Steps with GATK

Let’s meet the workhorse of genomics! We start with a general overview, requirements, command line syntax, the usual – then dive into calling variants with HaplotypeCaller, plus some visual troubleshooting and variant filtering concepts.

Main GATK use cases

—

Variants in IGV

6. GATK Best Practices for Germline Short Variant Discovery

Step by step examination of what may be the most commonly run genomics pipeline in the world, with highlights on joint calling for populations and deep learning for single-sample analysis.

Joint calling overview

—

Joint calling on a trio

7. GATK Best Practices for Somatic Variant Discovery

Switching gears to cancer genomics with a rundown of how somatic calling is different; step by step through the pipelines for somatic short variants (Mutect2) and copy number alterations.

Tumor progression

—

Tumor-Normal SNV calling

8. Automating Analysis Execution with Workflows

Halfway point; we pivot to the challenges of automating and scaling up these analyses, introducing the Cromwell workflow system and the portable Workflow Description Language (WDL).

Visualizing a workflow graph

9. Deciphering Real Genomics Workflows

We pretend to stumble across 2 mystery workflows, go through a systematic process of investigating their content to understand what they do and how they do it, learning useful WDL features along the way.

A real GATK pipeline

10. Running Single Workflows at Scale with Pipelines API

So far we’ve been running everything on our little custom VM. Now it’s time to unleash the full power of the cloud by dispatching workflow tasks to multiple machines – with surprisingly little effort.

Running on GCP via PAPI

—

Parallel execution on GCE

11. Running Many Workflows Conveniently in Terra

Now we’re scaling up to arbitrary numbers of samples, using the managed Cromwell server in Terra, an open platform for secure data access and analysis.

Using Cromwell via Terra

—

Job manager

12. Interactive Analysis in Jupyter Notebook

Circling back to the GATK work from earlier chapters, we examine what that would all look like done in Jupyter Notebooks instead of the terminal shell. Between embedded IGV and ggplots galore, it looks good!

List of notebooks

—

Notebook header

—

IGV in a notebook

13. Assembling Your Own Workspace in Terra

Crossing the bridge from canned examples to importing your own data and methods into Terra in a few different scenarios. Draws on other services in the ecosystem including workflow and data repositories.

Importing data

—

Importing a workflow

14. Making a Fully Reproducible Paper

Capstone case study on computational reproducibility involving synthetic data creation, GATK, downstream analysis and real biological findings by Dr. Matthieu Miossec et al.

Study overview

And that’s it; that’s our book in a nutshell. Four hundred pages summarized in a couple dozen sentences and images… Hopefully this gives you enough of a taste to help you gauge whether the book might be a good fit for you.

In the next installment of “What’s in the can?”, we’ll talk about the semi-official companion booklet we made for the book figures.

Written on April 29, 2021