Skip to main content

scANVI Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
scANVI_v1.0.0April, 2026WARP PipelinesPlease file an issue in WARP

Introduction to the scANVI workflow

The scANVI pipeline is a cloud-optimized WDL workflow that performs cell type label transfer on Multiome data using scVI and scANVI (single-cell ANnotation using Variational Inference) deep generative models. It integrates single-cell RNA-seq (GEX) and ATAC-seq data with an annotated reference dataset to transfer cell type labels via semi-supervised learning.

The pipeline is split into two tasks — PreprocessFilter (CPU-only) and MultiomeLabelTransfer (GPU) — so that expensive GPU time is reserved exclusively for model training and inference. All data loading, quality filtering, barcode alignment, and gene-activity-matrix conversion happen on a CPU node first; the GPU node receives ready-to-train h5ad files and never re-runs preprocessing.

Want to use scANVI for your publication?

The pipeline is designed to consume the outputs of the Multiome and PeakCalling WARP pipelines. Cite the pipeline using the WARP citation in the Citing section below.

Quickstart table

The following table provides a quick glance at the scANVI pipeline features:

Pipeline featuresDescriptionSource
Assay type10x single-cell / single-nucleus Multiome (GEX + ATAC)10x Genomics
Overall workflowCPU preprocessing + GPU SCVI/SCANVI label transferCode available on GitHub
Workflow languageWDL 1.0openWDL
ModelsSCVI (unsupervised VAE) + SCANVI (semi-supervised classifier)scvi-tools 1.2
ATAC gene-activity conversionCell-by-bin matrix → gene activity matrix (hg38 GENCODE)snapatac2 2.7
Data input formatThree AnnData h5ad files (GEX, ATAC cell-by-bin, annotated reference)AnnData
Data output formatAnnotated h5ad files with predicted cell types and UMAPAnnData

Set-up

scANVI installation

To download the latest scANVI release, see the release tags prefixed with "scANVI" on the WARP releases page. All scANVI pipeline releases are documented in the scANVI changelog.

scANVI can be deployed using Cromwell, a GA4GH-compliant, flexible workflow management system that supports multiple computing platforms. The workflow can also be run in Terra.

Inputs

scANVI accepts inputs in two modes — direct file inputs or bucket mode. Direct file inputs take precedence when both are provided.

Example input JSON files are available in the example_inputs folder.

Workflow inputs

InputTypeDescriptionDefault
input_idStringUnique identifier prepended to all output filenames.— (required)
input_bucketString?GCS bucket path containing input h5ad files (e.g., gs://bucket/path/to/inputs). Used when direct file inputs are not provided.
gex_h5adFile?Gene expression AnnData h5ad file from Multiome / Optimus.
atac_h5adFile?ATAC cell-by-bin AnnData h5ad file from Multiome / PeakCalling.
ref_h5adFile?Annotated reference AnnData h5ad file with cell type labels in obs['final_annotation'].
gex_filenameStringExpected GEX h5ad filename in the input bucket."gex.h5ad"
atac_filenameStringExpected ATAC h5ad filename in the input bucket."atac.h5ad"
ref_filenameStringExpected reference h5ad filename in the input bucket."ref.h5ad"
Input mode precedence

If gex_h5ad, atac_h5ad, and ref_h5ad are supplied, they are used directly and input_bucket is ignored. Otherwise, the three filenames are downloaded from input_bucket via gsutil. The pipeline fails fast if any input file is missing or empty.

Reference requirements

The reference h5ad must contain cell type annotations in obs['final_annotation']. The query datasets (GEX and ATAC) do not need pre-existing annotations — placeholder Unknown labels are added automatically before training.

scANVI tasks and tools

The scANVI workflow defines two tasks inline. Both use the same Docker image; only the second task is allocated GPUs.

Task nameToolSoftwareDescription
PreprocessFilterscanpy, snapatac2PythonLoads the three input h5ad files, patches missing columns, filters GEX to STARsolo cell calls, intersects barcodes between GEX and ATAC, and converts the ATAC cell-by-bin matrix into a gene activity matrix.
MultiomeLabelTransferscvi-toolsPython (GPU)Trains an SCVI model on the three preprocessed AnnData objects, then trains an SCANVI classifier from the SCVI model to transfer reference cell-type labels onto the unlabeled GEX and ATAC cells. Computes UMAP and writes annotated outputs.

Overall, the scANVI workflow:

  1. Preprocesses and filters the three input h5ad files (CPU).
  2. Trains SCVI / SCANVI models and transfers labels (GPU).

1. PreprocessFilter (CPU-only)

Loads and preprocesses the three input h5ad files on a CPU-only node. No GPU is allocated for this task. Steps:

  1. Load datasets — Reads GEX (scanpy), ATAC cell-by-bin (snapatac2), and reference (scanpy).
  2. Patch missing columns — Adds star_IsCell = True to GEX and gex_barcodes (from index) to ATAC if absent, ensuring compatibility across upstream pipelines.
  3. Filter GEX — Retains STARsolo cell calls (star_IsCell == True), then removes genes and cells with fewer than 3 counts.
  4. Prepare GEX — Sets batch label; copies counts into a counts layer.
  5. Reindex ATAC — Sets ATAC obs index to gex_barcodes so barcodes align with GEX.
  6. Shared barcode filtering — Intersects GEX and ATAC barcodes; subsets both to matched cells.
  7. Batch labels — GEX → pd-multiome_sci_gex, ATAC → pd-multiome_sci_atac.
  8. Placeholder annotations — Adds final_annotation = "Unknown" to query datasets.
  9. Gene activity matrix — Converts the ATAC cell-by-bin matrix into a gene activity matrix via snapatac2.pp.make_gene_matrix (hg38 GENCODE annotation).
  10. Modality tags — GEX → rna_unannotated, ATAC activity → atac_unannotated, reference → rna_annotated.
  11. Write outputs — Three ~{input_id}_preprocessed_*.h5ad files.

2. MultiomeLabelTransfer (GPU)

Loads the three preprocessed h5ad files and performs only model training, label transfer, and output finalization. It imports individual functions (run_multi_model, transfer_labels, finalize_output) from the container's multiome_label_transfer.py module — the script's main() function is never called, so no preprocessing is repeated.

  1. Load preprocessed data — Reads the three h5ad files produced by PreprocessFilter. No filtering, reindexing, or conversion is performed.
  2. Train SCVIrun_multi_model() concatenates the three AnnData objects, filters to genes in ≥ 5 cells, selects 5,000 highly variable genes (Seurat v3, batch-aware), then trains an SCVI model (unsupervised VAE: 2 layers, 30 latent dimensions, negative-binomial likelihood, gene-batch dispersion, up to 500 epochs with early stopping).
  3. Train SCANVI — The same function initializes SCANVI from the trained SCVI model and performs semi-supervised training using the reference cell type labels (final_annotation), propagating annotations to unlabeled GEX and ATAC cells (up to 500 epochs, 100 samples per label).
  4. Predict labelstransfer_labels() uses the trained SCANVI model to predict cell types (C_scANVI) for every cell, extracts the latent representation (X_scANVI), and computes a neighborhood graph and UMAP embedding.
  5. Propagate labels — Copies the predicted C_scANVI labels from the concatenated object back into the original GEX and ATAC AnnData objects using the barcode-suffix index created by ad.concat.
  6. Write annotated matrices — Saves ~{input_id}_gex_annotated_matrix.h5ad and ~{input_id}_atac_annotated_matrix.h5ad.
  7. Finalize predictionsfinalize_output() adds placeholder metadata (biosample, donor, species, disease, organ, library prep, sex), renames final_annotationcelltype, and copies counts into the .raw layer for SCP ingest. Writes ~{input_id}_SCANVI_predictions.h5ad.

Workflow diagram

                ┌──────────────────────┐
│ Input h5ad files │
│ (GEX, ATAC, Ref) │
└──────────┬───────────┘


┌────────────────────────┐
│ PreprocessFilter │ CPU-only
│ (load, filter, align, │
│ gene activity matrix)│
└──────────┬─────────────┘

┌──────────────┼──────────────┐
▼ ▼ ▼
preprocessed preprocessed preprocessed
_gex.h5ad _atac_activity _ref.h5ad
.h5ad
│ │ │
└──────────────┼──────────────┘


┌────────────────────────────┐
│ MultiomeLabelTransfer │ GPU
│ (import run_multi_model, │
│ transfer_labels, │
│ finalize_output — │
│ main() is NOT called) │
└──────────┬─────────────────┘

┌────────────┼────────────┐
▼ ▼ ▼
SCANVI_predictions gex_annotated atac_annotated
.h5ad _matrix.h5ad _matrix.h5ad

Design rationale

The container script multiome_label_transfer.py bundles preprocessing and model training inside a single main() function. If the GPU task called main(), all preprocessing would run a second time on the GPU node — wasting expensive GPU hours on CPU-bound work that has already been completed.

Instead, the pipeline:

  • Runs all preprocessing in PreprocessFilter on a CPU-only VM (no GPU cost).
  • In MultiomeLabelTransfer, imports only the three model-training / label-transfer / finalization functions from the script, bypassing main() entirely.

This ensures zero duplication: every preprocessing step executes exactly once (on CPU), and the GPU node spends 100 % of its time on model training and inference.

Outputs

All output filenames are prefixed with ~{input_id}_.

Output Variable NameFilenameOutput TypeOutput Format
scanvi_predictions_h5ad<input_id>_SCANVI_predictions.h5adCombined AnnData with SCANVI cell type predictions, UMAP, and metadata.H5AD
gex_annotated_h5ad<input_id>_gex_annotated_matrix.h5adPreprocessed GEX AnnData annotated with transferred cell type labels.H5AD
atac_annotated_h5ad<input_id>_atac_annotated_matrix.h5adATAC gene-activity AnnData annotated with transferred cell type labels.H5AD
pipeline_version_outN/AVersion of the processing pipeline run on this data.String

Runtime configuration

Both tasks use the same Docker image (pinned by digest). GPU and CUDA setup is handled entirely by the execution engine — the container does not configure the GPU environment itself.

Task 1 — PreprocessFilter (CPU-only)

AttributeValue
dockerus.gcr.io/broad-gotc-prod/scvi-scanvi@sha256:81fe915a045bd2929a1c457f4a0061055c6ea42fa3f88e9352b618e4a6e47b58
bootDiskSizeGb20
diskslocal-disk 1000 SSD
memory120 GiB
cpu32
maxRetries1

Task 2 — MultiomeLabelTransfer (GPU)

AttributeValue
dockerus.gcr.io/broad-gotc-prod/scvi-scanvi@sha256:81fe915a045bd2929a1c457f4a0061055c6ea42fa3f88e9352b618e4a6e47b58
bootDiskSizeGb20
diskslocal-disk 500 SSD
memory120 GiB
cpu32
hardware_gpu_typenvidia-tesla-t4
gpuCount2
nvidia_driver_version535.104.05
maxRetries1
GPU driver compatibility

Driver version 535.104.05 is compatible with CUDA 12.x and NVIDIA T4 GPUs and has been verified working on GCP / Terra with the scvi-scanvi container.

Docker image

The scvi-scanvi image is maintained in warp-tools. Key libraries: scvi-tools 1.2, snapatac2 2.7, scanpy, anndata.

Versioning

All scANVI pipeline releases are documented in the scANVI changelog.

Citing the scANVI Pipeline

When citing WARP, please use the following:

Kylee Degatano, Aseel Awdeh, Robert Sidney Cox III, Wes Dingman, George Grant, Farzaneh Khajouei, Elizabeth Kiernan, Kishori Konwar, Kaylee L Mathews, Kevin Palis, Nikelle Petrillo, Geraldine Van der Auwera, Chengchen (Rex) Wang, Jessica Way. "Warp Analysis Research Pipelines: Cloud-optimized workflows for biological data processing and reproducible analysis." Bioinformatics, 2025; https://doi.org/10.1093/bioinformatics/btaf494

Please also cite the underlying scvi-tools models:

  • Lopez, R., Regier, J., Cole, M.B., Jordan, M.I., Yosef, N. "Deep generative modeling for single-cell transcriptomics." Nature Methods 15, 1053–1058 (2018).
  • Xu, C., Lopez, R., Mehlman, E., Regier, J., Jordan, M.I., Yosef, N. "Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models." Molecular Systems Biology 17, e9620 (2021).

Feedback

Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.