Skip to main content

CEMBA Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
CEMBA_v1.1.0February, 2021Elizabeth KiernanPlease file GitHub issues in warp or contact Kylee Degatano


Introduction to the CEMBA Workflow#

CEMBA is a pipeline developed by the BRAIN Initiative that supports the processing of multiplexed single-nuclei bisulfite sequencing data. It is an alignment and methylated base calling pipeline that trims adaptors, attaches cell barcodes, aligns reads to the genome, filters reads based on quality, and creates both a VCF and ALLC file with methylation-site coverage.


Interested in using the pipeline for your publication? See the “CEMBA publication methods” for a generic "methods" style description of the pipeline.

Quick Start Table#

Pipeline FeaturesDescriptionSource
Assay TypeSingle-nucleus methylcytosine sequencing (snmC-seq)Luo et al. 2017
Overall WorkflowAligns reads and identifies methylated basesCode available from Github
Workflow LanguageWDL 1.0openWDL
Genomic Reference SequenceGRCH38 and GRCM38GENCODE
AlignerBISMARK v0.21.0 with --bowtie2Bismark
Variant CallerGATK
Data Input File FormatFile format in which sequencing data is providedZipped FASTQs (.fastq.gz)
Data Output File FormatFile formats in which CEMBA output is providedBAM, VCF, ALLC


CEMBA Installation and Requirements#

The CEMBA pipeline code is written in the Workflow Description Language (WDL) and can be downloaded by cloning the GitHub repository WARP. For the latest release of CEMBA, please see the release tags prefixed with "CEMBA" here. CEMBA can be deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms.

The workflow can also be run on Terra using the Methyl-c-seq_Pipeline workspace. This workspace contains the CEMBA workflow, workflow configurations, required reference data and other inputs, and demultiplexed example testing data.


CEMBA pipeline inputs are detailed in the example human configuration file (CEMBA.inputs.json). Genomic reference files were built using the BuildCembaReferencesWDL script. See descriptions of all inputs in the tables below.

Sample data input#

The pipeline accepts paired-end reads in the form of two compressed FASTQ files (fastq.gz). FASTQ files may represent a single cell sample or in the case of multiplexed samples, multiple cells.

Parameter NameDescription
fastq_r1_gzipped_inputCompressed FASTQ (.gz) for R1
fastq_r2_gzipped_inputCompressed FASTQ (.gz) for R2

Additional inputs#

Parameter NameDescription
barcode_white_listList of known cell barcodes
output_base_sample_namePrefix for all pipeline output files (final and intermediate)
barcode_start_posBase location of barcode start
barcode_lengthLength of cell barcode (bp)
reference_fastaReference FASTA
reference_fasta_indexReference FASTA index
fwd_converted_reference_fastaBisulfite-converted forward reference genome reads for Bismark alignment
rev_converted_reference_fastaBisulfite-converted reverse reference genome reads for Bismark alignment
reference_dictionaryReference genome dictionary
fwd_bowtie2_index_filesForward bowtie2 index files
rev_bowtie2_index_filesReverse bowtie2 index files
quality_cutoffInterval representing the number of base pairs to remove from 5’ and 3’ end in order to trim low quality reads
min_length_paired_end_trimAn interval to specify a minimum read length to avoid empty reads in paired-end mode
min_length_single_end_trimAn interval to specify a minimum read length to avoid empty reads in single-end mode
read1_adapter_seqThe R1 adaptor sequence
read2_adapter_seqThe R2 adaptor sequence
cut_lengthInterval provided to trim degenerate bases, random primer indexes, and Adaptase C/T tail
paired_end_runBoolean; if true, workflow will run in paired-end mode
remove_duplicatesBoolean; if true Picard will remove duplicates and report duplication removal metrics
extract_and_attach_barcodes_in_single_end_runBoolean; if true, workflow will create an unaligned BAM and extract barcodes
min_map_qualityNumerical value that represents minimum map quality; if provided Samtools will filter reads and produce a BAM for reads above value and reads below value
read_group_library_nameLibrary preparation type used for read group; default is "Methylation"
read_group_platform_nameSequencing platform used for read group; default is "Illumina"
read_group_platform_unit_namePlatform unit for the read group (i.e. run barcode); default is "smmC-Seq"

CEMBA Tasks and Tools

The CEMBA.wdl implements the workflow by importing individual "tasks" written in the WDL script.

CEMBA Task Summary#

The table and summary sections below detail the tasks and tools of the CEMBA pipeline; the code is available through GitHub. Each task can be found in the CEMBA WDL If you are looking for the specific parameters of each task/tool, please see the command {} section of the WDL script.

TrimCutadapt v1.18Trim
CreateUnmappedBamPicard v2.18.23Create uBAM for attaching
ExtractCellBarcodessctools v0.3.4Use whitelist to extract barcodes and tag to
TrimCutadapt v1.18Trim degenerate bases, primer index, C/T Adaptase tail of
TrimCutadapt v1.18Trim bases, primer index, C/T Adaptase tail of
AlignBismark v0.21.0Map multiplexed samples as single-end with
SortPicard v2.18.23Sort BAM(s) in coordinate
FilterDuplicatesPicard v2.18.23Removes duplicate reads from
Get MethylationReportBismark v0.21.0Produce methylation report for duplicates-filtered
FilterMapQualitySamtools v1.9Further filter duplicate-removed BAM by map
GetMethylationReportBismark v0.21.0Produce methylation report for reads above map quality and below map
AttachBarcodesPicard v2.18.23Add barcodes from the tagged uBAM to the aligned
MergeBamsSamtools v.19Merge R1 and R2 BAM files into single
AddReadGroupGATK v4.1.2.0Add read groups to the merged BAMbroadinstitute/gatk:
SortPicard v2.18.23Sort in coordinate order after adding read
IndexBamSamtools v1.9Index the output
MethylationTypeCallerGATK v4.1.2.0Produce a VCF with locus-specific methylation informationbroadinstitute/gatk:
VCFtoALLCPythonCreates an ALLC file from the VCF produced with
ComputeCoverageDepthSamtools v1.9Compute number of sites with coverage greater than

Prior to running: set-up the workflow for using multiplexed samples#

The pipeline uses paired-end reads, but it can only perform multiplexing when running in single-end mode. If you have multiplexed samples and want to attach cell barcodes, you must run the pipeline in single-end mode. If you do not wish to attach cell barcodes, you may run in paired-end mode (even if your samples are multiplexed). You can specify single-end mode or paired-end mode using the paired_end_run boolean in the configuration file. You will also need to adjust the extract_and_attach_barcodes_in_single_end_run boolean to true if you want to attach barcodes.

1. Trim adaptors#

The CEMBA workflow Trim task uses Cutadapt software to remove the Read1 (R1) and Read2 (R2) adaptor sequences specified in the input configuration from the zipped R1 and R2 FASTQ files. Low-quality reads are trimmed from the 5’ and 3’ ends using the interval specified in the quality_cutoff input parameter. To avoid empty reads, a threshold for read length is set using the min_length_paired_end_trim option.

2. Extract cell barcodes#

CEMBA can extract cell barcodes from multiplexed samples if the extract_and_attach_barcodes_in_single_end_run boolean is true and the samples are run in single-end mode. To do this, the workflow uses the CreateUnmappedBam and ExtractCellBarcodes tasks to first make an unaligned BAM (uBAM) for the trimmed R1 FASTQ and then tag barcodes identified with the barcode_white_list input to the uBAM.

3. Trim degenerate bases, random primer indexes, and Adaptase C/T tail#

After barcode extraction, the Trim task is used a second time to remove additional bases resulting from R1 random primer indexes (often used as barcodes) and the R2 C/T tail introduced by the Adaptase enzyme. Reads are trimmed using the cut_length input. The read length threshold is set by the min_length_single_end input.

4. Align to a reference genome#

The resulting trimmed FASTQ files can be aligned to a reference either in single-end mode for multiplexed samples or paired-end mode. For all modes, the workflow aligns with Bismark with --bowtie2 option. For paired-end or the single-end mode for R1, the workflow uses a directional option with --pbat parameter. For R2, the directional option is turned off.

5. Sort, remove duplicates, and filter#

The aligned BAM(s) are scattered and sorted in coordinate order using Picard. Duplicate reads are then removed from the sorted BAM. If a min_map_quality is provided in the input, reads will be filtered accordingly and a BAM produced for all reads above the min_map_quality and a BAM for reads below the min_map_quality.

6. Generate methylation reports#

Methylation reports are generated using the Bismark at two steps in the workflow: after the removal of duplicates and again after filtering on min_map_quality. The bismark_methylation_extraction function with the -- comprehensive --merge_non_CpG --report options outputs multiple reports which are detailed in the Bismark documentation. These outputs include mbias, splitting, CpG context, and non-CpG context reports.

7. Attach barcodes, merge BAMs, add read groups, sort and index BAMs#

In the AttachBarcodes task, Picard attaches the barcodes in the R1 uBAM to the aligned, duplicate-removed, and if applicable, filtered, R1 BAM. This produces a tagged_mapped.bam file. Once the barcodes are attached, the MergeBams task uses Samtools to merge the (barcoded if applicable) R1 BAM with the aligned and filtered R2 BAM. Read groups are then attached to the merged BAM file with GATK4 and the BAM is sorted with Picard. The BAM is indexed with Samtools.

8. Call methylated bases#

Methylated bases are identified using the MethylationTypeCaller task which calls the GATK4 function MethylationTypeCaller. This produces a VCF with methylation calls.

9. Create ALLC file#

The VCF containing methylation calls is used to create an additional ALLC ("all-cytosine") file which can be used for downstream differential methylated regions in downstream analyses.

10. Compute coverage depth#

The ComputeCoverageDepth task uses Samtools to calculate any region in the filtered, sorted BAM with a coverage depth greater than 1. This interval is read in the stdout of the workflow.


The table below details the pipeline outputs. If using multiplexed samples, the final files will represent reads from multiple cells and the output is not yet split by cell barcode.

Workflow Output NameDescriptionFile Type (when applicable)
bam_sort_outputFinal aligned, filtered, (barcoded), and sorted BAMBAM
bam_index_outputIndex file for the final BAMINDEX
methylation_vcfVCF file demarcating methylation sitesVCF
methylation_vcf_indexIndex file for the VCFINDEX
total_depth_countInterval demarcated in the stdout file for reads with coverage greater than 1NA
mapping_reportsMapping reports generated by Bismark alignmentArray
metric_remove_dup_outputDuplication metrics generated by PicardTXT
methylation_mbias_report_outputMbias reports generated by Bismark for the sorted duplicate-removed BAM and the two map quality-filtered BAMs (reads above map quality and reads below map quality)TXT
methylation_splitting_report_outputSplitting reports generated by Bismark for the sorted duplicate-removed BAM and the two map quality-filtered BAMs (reads above map quality and reads below map quality)TXT
methylation_CpG_context_report_outputCpG context reports generated by Bismark for the sorted duplicate-removed BAM and the two map quality-filtered BAMs (reads above map quality and reads below map quality)TXT
methylation_non_CpG_context_report_outputNon-CpG context reports generated by Bismark for the sorted duplicate-removed BAM and the two map quality-filtered BAMs (reads above map quality and reads below map quality)TXT


All CEMBA pipeline releases are documented in the CEMBA changelog.

Citing the CEMBA Pipeline#

Please identify the pipeline in your methods section using the CEMBA Pipeline's SciCrunch resource identifier.

  • Ex: CEMBA MethylC Seq Pipeline (RRID:SCR_021219)

Consortia Support#

This pipeline is supported and used by the BRAIN Initiative Cell Census Network (BICCN).

If your organization also uses this pipeline, we would love to list you! Please reach out to us by contacting Kylee Degatano.

Have Suggestions?#

Please help us make our tools better by contacting Kylee Degatano for pipeline-related suggestions or questions.