Skip to main content

BuildIndices Overview

Pipeline VersionDate UpdatedDocumentation AuthorQuestions or Feedback
BuildIndices_v3.0.0December, 2023Kaylee MathewsPlease file an issue in WARP.

BuildIndices_diagram

Introduction to the BuildIndices workflow

The BuildIndices workflow is an open-source, cloud-optimized pipeline developed in collaboration with the BRAIN Initiative Cell Census Network (BICCN) and the BRAIN Initiative Cell Atlas Network (BICAN).

Overall, the workflow filters GTF files for selected gene biotypes, calculates chromosome sizes, and builds reference bundles with required files for STAR and bwa-mem2 aligners.

Quickstart table

The following table provides a quick glance at the BuildIndices pipeline features:

Pipeline featuresDescriptionSource
Overall workflowReference bundle creation for STAR and bwa-mem2 alignersCode available on GitHub
Workflow languageWDL 1.0openWDL
Genomic Reference SequenceGRCh38 human genome primary sequence, M32 (GRCm39) mouse genome primary sequence, and release 103 (GCF_003339765.1) macaque genome primary sequenceGENCODE human reference files, GENCODE mouse reference files, and NCBI macaque reference files
Gene annotation reference (GTF)Reference containing gene annotationsGENCODE human GTF, GENCODE mouse GTF, and NCBI macaque GTF
Reference buildersSTAR, bwa-mem2Dobin et al. 2013, Vasimuddin et al. 2019
Data input file formatFile format in which reference files are providedFASTA, GTF, TSV
Data output file formatFile formats in which BuildIndices output is providedGTF, TAR, TXT

Set-up

BuildIndices installation

To download the latest BuildIndices release, see the release tags prefixed with "BuildIndices" on the WARP releases page. All BuildIndices pipeline releases are documented in the BuildIndices changelog.

To search releases of this and other pipelines, use the WARP command-line tool Wreleaser.

If you’re running a BuildIndices workflow version prior to the latest release, the accompanying documentation for that release may be downloaded with the source code on the WARP releases page (see the folder website/docs/Pipelines/BuildIndices_Pipeline).

The BuildIndices pipeline can be deployed using Cromwell, a GA4GH-compliant, flexible workflow management system that supports multiple computing platforms. The workflow can also be run in Terra, a cloud-based analysis platform.

Inputs

The BuildIndices workflow inputs are specified in JSON configuration files. Configuration files for macaque and mouse references can be found in the WARP repository.

Input descriptions

Parameter nameDescriptionType
genome_sourceDescribes the source of the reference genome listed in the GTF file; used to name output files; can be set to “NCBI” or “GENCODE”.String
gtf_annotation_versionVersion or release of the reference genome listed in the GTF file; used to name STAR output files; ex.”M32”, “103”.String
genome_buildAssembly accession (NCBI) or version (GENCODE) of the reference genome listed in the GTF file; used to name output files; ex. “GRCm39”, “GCF_003339765.1”.String
organismOrganism of the reference genome; used to name the output files; can be set to “Macaque”, “Mouse”, “Human”, or any other organism matching the reference genome.String
annotations_gtfGTF file containing gene annotations; used to build the STAR reference files.File
genome_faGenome FASTA file used for building indices.File
biotypesTSV file containing gene biotypes attributes to include in the modified GTF file; the first column contains the biotype and the second column contains “Y” to include or “N” to exclude the biotype; GENCODE biotypes are used for GENCODE references and RefSeq biotypes are used for NCBI references.File

BuildIndices tasks and tools

Overall, the BuildIndices workflow:

  1. Checks inputs, modifies reference files, and creates STAR index.
  2. Calculates chromosome sizes.
  3. Builds reference bundle for bwa.

The tasks and tools used in the BuildIndices workflow are detailed in the table below.

To see specific tool parameters, select the workflow WDL link; then find the task and view the command {} section of the task in the WDL script. To view or use the exact tool software, see the task's Docker image which is specified in the task WDL # runtime values section as docker: .

Task nameToolSoftwareDescription
BuildStarSingleNucleusmodify_gtf.py, STARwarp-tools, STARChecks that the input GTF file contains input genome source, genome build version, and annotation version with correct build source information, modifies files for the STAR aligner, and creates STAR index file.
CalculateChromosomeSizesfaidxSamtoolsReads the genome FASTA file to create a FASTA index file that contains the genome chromosome sizes.
BuildBWAreferenceindexbwa-mem2Builds the reference bundle for the bwa aligner.

1. Check inputs, modify reference files, and create STAR index file

Check inputs

The BuildStarSingleNucleus task reads the input GTF file and verifies that the genome_source, genome_build, and gtf_annotation_version listed in the file match the input values provided to the pipeline.

Modify reference files and create STAR index

The BuildStarSingleNucleus task uses a custom python script, modify_gtf.py, and a list of biotypes (example) to filter the input GTF file for only the biotypes indicated in the list with the value “Y” in the second column. The defaults in the custom code produce reference outputs that are similar to those built with 10x Genomics reference scripts.

The task uses the filtered GTF file and STAR --runMode genomeGenerate to generate the index file for the STAR aligner. Outputs of the task include the modified GTF and compressed STAR index files.

2. Calculates chromosome sizes

The CalculateChromosomeSizes task uses Samtools to create and output a FASTA index file that contains the genome chromosome sizes, which can be used in downstream tools like SnapATAC2.

3. Builds reference bundle for bwa-mem2

The BuildBWAreference task uses the chromosome sizes file and bwa-mem2 to prepare the genome FASTA file for alignment and builds, compresses, and outputs the reference bundle for the bwa-mem2 aligner.

Outputs

The following table lists the output variables and files produced by the pipeline.

Output nameFilename, if applicableOutput format and description
snSS2_star_indexmodified_star2.7.10a-<organism>-<genome_source>-build-<genome_build>-<gtf_annotation_version>.tarTAR file containing a species-specific reference genome and GTF file for STAR alignment.
pipeline_version_outBuildIndices_v<pipeline_version>String describing the version of the BuildIndices pipeline used.
snSS2_annotation_gtf_modifiedmodified_v<gtf_annotation_version>.annotation.gtfGTF file containing gene annotations filtered for selected biotypes.
reference_bundlebwa-mem2-2.2.1-<organism>-<genome_source>-build-<genome_build>.tarTAR file containing the reference index files for BWA-mem alignment.
chromosome_sizeschrom.sizesText file containing chromosome sizes for the genome build.

Versioning and testing

All BuildIndices pipeline releases are documented in the BuildIndices changelog and tested manually using reference JSON files.

Citing the BuildIndices Pipeline

If you use the BuildIndices Pipeline in your research, please consider citing our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Consortia support

This pipeline is supported by the BRAIN Initiative (BICCN and BICAN).

If your organization also uses this pipeline, we would like to list you! Please reach out to us by filing an issue in WARP.

Feedback

Please help us make our tools better by filing an issue in WARP for pipeline-related suggestions or questions.