BuildIndices Overview
Pipeline Version | Date Updated | Documentation Author | Questions or Feedback |
---|---|---|---|
BuildIndices_v3.0.0 | December, 2023 | Kaylee Mathews | Please file an issue in WARP. |
Introduction to the BuildIndices workflow
The BuildIndices workflow is an open-source, cloud-optimized pipeline developed in collaboration with the BRAIN Initiative Cell Census Network (BICCN) and the BRAIN Initiative Cell Atlas Network (BICAN).
Overall, the workflow filters GTF files for selected gene biotypes, calculates chromosome sizes, and builds reference bundles with required files for STAR and bwa-mem2 aligners.
Quickstart table
The following table provides a quick glance at the BuildIndices pipeline features:
Pipeline features | Description | Source |
---|---|---|
Overall workflow | Reference bundle creation for STAR and bwa-mem2 aligners | Code available on GitHub |
Workflow language | WDL 1.0 | openWDL |
Genomic Reference Sequence | GRCh38 human genome primary sequence, M32 (GRCm39) mouse genome primary sequence, and release 103 (GCF_003339765.1) macaque genome primary sequence | GENCODE human reference files, GENCODE mouse reference files, and NCBI macaque reference files |
Gene annotation reference (GTF) | Reference containing gene annotations | GENCODE human GTF, GENCODE mouse GTF, and NCBI macaque GTF |
Reference builders | STAR, bwa-mem2 | Dobin et al. 2013, Vasimuddin et al. 2019 |
Data input file format | File format in which reference files are provided | FASTA, GTF, TSV |
Data output file format | File formats in which BuildIndices output is provided | GTF, TAR, TXT |
Set-up
BuildIndices installation
To download the latest BuildIndices release, see the release tags prefixed with "BuildIndices" on the WARP releases page. All BuildIndices pipeline releases are documented in the BuildIndices changelog.
To search releases of this and other pipelines, use the WARP command-line tool Wreleaser.
If you’re running a BuildIndices workflow version prior to the latest release, the accompanying documentation for that release may be downloaded with the source code on the WARP releases page (see the folder website/docs/Pipelines/BuildIndices_Pipeline
).
The BuildIndices pipeline can be deployed using Cromwell, a GA4GH-compliant, flexible workflow management system that supports multiple computing platforms. The workflow can also be run in Terra, a cloud-based analysis platform.
Inputs
The BuildIndices workflow inputs are specified in JSON configuration files. Configuration files for macaque and mouse references can be found in the WARP repository.
Input descriptions
Parameter name | Description | Type |
---|---|---|
genome_source | Describes the source of the reference genome listed in the GTF file; used to name output files; can be set to “NCBI” or “GENCODE”. | String |
gtf_annotation_version | Version or release of the reference genome listed in the GTF file; used to name STAR output files; ex.”M32”, “103”. | String |
genome_build | Assembly accession (NCBI) or version (GENCODE) of the reference genome listed in the GTF file; used to name output files; ex. “GRCm39”, “GCF_003339765.1”. | String |
organism | Organism of the reference genome; used to name the output files; can be set to “Macaque”, “Mouse”, “Human”, or any other organism matching the reference genome. | String |
annotations_gtf | GTF file containing gene annotations; used to build the STAR reference files. | File |
genome_fa | Genome FASTA file used for building indices. | File |
biotypes | TSV file containing gene biotypes attributes to include in the modified GTF file; the first column contains the biotype and the second column contains “Y” to include or “N” to exclude the biotype; GENCODE biotypes are used for GENCODE references and RefSeq biotypes are used for NCBI references. | File |
BuildIndices tasks and tools
Overall, the BuildIndices workflow:
- Checks inputs, modifies reference files, and creates STAR index.
- Calculates chromosome sizes.
- Builds reference bundle for bwa.
The tasks and tools used in the BuildIndices workflow are detailed in the table below.
To see specific tool parameters, select the workflow WDL link; then find the task and view the command {}
section of the task in the WDL script. To view or use the exact tool software, see the task's Docker image which is specified in the task WDL # runtime values
section as docker:
.
Task name | Tool | Software | Description |
---|---|---|---|
BuildStarSingleNucleus | modify_gtf.py, STAR | warp-tools, STAR | Checks that the input GTF file contains input genome source, genome build version, and annotation version with correct build source information, modifies files for the STAR aligner, and creates STAR index file. |
CalculateChromosomeSizes | faidx | Samtools | Reads the genome FASTA file to create a FASTA index file that contains the genome chromosome sizes. |
BuildBWAreference | index | bwa-mem2 | Builds the reference bundle for the bwa aligner. |
1. Check inputs, modify reference files, and create STAR index file
Check inputs
The BuildStarSingleNucleus task reads the input GTF file and verifies that the genome_source
, genome_build
, and gtf_annotation_version
listed in the file match the input values provided to the pipeline.
Modify reference files and create STAR index
The BuildStarSingleNucleus task uses a custom python script, modify_gtf.py
, and a list of biotypes (example) to filter the input GTF file for only the biotypes indicated in the list with the value “Y” in the second column. The defaults in the custom code produce reference outputs that are similar to those built with 10x Genomics reference scripts.
The task uses the filtered GTF file and STAR --runMode genomeGenerate
to generate the index file for the STAR aligner. Outputs of the task include the modified GTF and compressed STAR index files.
2. Calculates chromosome sizes
The CalculateChromosomeSizes task uses Samtools to create and output a FASTA index file that contains the genome chromosome sizes, which can be used in downstream tools like SnapATAC2.
3. Builds reference bundle for bwa-mem2
The BuildBWAreference task uses the chromosome sizes file and bwa-mem2 to prepare the genome FASTA file for alignment and builds, compresses, and outputs the reference bundle for the bwa-mem2 aligner.
Outputs
The following table lists the output variables and files produced by the pipeline.
Output name | Filename, if applicable | Output format and description |
---|---|---|
snSS2_star_index | modified_star2.7.10a-<organism>-<genome_source>-build-<genome_build>-<gtf_annotation_version>.tar | TAR file containing a species-specific reference genome and GTF file for STAR alignment. |
pipeline_version_out | BuildIndices_v<pipeline_version> | String describing the version of the BuildIndices pipeline used. |
snSS2_annotation_gtf_modified | modified_v<gtf_annotation_version>.annotation.gtf | GTF file containing gene annotations filtered for selected biotypes. |
reference_bundle | bwa-mem2-2.2.1-<organism>-<genome_source>-build-<genome_build>.tar | TAR file containing the reference index files for BWA-mem alignment. |
chromosome_sizes | chrom.sizes | Text file containing chromosome sizes for the genome build. |
Versioning and testing
All BuildIndices pipeline releases are documented in the BuildIndices changelog and tested manually using reference JSON files.
Citing the BuildIndices Pipeline
If you use the BuildIndices Pipeline in your research, please consider citing our preprint:
Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1
Consortia support
This pipeline is supported by the BRAIN Initiative (BICCN and BICAN).
If your organization also uses this pipeline, we would like to list you! Please reach out to us by filing an issue in WARP.
Feedback
Please help us make our tools better by filing an issue in WARP for pipeline-related suggestions or questions.