Skip to main content

Ultima Genomics Whole Genome Germline Overview

Pipeline VersionDate UpdatedDocumentation AuthorsQuestions or Feedback
UltimaGenomicsWholeGenomeGermline_v1.0.16March, 2024Elizabeth Kiernan & Kaylee MathewsPlease file an issue in WARP.

UG_diagram

Introduction to the UG_WGS workflow

The Ultima Genomics Whole Genome Germline (UG_WGS) workflow is an open-source, cloud-optimized workflow for processing whole-genome germline sequencing data generated using the Ultima Genomics sequencing platform.

Background: Ultima Genomics sequencing

Ultima Genomics sequencing is a novel technology that produces single-read, flow-based data (Almogy et al., 2022). The sequencing platform works by flowing one nucleotide at a time in order, iteratively. This is in contrast to traditional technologies that do all four nucleotides at once. This iterative approach ensures that only one dNTP is responsible for the signal and it does not require the blocking of dNTPs.

What does the workflow do?

The workflow requires either an aligned CRAM output of the sequencing platform or an unmapped BAM as input. Overall, it aligns reads to a reference genome, marks duplicate reads, calls variants, post-processes variants in the output VCF in preparation for joint calling, and calculates quality control metrics. The workflow outputs a (re)aligned CRAM, an annotated GVCF with index, and quality metrics.

Set-up

Installation

To download the latest release of the UG_WGS pipeline, see the release tags prefixed with "UG_WGS" on the WARP releases page. All releases of the UG_WGS pipeline are documented in the UG_WGS changelog.

To search releases of this and other pipelines, use the WARP command-line tool Wreleaser.

The UG_WGS pipeline can be deployed using Cromwell, a GA4GH compliant, flexible workflow management system that supports multiple computing platforms. The workflow can also be run in Terra, a cloud-based analysis platform.

Inputs

The UG_WGS workflow inputs are specified in JSON configuration files. An example configuration file can be found in the input_files folder in the WARP repository.

Multiple workflow inputs are in the form of structs, which are defined in UG_WGS structs WDL.

Input descriptions

The workflow takes in an aligned CRAM (output of the Ultima Genomics sequencing platform) or an unmapped BAM file for one sample and one read group.

The workflow input variables are listed below. If an input variable is part of a struct, the struct name is listed in the Struct column.

Input variable nameStructDescriptionType
input_cram_listN/AArray of CRAM files to be used as workflow input; must be specified if input_bam_list is not provided.Array [File]
input_bam_listN/AArray of unmapped BAM files to be used as workflow input; must be specified if input_cram_list is not provided.Array [File]
base_file_nameN/ABase name for each of the output files.String
contamination_sites_pathContaminationSitesPath to contamination site files.String
contamination_sites_vcfContaminationSitesContamination site VCF.File
contamination_sites_vcf_indexContaminationSitesIndex for contamination site VCF.File
ref_fastaReferencesReference FASTA file used for alignment with BWA-MEM.File
ref_fasta_indexReferencesReference FASTA index file used for alignment with BWA-MEM.File
ref_dictReferencesDictionary file used for alignment with BWA-MEM.File
ref_altAlignmentReferencesReference files used for alignment with BWA-MEM.File
ref_ambAlignmentReferencesReference files used for alignment with BWA-MEM.File
ref_annAlignmentReferencesReference files used for alignment with BWA-MEM.File
ref_bwtAlignmentReferencesReference files used for alignment with BWA-MEM.File
ref_pacAlignmentReferencesReference files used for alignment with BWA-MEM.File
ref_saAlignmentReferencesReference files used for alignment with BWA-MEM.File
wgs_calling_interval_listVariantCallingSettingsInterval list used for variant calling with HaplotypeCaller.File
break_bands_at_multiples_ofVariantCallingSettingsBreaks reference bands up at genomic positions that are multiples of this number; used to reduce GVCF file size.Int
haplotype_scatter_countVariantCallingSettingsScatter count used for variant calling.Int
make_haplotype_bamN/ABoolean indicating if HaplotypeCaller should output a bamout file; default is set to "false".Boolean
rsq_thresholdNAThreshold for a read quality metric that is produced by the sequencing platform; default set to 1.0Float
annotation_intervalsVcfPostProcessingAnnotation intervals used for filtering and annotating the HaplotypeCaller output VCF.Array[File]
filtering_model_no_gtVcfPostProcessingOptional filtering file defining the model for VCF postprocessing.File
af_only_gnomadVcfPostProcessingVCF with gnomAD allele frequencies used for metric stratification in the AnnotateVCF_AF task.File
af_only_gnomad_indexVcfPostProcessingIndex for the af_only_gnomad VCF.File
filter_cg_insertionsVcfPostProcessingBoolean that indicates whether to filter CG insertions in HaplotypeCaller output VCF.Boolean
filtering_blocklist_fileVcfPostProcessingOptional file used to flag genomic locations that can’t be trusted while filtering the HaplotypeCaller output VCF.File
training_blocklist_fileVcfPostProcessingOptional interval file for training a model to postprocess the HaplotypeCaller output VCF to identify false positives.File
exome_weightVcfPostProcessingOptional interval for exome weight that is used to train a model to identify false positives in the HaplotypeCaller output VCF.Int
exome_weight_annotationVcfPostProcessingOptional string to annotate exome weights.String
interval_list_overrideVcfPostProcessingOptional interval list to use for VCF postprocessingFile
runs_fileVcfPostProcessingBED file of homopolymer runs that can be difficult to sequence.File
filtering_model_with_gt_name_overrideVcfPostProcessingOptional string to describe the optional filtering model with gt.String
max_duplication_in_reasonable_sampleVcfPostProcessingNumber indicating the maximum duplication expected in a sample.Float
max_chimerism_in_reasonable_sampleVcfPostProcessingNumber indicating the maximum chimerism expected in a sample.Float
ref_dbsnpVcfPostProcessingVCF with dbSNP annotations to be added to the HaplotypeCaller output VCF.File
ref_dbsnp_indexVcfPostProcessingIndex for the dbSNP annotation VCF.File
wgs_coverage_interval_listVcfPostProcessingInterval list for the CollectWgsMetrics tool.File
remove_low_tree_score_sites_cutoffVcfPostProcessingOptional float indicating a cutoff for low tree scoresFloat
annotations_to_keep_command_for_reblockingVcfPostProcessingOptional command (--annotations-to-keep) for the ReblockGVCF task indicating what types annotations to keep in the reblocked GVCF.String
increase_disk_sizeN/AInterval used to increase disk size; set to 20 GB by default.Int
increase_metrics_disk_sizeN/AInterval used to adjust disk size; set to 80 by default.Int
filtering_model_no_gt_nameN/AString to describe the optional filtering model; default set to "rf_model_ignore_gt_incl_hpol_runs".String
merge_bam_fileN/ABoolean indicating if by-interval bamout files from HaplotypeCaller should be merged into a single BAM.Boolean
reads_per_splitN/ANumber of reads by which to split the CRAM prior to alignment.Int
pipeline_versionN/AString that describes the pipeline version number.String

Reference files

Reference files, such as the hg38- and dbSNP-related files are located in a public Google bucket. See the example input configuration file for cloud locations.

UG_WGS tasks and tools

The UG_WGS workflow imports additional WDL scripts that contain the different workflow tasks. When applicable, links to these additional WDL scripts (subtasks/subworkflows) are provided in the summary section below.

  • Workflow tasks use different software tools to manipulate the workflow input data. To see specific tool parameters, select the task WDL link in the table; then find the task and view the command {} section of the task in the WDL script.

  • To view or use the exact tool software, see the task's Docker image which is specified in the task WDL input {} section as String docker =.

  • Docker images for the UG_WGS workflow are not yet versioned and officially released, but they are publicly available to test the workflow.

Overall, the UG_WGS workflow:

  1. Aligns with BWA-MEM and marks duplicates.
  2. Converts a merged BAM to CRAM and validates the CRAM.
  3. Extracts the nucleotide flow order.
  4. Calculates quality control metrics.
  5. Performs variant calling with HaplotypeCaller.
  6. Merges BAMs and converts the GVCF to VCF.
  7. Performs VCF post-processing in preparation for joint calling.
Multiple tasks are imported from nested sub-workflows

When applicable, links to the sub-workflow WDLs and nested tasks WDLs are provided.

1. Align and mark duplicates

Sub-workflow name and link: UltimaGenomicsWholeGenomeCramOnly

The table below details the subtasks called by the AlignmentAndMarkDuplicates task, which splits the CRAM or BAM into subfiles, converts them to uBAM format, converts the uBAM to FASTQ for alignment with BWA-MEM, and marks duplicate reads in the resulting BAM files.

The Picard tool MarkDuplicatesSpark has been adapted to handle ambiguity in the aligned start of a read. This functionality is applied using the tool’s --flowbased parameter.

Subtask name and WDL linkToolSoftwareDescription
Tasks.SplitCram as SplitInputCramsplitcrammerIf CRAM is the workflow input, splits the CRAM and outputs a split CRAM.
AlignmentTasks.SamSplitter as SplitInputBamSplitSamByNumberOfReadsPicardIf BAM is workflow input, splits the BAM and outputs an array of BAMs.
Tasks.ConvertCramOrBamToUBam as ConvertToUbamview, RevertSamsamtools, PicardConverts the split CRAM or BAM file to uBAM.
Tasks.SamToFastqAndBwaMemAndMbaSamToFastq, bwa mem, MergeBamAlignmentPicard, bwa memConverts each uBAM to FASTQ format, aligns with BWA-MEM, and merges the alignment in the resulting BAM with metadata from the uBAM.
Tasks.MarkDuplicatesSparkMarkDuplicatesSparkGATKFlags duplicate reads in the array of aligned and merged BAMs to create a new output BAM and index.

2. Convert BAM to CRAM and validate the CRAM

Sub-workflow name and link: UltimaGenomicsWholeGenomeCramOnly

The two tasks below are used to convert each duplicate-marked BAM to CRAM and validate the resulting files.

TASK name and WDL linkToolSoftwareDescription
Utilities.ConvertToCramviewsamtoolsConverts the duplicated-marked BAM to CRAM.
QC.ValidateSamFile as ValidateCramValidateSamFilePicardValidates the CRAM file.

3. Extract nucleotide flow order

Sub-workflow name and link: UltimaGenomicsWholeGenomeCramOnly

The flow order is the order in which nucleotides are passed during sequencing. This information is captured in the header of the BAM and is extracted as a text file for downstream processing.

TASK name and WDL linkToolSoftwareDescription
Tasks.ExtractSampleNameFlowOrderGetSampleNameGATKExtracts the flow order from the BAM header into a text file that is used in downstream VCF post-processing.

4. Calculate quality control metrics

Sub-workflow name and link: UltimaGenomicsWholeGenomeCramOnly

The workflow uses a contamination estimation that has been adapted to use only the highest quality SNP sites based on flow cycles and local realignment. This is different from the Whole Genome Germline Single Sample pipeline in that HaplotypeCaller performs local realignment and feeds that output to VerifyBAM ID, which cleans up the alignment.

Subtask name and WDL linkToolSoftwareDescription
Tasks.HaplotypeCaller as HaplotypeCallerForContaminationHaplotypeCallerGATKRuns HaplotypeCaller using an interval list of variants with high allele frequencies (contamination_sites_vcf).
Tasks.CheckContaminationVerifyBamIDVerifyBamIDChecks contamination in the HaplotypeCallerForContamination bamout file.
Tasks.CollectDuplicateMetricsCollectDuplicateMetricsPicardChecks duplication metrics in the aggregated, duplicate-marked BAM file.
QC.CollectQualityYieldMetricsCollectQualityYieldMetricsPicardCalculates QC metrics on the duplicated-marked BAM.
Tasks.CollectWgsMetricsCollectWgsMetricsPicardCollects WGS metrics on the duplicate-marked BAM using stringent thresholds.
Tasks.CollectRawWgsMetricsCollectRawWgsMetricsPicardCollects the raw WGS metrics on the duplicated-marked BAM with commonly used QC metrics.
Tasks.CollectAggregationMetricsCollectMultipleMetricsPicardPerforms QC on the aligned, duplicated-marked BAM.
QC.CheckPreValidationcustom scriptpython3Checks chimerism and duplicate files for a given threshold using a custom python script.

5. Variant call with HaplotypeCaller

The workflow implements initial variant calling with a version of HaplotypeCaller that handles flow-based data. This version replaces the classic Hidden Markov Model (HMM) with a flow-based likelihood model that more accurately accounts for sequencing errors present in the data.

Task name and WDL linkToolSoftwareDescription
Utilities.ScatterIntervalListIntervalListToolsPicardSplits the calling interval list into sub-intervals in order to perform variant calling on the sub-intervals.
Tasks.HaplotypeCallerHaplotypeCallerGATKPerforms initial variant calling on the aligned BAM file and outputs sub-interval GVCFs and a bamout file.

6. Merge VCFs and BAMs and convert GVCF to VCF

The workflow performs multiple post-processing steps to prepare the VCF for downstream joint calling. The HaplotypeCaller GVCF outputs are merged into a single GVCF and then converted to VCF in preparation for this post-processing.

Task name and WDL linkToolSoftwareDescription
VariantDiscoverTasks.MergeVCFsMergeVcfsPicardMerges the array of GVCFs from HaplotypeCaller into one VCF and index.
Tasks.MergeBamsMergeSamFilesPicardMerges the HaplotypeCaller bamout files into a single BAM file.
Tasks.ConvertGVCFtoVCFGenotypeGVCFsGATKConverts to GVCF to VCF format in preparation for post-processing.

7. Perform VCF post-processing

The workflow performs post-processing steps to prepare the VCF for downstream joint calling. First, it annotates the merged HaplotypeCaller output VCF with dbSNP variants, then it trains a model, a random forest classifier (Almogy et al., 2022), to distinguish between true positive variants and false positives. Next, the model is applied to filter variants in the VCF.

Task name and WDL linkToolSoftwareDescription
Tasks.AnnotateVCFVariantAnnotatorGATKAdds dbSNP annotations to the HaplotypeCaller output VCF and outputs a new VCF with index.
Tasks.AddIntervalAnnotationsToVCFannotatebcftoolsAdds additional interval annotations to the VCF.
Tasks.TrainModelCustom script (train_models_pipeline.py)genomics.py3Trains a model that can be applied for variant filtering to distinguish true and false-positive variants.
Tasks.AnnotateVCF_AFview, indexvcfanno, bcftoolsAdds gnomAD allele frequencies to the VCF for downstream metric calculation.
Tasks.FilterVCFfilter_variants_pipeline.pygenomics.py3Applies a trained model to filter for high quality VCF variants.
Tasks.MoveAnnotationsToGvcfannotatebcftoolsMove the annotations and the tree score from the filtered VCF to produces a final post-processed GVCF and index.
ReblockGVCF.ReblockGVCFReblockGVCFGATKReblocks the annotated GVCF. For more information on why the WARP germline pipelines perform reblocking, see the blog on reblocking.

8. Outputs

Workflow outputs are described in the table below.

Output variable nameDescriptionType
output_gvcfFinal annotated, reblocked GVCF.File
output_gvcf_indexIndex file for the final annotated, reblocked GVCF.File
output_vcfAnnotated VCF file.File
output_vcf_indexIndex file for the annotated VCF.File
haplotype_bamOptional merged, duplicated-marked BAM output.File
haplotype_bam_indexIndex for the optional merged, duplicate-marked BAM output.File
output_cramFinal duplicate-marked CRAM file.File
output_cram_indexIndex file for the final CRAM.File
output_cram_md5MD5 sum for the final CRAM.File
selfSMContamination estimate from VerifyBamID.File
contaminationEstimated contamination from the CheckContamination task.Float
filtered_vcfFiltered VCF file.File
filtered_vcf_indexIndex for the filtered VCF.File
quality_yield_metricsThe quality metrics calculated for the unmapped BAM files.File
wgs_metricsMetrics from the CollectWgsMetrics tool.File
raw_wgs_metricsMetrics from the CollectRawWgsMetrics tool.File
duplicate_metricsDuplicate read metrics from the MarkDuplicates tool.File
agg_alignment_summary_metricsAlignment summary metrics for the aggregated, duplicate-marked BAM.File
agg_alignment_summary_pdfOptional PDF of the alignment summary metrics for the aggregated BAM.File
agg_gc_bias_detail_metricsGC bias detail metrics for the aggregated, duplicate-marked BAM.File
agg_gc_bias_pdfPDF of GC bias for the aggregated, duplicate-marked BAM.File
agg_gc_bias_summary_metricsBait bias summary metrics for the aggregated, duplicate-marked BAM.File
agg_quality_distribution_pdfPDF of the quality distribution for the aggregated, duplicate-marked BAM.File
agg_quality_distribution_metricsQuality distribution metrics for the aggregated, duplicate-marked BAM.File
duplication_rateDuplication rate.Float
chimerism_rateChimerism rate.Float
is_outlier_dataBoolean that indicates if the duplication rate and chimerism rate are above a specified threshold.Boolean
sample_nameSample name from aligned BAM header.String
flow_orderFlow order (FO) from the aligned BAM header.String
barcodeBarcode from the aligned BAM header.String
idID from the aligned BAM header.String

Using outputs for downstream joint calling

The outputs of the UG_WGS workflow are not yet compatible with the WARP Ultimate Genomics Joint Genotyping workflow, but efforts to release the workflow are in-progress.

Versioning

All UG_WGS pipeline releases are documented in the pipeline changelog.

Citing the UG_WGS Pipeline

If you use the UG_WGS Pipeline in your research, please cite our preprint:

Degatano, K.; Awdeh, A.; Dingman, W.; Grant, G.; Khajouei, F.; Kiernan, E.; Konwar, K.; Mathews, K.; Palis, K.; Petrillo, N.; Van der Auwera, G.; Wang, C.; Way, J.; Pipelines, W. WDL Analysis Research Pipelines: Cloud-Optimized Workflows for Biological Data Processing and Reproducible Analysis. Preprints 2024, 2024012131. https://doi.org/10.20944/preprints202401.2131.v1

Feedback

Please help us make our tools better by filing an issue in WARP; we welcome pipeline-related suggestions or questions.