TrainGCNV

WDL source code

GATK-gCNV is a method for detecting rare germline copy number variants (CNVs) from short-read sequencing read-depth information. The TrainGCNV module trains a gCNV model for use in the GatherBatchEvidence workflow.

The samples used for training should be homogeneous (concerning sequencing platform, coverage, library preparation, etc.) and similar to the samples on which the model will be applied.

For small, relatively homogeneous cohorts, a single gCNV model is usually sufficient. However, for larger cohorts, especially those with multiple data sources, we recommend training a separate model for each batch or group of batches (see batching section for details). The model can be trained on all or a subset of the samples to which it will be applied. A subset of 100 randomly selected samples from the batch is a reasonable input size for training the model; when the n_samples_subsample input is provided, the TrainGCNV workflow can automatically perform this random selection.

The following diagram illustrates the recommended invocation order:

Inputs

The majority of the optional inputs of the workflow map to the optional arguments of the tool the workflow uses, GATK-GermlineCNVCaller; hence, you may refer to the documentation of the tool for a description on these optional inputs. We recommend that most users use the defaults.

info

All array inputs of sample data must match in order. For example, the order of the samples array should match that of the count_files array.

`samples`

Sample IDs

`count_files`

Per-sample binned read counts (*.rd.txt.gz) generated in the GatherSampleEvidence workflow.

Optional `n_samples_subsample`, `sample_ids_training_subset`

Provide one of these inputs to subset the input batch. n_samples_subsample will randomly subset, while sample_ids_training_subset is for defining a predetermined subset. These options are provided for convenience in Terra.

Optional `outlier_sample_ids`

Provide a file containing sample IDs, delimited by new lines, to exclude from the from the gCNV model training process - this should can be based on outlier samples identified following EvidenceQC that are still retained in the cohort.

Outputs

`cohort_contig_ploidy_model_tar`

Contig ploidy model tarball.

`cohort_gcnv_model_tars`

CNV model tarballs scattered across genomic intervals.

`cohort_contig_ploidy_calls_tar`

Contig ploidy calls for the submitted batch.

`cohort_gcnv_calls_tars`

CNV call tarballs scattered by sample and genomic region prior to segmentation.

`cohort_genotyped_segments_vcfs`

Single-sample VCFs of CNV calls for the submitted batch.

`cohort_gcnv_tracking_tars`

Convergence tracking logs.

`cohort_genotyped_intervals_vcfs`

Single-sample VCFs for the submitted batch containing per-interval genotypes prior to segmentation.

`cohort_denoised_copy_ratios`

TSV files containing denoised copy ratios in each sample.

Optional `annotated_intervals`

The count files from GatherSampleEvidence with adjacent intervals combined into locus-sorted DepthEvidence files using GATK CondenseDepthEvidence tool, which are annotated with GC content, mappability, and segmental-duplication content using GATK-AnnotateIntervals tool. This output is generated if do_explicit_gc_correction is set to True. Disabled by default.

Optional `filtered_intervals_cnv`, `filtered_intervals_ploidy`

Intervals of read count bins to be used for CNV and ploidy calling after filtering for problematic regions (e.g. high GC content). This output is generated if filter_intervals is set to True. Enabled by default.

Inputs​

samples​

count_files​

Optional n_samples_subsample, sample_ids_training_subset​

Optional outlier_sample_ids​

Outputs​

cohort_contig_ploidy_model_tar​

cohort_gcnv_model_tars​

cohort_contig_ploidy_calls_tar​

cohort_gcnv_calls_tars​

cohort_genotyped_segments_vcfs​

cohort_gcnv_tracking_tars​

cohort_genotyped_intervals_vcfs​

cohort_denoised_copy_ratios​

Optional annotated_intervals​

Optional filtered_intervals_cnv, filtered_intervals_ploidy​