TrainGCNV
GATK-gCNV
is a method for detecting rare germline copy number variants (CNVs)
from short-read sequencing read-depth information.
The TrainGCNV
module trains a gCNV model for use in the GatherBatchEvidence workflow.
The samples used for training should be homogeneous (concerning sequencing platform, coverage, library preparation, etc.) and similar to the samples on which the model will be applied.
For small, relatively homogeneous cohorts, a single gCNV model is usually sufficient.
However, for larger cohorts, especially those with multiple data sources,
we recommend training a separate model for each batch or group of batches (see
batching section for details).
The model can be trained on all or a subset of the samples to which it will be applied.
A subset of 100 randomly selected samples from the batch is a reasonable
input size for training the model; when the n_samples_subsample
input is provided,
the TrainGCNV
workflow can automatically perform this random selection.
The following diagram illustrates the recommended invocation order:
Inputs
The majority of the optional inputs of the workflow map to the optional arguments of the
tool the workflow uses, GATK-GermlineCNVCaller
; hence, you may refer to the
documentation
of the tool for a description on these optional inputs. We recommend that most users use the defaults.
All array inputs of sample data must match in order. For example, the order of the samples
array should match that
of the count_files
array.
samples
Sample IDs
count_files
Per-sample binned read counts (*.rd.txt.gz
) generated in the GatherSampleEvidence workflow.
Optional n_samples_subsample
, sample_ids_training_subset
Provide one of these inputs to subset the input batch. n_samples_subsample
will randomly subset, while
sample_ids_training_subset
is for defining a predetermined subset. These options are provided for convenience in Terra.
Outputs
cohort_contig_ploidy_model_tar
Contig ploidy model tarball.
cohort_gcnv_model_tars
CNV model tarballs scattered across genomic intervals.
cohort_contig_ploidy_calls_tar
Contig ploidy calls for the submitted batch.
cohort_gcnv_calls_tars
CNV call tarballs scattered by sample and genomic region prior to segmentation.
cohort_genotyped_segments_vcfs
Single-sample VCFs of CNV calls for the submitted batch.
cohort_gcnv_tracking_tars
Convergence tracking logs.
cohort_genotyped_intervals_vcfs
Single-sample VCFs for the submitted batch containing per-interval genotypes prior to segmentation.
cohort_denoised_copy_ratios
TSV files containing denoised copy ratios in each sample.
Optional annotated_intervals
The count files from GatherSampleEvidence with adjacent intervals combined into
locus-sorted DepthEvidence
files using GATK CondenseDepthEvidence
tool, which are
annotated with GC content, mappability, and segmental-duplication content using
GATK-AnnotateIntervals
tool. This output is generated if do_explicit_gc_correction
is set to True
. Disabled by default.
Optional filtered_intervals_cnv
, filtered_intervals_ploidy
Intervals of read count bins to be used for CNV and ploidy calling after filtering for problematic regions (e.g.
high GC content). This output is generated if filter_intervals
is set to True
. Enabled by default.