TrainGCNV
GATK-gCNV is a method for detecting rare germline copy number variants (CNVs) from short-read sequencing read-depth information. The TrainGCNV module trains a gCNV model for use in the GatherBatchEvidence workflow. The upstream and downstream dependencies of the TrainGCNV module are illustrated in the following diagram.
The samples used for training should be homogeneous (concerning sequencing platform, coverage, library preparation, etc.) and similar to the samples on which the model will be applied in terms of sample type, library preparation protocol, sequencer, sequencing center, and etc.
For small, relatively homogeneous cohorts, a single gCNV model is usually sufficient.
However, for larger cohorts, especially those with multiple data sources,
we recommend training a separate model for each batch or group of batches (see
batching section for details).
The model can be trained on all or a subset of the samples to which it will be applied.
A subset of 100 randomly selected samples from the batch is a reasonable
input size for training the model; when the n_samples_subsample
input is provided,
the TrainGCNV
workflow can automatically perform this random selection.
The following diagram illustrates the upstream and downstream workflows of the TrainGCNV
workflow
in the recommended invocation order. You may refer to
this diagram
for the overall recommended invocation order.
Inputs
This section provides a brief description on the required inputs of the TrainGCNV workflow.
For a description on the optional inputs and their default values, you may refer to the
source code of the TrainGCNV workflow.
Additionally, the majority of the optional inputs of the workflow map to the optional arguments of the
tool the workflow uses, GATK GermlineCNVCaller
; hence, you may refer to the
documentation
of the tool for a description on these optional inputs.
samples
A list of sample IDs.
The order of IDs in this list should match the order of files in count_files
.
count_files
A list of per-sample coverage counts generated in the GatherSampleEvidence workflow.
contig_ploidy_priors
A tabular file with ploidy prior probability per contig. You may find the link to this input from this reference and a description to the file format here.
reference_fasta
reference_fasta
, reference_index
, reference_dict
are respectively the
reference genome sequence in the FASTA format, its index file, and a corresponding
dictionary file.
You may find links to these files from
this reference.
Outputs
Optional annotated_intervals
The count files from GatherSampleEvidence with adjacent intervals combined into
locus-sorted DepthEvidence
files using GATK CondenseDepthEvidence
tool, which are
annotated with GC content, mappability, and segmental-duplication content using
GATK AnnotateIntervals
tool. This output is generated if the optional input do_explicit_gc_correction
is set to True
.