Skip to main content

TrainGCNV

GATK-gCNV is a method for detecting rare germline copy number variants (CNVs) from short-read sequencing read-depth information. The TrainGCNV module trains a gCNV model for use in the GatherBatchEvidence workflow. The upstream and downstream dependencies of the TrainGCNV module are illustrated in the following diagram.

The samples used for training should be homogeneous (concerning sequencing platform, coverage, library preparation, etc.) and similar to the samples on which the model will be applied in terms of sample type, library preparation protocol, sequencer, sequencing center, and etc.

For small, relatively homogeneous cohorts, a single gCNV model is usually sufficient. However, for larger cohorts, especially those with multiple data sources, we recommend training a separate model for each batch or group of batches (see batching section for details). The model can be trained on all or a subset of the samples to which it will be applied. A subset of 100 randomly selected samples from the batch is a reasonable input size for training the model; when the n_samples_subsample input is provided, the TrainGCNV workflow can automatically perform this random selection.

The following diagram illustrates the upstream and downstream workflows of the TrainGCNV workflow in the recommended invocation order. You may refer to this diagram for the overall recommended invocation order.

Inputs

This section provides a brief description on the required inputs of the TrainGCNV workflow. For a description on the optional inputs and their default values, you may refer to the source code of the TrainGCNV workflow. Additionally, the majority of the optional inputs of the workflow map to the optional arguments of the tool the workflow uses, GATK GermlineCNVCaller; hence, you may refer to the documentation of the tool for a description on these optional inputs.

samples

A list of sample IDs. The order of IDs in this list should match the order of files in count_files.

count_files

A list of per-sample coverage counts generated in the GatherSampleEvidence workflow.

contig_ploidy_priors

A tabular file with ploidy prior probability per contig. You may find the link to this input from this reference and a description to the file format here.

reference_fasta

reference_fasta, reference_index, reference_dict are respectively the reference genome sequence in the FASTA format, its index file, and a corresponding dictionary file. You may find links to these files from this reference.

Outputs

Optional annotated_intervals

The count files from GatherSampleEvidence with adjacent intervals combined into locus-sorted DepthEvidence files using GATK CondenseDepthEvidence tool, which are annotated with GC content, mappability, and segmental-duplication content using GATK AnnotateIntervals tool. This output is generated if the optional input do_explicit_gc_correction is set to True.

Optional filtered_intervals_cnv

Optional cohort_contig_ploidy_model_tar

Optional cohort_gcnv_model_tars